- Tags:: 🗞️Articles, Data Engineering
- Author:: Many
- Link::
- {{pdf: https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fdr___mario_second_brain%2FvRVUNh8icO.pdf?alt=media&token=aead0612-6b58-4d74-bdb0-a47901c13060}}
- There is also a very similar version, titled “Machine Learning: The High-Interest Credit Card of Technical Debt”
- Finished date:: 2020-11-13
- Source date:: 2014-01-01
El clasicazo de imagen con lo que supone poner un modelo de ML en producción:
ML: todos los problemas del software, and some more. Test unitarios are not enough.
One of the basic arguments in this paper is that machine learning packages have all the basic code complexity issues as normal code, but also have a larger system-level complexity that can create hidden debt. Thus, refactoring these libraries, adding better unit tests, and associated activity is time well spent but does not necessarily address debt at a systems level.
Imposible encapsulación, que en software funciona tan bien, es antitética aquí:
From a high level perspective, a machine learning package is a tool for mixing data sources together. That is, machine learning models are machines for creating entanglement and making the isolation of improvements effectively impossible.
CACE principle: Changing Anything Changes Everything.
Unstable** data dependencies**
Because ownership of the input signals usually belong to other teams. The solution: create versioned copies of the signals (though this is also tech debt).
No es fácil hacer análisis estático de quién está usando los datos. Se propone una automated feature management tool.
Be careful with glue code: it usually makes more sense to reimplement and tweak than adapt to a general purpose package:
While generic systems might make it possible to interchange optimization algorithms, it is quite often refactoring of the construction of the problem space which yields the most benefit to mature systems.
this may seem like a high cost to pay (…). But the resulting system may require dramatically less glue code to integrate in the overall system, be easier to test, be easier to maintain, and be better designed to allow alternate approaches to be plugged in and empirically tested. Problem-specific machine learning code can also be tweaked with problem-specific knowledge that is hard to support in general packages.
When we recognize that a mature system might end up being (at most) 5% machine learning code and (at least) 95% glue code, reimplementation rather than reuse of a clumsy API looks like a much better strategy.
A mature ML system might have lots of config. options. Be careful with this too (diff tools are interesting here).
Real world changes: features that do not longer correlate, prediction bias in different distributions groups…