Anna Karenina novel begins with the quote:
All happy families are alike; each unhappy family is unhappy in its own way.
Leo Tolstoy
This is the Anna Karenina principle and personally, I have found it very suitable to describe the world of Data Science. Think about a Machine Learning pipeline:
- Data Ingestion and Preparation
- Model Training and Retraining
- Model Evaluation
- Deployment
When everything is OK and the model works as expected then we are talking about another typical “Happy Family” where all of them are alike. But what about when the model does not perform as expected? Then we are talking about an “Unhappy Family” and believe me that it is unhappy in its own way :). There are so many different reasons which can cause a model to fail. Just to mention some of them:
- Simpson’s Paradox And Misleading Statistical Inference
- In web campaigns, the effect of the delayed conversions and the cookies’ effect affecting the serving of the variants.
- Wrong inference in A/B Testing
- Wrong inference in Multiple Comparisons
- Wrong inference in Multi-armed Bandit
- Change in the distribution of the userbase, especially in recommender systems, in market basket analysis and in collaborative filters
- Lack of experimental design and skewed samples.
- Lack to control some factors like “Campaign Effect”, “Seasonality Effect”, “Fatigue Effect”.
- Skip applying techniques like undersampling when it is necessary.
- Different staging and production environments resulting in different versions of libraries and so on.
- Miss to measure the long-term effect of an offer resulting in “cannibalization”.
- Data leakage issues
- and so on and so on, and so on
I am happy to learn about your unhappy families based on your experience