- Tags:: 📚Books, ✒️SummarizedBooks , Data Science
- Author:: Foster Provost and Tom Fawcett
- Liked:: 6
- Link:: Data Science for Business (oreilly.com)
- Source date:: 2021-02-05
- Finished date:: 2021-06-01
- Cover::
WARNING
Only partly summarized
1. Introduction: Data-Analytic Thinking
Example: Hurricane Frances
It starts with what could be a misleading example particularly in terms of COVID: how Walmart was able to discover interesting patterns that people were going to have upon the arrival of Hurricane Frances (2004): (not so) surprisingly, people were going to buy Pop-Tarts and beer.
To do this, analysts might examine the huge volume of Wal-Mart data from prior similar situations (such as Hurricane Charley) (p. e)
It is a misleading example because: do you have data on previous hurricanes? Me neither. Do you have data on previous viral outbreaks in your country? Yeah, me neither.
Data Science, Engineering, and Data-Driven Decision Making
A general number on the effect of being “data-driven”:
The benefits of data-driven decision-making have been demonstrated conclusively. Economist Erik Brynjolfsson and his colleagues from MIT and Penn’s Wharton School conducted a study of how DDD affects firm performance (Brynjolfsson, Hitt, & Kim, 2011). They developed a measure of DDD that rates firms as to how strongly they use data to make decisions across the company. They show that statistically, the more data driven a firm is, the more productive it is—even controlling for a wide range of possible confounding factors. And the differences are not small. One standard deviation higher on the DDD scale is associated with a 4%–6% increase in productivity. DDD also is correlated with higher return on assets, return on equity, asset utilization, and market value, and the relationship seems to be causal (p. 4)
And another 6% difference by the use of big data tech:
A separate study, conducted by economist Prasanna Tambe of NYU’s Stern School, examined the extent to which big data technologies seem to help firms (Tambe, 2012). He finds that, after controlling for various possible confounding factors, using big data technologies is associated with significant additional productivity growth. Specifically, one standard deviation higher utilization of big data technologies is asso‐ ciated with 1%–3% higher productivity than the average firm; one standard deviation lower in terms of big data utilization is associated with 1%–3% lower productivity (p. 8).
Data and Data Science Capability as a Strategic Asset
An amazing history about how Signal Bank (future Capital One) wanted in the 90s to model credit profitability from customers (finding the most profitable terms, having specific terms for different customers). But they had no data: credits were offered in a standard set of terms for all customers. What did they do? Invest in data, by losing money:
Once we view data as a business asset, we should think about whether and how much we are willing to invest. In Signet’s case, data could be generated on the profitability of customers given different credit terms by conducting experiments. Different terms were offered at random to different customers. This may seem foolish outside the context of data-analytic thinking: you’re likely to lose money! This is true. In this case, losses are the cost of data acquisition. The data-analytic thinker needs to consider whether she expects the data to have sufficient value to justify the investment (…) when Signet began randomly offering terms to customers for data acquisition, the number of bad accounts soared. Signet went from an industry-leading “charge-off ” rate (2.9% of balances went unpaid) to almost 6% charge-offs (…) Because the firm viewed these losses as investments in data, they persisted despite complaints from stakeholders. Eventually, Signet’s credit card operation turned around and became so profitable that it was spun off to separate it from the bank’s other operations, which now were overshadowing the consumer credit success (p. 10).
Data-Analytic Thinking
You can´t afford to ignore data science:
If these employees do not have a fundamental grounding in the principles of data-analytic thinking, they will not really understand what is happening in the business. This lack of understanding is much more damaging in data science projects than in other technical projects, because the data science is supporting improved decision-making (p. 13).
Data Mining and Data Science, Revisited
A shocker for many people:
Exctracting useful knowledge from data to solve business problems can be treated systematically by following a process with reasonably well-defined stages (p. 14)
They talk about CRISP-DM but there are others: Data methodology
2. Business Problems and Data Science Solutions
The book includes these:
- Classification: “Among all the customers of MegaTelCo, which are likely to respond to a given offer?”
- Regression: “How much will a given customer use the service?”
- Similarity matching / Clustering: “IBM is interested in finding companies similar to their best business customers, in order to focus their sales force on the best opportunities”. With no direct purpose by itself, this is used as input to other decision-making processes (e.g. for guiding product development).
- Co-occurrence grouping: associations between entities based on transactions involving them (e.g., market basket analysis). Typical in recommendations along with similarity matching (people who bought X also bought Y).
- Profiling: behavior of individuals or groups (“What is the typical shopping behavior of this customer?”). Its output could be used as input for anomaly detection.
- Link prediction: your average Facebook friend recommendation (can also be used for other recommendations thinking about links: e.g., viewers and movies).
- Data reduction (analysis): data summarization for insight (e.g., viewer genre preferences for a large dataset of movie views).
- Causal modeling: what actions influence others. Here we have randomized controlled experiments (A/B testing) but can also be done from observational data (more in The Book of Why).
In practice, a Data Scientist is also equipped to solve other problems more related to the “science-y” part and not so much with learning patterns from data: those belonging to Decision Theory or Operations Research.
Operations Research is a scientific manner of decision making, mostly under the pressure of scarce resources. The objective of decision making is often the maximization of profit, minimization of cost, better efficiency, better operational or tacti-cal or strategic planning or scheduling, better pricing, better productivity, better recovery, better throughput, better location, better risk manage-ment, or better customer service. The scientific manner involves extensive use of mathematical representations or models of real life situations. (“Business Applications of Operations Research” book)
They share with the Machine Learning problems the idea of being quantitative approaches to decision making. Examples of such problems would be: where to place supermarkets to maximize their revenue (based on population, competing supermarkets…), shift staff planning, scheduling of a production line so that jobs with different processing times requiring different machines take the minimum possible time , delivery routing, Revenue Management…
The Data Mining Process
Another diagram of a Data methodology, CRISP-DM
This iterations can be different from software engineering iterations:
This process diagram makes explicit the fact that iteration is the rule rather than the exception. Going through the process once without having solved the problem is, generally speaking, not a failure. Often the entire process is an exploration of the data, and after the first iteration the data science team knows much more (p. 27)
Of course, the magic is in…
The Business Understanding stage represents a part of the craft where the analysts’ creativity plays a large role. Data science has some things to say, as we will describe, but often the key to a great success is a creative problem formulation by some analyst regarding how to cast the business problem as one or more data science problems. High level knowledge of the fundamentals helps creative business analysts see novel formulations (p. 28).
Often the quality of the data mining solution rests on how well the analysts structure the problems and craft the variables (and sometimes it can be surprisingly hard for them to admit it) (p. 30).
Regarding who is responsible after deployment, we have an interesting nugget, mainly because it is not the usual approach. Data team members are more often seen as distractors of the “main” development teams. However, here they propose an staged handoff to devs:
From a management perspective, it is advisable to have members of the development team involved early on in the data science project. They can begin as advisors, providing critical insight to the data science team. Increasingly in practice, these particular developers are “data science engineers”—software engineers who have particular expertise both in the production systems and in data sci‐ence. These developers gradually assume more responsibility as the project matures. At some point the developers will take the lead and assume ownership of the product. Generally, the data scientists should still remain involved in the project into final deployment, as advisors or as developers depending on their skills (p. 33).
Implications for Managing the Data Science Team
This section is gold (it is a cliché really, but a good reference to pass on when you have trouble). Data Science projects are, of course, quite different from software engineering projects:
Software managers might look at the CRISP data mining cycle and think it looks comfortably similar to a software development cycle, so they should be right at home managing an analytics project the same way. This can be a mistake because data mining is an exploratory undertaking closer to research and development than it is to engineering. The CRISP cycle is based around exploration; it iterates on approaches and strategy rather than on software designs. Outcomes are far less certain, and the results of a given step may change the fundamental understanding of the problem. Engineering a data mining solution directly for deployment can be an expensive premature commitment. Instead, analytics projects should prepare to invest in information to reduce uncertainty in various ways. Small investments can be made via pilot studies and throwaway prototypes. Data scientists should review the literature to see what else has been done and how it has worked. On a larger scale, a team can invest substantially in building experimental testbeds to allow extensive agile experimentation. If you’re a software manager, this will look more like research and exploration than you’re used to, and maybe more than you’re comfortable with.
Machine Learning and Data Mining
KDD: Knowledge Discovery and Data Mining. A subset of Machine Learning, focused:
on concerns raised by examining real-world applications (…) research focused on commercial applications and business issues of data analysis.
3. Introduction to Predictive Modeling: From Correlation to Supervised Segmentation
How can we judge whether a variable contains important information about the target variable?
For example with a purity measure (how pure are the segments): entropy and information gain.
where is related to the bits you would need to encode the event of obtaining class “i” (to understand this better: Shannon Entropy (heliosphan.org)).
So how informative is an attribute to split a set? Is the difference between the entropy of the original set and those of the child sets (splitted by the attribute):
where is the proportion of instances of child 1…
If the target variable is numeric (for a regression number), then a measure of impurity would be the variance of the values.
Entropy charts
Show the weighted sum entropy (Y) of splitting a dataset by an attribute (X):
GILL-COLOR is not a very informative attribute
but ODOR clearly is.
4. Fitting a model to data
Learning curves
Performance on the test set vs training data. For a particular model, we may use this to see when investing in more training data is not worthwhile, because performance levels off:
Here, that logistic regression would not benefit from more data, but the decision tree would.
The underlying reasons for overfitting when building models from data are essentially problems of multiple comparisons Multiple Comparisons in Induction Algorithms. Note that even the procedures for avoiding overfitting themselves undertake multiple comparisons (e.g., choosing the best complexity for a model by comparing many complexities). (p. 139)
7. Decision Analytic Thinking I: What Is a Good Model?
Unbalanced classes
Careful when using accuracy as metric: as reported in which population?
Problems with Unequal Costs and Benefits
Indeed, it is hard to imagine any domain in which a decision maker can safely be indifferent to whether she makes a false positive or a false negative error. Ideally, we should estimate the cost or benefit of each decision a classifier can make. Once aggregated, these will produce an expected profit (or expected benefit or expected cost) estimate for the classifier (p. 193)
So we must think about the metrics carefully:
Why is the mean-squared-error on the predicted number of stars an appropriate metric for our recommendation problem? Is it meaningful? Is there a better metric? Hopefully, the analyst has thought this through carefully. It is surprising how often one finds that an analyst has not, and is simply reporting some measure he learned about in a class in school. (p. 194)
A Key Analytical Framework: Expected Value
Evaluation, Baseline Performance, and Implications for Investments in Data
It is usually not enough with showing positive profit: we need to beat “easy” benchmarks. Otherwise, first rule of Machine Learning: “don’t use Machine Learning” (Rules of Machine Learning). The “easy” part comes from how easy they are to compute: predicting always the majority class in classification, the mean or the median in regression, the value of yesterday in time series… But they may be very hard to beat (see 📖 The Business Forecasting Deal).
Another baseline can be a model based on a single feature (a decision stump in the case of decision trees). And we could do the same for data sources (basing ourselves on domain knowledge).
Whatever the data mining group chooses as a baseline for comparison, it should be something the stakeholders find informative, and hopefully persuasive (p. 205)
8. Visualizing Model Performance
Profit curves
Compare different classifiers (each trace), and for each one, the expected value we would obtain (Y axis) given different values of the threshold (X axis)
It requires you to know the operating conditions of the classifier: the class prior and the costs and benefits.
ROC Graphs
Otherwise, you may use the ROC curve.
Since the true positive rate is only computed from positive examples and false positive rate is computed from negative examples only, the actual class priors is not important.
Cumulative Response and Lift Curves
An easier to understand graph for stakeholders is the true positive rate as a function of the whole population targeted:
Also called a lift curve since you can see how the model performance is lifted over the random performance (the blue trace). You may also plot directly the lift over the random performance (model / random) or over other baseline (e.g., some other classifier).
The simplifying assumption for this graph, unlike ROC, is that the class priors are the same in your test set as in reality (e.g., if reality has more negative examples, the true positive and false positive rates stay the same in the test set than in reality, but the relationship between the axes of the lift curves don’t).