🗞️Good Data Analysis

Tags:: 🗞️Articles, Data Analysis, Data methodology
Author:: Patrick Riley (Google Data Scientist)
Link:: Good Data Analysis | ML Universal Guides | Google Developers
Source date:: 2016-10-01
Finished date:: 2022-02-20

As with any complex conversation, do not mix “stages” or it will be hard to see the truth:

Description should be things that everyone can agree on from the data. Evaluation is likely to have much more debate because you imbuing meaning and value to the data. If you do not separate Description and Evaluation, you are much more likely to only see the interpretation of the data that you are hoping to see. Further, Evaluation tends to be much harder because establishing the normative value of a metric, typically through rigorous comparisons with other features and metrics, takes significant investment.

These stages do not progress linearly. As you explore the data, you may jump back and forth between the stages, but at any time you should be clear what stage you are in.

You need to get dirty to do a good analysis:

make sure the people responsible for generating the data agree that it’s correct.

The burden of proof should be on the new:

If your new, custom metrics don’t make sense with your standard metrics, your new, custom metrics are likely wrong.

Seek multiple measurements:

Especially if you are trying to capture a new phenomenon, try to measure the same underlying thing in multiple ways. Then, check to see if these multiple measurements are consistent. By using multiple measurements, you can identify bugs in measurement or logging code, unexpected features of the underlying data, or filtering steps that are important. It’s even better if you can use different data sources for the measurements.

As stated elsewhere (e.g. 📖 The Art of Statistics), start with an hypothesis and try to falsify yourself:

Good data analysis will have a story to tell. To make sure it’s the right story, you need to tell the story to yourself, predict what else you should see in the data if that hypothesis is true, then look for evidence that it’s wrong. One way of doing this is to ask yourself, “What experiments would I run that would validate/invalidate the story I am telling?” Even if you don’t/can’t do these experiments, it may give you ideas on how to validate with the data that you do have.

Always state clearly (everywhere!) what you are measuring:

Ratios should have clear numerator and denominators

We need to educate:

You will often be presenting your analysis and results to people who are not data experts. Part of your job is to educate them on how to interpret and draw conclusions from your data. This runs the gamut from making sure they understand confidence intervals to why certain measurements are unreliable in your domain to what typical effect sizes are for “good” and “bad” changes to understanding population bias effects.

This is especially important when your data has a high risk of being misinterpreted or selectively cited. You are responsible for providing the context and a full picture of the data and not just the number a consumer asked for.

Dr. Mario's 2nd 🧠

Explorer

🗞️Good Data Analysis

Graph View