Introduction

Making sense of data is not necesarily becoming easier, but it is certainly appearing easier:

With the growing availability of massive data sets and user-friendly analysis software, it might be thought that there is less need for training in statistical methods. This would be naive in the extreme. Far from freeing us from the need for statistical skills, bigger data and the rise in the number and complexity of scientific studies makes it even more difficult to draw appropriate conclusions. More data means that we need to be even more aware of what the evidence is actually worth (p. 12)

So we need data literacy (definition):

…describes the ability to not only carry out statistical analysis on real-world problems, but also to understand and critique any conclusions drawn by others on the basis of statistics (p. 12)

Metrics are a mess (Metrics layer):

…the official definition of “unemployment” in the UK was changed at least thirty-one times between 1979 and 1996 (p. 8)

As in many places (📜 Predictive Modeling. A Retrospective, 📖 How to lead in Data Science), rigor is very important:

confident conclusions can only be drawn from a study which has been appropriately designed. Unfortunately, in the rush to get data and start analysis, attention to design is often glossed over (p. 15)

There is even a word for the tendency to construct reasons for a connection between what are actually unrelated events: Apophenia (p. 97)

The PPDAC is nothing else than another name for the scientific method.

The Analysis stage… sometimes all that is required is a useful visualization (p. 15)

1. Getting things in proportion: categorical data and percentages

Communicating counts and proportions

Framing: instead of showing survival rates, we can talk about mortality rates for more emotional impact, as well as using the actual number of people apart from a percentage.

Comparing a pair of proportions (percentages)

Always use the absolute difference (relative differences with percentages are very hard to interpret. Can be very manipulative).

See also “Porcentajes perniciosos” (p. 123) from 📖 Bullshit. Contra la charlatanería.

2. Summarizing and communicating numbers. Lots of numbers

Describing differences between groups of numbers

The usual suspects: mean, median, mode, IQR… The difference between mean and median is indicative of long tails (p. 52)

Describing relationships between variables

Pearson correlation coefficient vs. the Spearman’s rank correlation: which only takes into account the data rank, not the specific values, so just cares about steady increases even if they are not in a straight line. However…

…there is no substitute for simply looking at data properly (p. 56).

E.g., in global population, split by continent:

It is always valuable to split data according to a factor that explains some of the overall variability (p. 65)

Communication

As with good non-fiction writing…

…fight the temptation to be too sophisticated (p. 68)

3. Why Are We Looking at Data Anyway? Populations and Measurement

Inductive inference process:

  1. Getting the raw data.
  2. The truth in our sample data.
  3. The study population: the ones who could potentially have been included in our sample.
  4. Target population.

Step 1 to 2 is whether people on a survey tell us the truth.

The hardest step is 2 to 3: making sure our study population is representative of the target population (the Selection Bias of 📖 Causal Inference. The Mixtape): the best way to make sure of that is to drawn a random sample of the population. Step 3 to 4 is similar in this regard.

When we have all data, step 2 and 3 could be considered the same, but it still makes sense to consider that we still have a sample drawn from metaphorical population of all of what could had happen but didn´t, due to “chance”.

Other notes

In a study about death of hospitals in the 90s… the same problem with Data Quality:

it may seem extraordinary that we could not even establish the basic facts about the number of operations and their outcome (p. 21)

Many of the early developers of statistical techniques were enthusiastic eugenecists. (p. 39)

An alternative to P-values. Fragility indices for only sufficiently likely modifications | PNAS