- Tags:: ๐Books , Data Analysis
- Author:: Roger D. Peng, Elisabeth Matsui
- Liked:: 4
- Link:: The Art of Data Science
- Source date:: 2016-06-08
- Finished date:: 2020-01-01
- Cover::
2. Epicycles of Analysis
You want to setup your expectations and your data so that matching the two up is easy (p. 10)
3. Stating and Refining the Question
Six types of questions (Another good reference is What is the question? | Science):
- Descriptive: summarizes data.
- Exploratory: looks for patterns to generate an hypothesis.
- Inferential: you test an hypothesis in a set of data (should be in a different dataset than the exploratory).
- Predictive.
- Causal: whether changing one factor, changes another factor (on average) of the population. Different from the predictive: in the predictive you donยดt care about what is causing what.
- Mechanistic: how changing one factor leads to the change of another factor. That is more specific than the causal question.
Good questions:
- Are of interest.
- Not already answered.
- Plausible.
- Answerable.
- Specific.
4. Exploratory Data Analysis
Data visualization is arguably the most important tool for exploratory data analysis because the information conveyed by graphical display can be very quickly absorbed and because it is generally easy to recognize patterns in a graphical display. (p. 31)
Checklist:
- Formulate your question.
- Read data.
- Basic info of the dataset (df.info()).
- Look top and bottom rows.
- Identify some landmarks that can be used to check against the data.
- Check the data against external sources (e.g., measurements are within reasonable ranges).
- Plot.
- Try the easy answer first.
Importantly, if you do not find evidence of a signal in the data using just a simple plot or analysis, then often it is unlikely that you will find something using a more sophisticated analysis. (p. 49)
- Follow-up questions.
5. Using models to explore your data.
In a linear regression, points should appear evenly balanced around the regression line (p. 71). If they donโt maybe the relationship is not linear.
6. Inference
- Define the population.
- Describe the sampling process (should be representative of the population).
- Describe a model for the population:
- Usually in the form of an statistical model.
- How the population units interact: whether there is independency or not.