Typical Workflow and Definition of Done for data analysis

Tags:: 🗃Archive , Software documentation, Data Analysis, Data methodology

NOTE

This is an internal document I created at Mercadona Tech to have some structure in the way we were doing analysis.

Typical workflow

An analysis task is also a task in Jira.
An analysis start and evolve around questions, to be asked by the PM and the Data Scientists as they analyze data.
- See a lightweight reference on good EDAs, see 📖 The Art of Data Science.
No more than 2 full days should go by without sharing preliminary results with your peers. The analysis may continue for some more time but we need to make sure we are aligned.
We will use pull requests as sharing mechanism to our knowledge repo.
We will commit the output of the notebooks to make reviews easier.
- We will favor static visualizations.
- For heavy-computation notebooks, we will use checkpoints: storing partial results on cloud storage.
- If there are many images, we will export them as files and link them to the notebooks using their relative path. This will be rendered in GitHub.
We will strive to break the data analysis task into small Pull Requests. E.g., if there are several questions in the analysis, we could make a Pull Request for each question or group of questions (depending on the length) instead of a single PR for the whole analysis.

Definition of Done

The outcome of a data analysis task should be a Jupyter notebook:

Clearly answering relevant questions to be easily consumed by our stakeholders.
Proposing conclusions and/or next steps.
With clean code.
With easy-to-understand figures, supporting our answers or conclusions.
Without noise (answers to non-relevant questions, figures that are not being used to answer anything, code not in use…).
Peer reviewed. Since running a notebook might take a long time, we will not block the approval of the PR because of this. We commit to test for reproducibility even if we have already approved the PR.
- On code correctness and cleanliness (readability, good naming…)
- On analysis correctness (statistic assumptions hold, conclusions are sound…)
- On reproducibility: running the notebook again yields similar results.

Additionally, and in a different PR, we may add other notebooks with tooling and/or knowledge that is not directly related to the questions we were tackling in the analysis task. With the exception of relevance of questions, all the other aspects apply (we want it peer reviewed, in a “clean” state…).

Dr. Mario's 2nd 🧠

Explorer

Typical Workflow and Definition of Done for data analysis

Typical workflow

Definition of Done

Graph View

Table of Contents

Backlinks