Why use this project structure?
There are some opinions implicit in the project structure that have grown out of our experience with what works and what doesn't when collaborating on data science projects. Some of the opinions are about workflows, and some of the opinions are about tools that make life easier. Here are some of the beliefs which this project is built on—if you've got thoughts, please contribute or share them.
Data is immutable
Don't ever edit your raw data, especially not manually, and especially not in Excel. Don't overwrite your raw data. Don't save multiple versions of the raw data. Treat the data (and its format) as immutable. The code you write should move the raw data through a pipeline to your final analysis. You shouldn't have to run all of the steps every time you want to make a new figure (see Analysis is a DAG), but anyone should be able to reproduce the final products with only the code in
srcand the data in
Also, if data is immutable, it doesn't need source control in the same way that code does. Therefore, by default, the data folder is included in the
.gitignorefile. If you have a small amount of data that rarely changes, you may want to include the data in the repository. Github currently warns if files are over 50MB and rejects files over 100MB. Some other options for storing/syncing large data include AWS S3 with a syncing tool (e.g.,
s3cmd), Git Large File Storage, Git Annex, and dat. Currently by default, we ask for an S3 bucket and use AWS CLI to sync data in the
datafolder with the server.
Notebooks are for exploration and communication
Notebook packages like the Jupyter notebook, Beaker notebook, Zeppelin, and other literate programming tools are very effective for exploratory data analysis. However, these tools can be less effective for reproducing an analysis. When we use notebooks in our work, we often subdivide the
notebooksfolder. For example,
notebooks/exploratorycontains initial explorations, whereas
notebooks/reportsis more polished work that can be exported as html to the
Since notebooks are challenging objects for source control (e.g., diffs of the
jsonare often not human-readable and merging is near impossible), we recommended not collaborating directly with others on Jupyter notebooks. There are two steps we recommend for using notebooks effectively:
- Follow a naming convention that shows the owner and the order the analysis was done in. We use the format
- Refactor the good parts. Don't write code to do the same task in multiple notebooks. If it's a data preprocessing task, put it in the pipeline at
src/data/make_dataset.pyand load data from
data/interim. If it's useful utility code, refactor it to
Now by default we turn the project into a Python package (see the
setup.pyfile). You can import your code and use it in notebooks with a cell like the following:
Analysis is a DAG
Often in an analysis you have long-running steps that preprocess data or train models. If these steps have been run already (and you have stored the output somewhere like the
data/interimdirectory), you don't want to wait to rerun them every time. We prefer
makefor managing steps that depend on each other, especially the long-running ones. Make is a common tool on Unix-based platforms (and is available for Windows). Following the
makedocumentation, Makefile conventions, and portability guide will help ensure your Makefiles work effectively across systems. Here are some examples to get started. A number of data folks use
makeas their tool of choice, including Mike Bostock.
There are other tools for managing DAGs that are written in Python instead of a DSL (e.g., Paver, Luigi, Airflow, Snakemake, Ruffus, or Joblib). Feel free to use these if they are more appropriate for your analysis.
Build from the environment up
The first step in reproducing an analysis is always reproducing the computational environment it was run in. You need the same tools, the same libraries, and the same versions to make everything play nicely together.
One effective approach to this is use virtualenv (we recommend virtualenvwrapper for managing virtualenvs). By listing all of your requirements in the repository (we include a
requirements.txtfile) you can easily track the packages needed to recreate the analysis. Here is a good workflow:
mkvirtualenvwhen creating a new project
pip installthe packages that your analysis needs
pip freeze > requirements.txtto pin the exact package versions used to recreate the analysis
- If you find you need to install another package, run
pip freeze > requirements.txtagain and commit the changes to version control.
If you have more complex requirements for recreating your environment, consider a virtual machine based approach such as Docker or Vagrant. Both of these tools use text-based formats (Dockerfile and Vagrantfile respectively) you can easily add to source control to describe how to create a virtual machine with the requirements you need.
Keep secrets and configuration out of version control
You really don't want to leak your AWS secret key or Postgres username and password on Github. Enough said — see the Twelve Factor App principles on this point. Here's one way to do this:
Store your secrets and config variables in a special file
.envfile in the project root folder. Thanks to the
.gitignore, this file should never get committed into the version control repository. Here's an example:
Use a package to load these variables automatically.
If you look at the stub script in
src/data/make_dataset.py, it uses a package called python-dotenv to load up all the entries in this file as environment variables so they are accessible with
os.environ.get. Here's an example snippet adapted from the
AWS CLI configuration
When using Amazon S3 to store data, a simple method of managing AWS access is to set your access keys to environment variables. However, managing mutiple sets of keys on a single machine (e.g. when working on multiple projects) it is best to use a credentials file, typically located in
~/.aws/credentials. A typical file might look like:
You can add the profile name when initialising a project; assuming no applicable environment variables are set, the profile credentials will be used be default.
Be conservative in changing the default folder structure
To keep this structure broadly applicable for many different kinds of projects, we think the best approach is to be liberal in changing the folders around for your project, but be conservative in changing the default structure for all projects.
We've created a folder-layout label specifically for issues proposing to add, subtract, rename, or move folders around. More generally, we've also created a needs-discussion label for issues that should have some careful discussion and broad support before being implemented.
The Cookiecutter data Science project is opinionated, but not afraid to be wrong. Best practices change, tools evolve, and lessons are learned. The goal of this project is to make it easier to start, structure, and share an analysis. Pull requests and filing issues is encouraged. We'd love to hear what works for you, and what doesn't.
If you use the Cookiecutter data Science project, link back to this page or give us a holler and let us know!
Links to related projects and references
Project structure and reproducibility is talked about more in the R research community. Here are some projects and blog posts if you're working in R that may help you out.
- Project Template - An R data analysis template
- "Designing projects" on Nice R Code
- "My research workflow" on Carlboettifer.info
- "A Quick Guide to Organizing Computational Biology Projects" in PLOS Computational Biology
Finally, a huge thanks to the Cookiecutter project (github), which is helping us all spend less time thinking about and writing boilerplate and more time getting things done.
Bio: DrivenData is a mission-driven data science firm that brings the powerful capabilities of data science, machine learning, and artificial intelligence to organizations tackling the world’s biggest challenges. DrivenData Labs (drivendata.co) helps mission-driven organizations harness data to work smarter, offer more impactful services, and use machine intelligence to its fullest potential. DrivenData also runs online machine learning competitions (drivendata.org) where a passionate, global community of data scientists build algorithms for social impact.
Original. Reposted with permission.