Spreadsheets are powerful tools with many applications: collecting data, sharing data, visualizing data, analyzing data, reporting on data. Sometimes, the temptation to do all of these things in a single workbook is irresistible. But if your goal is to provide data to others for analysis, then features that are useful for, say, reporting are downright detrimental to the task of data analysis. To make things easier on your downstream analysts, and to reduce the risk of inadvertent errors that can be caused by spreadsheets, Karl Broman and Kara Woo have published a paper Data organization in spreadsheets chock-full of useful advice. To reiterate from their introduction:
Spreadsheets are often used as a multipurpose tool for data entry, storage, analysis, and visualization. Most spreadsheet programs allow users to perform all of these tasks, however we believe that spreadsheets are best suited to data entry and storage, and that analysis and visualization should happen separately. Analyzing and visualizing data in a separate program, or at least in a separate copy of the data file, reduces the risk of contaminating or destroying the raw data in the spreadsheet.
The paper provides a wealth of helpful tips for making data in spreadsheets ready for analysis:
Naming Practices: don't use spaces in column names or file names, and make those names meaningful.
Dealing with Dates: avoid a common gotcha by using the ISO 8601 standard for dates, as shown in the XKCD comic to the right.
Representing missing data: don't use empty cells for missing data; use a hyphen or a unique code like NA (but not a number like -9999).
Don't overload the cells. Don't try and pack more than one piece of information into a cell (and don't merge cells, either). In particular, don't encode useful information with color or font; add another column instead.
Follow tidy data principles. Make the data rectangular. If you need multiple rectangles, use multiple files or worksheets. If your data still doesn't fit into that format, a spreadsheet probably wasn't the best place for it in the first place.
Don't make calculations in data files. You're preparing this data for analysis, so avoid the temptation to include formulas to create new data. Just provide the raw data.
You can find details, examples, and lots more useful advice in the complete paper, which I encourage everyone to read at the link below.
The American Statistician: Data organization in spreadsheets (2017), Karl W. Broman and Kara H. Woo (via Sharon Machlis)