Spreadsheet Best Practices

I gave a webinar recently on tools and tips for data management. While many of the themes I spoke about have been covered here previously (though you should really check out the webinar slides – they’re awesome), I realized that I have never written about spreadsheets on the blog. Let’s rectify that.

The big thing to know about spreadsheets is that best practices emphasize computability over human readability; this is likely different from how most people use spreadsheets. The focus on computability is partly so that data can be ported between programs without any issue, but also to avoid capturing information solely via formatting. Formatted information is not computable information, which defeats the main purpose of using a spreadsheet. It’s better to have a clean table that is computable in any program than to have a spreadsheet that looks nice but is unusable outside of a particular software package.

With computability in mind, here are a few best practices for your spreadsheets:

  1. Spreadsheets should only contain one big table. The first row should be variable names and all other rows data. Smaller tables should be collapsed into larger tables whereever possible.
  2. Kick graphs and other non-data items out of your spreadsheet tables. If you’re in Excel, move them to a separate tab.
  3. Keep documentation to a minimum in the spreadsheet. (This is where data dictionaries come in handy.)
  4. Differentiate zero from null.
  5. Remove all formatting and absolutely NO MERGED CELLS. Add variables as necessary to encode this information in another way.

If you follow these rules, you should create spreadsheets that are streamlined and can easily move between analysis programs.

Such portability is important for two reasons. First, there are many great data analysis tools you may want to leverage but they probably won’t import messy spreadsheet data. Second, there are issues with research’s ubiquitous Excel; for example, a recent study showed Excel-related gene errors in up to one fifth of supplemental data from top genomics journals, not to mention the fact that Excel is known to mangle dates. It’s therefore best to keep your options open and your data as neutral as possible.

I hope you use these best practice to streamline your spreadsheet data to take maximum advantage of it!