An important aspect of data management is performing quality control on your data. This means checking your data for errors, ensuring consistent formatting, documenting the meaning of your variables, etc. Also under the umbrella of quality control is how to represent the absence of value in a dataset.
The absence of value can mean many things in a dataset including a true zero, that the data point is missing, that the data point is not applicable to this entry, etc. Unfortunately, many of these cases end up with the same label (or several different labels for any one case!) in a dataset, either “0”, a blank entry, “NA” or something else. And when it comes to calculating values like averages, there is a big difference between a “0” that is a true zero and one that is a placeholder for missing data. Therefore, we need to establish some best practices around absence of value.
The first rule is that “0” always represents true zero and nothing else. This means that you’ve made a measurement and that measurement happens to be zero. Only using “0” for this case makes your subsequent calculations accurate.
The second rule is to pick a good null label. This label will represent a lack of measurement. One of the best null labels is the blank entry, which most programs will interpret as null (so long as you’re careful not use a space instead of a blank). A secondary option is to use the null value preferred by your primary analysis program: “NA” in R, “NULL” in SQL, and “None” in Python, etc. (see Table 1 of White, et al.). However, this option is less ideal as it can result in unexpected problems if you don’t modify the nulls in your dataset before using it in a different program. So it’s best to stick with the blank entry for all of your null data points.
The third rule is to be consistent. There is no point in standardizing something if you’re not going to be consistent about it and, in this case, consistency makes for accurate calculations. So pick a system and stick with it!
Finally, you should document anything that isn’t standard. Want to use blanks for missing data and “NA” for not applicable data points? You can do it, so long as you are clear and upfront (and consistent) about the system you use.
Keeping zero and null straight is not difficult, but it takes a little conscious effort to be sure that everything is accurate. This effort is worth it in the long run, as your datasets are streamlined and your calculations turn out correct.
This post is about nothing*, Practical Data Management for Bug Counters. 30 Jan 2014.
White, et al. Nine simple ways to make it easier to (re)use your data. Ideas in Ecology and Evolution. 6(2). 2013. http://library.queensu.ca/ojs/index.php/IEE/article/view/4608/4898