Adding R to the Data Toolkit

I’ve officially jumped on the R bandwagon. I worked on a project last year for which R turned out to be the best solution to tackling a lot of messy data (OpenRefine was not reproducible enough and let’s not even talk about the disaster that was Access). Since then, I’ve thrown other data at R and now consider it as part of my regular suite of data tools.

I want to emphasize that last point that R is just one piece in the data toolkit. Software like R has a steep learning curve if you’ve never programmed before. There are other tools, like OpenRefine, that get the job done and are friendlier to the average user. But for processing large amounts of data in a reproducible way, R is definitely worth learning. (Here’s roughly how I break my data needs down: Excel is for everyday data work; OpenRefine is for one-off data cleaning; and R is for large scale/reproducible data cleaning and processing.)

So if you find yourself with a lot of data to process, I have some tips for learning R:

  • Run R in RStudio.
    • It takes a little effort to learn the RStudio interface but it will be a better experience if you’re not used to the command line (base R).
  • Have a problem to solve.
    • Learning a programming language is always easier if you have a specific task to accomplish.
  • Take advantage of existing resources.

Finally, I should say that I’m a patron of the Tidyverse, which is a flavor of R that comes with its own tools and methods for data handling. The Tidyverse makes data cleaning easy but you do have to organize your data in a particular way, with columns as variables and rows as individual observations. Tidy data is not condensed data and usually leads to a few columns with rows and rows of data, but this formatting enables streamlined processing. It’s not necessary to use the Tidyverse to use R, but it can be quite useful.

R is not the most efficient way to solve every data problem and it takes time to learn, but I think there is an advantage to learning a language like R (or Python or…) if you have serious data manipulation needs. Does it have a place in your data toolkit?