The future of statistical analysis is in open-source programming, not domain-specific proprietary software.


bunsen2There is a rising demand for inference from data coupled with the arrival of new software and hardware technologies. Seth Brown evaluates the current availability of tools across different statistical languages, both domain-specific varieties such as R and Stata, and general purpose computing languages like Python. He writes that we are on the cusp of great innovation and the development of better tools will only hasten this progress, but there must be careful consideration over the proprietary limits of these programmes to ensure they meet our future needs.

I’ve been thinking about the future of data analysis lately and which statistical language du jour will rise to prominence. I’m using the term statistical as a catch all adjective to encompass statistics, machine learning, and other types of data analysis and inference. On one side, there are languages built for doing statistics, which have some rudimentary programming capabilities, and, on the other side, there are languages built for programming, which have rudimentary statistical capabilities. This schism requires statisticians and scientists to be fluent in multiple languages, impairs the development of better tools, leads to feature duplication across languages, and generates needless technical debt.

Credit: quinn.anya. (CC BY-SA 2.0)

The burgeoning demand for a deeper understanding of our world through data is highlighting the need for better tools. Lowering the friction of data analysis workflows by closing the schism between existing language paradigms is a critical step towards the development of better tools. A contemporary statistical language is needed that can bridge this divide and provide an efficient, modern data analysis workflow.Most of my current work requires using a melange of incongruous tools written in R, a domain-specific language, and Python, a general purpose language. I use Python initially to munge data from APIs, databases, real-time streams, and distributed file systems. I switch to R for subsequent EDA, model prototyping, and basic plotting. Then, I return to Python to translate my R code into an interface that can communicate with other software systems. On top of this, additional work is often required to build MVC frameworks to wire the output of analyses to visualizations for the web. Removing the friction associated with switching back-and-forth between these two languages will greatly improve how I, and many others, work.

This harangue isn’t intended to criticize the current crop of statistical languages. Existing tools and the hardware that runs them is more powerful then ever before. However, it is important to critique current technologies so that our tools can continue to improve. We have only scratched the surface of what is to come and we are on the cusp of great innovation in understanding the world through data. The development of better tools will hasten this progress.

New inventions frequently emerge from changes in demand for resources or from the advent of new technologies. Many such inventions are born from the confluence of both elements. Gutenburg’s printing press was conceived from an increasing demand for books among a literate middle class and technical breakthroughs in metallurgy, mechanization, and movable type. Similarly, today we are experiencing a rising demand for inference from data coupled with the arrival of new software and hardware technologies. This junction has created a climate ripe for invention in the way we interact with data.

If I were to invent a modern statistical language today, I’d want to build a rich data analysis API on top of a more general purpose open-source programming language. It is a logical choice to stand on the shoulders of giants. Data analysis no longer operates in a vacuum. Statistical languages must move toward becoming more tightly integrated with other software systems, not the converse. General purpose programming languages provide the best options for moving forward.

Leveraging a general purpose language for statistical analysis frees statisticians to concentrate on statistics and leaves the nuts and bolts of language design to actual language design experts. This structure benefits everyone. A well designed language is easier to understand and is more approachable for students, statisticians, and scientists. The net effect of this division of labor is that the community gets an elegant, expressive, performant, and readable bedrock upon which to build scientific and statistical methods.

Another advantage of general purpose languages is that most are open-source and available gratis. This model contrast with many domain-specific statistical languages like MATLAB, SPSS, and Stata which operate under for-profit models. For-profit languages are a bad choice for a future statistical language because proprietary software takes power away from the community and gives it to a single monopolistic entity. When what is best for the community no longer aligns with what is best or most profitable for the entity, problems ensue and users become trapped within the software they helped promote. This is an example of the Microsoft Word problem. More troubling still, using a closed proprietary language gives the impression that users do not care if their results are reproducible or verifiable. For these reasons, a future statistical language must be open-source.

General purpose programming languages are also ideally suited to handle modern data sources. The Internet of Things and data sources like the Twitter firehose have created a world of sensors emitting real-time streams of data at unprecedented velocities. My simple home weather station already emits thousands of data measurements each day. Domain-specific languages were never designed for this world. Their ability to work with streams and out of memory data is either non-existent or very poor. Contrastingly, most general purpose languages have large collections of efficient modules for lazy-evaluation and tools for building parallelizable pipelines to handle data streams.

Currently, there is reluctance to move toward a general purpose programming language for statistical analyses. The primary reason for this reticence is justified; domain-specific languages are unmatched in their breadth and depth of applied statistics capabilities. Despite their power, I do not think these languages are the way forward. Their idiosyncratic design, poor performance profiles, lack of important general-purpose functionality, and niche appeal have placed them in an Ivory Tower ill-suited for the future of data analysis.

I posit that domain-specific languages are a sunken-cost. The communities that surround these languages feel too deeply invested to abandon them for better ecosystems that require initially going backward to ultimately move forward. The sunken-cost fallacy is reminiscent of the state of functional programming in the 1980’s. At that time, functional programmers were spread across several different languages and the user dilution that resulted impaired progress within their domain. The community recognized that to make significant headway they needed to regroup and unify their efforts. They could go fast alone, but farther together. They recognized their sunken costs and formed a committee to build a federated language that would cohere their efforts. That language becameHaskellone of the best modern programming languages in use today.

I do not think the choice of a general purpose language is all that important. There are many good languages to choose from to build a modern statistical language. However, Python is the most obvious choice. It is easy to learn, readable, open-source, performant, mature, and already widely used in a vast number of domains including data analysis. Importantly, Python has also already made inroads to replacing domain-specific languages with excellent tools such as SciPy/NumPyPandasscikit-learnBokehNumba,StatsModelsPyMCNLTK, and the iPython interactive notebook.

If Python is to be the next statistical language, it still needs substantial work to usurp current domain-specific languages. Python must cater to the needs of users with non-programming backgrounds coming from R, MATLAB, SPSS Stata, et alia. It must find ways to improved its current packaging and installation system. Python needs to find ways to build a more user-friendly statistical environment (RStudio) and easier to use module repositories (Bioconductor). These features will allow research methods and algorithms to be published, search, and implemented with ease in scripts and on the web. Python needs to spread deeper into academia and become more widely established as a serious statistical platform in research as well as science and statistics curricula. If the language can capture a larger section of students and academic researchers, cross-pollination between academia and industry will cement Python as the predominate statistical language of the future and remedy the current language divide.

This piece first appeared on Seth Brown’s personal blog and is reposted with permission.

Note: This article gives the views of the author, and not the position of the Impact of Social Science blog, nor of the London School of Economics. Please review our Comments Policy if you have any concerns on posting a comment below.

About the Author

Seth Brown is a Data Scientist in the telecommunications industry. His research focuses on understanding the topology of the global Internet using large-scale computing, statistical modeling, and data visualization techniques. Prior to computer networking, he was a research scientist in bioinformatics where he studied the structure and function of gene regulatory networks. Seth writes about topics in data analysis and data visualization on his website, He can be found on Twitter @drbunsen.