Tools of the Trade
My initial post is about software tools for data analysis. In the current research climate, there are so many software tools available that it can be daunting to choose the ones to use. My discussion here is going to be restricted to tools I have used and the advantages/disadvantages I have found with them. I will also break down the tools into three categories: statistical analysis, data manipulation, and visualization.
Statistical Analysis Tools
Prior to coming to the CTRSC, I had used several statistical analysis tools over the years including: SPSS, Minitab, and Matlab. I used Minitab the least – mostly for an Experimental Design course; I used SPSS for some of my statistical needs during my other courses and my phd disseration. Both Minitab and SPSS are aimed at making statistical analysis extremely easy by using a graphical interface for entering data and pulldown menus for applying statistical tests. These tools pack in a lot of options and include methods for generating descriptive statistics, nonparametric tests, regression, linear modeling, mix modeling, ROC curve generation, etc. The advantage of these tools is that they are easy to use and once you understand how to interpret the output, they can be used quickly to generate results. The disadvantage of these tools is that your data has to been in good shape BEFORE you enter them into the tool. Cleaning up and manipulating (recoding) are not that easy to do with SPSS or Minitab.
For most of my statistical work in graduate school, I used Matlab. This was mainly because Matlab contains toolboxes for a multitude of fields including statistics. Matlab is extremely powerful, but for the practitioner, extremely expensive. In addition, to get the most out of Matlab, you have to spend some time learning the language. However, once you do, there isn’t much you can’t do with it. However, since it is not just a statistics package, it can be overwhelming to learn all of the features that you may need. In particular, Matlab does support some data cleaning and manipulation operations, but they are not always easier to understand or use.
Since I arrived at the CTRSC, I have used almost exclusively, the R Statistical Package. I have become quite comfortable with R because it supports a robust programming language, there are a lot of extension packages that have been written by the community, and there is significant support for generating graphics. The major upside is that R is free and open source. If your institution or company is light on cash, this is definitely a plus. On the downside, R is not for the faint of heart. There is a considerable learning curve. Once you get used to the language though, it is very powerful and can do almost anything you need.
I also use Excel for some simple statistics (mean, median, stdev) if I have a simple dataset, however, I use it more for data manipulation.
My basic toolbox today is R, Excel or SPSS depending on what I am trying to do. Typically for quick one-off analyses, I will use SPSS, for really simple descriptive stats I use Excel. For larger datasets and when I need to run the same analysis over and over or I need to generate graphics, I use R.
The worst aspects of working with data are cleaning and recoding. Even with a robust data entry system, there will always be manipulations that you need to make in order to put the data in a format that will let you run your analysis. For example, you may need to recode data to more simple format, you may need to convert date formats so that you can perform date calculations, or you may need to merge several datasets together in order to do an analysis.
The primary tools I use for data manipulation include R, Excel, Notepad (or any plain text editor), Python (running in the Eclipse environment), and MS SQL Server 2012. Occasionally, I will work in MS Access, but I typically move the data to MS SQL Server because it supports a more comprehensive version of SQL.
One common task is search and replace in a text file (csv file). I often use Excel or Notepad for these types of operations. If need to do conditional replacement, I will use SQL Server. Sometimes R is useful for this, but it can be tricking.
For recoding variables, R excels and I use it most of the time for these types of operations. R supports the apply function which is use to apply an operation to data in a column. The output can also be appended to your dataset to create new columns programmatically.
Sometimes however, I need a more general programming language. Although I have considerable experience with C# and Java, for data manipulation Python works the best for me. There are ton of libraries in Python that support a considerable number of data manipulation operations. For example, just today, I wrote a script to download weather data files from the NOAA so we didn’t have to do it by hand. Other times I have used Python to write programs that divide a dataset up and calculate statistics on are data collection process for our clinical data. The great advantage of Python is that it supports diverse operations. However, there is a learning curve – it is a little different from other programming languages. A person trained in computer science will have no problem with Python, but for others, the curve is a little steeper.
R and SQL Server are both powerful tools, but they come at a price: the learning curve. Neither tool is simple to use for the uninitiated. It can take considerable time to get up to speed with either of them. However, common tasks that you do on a regular basis eventually become second nature. For anything else, you just have to develop a library of references that you can use when you get stuck.
For visualization, the main tools I use are R, matplotlib (Python library) and Tableau. In general, most of my visualization work is done in R. It has extensive support for graphing and mapping. As with all other operations in R, the learning curve is steep. There are numerous packages for visualization and none of them do everything; you have to fit the package to the what you want to show. An easier, but customizable tool is Tableau. It allows the user to drag and drop data onto a grid to generate visualization. While it can create some very complicated visualizations, it is limited in the amount of customization it can do. However, it is suitable for a wide variety of visualizations even with this limitation. We only recently got a license for Tableau, so I am still working through all of the features it has.
When working with Python, I use matplotlib to visualize data. I typically don’t use it for anything but exploratory data analysis – for publication and grant submissions I usually use R. However, when I am manipulating data with Python and I want to take a quick look, matplotlib is great. As I understand, a lot of people use it for publication quality graphics, but I don’t have experience in that realm.
So that sums up the main tools I use for our data analysis operation at the CTRSC. As we discover and test out new tools, I will likely publish a new entry to discuss. Until next time – good luck with your data analysis project.