I am at somewhat of a loss for what to write this week. I could ramble about model selection or talk about the virtues of models versus p value driven hypothesis testing, but I’m not sure I have many insightful thoughts to share. Instead my brain seems to have latched onto a small part of the Vickers reading for the week, in which he emphasizes the importance of reproducibility and error avoidance and suggests that all serious research should be done with a statistical programming language. This comment brought up two main thoughts that I figured were worth discussing: 1. Can’t the statistical language be misunderstood and misused in the same ways? and 2. To what extent has my (our?) appreciation and use of the R language expanded this semester?
Vickers makes a point that GUI based (i.e. click buttons and pull-down menus) statistical software makes it all too easy to simply choose an analysis at random and shuffle through many different options. When performing a complex analysis is as simple as choosing a command from a drop down list it is easy and tempting to just select the method that gives you the significant p value, even if that method is inappropriate or if you don’t understand it. While I agree with this, I disagree that using a statistical language eliminates this issue.
Perhaps it’s because I grew up tinkering with computer stuff (I made a rocking HTML website in 1998 about soccer, complete with lots of animated gifs according to Internet protocol at the time), but I am not intimidated by a GUI-less command based software like R. Although performing different statistical tests and methods require some thought in R, I’d like to argue that the risk of randomly choosing analyses is just as great here. For example, if I have a contingency table that I want to analyze I know I can go beyond the basic chi-square test but I may not be sure what other analyses I can do or how to do them (or why). Thanks to Google I don’t need a drop down list or button, I simple search “contingency table analysis” and I get a list of analyses, each of which I can easily find commands, packages, or scripts for. Fisher test? No problem fisher.test(table). McNemar test? Got it, mcnemar.test(table). Chi square test? chisq.test(table). G-test? Found a script someone wrote. Just like with a choose and click software I can perform a bunch different analyses just like that (mindlessly). And while I know each of these has a specific purpose (or maybe I don’t) it would be incredibly easy to just run through my list of analyses of contingency tables and go with the one that gave me the p value I was looking for.
Which brings me to what I think the real issue is. The real issue isn’t the ease at which we can choose alternate analyses. The problem with inappropriately chosen and applied statistics, I think, lies with the overwhelming amount of choices and general lack of understanding of what many of them represent. Sure you’re going to get called out if you use a fisher test on a large sample size, but when it comes to more complex analyses you could easily leave in default meaningless settings or choose random code variations to get the result you want and few people would call you out on it. It is all too easy to choose an approach that gives you the result you like without fully understanding what it is doing or what parameters you need to specify. Using a statistical language does not skirt this issue, especially with a generation coming of age that is more adept at understanding and implementing such a method. What do you think? Is it just as easy to run multiple analyses that you may or may not understand in R?
As for my second point, I’m curious to know how everyone else feels their knowledge and use of R had changed this semester. As for me, I think the biggest change to my R use and appreciation has come from learning how to construct functions and loops and from using knitr. These two additions to my R toolbox will definitely enhance my ability to replicate my results and repeat analyses. Functions and loops allow me to perform the exact same actions over multiple series of data, therefore minimizing the error introduced by retyping and making minor alterations to commands. One caveat to this though is that one mistake can make all of your analyses wrong, not just one. But still, one block of code is easier to error check than 20. Knitr has helped my error reduction in a different way. I’d been looking for a mechanism to print out my code and figures for awhile but was disappointed with the results of basic commands that write to PDF or text file. I appreciate how knitr incorporates the code and figures and any text into an easily readable summary of your analyses. In my own research I’ve noticed a major change in how easily I can now add to, change, or comment on my analyses. The output html file can be saved as a pdf that I can then share or open on a whim to review what I did, how I did it, and everything I found. Compare this to how I used to do it before I learned about R Studio and knitr, which was basically to re-run the entire analysis and code every time I wanted to check a p-value or verify the method I used. So in the interest of generating discussion, how has everyone else’s approach to their research or use of R changed over the semester? Is anyone else as enamored with knitr as I am?