Are You My Model? (not to be confused with “Are You My Mother?”, which is about a baby bird)

What oh what, where can you be,

An appropriate model that describes me;

And now thanks to An Intro to Computational Data Analysis for Biology,

I know where to start, so let’s go, please bear with me.


The first thing to do is to ask what is my predictor?

What response does it predict, and what is its shape toward the future?

I skip over money and also alcohol,

as the latter causes over-dispersion, and of the former I have none-at-all.


If I’d like to go and compare my model to the rest,

just so I can prove once and for all I’m the best;

I must choose something common to everyone with which I can rhyme,

oh wait, that’s perfect, I’ll just go and use Time.


As my response I will choose “Happy Times/Day”,

It’s error Poisson Distributed… which is always fun to say;

So i go back and sample across time my life,

but there are always some gradients that may cause some strife.


My replicates are days, but I’m not a morning person,

so I randomly sample across 24 hours, just to be certain;

Happiness builds and diminishes across seasons,

so Temporal autocorrelation will be needed to account for those reasons.


So now here I have my life’s model, YAY!

But it’s really no fun without more to say;

So a-priori to gathering all of my data,

I outlined other variables to look at which may influence MX and Beta.


Incidence of approval from advisors of sorts,

Leisure time spent in skis, cleats, or swim shorts;

While these two should lead to happiness increased,

Long problem sets… ehem stats homework… will probably provide the least.


So now I go and fit my data to some math,

As happiness isn’t exponential or bounded, I assume linearity down my path.


family= poisson (link=”log”), data=My_Life) should be fine.


But before I evaluate, I look at correlations,

just to be sure there aren’t funky palpitations.

Uh oh, there one is, I’m gonna have a fit,

It’s Approval and Homework… I’ll never hear the end of it.


So I add in another variable to the equation,

“Approval:Homework” will now help explain the relation.

So I run my diagnostics, looking at residuals vs fitted,

Q-Q, Scale-Location, Cooks Distance, Residuals Leveraged.


When everything checks out, I look at my fit,

I evaluate an ANOVA, Coefficients after it.

I’m using and alpha of 0.05 even though sometimes it’s rubbish,

I need to conform to NIH guidelines so that I may publish.


My happiness is positively influenced by Leisure and Approval,

Also Approval’s interaction with Homework, but negatively upon its removal.

Apparently Time has no affect, I’ve been happy all along,

exemplified by constant proclivity to write rhyme/song.


So now I can say I’ve finally reached my goal,

of finding my model and evaluating it as a whole.

And though I’ve done so through end-rhyme poetry,

It’s a weekly reflection, so don’t hold it against me.


It’s been fun taking the course, and though I’m drowning in work,

the knowledge I’ve gained has been quite a perk.

Even with all I’ve learned, statistics still seems quite daunting,

not nearly the same mathematical meadow in which you, Jarrett, seem jaunting.


I do have more confidence, and I’m not talking about intervals,

when discussing my analysis and reasoning and residuals.

So here’s my big thanks to Biol 697,

It came just in time as I’m nearing the end of data collection.

Posted in Uncategorized | Leave a comment

Programming in R, your new best friend?

I am at somewhat of a loss for what to write this week.  I could ramble about model selection or talk about the virtues of models versus p value driven hypothesis testing, but I’m not sure I have many insightful thoughts to share.  Instead my brain seems to have latched onto a small part of the Vickers reading for the week, in which he emphasizes the importance of reproducibility and error avoidance and suggests that all serious research should be done with a statistical programming language.  This comment brought up two main thoughts that I figured were worth discussing: 1. Can’t the statistical language be misunderstood and misused in the same ways? and 2. To what extent has my (our?) appreciation and use of the R language expanded this semester?

Vickers makes a point that GUI based (i.e. click buttons and pull-down menus) statistical software makes it all too easy to simply choose an analysis at random and shuffle through many different options.  When performing a complex analysis is as simple as choosing a command from a drop down list it is easy and tempting to just select the method that gives you the significant p value, even if that method is inappropriate or if you don’t understand it.  While I agree with this, I disagree that using a statistical language eliminates this issue.

Perhaps it’s because I grew up tinkering with computer stuff (I made a rocking HTML website in 1998 about soccer, complete with lots of animated gifs according to Internet protocol at the time), but I am not intimidated by a GUI-less command based software like R.  Although performing different statistical tests and methods require some thought in R, I’d like to argue that the risk of randomly choosing analyses is just as great here.  For example, if I have a contingency table that I want to analyze I know I can go beyond the basic chi-square test but I may not be sure what other analyses I can do or how to do them (or why).  Thanks to Google I don’t need a drop down list or button, I simple search “contingency table analysis” and I get a list of analyses, each of which I can easily find commands, packages, or scripts for.  Fisher test? No problem fisher.test(table).  McNemar test? Got it, mcnemar.test(table).  Chi square test? chisq.test(table). G-test? Found a script someone wrote.  Just like with a choose and click software I can perform a bunch different analyses just like that (mindlessly).  And while I know each of these has a specific purpose (or maybe I don’t) it would be incredibly easy to just run through my list of analyses of contingency tables and go with the one that gave me the p value I was looking for.

Screen Shot 2012-12-04 at 1.21.05 PM

See how easy it is to run multiple contingency table analyses?! Hey look, two of them give a significant p-value! See how easy it is to run multiple contingency table analyses?! Hey look, two of them give a significant p-values!

Which brings me to what I think the real issue is.  The real issue isn’t the ease at which we can choose alternate analyses.  The problem with inappropriately chosen and applied statistics, I think, lies with the overwhelming amount of choices and general lack of understanding of what many of them represent.  Sure you’re going to get called out if you use a fisher test on a large sample size, but when it comes to more complex analyses you could easily leave in default meaningless settings or choose random code variations to get the result you want and few people would call you out on it.  It is all too easy to choose an approach that gives you the result you like without fully understanding what it is doing or what parameters you need to specify.  Using a statistical language does not skirt this issue, especially with a generation coming of age that is more adept at understanding and implementing such a method.  What do you think? Is it just as easy to run multiple analyses that you may or may not understand in R?

As for my second point, I’m curious to know how everyone else feels their knowledge and use of R had changed this semester.  As for me, I think the biggest change to my R use and appreciation has come from learning how to construct functions and loops and from using knitr.  These two additions to my R toolbox will definitely enhance my ability to replicate my results and repeat analyses.  Functions and loops allow me to perform the exact same actions over multiple series of data, therefore minimizing the error introduced by retyping and making minor alterations to commands.  One caveat to this though is that one mistake can make all of your analyses wrong, not just one.  But still, one block of code is easier to error check than 20.  Knitr has helped my error reduction in a different way.  I’d been looking for a mechanism to print out my code and figures for awhile but was disappointed with the results of basic commands that write to PDF or text file.  I appreciate how knitr incorporates the code and figures and any text into an easily readable summary of your analyses.  In my own research I’ve noticed a major change in how easily I can now add to, change, or comment on my analyses.  The output html file can be saved as a pdf that I can then share or open on a whim to review what I did, how I did it, and everything I found.  Compare this to how I used to do it before I learned about R Studio and knitr, which was basically to re-run the entire analysis and code every time I wanted to check a p-value or verify the method I used. So in the interest of generating discussion, how has everyone else’s approach to their research or use of R changed over the semester?  Is anyone else as enamored with knitr as I am?

Posted in Uncategorized | Leave a comment

Slides with Knitr and Slidify

So, for those of you using knitr in the class, you may be feeling pretty good about R Markdown. You may want to even start using it more broadly, say, for making slides with lots of analyses. Previously you had to use LaTeX and Beamer. That’s what I’ve been doing for class. This has been fun, but it’s a whole new cumbersome language to work with. Some folk had figured out how to use pandoc to make slides using markdown. This was OK. But complicated.

Now…there is the new slidify library which puts everything into a nice tidy R + Markdown + Github + Rpub ecosystem. Very hot.

Posted in R Libraries You Might Like | Leave a comment

So no cleaning my kitchen today….

First of all, I must say that I’d be pretty damn impressive if a 93 year old man weighing 700 pounds could run a mile in negative time and lift half a ton.  I’d argue that the mile thing could become possible if we ever figure out time travel though.  It is pretty clear that there are dangers with extrapolation, but there are perks too…

I live with two guys and I am (unfortunately) quite messy and they are pretty messy and as a result, our apartment is constantly messy.  However, as Vickers so awesomely highlights in his messy kitchen example, “the increase in reward per unit effort decrease with increasing effort.”  Therefore it really doesn’t make sense to keep cleaning after the basics are complete because the returns for your efforts begin to plateau (and who wants to clean without a reward?).  I am thankful that this relationship is not linear, and I’m glad I have statistical ammunition to back up my lack of cleaning efforts if the guys ever say anything (which they won’t).

So extrapolation appears to occur when we improperly fit our data and we inappropriately look beyond the data that we have.  So why do people do this?  I can’t really answer that question, but it is interesting to think about.  I mean, say for example the scientists involved in projecting climate change for the IPCC chose to violate these rules (if any of you remember climate gate from 2009 then you know accusations of this did exist).  If the accusations were proven true, this violation would kill the credibility of science and also make the public even more skeptical of something that is so direly important.

How are we going to change environmental policy if we violate these rules?  What is the perk in data misrepresentation?  In inappropriately representing and extrapolating your data, you are likely creating a much more dramatic (maybe interesting) story, but it isn’t true and you lose credibility, even your field could lose credibility once that is unveiled.  Yeah it may be more difficult to lobby for carbon sequestration efforts if you didn’t have a hockey stick curve (and I am not a climate gate believer), but you should let you data speak for itself and your interpretation speak for what steps should be taken after the fact.  So what did we learn from climate gate? Well for starters, hackers suck.  Be careful what you say, but in the end there will still be people who will see your words out of context and draw false accusations.  Lastly, while this incident didn’t set the best tone for the climate summit in Copenhagen, one really positive thing did come out of it in that there was new value in data sharing and data openness.

After the climate gate investigations concluded that anthropogenic induced climate change really did exist and the science was sound and honestly conducted, I doubt that the climate change skeptics were ready to even be in the path of the bandwagon nor that politicians were eager to push pro-environmental legislature.  But how do you fix a misunderstanding and regain public trust after something like this?  From what I’ve read, scientists looked for better outlets of making their data publicly available or available on request, and made methods more accessible.

So, I think there is a lesson to be learned from improper data usage like extrapolation, or even accusations of data misconduct.  The more open you are with your methods and data, the less likely your email will get hacked and the more likely you are to help science progress and cover you’re a$$ in the process.














Posted in Uncategorized | Leave a comment

I may not need to be terrified of O’Hara anymore

So I have to admit, after reading O’Hara 2009, How to make models add up- a primer on GLMMs, I was about ready to give up. I was thinking to myself that it was all just gibberish and that I’d never make sense of all those Greek letters. But this week has given me new hope! Maybe some of that gibberish made it into my brain after all! Or maybe it was a week in class of ANOVAs… Either way, the chapter this week made sense and maybe even cleared up some of what O’Hara was trying to say to me.

So it turns out all these big long complicated models are just modified linear regressions, nothing more, nothing scary. A model, any model, just represents the relationship between some response variable and one or more explanatory variables. This is easy to see when we are doing a linear regression, but when we get to things like Jen’s model on how different nitrogen inputs affect nitrogen loads in Waquoit Bay, taking into account atmospheric deposition, wastewater, removal in septic tanks, animal wastes, well that simplicity gets a little buried in all the details, but it’s still there, it’s still y=ax+b.

Whitlock and Shluter simply leave out all the additional slopes that are built into their models when they write them out and this makes the relationship that the model is trying to show much more straightforward.  O’Hara on the other hand, gave me a headache with all of their slopes for each different categorical variable. I see now however, that you can simply set each of those other variables to zero (rending their troublesome slope obsolete!) to look at the relationship between the response variable and a single categorical variable.

I’m also starting to see how ANOVAs and linear models are really paired. An ANOVA represents a series of different F-tests where we eliminate one of the explanatory variables from our model and call that our null. We then compare the fit of that model to a model that incorporates the variable of interest. The F-statistic and p-value associated with it will tell us how significant the change in the model fit is due to the incorporation of this variable. I still want to know: how did O’Hara get those estimates in the tables, I thought these were from ANOVAs, but that not exactly true… It looks like those are t-tests.

Another thing I learned, why you use n-1 degrees of freedom in an ANOVA. Every other statistics course I have every taken has simple told me, “In an ANOVA use n-1 degrees of freedom,” but now I finally understand why. The degrees of freedom actually means something, how many things are “free” to move i.e. how many variables there are that are not “fixed.” In an ANOVA we set a single variable to be our “baseline” and look at how all other variables change in relation to it. This one variable is therefore fixed and we lose that as a degree of freedom.

While I did appreciate Whitlock and Shluter explaining generalized linear models in plain English that I could understand, I thought they took a rather simplistic route to explain this topic. They could have taken out a lot of extra words by eliminating the examples of categorical variables vs. blocking vs. factorial design and instead given us a more general overview of “multiple explanatory variables.” It seems to me that the whole idea is to add in all the terms that are a part of your experimental design, be they blocks that you established as part of your experimental set up or interaction terms that are inherent in factorial design. There’s really no need to explain these analyses separately, they are actually just the same analysis just with more or less variables. In fact the idea of a covariate is also the same. This is something that is not included in our experimental set up, but that we know is unavoidably going to affect our data and therefore needs to be accounted for in our data. We test to see how the covariate affects our model (i.e. does mass really affect energy expenditure? If so include it, if not, don’t. Is there an interaction term? If so include it, if not, don’t.) A more general statement regarding generalized linear models with a detailed description of some of these extra terms you could include and why might make this explanation more comprehensive and streamlined. It could also eliminate any possible confusion to the reader since treatment of these different types of experiments is presented as different tests.

Finally, I want to put in some reminders to myself. These reflections are really just my rewriting of the points I found most useful from any one week’s reading, so here’s a few more points that I want to address:

Things that are built into an experimental design should stay in a model whether they improve the fit or not. If it wasn’t a part of the design however, take out things that don’t enhance c fit.

Be careful of random factors- things that are randomly sampled and not fixed add more error to the model and need to be treated as such in the model fit. This affects how the F-test is run and needs to be specified before calculation. I wonder how it changes it though…

Finally, there are always assumptions to any test. Make sure you’re meeting them! ALWAYS!

Posted in Reflections | 1 Comment

More Themes with ggthemes

Tired of the same old ggplot2 themes, and want to spice it up a bit? Then try out the ggthemes package, currently on github. You can make your graph look Tufte-ian, solarized, like it came from Stata, formatted for the economist, or, even…like it came from Excel!


Posted in Links to outside reading | Leave a comment

Types of Error and Hurricane Sandy

Two great posts on Type I and II error and decisions about city closures here and with a followup here.

Posted in Links to outside reading | Leave a comment