Truly Appreciating Variance and Sample Sizes (Scott Morello)

     It was two summers ago and I was, one again, splitting my time between teaching, interns, and my own field research whenever time permitted. Each summer, graduate students in our lab will take on NSF and NIH funded undergraduates as an experience serving in an advisory role. In order to balance the multitude of summer responsibility, though, it behooves the advisor/graduate student to structure the advisee’s project around their own research goals; In my case, studying rocky intertidal communities. The unfortunate part about intertidal work, however, is that fieldwork hours are confined to the times between high and low tide (as is inherent in the name). This constitutes, at most, 2 separate 4 hour periods per day, for a 1-2 week period every month when the tides are low enough to access deeper communities. Time constraints considered, it is important that intertidal researchers properly manage their work schedule in order to maximize the reward in data collection.

     As we read in Vickers and discussed this past week, a sample mean can be somewhat meaningless (pun intended) without additional numeric descriptors to accompany it. One of these such descriptors, the spread of estimates, is an essential statistic in evaluating any group of data. These estimates of variance around means serve as the basis for most comparative statistics (eg. ANOVA), and a basic understanding of a system’s variance is required going INTO a study, or it can lead to difficulties with experimental design. This is because variability in data determines how many samples (AKA “sample size”) are needed to accurately estimate a mean. Misevaluating such variance can lead to either under sampling (disastrous at times) or over-sampling (not as terrible a thing, but it comes at the cost of money and that time I was discussing earlier).

     A study from the 1980s had previously correlated Shell Length:Shell Aperture ratios in the intertidal snail Nucella lapullus to wave force measurements, creating a regression by which one could estimate average wave exposure on a shore merely through morphological measurements. This had only been done in UK intertidal systems, however, and never in New England. Seeing this as an opportunity for both an easy undergraduate project, and a way to generate exposure estimates at my study sites, I tasked a summer student with undertaking it. I assumed that by having such a simply measured variable (shell length), we could devote minimal time to their project and maximize my own fieldwork. As is usually the case in field research, it didn’t quite work out that way. The reason for this had everything to do with a misunderstanding of the spread in the data.

     Prior to starting fieldwork, I had selected a sample size of 100 individuals per transect. This project would entail 3 transects per site, 6 sites per location, and 2 locations. So, though the actual method of measuring a snail seemed trivial, the magnitude of the total number of samples now seemed a bit daunting (3,600 total snails; 2 measures per snail = 7,200 measurements). But, in true “head down and power through” fieldwork fashion, we started the project. Needless to say, 2 days into the work I had gotten nothing done with respect to my own research, despite having 4 people conducting measurements throughout the day. That night, I considered where I could have gone wrong in my planning. I obviously knew a decreased sample size was necessary… but to what? Remember that time in the previous paragraph when I used the phrase “misunderstanding of the spread in data”? Well, let’s just say that should have read “neglecting of the spread in data”. The sample size of 100 was chosen haphazardly, and mistakenly, and I was paying the price. A quick analysis of Shell Length:Shell Aperture ratios that very night demonstrated how I needed far fewer samples per site to account for the spread, or variance, in the data to estimate the mean. Much like we did in class this week, I created a scatter-plot of randomly selected individuals at different sample sizes. The graph I got was similar to the following:




If only I had used preliminary variance estimates to calculate a sample size, I could have spent far more time doing my own research. My sample size of 100 snails was far greater than necessary (~ 40 snails) to estimate the mean. Gaining this a priori understanding can be difficult, and sometimes require using variance estimates of other data types (eg. weight data variance to estimate a sample size for volume). Nevertheless, it is important, and I’ve found that if it isn’t easily gathered in the field or researched in a journal, many researchers are more than willing to part with a variance estimate from their personal data (when it’s not published). They know that these data aren’t enough for you to publish a whole paper around, usually, but can be very useful in helping other researchers to generate useful datasets.

     The Vickers readings from this past week, as well as the lectures from class, have emphasized a few important descriptive statistics/sample properties of which variance is one. Which ones are useful at any given moment depends on the question being asked, the context in which the results are used, and the presence of normality, outliers, and continuous variables (I especially liked Vickers’ continued use of Bill Gates). Neglecting these properties of your data will not only detract from your understanding of the results, but can impede your generate of the data as well.


Code for the plot above: 


sampleSize<-rep(1:100, times=4)


for (i in 1:length(sampleSize)) {

  shellTable[i] <-mean(sample(shells[which(shells$site==”LL”),”Shell.Length.Aperture.Ratio”],size=sampleSize[i]))


plot(sampleSize,shellTable,xlab=”Sample Size”,ylab=”Shell Length:Aperture Mean”)

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s