P-hacking

In perfect timing for our Bayes discussion, one of our Senior Science Advisors at work sent along this article by Gelman and Loken (see here: http://www.americanscientist.org/issues/feature/2014/6/the-statistical-crisis-in-science/2) , questioning the value of statistically significant comparisons many of us researchers fall into doing. Calling it the “Statistical Crisis in Science”, Gelman and Loken delve into several papers that have come out with proven statistical significance, but question how much is influenced by the predetermined human predictions, and the data set of which they analyze. They discuss the issue dubbed “p-hacking” of when researchers come up with statistically significant results, but don’t look further at other relationships or comparisons for significance as well.  They are only looking at the hypothesis at hand, and not for the relationships with the best p-value.  The authors claim “the mistake is in thinking that, if the particular path that was chosen yields statistical significance, this is strong evidence in favor of the hypothesis.”

While I agree we need to consider all possible relationships or hypothesis so that we don’t miss the true importance of what is going on, how often can we look at all possibilities that drive us to a result?  The very basis of most of our studies come from a driven question, which focuses on the relationship analyzed.

The idea of statistic significance and how it’s role in our analyses is still clearly a debate among the community. I had a coworker publish a paper that her co-author, a statistician, demanded that she included p-values in her analysis, but she felt the p-values were not relevant to the analysis being discussed, nor did they make much sense when included.

It raises the point that Jarrett made in class yesterday, of it all really comes around to what questions we’re trying to answer.  Really, we can find a relationship (whether its strong or poor) with anything. But depending on the story we’re trying to tell, the tests we use to support our results must be relevant to our study.

Posted in Uncategorized | Leave a comment

Making predictions is hard!

This week’s Silver reading fits the current events a bit too eerily with its focus on disease epidemics and their past predictions.  I remember the 2009 swine flu fiasco.  We cut pork out of our diet for a few months, and I was made to stand in line at my high school for hours in the rain and cold to get the vaccine, until I was too miserable and demanded to go home.  The results? No one we knew had the swine flu, and I kept my flu-free record without any vaccination(even until now, *fingers crossed*).  I have always taken these “outbreaks” with a few large grains of salt, not because I don’t recognize them as credible threats, for no one can deny the havoc ebola is causing in Africa, but because I understand how the media likes to blow things out of proportion.  But somehow, cases after cases of misleading forecasts, the general public still eats up all that the media dishes out.

From reading about all these under- and over-forecasting of epidemics over the past few decades, I honestly feel that statisticians are have the shortest end of the stick.  They can make simple models, or multi-levels complex models, and still ends up having the wrong predictions and get bashed around for it.  It brings up an interesting debate of what happens when you know your model isn’t precisely right, would it be better to be conservative, and keep things calm for as long as you can, and changing your strategy as the event progresses? Or over-forecast, and over-prepare, possibly resulting in loss of millions of dollars and/or creating other panic problems, like in the 1976 swine flu scare?  It is a really hard decision to make when you know that so many people rely on it.

I really like what Silver quote of Geroge E.P. Box, “All models are wrong, but some models are useful.”  I think that we should adopt this way of thinking.  Rather than trying to come up with a solid number of how many people will die of a disease, what might be more important is using models to figure out the components of the disease, how it can be spread and how quickly, under certain circumstances.  It might be frustrating when all we want is to put a number on it, but in the long run, it would be much more useful to look at the trends rather than the values.

Posted in Uncategorized | Leave a comment

Get Your Flu Shot

With America fixated on the ebola panic as a result of insistent media coverage, much of America has forgotten about getting their flu shot. But this isn’t strange – according to the Center for Disease Control and Prevention (CDC), only 46 percent of the US received flu vaccinations in 2013, even if influenza can kill up to 50,000 individuals in a bad year.

Well get your flu shots, people, because this year has been projected to be much worse.

But that prediction got me thinking. How does the CDC determine how bad a flu season is? FluView by the CDC had some answers. The Influenza Division of the CDC produces a weekly report starting in week 40, and collects data on 1). Viral Surveillance 2). Mortality 3). Hospitalizations 4). Outpatient Illness Surveillance and 5). Geographic Spread of Influenza. They analyze these data year-round to produce predictions about the upcoming flu season with a lag time of 1-2 weeks.

That seems like hard, yet important, work. But what if we wanted to know about impending flu behavior sooner? While searching for other prediction methods, I found an interesting Nature paper by Ginsberg et al (2008) that used google search queries to predict influenza epidemic trends (see reference below).

Ginsberg and crew set out to develop a simple model to see if the percentage of influenza-related physician visits  could be explained by the probability that a random search  query submitted from that same region was related to influenza (across 9 regions). They did this by taking 50 million of the most common search queries in the US between the years 2003 and 2008, and checked each individually to see which would most accurately model the CDC-reported influenza visit percentage in that region. They narrowed it down to 46 search terms, and found a mean correlation of 0.90! Furthermore, across the 2006-2007 season, they shared their weekly results with the Epidemiology and Prevention Branch of Influenza Division at the CDC to better understand their prediction timing and accuracy, and determined that their model could correctly estimate the percentage of influenza-like symptoms 1-2 weeks before that of the CDC surveillance program (and in as little as 1 day).

Some of the terms with the most influence on the model include “influenza complication” and “Cold/Flu Remedy.”

Because these data are accurate and readily available, models like these could help public health officials prepare and respond better to flu epidemics. The authors acknowledge, though, that this should certainly not be a replacement for surveillance programs, especially since we cannot predict how internet users will behave in particular scenarios (I wonder how the search terms look now with ebola – you can check it out yourself at http://www.google.com/trends/).

I think the take home message here (besides the fact that google trends is super cool) is to understand that the modern technological world is information rich, filled with available big data; however, in leveraging these data to our advantage, we need to understand and acknowledge that there is always error involved in our predictions.

And as an ending note, by writing this blog post and searching for influenza-related topics, I probably just contributed to this week’s predictions!

If you’re interested, check out http://www.google.org/flutrends/ for results from a 2008-2009 tracking study.

- ABM

Reference: Ginsberg, Jeremy, et al. “Detecting influenza epidemics using search engine query data.” Nature 457.7232 (2008): 1012-1014.

Posted in Uncategorized | 2 Comments

Fibonacci: Our Personal Savior

In Homework 2 we peaked into the world of Fibonacci to create the Fibonacci sequence. I wanted to expand just a little on this to show how prevalent those numbers are in nature. I wrote my blog in R and exported it as a PDF so hopefully you’ll be able to see it.

Fibonacci_Lynum (CLICK HERE)

Posted in Uncategorized | Leave a comment

R- Evolutionary relationships, non-independence and correlation…

In this blog entry, I would like to share my experience with R and data analysis. A few months back I developed a keen interest in the relatively recent development of NGS (Next Generation Sequencing) data analysis. Subsequently, I decided to work on the evolution of specific features in bacterial genomes. As you may know, bacterial genomes are enormously variable in their structure and composition (e.g. codon usage, GC bias, and gene copy number).

During the course of this project I came across the issue of non-independence in the context of correlation between two traits (ex: genome size and genomic GC content). I was using R and used the correlation function to determine whether these two traits were correlated. The correlation estimate was 0.6 between GenomeSize Vs GC%. However, I soon found out that since a large number of species in my dataset share common ancestry they are not independent. So, If I were to correlate any two traits – I would have to take into account the shared ancestry. The most widely used method for analyzing associations between continuous traits in species is the phylogenetically independent (PIC) contrasts (Felsenstein, 1985). PIC essentially removes the effect of shared ancestry in the traits. As most things, R had a package called ‘geiger’ to perform PIC. After taking into account shared ancestry of species, I obtained a correlation estimate of 0.47 between the same two traits.
Whitlock and Schluter cite a similar example with a dataset of 17 lily species in which the closely related lily species tend to have the same flower type as compared to slightly more distant species.

Posted in Reflections | 1 Comment

Reporting Uncertainty

In chapter 6, Silver discusses the reporting (or lack thereof) of uncertainty in prediction. He points out that failure to report uncertainty can have potentially catastrophic outcomes. For example, a weather service in North Dakota forecasted that after a snow-heavy winter, the Red River would rise 49’. Since the levee that holds water out of the city was 51’ tall, they did not believe flooding would be an issue. What the weather service did not report is that their prediction had a margin of error of about +/- 9’. This means there was still a decent chance the river would rise above 51’ and the levee would overflow. As a result, locals did not prepare for the flood that damaged or destroyed 75% of the city’s homes. With proper preparation, the floodwaters could possibly have been diverted away from the city and the damages avoided. When asked why they didn’t report the margin of error surround the prediction, the weather service responded they were afraid the public would lose confidence in the forecast if uncertainty was reported.

Unfortunately, this seems to be just the case with the general public. People are not accustomed to seeing uncertainty reported with predictions or statistics. Instead, they take statistics at face value. If a layperson were to read a scientific journal entry, they would most likely be overwhelmed by all the caveats and margins of error presented with the data. The general public seems to be much happier reading brief popular science articles that state scientific findings without any accompanying uncertainty or error to obscure the study’s results. While this may make people feel more confident in the data reported, it is actually obscuring the true signal by incorporating, and not taking into account, the noise. This is dangerous as it may lead people to stop thinking critically. A critical thinker may have seen the North Dakota weather forecast and noted that 49’ is awfully close to the top of the levee. Maybe precautions should be taken just in case the forecast is not exactly accurate. Instead of withholding uncertainty to give people more confidence, reporting of uncertainty should become ubiquitous so that people get used to taking all factors into account when interpreting a forecast.

Another problem is false reporting of uncertainty. As Silver pointed out, economists often report a 90% confidence interval surrounding their prediction. However, instead of the true values falling within the interval 1/10 times (as expected with a 90% CI), the interval actually contains the true value only about 1/3 or ½ times (hardly better than a random guess). I would definitely not stake my financial future on these figures! The general public, however, may see 90% and feel very confident about the forecast. This dishonesty is particularly daunting because most people assume that what is reported to them is correct and they won’t even think to question the uncertainty reported.

Posted in Uncategorized | Leave a comment

To know everything about one statistical test, or little more than nothing about every statistical test….

I’d like to contrast the different ways that statistics instructors approach data analysis with those applied by researchers using statistical methods to ask specific questions of their data.  I previously took undergraduate statistics, and then two semesters of graduate-level coursework.   In my first course, and in the first semester of graduate coursework, I feel like the design of both courses were really geared towards learning tools: here is how to calculate standard deviation by hand, here is how you calculate a confidence interval from a z-table, here is the equation for a two-sample pooled t-test, etc etc.  In the later course, the entire curriculum was entirely based around experimental design, and approached statistics from an applied research perspective instead of a theoretical quantitative perspective.  There is no question that all of my classmates found the latter course more difficult (the fact that it was taught by a 60 yr old tenured professor who recorded all of his lectures with meticulously timed monotone recitals of equations didn’t exactly help).

In all of the statistics coursework I’ve taken, particularly in graduate school, I’ve noticed that my co-workers (and myself occasionally too) have preprogrammed into our brains whatever scientific question that we’ve written into our proposals, and just focus on that technique during the entire semester.  For example, my data happened to be very zero-inflated, so I spent a ton of time learning ZIP models and negative binomials.  I feel like that approach perhaps limits the ability to gain a much broader understanding of the entire universe of questions that you can  ask of your data, or more important, that you can ask of your field, before you pull out your calipers or microscope for the first time.

That’s why I really appreciate the much broader, theoretical perspective that Jarrett’s course goes over, in if I occasionally complain about all of the simulations that are so ubiquitous in the homework assignments.  If you don’t know the theory, then there’s no point in knowing how to use the tools.  I walked by a friend’s office last week and we chatted about some statistics problems, and simulations.  And someone far back in the office commented that she didn’t think she needed to know the equations for standard error because there was a code that she could simply type in to get standard error.  The friend and I smiled at each other knowingly, realizing that if you don’t know what’s under the hood, then you really shouldn’t be driving the car.   And of course, a comprehensive approach limits the topics that you can explore in a single semester.  But the truth is that there are so many types of analysis, that you can’t hope to master everything.

And I think that speaks to the larger problem of specialization in an time of interdisciplinary conglomeration.  Does someone who studies gene regulation care about community ecology? Should someone who studies disease transmission in amphibians care about microbial anaerobic metabolism?  More applicably, should someone who needs to run a factorial ANOVA with multiple post-hoc comparisons also know how to run a discriminant functional analysis? And the logical answer I come up with, is that if you’re only focused on your data after the fact, after your experiment is already run, then no, it doesn’t really matter.  You can run the right analysis based on the validation of assumptions in your distribution, and get a quantitative result.  But if you’re focused on the question you want to ask in your field, then yes, many types of analyses should be on the table and understood.  Similarly, if you’re asking a specific question about gene regulation, and that’s all you care about, then sure, you don’t need to know about community dynamics.  But if you’re asking about gene regulation in a specific organism, then yes, maybe community ecology effects selective processes and phenotypic expression from native gene sequences.

“The expert knows more and more about less and less until he knows everything about nothing.”  ― Mahatma Gandhi

“Of course, I am interested, but I would not dare to talk about them. In talking about the impact of ideas in one field on ideas in another field, one is always apt to make a fool of oneself. In these days of specialization there are too few people who have such a deep understanding of two departments of our knowledge that they do not make fools of themselves in one or the other.”  ― Richard P. Feynman, The Meaning of It All: Thoughts of a Citizen-Scientist

In science, specialization without integration can only get you so far, and having multiple avenues of inquiry, and subsequent analysis enables a far broader understanding.  One of my PIs, who happens to be the head of her department, is famous for saying that she is a chemist, and doesn’t care at all about biology.  Fair enough.  After hearing that, I immediately sent her an email with the following quotes:

“A man cannot be professor of zoölogy on one day and of chemistry on the next, and do good work in both. As in a concert all are musicians,—one plays one instrument, and one another, but none all in perfection.” — Louis Agassiz

“When chemists have brought their knowledge out of their special laboratories into the laboratory of the world, where chemical combinations are and have been through all time going on in such vast proportions,—when physicists study the laws of moisture, of clouds and storms, in past periods as well as in the present,—when, in short, geologists and zoologists are chemists and physicists, and vice versa,—then we shall learn more of the changes the world has undergone than is possible now that they are separately studied.” — Louis Agassiz

The more you know, especially about statistics, the more questions you can ask about the world around you.  So I’m excited to learn about the things I don’t know about, more about the mechanics involved with the things I do know about, and the exact circumstances of when I should use a scalpel and not a sledgehammer, a t-test and not a general linear model.  To know how each component of an advanced technique or function-code is built, and to ask questions that require multiple avenues of analysis that allow comprehensive understanding of ultimate and not just proximate causation.  Top-down and bottom-up understanding.  That would be nice, to know math and analysis, to have both theory and action not merely one or the other. http://wakinglifemovie.net/Transcript/Chapter/12

Posted in Uncategorized | Leave a comment