Short-cutting in RStudio

Well it’s Halloween again. A time to ponder about what scares us. I’ll go first. I get scared when I have an idea while coding, and I know that I won’t be able to finish actually typing the code out before I forget what I’m doing. This might sound dumb, but when you’re learning a new programming language, battling with syntax, and constantly referencing the help menu and google, it can take awhile to actually type out your code. If you share a similar fear, and don’t currently use keyboard short cuts in RStudio, I have some good tips for coding faster, but if you’re an experienced RStudio shortcutter, I’m sure you’ve already alt + tab’ed out of this post. Let’s start with the basics:

One of the best ways to speed up your coding is to stop using your mousepad and keep your fingers on your keyboard. For instance, instead of clicking on the little floppy disc every time you want to save your script, just hold control (command for macs) and type “s”. This will save you lots of time, and it is easier to develop good saving habits using the short cut. Heres another simple one: say you are coding along and then all of a sudden, you want to query google about formatting ggplots. Instead of moving the mouse to the task bar and switching to your internet browser, just hold alt (again command for macs) and hit “tab.” Boom! You just switched windows without sliding a finder.

Okay those were simple ones that most of you probably already use. Lets get a little more R-specific. How about when you finish writing a big long comment and you realize you forgot to put the “#” at the beginning of the line, meaning your “comment” is now a bunch of unusable code. Well you could either slide your cursor to the beginning of the line, click and then type #, but that takes waaaay too long. Instead, leave your cursor where it is and hold control+shift+c (substitute command for control for macs). Kapow!!! The whole line is instantly transformed into a comment! Scenario 2: you just realized a whole block of code is useless and you need to start fresh, but you don’t want to delete what you have. I’ll give you permission to use your mouse here but you could hold shift and use the directionals as well. Highlight all the bad code and hold control+shift+c. Now it’s all commented! Want a line or two back? Just highlight the lines you want to uncomment and hold control+shift+c again. SNAP! It’s code again.

Want to copy a line of code without using copy/paste? Use shift+alt+up/down (command+option+up/down for macs) to replicate a favorite line of code as many times as you want! Stick a variable in there and you have a poor man’s for-loop (just kidding,that wouldn’t save any time).

Sick of clicking on that “knit PDF” button over and over?      control+shift+K.

Want to delete a whole line of code?     control+D

The list of things you can do with keyboard short cuts is exhaustive. They have short cuts for code folding and debugging too if you’re into that. Check it out here or if you don’t want to leave R to reference the short cut list, just hold alt+shift+K (option+shift+K for macs) and it will pop right up. Thats right, there is a short cut for finding short cuts.

Well I hope you got something out of this post. I figured it would be useful information to have before we all sit down to hammer out our final projects. Happy Halloween. Happy coding.

command+Q

Posted in Uncategorized | 1 Comment

Comic Strip Statistics

I apologize in advance for the abbreviated nature of this post, but it’s Midterm Week 2 and I’m getting to the point where NA vectors look like they’re taunting me (na-na-na-na-na-na). A few weeks ago, I came across an interesting stats-related article that assessed the total damage that Calvin of Calvin & Hobbes fame caused over the course of his comic strip tenure from 1986 to 1996. It’s a short and sweet paper (link below) in which the author compares the damage Calvin caused in per month and per year, and tries to link several of Calvin’s habits to real-world causes, like going back to school after long vacations (he’s more destructive in January and August). Overall, the author concludes that a troublemaker of Calvin’s caliber would only cause enough damage per year to make him about $2,000 costlier to raise than a less chaos-inclined child, although based on the events that the author describes I sort of doubt this (e.g. flooding his house 5 times. And his parents only took away his dessert!). The author also leaves out several incidents that prevent the estimation of damage costs, including a “salamander incident,” which I believe would actually raise Calvin’s damage costs, since we all know how expensive salamander removal can get. While this article’s not exactly saturated with valuable statistics, it’s a short, entertaining read that shows to demonstrate an imaginative application of some of the concepts that we discuss in class every day, and it could provide for a brief break for anyone who has been staring at R for a little too long. Also, for anyone still in need of a final project idea, the author does mention that as of this article’s publication, there’s no reliable dataset or established method that measures the average child’s destructive costs over the course of a year.

http://pnis.co/hard1.html

Posted in Uncategorized | Leave a comment

Hola, me llamo R

Homework was due soon and I was fighting with my completely entangled code:

“Everything looks good” I repeated to myself a hundred times even though I was not getting the expected result,… “everything looked good indeed”.

programmers-be-like

Learning the coding language was and still is pretty stressful. You cannot avoid thinking that some things take only one or two clicks in excel…., but anyways, I managed to write what I wanted the program to do in its language and: “->Run”!………. Some error popped out and I had NO clue what it meant.
After checking my code piece by piece I finally found the problem; that problem that haunts all of those who happen to have English as their second language:  “lenght”. There it was, laughing at me while I was pulling my hair out.
But then it hit me: I needed a spell check package. I could not afford to waste so much time due to these types of errors.

I started looking for packages and the first that popped up was “aspell” but when attempting to install it, R was kind enough to shatter my dreams of good spelling:

package ‘Aspell’ is not available (for R version 3.0.3)

No worries, keep looking. -> “Hunspell” and Ispell were not available either for my current R version! Damn…
Finally I found an easy, AVAILABLE and friendly package with a Youtube example called “qdap” that has a built in dictionary and the possibility to add words to it so one can check the spelling of the strings generated.

When learning a language, one usually starts with basic things for a couple of sessions:

Reading comprehension:
My name is Joe the cowboy and I like horses!

Listening: The Rolling Stones – I Can’t Get No (Satisfaction)

I can’t get no satisfaction
I can’t get no satisfaction
‘Cause I try and I try and I try and I try
I can’t get no, I can’t get no

Learning the coding language  for R feels like:

Reading comprehension:

‘Tis better to be vile than vile esteemed,
When not to be receives reproach of being,
And the just pleasure lost, which is so deemed
Not by our feeling, but by others’ seeing.
For why should others’ false adulterate eyes
Give salutation to my sportive blood?

Listening: Alanis Morissete – Ironic

“Well life has a funny way of sneaking up on you
When you think everything’s okay and everything’s going right
And life has a funny way of helping you out when
You think everything’s gone wrong and everything blows up
In your face”

Just try to get it right the first time…

Add the extra layer of typos and voilá, first day of coding :D
Hopefully this package will be useful not only for me, but for people having similar issues when typing too fast.

Like learning any other language, continuously coding has helped me to understand better the coding language and unexpectedly get better with my English grammar.
One thing for sure is that after this class I’ll never forget how to spell lenght,…. Damn!

Daniel

Posted in Uncategorized | 2 Comments

An Unpredictable Part

As I clicked around the web, looking for data and predictions surrounding the world series currently underway, I found some interesting angles about analyzing data that I had not previously thought about.

An article on fivethirtyeight.com (link below) discusses how predicting baseball seasons is an “imperfect pursuit”. This was an interesting concept to me, because the whole reason new predictive models  are rolled out is to come up with data that is closer to what the truth will eventually be, the point that the article makes is that there is a statistical limit to how good a model can be. At some point, at least in baseball, there is a certain amount of randomness or luck involved, that simply can not be predicted. A sabermetrician Tom Tango determined that in baseball, one third of the difference between two teams records is the result of this random chance. Another way of looking at this is to say that the smallest possible root-mean-squared error, a way of testing the accuracy of a prediction, is 6.4 wins. This stat means that no matter how good, how perfect, a model can be for prediction, it will always have a built in level of error or 6.4 wins. Now modern models can get within this range on predicting a teams record, but this is random chance, and not a result of a model that has beaten the system.

I find this concept interesting since this is something that is virtually never discussed. At the beginning of a new season, fans and sports betters search for predictions about what the end result will be. The people trying to profit from this phenomena, and the people  who are trying to beat the current systems by making new ones of their own. It might sound obvious given the amount of variability, but the models can never, and will never be completely perfect. This idea can be extrapolated to anything where there are variables that we do not completely understand. From political elections, to sporting events, to earthquakes. While we continue to strive to predict the future it will never happen while there are factors that are not understood

Justin

(http://fivethirtyeight.com/features/the-imperfect-pursuit-of-a-perfect-baseball-forecast/)

Posted in Uncategorized | Leave a comment

P-hacking

In perfect timing for our Bayes discussion, one of our Senior Science Advisors at work sent along this article by Gelman and Loken (see here: http://www.americanscientist.org/issues/feature/2014/6/the-statistical-crisis-in-science/2) , questioning the value of statistically significant comparisons many of us researchers fall into doing. Calling it the “Statistical Crisis in Science”, Gelman and Loken delve into several papers that have come out with proven statistical significance, but question how much is influenced by the predetermined human predictions, and the data set of which they analyze. They discuss the issue dubbed “p-hacking” of when researchers come up with statistically significant results, but don’t look further at other relationships or comparisons for significance as well.  They are only looking at the hypothesis at hand, and not for the relationships with the best p-value.  The authors claim “the mistake is in thinking that, if the particular path that was chosen yields statistical significance, this is strong evidence in favor of the hypothesis.”

While I agree we need to consider all possible relationships or hypothesis so that we don’t miss the true importance of what is going on, how often can we look at all possibilities that drive us to a result?  The very basis of most of our studies come from a driven question, which focuses on the relationship analyzed.

The idea of statistic significance and how it’s role in our analyses is still clearly a debate among the community. I had a coworker publish a paper that her co-author, a statistician, demanded that she included p-values in her analysis, but she felt the p-values were not relevant to the analysis being discussed, nor did they make much sense when included.

It raises the point that Jarrett made in class yesterday, of it all really comes around to what questions we’re trying to answer.  Really, we can find a relationship (whether its strong or poor) with anything. But depending on the story we’re trying to tell, the tests we use to support our results must be relevant to our study.

Posted in Uncategorized | 1 Comment

Making predictions is hard!

This week’s Silver reading fits the current events a bit too eerily with its focus on disease epidemics and their past predictions.  I remember the 2009 swine flu fiasco.  We cut pork out of our diet for a few months, and I was made to stand in line at my high school for hours in the rain and cold to get the vaccine, until I was too miserable and demanded to go home.  The results? No one we knew had the swine flu, and I kept my flu-free record without any vaccination(even until now, *fingers crossed*).  I have always taken these “outbreaks” with a few large grains of salt, not because I don’t recognize them as credible threats, for no one can deny the havoc ebola is causing in Africa, but because I understand how the media likes to blow things out of proportion.  But somehow, cases after cases of misleading forecasts, the general public still eats up all that the media dishes out.

From reading about all these under- and over-forecasting of epidemics over the past few decades, I honestly feel that statisticians are have the shortest end of the stick.  They can make simple models, or multi-levels complex models, and still ends up having the wrong predictions and get bashed around for it.  It brings up an interesting debate of what happens when you know your model isn’t precisely right, would it be better to be conservative, and keep things calm for as long as you can, and changing your strategy as the event progresses? Or over-forecast, and over-prepare, possibly resulting in loss of millions of dollars and/or creating other panic problems, like in the 1976 swine flu scare?  It is a really hard decision to make when you know that so many people rely on it.

I really like what Silver quote of Geroge E.P. Box, “All models are wrong, but some models are useful.”  I think that we should adopt this way of thinking.  Rather than trying to come up with a solid number of how many people will die of a disease, what might be more important is using models to figure out the components of the disease, how it can be spread and how quickly, under certain circumstances.  It might be frustrating when all we want is to put a number on it, but in the long run, it would be much more useful to look at the trends rather than the values.

Posted in Uncategorized | Leave a comment

Get Your Flu Shot

With America fixated on the ebola panic as a result of insistent media coverage, much of America has forgotten about getting their flu shot. But this isn’t strange – according to the Center for Disease Control and Prevention (CDC), only 46 percent of the US received flu vaccinations in 2013, even if influenza can kill up to 50,000 individuals in a bad year.

Well get your flu shots, people, because this year has been projected to be much worse.

But that prediction got me thinking. How does the CDC determine how bad a flu season is? FluView by the CDC had some answers. The Influenza Division of the CDC produces a weekly report starting in week 40, and collects data on 1). Viral Surveillance 2). Mortality 3). Hospitalizations 4). Outpatient Illness Surveillance and 5). Geographic Spread of Influenza. They analyze these data year-round to produce predictions about the upcoming flu season with a lag time of 1-2 weeks.

That seems like hard, yet important, work. But what if we wanted to know about impending flu behavior sooner? While searching for other prediction methods, I found an interesting Nature paper by Ginsberg et al (2008) that used google search queries to predict influenza epidemic trends (see reference below).

Ginsberg and crew set out to develop a simple model to see if the percentage of influenza-related physician visits  could be explained by the probability that a random search  query submitted from that same region was related to influenza (across 9 regions). They did this by taking 50 million of the most common search queries in the US between the years 2003 and 2008, and checked each individually to see which would most accurately model the CDC-reported influenza visit percentage in that region. They narrowed it down to 46 search terms, and found a mean correlation of 0.90! Furthermore, across the 2006-2007 season, they shared their weekly results with the Epidemiology and Prevention Branch of Influenza Division at the CDC to better understand their prediction timing and accuracy, and determined that their model could correctly estimate the percentage of influenza-like symptoms 1-2 weeks before that of the CDC surveillance program (and in as little as 1 day).

Some of the terms with the most influence on the model include “influenza complication” and “Cold/Flu Remedy.”

Because these data are accurate and readily available, models like these could help public health officials prepare and respond better to flu epidemics. The authors acknowledge, though, that this should certainly not be a replacement for surveillance programs, especially since we cannot predict how internet users will behave in particular scenarios (I wonder how the search terms look now with ebola – you can check it out yourself at http://www.google.com/trends/).

I think the take home message here (besides the fact that google trends is super cool) is to understand that the modern technological world is information rich, filled with available big data; however, in leveraging these data to our advantage, we need to understand and acknowledge that there is always error involved in our predictions.

And as an ending note, by writing this blog post and searching for influenza-related topics, I probably just contributed to this week’s predictions!

If you’re interested, check out http://www.google.org/flutrends/ for results from a 2008-2009 tracking study.

- ABM

Reference: Ginsberg, Jeremy, et al. “Detecting influenza epidemics using search engine query data.” Nature 457.7232 (2008): 1012-1014.

Posted in Uncategorized | 2 Comments