As a first year graduate student, I am being asked to think about data in all sorts of new ways. As I am sure some of my classmates can relate to, and the rest may remember the feeling, it seems like the jump from undergrad to grad student a really about making a mental switch from consuming data to creating and analyzing it. I’d like to go back in Nate Silver’s book a bit, but there was one statistic he quoted in the beginning of The Signal and the Noise when he mentioned the sheer volume of new data being created every single day: 2.5 quintillion bytes. We, as a society, are creating far more data than we can possibly analyze, which to be honest is a bit intimidating, especially since I am now being asked to create more data. I do field work, design experiments, measure stuff, all the things that define my day, but really what I am doing is creating more data, and trying to analyze it in some meaningful way to describe some natural phenomenon. As I am thinking about all this, one thing that I am quickly learning is that I don’t want to create bad data, as that really is just contributing to the “noise”. Back to chapter 5, Silver is talking about overfitting. This is something that I feel would be incredibly tempting to do, especially after spending years collecting data. This seems to me like something that can come out of producing data that is either not useful or poorly collected. Trying to find meaning in something that isn’t necessarily there, when the search for meaning becomes more important than the search for truth. I am getting a little philosophical here, but in practice this is a lesson in making sure that I really think about design and data collection before creating lots of new data.
As I move forward in my work, school, career, whatever, I have chosen a path that will always require data analysis (someday it may even involve R if I ever can really figure it out, but I digress), I need to make sure that I think about what kind of data I am creating. Maybe the data I make isn’t going to be analyzed by me, or maybe someone else looks at my data and can do something better with it. Either way, the real tragedy would be if my created data isn’t useful because I failed to produce good data. In this way, I hope that maybe a few signals can come out of what I do, in stark contrast to all the noise.