Monday, September 23, 2013

Digital Research

I've previously talked about the role of the Internet as a data generator as well as a medium for collaboration. I thought that I'd drive the point home some more by talking about a very specific example of these roles in action.

The Internet: Another piece of lab equipment!

Today, I'm going to talk about a weekend project of mine. It was an open-ended group assignment meant to prime us for the research process of grad school. The fact that we could even get anywhere with the project is a testament to the sheer power of the Internet as a resource.

Two of my peers and I set out to pose a biology-related question. It could have been anything, provided that our work used data from a certain kind of biological method. We weren't given any equipment or materials to actually do this biological method, however. In fact, that wasn't the point. We were instead directed to a genomics data repository that had mountains of data obtained from this method.

The noteworthy thing about this repository is the size of any given set of data. Genomic data pertains to genetic information, which consists of billions of bits for any given sampling of a given organism. That doesn't even factor in information derived from genomic information, such as transcriptomic information, proteomic information, and other such -omic buzzwords.

Picture taken from here. Science just looooves buzzwords.

The data is massive, but they are ultimately scrutinized by human beings. There is an implicit assumption that, even when a research team has data that they've independently collected, and even when they've found some trends within that data that are exciting and original, they're bound to miss a few trends among the millions of numbers on their spreadsheet. And of course, the newer the data and the less time people have had to examine the data, the more likely this is to be the case.

In other words, it was completely possible for our group to pose an original question using data that we did not originally collect. And while this is arguably possible for any kind of data, it's far more likely to happen when the data set is too huge for human beings to easily digest. So when we happened to find some data sets that had been uploaded to the database a mere two days ago, we were ecstatic.

The data that we found happened to be on bacteria (which we'll refer to as MT) that causes a certain kind of disease in humans. The data described two strains of MT, along with some information that we didn't really understand at first. That was okay, because when Googled some of the terms in the information, we were led to another database.

This second database dealt less with data sets and more with entire descriptions of fundamental biological systems. If you were to poke around it long enough, you could find giant maps of metabolic networks, along with descriptions of each individual chemical, each individual reaction, the individual protein actuating that reaction, and the genetic data that helps form the protein. Using this database, we were able to figure out that our data was describing proteins - specifically, how the proteins were different between the two strains of MT.

Subtle changes in the amount of confetti.

At this point, we could begin asking questions about what we wanted to do with this data. We became particularly curious about which protein concentrations (a consequence of RNA expression) changed the most between strains. When we found some protein names that corresponded with interesting trends in the data, we set out to figure out what those proteins actually did.

The way that we figured out these protein functions is what is notable here. We would find protein listings from the previous database, and then copy their amino acid sequence onto NCBI Blast, an online resource that can read these sequences and tell you what functions the protein most likely does. We would then search online publications for more details on these functional descriptions, in order to get an idea as to what is physically changing between the MT strains.

Let's just recap this whole process for you, so that you can appreciate what I'm getting at here. We started our project by searching on an online database for information. We then started searching on another online database to get information in order to comprehend the original information. And then, when we finally figured out what we wanted to do with this information, we did some more online searching.

At no point did we have to put on lab coats and trudge into a laboratory. We went from the beginning to the end of a research problem using the same strategy that one uses to make those stupid Google autocomplete jokes.


I'm sure that if my peers ever actually read this blog post, they'll be quick to point out "Well, sure, you did all this, but your presentation material didn't exactly end up being publish-worthy." They're totally right - this was just a weekend project, and it wasn't approached with the intent to come up with something truly novel or rigorous.

Plus, the importance of experimental design and data collection doesn't go away just because it wasn't important that we collected the data. There were a few holes in our analysis that come from missing data. Models, after all, can only take you so far. The way to fill those holes would be to conduct experiments to get the data, which could be a monumental task in and of itself.

That said, the searching strategy that we used marks an important new strategy in how we ask scientific questions. Just because we probably didn't do anything interesting with this data doesn't mean that someone else couldn't. And while one could point out that our searching wasn't all that fundamentally different from a traditional literature search, the fact that it's now easy to analyze as much information as we did in a weekend is remarkable.

This is a level of collaboration that is hard to wrap one's mind around, but is easier to imagine than ever thanks to the Internet. We can collect information and share it online, so that any inquiring mind will be free to use it. We can make digital tools that we use to interpret that information for us in certain ways, so that we can use it further. There are now whole areas of scientific inquiry that can be pursued without ever entering the laboratory.

The Internet has become a fundamentally important tool in any intellectual investigation, and it will take us to strange and interesting places.

No comments:

Post a Comment