Monday, August 5, 2013

The Data Dump

One of the buzzwords floating around these days is "big data" - data that exists in such large quantities that traditional data processing begins to break down. There are a lot of very cool underlying patterns that can be found when you're looking at a large enough data set, and there is no shortage of large, complex sets of data.

"Big data" can also mean you're really into Star Trek: The Next Generation.

Luckily for scientists, we live in an age where we have a constantly growing pile of data, teeming with transactions and activity records. Analyzing the Internet itself has become a worthy scientific endeavor. From economic studies of video game worlds to assessing trends on dating sites, interesting information can be found in the most unsuspecting of places online.

So today, we'll be looking at some interesting studies that have used information extracted online.

1. Twitter's Map of Ideas

The source, in Nature.

Presumably, some scientists woke up one morning and said to themselves, "Gee, a lot of activity happens on that Twitter website. And didn't that Arab Spring thing get a lot of help from social media? What else can we find out from it?" So they fished around Twitter's public API, got a sample size of several million users and hashtags, and started tracking who was re-tweeting what. The result is the picture that you see above.

The hashtag "Japan" was followed to monitor how word of the March 2011 Earthquake in Japan spread, while "Egypt" and "Syria" showcase the spread of the Arab Spring. And the "GOP" hashtag on the top right? Each giant hub you see roughly represents the two political factions in the United States. Republicans will occasionally tweet #GOP for the sake of loyalty, democrats will occasionally tweet #GOP for the sake of mockery. Notice how polarized the connections are, implying that each group is mostly just preaching to their respective choirs on Twitter. It turns out that conservatives and liberals really aren't talking to each other very much! That certainly can't have impacted anything important, right?

The source paper also studies how these hashtags compete with one another on Twitter. This raises some interesting concepts - we live in an age where our ability to consume information cannot keep up with the amount of information being produced online. The limits of our attention become important when determining which ideas spread through our culture and which do not. You can imagine a sort of "attention economy" working behind the scenes of online culture.

Some Japanese scientists presumably saw the above study and thought to themselves, "That's neat, but can we measure what sort of things can grab everyone's attention at once?" They also did some data mining on Twitter, and produced a picture of their own.

The source, in PLOS ONE.

The darker and denser the cluster, the more tweets that concentrate on a single subject. The bottom two were measured in reference to a popular movie and a lunar eclipse, respectively. You can see there's some sharing activity, but it's fairly decentralized. The top right picture is a measure of the 2011 FIFA Women's World Cup, and the top left picture has to do with the March 2011 Earthquake, both earning far greater public spotlight. The researchers asked questions about how these responses differed from one another numerically, which could not be done before we were leaving traces of our behavior all over our computer monitors.

2. The Strangeness of Stock Markets

The source, in Nature.

I'd love to guess at how researchers came up with the idea for this one, but I'm kind of surprised at the connection here myself. These researchers looked up statistics on the Wikipedia pages of companies in the Dow Jones Index. Using a data set that spanned nearly five years of Wikipedia activity, they found that if you were making investments into companies based on how many views their Wikipedia page was getting, your returns on investment would be significantly higher than what you'd expect from random strategies.

We can make sense out of this finding in hindsight. If a company is generating a lot of buzz, then investors are probably more likely to search for information on that company. Wikipedia happens to be very ubiquitous, so an online search is probably going to take searchers to a Wikipedia entry. Would that imply that Google searches could also provide insight to stock market activity?

The source, in Nature.

As it turns out, the same group of researchers also worked on a similar study using Google Trends. Your returns seem to improve when you use Google Trends to inform your investment strategy. This is crowd wisdom in its natural habitat - the network is just people all the way down. We have a measurable and quantifiable way to find correlations between online activity and market activity, and it's not too hard to imagine when we'll start using these correlations to systematically predict future market activity.

The nice thing about these studies as well is how large their sample sizes are. There's a handy thing in statistics called the law of large numbers, which tells us that the larger our sample size of data, the more representative its statistics are to the general population statistics. When you are pulling data from sources with millions of data points, you can expect results that are more representative of the general population than most studies before the Internet could ever hope to be.

3. How the Networks Tick

The source, archived in Cornell University Library.

One day, some Swiss researchers watched an Onion video and asked themselves why some social networks die out. Then they did...something.

To be honest, the nuts and bolts of the process in their paper uses a lot of methodology that I'm not familiar with, so I don't trust myself to explain it in simpler terms. The general idea seems to be that there is a cost and benefit to staying with a social network. Certain events on the network - such as interface updates - can briefly increase the cost-to-benefit ratio. Not all networks are resilient enough to withstand these moments of stress, and some collapse in themselves.

It's an interesting paper, though as far as I can tell it isn't actually published anywhere yet, so we don't know if it's passed the test of peer review. While we may have to take its findings with a grain of salt, the concept behind the paper is certainly interesting. It presents itself as an autopsy of Friendster. Why did Friendster fail, why are networks like Orkut and Myspace failing, and what is making Facebook endure despite everyone's gripes about its interface updates?

These are interesting questions that can reveal a lot about our psychology. We can learn how to make more resilient online hubs. We can learn about what coaxes activity out of more users in the hub. Not only does this benefit advertisers and site owners, it also allows us to gather more data from them. We've seen what researchers can do with Twitter, and it may be inevitable that other online hubs start being analyzed.

mentioned the idea of modelling the information exchange among secular networks to monitor and optimize the atheist movement once. What happens to social movements when you're able to model - and even predict - their life cycles? The data at our disposal increases every day. There may come a time where some serious social engineering will be possible.

Looking Ahead

Analyzing how information spreads in networks, how networks thrive, and how networks impact other networks all play a part in the previously mentioned attention economy. As we learn more about how networks work, we will be better able to engineer them, refine them, and exploit them.

What do you mean, I'm late to the party? Again?

When will we truly be able to observe the 'planned attention economy'? Arguably, you can already see the beginnings of it in action. Targeted online ads make use of user information in order to more effectively grab viewers' attention, attempting to redeem online advertisers for years of complete incompetence. Wait until the agenda becomes more complex than simply selling you a product.

Of course, since some people balked at the idea of the government monitoring communications activity, they might also cringe at the idea of scientists, advertisers, and other factions monitoring communications activity just as closely. As these models become more sophisticated and gain more predictive power, we can expect them to stop being merely observational.

There aren't many options in front of us. There is no option to cease using the Internet - it is a tool with which society has gotten fantastic mileage. Even if the Internet were to split into smaller networks, it would only be a matter of time until those networks could be similarly analyzed. You can't ask researchers to stop researching this information, either - you have to remember that if your researchers aren't doing this work, there is a good chance that someone else might be. The rewards to having models of sufficient accuracy are too tantalizing to be discounted.

It is worth pointing out that I am talking about models using the Internet, on the Internet, with information accessible from the Internet. That implies that information on these models is fairly transparent. Perhaps instead of resisting the creation of these models - and sabotaging the benefits that we'd get from them - we should work to make sure that they are tools that are accessible to everybody. Keep people educated, and keep the material public.

There's one thing I'm not mentioning, though - the limitations to models. This isn't to undermine the new insights that models give us, but before they can be useful, a lot of foundation work still needs to be established, and our data collection strategies need to be refined. These are challenges that will probably continue well into the decade, if not beyond that.

Still, if the 21st century gets defined by something that isn't the Internet, then it will likely be defined by something emergent from the Internet.

No comments:

Post a Comment