You might think you’re pretty good at making sure you don’t share your internet life with the entire world. You use Facebook’s strictest privacy settings, don’t share anything sensitive on Twitter, and you regularly trash your laptop’s browsing history. All good, right? Nope. All that “anonymized” data you leave behind out in the ether is still totally you, and it’s far easier than you think to make it paint your picture and yours alone.
Journalist Svea Eckert and data scientist Andreas Dewes, both from Germany, wanted to find out just how easy it was to acquire and identify your web browsing history. And so, as The Guardian reports, they did just that.
The pair presented their findings recently at the annual Def Con hacker conference in Las Vegas.
Acquiring the data
The two were easily able to acquire a database holding more than 3 billion visited web addresses. That data, in turn, comprised about 9 million unique sites visited by roughly 3 million users, all in Germany.
The data clearly showed the light users — those who visited only a few dozen sites over a 30-day span — from the heavy users, those who had tens of thousands of data points sitting there to be examined.
Eckert and Dewes didn’t even have to pay for the data access, they said. What they did do was create a fake marketing company: They launched a website for the company and a LinkedIn page for its fake CEO.
That fake marketing company claimed to have developed a machine learning algorithm that could improve marketers’ tactics… but only if it was trained with a large amount of data. (This is common: machine learning depends on finding and exploiting patterns, and you need a whole crapton of data points in order to identify those patterns in a meaningful way.)
In short, they used the fake company to go begging. “We wrote and called nearly a hundred companies, and asked if we could have the raw data, the clickstream from people’s lives,” Eckert said.
It took longer than they expected — but only because they were specifically targeting Germany. “We often heard: ‘Browsing data? That’s no problem. But we don’t have it for Germany, we only have it for the US and UK,’” she said.
Eventually, one data broker was willing to help them “test” their “data platform,” and parted with the data trove for free.
That data was, of course, anonymized — but reassembling it wasn’t particularly hard, once they got started.
Some users were easily pinpointed by uniquely identified URLs. For example, if a verified Twitter user looks up their analytics, Dewes explained, that generates a unique URL that indicates their Twitter username (analytics.twitter.com/user/[username]/). That, in turn, tells you the identity of that entire user, and lo, that entire person’s browsing history is instantly connected to an identity.
Even for those who don’t have their names conveniently waiting in their URL history, though, coming up with digital “fingerprints” and making an educated guess about a data trail isn’t that hard, Dewes and Eckert said.
Basically, every time you add another URL to the list, the Venn Diagram of “the number of people this particular identity could belong to” shrinks.
Think of it like this: first you visit your employer’s site; 500 people work there. Then you visit your bank’s site: only 50 of the 500 people who work at your employer do their banking there.
That already reduces the pool of people this particular data trail could belong to down to 10% of its original size. Then we shrink it from there: You read up about a medical condition that only 30 people at your employer have, visit a site for a hobby that 70 people at your employer share, and look at the website for a school that 20 people at your employer have children currently enrolled in.
The more data points like this you leave, the more you winnow down the potential overlaps, and the more quickly you become the sole nexus where all of these threads cross.
While that process may sound fairly convoluted, on the grand scale it’s really not: It takes about 10 URLs, total, to uniquely identify someone, Eckert and Dewes said.
Now think about how many URLs you’ve got in your history — odds are it’s a lot more than 10, and that many of them are relevant to you, personally, in some way.
Really personal data
So just how salacious were the items the pair were able to glean from this “anonymous” data dump?
Well, they found and identified a judge’s porn preferences and the medication taken by a member of the German parliament, among other things — and if someone wanted to, they could almost certainly do the same for you.
There are some things you can do to obscure your footprint and hide your breadcrumb trail, if you’re particularly invested in doing so. But those solutions are imperfect. Many of them wouldn’t even help with the type of data collection that was used to gather the database Dawes and Eckert used, which was scraped from a “safe surfing” browser plugin tool users had installed voluntarily.
And, depressingly, it’s a problem that’s going to get worse before it gets better. Absent a strong internet privacy law — which we very briefly had — not only the companies you deal with online but also the internet service providers who get you there can do basically whatever they want with your data.