Back in 2006, AOL tried something incredibly bold and even more incredibly *stupid*: they dumped a data-set of 20,000,000 "anonymized" search queries from 650,000 users (yes, AOL had a search engine - there used to be *lots* of search engines!):
en.wikipedia.org/wiki/AOL_sear…
--
If you'd like an essay-formatted version of this thread to read or share, here's a link to it on pluralistic.net, my surveillance-free, ad-free, tracker-free blog:
pluralistic.net/2025/06/19/pri…
1/
Gianmarco Gargiulo reshared this.
Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
The AOL dump was a catastrophe. In an eyeblink, many of the users in the dataset were de-anonymized. The dump revealed personal, intimate and compromising facts about the lives of AOL search users.
2/
Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
The AOL dump is notable for many reasons, not least because it jumpstarted the academic and technical discourse about the limits of "de-identifying" datasets by stripping out personally identifying information prior to releasing them for use by business partners, researchers, or the general public.
3/
Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
It turns out that de-identification is *fucking hard*. Just a couple of datapoints associated with an "anonymous" identifier can be sufficent to de-anonymize the user in question:
pnas.org/doi/full/10.1073/pnas…
4/
Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
But firms stubbornly refuse to learn this lesson. They would love it if they could "safely" sell the data they suck up from our everyday activities, so they declare that they *can* safely do so, and sell giant data-sets, and then bam, the next thing you know, a federal judge's porn-browsing habits are published for all the world to see:
theguardian.com/technology/201…
5/
'Anonymous' browsing data can be easily exposed, researchers reveal
Alex Hern (The Guardian)Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
Indeed, it appears that there may be *no* way to truly de-identify a data-set:
pursuit.unimelb.edu.au/article…
Which is a serious bummer, given the potential insights to be gleaned from, say, population-scale health records:
nytimes.com/2019/07/23/health/…
It's clear that de-identification is not fit for purpose when it comes to these data-sets:
cs.princeton.edu/~arvindn/publ…
6/
Understanding the maths is crucial for protecting privacy
Dr Chris Culnane (The University of Melbourne)Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
But that doesn't mean there's no safe way to data-mine large data-sets. "Trusted research environments" (TREs) can allow researchers to run queries against multiple sensitive databases without ever seeing a copy of the data, and good procedural vetting as to the research questions processed by TREs can protect the privacy of the people in the data:
pluralistic.net/2022/10/01/the…
7/
Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
But companies are perennially willing to trade your privacy for a glitzy new product launch. Amazingly, the people who *run* these companies and design their products seem to have *no clue* as to how their users *use* those products. Take Strava, a fitness app that dumped maps of where its users went for runs and revealed a bunch of secret military bases:
gizmodo.com/fitness-apps-anony…
8/
Fitness App's 'Anonymized' Data Dump Accidentally Reveals Military Bases Around the World
Matt Novak (Gizmodo)Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
Or Venmo, which, by default, lets *anyone* see what payments you've sent and received (researchers have a field day just filtering the Venmo firehose for emojis associated with drug buys like "pills" and "little trees"):
nytimes.com/2023/08/09/technol…
9/
Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
Then there was the time that Etsy decided that it would publish a feed of everything you bought, never once considering that maybe the users buying gigantic handmade dildos shaped like lovecraftian tentacles might not want to advertise their purchase history:
arstechnica.com/information-te…
10/
Etsy users irked after buyers, purchases exposed to the world
Jacqui Cheng (Ars Technica)Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
But the most persistent, egregious and consequential sinner here is Facebook (naturally). In 2007, Facebook opted its 20,000,000 users into a new system called "Beacon" that published a public feed of every page you looked at on sites that partnered with Facebook:
en.wikipedia.org/wiki/Facebook…
Facebook didn't just publish this - they also lied about it. Then they admitted it and promised to stop, but that was also a lie.
11/
part of Facebook's advertisement system
Contributors to Wikimedia projects (Wikimedia Foundation, Inc.)Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
They ended up paying $9.5m to settle a lawsuit brought by some of their users, and created a "Digital Trust Foundation" which they funded with another $6.5m. Mark Zuckerberg published a solemn apology and promised that he'd learned his lesson.
Apparently, Zuck is a slow learner.
Depending on which "submit" button you click, Meta's AI chatbot publishes a feed of all the prompts you feed it:
techcrunch.com/2025/06/12/the-…
12/
The Meta AI app is a privacy disaster | TechCrunch
Amanda Silberling (TechCrunch)Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
Users are clearly hitting this button without understanding that this means that their intimate, compromising queries are being published in a public feed. *Techcrunch*'s Amanda Silberling trawled the feed and found:
* "An audio recording of a man in a Southern accent asking, 'Hey, Meta, why do some farts stink more than other farts?'"
* "people ask[ing] for help with tax evasion"
* "[whether family members would be arrested for their proximity to white-collar crimes"
13/
Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
* "how to write a character reference letter for an employee facing legal troubles, with that person’s first and last name included."
While the security researcher Rachel Tobac found "people’s home addresses and sensitive court details, among other private information":
twitter.com/racheltobac/status…
14/
Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
There's no warning about the privacy settings for your prompts, and if you use Meta's AI to log in to Meta services like Instagram, it publishes your Instagram search queries as well, including "big booty women."
As Silberling writes, the only saving grace here is almost no one is using Meta's AI app. The company has only racked up a paltry 6.5m downloads, across its ~3 billion users, after spending tens of billions of dollars developing the app and its underlying technology.
15/
Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
The AI bubble is overdue for a pop:
wheresyoured.at/measures/
When it does, it will leave behind some kind of residue - cheaper, spin-out, standalone models that will perform many useful functions:
locusmag.com/2023/12/commentar…
16/
Desperate Times, Desperate Measures
Edward Zitron (Ed Zitron's Where's Your Ed At)Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
Those standalone models were released as toys by companies pumping tens of billions into the unsustainable "foundation models," who bet that - despite the worst unit economics of any techn in living memory - these tools would someday become economically viable, capturing a winner-take-all market with trillions of upside. That bet remains a longshot, but the littler "toy" models are beating everyone's expectations by wide margins, with no end in sight:
nature.com/articles/d41586-025…
17/
How China created AI model DeepSeek and shocked the world
Mallapaty, SmritiCory Doctorow
in reply to Cory Doctorow • • •Sensitive content
I can easily believe that one enduring use-case for chatbots is as a kind of enhanced diary-cum-therapist. Journalling is a well-regarded therapeutic tactic:
charliehealth.com/post/cbt-jou…
And the invention of chatbots was *instantly* followed by ardent fans who found that the benefits of writing out their thoughts were magnified by even primitive responses:
en.wikipedia.org/wiki/ELIZA_ef…
18/
tendency to assume computer behaviors are analogous to human behaviors
Contributors to Wikimedia projects (Wikimedia Foundation, Inc.)Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
Which shouldn't surprise us. After all, divination tools, from the I Ching to tarot to Brian Eno and Peter Schmidt's Oblique Strategies deck have been with us for thousands of years: even random responses can make us better thinkers:
en.wikipedia.org/wiki/Oblique_…
I make daily, extensive use of my own weird form of random divination:
pluralistic.net/2022/07/31/div…
19/
set of cards intended to promote creativity
Contributors to Wikimedia projects (Wikimedia Foundation, Inc.)Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
The use of chatbots as therapists is not without its risks. Chatbots can - and do - lead vulnerable people into extensive, dangerous, delusional, life-destroying ratholes:
rollingstone.com/culture/cultu…
But that's a (disturbing and tragic) minority. A journal that responds to your thoughts with bland, probing prompts would doubtless help many people with their own private reflections.
20/
AI-Fueled Spiritual Delusions Are Destroying Human Relationships
Miles Klee (Rolling Stone)Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
The keyword here, though, is *private*. Zuckerberg's insatiable, all-annihilating drive to expose our private activities as an attention-harvesting spectacle is poisoning the well, and he's far from alone. The entire AI chatbot sector is so surveillance-crazed that anyone who uses an AI chatbot as a therapist needs their head examined:
pluralistic.net/2025/04/01/doc…
21/
Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
AI bosses are the latest and worst offenders in a long and bloody lineage of privacy-hating tech bros. No one should ever, ever, *ever* trust them with *any* private or sensitive information. Take Sam Altman, a man whose products routinely barf up the most ghastly privacy invasions imaginable, a completely foreseeable consequence of his totally indiscriminate scraping for training data.
22/
Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
Altman has proposed that conversations with chatbots should be protected with a new kind of "privilege" akin to attorney-client privilege and related forms, such as doctor-patient and confessor-penitent privilege:
venturebeat.com/ai/sam-altman-…
I'm all for adding new privacy protections for the things we key or speak into information-retrieval services of all types.
23.
Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
But Altman is (deliberately) omitting a key aspect of all forms of privilege: they immediately vanish the *instant* a third party is brought into the conversation. The things you tell your lawyer *are* priviiliged, unless you discuss them with anyone else, in which case, the privilege disappears.
And of course, all of Altman's products harvest all of our information.
24/
Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
Altman is the untrusted third party in every conversation everyone has with one of his chatbots. He is the eternal Carol, forever eavesdropping on Alice and Bob:
en.wikipedia.org/wiki/Alice_an…
Altman isn't proposing that *chatbots* acquire a privilege, in other words - he's proposing that *he* should acquire this privilege. That he (and he alone) should be able to mine your queries for new training data and other surveillance bounties.
25/
characters used in cryptography and science literature
Contributors to Wikimedia projects (Wikimedia Foundation, Inc.)Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
This is like when Zuckerberg directed his lawyers to destroy NYU's "Ad Observer" project, which scraped Facebook to track the spread of paid political misinformation. Zuckerberg denied that this was being done to evade accountability, insisting (with a miraculously straight face) that it was in service to protecting Facebook users' (nonexistent) privacy:
pluralistic.net/2021/08/05/com…
We get it, Sam and Zuck - you love privacy.
We just wish you'd share.
26/
Cory Doctorow
in reply to Cory Doctorow • • •Sensitive content
I'm on a 20+ city book tour for my new novel *Picks and Shovels*.
Catch me in #PDX with BUNNIE HUANG at Barnes and Noble TOMORROW (Jun 20):
stores.barnesandnoble.com/even…
And at the #TUALATIN Public Library on SUNDAY (Jun 22):
tualatinoregon.gov/library/aut…
More tour dates (#London, #Manchester) here:
martinhench.com
27/
Author Signing with Cory Doctorow & Andrew "Bunnie" Huang
stores.barnesandnoble.comCory Doctorow
in reply to Cory Doctorow • • •Sensitive content
Image:
Cryteria (modified)
commons.wikimedia.org/wiki/Fil…
CC BY 3.0
creativecommons.org/licenses/b…
eof/
File:HAL9000.svg - Wikimedia Commons
commons.wikimedia.orgPteryx the Puzzle Secretary
in reply to Cory Doctorow • • •Which probably explains why I didn't even *know* about the fact that my roommate was going through a years-long court misadventure in dealing with a fender bender until it was all over. I'm guessing they interpreted the whole breach-of-privilege thing too broadly.
(Which is to say, I'd assume that admitting that legal stuff is happening in a broad sense would not constitute a breach, but correct me if I'm wrong...)
Daniël Franke
in reply to Cory Doctorow • • •