LEAKED: A New List Reveals Top Websites Meta Is Scraping of Copyrighted Content to Train Its AI (Including Many Fediverse Instances!!!)
"The tech giant is sidestepping guardrails that websites use to prevent being scraped, data show, in a move whistleblowers say is unethical and potentially illegal."
ARTICLE: dropsitenews.com/p/meta-facebo…
FULL PDF: dropsitenews.com/api/v1/file/b…
reshared this
ophiocephalic 🐍
in reply to #FediPact • • •> Rather than scraping from sites directly, many of the addresses on Meta’s leaked list belong to Content Delivery Networks (CDNs) that are used by websites to cache and store information to improve site performance.
This is a critical point. An instance or website can defend itself in numerous different ways, including actively adversarial strategies, and still succumb to extraction - if they're using Cloudflare
cc: @subMedia
reshared this
ophiocephalic 🐍 reshared this.
Kevin Karhan
in reply to ophiocephalic 🐍 • • •@ophiocephalic @subMedia People who use #CloudFlare since #KiwiFarms became their client (which they only fired when bigger corporate clients went "it's us or them!") already gave up on #Hosting, cuz #ClownFlare is a #RogueISP known for willingly hosting #Cybercrime & #Daesh propaganda sites!
ophiocephalic 🐍
in reply to Kevin Karhan • • •@kkarhan
Yeah, Cloudflare is scummy. Fedi admins use Cloudflare because it can speed up performance and protect against ddos. The problem here is that Cloudflare itself needs to be considered in the threat model, especially after this news
@FediPact @subMedia
your auntifa liza 🇵🇷 🦛 🦦
in reply to ophiocephalic 🐍 • • •🛎 🛎 🛎 thank you!
a big chunk of their services are free, aren't they?
the fact the founder shares the same last name as the guy behind Academi/Blackwater, even if they were not related, i’d assume they’re making money out of surveillance.
and “AI” *is* spyware, not just plagiarism-as-a-service.
@ophiocephalic @kkarhan @FediPact @subMedia
ophiocephalic 🐍 reshared this.
ophiocephalic 🐍
in reply to your auntifa liza 🇵🇷 🦛 🦦 • • •@blogdiva
sus as hell all the way around
@kkarhan @FediPact @subMedia
ophiocephalic 🐍
in reply to ophiocephalic 🐍 • • •Another sickening consideration here. If they're scraping Cloudflare and CDNs rather than directly, it's possible or likely they're not just extracting public posts, but all posts, including DMs
@subMedia
ophiocephalic 🐍 reshared this.
Jon
in reply to ophiocephalic 🐍 • • •Yeah, definitely grounds for concern. And yet another good reminder DO NOT USE FEDI FOR ANYTHING CONFIDENTIAL, that's what Signal is for!
All that being said, it's not clear to me whether Meta is scraping images from DMs or followers-only post. If they're looking at neuromatch.social (or follow 'show original post' link for any profiles or public/unlisted posts that federate from neuromatch to instances that don't have their public feeds locked down) they'd get files from media.neuromatch.social. But is there a realistic way for them to scrape all images from media.neuromatch.social?
Anyhow @jonny @moderation according to this report Meta's scraping a list of domains that include media.neuromatch.social. Meta denies it of course but we all know that doesn't mean anything. It's not clear just what it means for domains to be on the list, and I'm not sure what to do in response -- blocking all known Meta domains and IPs at the network level is a good idea if that's not already happening, although it's easy enough for them to work around it.
EDIT, August 10: here's jonny's update: the neuromatch media site wasn't publically enumerable.
@ophiocephalic @FediPact @subMedia
jonny (good kind) (@jonny@neuromatch.social)
Neuromatch SocialMy camera shoots fascists
in reply to ophiocephalic 🐍 • • •@ophiocephalic @subMedia
OK, wait, Cloudflare has a setting you can activate where they will block known scrapers. I have it turned on for my personal sites. How do those two things square?
fancysandwiches
in reply to My camera shoots fascists • • •ophiocephalic 🐍
in reply to fancysandwiches • • •@fancysandwiches
This. At the end of the day, Cloudflare is just another unaccountable black box of a corporation
@Mikal @FediPact @subMedia
subMedia
in reply to ophiocephalic 🐍 • • •@ophiocephalic
I've shared your concerns with our tech team. We will look into other options for DDOS protection, and will also look into using some of CloudFlare's tools to block AI scrapers.
That said, there are two points that our tech team made:
1) There's currently no evidence that Meta is scraping DMs from Cloudflare. It's not even clear to members of our tech team that this is technically possible to do.
2) Kolektiva.social posts, like most things on the fediverse, are public. So at the end of the day, there is little that can be done to protect that data from scrapers, aside from making our instance only visible to logged in users.
If you've got any other practical suggestions for steps we can take to protect user's data, let us know, as we're always happy to do what we can.
ophiocephalic 🐍 reshared this.
ophiocephalic 🐍
in reply to subMedia • • •@subMedia
Thanks for the response and for you and the team taking action on this!
In reply to the points made. My comment on DMs was pure speculation, based on the possible premise that if the DMs are passing through Cloudflare like all the other posts, and the Cloudflare cache is scraped, everything could get hoovered up.
Indeed, it's not clear that Clouflare is the vector for the inclusion of the Kolektiva instances on this list at all. A suggestion, though, that beyond the Cloudflare consideration, there are options for blocking the range of IPs controlled by Meta from server access. If y'all haven't done that yet, here are a couple links:
Instructions from @MOULE : mastodon.moule.world/@MOULE/11…
Facebook IP lists: github.com/SecOps-Institute/Fa…
Thank you again for jumping on this!
@FediPact
GitHub - SecOps-Institute/FacebookIPLists: Hourly Checked and Updated if Facebook modifies their list
GitHubTiota Sram
in reply to ophiocephalic 🐍 • • •ophiocephalic 🐍
in reply to Tiota Sram • • •@tiotasram
Yep, that's a thing. Not sure if anyone has tried applying that to a fedi web client yet, and how it might otherwise affect functionality
@subMedia @MOULE @FediPact
My camera shoots fascists
in reply to ophiocephalic 🐍 • • •@ophiocephalic @fancysandwiches @subMedia
I will ask them, but as a random free-tier user, I don't expect much of a response.
ophiocephalic 🐍
in reply to My camera shoots fascists • • •@Mikal
Good idea, would be curious to see if you get anything out of them!
@fancysandwiches @FediPact @subMedia
Jon
in reply to ophiocephalic 🐍 • • •Yeah. As long as you're already giving Cloudflare access to all the data then it probably makes sense to turn on their option to block known scrapers; it's in their interests to make it as effective as possible. But once Cloudflare has the data, who do they share it with? On the one hand, they don't have an ad-funded business model, and they have a real business that's successful enough that it's not in their incentive to sell people's data. On the other hand, they're also a surveillance capitlalism company, so I certainly wouldn't trust them.
@ophiocephalic @fancysandwiches @Mikal @FediPact @subMedia
ophiocephalic 🐍 reshared this.
Jon
in reply to Jon • • •The same's true for Fastly of course, or anybody else. It really is a dilemma. On the one hand, DDOS protection is valuable, a CDN can makes things a lot cheaper (and faster), and the AI blocking is at least somewhat useful. But it comes with a cost
@ophiocephalic @fancysandwiches @Mikal @FediPact @subMedia
ophiocephalic 🐍
in reply to Jon • • •@jdp23
I am not suggesting this as a course of action for anyone, it would be a heavy lift for even the most invested fedi admins. But FWIW, it's entirely possible to roll your own CDN and even ddos protection. If there's anyone foolhardy enough to find this of interest let me know and I'll shoot you a few links
@fancysandwiches @Mikal @FediPact @subMedia
your auntifa liza 🇵🇷 🦛 🦦
in reply to ophiocephalic 🐍 • • •aaaaaaand this is exactly why i have been side-eyeing Cloudfare for years.
a whole generation of web developers mindlessly gave their labor and the intellectual property of their clients to a white guy who said, “don’t worry be happy trust me you got it gratis”.
when i heard about the Iguanazi scraping fuckery, my first thought was: how can they do this without some huge cache they can control.
and there you go. fucking Cloudflare.
@ophiocephalic @FediPact @subMedia
ophiocephalic 🐍 reshared this.
ophiocephalic 🐍
in reply to your auntifa liza 🇵🇷 🦛 🦦 • • •@blogdiva
💯 Cloudflare is truly the mother of all internet centralization schemes
@FediPact @subMedia
#FediPact
in reply to #FediPact • • •INSTANCES KNOWN TO HAVE BEEN SCRAPED BY META INCLUDE:
• mastodon.social
• mastodon.online
• tech.lgbt
• hackers.town
• chaos.social
• mastodon.org.uk
• mastodont.cat
• mastodon.de
• mastodon.xyz
• mastodon.coffee
• mastodon.cloud
• mastodon.scot
• mastodonapp.uk
• mastodon.green
• mastodon.ml
• mastodon.au
• mastodon.eus
• mastodonczech.cz
• mastodon.sdf.org
• mstdn.social
• troet.cafe
• techhub.social
• tchncs.de
• kolektiva.social
• mamot.fr
• defcon.social
• meow.social
• social.linux.pizza
• ioc.exchange
• eldritch.cafe
• yiff.life
• furry.engineer
• infosec.exchange
• blahaj.zone
• woof.group
• union.place
• queer.party
• sakurajima.moe
• pawb.social
• digipres.club
• journa.host
• corteximplant.net
• corteximplant.com
• octodon.social
• bitbang.social
• jorts.horse
• tenforward.social
• pnw.zone
• spore.social
• hear-me.social
• neuromatch.social
• vt.social
• cosocial.ca
• chitter.xyz
• tooter.social
• cloudisland.nz
• social.seattle.wa.us
• masto.es
• nobigtech.es
• mastodon.gal
• masto.host
• toot.community
• pony.social
• climatejustice.global
• pleroma.envs.net
• indiepocalypse.social
• anarchism.space
• disroot.org
• dragonscave.space
• toot.bike
• fuzzies.wtf
• norden.social
• beige.party
• ohai.social
• freeradical.zone
• metalhead.club
• treehouse.systems
• icosahedron.website
• sunbeam.city
• sunny.garden
• zeroes.ca
• ursal.zone
• chaosfem.tw
• mas.to
• mathstodon.xyz
• rubber.social
• todon.nl
• cupoftea.social
• nerdculture.de
• toad.social
there're definitely more, i just did ctrl+f when i thought of an instance name so i definitely missed some. will be editing this list to add them as i think of them
#FediPact #meta #threads
reshared this
Oblomov, JoAnn, Kotes, RFanciola, Quincy, Arnold Knijn, Matthew, Mastodon Migration e Tiziano :friendica: reshared this.
MurmeltHier
in reply to #FediPact • • •PaulaToThePeople 😷
in reply to MurmeltHier • • •Well, that's bad. But it's not news.
That's what AI does.
@b2c is constantly fighting AI traffic to our server. But they are constantly fighting to scrape everything despite any counter measures.
I'll look more into it later, but from what I see now cj.g is on the list but cj.s not.
That could be because cj.g has the bluesky bridge enabled. Dunno.
PS: Nevermind. Now that I'm awake I searched the pdf and both climatejustice instances are on the list. But not all instances on our server.
Matthew reshared this.
Martin Ruskov
in reply to #FediPact • • •here's the lines of the leaked pdf that are present in the top 1000 in fedidb.com/servers
@fedidb
mastodon.org.uk
literatur.social
mastodont.cat
furries.club
libretooth.gr
noc.social
det.social
icosahedron.website
mastodon.hams.social
social.bau-ha.us
ecoevo.social
nso.group
social.politicaconciencia.org
toot.berlin
fuzzies.wtf
mastodon.jalgi.eus
mstdn.mx
social.anoxinon.de
ipv6.social
ciberlandia.pt
wisskomm.social
mastodon.tetaneutral.net
mamot.fr
eupolicy.social
social.librem.one
mastodon.bida.im
shelter.moe
tldr.nettime.org
mstdn.guru
corteximplant.com
mastodon.nl
social.bund.de
mastodon.uy
amicale.net
masto.es
anarchism.space
darmstadt.social
hessen.social
kafeneio.social
dju.social
pol.social
sunbeam.city
mastodon.cipherbliss.com
freiburg.social
todon.eu
social.sciences.re
functional.cafe
machteburch.social
nrw.social
jasette.facil.services
spore.social
diaspodon.fr
social.rebellion.global
kolektiva.social
legal.social
openbiblio.social
social.kyiv.dcomm.net.ua
mastodon.me.uk
graz.social
toot.aquilenet.fr
systemli.social
tooting.ch
linuxrocks.online
lile.cl
tooter.social
digitalcourage.social
kirche.social
berlin.social
rollenspiel.social
furry.engineer
climatejustice.global
hostux.social
mastodon.zaclys.com
toot.bike
wien.rocks
xn--baw-joa.social
apobangpo.space
pouet.chapril.org
mastodon.ml
masto.nobigtech.es
mastodon-belgium.be
mastodon.eus
mastodon.thirring.org
norden.social
todon.nl
typo.social
fediscience.org
social.overheid.nl
cyberplace.social
climatejustice.social
mastodonczech.cz
mastodon.sdf.org
equestria.social
nerdculture.de
vt.social
gruene.social
bonn.social
don.linxx.net
pawb.fun
Martin Ruskov reshared this.
Helen LH
in reply to #FediPact • • •Cybarbie
in reply to #FediPact • • •Bake 425 °F 15 min, then 375 °F 45 min until golden for the best apple pie recipe best top generative best
Carlos Solís likes this.
Helga Numberger 🐧
in reply to #FediPact • • •Erik (ring2)
in reply to Helga Numberger 🐧 • • •Helga Numberger 🐧
in reply to Erik (ring2) • • •Zumindest wurde es schon mehrfach erwähnt bzw. gefragt. Ich kann mir gut vorstellen, dass irgendjemand die Initiative ergreift...
Justin Derrick
in reply to #FediPact • • •I'm not super familiar with the inner workings of Mastodon/Activity Pub -- but does this mean that if I've interacted with any of these instances, that my messages were also scraped, even if I'm on an instance not on that list?
I feel like that could be a really big class-action lawsuit of some of the most tech savvy people on the planet.
DoryTheFish🌌
in reply to #FediPact • • •Jérôme
in reply to DoryTheFish🌌 • • •@DoryTheFish you can’t prevent it as long as your profile is public. That’s the whole point, profile public = anybody can scrap it.
Mastodon is available on the web. Anybody can scrap any website and mastodon by default does not require a login to see users public post.
It’s definitely not ethical but it’s technically not complicated.
You can make your profile private and always reply privately if you don’t want your posts to be visible.
Matthew
in reply to Jérôme • • •@FediPact@cyberpunk.lol @DoryTheFish@beige.party
#FediPact
in reply to #FediPact • • •i'm gonna be editing that list as i think of more so be sure to view it directly on cyberpunk.lol to make sure you get the whole thingy
#FediPact #meta #threads
Sue Briccay
in reply to #FediPact • • •@mods
Is this true WRT to tech.lgbt?
@FediPact
bluestarultor
in reply to Sue Briccay • • •Pseudonymous
in reply to bluestarultor • • •@bluestarultor
@essjayjay @mods
You said scraping was legal. Presuming we're talking about the U.S.A. here, can you explain how that can be in a country that presumes everything I write defaults to being subject to my personal copyright?
The Nexus of Privacy
in reply to Pseudonymous • • •At least so far, individuals haven't succeed in copyright claims against web scrapers. Here's a good article on the US legal landscape as of a couple of years ago (with the caveat that it's by somebody who sees scraping as generally a good thing) blog.ericgoldman.org/archives/… From a privacy perspective, papers.ssrn.com/sol3/papers.cf… looks at the challenges.
@VictimOfSimony @bluestarultor @essjayjay @FediPact
Web Scraping for Me, But Not for Thee (Guest Blog Post) - Technology & Marketing Law Blog
Eric Goldman (Technology & Marketing Law Blog)bluestarultor
in reply to The Nexus of Privacy • • •@thenexusofprivacy @VictimOfSimony @essjayjay Also literally no one in this thread said it was legal. XD
Even the original article notes that it's illegal to be slurping up copyrighted works, but that they failed to convince the judge of meaningful damages meriting restitution.
I said scraping is "not necessarily consensual" and that's because various sites have entered partnerships to sell off their users' creations with some half-assed nod to getting their consent.
The Nexus of Privacy
in reply to bluestarultor • • •Fair enough, I was just responding to @VictimOfSimony's question about scraping and copyright.
@bluestarultor @essjayjay @FediPact
Pseudonymous
in reply to The Nexus of Privacy • • •@thenexusofprivacy
@bluestarultor
@essjayjay
We do appreciate the response.
Pseudonymous
in reply to The Nexus of Privacy • • •@thenexusofprivacy
@bluestarultor
@essjayjay
This article seems to think the problem is that a third party is asserting the copyright. The fact that these class actions are becoming more popular with first parties seems to suggest you're mistaken. Also, the trespass issue I mentioned remains since there is no implied right of access to chattels for an illegal purpose. There's a tort here.
The Nexus of Privacy
in reply to Pseudonymous • • •There are quite a few class actions in process and it'll be interesting to see how things play out. And even though the plaintiffs in the Meta case didn't succeed, the court certainly left the door open to other attempts -- and arguably even encouraged them. technologyreview.com/2025/07/0… is a good overview of the Meta and Anthropic cases, and as they point out the wins for the tech companies are less cut-and-dried than they seem at first.
Still, even though the answer may be different at some point, right now I think it's still true that so far individuals haven't succeeded in copyright claims against scrapers.
@VictimOfSimony @bluestarultor @essjayjay @FediPact
What comes next for AI copyright lawsuits?
Will Douglas Heaven (MIT Technology Review)Paul Sutton
in reply to #FediPact • • •Kevin Karhan
in reply to Paul Sutton • • •Paul Sutton
in reply to Kevin Karhan • • •Oblomov
in reply to #FediPact • • •Carlo Gubitosa
in reply to Oblomov • • •Oblomov
in reply to Carlo Gubitosa • • •LisPi
in reply to Oblomov • • •Grutjes
in reply to #FediPact • • •stux⚡
in reply to Grutjes • • •Quincy
in reply to #FediPact • • •Sensitive content
spla
in reply to #FediPact • • •I did apply this nginx config to fight against it and many other IA bots and scrappers:
github.com/kurren/ai-bots-craw…
returning 444 to them seems a good way to confuse them and decrease server load.
GitHub - kurren/ai-bots-crawlers: Prevent ai bots to crawl a website (Nginx web server)
GitHubCarlos Solís
in reply to #FediPact • • •ophiocephalic 🐍
in reply to Carlos Solís • • •There are many FediPact instances on the list. But it is likely to be a major violation of consent for instances that federate too. Federation over the bridge has rules, whereas Meta's "AI" extractivism may be collecting posts and media that weren't intended to be shared with Threads or even public at all. Their ransacking of CDNs and cacheing subdomains suggests this is a real possibility
ophiocephalic 🐍 reshared this.
Sabella
in reply to #FediPact • • •⬜ Yes
☑️ Ask me later