Salta al contenuto principale


We apologize for a period of extreme slowness today. The army of AI crawlers just leveled up and hit us very badly.

The good news: We're keeping up with the additional load of new users moving to Codeberg. Welcome aboard, we're happy to have you here. After adjusting the AI crawler protections, performance significantly improved again.

in reply to Codeberg

It seems like the AI crawlers learned how to solve the Anubis challenges. Anubis is a tool hosted on our infrastructure that requires browsers to do some heavy computation before accessing Codeberg again. It really saved us tons of nerves over the past months, because it saved us from manually maintaining blocklists to having a working detection for "real browsers" and "AI crawlers".
in reply to Codeberg

However, we can confirm that at least Huawei networks now send the challenge responses and they actually do seem to take a few seconds to actually compute the answers. It looks plausible, so we assume that AI crawlers leveled up their computing power to emulate more of real browser behaviour to bypass the diversity of challenges that platform enabled to avoid the bot army.
in reply to Codeberg

We have a list of explicitly blocked IP ranges. However, a configuration oversight on our part only blocked these ranges on the "normal" routes. The "anubis-protected" routes didn't consider the challenge. It was not a problem while Anubis also protected from the crawlers on the other routes.

However, now that they managed to break through Anubis, there was nothing stopping these armies.

It took us a while to identify and fix the config issue, but we're safe again (for now).

reshared this

in reply to Codeberg

For the load average auction, we offer these numbers from one of our physical servers. Who can offer more?

(It was not the "wildest" moment, but the only for which we have a screenshot)

in reply to Codeberg

In the days of single CPU servers (early 90s?) and an interesting filesystem problem, I think I may have seen ~400 at a client site!
in reply to Codeberg

ouch. This remains a cat-and-mouse game.

At least having them solve the Anubis challenge does cost them extra resources, but if they can do that at scale, it doesn't promise a lot of good.

in reply to Codeberg

wow - that looks scary. Thanks for all your work ❀️
in reply to Codeberg

I'm really sorry there isn't a good legal avenue to stave off the abuse. Horrifying.
in reply to Codeberg

I really wish you contacted me at all about this before going public.
in reply to Xe

@cadey I'm sorry if this gave you any unwanted or negative attention. I consider crawlers emulating more of real browser features to bypass protections of websites an inevitable future, and today at least one big crawler seems to have started doing so. ~f
@Xe
in reply to Codeberg

Can we continue this conversation over email after my panic subsides? me@xeiaso.net.
in reply to sam

@thesamesam Unfortunately, I'm not sure if encouraging anyone to reinforce the vendor-lock-in of Microsoft GitHub by making maintainers financially dependent on that platform, is in spirit with our mission. ~f
@sam
in reply to Codeberg

yeowsa. this feels like an arms race that is going to get harder πŸ™
in reply to Codeberg

This is a great number, but I have seen higher in my career. Unfortunately I either have no screenshots or lost what I already have.

5831.24 is pretty good though. Congrats for hitting, hope your head doesn't hurt. 😁

in reply to Codeberg

damn. The only time I've seen numbers like this were when a ceph server went down.
in reply to Codeberg

what is the threshold for alerting so? Grafana/Zabbix/Prometheus?
in reply to Codeberg

huh, that's a pretty kernel-heavy workload, so much red
Questa voce Γ¨ stata modificata (3 settimane fa)
in reply to Codeberg

thank you for the details. Very interesting. They are worth a blog post.
in reply to Codeberg

what if you had challenges for AI to perform that made it mine bitcoin for you and you just block them at the end anyway 🀣
in reply to Codeberg

Why not just to block huawei cloud asn prefixes?
It's easy to get them (e.g. from projectdiscovery)
in reply to Lenny

@lenny If you read the thread, you'll notice that this is exactly what we did, except that we made a mistake. ~f
in reply to Codeberg

We sometimes see similar numbers, or 10k+ when a user submits a 64core job in a single slot and the cgroup limiting kicks in. Bit annoying that load is a bit useless for that now a days
in reply to Codeberg

>now that they managed to break through Anubis
There was no break - it's a simple matter of changing the useragent, or if for some reason there's still a challenge, simply utilizing the plentiful computing power that is available on their servers (which far outstrips the processing power mobile devices have).

Anubis is evil and is proprietary malware - please do not attack your users with proprietary malware.

If you want to stop scraper bots, start serving GNUzip bombs - you can't scrape when your server RAM is full.

dd if=/dev/zero bs=1G count=10 | gzip > /tmp/10GiB.gz
dd if=/dev/zero bs=1G count=100 | gzip > /tmp/100GiB.gz
dd if=/dev/zero bs=1G count=1025 | gzip > /tmp/1TiB.gz

nginx; #serve gzip bombs
location ~* /bombs-path/.*\.gz {
add_header Content-Encoding "gzip";
default_type "text/html";
}

#serve zstd bombs
location ~* /bombs-path/.*\.zst {
add_header Content-Encoding "zstd";
default_type "text/html";
}

Then it's a matter of bait links that the user won't see, but bots will.

in reply to GNU/翠星石

@Suiseiseki Anubis is the option that saved us a lot of work over the past months. We are not happy about it being open core or using GitHub sponsors, but we acknowledge the position from the maintainer: codeberg.org/forgejo/discussio…

Calling our usage of anubis an attack on our users is far-fetched. But feel free to move elsewhere, or host an alternative without resorting to extreme measures. We're happy to see working proof that any other protection can be scaled up to the level of Codeberg. ~f

in reply to Codeberg

@Suiseiseki BTW, we're also actively following the work around iocaine, e.g. come-from.mad-scientist.club/@…

However, as far as we can see, it does not sufficiently protect from crawling. As the bot armies successfully spread over many servers and addresses, damaging one of them doesn't prevent the next one from doing harmful requests, unfortunately. ~f

in reply to Codeberg

I believe @Suiseiseki is not referring to codebergs usage of anubis specifically, rather shares fsfs' stance (which I don't share) that Anubis "acts like malware" for making "calculations that a user does not want done": fsf.org/blogs/sysadmin/our-sma…

fsf saying fsf things πŸ˜€

in reply to Codeberg

@Suiseiseki@freesoftwareextremist.com β€œWe are not happy about it being open core … GH sponsors”

Do you have better suggestions for how we can have a sustainable OSS model that isn’t entirely dependent on core contributors of major projects having full time jobs and then supporting everyone else in whatever free time they might have?

in reply to Codeberg

so, to clarify, do you have evidence that the bots were solving Anubis challenges or not, i.e., it was due to the configuration issue? (I think it's inevitably going to happen if Anubis gets traction. I'm just curious if we're already there or not.) Thanks for your work and transparency on all this.
in reply to Stefano Zacchiroli

@zacchiro Yes, the crawlers completed the challenges. We tried to verify if they are sharing the same cookie value across machines, but that doesn't seem to be the case.

Stefano Zacchiroli reshared this.

in reply to Codeberg

I have a follow up question, though, @Codeberg, re: @zacchiro's question. Is it *possible* that giant human farms of Anubis challenge-solvers actually did it? Or did it all happen so fast that there is no way it could be that?

#Huawei surely could fund such a farm and the routing software needed to get the challenge to the human and back to the bot quickly enough that it might *seem* the bot did it.

in reply to Bradley Kuhn

@bkuhn
Anubis challenges are not solved by humans. It's not like a captcha. It's a challenge that the browser computes, based on the assumption that crawlers don't run real browsers for performance reasons and only implement simpler crawlers.

So at least one crawler now seems to emulate enough browser behaviour to make it pass the anubis challenge. ~f
@zacchiro

in reply to Codeberg

I get it now.

Thanks for taking the time to clue me in.

I'm lucky that I haven't needed to learn about this until now and I'm so sorry you've had to do all this work to fight this LLM training DDoS!

Cc: @zacchiro

in reply to HenrΓ½ Γ“lson

@nemo Currently not. We wanted to investigate the legal situation with regards to sharing such lists. They could currently contain individual's IP addresses and likely need to be cleaned up first. ~f
in reply to Codeberg

Was the solution to increase the proof-of-work difficulty?
in reply to Steven Sandoval

@baltakatei No. We fixed our config. Now we're blocking the offending IP ranges directly. ~f
in reply to Codeberg

have you tried filing a criminal complaint against the "attacker" because basically it's a breach of ToS and a DoS, right? So it might qualify for a violation of Β§ 303b StGB (German criminal code). I mean, I am no lawyer, but at least it's worth the try?
Questa voce Γ¨ stata modificata (3 settimane fa)
in reply to Codeberg

How much were they slowed down by actually solving the challenges? I was under the impression that the proof of work was the primary intent of Anubis, and the fact that most crawlers just bombed out and didn't even attempt them in the first place was a bonus.
in reply to Codeberg

It makes me wonder: there is a public curated IP blocklist somewhere that we can all use ? I searched a bit, I found only weak robot.txt solutions based on User Agent.
in reply to Codeberg

Seem a bad mouse and cat game, glad that you could stay at the top of it (proves that human can still win). Jesus christ, those big tech compagnies should be held responsable for that shit and pay billions in fine. Maybe then they would think of stopping that insanity.
in reply to Codeberg

Good luck with fighting the bots. I recently moved my OSDev project and site to Codeberg from GitHub and so far it’s been great!

Thank you for helping the open-source community!

in reply to Codeberg

Now what needs to happen is that part of the challenge computes a known answer while the other part does useful computational work, and there's no way for the 'bot to tell which is which -- so it has to do both.

That could maybe contribute computing power to something important like Folding@Home, or even just something pretty like Electric Sheep.

in reply to Woozle Hypertwin

@woozle This topic was discussed in the past. The problem is that cutting useful work in small chunks AND verifying it is very difficult. It might work for some cryptocurrencies, but that's nothing we're interested in.

A proof of concept is more than welcome, but I don't yet know if anyone found a suitable task for this.

~f

in reply to Woozle Hypertwin

(on further thought) ...or is it?

  • Create a set of N problems.
  • Solve a sampling of them.
  • Require the bot to solve all of them.
  • If the bot's solutions to the solved set don't match, then it fails the whole test.

Might that work? I guess there could be problems with trustability of the "unknown" answers -- does that look like the main issue to be solved?

in reply to Woozle Hypertwin

@woozle Remember that users want to get through the challenge page quickly. So the more samples you have, the simpler the individual problems need to be.

~f

in reply to Codeberg

These companies are evidently willing to pay an absolutely staggering cost to do their scraping.

I wonder, are they paying with their own money, or are they β€œborrowing” some unsuspecting strangers' compromised computers/routers/etc to do the work?

Questa voce Γ¨ stata modificata (3 settimane fa)
in reply to Codeberg

I observed them too about a month ago. I then sent the whole AS to Google's recaptcha and it worked (at least people who can solve recaptcha can still access our site while these bots can't).
in reply to Codeberg

boy Huawei is so nasty

I wonder who are the biggest offenders on this matter...

in reply to Codeberg

"AI crawlers learned how to solve the Anubis challenges"

Why does EU discuss chat control and not AI crawlers control again?

movq reshared this.

in reply to Codeberg

eBPF could be more effective and easy on the CPU, since it acts on a way lower network layer. Anubis kinda has it's limits and it's way too easy to circumvent (as you found out)

Maybe it's worth it to consider eBPF (if not already happened)

And thanks guys for your work. I'm a proud supporter and I'll continue to support your work. Companies shouldn't control the Open Source space

Questa voce Γ¨ stata modificata (3 settimane fa)
in reply to Codeberg

It's going to be rat race after all, I expected this to happen eventually. Surprising it took this long.
in reply to Codeberg

Anubis is extremely easy to bypass, you just have to change the User-Agent to not contain Mozilla, please get proper bot protection.

ulveon.net/p/2025-08-09-vangua…
This post talks briefly about other alternatives. Try Berghain, Balooproxy, or go-away.

in reply to ulveon.net

@ulveon This depends on the configuration, and it was not the problem we have been running into today. ~f
in reply to Codeberg

Perhaps it's time stop letting robots solve puzzles and instead feed them bombs. Do we know how well a ZIP bomb works on these crawlers?
in reply to Codeberg

Have you looked into serving these LLM crawlers alternative versions of the site, with poisoned data? (And rate-limiting, of course.) I know it would be additional work for you to implement this, but... it might be effective.

I'm thinking you could have a precomputed set of 1000 different poison repos that get served up randomly, each of which is a Markov-chain-scrambled version of the files in a real repo.

(I wrote codeberg.org/timmc/marko to do something similar to the contents of my blog postsβ€”a Markov model on either characters or words.)

Questa voce Γ¨ stata modificata (3 settimane fa)
in reply to Codeberg

😲🀬 re: what's happened to @Codeberg today.
The AI ballyhoo *is* a real DDoS against one of the few code hosting sites that takes a stand against slurping #FOSS code into LLM training sets β€” in violation of #copyleft.

Deregulation/lack-of-regulation will bring more of this. βˆƒ plenty of blame to go around, but #Microsoft & #GitHub deserve the bulk of it; they trailblazed the idea that FOSS code-hosting sites are lucrative targets.

giveupgithub.org

#GiveUpGitHub #FreeSoftware #OpenSource

Questa voce Γ¨ stata modificata (3 settimane fa)
in reply to Bradley Kuhn

@bkuhn if anyone need it, there is this gist showing how to pseudo-automate repository bulk deletion.
gist.github.com/mrkpatchaa/637…

and this tool
reporemover.xyz very handy

Questa voce Γ¨ stata modificata (3 settimane fa)
in reply to serk

IMO, @serk, the better move is not to delete the repository, but to do something like I've done here with my personal β€œsmall hacks” repository:

github.com/bkuhn/small-hacks

I'm going to try to make a short video of how to do this, step by step. The main thing is that rather than 404'ing, the repository now spreads the message that we should #GiveUpGitHub!

in reply to Bradley Kuhn

@bkuhn @serk When @librecast moved our repos I wrote a script to wipe the GitHub repo and replace it with the #GiveUpGitHub README:

codeberg.org/librecast/giveupg…

Unknown parent

mastodon - Collegamento all'originale
Codeberg

@gturri Anubis sends a challenge. The browser needs to compute the answer with "heavy" work. The server then has "light" work and verifies the challenge.

As far as we can tell, the crawlers actually do the computation and send the correct response. ~f

in reply to Codeberg

could just setup a few traps that crash the AI crawlers or something. This is going to get really annoying and hopefully these bastards don't interfer with some of my work in the long run with what they've been doing on the internet. Scraping is already largely frowned upon so these pos are just making it worse.
in reply to Codeberg

what if the new captcha was get a bug fix PR merged? That'd keep them robits out.
in reply to Codeberg

Thank You For Your Service. ( I moved to Codeberg, like, yesterday, and signed up a recurring donation )
in reply to Codeberg

Are you guys using traffic shaping and queue management at all? For example putting something like QFQ qdisc on your routers and then marking packets from spammy sources as low-priority and putting them into a low priority queue can be a huge boost in responsiveness for your real customers.
Spammy sources could be those that open new connections too often, transfer too many bytes, or have too many open active connections. All of those kinds of things can be accounted in nftables.
in reply to Codeberg

I've been moving my stuff to Codeberg. Glad to see you have a presence on Mastodon! Thanks for being there.
in reply to Codeberg

Really need to sue them for a denial of service attack, get them banned from touching a computer for 20 year.
⇧