Jürgen Hubert

5 mesi fa

Jürgen Hubert
5 mesi fa

1/ I have a problem, which is: My websites (a #Wordpress site and a #MediaWiki installation) are slow as hell.

So I need to identify the cause. The problem is that I don't know nearly as much about website administration as I ought to be.

I contacted the support people at my website provider, who looked at my (Apache) logs and suggested that my Wordpress site might suffer from a "pingback xmlrpc attack". I did the proposed remedy, which made things a little better. But I don't know enough about reading website logs to identify such problems myself, which I ought to.

So what I am trying to say is: Is there some kind of beginners guide for reading website logs, identifying malicious traffic, and what to do about it?

Questa voce è stata modificata (5 mesi fa)

in reply to Jürgen Hubert

friendica (DFRN) - Collegamento all'originale

Max - Poliverso 🇪🇺🇮🇹

in reply to Jürgen Hubert 5 mesi fa — (Firenze)

@Jürgen Hubert

If you haven't already done it you could ask WordPress and MediaWiki communities.

This being said, no... I don't think there are because network security is really a huge topic.

If I were you, I'd start by trying to understand every line in the log by copying and pasting on a search engine and reading posts talking about that.

Studying would make your systems more robust but it'll take time.

@Jürgen Hubert

Jürgen Hubert likes this.

in reply to Jürgen Hubert

Jürgen Hubert

in reply to Jürgen Hubert 5 mesi fa

2/ Okay, I think I might already have some ideas.

My latest #Apache log has 26,694 lines.

In these 26.694 lines, I have:

- 10,724 access requests from "developers.facebook.com/docs/s…"
- 4.562 access requests from "developer.amazon.com/support/a…"
- 3.316 access requests from "openai.com/gptbot"

So yeah, I suspect these are the #LLM crawling bots from #Facebook , #Amazon , and #OpenAI who jointly make up for more than half the traffic - and they are hogging the more resource intensive functions, like "Recent Changes" on my wiki.

Fuck those fuckers for causing outages on my websites.

And any suggestions on how to block them (no snark, please - I _am_ new at this.)

About AmazonBot

Customer facing page of Amazonbot crawler which all web content publishers can refer to.

^{Developer Portal Master}

#facebook #Amazon #apache #openai #LLM

in reply to Jürgen Hubert

Simon Zerafa

in reply to Jürgen Hubert 5 mesi fa

I suspect the blocks will need to be at the IP address or ASN level.

Using robots.txt will be futile at the vast majority of LLM crawlers ignore it 😞

in reply to Jürgen Hubert

samir, a terminally deprecated method

in reply to Jürgen Hubert 5 mesi fa

This might be a good place to start:

github.com/ai-robots-txt/ai.ro…

I am not an expert, but I am happy to try and answer any questions you might have.

GitHub - ai-robots-txt/ai.robots.txt: A list of AI agents and robots to block.

A list of AI agents and robots to block. Contribute to ai-robots-txt/ai.robots.txt development by creating an account on GitHub.

^GitHub

in reply to samir, a terminally deprecated method

Jürgen Hubert

in reply to samir, a terminally deprecated method 5 mesi fa

@samir

Thanks! I will fiddle around with those and see if anything works.

@samir, a terminally deprecated method

in reply to Jürgen Hubert

Hermetic Library

in reply to Jürgen Hubert 5 mesi fa

I use Cloudflare (I know, I know, but …) and they have, in addition to caching, a feature to inject an AI crawler maze. Along with just sometimes blocking User Agents and IP ranges from offenders. But, there's really no actually effective way. I still see massive traffic from crawlers at all times.

in reply to Jürgen Hubert

Ratsnake Games 🔞

in reply to Jürgen Hubert 5 mesi fa

github.com/TecharoHQ/anubis is intended for blocking AI scrapers. Not sure how hard it is to set up!

GitHub - TecharoHQ/anubis: Weighs the soul of incoming HTTP requests to stop AI crawlers

Weighs the soul of incoming HTTP requests to stop AI crawlers - TecharoHQ/anubis

^GitHub

in reply to Jürgen Hubert

Ray McCarthy

in reply to Jürgen Hubert 5 mesi fa

Nuke those corporations from orbit?
Pray for the AI bubble to burst?

Real solutions seem either complicated (tarpit-honeypots) or expensive (and hurt humans, such as Cloudflare).

in reply to Jürgen Hubert

Max - Poliverso 🇪🇺🇮🇹

in reply to Jürgen Hubert 5 mesi fa — (Firenze)

There's a how-to about access control on Apache website. I can remember of a "Require" directive that could help blocking access from a given hostname.

You could start reading from there maybe.

Block that hostname and see what happens. You might have found a useful tool to prevent unwanted accesses.

in reply to Jürgen Hubert

Femme Malheureuse

in reply to Jürgen Hubert 5 mesi fa

Wonder if you're getting scraped by AI harvesting bots. Can your site host tell you if you are/are not? And if it's AI bots scraping for LLMs, is the host doing anything to block them?

in reply to Femme Malheureuse

Jürgen Hubert

in reply to Femme Malheureuse 5 mesi fa

@femme_mal I took a closer look, and I am _definitely_ scraped by AI harvesting bots.

mementomori.social/@juergen_hu…

Jürgen Hubert

2025-10-20 13:23:57

2/ Okay, I think I might already have some ideas.
My latest #Apache log has 26,694 lines.
In these 26.694 lines, I have:
- 10,724 access requests from "developers.facebook.com/docs/s…"
- 4.562 access requests from "developer.amazon.com/support/a…"
- 3.316 access requests from "openai.com/gptbot"
So yeah, I suspect these are the #LLM crawling bots from #Facebook , #Amazon , and #OpenAI who jointly make up for more than half the traffic - and they are hogging the more resource intensive functions, like "Recent Changes" on my wiki.
Fuck those fuckers for causing outages on my websites.
And any suggestions on how to block them (no snark, please - I _am_ new at this.)

About AmazonBot
Customer facing page of Amazonbot crawler which all web content publishers can refer to.
^{Developer Portal Master}

@Femme Malheureuse

in reply to Jürgen Hubert

Femme Malheureuse

in reply to Jürgen Hubert 5 mesi fa

Looks like you received good feedback about that. AI scraping is the one regularly annoying resource suck I've experienced over the past 3 years. Irritates me to have to explain to readers that there's a limit to what can be done while these bots are stealing content and bandwidth.

Good luck!

in reply to Jürgen Hubert

thvv

in reply to Jürgen Hubert 5 mesi fa

some suggestions: try to measure what is
slow. in Chrome, you can go to View > Developer > Tools and run a Lighthouse report.

look at Performance: my sites come up in 0.3 seconds.

run validator.w3.org to see if you have errors in your HTML.

try https:multicians.org/thvv/histsite.html for some personal views on website construction.

The W3C Markup Validation Service

W3C's easy-to-use markup validation service, based on SGML and XML parsers.

^{validator.w3.org}

Unknown parent

Jürgen Hubert

Unknown parent 5 mesi fa

Thanks - you have given me a lot to think about!

Questa voce è stata modificata (5 mesi fa)

Unknown parent

Jason Lefkowitz

Unknown parent 5 mesi fa

There's probably more to be said, but I've gone on for far too long already 😆

Hope this was at least helpful. If you want to talk further, feel free to @ me either here or in DMs. Can't promise I can solve your problem, but I'm happy to help however I can.

~ fin ~

(5/5)

Unknown parent

Jason Lefkowitz

Unknown parent 5 mesi fa

On Apache, for a quick fix, if your host gives you the ability to use .htaccess files (which modify Apache's configuration), you could put lines in each site's .htaccess like

Require not host <host.example.com>

The downside is that it's on you to keep up with the domains the crawlers are coming from, and they change. A WAF lets you just say "throttle anyone who shows up too much."

You'd also have to keep up your list in two places, the .htaccess for WP, and the one for MediaWiki.

.htaccess syntax is also finicky. If you don't know what you're doing, I wouldn't mess with it.

httpd.apache.org/docs/2.4/howt…

(4/?)

Access Control - Apache HTTP Server Version 2.4

^{httpd.apache.org}

Unknown parent

Jason Lefkowitz

Unknown parent 5 mesi fa

If you want to block traffic to multiple applications on a single machine, you're either going to need to be able to modify your web server software's configuration (which many hosts don't allow), or software called a "web application firewall" (WAF).

A WAF sits between the public web and your applications, filtering and throttling traffic before it reaches them. It gives you one central way to block or rate-limit entire domains.

Many hosting companies integrate with Cloudflare, which offers a basic, free WAF as a service. So that might be something to talk to your host about.

en.wikipedia.org/wiki/Web_appl…

(3/?)

HTTP specific network security system

^{Contributors to Wikimedia projects (Wikimedia Foundation, Inc.)}

Unknown parent

Jason Lefkowitz

Unknown parent 5 mesi fa

If you can download your access logs from your hosting provider, GoAccess (goaccess.io/) is a handy free tool for analyzing them quickly. It can put together simple charts that show you who's hitting your site, when, and from where. These can be useful for identifying spikes in traffic from different sources, which you can then block.

(2/?)

GoAccess - Visual Web Log Analyzer

GoAccess is an open source real-time web log analyzer and interactive viewer that runs in a terminal in *nix systems or through your browser.

^goaccess.io

Questa voce è stata modificata (5 mesi fa)

⇧

Jürgen Hubert 5 mesi fa • •

Jürgen Hubert
5 mesi fa