1/ I have a problem, which is: My websites (a #Wordpress site and a #MediaWiki installation) are slow as hell.
So I need to identify the cause. The problem is that I don't know nearly as much about website administration as I ought to be.
I contacted the support people at my website provider, who looked at my (Apache) logs and suggested that my Wordpress site might suffer from a "pingback xmlrpc attack". I did the proposed remedy, which made things a little better. But I don't know enough about reading website logs to identify such problems myself, which I ought to.
So what I am trying to say is: Is there some kind of beginners guide for reading website logs, identifying malicious traffic, and what to do about it?
Questa voce è stata modificata (5 mesi fa)
Max - Poliverso 🇪🇺🇮🇹
in reply to Jürgen Hubert • — (Firenze) •@Jürgen Hubert
If you haven't already done it you could ask WordPress and MediaWiki communities.
This being said, no... I don't think there are because network security is really a huge topic.
If I were you, I'd start by trying to understand every line in the log by copying and pasting on a search engine and reading posts talking about that.
Studying would make your systems more robust but it'll take time.
Jürgen Hubert likes this.
Jürgen Hubert
in reply to Jürgen Hubert • • •2/ Okay, I think I might already have some ideas.
My latest #Apache log has 26,694 lines.
In these 26.694 lines, I have:
- 10,724 access requests from "developers.facebook.com/docs/s…"
- 4.562 access requests from "developer.amazon.com/support/a…"
- 3.316 access requests from "openai.com/gptbot"
So yeah, I suspect these are the #LLM crawling bots from #Facebook , #Amazon , and #OpenAI who jointly make up for more than half the traffic - and they are hogging the more resource intensive functions, like "Recent Changes" on my wiki.
Fuck those fuckers for causing outages on my websites.
And any suggestions on how to block them (no snark, please - I _am_ new at this.)
About AmazonBot
Developer Portal MasterSimon Zerafa
in reply to Jürgen Hubert • • •I suspect the blocks will need to be at the IP address or ASN level.
Using robots.txt will be futile at the vast majority of LLM crawlers ignore it 😞
samir, a terminally deprecated method
in reply to Jürgen Hubert • • •This might be a good place to start:
github.com/ai-robots-txt/ai.ro…
I am not an expert, but I am happy to try and answer any questions you might have.
GitHub - ai-robots-txt/ai.robots.txt: A list of AI agents and robots to block.
GitHubJürgen Hubert
in reply to samir, a terminally deprecated method • • •@samir
Thanks! I will fiddle around with those and see if anything works.
Hermetic Library
in reply to Jürgen Hubert • • •Ratsnake Games 🔞
in reply to Jürgen Hubert • • •GitHub - TecharoHQ/anubis: Weighs the soul of incoming HTTP requests to stop AI crawlers
GitHubRay McCarthy
in reply to Jürgen Hubert • • •Nuke those corporations from orbit?
Pray for the AI bubble to burst?
Real solutions seem either complicated (tarpit-honeypots) or expensive (and hurt humans, such as Cloudflare).
Max - Poliverso 🇪🇺🇮🇹
in reply to Jürgen Hubert • — (Firenze) •There's a how-to about access control on Apache website. I can remember of a "Require" directive that could help blocking access from a given hostname.
You could start reading from there maybe.
Block that hostname and see what happens. You might have found a useful tool to prevent unwanted accesses.
Femme Malheureuse
in reply to Jürgen Hubert • • •Jürgen Hubert
in reply to Femme Malheureuse • • •@femme_mal I took a closer look, and I am _definitely_ scraped by AI harvesting bots.
mementomori.social/@juergen_hu…
Jürgen Hubert
2025-10-20 13:23:57
Femme Malheureuse
in reply to Jürgen Hubert • • •Looks like you received good feedback about that. AI scraping is the one regularly annoying resource suck I've experienced over the past 3 years. Irritates me to have to explain to readers that there's a limit to what can be done while these bots are stealing content and bandwidth.
Good luck!
thvv
in reply to Jürgen Hubert • • •some suggestions: try to measure what is
slow. in Chrome, you can go to View > Developer > Tools and run a Lighthouse report.
look at Performance: my sites come up in 0.3 seconds.
run validator.w3.org to see if you have errors in your HTML.
try https:multicians.org/thvv/histsite.html for some personal views on website construction.
The W3C Markup Validation Service
validator.w3.orgJürgen Hubert
Unknown parent • • •Jason Lefkowitz
Unknown parent • • •There's probably more to be said, but I've gone on for far too long already 😆
Hope this was at least helpful. If you want to talk further, feel free to @ me either here or in DMs. Can't promise I can solve your problem, but I'm happy to help however I can.
~ fin ~
(5/5)
Jason Lefkowitz
Unknown parent • • •On Apache, for a quick fix, if your host gives you the ability to use .htaccess files (which modify Apache's configuration), you could put lines in each site's .htaccess like
Require not host <host.example.com>
The downside is that it's on you to keep up with the domains the crawlers are coming from, and they change. A WAF lets you just say "throttle anyone who shows up too much."
You'd also have to keep up your list in two places, the .htaccess for WP, and the one for MediaWiki.
.htaccess syntax is also finicky. If you don't know what you're doing, I wouldn't mess with it.
httpd.apache.org/docs/2.4/howt…
(4/?)
Access Control - Apache HTTP Server Version 2.4
httpd.apache.orgJason Lefkowitz
Unknown parent • • •If you want to block traffic to multiple applications on a single machine, you're either going to need to be able to modify your web server software's configuration (which many hosts don't allow), or software called a "web application firewall" (WAF).
A WAF sits between the public web and your applications, filtering and throttling traffic before it reaches them. It gives you one central way to block or rate-limit entire domains.
Many hosting companies integrate with Cloudflare, which offers a basic, free WAF as a service. So that might be something to talk to your host about.
en.wikipedia.org/wiki/Web_appl…
(3/?)
HTTP specific network security system
Contributors to Wikimedia projects (Wikimedia Foundation, Inc.)Jason Lefkowitz
Unknown parent • • •If you can download your access logs from your hosting provider, GoAccess (goaccess.io/) is a handy free tool for analyzing them quickly. It can put together simple charts that show you who's hitting your site, when, and from where. These can be useful for identifying spikes in traffic from different sources, which you can then block.
(2/?)
GoAccess - Visual Web Log Analyzer
goaccess.io