Salta al contenuto principale


Taking a stab at making a privacy-preserving analytics thing for my blog.

I'm hashing the user's IP and User-Agent, so they're never stored directly. Not 100% accurate, but good enough for me to work out whether a page-view is unique or not.

Country is very rough - using GeoIP.

Referer is either from HTTP or if the linking site uses UTM.

Seems *broadly* accurate compared to other plugins.

I'm not sure of the performance if this hits a million rows. Might need further DB tables.

reshared this

in reply to Terence Eden

in case you have any interest in existing privacy-friendly analytics solutions, that’s Plausible Analytics’ goal @plausible

plausible.io/about

in reply to Terence Eden

if performance is your concern, DuckDB is incredibly fast. I had Claude write me an AWS Cloudfront to Parquet 1 converter with IPinfo lite MMDB lookups, which you can then import into a DuckDB. It takes 41 milliseconds to find the top 10 countries over 8.8M rows, about 5 years' worth of logs:

```
zulfiqar ~/lognami>duckdb blog.duckdb
DuckDB v1.3.2 (Ossivalis) 15004e53
Enter ".help" for usage hints.
D .timer on
D select count(*) from web;
┌────────────────┐
│ count_star() │
│ int64 │
├────────────────┤
│ 8820061 │
│ (8.82 million) │
└────────────────┘
Run Time (s): real 0.008 user 0.006334 sys 0.000867
D select country, count(*) from web group by 1 order by 2 desc limit 10;
┌─────────┬──────────────┐
│ country │ count_star() │
│ varchar │ int64 │
├─────────┼──────────────┤
│ US │ 3942963 │
│ FR │ 727561 │
│ DE │ 570468 │
│ CN │ 456912 │
│ NL │ 418870 │
│ SG │ 408658 │
│ GB │ 279144 │
│ CA │ 218756 │
│ IN │ 175326 │
│ RU │ 157289 │
├─────────┴──────────────┤
│ 10 rows 2 columns │
└────────────────────────┘
Run Time (s): real 0.041 user 0.181591 sys 0.648902
```

Questa voce è stata modificata (2 settimane fa)
in reply to Terence Eden

SELECT country, COUNT(*) AS count
FROM wp_edent_stats
WHERE DATE(time) = '2025-09-05'
GROUP BY country
ORDER BY count DESC;

Now, what's the SQL query to turn these into emoji flags?

in reply to Terence Eden

create a table with the corresponding country codes and emojis and JOIN the two tables on country = country ..
in reply to Christopher Isene

@cisene Not necessarily.

The emoji flags are made up of two-character codes.

So the flag 🇬🇧 is literally 🇬 🇧

What I'm wondering is if I can turn a returned GB into those symbols.

in reply to Terence Eden

.. country, CONCAT(':', LOWER(country), ':') AS flag, COUNT(*) .. sorta?
Questa voce è stata modificata (2 settimane fa)
in reply to Terence Eden

@cisene I'm going to be cranky with the Unicode folk if adding an offset to the characters of a two letter country code doesn't yield the characters to make up the flag 😂

Bring back the old ways, convert numbers to characters by adding 48 to them.

in reply to Emily_S

Just add 1F1A5 to each letter and you'll get the Country Code letter.
Questa voce è stata modificata (2 settimane fa)
in reply to Christopher Isene

@cisene @emily_s
That's not how it works.
The flags are the ISO country codes.
If a new country of KK were created, the flag would be 🇰🇰 which is 🇰 🇰
in reply to Terence Eden

@cisene well that works surprisingly well 😂 the Unicode folk are off the hook for now
in reply to Terence Eden

SET NAMES utf8mb4;<br><br>-- helper: safe code point to utf8mb4 string using JSON unescape<br>DELIMITER //<br><br>DROP FUNCTION IF EXISTS unichar//<br>CREATE FUNCTION unichar(cp INT)<br>RETURNS VARCHAR(4) CHARSET utf8mb4<br>DETERMINISTIC<br>BEGIN<br>  DECLARE v INT;<br>  DECLARE hs INT;<br>  DECLARE ls INT;<br>  IF cp < 0x10000 THEN<br>    RETURN JSON_UNQUOTE(CONCAT('"\\u', LPAD(HEX(cp), 4, '0'), '"'));<br>  END IF;<br>  SET v = cp - 0x10000;<br>  SET hs = 0xD800 + FLOOR(v / 0x400);<br>  SET ls = 0xDC00 + (v % 0x400);<br>  RETURN JSON_UNQUOTE(<br>           CONCAT('"\\u', LPAD(HEX(hs), 4, '0'),<br>                  '\\u', LPAD(HEX(ls), 4, '0'), '"'));<br>END//<br>DELIMITER ;<br><br>-- idempotent create for the flag function<br>DELIMITER //<br>DROP FUNCTION IF EXISTS flag_emoji//<br>CREATE FUNCTION flag_emoji(cc VARCHAR(2))<br>RETURNS VARCHAR(8) CHARSET utf8mb4<br>DETERMINISTIC<br>BEGIN<br>  DECLARE base INT DEFAULT 127462; -- U+1F1E6 REGIONAL INDICATOR SYMBOL LETTER A<br>  IF cc IS NULL OR cc NOT REGEXP '^[A-Za-z]{2}$' THEN<br>    RETURN NULL;<br>  END IF;<br><br>  RETURN CONCAT(<br>    unichar(base + ASCII(UPPER(SUBSTRING(cc, 1, 1))) - ASCII('A')),<br>    unichar(base + ASCII(UPPER(SUBSTRING(cc, 2, 1))) - ASCII('A'))<br>  );<br>END//<br>DELIMITER ;<br><br>-- examples<br>SELECT flag_emoji('GB') AS gb, flag_emoji('us') AS us, flag_emoji('de') AS de;<br>
mysql> SELECT flag_emoji('GB') AS gb, flag_emoji('us') AS us, flag_emoji('de') AS de;<br>+----------+----------+----------+<br>| gb       | us       | de       |<br>+----------+----------+----------+<br>| 🇬🇧         | 🇺🇸         | 🇩🇪         |<br>+----------+----------+----------+<br>1 row in set (0.00 sec)<br>

I apologize for the JSON abuse, but that was easiest.
in reply to Terence Eden

So, here's some zeitgeist to put together:

Unicode flags are just the combination of the regional indicator symbols compart.com/en/unicode/U+1F1E6, so 🇫+ 🇷 gives 🇫🇷

And framapiaf.org/@sylvestre/11513…

So maybe you could... `| cargo run tr A-Z 🇦-🇿 `


@zacchiro @CoolSWEng echo "💐💩🌷" | cut -d"💩" -f1

with the rust coreutils
💐
with GNU:
cut: the delimiter must be a single character


in reply to Terence Eden

Getting there! I can now (inefficiently) see how many page views and (roughly) how many unique users have visited my blog.

I can also see where people are coming from on a specific day.

(Remember, this is just for me to noodle about with. This isn't going to be an amazingly accurate, professional-grade plugin.)

in reply to Terence Eden

These are the countries which send the most visitors to my blog.

I'm using local GeoIP database - so I don't have to send anyone's IP address to a 3rd party. Also keeps costs at zero!

Terence Eden reshared this.

in reply to Terence Eden

I want to draw a graph of visitors.
I don't want to call a 3rd party graphing service.
I don't want to use a big library from someone else.
I am not going to write my own graphics generator.

So I think I'm going to build a Unicode sparkline generator for PHP.

▁▂▃▄▅▆▇█
⅛¼⅜½⅝¾⅞1

in reply to Terence Eden

An approach that I use that you may hate is in the accordians in this page, using HTML5 meter elements:

earth.org.uk/energy-series-dat…

in reply to DamonHD

@DamonHD I do not hate that! It is charming.
I'm rather more used to vertical lines.
Hmmm. I wonder if they can be rotated?
in reply to Terence Eden

Let me know how you get on. My page is getting a bit bloated, and needs a revamp. What I did not want to to was generate and include loads of external images for each (daily) rebuild. A possible solution for me is in-line SVG which my HTML minifier could do a good job on in passing...
in reply to Terence Eden

There's already a command-line one of those I found a few years ago:

github.com/deeplook/sparklines

in reply to Terence Eden

As pointed out by @DamonHD - the HTML <meter> element also works well!
Bit of CSS to rotate it, table for layout, job's a goodun!
Questa voce è stata modificata (1 settimana fa)
in reply to Terence Eden

TABLE FOR LAYOUT!!!!!

I disown you. I would never* do such a thing.

*Except when I do.

in reply to DamonHD

@DamonHD I'm with you, buddy.
But, to be fair, it *is* tabular data - so I think I can get away with it 😆
in reply to Terence Eden

Might have gone a *little* overboard! I rather like this though.
Self-hosted analytics showing a graph of the last 7 days.
Today's top referers, pages, and countries.

Super simple and suitable for my needs.

Next up - import historic data and work out if I want to keep countries & referers in it.

#SelfHosting #WordPress #SmolWeb

in reply to Terence Eden

funny. My bubble means I’m shocked the ranking isn’t mastodon and then some good RSS readers 😀 “real” analytics ftw
in reply to Terence Eden

I'm basically farting around with the design because I don't want to do the hard work of periodically consolidating the database.

I *think* for now I just need a separate table of:
Date, Views, Users.

I can increment the Views every time the page is visited using something like `ON DUPLICATE KEY UPDATE visits = visits + 1;`

But deduplicating the users will be harder. I don't want to run a daily cron job for this.

in reply to Terence Eden

Interestingly, I've started seeing hits come in from `noai.duckduckgo.com`

Small numbers for now, but wonder if they'll grow? I'll probably consolidate it to the 2nd level for my stats. Not sure there's much useful I can do with the information.

in reply to Terence Eden

information and granularity overload can make for a challenging decision making environment. At my time in the Civil Service our web teams wanted *so many* metrics, but ultimately only listened to the user researchers. Can’t put detailed 30 minute chats into a PowerBI dashboard though.

I suppose you could write anti-AI content to juice those numbers.

in reply to Terence Eden

I think I've cracked the problem - MySQL makes my brain hurt.

Table 1: Logs *everything*. Date, post ID, user, etc.

Table 2: Is date, view count, user count.
When (1) is updated, an AFTER INSERT trigger increments the view count for that day.
It also counts the number of unique users for that day and updates (2).

Table 3 is the same as (2) but on a per-post basis. It also updates after insert.

That way, I don't have to use logic in my PHP app - all it does is fire an insert.

in reply to Terence Eden

if you are updating a "user count" in a trigger, then every single insert is going to be in contention of a lock on that row, and you will get massive concurrency problems that bottleneck your throughput.

My advice if there is even the slightest chance this will scale beyond one blog would be to do the write without any triggers, and the calculate the total per day when you need it, or every 5 minutes, or something.

in reply to Ben Curthoys

@bencurthoys This is only for my blog. While it does occasionally get popular, it's usually less than 2,000 visits per day.

Is there an equivalent of "cron" for MySQL / MariaDB? I'd rather not rely on WordPress's cron because I've found it a bit fickle.

in reply to Terence Eden

yeah but you know you're going to open source it and 5 minutes later someone will be using it for something crazy.

Anyway my experience is not MySql specific, but in any RDBMS it will be way more sensible to

SELECT COUNT(DISTINCT User ID) FROM views WHERE timestamp BETWEEN startofday AND endofday

than to try to maintain totals - as long as you index it properly.

Questa voce è stata modificata (1 settimana fa)
in reply to Ben Curthoys

e.g. in an OrderItemLog table with 680767 rows, it takes 0ms - that's zero - to count how many are on a given day. And 0ms to count how many distinct orders there are.
in reply to Terence Eden

be careful with 'UPDATE count = count + 1" patterns. They're unlikely to accurately count as you expect due to concurrency issues. Especially in MVCC systems, it will depend alot on transaction isolation levels.

For this kind of thing, just logging the visits as records and aggregating on analysis / periodically is often the better approach. Removes a lot of the concurrent write headache and give you more flexibility.

in reply to Chris Ellis

@intrbiz thanks. Is there a sensible way to schedule those sorts of reconciliations?
in reply to Chris Ellis

also in general window functions are great for aggregating these kind of stats.
in reply to Terence Eden

I almost barely understand this. 20 years ago I was just starting to learn SQL.. my brain hurt.
in reply to Terence Eden

On the advice of many people here, I'm looking at using EVENT rather than TRIGGER.

Looks like I need to raise a ticket with my host to see if this is possible. I'd rather not use a cron job.

in reply to Terence Eden

cron is fine. Do you need absolutely accurate real time stats? Or is being (cron interval) behind OK?
in reply to Terence Eden

Arse. My host doesn't support EVENTs on my cheap plan.
Fair enough. Looks like I'll have to use cron.
Wonder whether it's better to have a stored routine, or just keep the logic in the PHP?

My heart says use the stored routine - but it's probably more flexible and fixable to have the SQL in the PHP.

in reply to Terence Eden

Friends don't let friends use Stored Procedures. That way darkness lies... (as in weird stuff happens and you've forgotten about the SP and version control is a PITA)
in reply to Terence Eden

My self-written stats plugin for WordPress seemed to cope with a small burst of traffic from HackerNews.

Blog post later today. Visit shkspr.mobi/blog/ to subscribe by RSS or Email.
Also available on the fediverse as @blog

Questa voce è stata modificata (1 settimana fa)
in reply to Terence Eden

And I can now see exactly which websites send traffic to which post!

Hello all you lovely Fediverse friends 🙂

Blog post at shkspr.mobi/blog/2025/09/reaso…

in reply to Terence Eden

Tidied up the control panel.

I can now click on a single post and see how many times it has been viewed, how many unique users visited it, which countries they came from, and what referers sent traffic to that page.

So far, it is *broadly* accurate compared to my 3rd party plugin. It over-counts users and under-counts page views. But that's fine.

Next steps: draw some pretty graphs. Ideally with weekly / monthly / yearly groupings.

in reply to Terence Eden

Who needs a fancy graphing library?

SVG polylines are easy enough to generate with code!

Needs a bit of tidying up and looks rubbish when more than 100 points. So will need to group by week, month, and year.

Pretty happy with that so far.

Terence Eden reshared this.

in reply to Zoe

@z303 a daily email of my top performing posts!
Brilliant idea 😉
@Zoe
in reply to Terence Eden

I can now group views by year, month, week.

Writing a basic graphing library is fun!

in reply to Terence Eden

Web design is my passion!

Ok, first make it work *then* make it look good 😁

Terence Eden reshared this.

in reply to Terence Eden

Make it work, make it work right, make it work better...
Questa voce è stata modificata (6 giorni fa)
in reply to Terence Eden

I stopped wearing that smartwatch because it was so clunky compared to a Casio F-91W on a nylon strap

if it was open-source/jailbroken and I could make it, y'know, show the data I want to see, however...

in reply to jack

@jackeric it works perfectly with the Open Source app gadgetbridge.org/

I can extract all my data - sleep, heart rate, steps, etc.

I also swapped the strap to something slightly nicer.

@jack
in reply to Terence Eden

yeah, but that's for extracting data, right? I want to control what's shown on the watch face
in reply to jack

@jackeric oh, I see.
I've emailed both the manufacturer and OEM to see if I can get the SDK. No luck yet.
You can upload your own photo to one of the faces. Not exactly LCARS, but a start 🙂
@jack
in reply to Terence Eden

I think initially I wanted to be able to store other types of visual codes than QR - even if 99% of readers expecting Code 128 will take a QR version in stride
in reply to Terence Eden

tip: use a class-less stylesheet that just makes things look nice without you having to add any classes to the markup. For example picocss

picocss.com/docs/classless

in reply to Terence Eden

Wooooo! My self-written stats plugin is coping with a HN influx.
in reply to Terence Eden

My CSS breaks but only when I look at my all-time stats.

Which is a nice problem to have, I guess.

I don't want to wrap the views, nor do I want to increase the column width. I guess font scaling?

in reply to Terence Eden

You could truncate the domain to a specific number of characters too, with the full domain, as you currently have it, available when hovering over as title text.
in reply to Terence Eden

On the suggestion of @mattround I've combined the dates into one box.

Testing shows it should be good until I breach 100 million views. Which will be a problem for 2078's me!

in reply to Terence Eden

Nice to see the Lemmyverse (or whatever we're calling it) start to send traffic to my blog.

As every with federated social media, the individual number of referers is quite low. But together, when added up, give it a bit more heft.

Terence Eden reshared this.

in reply to Terence Eden

Typically we're calling it "threadiverse", for the universe of threaded discussions and it includes non-lemmy software like piefed and mbin
Questa voce è stata modificata (3 giorni fa)
in reply to db0

@db0 all part of the ActivityPub Cinematic Universe!
@db0
in reply to Timo Tijhof

@krinkle well spotted!

Blog post coming later, but you can see some details at:

mastodon.social/@Edent/1151688…


As pointed out by @DamonHD - the HTML <meter> element also works well!
Bit of CSS to rotate it, table for layout, job's a goodun!
in reply to Terence Eden

have you tried adding indexes to query the source data directly rather than (effectively) materialising views via triggers?
in reply to Terence Eden

basically try writing a query which queries table 1, then try following advice from use-the-index-luke.com to see if it can become fast enough
in reply to Glenjamin

@glenjamin Nice, thanks. I'll take a look.

TBH, I'm not *overly* worried about performance at the moment.
In the last week(ish) I've only generated 16,846 rows.
I expect roughly 500k per year.

in reply to Terence Eden

I've used charts.css in a recent backoffice project, where you create a table using html templating language and then it's styled like a bar chart. Very handy!

chartscss.org/

in reply to Terence Eden

You're very welcome to nick my SVG sparkline generator (It's in js but should convert to PHP fairly easily if you want to do it server-side rather than client-side)
Post: kevinmarks.com/svgsparklines.h…
code: view-source:kevinmarks.com/sparkline.svg
Questa voce è stata modificata (1 settimana fa)
in reply to Terence Eden

this is actually gold, i went so mad when MaxMind discontinued its product @mcrmonkey
@ant
in reply to Terence Eden

A million rows shouldn't cause any problems if you have sensible indexes, but you might want to rewrite:

DATE(time) = '2025-09-05'

as:

time >= '2025-09-05' AND time < '2025-09-06'

because DATE(time) *might* cause any indexes on time to be ignored, since you're transforming the column to something other than what was indexed (I had a client a few years ago with a really slow report because of this)

in reply to Terence Eden

I'm going to be a bit of an annoying nerd, but this is not actually privacy preserving. Firstly, you're building user identifiers, which allows you to track users' behaviour. Secondly, hashing the user IP and user-agent is very easily reversible, so you shouldn't be even publishing that screenshot 😀
in reply to hades

@hades I 100% disagree.

I can't track anything off my site.
I'll buy you a beer if you can reverse any of the ones I posted. I'll even tell you the algorithm if you'd like?

in reply to Terence Eden

we may be using different definitions of words, but you know which pages I've visited at what times and where I came from. is that not tracking? if I went to a book store, and they recorded which coffee shop I came in from, and what books I've looked at in which order, I wouldn't call that privacy preserving
in reply to hades

@hades perhaps it is a cultural thing.
If you have left my beach and I notice your footprints, has that invaded your privacy?
Unknown parent

mastodon - Collegamento all'originale
Terence Eden
@miclgael yup. Well, I assume so. Or people sharing my links on here.
in reply to Terence Eden

Are you familiar with plausible.io/about ? @plausible

They might be already doing what you're trying to do...

in reply to Ben Curthoys

@bencurthoys @plausible
Yes, but I'm more interested in building something that I can maintain and understand.

Learning is more fun than just installing.