Is anybody out there?

Ángel Ortega

Today I realized that I haven't taken a look at this site's log for a very long time. I disappeared from Google's first page some years ago (and, consequently, from the Internet); suddenly I wondered if somebody ever vists this place. I don't run any log analyzer, so I located yesterday's nginx log and made a raw count:

angel@samael:~$ wc -l /var/log/nginx/triptico-access.log.1
2863 /var/log/nginx/triptico-access.log.1

So there were 2863 queries served by the httpd process on February 4th. But not so long I added ActivityPub support to the software that runs Triptico; I guessed that a non-small amount of these queries are probably Fediverse pings, spam tries and other useless crap, so I rewrote the line count to:

angel@samael:~$ grep -v mastopeek /var/log/nginx/triptico-access.log.1 | \
grep -v fediverse | grep -v GabSocial | grep -v Mastodon | wc -l
2097

Well, that was a trim. But what about the bots? I remember Bing behaving like an asshole some time ago by hammering my site (I even denied it access for a while). Let's ignore it and another fucker I knew by name:

angel@samael:~$ grep -v mastopeek /var/log/nginx/triptico-access.log.1 | \
grep -v fediverse | grep -v GabSocial | grep -v Mastodon | \
grep -v bingbot | grep -v Googlebot | wc -l
1763

Bang! How many other non-humans come here? I took a look at the file with dismay; what the fuck. There are a trillion bots lurking here! I started cropping:

angel@samael:~$ grep -v mastopeek /var/log/nginx/triptico-access.log.1 | \
grep -v fediverse | grep -v GabSocial | grep -v Mastodon | \
grep -v bingbot | grep -v Googlebot | grep -v YandexBot | grep -v Sogou | \
grep -v magpie-crawler | wc -l
1622

There are more!

angel@samael:~$ grep -v mastopeek /var/log/nginx/triptico-access.log.1 | \
grep -v fediverse | grep -v GabSocial | grep -v Mastodon | \
grep -v bingbot | grep -v Googlebot | grep -v YandexBot | grep -v Sogou | \
grep -v magpie-crawler | grep -v Applebot | grep -v SemrushBot | \
grep -v Nimbostratus-Bot | grep -v RU_Bot | grep -v MojeekBot | \
grep -v 360Spider | wc -l
1363

I learned that there are software with lame names like Applebot and Nimbostratus-Bot (it's some cloud-related crap, how original). Are there more? Sure. Though not self-identifying as a bot with a reasonable User-Agent, something asked for /robots.txt many times, and so that can be deleted as well:

angel@samael:~$ grep -v mastopeek /var/log/nginx/triptico-access.log.1 | \
grep -v fediverse | grep -v GabSocial | grep -v Mastodon | \
grep -v bingbot | grep -v Googlebot | grep -v YandexBot | grep -v Sogou | \
grep -v magpie-crawler | grep -v Applebot | grep -v SemrushBot | \
grep -v Nimbostratus-Bot | grep -v RU_Bot | grep -v MojeekBot | \
grep -v 360Spider | grep -v '/robots.txt' | wc -l
1205

Wait, more bots!

angel@samael:~$ grep -v mastopeek /var/log/nginx/triptico-access.log.1 | \
grep -v fediverse | grep -v GabSocial | grep -v Mastodon | \
grep -v bingbot | grep -v Googlebot | grep -v YandexBot | grep -v Sogou | \
grep -v magpie-crawler | grep -v Applebot | grep -v SemrushBot | \
grep -v Nimbostratus-Bot | grep -v RU_Bot | grep -v MojeekBot | \
grep -v 360Spider | grep -v '/robots.txt' | grep -v ZoominfoBot | \
grep -v AhrefsBot | grep -v SeznamBot | grep -v DotBot | \
grep -v Barkrowler | grep -v MixnodeCache | grep -v repology-linkchecker | \
grep -v Linespider | grep -v VelenPublicWebCrawler | grep -v linkfluence | \
grep -v Twitterbot | wc -l
965

I know that I don't have dead links, so all 404 errors must come from non-humans (being righteous, some of them may be due to linkrot from old sites pointing here). Also I don't serve fucking PHP here (never did), so let's prune index.php queries that most probably come from script-kiddies trying to fuck with my system:

angel@samael:~$ grep -v mastopeek /var/log/nginx/triptico-access.log.1 | \
grep -v fediverse | grep -v GabSocial | grep -v Mastodon | \
grep -v bingbot | grep -v Googlebot | grep -v YandexBot | grep -v Sogou | \
grep -v magpie-crawler | grep -v Applebot | grep -v SemrushBot | \
grep -v Nimbostratus-Bot | grep -v RU_Bot | grep -v MojeekBot | \
grep -v 360Spider | grep -v '/robots.txt' | grep -v ZoominfoBot | \
grep -v AhrefsBot | grep -v SeznamBot | grep -v DotBot | \
grep -v Barkrowler | grep -v MixnodeCache | grep -v repology-linkchecker | \
grep -v Linespider | grep -v VelenPublicWebCrawler | grep -v linkfluence | \
grep -v Twitterbot | grep -v ' 404 ' | grep -v 'index.php' | wc -l
886

Some crappy software from Apple keeps asking for /favicon.ico even though I announce this site's icon as something else. Morons.

Digging more into the log file: do queries from something that identifies as facebookexternalhit count as real people? I hope they do, but how many that day?

angel@samael:~$ grep facebookexternalhit /var/log/nginx/triptico-access.log.1 | wc -l
18

Well, not many. And what about RSS/ATOM feed aggregators? Are there real human beings behind them?

angel@samael:~$ grep -E '/(atom|rss).xml' /var/log/nginx/triptico-access.log.1 | wc -l
131

Meh. Whatever. Regardless of this, I ended up with 886 queries (from the apparent 2863) that are not clearly discardable as garbage. How much can be inferred from looking at them? Well, some of them look like real human activity; for example, queries of the Minimum Profit page followed by the images of MP's screenshots seem legit (there are 14 queries for that page). On the other hand, there are bursts of activity asking for 50+ pages simulteously in the same fraction of a second from IPs in the 47.88.x.x and 47.254.x.x ranges and running some software that reports a User-Agent that mentions the Safari browser but it's obviously lying. There are at least 400 queries from this thing. I don't know what the fuck is it.

So there are a little more that 400 queries that are not easily identifyable as cruft. But what is known as a WWW visit is not the same; following the previous example, each visit to the Minimum Profit editor page results in 15 different file requests.

This is starting to get depressing so I'll end this post here.