I rebuilt my blog's cache. Bots are the audience now

(hoeijmakers.net)

29 points | by robhoeijmakers 3 hours ago

11 comments

pavel_lishin 1 hour ago
> Not because I expect a person in Singapore to shave 200ms off their pageload, but because the next request for that page is more likely to come from a retrieval system than a browser, and the request after that, and the one after that.
Why do I care if I shave off 200ms from a crawler's request, instead of a human's?
[-]
- Brybry 45 minutes ago
  The graphic in the article seems to be the only significant content.
  Based on that I think it's more about requests from bots/scrapers having the greatest chance possible of hitting a cache before hitting the blog's origin/real host. Bots will hit some layer of Cloudflare first then they'll hit Fastly and then if not in Fastly they'll hit the Ghost blog's server.
  To me, this makes a lot of sense if it's self-hosted but I also thought it was already the standard to shove your self-hosted blog behind a reverse-proxy and cache as much as possible.
  And I'm not a professional web developer but all the extra caching layers for a static personal blog seem a bit overkill.
  Aside from the graphic, the article is a lot of words about engaging with an LLM to get a full understanding of how caching works for their blog hosting and how it enabled them to change their setup for the better.
  It's kind of hard to understand because there are no words about what they actually did or how what they actually did was better.
- m0rde 53 minutes ago
  From the post:
  > If you care about how your content moves through the world now, including through AI systems, you have to care about caching. Not as a performance optimisation for human browsers, but as infrastructure for machine readership.
  [-]
  - nothrabannosir 24 minutes ago
    That doesn’t answer the question at all, and I wonder is it’s actually true? A cache is not magic, it is, itself, just a static file server in the end. If I self host a static page website on an nginx box, do I actually need cache to serve today’s crawlers?
    The screenshot in the image says 3k req/day. That’s 2 requests per minute (amortized). At that rate, you can serve it with cgi and Perl.
    Cache is only relevant if you have a lot of traffic AND dynamic pages, or if you care about latency (which is only relevant for humans).
- rodw 1 hour ago
  Page load time can impact index coverage (depth of crawl), freshness (revisit rate), and ranking.
jdw64 3 hours ago
Personally, I think this is a good idea. But the core problem is this: How is a newcomer supposed to build reputation now? Without exaggerated business promises or capital, basic online reputation usually depends on writing. In fact, my own first step into freelancing came because someone found the articles on my Korean blog interesting. So the question is: if the subscribers are bots, what benefit do they actually give me? If bots become the readers, then what matters is whether they can provide any kind of symbolic capital or real capital. I can build caching with Redis without much difficulty, but I worry that if this continues, the result may simply be that LLMs learn from my writing while no benefit returns to me. People write partly to organize their thoughts, but also partly to gain symbolic capital. That is one reason why I write my own posts instead of using an LLM to write them for me.
[-]
- ryandrake 22 minutes ago
  I think in general, "Writing on the Internet with the intent to make money" is effectively dead, or at least soon to be dead. AI+bots mean we now have the "infinite typewriter monkeys" from the thought experiment. With infinite supply, the price goes to zero.
  We need to stop this treadmill of trying to "build reputation" and stop focusing on "symbolic capital" and "clout" and whatever else bloggers are going after. You're not going to get it, and even if you do, you're not going to be able to "monetize" it.
  If you have a need to write, write. Maybe a handful of actual people will read it, maybe not. But, I wouldn't try to do it for a living. The reward will have to be the cathartic process of writing itself, and not in how much attention it gets, how much it "blows up" or how viral it gets.
  [-]
  - jdw64 13 minutes ago
    I am not trying to make money from writing.
    What I need is for my writing to spread enough that I can receive opportunities to have my programming ability evaluated.
    The reason I write about programming is that, in the past, some readers found my programming essays interesting, and that led to chances for me to be tested. I had to leave graduate school because of financial problems, and I did not graduate from a prestigious university.
    So this is not simply about monetizing writing. It is a struggle to receive opportunities. Those are fundamentally different things.
    Some people may be happy writing things that nobody reads. But many people are happier when they can share their writing and let their values collide with those of others.
- nilirl 31 minutes ago
  I feel that pressure of not knowing how to definitively compete on the internet, especially when there's so much AI created noise.
  I'm a copywriter and I used to get hired to write posts on behalf of founders on LinkedIn or for their company blog.
  Now, the last three jobs I had were all focused on sending cold email.
- pixl97 1 hour ago
  >How is a newcomer supposed to build reputation now
  Dead internet manifest.
- johng 1 hour ago
  What's worse, is they train on your content, and very often you don't even get an attribution link. So the end user never even knows it was your site that provided the information and you never even get a single clickthrough. It's not like the SERPs where someone would click through, read your site, hopefully find it interesting and useful and come back.
  It's going to be a serious problem and I've already seen sites that are down 90% in traffic simply because AI is scraiping them, answering the questions themselves and never providing a linkback.
  [-]
  - 01284a7e 1 hour ago
    I pulled all the websites I had - some existed for a decade plus and made me hundreds of thousands of dollars. All that is left is bots that theft the value of my work. Until something changes, goodbye.
    [-]
    - gbgarbeb 42 minutes ago
      This is like choosing to be an elementary school teacher and then quitting because it turns out your students for the year aren't your pets in perpetuity.
      [-]
      - diatone 35 minutes ago
        If your students were growing up to subvert your line of work, sure. Pretty sure that’s not the case though!
- robhoeijmakers 1 hour ago
  [flagged]
chrismorgan 27 minutes ago
I’m very confused about why you’d have such a complex cache arrangement. Sounds like you’re using Cloudflare and Fastly to do roughly the same thing. That sounds like a recipe for more expense and more problems.
For the sort of thing you’re doing, it should be as simple as “throw it behind Cloudflare/Fastly/Bunny/whichever private CDN you like” and that’s it.
Also the diagram near the end is pretty much incoherent. GenAI, I presume.
[-]
- robhoeijmakers 6 minutes ago
  I am on a low tier Ghost subscription, I could not rewrite some of the HTML. So I do this with Cloudflare and then cache it again.
  Yes, the architecture setup is generated by ChatGPT but in itself it says what it needs.
yawnxyz 20 minutes ago
my tiny blogs no one reads have been racking up a huge deno deploy and vercel bills ($40-50/mo each) bc I ran them "naked" without a cache or cloudflare or static builds - it didnt matter bc I got like hundreds of visitors a month. they were just hono or whatever api pulling from my backend which could be notion or airtable - super simple, though kind of slow
now I suddenly I have 10k visitors a month hammering my apis and causing massive egress and cpu usage - so i had to get them behind cloudflare and now build everything statically - cut the costs back down from 90+ cpu hours to about 0.2 cpu hours a month
crazy times
(also, all donw w/ claude code's help, or it would have taken a week for me to figure out)
[-]
- faangguyindia 15 minutes ago
  This is why I don't use those serverless setup.
  $4 hetzner vps can serve tons of request if you put cloudflare in front of it
  I host my own runners for CI and artifcat building on Hetzner VPS (spun on demand).
  People are easily lured by pay as you go plans on serverless and other cheap to get started managed services and end up racking huge bills.
  This is same reason I don't use stack driver or cloud monitoring and prefer to use it graphana + loki + Prometheus setup
  My setup cannot be mosco figured and end up racking huge bills.
faangguyindia 18 minutes ago
Yesterday I logged into cloudflare and found that Cloudflare had blocked chatgpt and claude from accessing my site. https://macrocodex.app
This is bad because there are fitness guides on my domain
https://macrocodex.app/guides which newbies often put in chatgpt and asks to simplify.
I enabled crawl for LLMs. There is lot of misinformation in fitness field so it's better if LLMs get their content from people who atleast have experience in the field
[-]
- robhoeijmakers 4 minutes ago
  It is good to make a proper distinction, in the ChatGPT context, between crawlers and agents. The crawlers go for the content to build a new model, the agents serve content to users. The last one can be very useful.
ssv445 25 minutes ago
the core value of internet was some one discovering you via your content, agents as primary consumer might looks good for now, but we are definitely making internet dead for many SMBs.
ianberdin 18 minutes ago
It is time rewrite to X to optimize Y :)
steve_adams_86 31 minutes ago
I went through a similar process recently. For a while I saw readership of my site gradually increasing, and eventually it became clear that it wasn't human beings.
I also used Claude to help me drill into what's going on. Bizarrely, about 80% of my traffic comes from Singapore, which the author mentioned. I don't know why. A lot of the traffic looks real; it stays for a while, clicks different links in different orders. But no one in Singapore has ever read a thing I've written on my site as far as I'm concerned.
I thought Cloudflare would help protect my site from bots, but it utterly fails. I'm not sure if they're too sophisticated or people overestimate how well CF works for these things. I paid for advanced features for a while and reverted to the free plan once I realized it made no difference. It's a great platform in general, but hasn't been great for allowing me to see how many humans actually read my content.
I know some do because they email me occasionally. If I had to guess, of the ~200 visits per week reported in analytics, around 15 are real.
[-]
- robhoeijmakers 2 minutes ago
  Same ratio roughly. 80% Crawlers and agents, 20% human. Loads of the agents actually serve the content to humans, mostly in ChatGPT.
cullumsmith 1 hour ago
I simply block all AI crawlers with a user-agent check in nginx.conf.
[-]
- microtonal 56 minutes ago
  I also block all AI crawlers. I am not sure why I should give them my content for them to rip it off and make money from it through training or agents. Sadly, a lot of AI companies are trying to make requests indistinguishable from regular browsers from residential connections, so unfortunately I have to use Cloudflare to block them.
  Ideally I'd make the content available to crawlers for training open models, but that seems to be nearly impossible. It would be possible if other AI companies behaved.
  [-]
  - Barbing 41 minutes ago
    >so unfortunately I have to use Cloudflare to block them.
    That can’t block Grok, can it?
    (You might have a fake iPhone or something visit your site if you ask Grok to retrieve information from it)
- orf 51 minutes ago
  *some AI crawlers. Not many
- robhoeijmakers 1 hour ago
  I started blocking some of them. But for now I want to improve visibility before further blocking or optimising. The dashboard helps with this.
gostsamo 24 minutes ago
The writership of the blog is also changed and seems to be mostly machine as well. It is painful to read something that lacks human presence on the other side.
Hackbraten 2 hours ago
Why do I get just an empty page?
[-]
- robhoeijmakers 1 hour ago
  Thanks. It seems to be very local/incidental. The page works from the locations I can test, but I’ll check whether one edge cache or request path served a bad response.
- consumer451 1 hour ago
  Same here via VPN. No VPN, and I get the actual content.
- ksk23 1 hour ago
  Caching gone wrong.. (Works for me)