More Mastodon scaling experiences

2023-06-29

Problem: really high CPU, so the web frontend becomes unusable.

Sidekiq was still on top of its queues, so I didn't think the problem was there.

Postgres was not seemingly overloaded.

I spent most of a day upgrading the main machine to the latest Ubuntu LTS, and then trying to take a dump of the database. I thought I'd royally knacked it when the first do-release-upgrade crashed halfway through, and then the machine wouldn't boot, giving a "can't mount root FS" error. It turned out that that meant it was trying to load a kernel that it hadn't built an initramfs image for, so I used the VNC console to catch grub and get it to boot into the old kernel. I then ran do-release-upgrade again and it eventually all worked out.

The database dump was a massive waste of effort, particularly since the site was still running, after a fashion, in the meantime, so it was instantly out of date. My thinking was that if I had to restore it, this dump would be fresher than the 12-hour-old automatic backup. But the dump process ended up taking about 8 hours, so that was a waste!

Eventually, I wondered if Puma was the problem.

Puma is the thing that serves the web frontend. When it runs slowly, that's the most noticeable to users.

Plan: move Puma to another host.

On the new host, I had to set the BIND environment variable to 0.0.0.0 in the supervisor config, so that it listens to external connections.

Just doing apt install nodejs on the new machine installed node 18. I had to add the apt source as described in the docs, then apt install nodejs=16.20.1-deb-1nodesource1 to force it to install the right version.

I got weird errors Module parse failed: Unexpected token (1:2259) when trying to do the rails assets:precompile task. My yarn lockfile had changed, so I must have not done yarn install properly. I did git restore yarn.lock and then yarn install --pure-lockfile, and it worked.

Then I had to do supervisorctl restart mastodon:web to get it to pick up the change.

I changed the nginx config to set the puma server as an upstream host:

upstream puma_hosts {
    server <ip of the puma machine>:3000;
}

...


location @proxy {
    proxy_pass http://puma_hosts;
}

That seemed to do it - turning off the Puma server on the main machine allowed the other processes to work, and the load average dropped back to reasonable levels. The new machine running Puma doesn't seem to be doing that much work after all, so I think this was a queueing problem.

For reference, the layout is now:

6 vCPU, 16GB RAM VPS running postgres, Redis, the streaming API, and 95 sidekiq processes. The sidekiq processes are split into four groups, prioritising different queues.
2 vCPU, 4GB RAM VPS running 45 more sidekiq processes, in three groups priorisiting push, pull and ingress.
4 vCPU, 8GB RAM VPS running Puma with WEB_CONCURRENCY=4 and MAX_THREADS=15. This is the new machine. The parameters were guesses based on the number of vCPUs and what worked on the first machine.

Together, these cost $192 per month. We're getting more than that in donations, so I'm not worried about the cost at the moment.

The dedicated sidekiq machine never seems very busy, so I might try out only handling the push and pull queues on that, and see what happens.

For the several days all this was happening, I got intermittent messages from people either asking if I knew the server was unresponsive, or offering support. I was frustrated that while the mastodon server was unusable, I didn't have a way of telling users what I was up to.

We really need a separate status page that people can check. It would also be a good idea to have a well-known account on another fediverse instance that people can follow. I've been meaning to set those up for a while; I left it just a bit too long!