There's really no redundancy. If your headscale node goes down it can take your tailnet with it. In particular, if you drop in some ACLs that have issues, your entire tailnet can drop until you get it fixed. The recent (0.21) "configtest" can help there, but it still feels a bit brittle.
Headscale can use a lot of system resources. ~100 nodes can saturate a t3a.small instance in cpu time and disc access. Reducing the update frequency can help, but there are hard limits here. I'm imagining much of this is database updates to sqlite, but I haven't tried switching to an external postgres server yet to see how much of the load is database related.
reply