Hook, Line, & Syncer: Bringing Voice Back Online
By querying etcd, we could see that a majority of voice syncers instances were not announcing and were out of the hash ring. While these instances were not taking on new voice syncer processes, they still held most of their pre-impact voice syncer processes. Not wanting to impact ongoing calls we first attempted targeted recovery of individual instances.
At 12:43 we fully restarted the voice syncers application on instance 2-13. At 12:47 we killed and restarted the Holster.Pool DynamicSupervisor on 2-1 after the erlang recon module confirmed that its mailbox was getting backed up.
In both cases, the instances very briefly recovered — they were able to re-announce in etcd, quickly take on voice syncers, and dispatch some RPC messages to SFUs.
Because the majority of our instances were not in the ring, healthy instances were accepting additional traffic acting as the secondary and tertiary failover for unhealthy instances. Thanks to the oversized influx of new syncers (and in 2-1’s case, the pending retries for RPCs to existing voice syncers), mailboxes immediately started to grow again and we re-entered a similar failure mode.
With targeted instance recovery not working, we felt that maximizing the concurrent number of healthy instances would give us the best chance of recovery. At 13:05, we attempted a restart of the full voice syncers cluster. This “restarts the Discord” from the perspective of the A/V infrastructure, and quickly re-creates millions of active calls. The instances restarted and some initial RPC connections succeeded, moving some calls out of the “Awaiting Endpoint” state.
Once again, recovery was short-lived — the cold-start thundering herd was too much to handle, and by 13:09 all restarted nodes had ever-growing mailboxes again. There were split failure modes across the instances (depending on the Holster.Pool supervisor being backed up or not), but all instances were eventually unable to create new connections and needed further intervention.
It was clear we needed a way to slow down the influx of outgoing HTTP requests from restarted syncers instances. We attacked this from two parallel fronts: tuning our existing syncer creation rate limit and growing the syncers cluster.
We already had some rate limiting on call owner voice syncer creation, but discovered during our incident response that this was outdated or unset across the call owner services. This rate limit in guilds was especially ineffective, as it limits only the spawning of the “coordinator” voice syncer for each guild while allowing an unbounded spawning of child syncers for each of the guild’s voice channels (each of which will select at least one SFU endpoint they need to connect to).
Until A/V Infrastructure follows the Kubernetes migration path defined by the realtime infrastructure team, growing the voice syncers cluster is a somewhat manual (and unfortunately) slow process. Instances live on GCP VM nodes defined in Terraform, are configured by Salt, and need to be manually added as ring candidates in etcd.
The rate limit changes were ready first: by 13:43 the calls, streams, and guilds services had low syncer spawn limits in place and we attempted a restart of the voice syncers cluster again. This followed mostly the same pattern as the 13:09 restart — initial recovery followed by ever-growing mailboxes — but with one helpful difference: the backed up supervisor across instances was consistently the gun supervisor instead of Holster.Pool . This allowed instances to checkout pooled connections created before the gun supervisor got too far behind. All restarted instances stayed announced in etcd, retained their target voice syncers, and were also able to successfully send RPCs to SFUs for a subset of their syncers.
While we waited for Salt to finish provisioning the 15 new syncers instances, the slightly improved partial recovery from the 13:43 restart gave us the confidence to once again attempt targeted recovery of individual instances — an urgent game of whack-a-mole.
Some targeted restarts were finally successful, providing full recovery of some instances. At 14:03, we restarted the voice syncers application on instance 2-3. It happily took on its voice syncers, the aggregate mailbox length stayed low, and new and existing connections were consistently successful. We chalk this success up to a few factors: lower rate limits slowing the spawn rate of syncers, the number of global syncers being reduced being further off peak and due to the ongoing incident, and the healthier cluster needing fewer secondary and tertiary placements of syncers.
By 14:15, we had restarted 5 instances with 4 of the 5 successfully demonstrating a full recovery. At this time, the additional 15 instances doubling our capacity were also simultaneously brought online, successfully taking on syncers without any mailbox queuing issues. This operation halved the per-instance syncer count and improved the success rate of the final remaining restarts. By 14:26, the final backed up instance was restarted, and the doubled cluster was in a fully healthy state.