Deep Dive: Running System Stress Tests

Tim Golen
Tim Golen Expensify Customer, Expensify Team, Expensify Student Ambassador Posts: 1 Expensify Team

When we decided to do a Super Bowl commercial in 2019, we had to take a serious look at our entire infrastructure - from customer support to application performance to how many logs our servers produced - and tackle two large problems. The first problem was: How we could handle a quick and large spike in traffic during the event? The second problem was: how we were going to handle the sustained strain on our system in the months and years to follow as we got more customers? Solving these two problems would enable us to onboard more customers without running out of capacity in our system and subjecting customers to poor performance.

We solved these problems by adopting a mantra of “10X Everything”. Each layer of our entire product should be able to handle 10X the current level of activity with no decrease in performance. Then, beyond 10X, every system should gracefully degrade, breaking under only the most extreme load.

Our traffic varies quite a bit day-to-day and there is no good way to simulate traffic that is truly representative of real production traffic. So instead of trying to simulate it, we decided to use the real thing!

To see what that traffic would be like with 10x as many customers, we would apply a multiplier of 10 to every request layer of the stack—web requests, mobile api calls, database queries, messages logged, etc). That means that if you smart scanned a receipt during a stress test, our web servers got 10 api calls from your mobile app, and our database got 10 requests to save your data. The database would complete the required queries for all 10, canceling the 9 fake requests at the last possible moment by rolling back the transactions.

A stress test typically begins at a small multiplier like 4x. Every 5-10 minutes, we increase it (eg. 4x -> 8x -> 12x) until we determine that the system has been strained as much as we can afford without actually taking down the site. For example, we look at the failure rate of API requests, how quickly API requests are taking, and how many writes per second are being done to the database, etc.. At this point, the stress test is finished and all systems are rapidly restored to their original state by immediately reverting our client load multiplier to 1x.

We aim to complete a stress test at least once every month. During them, we watch our systems to see where bottlenecks form and where things eventually break. We then focus our engineering efforts on improving  the bottlenecks or failures, rinse, and repeat. This focuses our efforts on the weakest part of the system and forces us to have quick recovery procedures in place so that the impact to our customers of any real traffic spike is minimized.

Rather than keeping the details of stress tests in the hands of our relatively small team of site reliability engineers, we also ask for volunteers from our full-stack software engineering pool. Those volunteers run the stress tests, respond to system alerts, and help with system recovery. This gives us a large group of people across all timezones that can address and respond to system events whenever they occur.

Stress testing is usually done on the second Monday of every month, so it’s possible you’ve noticed that things get a bit sluggish on that Monday. Sorry, not sorry! 

We also run stress tests when we deploy new technologies and that might happen at any time so we can ensure those new technologies are scalable and stable. Our stress tests allow our systems to handle large spikes in traffic and increases to our user base with little to no strain. This lets us dream big when it comes to marketing and growth without being limited by the capacity of our systems.