How Addressing Problematic Alerts Opened Many Doors
When I first started at a former company, I sat in a meeting with my entire team as they went through the on-call alerts from that week. The first thing I noticed was that there were a lot of them…like over 800 of them. As the new guy, this made me a bit nervous. How much time was I going to have to spend just putting out fires when I hit the on-call rotation? How could things have gotten to this state and what was being done about it?
Realizing there was an opportunity hidden in this chaos, I started diving into our processes and roadmap with my new boss. Like me, he was relatively new and was also concerned about the current state of things. We had some pretty grand ambitions but the issues that constantly popped up meant that the team didn’t have the capacity to actually start fixing things and I could tell that they were a bit demoralized, right off the bat.
One of the alerts that pretty regularly showed up in the weekly alert tracking system was about an important file failing to deliver, specifically to one of our key partners. Without these files, this partner had no idea what content they were supposed to display or where to find it. There were some backups in place, but once we hit a critical point revenue would be lost from that avenue and our relationship with this critical partner would deteriorate. It also required pulling engineers away from other important tasks, which gets frustrating when you have to stop what you are doing regularly to try and solve a problem that occurs so frequently.
Evidently, this issue had persisted for a really long time, so my job became one of finding out what was going on and getting a solution in place.
Discovery
I was still quite new when I began this process, so I had to really understand the problem and the relevant history before making a recommendation on a solution. As I set up and sat through many meetings across the organization and poured over the related code, a picture began to form about the existing workflow and where the core issue was.
I won’t go into detail on all of the things I uncovered, but just know that, over the years, things had spiraled a bit out of control, which is why the team had been seeing 100’s of alerts per week on this topic. Ultimately what needed to happen is that the team that owned this code needed to go through a bit of an overhaul so that the system wasn’t dependent on numerous upstream conditions. Unfortunately they did not have the time, desire, or capacity to make this happen.
The New Approach
With the reality of our situation now in front of me, a new question worked itself to the forefront: “if we can’t solve the real issue, what can we do to ease the burden on the rest of the team?” Figuring out how to stop the alerts seemed like the obvious answer, but as with anything it is a bit more complicated. We couldn’t just turn off the alerts. While they were a bit too noisy, they did sometimes mean a problem existed. That meant we had to build a software solution somehow. After a bit more discovery, I uncovered the fact that there was a file being read by the problematic application that was also being written and rewritten a number of times per day, depending on the schedules of our operations team. It turns out that a race condition of sorts existed, resulting in the EPG application reading in a corrupt JSON file, causing it to bail on its task. So all we had to do was ensure that the EPG app had up to date and non-corrupt data to work with.
A Solution and the Lasting Impacts
The solution was incredibly simple — I built a Dockerized app that rebuilt the file before every cron job ran that delivered the EPG. It took a few days to build and deploy and immediately cut the corrupt JSON alerts down to zero. Without having gone through a thorough discovery approach, however, this solution would have never shown itself.
A few solutions had been presented in the past in an attempt to quickly solve this problem. Examples included counting the number of alerts and ignoring them until something like 10 occurrences. This would save a bit of time but didn’t actually solve anything. There was an attempt to shift cron schedules around as well, but as the company scaled, the times that these jobs ran began to overlap with operations schedules, so again the issue persisted.
The problem with these alternative solutions was that they did not consider the real issue: the alerts were taking time away from our engineers and were a thorn in our sides. They didn’t really consider that there was still a need for time to be spent continually adjusting the solution as opposed to freeing up time for everyone to get work done that they found more invigorating and valuable.
Measurable Results
Once implemented, on-call became a much simpler week to manage. On-call engineers weren’t forced to drop all of the things they had been working on just to put out fire. Don’t get me wrong, this still existed, but at a greatly reduced rate. If you think about all of the time spent simply acknowledging the alerts, and discussing the impacts of these issues in a group setting you can calculate that Wurl had been spending thousands of dollars per week on an issue that had a pretty simple solution. More importantly, not having this problem hanging over the team’s head meant that our engineers could spend their time on roadmap items and were generally happier throughout. While it is challenging to directly quantify this value, let’s assume that an engaged, happy employee can easily spend six hours per day on focused work. When down, frustrated and demoralized, you can easily cut that productivity amount in half, or worse. That results in a quite expensive loss of productivity. The alternative is more work and, most importantly, better work.
Takeaways
Often a simple solution exists that can be very impactful. If planning focuses on making the lives of your teams and employees easier while adding value to the business, you will end up with a solution that has exponential impact. When approaching problems at work, figure out what is frustrating people the most and try to solve for that. This will allow you to uncover some really clever and impactful solutions.