It all started on a Thursday evening. After months of hard work, we were having the first release for the brand-new version of one of our finance-related products, developed with open source, event-driven microservices, microfrontends, cloud, HCD, agile and DevOps as key principles.
With this effort, we’re gradually replacing an actively used Java monolith launched in 2012, so for the first release we decided to include some basic non-core features, together with complex changes made on the authentication flows in order to allow a seamless single sign-on experience when moving between the monolith and the cloud product.
Since a monolith deployment was needed for this release, production artifacts deployment was planned for 7 PM in order to avoid impacting end users. At around 6 PM first issue was discovered, there was a typo in a URL defined in one of the monolith properties files. Correcting that value would have required generating a new ear file and postponing the release due to the approval process that’s in place for that. Luckily for us, the mistyped URL was pointing to an entry in an API gateway that was created with self-service support, so we were able to quickly create a temporary new route in order to match the mistyped URL and avoid any further impact.
At around 9 PM, the deployment was completed, and smoke tests didn’t bring any bad news. But the following morning, things changed. One of the basic non-core features we were taking to production is a user creation request screen that triggers an email (through a monolith API) to a group of persons that analyze the request and decide on it. Well, that group reported they were not getting any requests. That was unusual and an indication that the new screen was probably not working as expected.
We started troubleshooting by analyzing the network traffic from the client side and quickly realized that although the email was not arriving as expected, the request was returning without errors to end users, giving the wrong impression that it was being processed correctly. From here we would have to analyze API gateway and monolith logs, and maybe even take actions at the email server. But first we wanted to avoid future requests from being sent until we had fixed the issue. So we made use of one of the benefits we get with the new product’s development model and quickly created and deployed a static maintenance screen for just this function, without impacting any other features.
Afterward, we decided to explore the API gateway monitoring capabilities and, once again, this piece of software saved us. We were able to:
- See the past requests’ traces
- Identify the root cause of the issue (an API misconfiguration causing requests to be forwarded to an undesired proxy which was returning a login form html content as response body)
- Save the non-arriving user creation requests for reprocessing.
Third (and last) issue
By noon, a third unexpected problem appeared. There were reports of failed logins with the new authentication flows, and members of our dev teams were also receiving alarms from the centralized logging solution we have in place about HTTP errors when contacting the identity provider. We ran some searches from the log viewer web console, performed an analysis (1 out of 3 logins attempts were failing, only in production, for all kind of users and from all the app server cluster nodes) and then shared that information with the identity management team. They detected that one of the identity server cluster’s nodes was down causing login attempts to fail if they were started on a healthy node but the flow continued on the failing node.
As days went by we ended up with different thoughts and questions in our minds. Reflecting on past years’ experiences we had many reasons to be happy and thankful in relation to operations and infrastructure: We were able to quickly take actions ourselves (no black boxes) and avoid cumbersome release processes, we had an API gateway product created with focus on developers’ needs and a robust centralized logging solution with valuable log statements added in the application’s code.
On the other hand, we regretted spending hours fixing issues caused by typos and misconfigurations. Why didn’t we detect those earlier? How can we avoid suffering this in the future? Maybe environment configuration automation could help, as well as externalizing the monolith properties, avoiding monolith dependencies if possible, and not overestimating non-coding configuration tasks during our sprints.
But the most disturbing thoughts we started having were related to future possible scenarios. What would we do if the issue was not a simple one in a non-core feature but a complex one in the core of our product? And if we were in peak season with hundreds or even thousands of transactions failing? Would the same troubleshooting steps and tools be enough? The answer to this last question was definitely no.
You build it, you run it
There’s so much being said about the DevOps term’s definition these days that it may be hard for teams to focus in achieving the right mindset as you get started. For me, after the experience just shared, the You build it, you run it quote summarizes the key philosophy needed to continue improving in that area for our new product’s teams. Designing and prioritizing backlogs has to change when you’re supporting our own deployments in production, and that needs to be understood by dev teams, product owners and business stakeholders as well.
In line with my previous words, during the last few months we’ve been putting that quote into practice in different ways:
- Meeting in order to make sure every artifact’s owner team is clearly defined, and that we all understand the relationship between autonomy, responsibility and ownership. It’s not acceptable to have many teams working in a code but none of them showing ownership for maintaining it.
- Conversations with dev teams, product owners and business stakeholders in order to adjust backlogs priorities according to DevOps needs. I wish we had pushed for those earlier in order to avoid sprints full of DevOps stories only.
- Developed screens to be able to easily query our product’s event store and retrying events in case of failures. By doing this we streamline troubleshooting by avoiding searching for logs, opening up ssh tunnels, running sql queries and manually pushing messages.
- Smart, concise messaging notifications being used on top of basic monitoring in order to be able to detect potential problems as early as possible. It’s great to identify and solve issues before end users tell you about it!
- Programmatic and automatic remediation flows in order to avoid having to run manual actions when the number of affected items goes up. Just imagine having to spend whole days running manual remediation actions for thousands of items one by one and you’ll agree it’s worth investing here. Also, the chances of running into this scenario increase when working in an async event-driven application as we are.
- Increased emphasis on valuable code testing for early bug detection and increased confidence in code.
We still have lots of room for improvement, but I’m convinced that we’re moving in the right direction, and that’s in part thanks to the events that took place that Thursday evening.