When I first came to Monsanto, our typical tech stack was Java with Spring/Hibernate and a whole slew of common libraries that made our life easier. On my team, we were using many bioinformatics tools to analyze DNA which are CPU and memory intensive. Our workload was mostly idle with random huge spikes of work; so we needed to design systems that could handle large volumes of heavy lifting. Enter the data center and parallel computing.
What went wrong
Our data center at the time promised to have 1000+ nodes of compute power, so we designed workflows to be massively parallel. The first problem was that our workflows are sequential. We take the results of one command and feed them into the next one, then aggregate all the results at the end. We tend to run up to nearly 100 workflows at a time, so the idea was to run different commands from different workflows at once. Each step was designed as an asynchronous job we submitted to the grid with various throttling mechanisms in place to prevent overloading external dependencies, such as NFS. We decided to use Oracle BPM to orchestrate the various commands which we exposed with cxf soap services. In theory, we had a system that was very parallel with tons of error handling. In reality, we had a mix and match of so many architectures that within just a few years became extremely unstable and overloaded.
New Architecture Diagram
With our volume of work only increasing over the years, we were under pressure to redesign this system to handle larger volumes of work faster and more reliably. For our API, we decided Node.js would be a better fit for us. We can spin up Node.js micro services really fast, and I personally prefer working with the operating system without a strongly typed language (many of the bioinformatics tools we use don’t have APIs so we use the command line.) To handle our large volumes of work, we have our API submit jobs to Amazon SQS and have small worker applications that poll for any work. We only send identifiers in our queue message, so any additional information will be retrieved from an AWS data-store by our worker.
These workers are usually a specialized EC2 instance that live under an auto-scaling group that has an algorithm installed on it. We typically have one of these workers polling multiple queues, one for each command our workflow uses. We only install one set of tools on an EC2 instance so different tools can scale independently. Using this design, we can spin up additional EC2 instances when a large burst of work comes in, then shut them down when we’re finished. Rather than paying for 1000 nodes in our data center, we’re only paying for the servers we need when we need them. We also gained the benefit of having the workers reserved exclusively for our use, rather than competing with other grid clients for grid nodes. During peak usage times on our internal grid, it was not unusual to sit in the queue for an hour or more.
One issue we encountered is with the spin up time of an EC2 instance. From start to finish our servers were taking about 5 minutes to spin up, and the work they needed to perform could take anywhere from seconds to hours. We didn’t mind the spin up time on long running jobs, but for the small jobs this was unacceptable. After lots of tweaking, we decided the best approach was to take a rather large EC2 instance and spin up a bunch of docker containers. We leave 1 EC2 instance running all the time and will spin up additional EC2 instances based on the size of the queue. We scale up using small increments if the queue is slightly backed up, and scale up using large percentages if we are extremely backed up.
Our Auto-Scaling policy
During our load testing, we submitted 20,000 jobs to our queue. Using the old data center it would have taken 2 days to process this many jobs, and we couldn’t really calculate how much it cost. With our new AWS infrastructure we completed the same 20,000 jobs in 90 minutes with $32 of EC2 charges. One of the biggest reasons for the performance difference is we have too much demand on our grid nodes internally, and the grid architecture itself was a huge performance bottleneck.
Getting great performance is one thing, but when you’re submitting high volumes of jobs, errors have a tendency to pop up frequently. The vast majority of errors we see are timeouts and various temporary issues that can be solved with simple retry mechanisms. Using our queue pattern, there is a very easy way to solve this that has minimal coding impact. SQS has guaranteed delivery which you can back with a redrive policy. After we consume a message, we do all the work, if it was successful we delete the message and call it a day. If an error pops up the message is not deleted, it will eventually time out and SQS will automatically send it out for delivery again. If after a certain amount of retries it still fails, we leave it in a dead letter queue for further investigation. In our cases this is usually bad data the submitter sent us and we have to manually investigate, so if it ever hits the dead letter queue it is something we would have needed to look at anyways. All of the components in this new design communicate using queues rather than directly relying on service calls. This allows any application to go down without causing an outage to any dependent applications. During the downtime the jobs will continue to queue up and consuming from the queue will resume when the application is brought back to life.
Working with parallelization has been an interesting experience for me and having used multiple approaches, I can really appreciate how much easier it is to design massively parallel systems using AWS. Since we’ve adopted Node.js micro services with SQS, we’ve had very few infrastructure problems and we deliver entire workflows at a pace we never would have thought possible in our Java days. We have embraced the queue in / queue out pattern in nearly all of our work and enjoy the freedom to scale whenever and wherever we want to.