MaxCDN Blog

Updates & Insights on all Things CDN

How we improved our API response time by 95%

perl-logo

MaxCDN has always used a Perl-based system for provisioning zones to various Points of Presence (POPs) throughout the cache network.

The current system started to creak as our client base grew: the provisioning happened on a single thread, and blocked on I/O operations.

api1

Creating a new node is primarily I/O bound (as data and settings are setup on disk), with relatively light activity for the CPU.

On average, new zones took about 10 seconds to provision. Not bad, right -- why change?

Unfortunately, because the requests were synchronous, they could pile up, one behind the other. In some cases a newly-issued provisioning request could take up to 5 minutes to complete -- not a great experience for the end user sitting on the other end of the control panel.

Seeing the delay, a user might suspect an error, refresh the page, enter the zone details again, and kick off another provisioning request that also gets stuck in the queue. You can see where this is going.

We decided to move the provisioning process to an API-driven process, and had to decide among a few implementation languages:

  • Go, the server-side language from Google
  • NodeJS, an asynchronous framework in Javascript

We built prototypes in both languages, and decided on NodeJS:

  • NodeJS is asynchronous-by-default, which suited the problem domain. Provisioning is more like “start the job, let me know when you’re done” than a traditional C-style program that’s CPU-bound and needs low-level efficiency.
  • NodeJS acts as an HTTP-based service, so exposing the API was trivial

Getting into the headspace and internalizing the assumptions of a tool helps pick the right one. NodeJS assumes services will be non-blocking/event-driven and HTTP-accessible, which snapped into our scenario perfectly.

So, did the new approach work?

api2

api3

The new NodeJS architecture resulted in a staggering 95% reduction in processing time: requests went from 7.5 seconds to under a second. Additionally, requests don’t get stuck in a queue, where one slow item can block the others from being completed.

Now, when users make a provisioning request in “real time” (whether in the Control Panel or API), they get responses in real time. Users are seeing quick feedback, and everything is looking all-right in the API kingdom.

Architecture Details

Here’s a few details on the new architecture, and tips on how to apply similar changes to your own system.

api5

 

  • Be asynchronous. The major gains came from avoiding the need to make blocking filesystem I/O requests as incoming requests came in. Again, choosing NodeJS meant we had this architecture strategy out of the box. Having several simultaneous I/O operations queued lets the operating system figure out how to allocate resources (its speciality), vs the programmer. Fire off the requests and let the OS sort ‘em out.
  • The fastest code is no code. As we rebuilt the API, we noticed the previous provisioning system ran a configuration check against every zone on a server, which could vary from 1 to 15 seconds. The new API just checks the configuration on the zone being provisioned, which usually completes in under 250ms. When a legacy system is being redesigned, question the assumptions that may no longer apply.
  • Be even more asynchronous. The original API performed a synchronous Nginx reload after provisioning a zone, which often took up to 30 seconds or longer. While important, this step shouldn’t block the response to the user (or API) that a new zone has been created, or block subsequent requests to adjust the zone. With the new API, an independent worker reloads Nginx configurations based on zone modifications.It’s like ordering a product online: don’t pause the purchase process until the product’s been shipped. Say the order has been created, and you can still cancel or modify shipping information. Meanwhile, the remaining steps are being handled behind the scenes. In our case, the zone provision happens instantly, and you can see the result in your control panel or API. Behind the scenes, the zone will be serving traffic within a minute.
  • What gets measured, gets improved. How do you know what parts of the workflow need improvement? Measure it. With New Relic in place, we have graphs of our API performance and can directly see if a server or zone is causing trouble, and the impact of our changes. There’s no comparison between a real-time performance graph and “Strange, the site seems slow, I should tail the logs”.
  • Handle failures gracefully. Moving to an asynchronous workflow gives you a chance to re-examine failure scenarios. In our case, the earlier API was overly optimistic about operations like database updates, and might return a successful response when a silent failure had occurred. Additionally, it would send errors like a Nginx reload failure inline, as part of an individual provisioning response. This behavior was changed to send a global alert if the Nginx reload failed (which can impact several zones, not just the API request that issued the request).

We’re really thrilled with the performance increases, and hope you can apply these lessons to your own services.

  • http://dlitvakb.github.io David Litvak Bruno

    Nice!! Great to see this finally completed!! It was such an amazing effort!!

    • http://blog.justindorfman.com jdorfman

      <3

  • David Henzel

    You guys are on fire!!! Devteam ++

    • teslafanatic

      Or -1 for the original Perl implementation…which gets you back to 0.

      • http://www.MaxCDN.com/ Chris Ueland / MaxCDN

        FWIW I got a chuckle out of this. Thanks :-) More post coming soon.

  • matt

    Weird, you built two prototypes but you have based your decision on theory. Any benchmark of the two prototypes which could show that your decision was the right one ?

    • Philippe Modard

      I agree, I would be interested to see numbers for both languages.

      • tomasdev

        I think the prototypes might not have been tested but rather they chose NodeJS for its architecture… Could be really expensive to test prototypes out in production.

    • http://www.maxcdn.com Taylor Dondich

      Hi Matt! Great question. There was prototype code in both GO and Node.JS. The code written in GO was not fully tested in a “production” environment because we made our decision on quite a few factors early on.

      GO is fantastic as faster development cycles and performing CPU-bound operations. However, there are places where GO just isn’t the right choice for this particular scenario.

      We use GO in various parts of our infrastructure where the piece is small and meant to process lots of information. However, provisioning zones is not so much a CPU-bound operation but an external operation bound operation. We have to wait for nginx to do various operations before control comes back to us. We’re not crunching numbers, we’re writing to the data storage for our provisioning information. This type of operation is further bound to external factors.

      GO is also a language that doesn’t have massive adoptability yet. There’s a lot of 3rd party vendors where we use their technology and rely on their SDK’s to interact. Many of those vendors may not have written a GOlang library. Sure, there’s bindings to other languages; however, this is a small hoop to jump through. Is our provisioning dependent on such libraries right now? Not so much. However, you have to keep an eye on the long-term future.

      We also isolate our backend systems at various levels and have them communicate via REST API’s. You can certainly write solutions in GO to have a REST frontend; however, it’s not the ideal case for GO, in our opinion. There’s a fair bit more scaffolding to put in place which means more code to maintain. There are some some great libraries out there for this (check out https://github.com/ant0ine/go-json-rest); however, we have to compare the amount of code we need to write to support the environment we run in, versus the amount of code that focuses on the core problem we are trying to solve.

      Also, it’s pretty difficult to find GO rock star engineers to hire. We have a couple amazing ones here in our team and we only hire the best. By the way, if you ARE A GO rock star, we want to talk to you. Reach out to me.

      So, what we really wanted to find was a replacement language that is quick to develop, does well with asynchronous operations, which are external operations and I/O bound, and is easy to maintain with our existing engineering team as well as future members.

      When deciding what platform or language to use to solve a customer facing problem, you sometimes can’t make the decision on numbers alone. In the end, node.js has the right balance of features to give us the ability to rapidly develop solutions that are I/O bound and has the support and maintainability to ensure we can keep developing the best customer solution possible for *this* specific use case.

Categories

Brand New MaxCDN Tools

Compare the first and last byte speed of 2 websites in 5 seconds and more:

Start Testing