Skip to content

Latest commit

 

History

History
365 lines (256 loc) · 16 KB

06-connection-problems.adoc

File metadata and controls

365 lines (256 loc) · 16 KB

Connection Problems

The "happy path" is the way things should work in an ideal world, where nothing goes wrong. It would be like me thinking "I can just ride my bike to work in New York City without two taxi drivers, a truck, a family unit of tourists, cops on horses, a dog walker, and a raccoon all trying to kill me."

When API client developers start out they are full of optimism, and spend most of their time focused on the happy path as they focus on completing all the functionality they’ve been asked to do.

Over time API client developers step on enough rakes of their own making that they become hardened, almost paranoid, and it’s not unreasonable to feel this way.

There are so many things that can go wrong whenever you "go over the wire" that it’s best to assume an API interaction is not going to work, then be pleasantly surprised when it does. Your code should reflect this mindset.

What can go wrong?!

Let’s look at a simple bit of code thats likely to be in a codebase near you already.

A super simple API interaction: finding out the various abilities for a Pokémon via the PokéAPI.
link:code/ch06-connection-problems/01-happy-path-only.js[role=include]

This is assuming that the server is up, that the request is going to get a response, in a reasonable time-frame (no definition of what that is), and when it does it’s actually talking to the server you thought it would be, and it’s definitely returning JSON!

Assuming server exists, responds in a decent time, and the data is in fact JSON, there’s a few more issues: we’ve presumed the pokemon was definitely found (not typos or missing data), and confidently tied our code to an exact JSON structure with no checks to see if the data exist, and of the types we expect.

Coding API interactions this way may work for long enough to close out your tickets and go home, but you’ll end up with a lot more tickets piling up when various things go wrong.

Intermittent Connections

For a client-server interaction to work, clearly both the client and the server will need to have an Internet connection at some point, but it might not always be there.

Even in situations where there is a strong degree of confidence, things could cut out momentarily, or for a while. How will your application cope if it does?

Connection Vanishes

Mobile users in the city might hop on a subway, or go into an old building with a Superman-proof lead roof.

Mobile users in the countryside might be out in the woods trying to use your app to get directions home, or maybe they’re just trying to send selfies to make their friends jealous.

Laptop users might be in the middle of an important process when their wireless network decides to switch from the coffee shop wifi to the bar next doors Wifi, essentially cutting that connection in half and trashing the API calls. That might take a second, or it might be longer if the new network requires a login screen.

Client Mistakes

Maybe the API has "rate limiting" and decided your client is too chatty.

The API implemented a redirect and the HTTP client doesn’t know to follow it.

Server Mistakes

Their HTTP errors are coming out in HTML because they forgot to catch and return JSON errors.

HTTP status code 500 is popping up from the server or some other network component like a cache server, even though their documentation said they would not ever do a 500 error.

The API deployment wasn’t set up with zero-downtime deployments, and a code push has caused some downtime.

The API is relying on an upstream API which is deploying, or generally having a wobble, and thats caused some temporary instability.

AWS is down again.

Weird Problems

ISP (Internet Service Provider) blocked a domain, or a keyword in the request/response.

Squirrels attacked the data center?

"A frying squirrel took out half of our Santa Clara data center two years back," Christian said, noting squirrels' propensity to interact with electrical equipment, with unfortunate results. If you enter “squirrel outage” in either Google News or Google web search, you’ll find a lengthy record of both recent and historic incidents of squirrels causing local power outages.

— Rich Miller
Surviving Electric Squirrels and UPS Failures, 2012, Data Center Knowledge

Ships trash an undersea cables by dropping anchor right on The Internet?!

Defensive Code

All of these problems are going to cause the happy path to get messy.

Let’s harden our code one step at a time.

Inspiration for these code examples was taken from Umar Hansa’s brilliant article Implement error handling when using the Fetch API.
link:code/ch06-connection-problems/02-catch-fetch-errors.js[role=include]

This little change solves some of these problems. If there is any sort of connection failure from something like a connection refused (no internet, server is down, etc), or a dropped connection (failed part way through), or certificate errors, this should all be caught with the first exception.

Whatever happens, it will log something to the user console, and return early. You could imagine this code doing something clever to update the user interface, but for now we’re keeping it simple.

This is a step in the right direction, but once we’ve eventually got a response there is a lot of other things that can go wrong. What if the response is randomly HTML instead of JSON? Or it’s weirdly invalid JSON?

link:code/ch06-connection-problems/03-catch-json-errors.js[role=include]

Great! Now when the API randomly squirts some unexpected HTML error at you, the function will just return an empty array, and there is an error logged that the developers can go digging into.

Another step in the right direction, but this still assumes we actually get a response in a reasonable timeframe.

What if you’ve been waiting for thirty seconds?

What if you’ve been waiting for two minutes?

We will deep dive into timeouts later on in the book, but a really helpful quick bit of defensive coding you can do, is to make sure your application isn’t spending two minutes doing absolutely nothing for a request that normally takes less than half a second.

link:code/ch06-connection-problems/04-timeout.js[role=include]

Simulating Network Nonsense

Most of the time developing against an API that works just fine means you cannot test these complicated unhappy paths.

To simulate the sort of nonsense you are coding to defend against, take a look at Toxiproxy by Shopify.

Rate Limiting

Another common situation to run into is rate limiting, which is basically the API telling your API client to calm down a bit, and slow down how many requests are being made. The most basic rate limiting strategy is often "clients can only send X requests per second."

Many APIs implement rate limiting to ensure relative stability when unexpected things happen. If for some reason one client causes a spike in traffic, the API has to continue running smoothly for other users instead of crashing. A misbehaving (or malicious script) could be hogging resources, or the API systems could be struggling and they need to cut down the rate limit for "lower priority" traffic. Sometimes it is just because the company providing the API has grown beyond their wildest dreams, and want to charge money for increasing the rate limit for high capacity users.

Often the rate limit will be associated to an API key or access token, so it can be tied to a specific account. Our friends over at Nordic APIs very nicely explain some other rate limiting strategies:

Server rate limits are a good choice as well. By setting rates on specific servers, developers can make sure that common use servers, such as those used to log in, can handle a lot more requests than specialized or seldom used servers, such as data conversion devices.

Finally, the API developer can implement regional data limits, which limit calls by region. This is especially useful when implementing behavior-based limiting; for instance, a developer would expect the number of requests during midnight in North America to be lower than the baseline daytime rate, and any behavior contrary to this without a good reason would suggest questionable activity. By limiting the region for a period of time, this can be prevented.

All fair reasons, but for the client it can be a little pesky.

Throttling Your API Calls

There are a lot of ways to go about throttling your API calls, and it very much depends on where the calls are being made from. One of the hardest things to limit are API calls to a third party being made directly to the client. For example, if your iOS/web/etc clients are making Google Map API calls directly from the application, there is very little you can do to throttle that. You’re just gonna have to pay for the appropriate usage tier for how many users you have.

Other setups can be a little easier. If the rate limited API is being spoken to via some sort of backend process, and you control how many of those processes there are, you can limit often that function is called in the backend code.

For example, if you are hitting an API that allows only 20 requests per second, you could have 1 process that allows 20 requests per second to pass through. If this process is handling things synchronously that might not quite work out, and you might need to have something like 4 processes handling 5 requests per second each, but you get the idea.

If this process was being implemented in NodeJS, you could use Bottleneck.

const Bottleneck = require("bottleneck");

// Never more than 5 requests running at a time.
// Wait at least 1000ms between each request.
const limiter = new Bottleneck({
  maxConcurrent: 5,
  minTime: 1000
});

const fetchPokemon = id => {
  return pokedex.getPokemon(id);
};

limiter.schedule(fetchPokemon, id).then(result => {
  /* ... */
})

Ruby users who are already using tools like Sidekiq can add plugins like Sidekiq::Throttled, or pay for Sidekiq Enterprise, to get rate limiting functionality. Worth every penny in my books.

Every language will have some sort of throttling, job queue limiting, etc. tooling, but you will need to go a step further. Doing your best to avoid hitting rate limits is a good start, but nothing is perfect, and the API might lower its limits for some reason.

Am I Being Rate Limited?

The appropriate HTTP status code for rate limiting has been argued over about as much as "tabs" versus "spaces", but there is a clear winner now; RFC 6585 defines it as HTTP 429.

Lots of cats
Figure 1. http.cat meme for HTTP 429

Some APIs like Twitter’s old API existed for a few years before this standard, and they chose "420 - Enhance Your Calm". Twitter has dropped 420 and got on board with the standard 429. Unfortunately some APIs replicated that and have not yet switched over to using the standard, so you might see either a 429 or this slow copycat.

Cat chewing on a cannabis leaf
Figure 2. http.cat meme for HTTP 420

Google also got a little "creative" with their status code utilization. For a long time were using 403 for their rate limiting, but I don’t know if they are still doing that. Bitbucket are still using 403 in their Server REST API.

Actions are usually "forbidden" if they involve breaching the licensed user limit of the server, or degrading the authenticated user’s permission level. See the individual resource documentation for more details.

— REST Resources Provided By: Bitbucket Server
https://docs.atlassian.com/bitbucket-server/rest/5.12.3/bitbucket-rest.html

GitHub v3 API has a 403 rate limit too:

HTTP/1.1 403 Forbidden
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1377013266
{
   "message": "API rate limit exceeded for xxx.xxx.xxx.xxx. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)",
   "documentation_url": "https://developer.github.com/v3/#rate-limiting"
}

Getting a 429 (or a 420) is a clear indication that a rate limit has been hit, and a 403 combined with an error code, or maybe some HTTP headers can also be a thing to check for.

Either way, when you’re sure it’s a rate limit error, you can move onto the next step: figuring out how long to wait before trying again.

There are three main ways a server might communicate retry logic to you.

Retry-After Header

The Retry-After header is a handy standard way to communicate "this didn’t work now, but it might work if you retry in <the future>".

Retry-After: <http-date>
Retry-After: <delay-seconds>

The logic for how it works is defined in RFC 6584 (the same RFC that introduced HTTP 429) but basically it might look a bit like this:

HTTP/1.1 429 Too Many Requests
Retry-After: 60
Content-Type: application/json

{
  "error": {
    "message": "API rate limit exceeded for xxx.xxx.xxx.xxx.",
    "link": "https://developer.example.com/#rate-limiting"
  }
}

You might also see a Retry-After showing you an HTTP date:

Retry-After: Sat, 15 April 2023 07:28:00 GMT

Same idea, it’s just saying "please don’t come back before this time".

By checking for these errors, you can catch and retry (or re-queue) requests that have failed. If that is not an option try sleeping for a bit to calm workers down.

Warning
Make sure your sleep does not block your background processes from processing other jobs. This can happen in languages where sleep sleeps the whole process, and that process is running multiple types job on the same thread. Don’t back up your whole system with an overzealous sleep!_

Some HTTP clients like Faraday are aware of Retry-After and use it to power their build in retry logic, but other HTTP clients might need some training.

Proprietary Headers

Some APIs like GitHub v3 use proprietary headers, all beginning with X-RateLimit-. These are not at all standard (you can tell by the X-), and could be very different from whatever API you are working with.

Successful requests with Github here will show how many requests are remaining, so maybe keep an eye on those and try to avoid making requests if the remaining amount on the last response was 0.

$ curl -i https://api.github.com/users/octocat

HTTP/1.1 200 OK
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 56
X-RateLimit-Reset: 1372700873

You can use a shared key (maybe in Redis or similar) to track that, and have it expire on the reset provided in UTC time in X-RateLimit-Reset.

RateLimit Headers (Standard Draft)

The benefit of the proprietary headers is that you get a lot more information to work with, letting you know you’re approaching a limit so you can pre-emptively back off, instead of waiting to stand on that rake then having to respond after being hit round the head.

There’s an IETF RFC draft called RateLimit header fields for HTTP that aims to give you the best of both, and maybe you’ll fun into something that resembles this in the distant future of 2024 or 2025.

RateLimit-Limit: 100
RateLimit-Remaining: 50
RateLimit-Reset: 50

This says there is a limit of 100 requests in the quota, the client has 50 remaining, and it will reset in 50 seconds. Handy!