It was a normal release day and everything in the release pipeline ran smoothly. We did not expect any issues.
As soon as users started to use the web app, Slack API alarm channel was buzzing with 400 status code API errors. This never happened in the staging environment and suspecting release issue, I started debugging. I found out that the error was occurring due to missing key in the payload of the API request. The older app didn’t send the payload and after update API expected the key in payload which was sent in the latest version of the app.
Initially, I suspected frontend release pipeline had failed. For frontend, we served React SPA through Cloudfront which had S3 as origin where we would store the build files of our app.
Before checking further, I instinctively hard-refreshed the web app and the error disappeared. The browser was showing older version of the app the first time and after refreshing, it showed the latest version. Confused, I dug deeper and few hours later found the culprit.
It was heuristic caching of the html file served in the response by browser’s HTTP cache.
What is heuristic caching ?
Before jumping to heuristic caching, I want to explain how the HTTP cache works in browsers. Browsers store HTTP responses that satisfisy certain conditions1. A response is considered fresh if it has not surpassed its freshness lifetime.
Freshness lifetime is the time between its generation time by origin server and its expiration time. This expiration time can be explicitly specified by origin server by adding Expires header or Cache-Control header with max-age directive in response header. In case explicit expiration time is not specified, browsers can heuristically calculate the expiration time based on some conditions. This is heuristic caching.
There are certain conditions that browser needs to heuristically cache responses:
- No Expires or Cache Control header.
- The response status code is heuristically cacheable or response with non-heuristically cacheable status code response has
Cache-Control: publicheader. - Has Last-Modified header which browser uses to calculate freshness time of response.
Although there is no explicit algorithm to calculate expiration time in heuristic caching, HTTP cache protocol does suggest following 10% of time since last modified as expiration time which all the browsers follow. For example, if your file was last modified 20 days ago, file’s heuristic expiration time will be ~2 days.
Since the response did not have any explicit expiration header but did have Last-Modified header, browser heuristically cached index.html file. Same headers were returned for CSS and JS files referenced inside index file, so the browsers cached those heuristically as well. On hard refresh browsers ignores the cache, making the browser make API request and fetch index.html from server/cloudfront instead of from browser’s http cache.
The solution to this is not to allow browser to heuristically calculate freshness lifetime of response i.e html file in this case, which can be done by sending Cache-Control header(preferred) or Expires header.
I wanted the browser to validate the last modified time of index.html and then use the cached file if there were no modification. no-cache directive of Cache-Control header does exactly that. When the API response has Cache-Control: no-cache along with Last-Modified or ETag, on next API request browser sends If-Modified-Since header with value of Last-Modified value from previous response orIf-None-Match with value of ETag header.
As per HTTP cache protocol, server must return response with 304 status code if the forwarded request’s response is valid and can be reused or should return new response with status code 200 to signify existing response is invalid and browser should use new response and browser will cache it and revalidate it during next request. CloudFront already handles this so I just had to take care of adding Cache-Control header with no-cache directive in the response.
I made the change and cloudfront started sending 304 response when there are no modifications or no new release and 200 after some modifications are released. It’s been 3 months and 3 releases since, the issue has not occurred again. It’s safe to say we have resolved the issue.
As a takeaway, don’t rely on heuristic caching for HTML files and use explicit caching headers to either determine its expiration within cache or use no-cache directive to validate cache in every request. Fingerprinted assets like JS/CSS files can be safely cached for a year using max-age directive. Every new build will generate a new fingerprinted file (most bundlers do), the new file referenced inside the HTML will be fetched automatically by the browser.2