The steps that any new launch or high traffic event should go through in order to have the best chance of success. This post is aimed at the project management level, so will try to stay out of the weeds, and focus on the high level topics you need to think about. There is a ~18 minute recording at the end of this post where I presented this topic at Drupalsouth 2019.
Preamble: What could be considered a high traffic event
Launching a brand new site
- Re-platforming (e.g. moving CMS version or type, or between hosting providers)
- eDM or other marketing event (e.g. Adwords)
- Planned traffic event (e.g. black Friday)
- Unplanned traffic event (e.g. news and media site)
Step 1) Ensure you have some basic Drupal configuration in place
- Disable known problem child modules
dblog
,devel
,statistics
,radioactivity
,page_cache
- Enable
dynamic_page_cache
(if you have authenticated traffic) - Set minimum cache lifetime to something sensible
- JS and CSS aggregation enabled
- Automate these checks with Drutiny
Step 2) Content Delivery Network (CDN)
Additional insurance against a lot of traffic is distributing your cached content to all corners of the globe.
Tiered caching should be used to ensure the highest offload rate. Most CDN providers will support this at a given price point.
Step 3) Cache tuning and minimising origin requests
Every request that bypasses your CDN layer adds load to the platform. In order to have the best chance of surviving a high traffic event, origin traffic needs to be carefully considered and reduced where possible.
Requests to origin that are often overlooked
- 404s
- Marketing based parameters (e.g.
utm_campaign
) - Redirects (especially if re-platforming)
- WAF to block silly requests (e.g. WordPress URLs like
wp-login.php
)
It you are interested in WAF tuning, you should check out my talk last year on using Cloudflare to secure your Drupal site.
Step 4) Load testing
If you are building a new site, or are expecting a substantially different traffic profile than what you have currently, then you should look to load test the system.
- Production hardware replica (scaled up if appropriate)
- Emulate expected user behaviour, use existing analytics, or expected flows
- Emulate what the browser would be doing (download all assets, including any HTTP 404s)
- Ensure complex tasks are also simulated at the same time (e.g. editorial, searching, form submissions, feeds ingestions)
At the end of this task (they you may need to run several times), you should have the confidence that you can handle the traffic expected.
Step 5) Hardware (auto) scaling
Now that you have the hardware you need to have in place with load testing, ensure you have autoscaling in place to deal with the peaks and troughs (it is unlikely you need to run your peak hardware for the entire duration of the event).
Autoscaling can also help if the origin traffic that you experience is higher than anticipated.
Test the autoscaler, set limits that you are comfortable with, and ensure you know how quickly the new resources take to come to life.
Step 6) Have a good fallback
Say the worst does happen, and you site does go down, or a critical API drops off the face of the internet, what does the end user see? Can you offer at least a better experience than a generic web server error page?
Most CDNs will have the ability to load balance origins (hot DR), and even fallback to a static version of the site if all origins are down.
It would make sense to test this prior to the high load event as well.
Step 7) Warm your cache
If you have a rather long tail website, it will be worth warming your cache prior to the event. An excellent module called warmer has been written, to which allows warming all sorts of caches. It can for instance load every page in the XML sitemap. So this is fairly low effort, high reward.
Step 8) Third party API dependencies
This is more of a fundamental design decision likely made much earlier on in the project. Say the content of your page is dependent on the content in an API response. If you request the API content during page generation time, then you are tying the speed and availability of your site to another site (often outside your control).
This can lead to slow page load times, and worse case scenario can tie up your server’s resources.
New Relic APM has “external requests” to which allow you to visualise this.
There are ways to mitigate this:
- Fetch the data in the background and cache locally in Drupal for as long as the data is considered ‘good’. e.g. using Drush and a cronjob.
- Use a client side application (e.g. React) and request the API response in the client side
- Use a CDN on the API and see Step #6 above
Step 9) Realtime analytics
During the event, having access to realtime (or near realtime) analytics to find out
- how the system is currently performing
- requests/sec
- where the traffic is coming from
- cache offload rate from the CDN
Is extremely valuable. Even more valuable is being able to respond to this data in a quick and efficient matter. Having access to technical people can help. The types of logs and analytics you should be looking to get a hold of:
- Web analytics tools (e.g. Google Analytics)
- APM tools (e.g. New Relic)
- CDN analytics (e.g. Cloudflare Logs)
- Log stream from hosting provider (e.g. PHP error log)
To see where you can take this, you might also be interested in reading this blog post that shows off some dashboards that were purpose built for a high traffic event.
Step 10) Application changes in a pinch
If you do spot something in your analytics, knowing what tools you have at your disposal to mitigate issues quickly and easily is worth knowing.
- Cloudflare page rules (redirect a broken path, increase the WAF presence on a route)
- Nginx or Apache configuration
- Application hotfix (avoid clearing the cache)
Knowing what tool will solve what problem, how long each option takes to deploy, how safe it is, how easy is the rollback is is absolutely critical.
Step 11) Letting your hosting provider and their support team know
No-one likes surprises, so plan ahead. Ensure there are people available or on call during your traffic event. This goes for both your hosting provider, to CDN provider to support staff.
Postamble: What success looks like
So after your high traffic event has ended, here are some simple things to check in order to see how successful you were:
- Minimal origin requests and a high CDN offload
- Boring origin hardware graphs
- No rants on twitter
- No trending hashtag on twitter that is negative
- Users remember the event for it’s content, and not the problems with it
Drupalsouth 2019 video
Let me know in the comments if this was of use, and also if you have any other words of wisdom for anyone else.