The Drupal Core security release SA-CORE-2021-003 on May 26th, 2021 was delayed by a combination of infrastructure issues, causing the release to be delayed to outside the planned release window. This post-mortem blog explains the circumstances, and how the Drupal security team and Drupal Association are collaborating to improve the situation for the next release.
On 2021-05-25, the Drupal Security team published a PSA about an off-cycle release for Drupal Core. Typically, the Drupal security team will publish PSA’s about releases under a few conditions:
- If the release is off-cycle (core security releases are on the 3rd Wednesday of the month
- If the release is highly critical, the security team will issue a PSA before the normal window.
For SA-CORE-2021-003 the release was off-cycle because it related to a 3rd party library, and so a PSA was issued.
At 1:30 eastern on 2021-05-26 the release process started. This normally takes 20-45min. The PSA had triggered a large amount of “bot” like traffic to our GitLab infrastructure.
Because of the nature of this particular commit as a change to the integration points of a 3rd party dependency, the diff was exceptionally large. As a result, each request to commit pages in the Drupal GitLab instance was significantly more resource-intensive than average.
The load on the GitLab infrastructure caused the systems that create the packages and update subtree splits to become unstable.
- A PSA was sent in advance of the release because it was happening outside the normal core cycle.
- This PSA resulted in large amounts of traffic to Drupal.org and to our GitLab installation, and in particular, a large number of people and scripts looking at the commit history for new commits – even before the release was published.
- Because of the nature of this particular commit, the diff was exceptionally large, and each request to commit pages triggered a heavy syntax highlighting operation.
- Once the commit was pushed to git, even before the packaged release was available, requests from CI systems and Composer increased dramatically.
- These combined factors resulted in the release packaging process itself being blocked on 500 errors, unable to populate a key db table in the middle of the pipeline process.
- We mitigated this by blocking the most aggressive automated traffic that was repeatedly attempting to gather commit information.
- This allowed us to complete the packaging process, and get the packaged release out.
- In the meantime, the load on the GitLab instance itself remained high, resulting in slowdowns and 500 errors for several hours, as it gradually cleared.
We are still evaluating potential mitigations for future release windows, but they may include one or more of the following strategies:
- Improving log collection from our GitLab servers for better diagnosis of issues.
- Enforcing static archiving on GitLab commit pages, either temporarily or permanently.
- Investigating the possibility of separating web traffic from Git traffic for our GitLab instance, or at least to reduce load spikes and provide dedicated git resources to the packaging job.
- Dividing read-only traffic between the GitLab primary and the replica, for which we have a plan.
- Deliberately redirect all non-cached traffic until packaging completes.
These mitigations should help ensure that the full release process can be completed within the specified window.
The final aspect of this release that was disruptive was communication. We recognize that communication gets sparse whenever there is an unexpected delay, because the team is fully occupied trying to resolve any issues as quickly as possible. In future, we’ll try to designate a person to handle communications, so that if we have to update the expected release time, we do so in a timely manner.
Finally, you can communicate with the security team and other community members in Drupal Slack, in the #security-questions channel.
As a long-term goal, we know our international community members would benefit from more peace of mind when updating sites. We hope a combination of the in-development Automatic Updates initiative and the Drupal Steward program for highly critical vulnerabilities can help these international teams.
Go to Source