CrowdStrike reveals details
Three weeks after a major IT outage disrupted services worldwide, CrowdStrike has released a Root Cause Analysis (RCA) report.
The systems crash on July 19 caused significant disruptions, cancelling flights, postponing surgeries, interrupting cable news broadcasts, and forcing shops to switch to cash-only transactions.
This incident has sparked intense scrutiny and debate in the tech industry, highlighting the vulnerabilities of digital infrastructure and the challenges of software deployment and system reliability.
Root Cause Analysis (RCA) is a systematic approach used in the IT industry to identify the fundamental causes of faults or problems.
By examining every component and process involved, RCA aims to pinpoint the root causes rather than just addressing the symptoms. This method ensures that similar incidents are prevented in the future through effective corrective measures.
CrowdStrike’s RCA report (PDF) identifies several critical factors that led to the Falcon EDR sensor crash. Key issues include:
-
Mismatch between inputs: A discrepancy between inputs validated by a content validator and those provided to a content interpreter created an undetected vulnerability during initial testing phases, leading to a cascade of failures.
-
Out-of-bounds read issue: This flaw in the content interpreter caused memory read errors, triggering the global system crashes.
-
Absence of specific testing: The lack of a specific test for non-wildcard matching criteria in the 21st field was a significant oversight. CrowdStrike has pledged to work with Microsoft to ensure secure and reliable access to the Windows kernel.
The problem originated in February when CrowdStrike introduced a new template type to detect novel attack techniques using Windows' interprocess communication mechanisms.
This template defined 21 input parameter fields, while the content interpreter could handle only 20.
On July 19, the deployment of additional template instances introduced criteria for matching a 21st parameter, causing a memory read error that led to widespread crashes.
In response to the incident, CrowdStrike has announced several measures to prevent future occurrences:
- Updated test procedures: Enhanced tests for template type development and automated tests for all existing template types aim to catch discrepancies early.
- Enhanced deployment checks: Additional deployment layers and acceptance checks in the content configuration system ensure templates pass successive deployment rings before full rollout.
- Improved customer control: New capabilities allow customers greater control over the deployment of Rapid Response Content updates, with more functionalities planned to empower users.
Preventing Channel File 291 issues: Validation for input field numbers has been implemented to prevent similar issues.
CrowdStrike CEO George Kurtz publicly apologised to customers, emphasising the company’s dedication to regaining customer trust and confidence, and stressing that customer protection remains their top priority.
While CrowdStrike's RCA provides a detailed breakdown of the technical flaws, it appears to overlook the broader issue of process failure.
The report focuses on technical defects, diverting attention from procedural gaps and executive accountability.
The bug - a mismatch in input fields - is a basic technical error. The pressing concern is why such a bug went undetected for so long.
The RCA reveals significant gaps in automated testing processes, which should have caught this discrepancy before deployment.
Additionally, the RCA does not clearly address the decision to push the update to all users simultaneously, a significant oversight.
Experts say staggered deployments and more rigorous testing could have mitigated the impact of such an error.
The fallout extends beyond technical fixes. Investors have filed lawsuits against CrowdStrike, citing a 32 per cent drop in share price over 12 days. Delta Airlines has threatened to sue over $500 million in losses.
The RCA’s revelations may fuel further litigation, reflecting the significant brand and financial impact of the outage.
CrowdStrike's liability is limited by typical software contract clauses, which may cap its exposure significantly.
However, the extent of financial losses involved could challenge these limitations. Experts suggest this incident could become a landmark case in software liability, potentially leading to regulatory reforms.
CrowdStrike has rejected Delta's claim that it should be blamed for the flight disruptions, asserting its liability is capped at a “single-digit millions” amount.
The company has expressed regret and apologised to all customers, but suggests Delta's prolonged recovery period compared to other airlines points to issues within Delta's own IT systems.