CrowdStrike’s Falcon security platform has been linked to crashes in both Debian and Rocky Linux systems earlier this year. This incident, which essentially went under the radar then, is now being scrutinized after a recent widespread outage caused by a similar issue on Windows devices.
In April, a civic tech lab responsible for various production websites reported that all its Debian Linux servers crashed and failed to boot after a CrowdStrike update. This incident forced the IT team to remove CrowdStrike from each server and restore functionality manually.
Jack Cushman, the Harvard Library Innovation Lab director, reports on Hacker News: “CrowdStrike pushed an update on a Friday evening that was incompatible with up-to-date Debian stable. We patched Debian as usual, and everything was fine for a week. Then, all of our servers across multiple websites and cloud hosts simultaneously hard crashed and refused to boot.”
The problem was traced back to an update that conflicted with the latest stable version of Debian. CrowdStrike acknowledged the issue a day later, but it took weeks to analyze the root cause. The analysis revealed that a specific Debian configuration was missing from their testing matrix.
In May, Rocky Linux users experienced a similar issue after upgrading to version 9.4. Servers began freezing due to a kernel bug, which CrowdStrike support eventually attributed to insufficient testing. Rocky Linux users reported that booting with the new kernel resulted in system freezes, while reverting to the old kernel restored normal operation.
These repeated issues highlight a significant gap in CrowdStrike’s testing and compatibility processes. The fact that critical software updates can cause widespread system failures is alarming. This oversight raises questions about CrowdStrike’s update and testing procedures, especially considering their software’s role in protecting systems from cyber threats.
The root cause of these problems appears to be a combination of insufficient testing and the use of programming languages prone to certain types of errors.
A former Google employee, Zach Vorhies, explained on social media that a NULL pointer error in C++ caused the recent Windows issue. “Programmers in C++ are supposed to check for this when they pass objects around by ‘checking for null’,” Vorhies noted, suggesting that modern tools could prevent such errors and recommending a shift to safer programming languages like Rust.