The Croudstrike IT outage tested your business continuity and recovery planning. Did you pass?

Introduction

In this post, we examine the recent Crowdstrike issues, unpack what went wrong, identify who was affected, and discuss how to mitigate this issue to improve business continuity and disaster recovery. The internet is flooded with the blaming of Croudstrike for the most recent outage; we are getting many who fall into this issue should have been found as part of the Crowdstrike release testing. One is correct in saying that, but it does not provide the whole picture; the relationship is that the business is also responsible for verifying and applying any patch from a third party. The company is accountable for ensuring its customers' business continuity and functional IT systems. Much like a food business provides the quality of delivered food that is fit for usage, IT should verify that all patches and updates do not break the more comprehensive IT systems. To mitigate issues like this, we need to perform regular business continuity and disaster recovery reviews, improvements, and testing to achieve this. If you suffer an outage, it indicates that you failed in your business continuity and disaster recovery approach. This post will help you understand how to mitigate this sort of issue.

Outage

Let us first better understand what happened. The outage began on the evening of July 18th, when computers with the Windows operating system started to shut down with a blue screen and entered a reboot cycle with the blue screen error. The issue was seen globally across many different sections. On investigation, CrowdStrike acknowledged the problem in a recorded phone message, stating they were "aware of reports of crashes on Windows... related to the Falcon sensor." The company later confirmed that the problem was caused by a defect in a single content update for Windows hosts and was not a cyberattack. When we looked at the content update, it seemed this issue was in the Falcon sensor Windows driver; it resulted from a Null pointer. Updating and patching Windows drivers is a critical patch because of the potential to take down the operating system. Later in the post, we will examine why Windows driver patches are an essential change to the system and why there needs to be careful rolling out with a back-out strategy.

Impact

The outage significantly disrupted business globally, and no sectors seemed spared. Travel, banking, and healthcare were all affected; that list is enormous, and this disruption cost hundreds of millions. In the UK, the NHS was particularly affected by the EMIS system used by GP practices and pharmacies to manage appointments, patient records, prescriptions, and referrals. The NHS confirmed the issue and stated that long-standing measures, such as using paper records and handwritten prescriptions, were in place to manage the disruption.

Cause

This issue was related to an update by Croudstrike about their Falcon Endpoint Protection product. A file called 'C-00000291*.sys' was responsible. The fix was to delete this file, reboot the system, and then the system would perform another update, restoring the system to its operational state. This issue was a null pointer in the code that caused the Windows kernel to crash. Because Windows drivers operate of ring zero, there are no protections in place to prevent a crash of the whole operating system, and this is why driver patching is critical to get correct and have a fallback position.

Improving and Mitigating

Blaming Croudstrike without looking at how one's buisness recovered automatically or not is critical to improving. Each buisness should use this outage to learn and improve for the future. Did we fail, or did our business continuity and DR succeed and mitigate this crowdstrike issue? This is the time to reflect, acknowledge shortcomings, and use them to reduce this issue.

Many organizations are using the default patching approach with Crowdstrike. This is where the Crowdstrike software reaches out, pulls in software updates, and patches the system. Just stop here for a second and think; this means that critical systems are updated during crucial times. Patching should always happen during noncritical times, unless it is a necessary patch. This requires patching to be in the control of business IT and not a third party. Remember, the third party needs insight into your business operations. That is why your IT needs to have control over patching and align it with business operations.

Business IT would also have a canary approach to patches, where noncritical systems are updated first and used as the validation phase; this should be automated. If the noncritical systems show issues, the automated patching rollout automatically halts. All systems should also have a fallback position, where failed drivers during reboot trigger a fallback to the last known boot, mitigating all failed systems drivers.

Waiting for issues like the crowsstricke to arrive is too late. Each business should have business continuity planning, which should involve IT and the cloud. Having war days where the IT continuity is tested by having the team test the continuity plan with playable outage scenarios is important to finding flaws in the IT and cloud continuity planning. This war day will have a debriefing where issues will be discussed and improvement plans created. This process should be baked into the ongoing continuous improvement planning for the the cloud.

Conclusions

We can learn that we need to plan for software issues; our planning should have resulted in mitigation, as software patching always presents the possibility of system failures. We have to prepare for recovery from patching. We also learned to have noncritical systems as canary systems, where we detect issues and halt patching. And how having patching under the control of the business IT is as critical as not using default 3rd party patching. And finally, to always have a fallback for systems to recover from rogue patches.