How Microsoft, partners are tackling ‘huge, huge task’ of making security software safer

This audio is auto-generated. Please let us know if you have feedback.

Microsoft and its partners are quietly grinding away on a massive project to completely redesign how cybersecurity software runs in Windows, with the hope of making it more resilient. But it could be years before customers see the results of one of the most ambitious software engineering transformations in decades.

The project, known as the Windows Resiliency Initiative, is intended to protect Windows computers from the disruptive effects of defective third-party software running inside the kernel, the operating system’s most powerful environment. Microsoft announced the effort after a faulty CrowdStrike software update in 2024 paralyzed millions of computers and caused billions of dollars in damages. The outage affected governments, critical infrastructure organizations and Fortune 500 companies and prompted widespread discussions about the risks of third-party code in the kernel.

The result is an unprecedented collaboration between Microsoft and third-party security vendors to redesign Windows, as well as products like endpoint detection and response (EDR) software and antivirus applications, in ways that improve resilience without sacrificing security or speed.

Microsoft and its partners have said little publicly about the major initiative, but everyone involved appears to recognize how hard it will be to rewrite the pathways between Windows and some of its most important tools.

“Every day’s a learning curve,” said Tony Anscombe, the chief security evangelist at ESET, one of the handful of companies working closely with Microsoft on the project. “We learn something every day.”

Lessons from the CrowdStrike incident

The Windows kernel is the core of the operating system, the connective tissue between a computer’s hardware and software components. It oversees how much memory applications are using, verifies the configuration of device drivers and deconflicts the work of the computer’s various processes. It can do all of this work because it has total control over everything happening inside the computer.

Because of the powers granted to programs running inside the kernel, security application developers have found it to be a useful environment for their products, which need total visibility and control in order to block cyberattacks.

“As a security vendor, you want to see everything that’s happening on the device,” Anscombe said.

The kernel’s ability to essentially freeze and reset the computer’s regular operating environment becomes a major asset when the computer encounters a problem, whether accidental or malicious. “Your machine doesn’t need to be rebooted,” Anscombe said, “and that’s because the application was running in that other mode, where Windows as the OS, the kernel mode, can shut down the user mode and restart [it] without doing a reboot.”

In addition to visibility and control, the kernel also offers speed and flexibility that greatly benefit security applications.

But the kernel’s immense power also comes with significant responsibility, as a defective kernel process could bring down an entire computer — or, if deployed widely, an entire network.

That’s exactly what happened on July 19, 2024, when CrowdStrike deployed a faulty software update to its endpoint detection and response (EDR) product, Falcon. The flawed code forced Windows computers running Falcon to restart endlessly or boot into recovery mode. More than 8 million machines around the world crashed and failed to restart, paralyzing airlines, banks, hospitals, stock markets, government agencies and emergency services. A tiny update to third-party software running in the Windows kernel had caused the largest IT outage in history, leading to billions of dollars in losses, including more than $5 billion for Fortune 500 companies alone.

“Had that [Falcon] process been running in user mode, the severity would have probably been very different,” Anscombe said.

The incident highlighted the dangers of running — and frequently updating — third-party code in the Windows kernel. Four months after the digital chaos subsided, Microsoft launched the Windows Resiliency Initiative and pledged to work more closely with third-party security vendors on responsible software development and deployment practices. The company said it would require all software updates for security programs to deploy gradually throughout customer organizations “to ensure any negative impact from updates is kept to a minimum.”

Microsoft also said it was “developing new Windows capabilities that will allow security product developers to build their products outside of kernel mode.”

“This change will help security developers provide a high level of security [and] easier recovery,” the company added, “and there will be less impact to Windows in the event of a crash or mistake.”

The CrowdStrike incident validated Microsoft’s longstanding unease about letting third-party developers run code in the Windows kernel, said Pavel Yosifovich, an expert on Windows architecture who trains and consults on the subject. Microsoft built in safeguards against kernel-level software crashes by requiring companies to sign their drivers and meet testing requirements, Yosifovich said, “but it’s not bulletproof.”

Windows API revamp

To coordinate the kernel migration project, Microsoft is using its existing Microsoft Virus Initiative (MVI), a program meant to help security vendors smoothly integrate their products into Windows. As part of the Windows Resiliency Initiative, Microsoft refreshed the MVI, dubbing it “MVI 3.0” and requiring participants to meet new reliability requirements.

Roughly 100 security companies are members of the MVI, Anscombe said, but only a dozen or so — representing “a significant majority of the market share” — are working hand-in-hand with Microsoft on the kernel changes. Microsoft has publicly identified Bitdefender, CrowdStrike, ESET, SentinelOne, Sophos, Trellix, Trend Micro and WithSecure as part of that group, but otherwise it is keeping the kernel project shrouded in secrecy. Participating companies’ employees must sign nondisclosure agreements, and most of the companies that Cybersecurity Dive contacted declined interview requests for this story.

Microsoft itself declined to answer even basic questions about the project, referring Cybersecurity Dive to its executives’ blog posts, which offer few details.

The kernel project is still in its early stages. Microsoft has asked vendors to inventory all of their products’ features so the company knows what functionality they need preserved during the transition from kernel mode to user mode. The work has been incredibly complicated, both because vendors have had to review decades’ worth of code and because every vendor’s code works slightly differently.

“It really is an unpicking of how all these products work and then unpicking the OS to see if you can provide that functionality in different ways,” Anscombe said. “It’s a huge, huge task.”

The result is a highly unusual arrangement in the Windows development ecosystem, with Microsoft soliciting feedback from vendors and incorporating it in real time into the application programming interfaces (APIs) that let security products safely hook into Windows’ core components.

It’s “an unusual scenario you don’t often get,” Anscombe said. “This is not somebody developing an API and landing the API on your desk and saying, ‘Here’s our new API. You need to work with this.’ This is somebody developing an API while you're developing the things that will work with the API.”

Having seen how challenging the work has been for ESET, Anscombe said, “I wouldn’t want to be on the Microsoft end of this, of getting responses from 50 vendors and suddenly hav[ing] to try and map everybody’s [feedback] to give them all the functionality they need.”

Difficult balancing act

Microsoft and its partners will have to overcome serious challenges to make security software work just as well outside the kernel as it does inside it.

For one thing, the kernel affords greater control to software running there. This is particularly important for software designed to protect a system from malicious processes.

“When a process is created, a kernel driver can be notified, do some analysis, and decide to terminate the process before it does anything,” Yosifovich said. In user mode, software can only receive notifications about a process after the fact. “If the process is short-lived, and does something malicious,” Yosifovich said, “user mode may be too slow to do anything about that.”

Jeff Tang, a Windows security expert and independent consultant, said that in user mode, “your capabilities are much more limited to monitor the entire system.”

Yosifovich argued that “it’s almost impossible to run completely outside the kernel without significant reengineering or diminishing the powers for security products.”

Security programs running in user mode would also be much more vulnerable to tampering. “Your capabilities are on the same level as the thing you’re trying to monitor and/or stop,” Tang said, “so malware has the same opportunity to stop you from protecting the system” as the security program does to stop the malware.

The problem is far from theoretical. “We’re already hearing of ransomware EDR killers and all sorts of other things that try and do that,” Anscombe said. “You need to be able to give security vendors the [reassurance] that their application can’t be manipulated.”

In addition to the risk of tampering, user mode’s distance from the kernel introduces a processing delay. “Running in user mode is slower when accessing system APIs,” Yosifovich said.

That delay could make or break the customer experience, with potentially serious consequences. “What you don’t want to see is a customer turning around and saying, ‘Well, everything’s a bit slower now. I’m going to start turning things off,’” Anscombe said. “That’s a degradation of security.”

New API timeline TBD

Microsoft and its vendor partners are taking their time to analyze the challenges they face.

“At the moment, it’s more about how you’d move some of the features and functionality across, using an API in that way,” Anscombe said. “Testing, efficacy, et cetera, is further down the line.”

It remains unclear when Microsoft will publish APIs that vendors can use to build user-mode software, or when that software is ready for testing and deployment. Anscombe declined to discuss the project’s internal timelines, although he said the early work has validated vendors’ skepticism about a fast turnaround.

“This will be ongoing for a long period of time,” Anscombe said. “There’ll always be some feature that, somewhere, somebody’s got that will be complicated to transition.”

Meanwhile, market pressure could drive the migration of security products out of the kernel. If user-mode software proves to be more resilient, widely used publications like the National Institute of Standards and Technology’s Cybersecurity Framework could begin recommending that organizations use such software. Insurers might even offer lower premiums to customers that use those products, especially if they determine that user-mode software lowers the risk of business interruptions that could generate claims.

The most likely outcome, Anscombe said, is a hybrid world, in which some software continues to run in the kernel while other programs run in user mode.

Some products might even run in both modes simultaneously, with developers testing and implementing simpler user-mode migrations before tackling the harder components of their code.

“They can coexist in the same technology,” Anscombe said. “In fact, in theory, they already do.”