AI & LLMs Show Promise in Squashing Software Bugs

Security researchers and attackers are turning to AI models to find vulnerabilities, a technology whose use will likely drive the annual count of software flaws higher, but could eventually result in fewer flaws in public releases, experts say.

On Nov. 1, Google said its Big Sleep large language model (LLM) agent discovered a buffer-underflow vulnerability in the popular database engine, SQLite. The experiment shows both the peril and the promise of AI-powered vulnerability discovery tools: The AI agent searched through the code for variations on a specific vulnerability, but identified the software flaw in time for Google to notify the SQLite project and work with them to fix the issue.

Using AI just for software-defect discovery could result in a surge in vulnerability disclosures, but introducing LLM agents into the development pipeline could reverse the trend and lead to fewer software flaws escaping into the wild, says Tim Willis, head of Google’s Project Zero, the company’s effort to identify zero-day vulnerabilities.

“While we are at an early stage, we believe that the techniques we develop through this research will become a useful and general part of the toolbox that software developers have at their disposal,” he says.

Google is not alone in searching for better ways to find — and fix — vulnerabilities. In August, a group of researchers from Georgia Tech, Samsung Research, and other firms — collectively known as Team Atlanta — used an LLM bug-finding system to automatically find and patch a bug in SQLite. And just last month, cybersecurity firm GreyNoise Intelligence revealed it had used its Sift AI system to analyze honeypot logs leading to the discovery and patching of two zero-day vulnerabilities affecting Internet-connected cameras used in sensitive environments.

Overall, companies are gaining more ways to automate vulnerability discovery, and — if they are serious about security — will be able to drive down the number of vulnerabilities in their products by using the tools in development, says Corey Bodzin, chief product officer at GreyNoise Intelligence.

“The exciting thing is we do have technology that allows people who [care about] security to be more effective,” he says. “Sadly … there are not many companies where that is … a primary driver, but even in companies where [security is] purely viewed as a cost” can benefit from using these tools.

Only the First Steps

Currently, Google’s custom approach is still bespoke and requires work to adapt to specific vulnerability-finding tasks. The company’s Big Sleep agent does not to look for completely new vulnerabilities, but uses details from a previously discovered vulnerability to look for similar issues. The project has looked at smaller programs with known vulnerabilities as test cases, but the SQLite experiment is the first time they found vulnerabilities in production code, the Google Project Zero and Google DeepMind researchers stated in Google’s blog post describing the research.

While specialized fuzzers would likely have found the bug, tuning those tools to perform well is a very manual process, says Google’s Willis.

“One promise of [L]LM agents is that they might generalize across applications without the need for specialized tuning,” he says. “Additionally, we’re hopeful that [L]LM agents will be able to uncover a different subset of vulnerabilities than those typically found through fuzzing.”

The use of AI-based vulnerability discovery tools will be a race between attackers and defenders. Manual code review is a viable way of finding bugs for attackers, who only need a single exploitable vulnerability or short chain of vulnerabilities. But defenders need a scalable way of finding and fixing applications, Willis says. While bug-finding tools can be a force multiplier for both attackers and defenders, the ability to scale up to analyze code will likely be a greater benefit for defenders, Willis says.

“We expect that advances in automated vulnerability discovery, triage, and remediation will disproportionately benefit defenders,” he says.

Focus AI on Finding and Fixing Bugs

Companies that focus on using AI to generate secure code and fix bugs when found will deliver higher quality code from developers, says Chris Wysopal, co-founder and chief security evangelist at Veracode, an application security firm. He argues that automating bug finding and bug fixing are two completely different problems. Finding vulnerabilities is a very large data problem, whIle fixing bugs usually deals with perhaps a dozen lines of code.

“Once you know the bug is there — if you found it through fuzzing, or through an LLM, or using human code review — and you know what kind of bug it is, fixing it is relatively easy,” he says. “So, LLMs favor defenders, because having access to source code and fixing issues is easy. So I’m kind of bullish that we can eliminate whole classes of vulnerabilities, but it’s not from finding more, it’s from being able to fix more.”

Companies that require developers to run automated security tools before code check-in will find themselves on a path to paying down their security debt — the collection of issues that they know about, but have not had time to fix, he says. Currently, about half (46%) of organizations have security debt in the form of persistent critical flaws in applications, according to Veracode’s 2024 State of Software Security report.

“The idea that you’re committing code that has a problem in it, and it’s not fixed, will become the exception, not the rule, like it is today,” Wysopal says. “Once you can start to automate this fixing — and we’re always getting better at automating finding [vulnerabilities] — I think that’s how things change.”

Yet, the technology will still have to overcome companies’ focus on efficiency and productivity over security, says Bob Rudis, vice president of data science and security research at GreyNoise Intelligence. He points to the fixing of the two security vulnerabilities that GreyNoise Intelligence found and responsibly disclosed. The company only fixed the issues in two product models, but not others — despite the fact that the other products likely had similar issues, he says.

Google and GreyNoise Intelligence proved that the technology will work, but whether companies integrate AI into the development pipelines to eliminate bugs is still an open question.

Rudis has doubts.

“I’m sure a handful of organizations are going to deploy it — it’s going to make like seven C files a little bit safer across a bunch of organizations, and maybe we’ll get like a tick more security for the ones that can actually deploy it properly,” he says. “But ultimately, until we actually change the incentive structure around how software vendors build and deploy things, and how consumers actually purchase and deploy and configure things, we are not going to see any benefit.”