Using LLMs to find Python C-extension bugs

Welcome to LWN.net The following subscription-only content has been made available to you by an LWN subscriber. Thousands of subscribers depend on LWN for the best news from the Linux and free software communities. If you enjoy this article, please consider subscribing to LWN. Thank you for visiting LWN.net!

The open-source world is currently awash in reports of LLM-discovered bugs and vulnerabilities, which makes for a lot more work for maintainers, but many of the current crop are being reported responsibly with an eye toward minimizing that impact. A recent report on an effort to systematically find bugs in Python extensions written in C has followed that approach. Hobbyist Daniel Diniz used Claude Code to find more than 500 bugs of various sorts across nearly a million lines of code in 44 extensions; he has been working with maintainers to get fixes upstream and his methodology serves as a great example of how to keep the human in the loop—and the maintainers out of burnout—when employing LLMs.

The numbers are fairly eye-opening: " 575+ confirmed bugs (~10-15% false positive rate after review, ~140 reproduced from Python) and fixes already merged in 14 projects ". The types of the bugs range widely: " from hard crashes and memory corruption to correctness issues and spec violations ". Meanwhile, Diniz would like to work with maintainers to make the effort " more useful and scalable for maintainers "; the goal is to provide high-quality reports of " a large class of non-trivial bugs " that are difficult to find manually.

To do that, Diniz created a Claude Code plugin, cext-review-toolkit, that is tuned for Python-specific problems that might be found in C extensions, such as problems with reference counts, in handling the global interpreter lock (GIL), and with exception state. It uses " 13 specialized analysis agents analyzing the C extension source code in parallel, with each agent targeting a different bug class ".

Results

The lengthy report is worth reading in its entirety, but we will highlight a few parts of it here. The tool found lots of bugs, as noted, many of which resulted in bug reports and pull requests (PRs). There are lots of links to both for more than a dozen different C extension projects, including Cython, Guppy 3, regex, Pillow, and more. The Guppy 3 maintainer, YiFei Zhu, was highlighted for digging into the extensive report for that project, fixing 24 of 30 issues found, and finding " additional bugs the tool missed ". In addition, the feedback provided in the umbrella issue for the findings was " invaluable ", leading to improvements to the tools to reduce false positives.

The report describes how the tool and process work: the agents are run for a project, the findings are reviewed, pure-Python reproducers are created when possible, and then a report is shared with the maintainers via a secret GitHub gist. There is another document that describes techniques for creating reproducers in Python and the report itself describes the specific types of bugs targeted by the agents.

More importantly, given the widespread problems with maintainers being buried under slop bug reports and PRs, Diniz is clearly trying to ensure that his work is worthwhile to the projects:

Reports like these can be time and energy-intensive for maintainers to investigate. Historically, automated bug-finding tools have produced far more false positives than useful information, and AI can make those false positives look incredibly convincing. [...] When a maintainer points out a false positive, I immediately update the agents' prompts so that specific pattern is avoided in the future. Beyond polishing the tools, I try to communicate in a non-invasive, helpful manner. The maintainer always holds the reins: I ask them how they prefer to receive the information (an umbrella issue? individual issues? direct PRs? or do nothing at all) and let them decide exactly what to do with the findings.

There is more to the report, including an example of a bug and reproducer, a look at things that did not work, and so on. He ended with a set of questions for the community about whether it is useful, how to improve the tools and reports, and ideas for future tools. He mentions several other projects he is working on, such as an analysis tool aimed at C extensions with regard to free-threaded Python and another tool to analyze the CPython source code.

Reaction

The reaction has been quite positive—no surprise—with a few Python developers and maintainers popping up to talk about the experience and to suggest ideas for further refinements. James Parrott wondered about the number of bugs that would be eliminated if Rust had been used instead. Cython maintainer David Woods thought that Rust could eliminate things like reference-counting problems, but probably not the exception-handling bugs that were prevalent in the report for Cython. Diniz prompted Claude Code with the Rust question, which stated that 60-70% would not be prevented by Rust; Diniz cautioned that " given LLM's troubles with numbers and estimates, I wouldn't trust the percentages too much ". But even the broad categorization may be suspect, as Matthias Urlichs said that he thought Rust could prevent more types of problems " if the Rust API is designed safely (in the Rust sense) instead of literally following the C API ".

Parrott also suggested using the GitHub Actions system to reproduce the bugs. That would improve the tool's reports, which are less than ideal for him: " I don't want to have to read a huge machine generated report and work out what's what. " Diniz was appreciative of the suggestions and thought that he could implement them relatively easily. In particular, customizing reports is already on his radar: " I'd like to tailor the reports to what maintainers need, some like having reproducers and suggested fixes, others would prefer just a short description and code locations. "

Eric Soroos, one of the Pillow maintainers, thought it was " one of the better sets of reports that we've gotten about potential security/correctness issues ". He did note that the coverage was incomplete, as he spotted similar bugs in related functions that were not found. Some of the bugs were difficult to reproduce because they required a memory-allocation failure to occur in a specific place, leading to a tooling suggestion:

It would be interesting as a test run to have a fuzzer that used coverage guidance to fail mallocs (or c-api python methods) to test the error handling in those cases. It would need to run under valgrind to catch memory leaks or invalid accesses. This could give better code coverage for the repetitive if(ptr==null) {free everything allocated in the function} c level error handling.

The idea was met with approval, so Soroos expanded on it some later in the thread.

The severity of the bugs being found, and whether they are worth the maintainer attention needed to fix them, may also factor into the question about the reports, as Maurycy Pawłowski-Wieroński noted. He had tried using Diniz's LLM tool for CPython and had mixed results, in part because some of the bugs are only reproducible in ways that users are unlikely to ever hit:

Unless the issue is critical (even if perfectly reproducible), many fixes are just distracting. Maintainers have their own projects, plans, schedules etc., and some pathological refleak is not really that important. I believe that such PRs used to make it in the past, because they were seen as an investment (education) in a potential maintainer, a future colleague. Now, it's "Contributor" badge hunting.

Diniz had a, seemingly characteristic, thoughtful reply, agreeing that " not all findings are worth fixing ". Maintainers will draw their own lines of what warrants a fix, so he is not in a position to decide which bugs merit addressing. " The best I can do is offer a listing of what the tools find and let them decide what to fix. " He said that so far he has not gotten much feedback on whether " tiny PRs targeting nits, leaks, etc. " are valuable or not, but he is open to discussing it.

This issue is likely to recur. Finding and fixing memory-allocation-failure handling, for example, is certainly important, but it may well not be as important as other things that maintainers are trying to accomplish. Tuning LLMs to prioritize their reports based on the likelihood of real-world exploitation would be another helpful step. Those who are using these tools for ill are surely pointing them toward exploitable bugs; LLM providers could potentially use those prompts (or share them) for defensive purposes. The LLM providers just might have their own tools and models that could be loosed on such a task as well.

Keeping maintainers fully in control is perhaps the most important element of this effort; giving them the ability to opt out is particularly key. There is a balance to be struck there, of course, because there may be bugs found that need escalation even when the project and its maintainers are not interested in the machine-generated reports. These are the early days for LLM bug-finding—and machines can generate far more reports than mere humans can process—so we are likely to see a variety of approaches, both good and ill. For now, this seems like a nice example of the "good" side of the coin.

to post comments