Large language models struggle with generating clean code

Explore how large language models struggle with clean code generation, revealing high API misuse and the need for better reliability assessments.

The article discusses a study on the reliability and robustness of code generated by large language models (LLMs) for Java coding questions. The study evaluated four code-capable LLMs, including GPT-3.5 and GPT-4 from OpenAI, and found that they exhibited high rates of API misuse. The study also highlighted the importance of assessing code reliability beyond semantic correctness and emphasized the need for static analysis to ensure full coverage. Llama 2, an open model, performed the best with a failure rate of less than one percent.

Original article: Perhaps AI is going to take away coding jobs of those who trust this tech too much

Featured writing

Nobody takes you aside anymore

Print taught a generation when to stop. What we lose when the machines absorb the constraints that used to form us.

Your AI agents need a water cooler

Coordination is a property of the room, not the org chart. What that means when your coworkers are agents.

On the death of the author and the birth of the detector

Why worrying about AI authorship is lazier, and more prejudiced, than it looks.

Books

The work of being available now

A book on AI, judgment, and staying human at work.

The practice of work in progress

Practical essays on how work actually gets done.

Recent writing

The questions your faculty information system cannot answer

Systems Owe Evidence. People Do Not.

Memory is (almost) solved. time is next.

AI can't tell if a memory is two minutes or two weeks old. The fix isn't making models feel time — it's cache invalidation: an as-of stamp on every fact, a clock in the context, and a freshness window for anything volatile.

View all writing →

Related thinking