Llama 2 avoids errors by staying quiet, GPT-4 gives long, if useless, samples

The article discusses a study conducted by computer scientists at the University of California San Diego on the reliability and robustness of large language models (LLMs) in generating code. The researchers evaluated four different code-capable LLMs using an API checker called RobustAPI. They gathered 1,208 coding questions from StackOverflow involving 24 common Java APIs and tested the LLMs with three different types of questions. The results showed that the LLMs had high rates of API misuse, with GPT-3.5 and GPT-4 from OpenAI exhibiting the highest failure rates. However, Meta’s Llama 2 performed exceptionally well, with a failure rate of less than one percent. The study highlights the importance of assessing code reliability and the need for improvement in large language models’ ability to generate clean code.
Related Posts
It's going to take a century for artifical intelligence to be able to perform most human jobs. But there are going to be some key developments during the next decade.
According to a survey of leading AI researchers, all human tasks may become highly automatable by 2116. While this prediction seems far off, it is …
Many businesses are not yet prepared to fully reap the benefits of AI.
AI has become a ubiquitous technology, with companies exploring its applications for revenue growth and improved customer and employee experiences. …
Rose-tinted predictions for artificial intelligence’s grand achievements will be swept aside by underwhelming performance and dangerous results.
The year 2023 saw widespread hype around generative AI, with expectations of AI-powered advancements becoming common. However, in 2024, there will be …
