Study Finds That 52 Percent of ChatGPT Answers to Programming Questions Are Wrong

You should see 52% of the first version of my code.

It doesn’t have to be right to be useful.

111

It's been a tremendous help to me as I relearn how to code on some personal projects. I have written 5 little apps that are very useful to me for my hobbies.

It's also been helpful at work with some random database type stuff.

But it definitely gets stuff wrong. A lot of stuff.

The funny thing is, if you point out its mistakes, it often does better on subsequent attempts. It's more like an iterative process of refinement than one prompt gives you the final answer.

54

Sometimes ChatGPT/copilot’s code predictions are scary good. Sometimes they’re batshit crazy. If you have the experience to be able to tell the difference, it’s a great help.

52

I'm a 10 year pro, and I've changed my workflows completely to include both chatgpt and copilot. I have found that for the mundane, simple, common patterns copilot's accuracy is close to 9/10 correct, especially in my well maintained repos.

It seems like the accuracy of simple answers is directly proportional to the precision of my function and variable names.

I haven't typed a full for loop in a year thanks to copilot, I treat it like an intent autocomplete.

Chatgpt on the other hand is remarkably useful for super well laid out questions, again with extreme precision in the terms you lay out. It has helped me in greenfield development with unique and insightful methodologies to accomplish tasks that would normally require extensive documentation searching.

Anyone who claims llms are a nothingburger is frankly wrong, with the right guidance my output has increased dramatically and my error rate has dropped slightly. I used to be able to put out about 1000 quality lines of change in a day (a poor metric, but a useful one) and my output has expanded to at least double that using the tools we have today.

Are LLMs miraculous? No, but they are incredibly powerful tools in the right hands.

Don't throw out the baby with the bathwater.

47

Ask "are you sure?" and it will apologize right away.

27

For someone doing a study on LLM they don’t seem to know much about LLMs.

They don’t even mention which model was used…

Here’s the study used for this clickbait garbage :

https://dl.acm.org/doi/pdf/10.1145/3613904.3642596

26

In the short term it really helps productivity, but in the end the reward for working faster is more work. Just doing the hard parts all day is going to burn developers out.

25

I worked for a year developing in Magento 2 (an open source e-commerce suite which was later bought up by Adobe, it is not well maintained and it just all around not nice to work with). I tried to ask some Magento 2 questions to ChatGPT to figure out some solutions to my problems but clearly the only data it was trained with was a lot of really bad solutions from forum posts.

The solutions did kinda work some of the times but the way it was suggesting it was absolutely horrifying. We're talking opening so many vulnerabilites, breaking many parts of the suite as a whole or just editing database tables. If you do not know enough about the tools you are working with implementing solutions from ChatGPT can be disasterous, even if they end up working.

21

You forgot the "at least" before the 52%.

18

Sure, but by randomly guessing code you'd get 0%. Getting 48% right is actually very impressive for an LLM compared to just a few years ago.

17

What's especially troubling is that many human programmers seem to prefer the ChatGPT answers. The Purdue researchers polled 12 programmers — admittedly a small sample size — and found they preferred ChatGPT at a rate of 35 percent and didn't catch AI-generated mistakes at 39 percent.

Why is this happening? It might just be that ChatGPT is more polite than people online.

It's probably more because you can ask it your exact question (not just search for something more or less similar) and it will at least give you a lead that you can use to discover the answer, even if it doesn't give you a perfect answer.

Also, who does a survey of 12 people and publishes the results? Is that normal?

16

For the upteenth time - an llm just puts words together, it isn't a magic answer machine.

13

I wonder if the AI is using bad code pulled from threads where people are asking questions about why their code isn’t working, but ChatGPT can’t tell the difference and just assumes all code is good code.

10

Worth noting this study was done on gpt 3.5, 4 is leagues better than 3.5. I'd be interested to see how this number has changed

10

Probably more than 52% of what programmers type is wrong too

9

It was pretty good for a while! They lowered the power of it like immortan joe. Do not be come addicted to AI

8

I find it funny that thumbnail with a "fail" I'm actually surprised that it got 48% right.

8

I use chatgpt semi-often... For generating stuff in a repeating pattern. Any time I have used it to make code, I don't save any time because I have to debug most of the generated code anyway. My main use case lately is making python dicts with empty keys (e.g. key1, key2... becomes "key1": "", "key2": "",...) or making a gold/prod level SQL view by passing in the backend names and frontend names (e.g. value_1, value_2... Value 1, Value 2,... Becomes value_1 as Value 1,...).

8

ill use copilot in place of most of the times ive searched on stackoverflow or to do mundane things like generate repeated things but relying solely on it is the same as relying solely on stackoverflow.

6

AI Defenders! Assemble!

6

The interesting bit for me is that if you ask a rando some programming questions they will be 99% wrong on average I think.

Stack overflow still makes more sense though.

4

Developing with ChatGPT feels bizzarely like when Tony Stark invented a new element with Jarvis' assistance.

It's a prolonged back and forth, and you need to point out the AIs mistakes and work through a ton of iterations to get something that is close enough that you can tweak it and use, but it's SO much faster than trawling through Stack Overflow or hoping someone who knows more than you can answer a post for you.

4

I don't even bother trying with AI, it's not been helpful to me a single time despite multiple attempts. That's a 0% success rate for me.

4

I would make some 1000 monkeys with typewriters comment, but I see what most actual contracted devs produce...

3

I've used chatgpt and gemini to build some simple powershell scripts for use in intune deployments. They've been fairly simple scripts. Very few have of them have been workable solutions out of the box, and they've often filled with hallucinated cmdlets that don't exist or are part of a thirdparty module that it doesn't tell me needs to be installed. It's not useless tho, because I am a lousy programmer its been good to give me a skeleton for which I can build a working script off of and debug myself.

I reiterate that I am a lousy programmer, but it has sped up my deployments because I haven't had to work from scratch. 5/10 its saved me a half hour here and there.

3

This is the best summary I could come up with:

In recent years, computer programmers have flocked to chatbots like OpenAI's ChatGPT to help them code, dealing a blow to places like Stack Overflow, which had to lay off nearly 30 percent of its staff last year.

That's a staggeringly large proportion for a program that people are relying on to be accurate and precise, underlining what other end users like writers and teachers are experiencing: AI platforms like ChatGPT often hallucinate totally incorrectly answers out of thin air.

For the study, the researchers looked over 517 questions in Stack Overflow and analyzed ChatGPT's attempt to answer them.

The team also performed a linguistic analysis of 2,000 randomly selected ChatGPT answers and found they were "more formal and analytical" while portraying "less negative sentiment" — the sort of bland and cheery tone AI tends to produce.

The Purdue researchers polled 12 programmers — admittedly a small sample size — and found they preferred ChatGPT at a rate of 35 percent and didn't catch AI-generated mistakes at 39 percent.

The study demonstrates that ChatGPT still has major flaws — but that's cold comfort to people laid off from Stack Overflow or programmers who have to fix AI-generated mistakes in code.

The original article contains 340 words, the summary contains 199 words. Saved 41%. I'm a bot and I'm open source!

3

It's programming spell check

1

I’m surprised it scores that well.

Well, ok… that seems about right for languages like JavaScript or Python, but try it on languages with a reputation for being widely used to write terrible code, like Java or PHP (hence having been trained on terrible code), and it’s actively detrimental to even experienced developers.

1

The best method I've found for using it is to help you with languages you may have lost familiarity in and to walk it through what you need step by step. This lets you evaluate it's reasoning. When it gets stuck in a loop:

Try A!
Actually A doesn't work because that method doesn't exist.
Oh sorry Try B!
Yeah B doesn't work either.
You're right, so sorry about that, Try A!
Yeah.. we just did this.

at that point it's time to just close it down and try another AI.

1

We need a comparison against an average coder. Some fucking baseline ffs.

-2

We need a comparison against an average coder. Some fucking baseline ffs.

-3