Does Bing Chat give reliable answers to math and physics questions? If not is it possible to make it more reliable?

Wrong_thought_7@lemmy.ml · 1 year ago

Does Bing Chat give reliable answers to math and physics questions? If not is it possible to make it more reliable?

bionicjoey@lemmy.ca · 1 year ago

Language models are designed to produce responses which convince the user that they are a coherent response. They don’t care about factuality, and in fact have no ability to “know” if they are correct. And they don’t “care”

If you want a smart query tool that lets you ask math problems, you should try something like Wolfram Alpha. It’s not perfect, but it’s at least designed with the intent to produce answers to math problems.

blivet@artemis.camp · 1 year ago

They don’t care about factuality, and in fact have no ability to “know” if they are correct. And they don’t “care”

I suspect that most people think (maybe not even consciously) that these models answer questions by retrieving data and then writing a response which incorporates that data, rather than just generating text that may or may not contain actual facts.

It really bears repeating over and over that all these so-called AI systems do is take a prompt and output text in response to it that reads as if a human wrote it.

CanadaPlus@lemmy.sdf.org · edit-2 1 year ago

Yeah. They may learn other stuff in the process, but at the end of the day all they are doing is predicting the next word/token.

intensely_human@lemm.ee · 1 year ago

You can tell these things about API calls and they can make API calls.

I have my own GPT4 instance instructed to gather information as necessary so it asks me questions when it needs to.

You can get it to “ask questions” with specific syntax which can then be translated to API calls. This is a way you can get an LLM to consider new information in its tasks.

Solarius@lemmy.sdf.org · 1 year ago

They definitely retrieve data too, otherwise you wouldn’t be able to send ask them about news events that happened yesterday and get a summary on it.

Thisfox@sopuli.xyz · 1 year ago

You can also get a summary on news events that didn’t happen.

Solarius@lemmy.sdf.org · 1 year ago

I guess that’s on you if you’re asking it something like “tell me yesterday’s news”. No matter your feelings on AI our current LLMs are indisputably a great tool for sending emails and summarizing large text as a draft. If you’re taking the output and running with it and not relying on any other external sources or proofreading then I could see how someone could come to the conclusion it’s 100% terrible awful.

Hazewind@artemis.camp · 1 year ago

Thing is they can be confidently wrong,.

Solarius@lemmy.sdf.org · 1 year ago

yeah im not denying that but it’s not black and white. people either praise it as this super intelligent AI or act like it’s cleverbot 2.0. if you have low expectations and use it for what it’s intended and actually take a moment to review the output then it’s useful for lots of things

B0rax@feddit.de · 1 year ago

No. Bing chatbot is chatgpt after all. It will oftentimes provide information that doesn’t match the sources it provides at all. Don’t trust it blindly.

You can use it in combination with WolframAlpha.

simple@lemm.ee · 1 year ago

No. As far as I know the only LLM that can be reliable with math and physics is GPT 4 with the Wolfram extension, since it runs the math through the wolfram api and double-checks the validity of its info. Everything else has the habit of hallucinating a lot and giving wrong answers.

themusicman@lemmy.world · 9 months ago

GPT 4 with python does a pretty good job too, but it’s the same thing I guess

lily33@lemm.ee · edit-2 1 year ago

I have experience with GPT-4, and in particular I’ve used to for math questions in my work occasionally. I’m not sure how Bing chat compares.

For GTP-4, I’ve noticed the following:

How reliable the answer is depends on how easy or obscure the question is. It hasn’t lied to me on easy or introductory material, but once your questions start becoming more obscure, and it’s less likely to have the answer in the training set, it starts making things up.

I think of it as search to an extent - it needs to have the answer in the training data to find it. Unlike google, it can usually find an answer even if you don’t use the proper terms. But if it doesn’t find an answer, it might make something up.
“Easy or introductory” is relative - I have been able to get good answers for some masters-level math, and some wrong ones for lower-level things. Ultimately it depends on how much resources on the topic have been in the training set.

It’s actually much more reliable in detecting errors than it’s in generating text. So you can open a new chat and ask, “Is the following true: …” and it will catch most of its own errors. Once it starts catching error, you should know you’ve left the reliable “easy questions” territory, and even if it can still be useful, exercise much more care.
The way you phrase a prompt matters a lot. For example, if you ask it to explain its reasoning step by step, it becomes much more accurate.
It is generally good in rephrasing questions to use better terminology.

.

Bing chat might be different in some regards. I know that it automatically searches the web for sources, and when generating an answer, and bases its answer on the contents of the sources it found - but I don’t have experience with it.

That said, asking for additional sources (besides the search results it found) shouldn’t improve the accuracy. It might just give you something you can use to fact-check it.

chaos@beehaw.org · 1 year ago

These models aren’t great at tasks that require precision and analytical thinking. They’re trained on a fairly simple task, “if I give you some text, guess what the next bit of text is.” Sounds simple, but it’s incredibly powerful. Imagine if you could correctly guess the next bit of text for the sentence “The answer to the ultimate question of life, the universe, and everything is” or “The solution to the problems in the Middle East is”.

Recently, we’ve been seeing shockingly good results from models that do this task. They can synthesize unrelated subjects, and hold coherent conversations that sound very human. However, despite doing some things that up until recently only humans could do, they still aren’t at human-level intelligence. Humans read and write by taking in words, converting them into rich mental concepts, applying thoughts, feelings, and reasoning to them, and then converting the resulting concepts back into words to communicate with others. LLMs arguably might be doing some of this too, but they’re evaluated solely on words and therefore much more of their “thought process” is based on “what words are likely to come next” and not “is this concept being applied correctly” or “is this factual information”. Humans have much, much greater capacity than these models, and we live complex lives that act as an incredibly comprehensive training process. These models are small and trained very narrowly in comparison. Their excellent mimicry gives the illusion of a similarly rich inner life, but it’s mostly imitation.

All that comes down to the fact that these models aren’t great at complex reasoning and precise details. They’re just not trained for it. They got through “life” by picking plausible words and that’s mostly what they’ll continue to do. For writing a novel or poem, that’s good enough, but math and physics are more rigorous than that. They do seem to be able to handle code snippets now, mostly, which is progress, but in general this isn’t something that you can be completely confident in them doing correctly. They make silly mistakes because they aren’t really thinking it through. To them, there isn’t really much difference between answers like “that date is 7 days after Christmas” and “that date is 12 days after Christmas.” Which one it thinks is more correct is based on things it has seen, not necessarily an explicit counting process. You can also see this in things like that case where someone tried to use it to write a legal brief, where it came up with citations that seemed plausible but were in fact completely made up. It wasn’t trained on accurate citations, it was trained on words.

They also have a bad habit of sounding confident no matter what they’re saying, which makes it hard to use them for things you can’t check yourself. Anything they say could be right/accurate/good/not plagiarized, but the model won’t have a good sense of that, and if you don’t know either, you’re opening yourself up to risk of being misled.

flashgnash@lemm.ee · edit-2 1 year ago

They can definitely be made to work out arithmetic and similar though

If you were to say in the preprompt something like: When asked a mathematical question, please respond with the equations used to achieve the result

For example if you asked it what 3x4 is it could respond with “The answer is {3x4}” and then the {3x4} could be evaluated in software afterwards and dropped in for the user to see

I think that might be what chatGPT does now as they somewhat recently fixed it always getting maths wrong

Or alternatively you could ask it to simply write a script to work out whatever problem it’s given that isn’t linguistic and execute that in a sandboxed environment (though still might be too risky incase it generates some bad code)

BeanCounter@sh.itjust.works · 1 year ago

As every generative AIs are, yes and no.

Even if you explicitly say to do something, it will do what it see fits. Most of the time it will work, some times it won’t.

It’s just another tool after all. You cannot safely rely on it, no. It’s neat and explains some things well. It won’t sometime.

can@sh.itjust.works · edit-2 1 year ago

It’s a useful tool but never take it at face value. Bing shares it’s sources and you should check every one.