Encyclopedia Britannica is also wrong in a reproducible and fixable way. And the...

lukev · on May 25, 2024

They don’t just seem it. They are by design.

We talk about models “hallucinating” but that’s us bringing an external value judgement after the fact.

The actual process of token generation works precisely the same. It’d be more accurate to say that models always hallucinate.

madeofpalk · on May 25, 2024

Yes - this is what i've been saying all the time. The term 'hallucinations' is misleading because the whole point of LLMs is that they recombine all their inputs into something 'new'. They only ever hallucinate outputs - that's their whole point!

wizzwizz4 · on May 25, 2024

Into something probable. The models that underlie these chatbots are usually overfitted, so while they usually don't repeat their training data verbatim, they can.

DougBTX · on May 25, 2024

> The actual process of token generation works precisely the same

I’d be wary of generalising it like that, it is like saying that all programs run on the same set of CPU instructions. NNs are function approximators, where the code is expressed in model weights rather than text, but that doesn’t make all functions the same.

lukev · on May 25, 2024

You misunderstand. I mean that the model itself is doing exactly the same thing whether the output is a “hallucination “ or happens to be fact. There isn’t even a theoretical way to distinguish between the two cases based only on the information encoded in the model.

skydhash · on May 25, 2024

> it is like saying that all programs run on the same set of CPU instructions

Turing machine is the embodiment of all computer programs. And then you come across the halting problem. LLMs can probably generate all books in existence, but it can't apply judgement to it. Just like you need programmers to actually write the program and verify that it correctly solves the problem.

Natural languages are more flexible. There are no functions libraries, or paradigms to ease writing. And the problem space can't be specified and usually relies on shared context. Even if we could have snippets of prompts to guide text generations, the result is not that valuable.

intended · on May 25, 2024

YES. Humans can hallucinate, its a deviation from what is observable reality.

All the stress people are feeling with GenAI comes from the over anthropomorphisation of ... stats. Impressive syntatic ability is not equivalent to semantic capability.

tremon · on May 27, 2024

The human definition of hallucination has to do with sensory experience, i.e. inputs. Saying that LLMs hallucinate means that we're ascribing them control over their inputs that they simply do not have -- by design.

Or, in other words, if a chatbot really were hallucinating, it would probably start giving unprompted responses.

> Humans can hallucinate, its a deviation from what is observable reality.

What is "observable reality" then, for an LLM? Its training set?

somenameforme · on May 25, 2024

LLMs are completely deterministic even if that's kind of weird to state because they output things in terms of probabilities. But if you simply took the highest probability next word, you'd always yield the exact same output given the exact same input. Randomness is intentionally injected to make them seem less robotic through the 'temperature' parameter. Why it's not just called the rng factor is beyond me.

dleeftink · on May 25, 2024

Maybe some models can be deterministic at a point in time, but train it for another epoch with slight parameter changes and a revised corpus and determinism goes out the proverbial (sliding) window real quick. This is not unwanted per se, and the exact feedback loop that needs improving to better integrate new knowledge or revise knowledge artefacts incrementally/post-hoc.

Vetch · on May 25, 2024

If you train it then it's no longer the same model. If I have f(x) = x + 1 and change it to f(x) = x + 1 + 1/1e9, it would not mean that `f` is not deterministic. The issue would be in whatever interface I was exposing the f's at.

jononor · on May 25, 2024

But current models must be retrained to incorporate new information. Or to attempt to fix undesirable behavior. So just freezing it forever does not seem feasible. And because there is no way to predict what has changed - one has to verify everything all over again.

avar · on May 25, 2024

Would you by extension argue that e.g. modern relational database aren't deterministic in their query execution? Their query plans tend to be chosen based on statistics about the tables they're executed against, and not just the query itself.

I don't see how that's different than the LLM case, a lot of algorithms change as a function of the data they're processing.

dleeftink · on May 26, 2024

At least in case of Bigquery, I have fought with indeterminist-like issues many times over, especially when dealing with window functions that aggregate floats from different compute nodes, where rows cannot be further sorted on a unique column (i.e. the maximum sorting granularity for rows with a similar column of interest to compute a window function over has been reached).

Inconsistent results could be resolved by introducing additional out of data constraints (e.g. incremental hashes), but it can take quite a while to figure out at which exact point in a complex query these constraints need to be introduced.

Beyond that, some functions might still produce different results between runs, e.g. `ml.tf_idf` and `ml.multi_hot_encoder` that take some approximation liberties. Whether these functions are relational in the traditional sense is up for debate.

Terr_ · on May 25, 2024

I think what you're describing is that training/execution effects aren't predictable.

It is still "deterministic" in that training on exactly the same data and asking exactly the same questions should (unless someone manually adds randomness) lead to the same results.

Another example of the distinction might be a pseudo-random number generator: For any given seed, it is entirely deterministic, while at the same time being very deliberately hard to predict without actually running it to see what happens.

dleeftink · on May 26, 2024

True in the ideal case, but taken together (e.g. corpus retraining, temperature settings, slight input changes, initial descent parameters) unpredictability and indeterminism become difficult to distinguish. Especially in the distributed training case, training data may be propagated to different nodes in different order (e.g. when leaving it to a query optimiser), which makes any large-scale training operation difficult to reproduce exactly.

milesvp · on May 26, 2024

I think you’re missing a subtlety with markov chains. It’s not about picking the next work with highest probability, it about picking the next word using the next word probability distribution. I played with them almost 20 years ago, and the difference in output was pretty obvious even with simple trigrams. The poetry produced was just better.

I can’t imagine any modern llm not using a probabilty distribution function for the same reason.

ein0p · on May 25, 2024

LLMs are deterministic if you want them to be. If you eagerly argmax the output you will get the same sequence for the same prompt every time

ta8645 · on May 25, 2024

> LLMs so far seem to be entirely unverifiable.

I don't understand this complaint. Are they any less verifiable than a human?

threeseed · on May 25, 2024

I can ask a human to explain the steps they took to answer a question.

I can ask a human a question 100 times and I don't get back 100 different answers.

None of those applies to an LLM.

ta8645 · on May 25, 2024

You can ask an LLM to explain itself.. it will give you a logical stepwise progression from your question to its answer. It will often contain a mistake, but the same is true for a human.

And if your LLM is giving you 100 different answers, then it has been configured to do so. Because instead, it could be configured to never vary at all. It could be 100% reproducible if so desired.

bluefirebrand · on May 25, 2024

> it will give you a logical stepwise progression from your question to its answer.

No, it will generate a new hallucination that might be a logical stepwise progression from the question you asked to the answer you gave, but it is not due to any actual internal reasoning being done by the LLM

vidarh · on May 26, 2024

We have no clear evidence the same isn't true for humans, and some that it might be. See experiments with split brain patients, that have shown the brain halves will readily explain how they made decisions they provably never made.

bluefirebrand · on May 26, 2024

I think we have thousands of years of evidence that the same isn't true for humans

The fact that one human brain is composed in two half brains that each seem to be able to function fairly independently when separated doesn't seem like it changes that much

vidarh · on May 27, 2024

It changes that we can prove it. In as much as you can do experiments where you "hide" actions from one half, mess with it (e.g. totally change "choices" supposedly made by the brain) and ask the other half to explain why "it" made the choice, and it will do so, unaware that the choice was made by the experimenters. It won't go "sorry, but I don't know" or similar.

ta8645 · on May 25, 2024

So what? You have no way to know for sure if the human you ask the same question, does either. The question that started this thread was related to verifiability. And i still think it is a spurious complaint, given that we have exactly the same limitations when dealing with any human agent.

Dylan16807 · on May 25, 2024

> You have no way to know for sure if the human you ask the same question, does either.

The human might lie, but they generally don't.

An LLM is always confabulating when it explains how it reached a conclusion, because that information was discarded as soon as it picked a word.

The limitations are not in the same ballpark.

vidarh · on May 26, 2024

We have no evidence that humans can even know, much less that they generally don't. And we do have evidence that there are situations where the brain will readily construct an explanation after the fact which can't possibly be true (experiments with split brain patients where researchers tricked one brain half into thinking the other half, and so the brain as a whole, had made a decision while the action was taken by the researchers, and made it explain how it has made the decision)

There is no basis for claiming to know that people don't usually make up explanations like this other than when e.g. breaking the process apart and writing it down step by step during. But even then individual decisions are "suspect".

TeMPOraL · on May 25, 2024

The human may also be... wrong. Saying things that "feel right", except this one time they're factually wrong.

A human can explain their reasoning step by step, if the original reasoning was a System 2, formal, step-by-step process in the first place; otherwise, they're just making shit up after the fact, which feels right, but may or may not be correct (see also the previous paragraph).

Note that it's very rare anyone has an interaction with a human that uses this mode of reasoning - it's unnecessary except in special circumstances, usually math-heavy.

bluefirebrand · on May 25, 2024

> And i still think it is a spurious complaint, given that we have exactly the same limitations when dealing with any human agent

We're not talking about an LLM that is trying to do the job of a human, here

We're talking about an LLM that is trying to give authoritative answers to any question typed into the Google search bar

It's already well past the scale that humans could handle

Talking about human shortcomings when discussing LLMs is a red herring at best, or some kind of deliberate goalpost shifting at worst

ta8645 · on May 25, 2024

Nothing of the sort. I'm trying to understand why anyone cares about formal verifiability in this context, since it's not something we rely on when asking humans to answer questions for us. We evaluate any answer we get without such mathematical proofs, and instead simply judge the answer we're given on its fit and usefulness.

Anyone who doubts the usefulness of even these nascent LLMs is fooling themselves. The proof is in the pudding, they already do a great job, even with all their obvious limitations.

bluefirebrand · on May 25, 2024

> since it's not something we rely on when asking humans to answer questions for us

Because we interact with computers (which includes LLMs) differently than we do with humans and we hold them to higher standards

Ironically, Google played a large part in this, delivering high quality results to us with ease for many years. At one point Google was the standard for finding high quality information

ta8645 · on May 25, 2024

Shrug. Seems like clutching pearls to me. People seem to have an emotional reaction and obsess on the aspects that differentiate human cognition from LLMs. But that is a lot of wasted energy.

To the extent that anyone avoids employing these technologies, they will be at a disadvantage to those who do; because these tools just work. Already. Today.

There isn't even room for debate on that issue. Again, the proof is in the pudding. These systems are already successfully, usefully, and correctly answering millions of questions a day. They have failure modes where they produce substandard or even flat out incorrect answers too. They're far from perfect, but they're still incredible tools, even without waiting for the improvements that are sure to come.

figassis · on May 25, 2024

The reason verifiability is important is because humans can be incentivized to be truthful and factual. We know we lie, but we also know we can produce verifiable information, and we prefer this to lies, so when it matters, we make the cost to lying high enough that we can reasonably expect that they will not try to deceive (for example by committing perjury, or fabricating research data). We know it still happens, but it’s not widespread and we can adjust the rules, definitions and cost to adapt.

An LLM does not have such real world limitations. It will hallucinate nonstop and then create layers of gaslighting explanations to its hallucinations. The problem is that you absolutely must be a domain expert at the LLM’s topic or always go find the facts elsewhere to verify (then why use an LLM?).

So a company like Google using an LLM, is not providing information, it’s doing the opposite. It is making it more difficult and time consuming to find information. But it is then hiding their responsibility behind the model. “We didn’t present bad info, our model did, we’re sorry it told you to turn your recipe into poison…models amirite?”

A human doing that could likely face some consequences.

wizzwizz4 · on May 25, 2024

The problem of other minds is no reason to throw everything out the window. Humans are capable of being conscious of their reasoning processes; token-at-a-time predictive text models wired up as chatbots aren't capable of it. Your choice is between a possibly-mistaken, possibly-lying human, and a 100%-definitely incapable computer program.

You don't know either "for sure", but you don't know that the external world exists "for sure" either. It's an insight-free observation, and shouldn't be the focus of anyone's decision-making.

vanviegen · on May 26, 2024

When you ask an LLM to carefully reason step by step before arriving at its answer, that seems pretty much the same as conscious reasoning to me. Of course, when asked to justify a gut reaction after the fact, it will just come up with something that sounds plausible (and may or may not be true). Just like humans do.

ta8645 · on May 25, 2024

You've made some interesting points, which are debatable, for sure. But you've failed to address the question being asked about "verifiability".

Vetch · on May 25, 2024

> It will often contain a mistake...but the same is true for a human.

If this were true textbooks could not work. Given a question, we don't consult random humans but experts of their field. If I have a question on algorithms, I might check a text by Knuth, I wouldn't randomly ask on the street.

> It could be 100% reproducible if so desired.

Reproducible does not mean better. For harder questions, it's often best to generate multiple answers at a higher temperature than to greedily pick the highest probability tokens.

vidarh · on May 26, 2024

And for most cases that human explanation is likely with a disturbing frequency a complete fabrication after the fact. See experiments on split brain patients.

With respect to repeatability, yes, LLMs are currently frozen in time. That is not an inherent limitation, but it is one that is practical for a lot uses and a problem for some.

mewpmewp2 · on May 25, 2024

Isn't it actually known that every time a human brain recalls a piece of memory the memory gets slightly changed?

If the answer has any length at all, I imagine the answer can vary every single time the person answers, unless they prepared for it, memorized it word by word.

vidarh · on May 26, 2024

It's also known that the brain is prone to outright constructing demonstrably fictional rationalisations of decisions it's never made.

Any notion that we're reliable narrators of our own thoughts and actions is fiction.

mewpmewp2 · on May 26, 2024

Right, that makes sense, our brain will do a quick black box judgment (some may call it system 1), and then rational process only works to justify that or explain the black box, assuming that black box is always correct (depending on the person and how much they trust their black box or system 1).

So system 2 is "hallucinating" the best justification for system 1.

And usually system 2 will do it only when it's required to justify it for anyone else.

astrange · on May 25, 2024

The only reason you can't verify a server side LLM is you can't see the model. It is possible to look at its activations if you have the model.

tux1968 · on May 25, 2024

Do the activations tell you anything more than what the LLM delivers in plain text? Other than for trivial bugs in the LLM code, I don't think so.

astrange · on May 25, 2024

Yes, "making up an answer" will look different from "quoting pretrained knowledge" because eg the model might've decided you were asking a creative writing question.

duskwuff · on May 25, 2024

Can you cite a source for this, or are you speculating?

My understanding was the opposite -- that the activity of a confabulating LLM is indistinguishable from one giving factually accurate responses.

https://arxiv.org/abs/2401.11817

astrange · on May 25, 2024

Some things like:

https://arxiv.org/abs/2310.18168

https://arxiv.org/abs/2310.06824

There are various reasons an LLM might have incorrect "beliefs" - the input text was false, training doesn't try to preserve true beliefs, quantization certainly doesn't. So it can't be perfectly addressed, but some things leading to it seem like they can be found.

> https://arxiv.org/abs/2401.11817

This seems like it's true since LLMs are a finite size, but in Google's case it has a "truth oracle" (the websites it's quoting)… the problem is it's a bad oracle.

lukev · on May 25, 2024

This is confidently stated and incorrect.

astrange · on May 25, 2024

Do you have anything to add?

lukev · on May 25, 2024

Sure. Your claim has some truth, but is far too strong.

The articles you cited upthread do not support the notion that models consistently activate differently when generating true facts vs false facts.

It is true that models can capture some notion of reliability based on patterns in their training data. For a concrete example, it is entirely plausible that a model can capture the sense that data trained from Reddit is less truthy than data trained from Wikipedia, or that training data with poor grammar and vocabulary is less reliable than more sophisticated inputs.

But this process is not a guarantee, and does not change the fact that LLMs have no mechanism to track the provenance of information. It's probably a fruitful direction of research for reducing the probability of emitting false facts, but there will always be an infinite number of marginal cases for which the activations for true facts are indistinguishable from those for false facts.

Models simply do not track the provenance which is required to make this distinction in every case.

astrange · on May 26, 2024

I agree with this; that's why I was careful not to use the examples you mentioned. Quoting incorrect training knowledge would be an unavoidable issue if your probe can only say "it's quoting something", and as far as I know it can't do better than that.

But I have seen issues with prompts where a creative-writing prompt and just asking a question look similar, and in that case it could help to know which one it thinks it's doing.

Gemini itself has a funny verification button where it more or less Googles every sentence the model writes and tells you if it seems like it made it up or not.

kenjackson · on May 25, 2024

Ask a human what the meaning of life is and how it impacts their day to day interactions. I know I can tell you an answer but I couldn’t tell you steps about how I got it.

And if you asked it to me twice I’d definitely give different answers unless you told me to give the same answer. In part I’d give a different answer because if someone asks me the same question twice I assume the first answer wasn’t sufficient.

threeseed · on May 25, 2024

No one is taking about existential questions about meaning of life.

We are talking about basic things like whether or not to eat rocks or put glue in recipes. We can answer those questions with a chain of logic and repeatability.

kenjackson · on May 26, 2024

And those specific questions get repeatable answers on ChatGPT for me.

Here are two answers I got which seem as close as you’d expect any human to give:

“No, people should not eat rocks. Rocks are not digestible and can cause serious harm to the digestive system, including blockages and damage to internal organs. Eating rocks can lead to severe health problems and should be avoided.”

“No, people should not eat rocks. Rocks are not digestible and can cause serious harm to the digestive system, including blockages, abrasions, and potential poisoning from harmful minerals or substances. It's important to consume only food items that are safe and meant for human consumption.”