Encyclopedia Britannica is also wrong in a reproducible and fixable way. And the input queries a finite set. It's output does not change due to random or arbitrary things. It is actually possible to verify. LLMs so far seem to be entirely unverifiable.
Yes - this is what i've been saying all the time. The term 'hallucinations' is misleading because the whole point of LLMs is that they recombine all their inputs into something 'new'. They only ever hallucinate outputs - that's their whole point!
Into something probable. The models that underlie these chatbots are usually overfitted, so while they usually don't repeat their training data verbatim, they can.
> The actual process of token generation works precisely the same
I’d be wary of generalising it like that, it is like saying that all programs run on the same set of CPU instructions. NNs are function approximators, where the code is expressed in model weights rather than text, but that doesn’t make all functions the same.
You misunderstand. I mean that the model itself is doing exactly the same thing whether the output is a “hallucination “ or happens to be fact. There isn’t even a theoretical way to distinguish between the two cases based only on the information encoded in the model.
> it is like saying that all programs run on the same set of CPU instructions
Turing machine is the embodiment of all computer programs. And then you come across the halting problem. LLMs can probably generate all books in existence, but it can't apply judgement to it. Just like you need programmers to actually write the program and verify that it correctly solves the problem.
Natural languages are more flexible. There are no functions libraries, or paradigms to ease writing. And the problem space can't be specified and usually relies on shared context. Even if we could have snippets of prompts to guide text generations, the result is not that valuable.
YES. Humans can hallucinate, its a deviation from what is observable reality.
All the stress people are feeling with GenAI comes from the over anthropomorphisation of ... stats. Impressive syntatic ability is not equivalent to semantic capability.
The human definition of hallucination has to do with sensory experience, i.e. inputs. Saying that LLMs hallucinate means that we're ascribing them control over their inputs that they simply do not have -- by design.
Or, in other words, if a chatbot really were hallucinating, it would probably start giving unprompted responses.
> Humans can hallucinate, its a deviation from what is observable reality.
What is "observable reality" then, for an LLM? Its training set?
LLMs are completely deterministic even if that's kind of weird to state because they output things in terms of probabilities. But if you simply took the highest probability next word, you'd always yield the exact same output given the exact same input. Randomness is intentionally injected to make them seem less robotic through the 'temperature' parameter. Why it's not just called the rng factor is beyond me.
Maybe some models can be deterministic at a point in time, but train it for another epoch with slight parameter changes and a revised corpus and determinism goes out the proverbial (sliding) window real quick. This is not unwanted per se, and the exact feedback loop that needs improving to better integrate new knowledge or revise knowledge artefacts incrementally/post-hoc.
If you train it then it's no longer the same model. If I have f(x) = x + 1 and change it to f(x) = x + 1 + 1/1e9, it would not mean that `f` is not deterministic. The issue would be in whatever interface I was exposing the f's at.
But current models must be retrained to incorporate new information. Or to attempt to fix undesirable behavior. So just freezing it forever does not seem feasible. And because there is no way to predict what has changed - one has to verify everything all over again.
Would you by extension argue that e.g. modern relational database aren't deterministic in their query execution? Their query plans tend to be chosen based on statistics about the tables they're executed against, and not just the query itself.
I don't see how that's different than the LLM case, a lot of algorithms change as a function of the data they're processing.
At least in case of Bigquery, I have fought with indeterminist-like issues many times over, especially when dealing with window functions that aggregate floats from different compute nodes, where rows cannot be further sorted on a unique column (i.e. the maximum sorting granularity for rows with a similar column of interest to compute a window function over has been reached).
Inconsistent results could be resolved by introducing additional out of data constraints (e.g. incremental hashes), but it can take quite a while to figure out at which exact point in a complex query these constraints need to be introduced.
Beyond that, some functions might still produce different results between runs, e.g. `ml.tf_idf` and `ml.multi_hot_encoder` that take some approximation liberties. Whether these functions are relational in the traditional sense is up for debate.
I think what you're describing is that training/execution effects aren't predictable.
It is still "deterministic" in that training on exactly the same data and asking exactly the same questions should (unless someone manually adds randomness) lead to the same results.
Another example of the distinction might be a pseudo-random number generator: For any given seed, it is entirely deterministic, while at the same time being very deliberately hard to predict without actually running it to see what happens.
True in the ideal case, but taken together (e.g. corpus retraining, temperature settings, slight input changes, initial descent parameters) unpredictability and indeterminism become difficult to distinguish. Especially in the distributed training case, training data may be propagated to different nodes in different order (e.g. when leaving it to a query optimiser), which makes any large-scale training operation difficult to reproduce exactly.
I think you’re missing a subtlety with markov chains. It’s not about picking the next work with highest probability, it about picking the next word using the next word probability distribution. I played with them almost 20 years ago, and the difference in output was pretty obvious even with simple trigrams. The poetry produced was just better.
I can’t imagine any modern llm not using a probabilty distribution function for the same reason.
You can ask an LLM to explain itself.. it will give you a logical stepwise progression from your question to its answer. It will often contain a mistake, but the same is true for a human.
And if your LLM is giving you 100 different answers, then it has been configured to do so. Because instead, it could be configured to never vary at all. It could be 100% reproducible if so desired.
> it will give you a logical stepwise progression from your question to its answer.
No, it will generate a new hallucination that might be a logical stepwise progression from the question you asked to the answer you gave, but it is not due to any actual internal reasoning being done by the LLM
We have no clear evidence the same isn't true for humans, and some that it might be. See experiments with split brain patients, that have shown the brain halves will readily explain how they made decisions they provably never made.
I think we have thousands of years of evidence that the same isn't true for humans
The fact that one human brain is composed in two half brains that each seem to be able to function fairly independently when separated doesn't seem like it changes that much
It changes that we can prove it. In as much as you can do experiments where you "hide" actions from one half, mess with it (e.g. totally change "choices" supposedly made by the brain) and ask the other half to explain why "it" made the choice, and it will do so, unaware that the choice was made by the experimenters. It won't go "sorry, but I don't know" or similar.
So what? You have no way to know for sure if the human you ask the same question, does either. The question that started this thread was related to verifiability. And i still think it is a spurious complaint, given that we have exactly the same limitations when dealing with any human agent.
We have no evidence that humans can even know, much less that they generally don't. And we do have evidence that there are situations where the brain will readily construct an explanation after the fact which can't possibly be true (experiments with split brain patients where researchers tricked one brain half into thinking the other half, and so the brain as a whole, had made a decision while the action was taken by the researchers, and made it explain how it has made the decision)
There is no basis for claiming to know that people don't usually make up explanations like this other than when e.g. breaking the process apart and writing it down step by step during. But even then individual decisions are "suspect".
The human may also be... wrong. Saying things that "feel right", except this one time they're factually wrong.
A human can explain their reasoning step by step, if the original reasoning was a System 2, formal, step-by-step process in the first place; otherwise, they're just making shit up after the fact, which feels right, but may or may not be correct (see also the previous paragraph).
Note that it's very rare anyone has an interaction with a human that uses this mode of reasoning - it's unnecessary except in special circumstances, usually math-heavy.
Nothing of the sort. I'm trying to understand why anyone cares about formal verifiability in this context, since it's not something we rely on when asking humans to answer questions for us. We evaluate any answer we get without such mathematical proofs, and instead simply judge the answer we're given on its fit and usefulness.
Anyone who doubts the usefulness of even these nascent LLMs is fooling themselves. The proof is in the pudding, they already do a great job, even with all their obvious limitations.
> since it's not something we rely on when asking humans to answer questions for us
Because we interact with computers (which includes LLMs) differently than we do with humans and we hold them to higher standards
Ironically, Google played a large part in this, delivering high quality results to us with ease for many years. At one point Google was the standard for finding high quality information
Shrug. Seems like clutching pearls to me. People seem to have an emotional reaction
and obsess on the aspects that differentiate human cognition from LLMs. But that is
a lot of wasted energy.
To the extent that anyone avoids employing these technologies, they will be at a
disadvantage to those who do; because these tools just work. Already. Today.
There isn't even room for debate on that issue. Again, the proof is in the pudding.
These systems are already successfully, usefully, and correctly answering millions
of questions a day. They have failure modes where they produce substandard or
even flat out incorrect answers too. They're far from perfect, but they're still
incredible tools, even without waiting for the improvements that are sure to come.
The reason verifiability is important is because humans can be incentivized to be truthful and factual. We know we lie, but we also know we can produce verifiable information, and we prefer this to lies, so when it matters, we make the cost to lying high enough that we can reasonably expect that they will not try to deceive (for example by committing perjury, or fabricating research data). We know it still happens, but it’s not widespread and we can adjust the rules, definitions and cost to adapt.
An LLM does not have such real world limitations. It will hallucinate nonstop and then create layers of gaslighting explanations to its hallucinations. The problem is that you absolutely must be a domain expert at the LLM’s topic or always go find the facts elsewhere to verify (then why use an LLM?).
So a company like Google using an LLM, is not providing information, it’s doing the opposite. It is making it more difficult and time consuming to find information. But it is then hiding their responsibility behind the model. “We didn’t present bad info, our model did, we’re sorry it told you to turn your recipe into poison…models amirite?”
A human doing that could likely face some consequences.
The problem of other minds is no reason to throw everything out the window. Humans are capable of being conscious of their reasoning processes; token-at-a-time predictive text models wired up as chatbots aren't capable of it. Your choice is between a possibly-mistaken, possibly-lying human, and a 100%-definitely incapable computer program.
You don't know either "for sure", but you don't know that the external world exists "for sure" either. It's an insight-free observation, and shouldn't be the focus of anyone's decision-making.
When you ask an LLM to carefully reason step by step before arriving at its answer, that seems pretty much the same as conscious reasoning to me. Of course, when asked to justify a gut reaction after the fact, it will just come up with something that sounds plausible (and may or may not be true). Just like humans do.
> It will often contain a mistake...but the same is true for a human.
If this were true textbooks could not work. Given a question, we don't consult random humans but experts of their field. If I have a question on algorithms, I might check a text by Knuth, I wouldn't randomly ask on the street.
> It could be 100% reproducible if so desired.
Reproducible does not mean better. For harder questions, it's often best to generate multiple answers at a higher temperature than to greedily pick the highest probability tokens.
And for most cases that human explanation is likely with a disturbing frequency a complete fabrication after the fact. See experiments on split brain patients.
With respect to repeatability, yes, LLMs are currently frozen in time. That is not an inherent limitation, but it is one that is practical for a lot uses and a problem for some.
Isn't it actually known that every time a human brain recalls a piece of memory the memory gets slightly changed?
If the answer has any length at all, I imagine the answer can vary every single time the person answers, unless they prepared for it, memorized it word by word.
Right, that makes sense, our brain will do a quick black box judgment (some may call it system 1), and then rational process only works to justify that or explain the black box, assuming that black box is always correct (depending on the person and how much they trust their black box or system 1).
So system 2 is "hallucinating" the best justification for system 1.
And usually system 2 will do it only when it's required to justify it for anyone else.
Yes, "making up an answer" will look different from "quoting pretrained knowledge" because eg the model might've decided you were asking a creative writing question.
There are various reasons an LLM might have incorrect "beliefs" - the input text was false, training doesn't try to preserve true beliefs, quantization certainly doesn't. So it can't be perfectly addressed, but some things leading to it seem like they can be found.
This seems like it's true since LLMs are a finite size, but in Google's case it has a "truth oracle" (the websites it's quoting)… the problem is it's a bad oracle.
Sure. Your claim has some truth, but is far too strong.
The articles you cited upthread do not support the notion that models consistently activate differently when generating true facts vs false facts.
It is true that models can capture some notion of reliability based on patterns in their training data. For a concrete example, it is entirely plausible that a model can capture the sense that data trained from Reddit is less truthy than data trained from Wikipedia, or that training data with poor grammar and vocabulary is less reliable than more sophisticated inputs.
But this process is not a guarantee, and does not change the fact that LLMs have no mechanism to track the provenance of information. It's probably a fruitful direction of research for reducing the probability of emitting false facts, but there will always be an infinite number of marginal cases for which the activations for true facts are indistinguishable from those for false facts.
Models simply do not track the provenance which is required to make this distinction in every case.
I agree with this; that's why I was careful not to use the examples you mentioned. Quoting incorrect training knowledge would be an unavoidable issue if your probe can only say "it's quoting something", and as far as I know it can't do better than that.
But I have seen issues with prompts where a creative-writing prompt and just asking a question look similar, and in that case it could help to know which one it thinks it's doing.
Gemini itself has a funny verification button where it more or less Googles every sentence the model writes and tells you if it seems like it made it up or not.
Ask a human what the meaning of life is and how it impacts their day to day interactions. I know I can tell you an answer but I couldn’t tell you steps about how I got it.
And if you asked it to me twice I’d definitely give different answers unless you told me to give the same answer. In part I’d give a different answer because if someone asks me the same question twice I assume the first answer wasn’t sufficient.
No one is taking about existential questions about meaning of life.
We are talking about basic things like whether or not to eat rocks or put glue in recipes. We can answer those questions with a chain of logic and repeatability.
And those specific questions get repeatable answers on ChatGPT for me.
Here are two answers I got which seem as close as you’d expect any human to give:
“No, people should not eat rocks. Rocks are not digestible and can cause serious harm to the digestive system, including blockages and damage to internal organs. Eating rocks can lead to severe health problems and should be avoided.”
“No, people should not eat rocks. Rocks are not digestible and can cause serious harm to the digestive system, including blockages, abrasions, and potential poisoning from harmful minerals or substances. It's important to consume only food items that are safe and meant for human consumption.”