Google scrambles to manually remove weird AI answers in search

zogrodea · on May 25, 2024

This approach to remove bad search suggestions manually reminded of a different approach Google once took, where they weren’t satisfied with manually tweaking search results but rather wanted to tweak the algorithm that produces these results when there were bad results.

'Around 2002, a team was testing a subset of search limited to products, called Froogle. But one problem was so glaring that the team wasn't comfortable releasing Froogle: when the query "running shoes" was typed in, the top result was a garden gnome sculpture that happened to be wearing sneakers. Every day engineers would try to tweak the algorithm so that it would be able to distinguish between lawn art and footwear, but the gnome kept its top position. One day, seemingly miraculously, the gnome disappeared from the results. At a meeting, no one on the team claimed credit. Then an engineer arrived late, holding an elf with running shoes. He had bought the one-of-a kind product from the vendor, and since it was no longer for sale, it was no longer in the index. "The algorithm was now returning the right results," says a Google engineer. "We didn't cheat, we didn't change anything, and we launched."'

https://news.ycombinator.com/item?id=14009245

onemiketwelve · on May 26, 2024

reminds me of this time we kept getting bugs in our app from a super old android phone from 2011. we could never reproduce it with any other hardware. There were only 4 users with this phone. We spent weeks trying to fix it but couldn't. I suggested we buy the 4 users a refurb phone from another brand. Would've cost like $300 total. Nope, not allowed. Something about not giving up as engineers.

We spent 3 weeks trying to fix it, which equaled $4500 in just my salary. we never ended up figuring it out.

charleslmunger · on May 26, 2024

A similar story:

https://issues.chromium.org/issues/41088357#comment32

Years ago I bought some Korean phone with a foot long antenna and TV tuner on eBay because it was crashing at a disproportionate rate. It was just the nature of Android development at the time.

fragmede · on May 26, 2024

That was an amazing read, how'd you come across it?

charleslmunger · on May 31, 2024

At the time I was working on a prototype of what would eventually be open sourced as Cronet, which was Chromium's http stack repackaged to be embedded in Android apps, so I monitored chromium bugs in the network stack component that mentioned android.

Cronet is still around, now open source and as far as I know is still the best all-around network stack for Android apps - fast, secure, supports modern protocols.

gerdesj · on May 25, 2024

Spend say £500M (USD/GBP/EUR) on experts, per annum.

Imagine typing a search and getting a response: "Give us 30 mins to respond - here's a token, come back at 17:35 with your token" ... and then you get an answer from an expert, which also gets indexed.

The clever bit decides when to defer to an expert instead of returning answers from the index.

I'll leave the finer details out.

duskwuff · on May 25, 2024

Google Answers was launched in 2002 and retired in 2006.

https://en.wikipedia.org/wiki/Google_Answers

gerdesj · on May 25, 2024

"users would pay someone else to do the search."

My notion isn't a rehash of Google Answers. Google pays the "someone else", not you.

josefx · on May 26, 2024

You assume that a modern day tech giant would hire an army of experts, instead of just outsourcing it to the lowest bidder in the current third world country of choice.

gerdesj · on May 27, 2024

I live in hope.

I don't think that my notion is too daft - I generally find that quality might be a profit generator.

When you are so far up the arse of your financials that you can't see sense ... you are probably profitable for now until ... you are not.

duskwuff · on May 26, 2024

That sounds like a way for Google to spend money, not make it.

fragmede · on May 26, 2024

Have you seen how much Google spends on their campuses and engineers?

aleph_minus_one · on May 26, 2024

Quora still exists. :-)

quibono · on May 26, 2024

Quora is gamed to the point where it's pretty much useless.

seanp2k2 · on May 26, 2024

So, replace Google with calling the DMV?

badgersnake · on May 25, 2024

Sounds rather like how Google photos does not identify anything as a Gorilla.

avar · on May 25, 2024

It sounds like the exact opposite of that story. They manually blacklisted gorillas from being identified because they kept conflating black people with gorillas.

kreyenborgi · on May 25, 2024

Google bought all the gorillas?

1vuio0pswjnm7 · on May 25, 2024

The solution is always the same: pay people off and keep it under the radar.

What stops the vendor, or other vendors, from creating more gnomes with sneakers. Easy money from customer with billions of dollars to spend on payola, fines, legal settlements, etc.

Maybe they made the vendor sign an NDA.

latexr · on May 26, 2024

> The solution is always the same: pay people off and keep it under the radar.

You’re making this into a conspiracy unnecessarily. They didn’t “pay people off”, they bought an item. Do you “pay off” your grocer when you buy a carrot from them?

> What stops the vendor, or other vendors, from creating more gnomes with sneakers.

The fact they don’t know their entry was causing this issue to a major corporation?

> Maybe they made the vendor sign an NDA.

Why would they? Someone had one gnome with sneakers for sale; someone else bought it; end of story.

ein0p · on May 26, 2024

Google could buy all the wood glue in the world, but probably not all the rocks

dpflan · on May 25, 2024

Wow. Thank you for digging this up!

jerf · on May 25, 2024

"Achieving the initial 80 percent is relatively straightforward since it involves approximating a large amount of human data, Marcus said, but the final 20 percent is extremely challenging. In fact, Marcus thinks that last 20 percent might be the hardest thing of all."

100% completely accurate is super-AI-complete. No human can meet that goal either.

No, not even you, dear person reading this. You are wrong about some basic things too. It'll vary from person to person what those are, but it is guaranteed there's something.

So 100% accurate can't be the goal. Obviously the goal is to get the responses to be less obviously stupid. Which, while there are cynical money-oriented business reasons for, it is obviously also a legitimate hole in the I in AI to propose putting glue on pizza to hold the cheese on.

But given my prior observations that LLMs are the current reigning world-class champions at producing good sounding text that seems to slip right past all our system 1 thinking [1], it may not be a great thing to remove the obviously stupid answers. They perform a salutatory task of educating the public about the limitations and giving them memorable hooks to remember not to trust these things. Removing them and only them could be a net negative in a way.

[1]: https://thedecisionlab.com/reference-guide/philosophy/system...

lukev · on May 25, 2024

I feel like there's some semantic slippage around the meaning of the word "accuracy" here.

I grant you, my print Encyclopedia Britannica is not 100% accurate. But the difference between it and a LLM is not just a matter of degree: there's a "chain of custody" to information that just isn't there with a LLM.

Philosophers have a working definition of knowledge as being (at least†) "justified true belief."

Even if a LLM is right most of the time and yields "true belief", it's not justified belief and therefore cannot yield knowledge at all.

Knowledge is Google's raison d'etre and they have no business using it unless they can solve or work around this problem.

† Yes, I know about the Gettier problem, but is not relevant to the point I'm making here.

jononor · on May 25, 2024

Encyclopedia Britannica is also wrong in a reproducible and fixable way. And the input queries a finite set. It's output does not change due to random or arbitrary things. It is actually possible to verify. LLMs so far seem to be entirely unverifiable.

lukev · on May 25, 2024

They don’t just seem it. They are by design.

We talk about models “hallucinating” but that’s us bringing an external value judgement after the fact.

The actual process of token generation works precisely the same. It’d be more accurate to say that models always hallucinate.

madeofpalk · on May 25, 2024

Yes - this is what i've been saying all the time. The term 'hallucinations' is misleading because the whole point of LLMs is that they recombine all their inputs into something 'new'. They only ever hallucinate outputs - that's their whole point!

wizzwizz4 · on May 25, 2024

Into something probable. The models that underlie these chatbots are usually overfitted, so while they usually don't repeat their training data verbatim, they can.

DougBTX · on May 25, 2024

> The actual process of token generation works precisely the same

I’d be wary of generalising it like that, it is like saying that all programs run on the same set of CPU instructions. NNs are function approximators, where the code is expressed in model weights rather than text, but that doesn’t make all functions the same.

lukev · on May 25, 2024

You misunderstand. I mean that the model itself is doing exactly the same thing whether the output is a “hallucination “ or happens to be fact. There isn’t even a theoretical way to distinguish between the two cases based only on the information encoded in the model.

skydhash · on May 25, 2024

> it is like saying that all programs run on the same set of CPU instructions

Turing machine is the embodiment of all computer programs. And then you come across the halting problem. LLMs can probably generate all books in existence, but it can't apply judgement to it. Just like you need programmers to actually write the program and verify that it correctly solves the problem.

Natural languages are more flexible. There are no functions libraries, or paradigms to ease writing. And the problem space can't be specified and usually relies on shared context. Even if we could have snippets of prompts to guide text generations, the result is not that valuable.

intended · on May 25, 2024

YES. Humans can hallucinate, its a deviation from what is observable reality.

All the stress people are feeling with GenAI comes from the over anthropomorphisation of ... stats. Impressive syntatic ability is not equivalent to semantic capability.

tremon · on May 27, 2024

The human definition of hallucination has to do with sensory experience, i.e. inputs. Saying that LLMs hallucinate means that we're ascribing them control over their inputs that they simply do not have -- by design.

Or, in other words, if a chatbot really were hallucinating, it would probably start giving unprompted responses.

> Humans can hallucinate, its a deviation from what is observable reality.

What is "observable reality" then, for an LLM? Its training set?

somenameforme · on May 25, 2024

LLMs are completely deterministic even if that's kind of weird to state because they output things in terms of probabilities. But if you simply took the highest probability next word, you'd always yield the exact same output given the exact same input. Randomness is intentionally injected to make them seem less robotic through the 'temperature' parameter. Why it's not just called the rng factor is beyond me.

dleeftink · on May 25, 2024

Maybe some models can be deterministic at a point in time, but train it for another epoch with slight parameter changes and a revised corpus and determinism goes out the proverbial (sliding) window real quick. This is not unwanted per se, and the exact feedback loop that needs improving to better integrate new knowledge or revise knowledge artefacts incrementally/post-hoc.

Vetch · on May 25, 2024

If you train it then it's no longer the same model. If I have f(x) = x + 1 and change it to f(x) = x + 1 + 1/1e9, it would not mean that `f` is not deterministic. The issue would be in whatever interface I was exposing the f's at.

jononor · on May 25, 2024

But current models must be retrained to incorporate new information. Or to attempt to fix undesirable behavior. So just freezing it forever does not seem feasible. And because there is no way to predict what has changed - one has to verify everything all over again.

avar · on May 25, 2024

Would you by extension argue that e.g. modern relational database aren't deterministic in their query execution? Their query plans tend to be chosen based on statistics about the tables they're executed against, and not just the query itself.

I don't see how that's different than the LLM case, a lot of algorithms change as a function of the data they're processing.

dleeftink · on May 26, 2024

At least in case of Bigquery, I have fought with indeterminist-like issues many times over, especially when dealing with window functions that aggregate floats from different compute nodes, where rows cannot be further sorted on a unique column (i.e. the maximum sorting granularity for rows with a similar column of interest to compute a window function over has been reached).

Inconsistent results could be resolved by introducing additional out of data constraints (e.g. incremental hashes), but it can take quite a while to figure out at which exact point in a complex query these constraints need to be introduced.

Beyond that, some functions might still produce different results between runs, e.g. `ml.tf_idf` and `ml.multi_hot_encoder` that take some approximation liberties. Whether these functions are relational in the traditional sense is up for debate.

Terr_ · on May 25, 2024

I think what you're describing is that training/execution effects aren't predictable.

It is still "deterministic" in that training on exactly the same data and asking exactly the same questions should (unless someone manually adds randomness) lead to the same results.

Another example of the distinction might be a pseudo-random number generator: For any given seed, it is entirely deterministic, while at the same time being very deliberately hard to predict without actually running it to see what happens.

dleeftink · on May 26, 2024

True in the ideal case, but taken together (e.g. corpus retraining, temperature settings, slight input changes, initial descent parameters) unpredictability and indeterminism become difficult to distinguish. Especially in the distributed training case, training data may be propagated to different nodes in different order (e.g. when leaving it to a query optimiser), which makes any large-scale training operation difficult to reproduce exactly.

milesvp · on May 26, 2024

I think you’re missing a subtlety with markov chains. It’s not about picking the next work with highest probability, it about picking the next word using the next word probability distribution. I played with them almost 20 years ago, and the difference in output was pretty obvious even with simple trigrams. The poetry produced was just better.

I can’t imagine any modern llm not using a probabilty distribution function for the same reason.

ein0p · on May 25, 2024

LLMs are deterministic if you want them to be. If you eagerly argmax the output you will get the same sequence for the same prompt every time

ta8645 · on May 25, 2024

> LLMs so far seem to be entirely unverifiable.

I don't understand this complaint. Are they any less verifiable than a human?

threeseed · on May 25, 2024

I can ask a human to explain the steps they took to answer a question.

I can ask a human a question 100 times and I don't get back 100 different answers.

None of those applies to an LLM.

ta8645 · on May 25, 2024

You can ask an LLM to explain itself.. it will give you a logical stepwise progression from your question to its answer. It will often contain a mistake, but the same is true for a human.

And if your LLM is giving you 100 different answers, then it has been configured to do so. Because instead, it could be configured to never vary at all. It could be 100% reproducible if so desired.

bluefirebrand · on May 25, 2024

> it will give you a logical stepwise progression from your question to its answer.

No, it will generate a new hallucination that might be a logical stepwise progression from the question you asked to the answer you gave, but it is not due to any actual internal reasoning being done by the LLM

vidarh · on May 26, 2024

We have no clear evidence the same isn't true for humans, and some that it might be. See experiments with split brain patients, that have shown the brain halves will readily explain how they made decisions they provably never made.

bluefirebrand · on May 26, 2024

I think we have thousands of years of evidence that the same isn't true for humans

The fact that one human brain is composed in two half brains that each seem to be able to function fairly independently when separated doesn't seem like it changes that much

vidarh · on May 27, 2024

It changes that we can prove it. In as much as you can do experiments where you "hide" actions from one half, mess with it (e.g. totally change "choices" supposedly made by the brain) and ask the other half to explain why "it" made the choice, and it will do so, unaware that the choice was made by the experimenters. It won't go "sorry, but I don't know" or similar.

ta8645 · on May 25, 2024

So what? You have no way to know for sure if the human you ask the same question, does either. The question that started this thread was related to verifiability. And i still think it is a spurious complaint, given that we have exactly the same limitations when dealing with any human agent.

Dylan16807 · on May 25, 2024

> You have no way to know for sure if the human you ask the same question, does either.

The human might lie, but they generally don't.

An LLM is always confabulating when it explains how it reached a conclusion, because that information was discarded as soon as it picked a word.

The limitations are not in the same ballpark.

vidarh · on May 26, 2024

We have no evidence that humans can even know, much less that they generally don't. And we do have evidence that there are situations where the brain will readily construct an explanation after the fact which can't possibly be true (experiments with split brain patients where researchers tricked one brain half into thinking the other half, and so the brain as a whole, had made a decision while the action was taken by the researchers, and made it explain how it has made the decision)

There is no basis for claiming to know that people don't usually make up explanations like this other than when e.g. breaking the process apart and writing it down step by step during. But even then individual decisions are "suspect".

TeMPOraL · on May 25, 2024

The human may also be... wrong. Saying things that "feel right", except this one time they're factually wrong.

A human can explain their reasoning step by step, if the original reasoning was a System 2, formal, step-by-step process in the first place; otherwise, they're just making shit up after the fact, which feels right, but may or may not be correct (see also the previous paragraph).

Note that it's very rare anyone has an interaction with a human that uses this mode of reasoning - it's unnecessary except in special circumstances, usually math-heavy.

bluefirebrand · on May 25, 2024

> And i still think it is a spurious complaint, given that we have exactly the same limitations when dealing with any human agent

We're not talking about an LLM that is trying to do the job of a human, here

We're talking about an LLM that is trying to give authoritative answers to any question typed into the Google search bar

It's already well past the scale that humans could handle

Talking about human shortcomings when discussing LLMs is a red herring at best, or some kind of deliberate goalpost shifting at worst

ta8645 · on May 25, 2024

Nothing of the sort. I'm trying to understand why anyone cares about formal verifiability in this context, since it's not something we rely on when asking humans to answer questions for us. We evaluate any answer we get without such mathematical proofs, and instead simply judge the answer we're given on its fit and usefulness.

Anyone who doubts the usefulness of even these nascent LLMs is fooling themselves. The proof is in the pudding, they already do a great job, even with all their obvious limitations.

bluefirebrand · on May 25, 2024

> since it's not something we rely on when asking humans to answer questions for us

Because we interact with computers (which includes LLMs) differently than we do with humans and we hold them to higher standards

Ironically, Google played a large part in this, delivering high quality results to us with ease for many years. At one point Google was the standard for finding high quality information

ta8645 · on May 25, 2024

Shrug. Seems like clutching pearls to me. People seem to have an emotional reaction and obsess on the aspects that differentiate human cognition from LLMs. But that is a lot of wasted energy.

To the extent that anyone avoids employing these technologies, they will be at a disadvantage to those who do; because these tools just work. Already. Today.

There isn't even room for debate on that issue. Again, the proof is in the pudding. These systems are already successfully, usefully, and correctly answering millions of questions a day. They have failure modes where they produce substandard or even flat out incorrect answers too. They're far from perfect, but they're still incredible tools, even without waiting for the improvements that are sure to come.

figassis · on May 25, 2024

The reason verifiability is important is because humans can be incentivized to be truthful and factual. We know we lie, but we also know we can produce verifiable information, and we prefer this to lies, so when it matters, we make the cost to lying high enough that we can reasonably expect that they will not try to deceive (for example by committing perjury, or fabricating research data). We know it still happens, but it’s not widespread and we can adjust the rules, definitions and cost to adapt.

An LLM does not have such real world limitations. It will hallucinate nonstop and then create layers of gaslighting explanations to its hallucinations. The problem is that you absolutely must be a domain expert at the LLM’s topic or always go find the facts elsewhere to verify (then why use an LLM?).

So a company like Google using an LLM, is not providing information, it’s doing the opposite. It is making it more difficult and time consuming to find information. But it is then hiding their responsibility behind the model. “We didn’t present bad info, our model did, we’re sorry it told you to turn your recipe into poison…models amirite?”

A human doing that could likely face some consequences.

wizzwizz4 · on May 25, 2024

The problem of other minds is no reason to throw everything out the window. Humans are capable of being conscious of their reasoning processes; token-at-a-time predictive text models wired up as chatbots aren't capable of it. Your choice is between a possibly-mistaken, possibly-lying human, and a 100%-definitely incapable computer program.

You don't know either "for sure", but you don't know that the external world exists "for sure" either. It's an insight-free observation, and shouldn't be the focus of anyone's decision-making.

vanviegen · on May 26, 2024

When you ask an LLM to carefully reason step by step before arriving at its answer, that seems pretty much the same as conscious reasoning to me. Of course, when asked to justify a gut reaction after the fact, it will just come up with something that sounds plausible (and may or may not be true). Just like humans do.

ta8645 · on May 25, 2024

You've made some interesting points, which are debatable, for sure. But you've failed to address the question being asked about "verifiability".

Vetch · on May 25, 2024

> It will often contain a mistake...but the same is true for a human.

If this were true textbooks could not work. Given a question, we don't consult random humans but experts of their field. If I have a question on algorithms, I might check a text by Knuth, I wouldn't randomly ask on the street.

> It could be 100% reproducible if so desired.

Reproducible does not mean better. For harder questions, it's often best to generate multiple answers at a higher temperature than to greedily pick the highest probability tokens.

vidarh · on May 26, 2024

And for most cases that human explanation is likely with a disturbing frequency a complete fabrication after the fact. See experiments on split brain patients.

With respect to repeatability, yes, LLMs are currently frozen in time. That is not an inherent limitation, but it is one that is practical for a lot uses and a problem for some.

mewpmewp2 · on May 25, 2024

Isn't it actually known that every time a human brain recalls a piece of memory the memory gets slightly changed?

If the answer has any length at all, I imagine the answer can vary every single time the person answers, unless they prepared for it, memorized it word by word.

vidarh · on May 26, 2024

It's also known that the brain is prone to outright constructing demonstrably fictional rationalisations of decisions it's never made.

Any notion that we're reliable narrators of our own thoughts and actions is fiction.

mewpmewp2 · on May 26, 2024

Right, that makes sense, our brain will do a quick black box judgment (some may call it system 1), and then rational process only works to justify that or explain the black box, assuming that black box is always correct (depending on the person and how much they trust their black box or system 1).

So system 2 is "hallucinating" the best justification for system 1.

And usually system 2 will do it only when it's required to justify it for anyone else.

astrange · on May 25, 2024

The only reason you can't verify a server side LLM is you can't see the model. It is possible to look at its activations if you have the model.

tux1968 · on May 25, 2024

Do the activations tell you anything more than what the LLM delivers in plain text? Other than for trivial bugs in the LLM code, I don't think so.

astrange · on May 25, 2024

Yes, "making up an answer" will look different from "quoting pretrained knowledge" because eg the model might've decided you were asking a creative writing question.

duskwuff · on May 25, 2024

Can you cite a source for this, or are you speculating?

My understanding was the opposite -- that the activity of a confabulating LLM is indistinguishable from one giving factually accurate responses.

https://arxiv.org/abs/2401.11817

astrange · on May 25, 2024

Some things like:

https://arxiv.org/abs/2310.18168

https://arxiv.org/abs/2310.06824

There are various reasons an LLM might have incorrect "beliefs" - the input text was false, training doesn't try to preserve true beliefs, quantization certainly doesn't. So it can't be perfectly addressed, but some things leading to it seem like they can be found.

> https://arxiv.org/abs/2401.11817

This seems like it's true since LLMs are a finite size, but in Google's case it has a "truth oracle" (the websites it's quoting)… the problem is it's a bad oracle.

lukev · on May 25, 2024

This is confidently stated and incorrect.

astrange · on May 25, 2024

Do you have anything to add?

lukev · on May 25, 2024

Sure. Your claim has some truth, but is far too strong.

The articles you cited upthread do not support the notion that models consistently activate differently when generating true facts vs false facts.

It is true that models can capture some notion of reliability based on patterns in their training data. For a concrete example, it is entirely plausible that a model can capture the sense that data trained from Reddit is less truthy than data trained from Wikipedia, or that training data with poor grammar and vocabulary is less reliable than more sophisticated inputs.

But this process is not a guarantee, and does not change the fact that LLMs have no mechanism to track the provenance of information. It's probably a fruitful direction of research for reducing the probability of emitting false facts, but there will always be an infinite number of marginal cases for which the activations for true facts are indistinguishable from those for false facts.

Models simply do not track the provenance which is required to make this distinction in every case.

astrange · on May 26, 2024

I agree with this; that's why I was careful not to use the examples you mentioned. Quoting incorrect training knowledge would be an unavoidable issue if your probe can only say "it's quoting something", and as far as I know it can't do better than that.

But I have seen issues with prompts where a creative-writing prompt and just asking a question look similar, and in that case it could help to know which one it thinks it's doing.

Gemini itself has a funny verification button where it more or less Googles every sentence the model writes and tells you if it seems like it made it up or not.

kenjackson · on May 25, 2024

Ask a human what the meaning of life is and how it impacts their day to day interactions. I know I can tell you an answer but I couldn’t tell you steps about how I got it.

And if you asked it to me twice I’d definitely give different answers unless you told me to give the same answer. In part I’d give a different answer because if someone asks me the same question twice I assume the first answer wasn’t sufficient.

threeseed · on May 25, 2024

No one is taking about existential questions about meaning of life.

We are talking about basic things like whether or not to eat rocks or put glue in recipes. We can answer those questions with a chain of logic and repeatability.

kenjackson · on May 26, 2024

And those specific questions get repeatable answers on ChatGPT for me.

Here are two answers I got which seem as close as you’d expect any human to give:

“No, people should not eat rocks. Rocks are not digestible and can cause serious harm to the digestive system, including blockages and damage to internal organs. Eating rocks can lead to severe health problems and should be avoided.”

“No, people should not eat rocks. Rocks are not digestible and can cause serious harm to the digestive system, including blockages, abrasions, and potential poisoning from harmful minerals or substances. It's important to consume only food items that are safe and meant for human consumption.”

eynsham · on May 25, 2024

> it’s not /justified/ belief

Beliefs derived from the output of LLMs that are ‘right most of the time’ pass one facially plausible precisification of ‘justification’ in that they are generated by a reliable belief-generation mechanism (see e.g. Goldman). To block this point one must engage with the post-Gettier literature at least to some extent. There is a clear difference between beliefs induced by reading the outputs of LLMs and those induced by the contents of a reference work, but it is inessential to the point and arguably muddies the water to present the distinction as difference in status as knowledge or non-knowledge.

lukev · on May 25, 2024

Upon a second reading, this is an excellent point.

For the sake of clarity, let's remove LLMs from the equation and posit the existence of Encyclopedia Eric. Ask Eric any question, and he will happily research it and come back to you with the answer. But he can sometimes be sloppy in his research, and he gives the correct answer only X percent of the time.

Furthermore, Encyclopedia Eric steadfastly refuses to cite his own sources or explain his reasoning in any way. He simply states his answer.

Can Eric be a source of knowledge? It seems evident that the answer is no, for low values of X. For higher values of X, the question becomes murkier.

The temptation at this point is to give up on defining knowledge at all and fall back on a sort of Bayesian epistemology where everything is ultimately a matter of probabilities.

Yet there does seem to be a distinct practical difference between a knowledge source that is "traversable" (like a standard encyclopedia) vs a knowledge source that is not (like Eric.) Is that part of the definition of knowledge? You're right, that is at least a Gettier adjacent question.

I think we can all agree that for current LLMs the value of X is definitely too small to count as knowledge.

eynsham · on May 25, 2024

I am glad that you think it an excellent point!

I think that this might be a nice way of getting round my objection, but there is one worry, which is that X is relative to a distribution on the questions we ask when we aren’t dealing with Encyclopedia Eric but with an LLM. I don’t actually use LLMs very much myself, partly out of arrogance and Luddite tendencies. But I suspect that the value of X for some sorts of questions (simple quiz questions, maybe?) and some LLMs (maybe not Google’s) will be high enough to end up in the murky case.

Of course, both you and I agree that there /is/ clearly a difference. I can see the attraction of appealing to the intuitive or pretheoretic notion of knowledge, since it’s a fairly straightforward way of stating the difference and it’s not obvious how else one might put it (I suppose ‘LLMs don’t explicitly think through the facts stored when deciding what to say’ is one way of putting it.)

I remember some time ago rather sleepily watching John Hawthorne talk about conditionals; my sole memory was of his banging on about ‘the little logician in the brain’ (I think the point was something like: some conditionals seem [in]felicitous in virtue of form because the little logician in the brain is reading them; others seem infelicitous because we examine them more closely and look e.g. at the referents involved, in which case e.g. Gricean considerations apply). One difference at least in the case of LLMs that makes sense to me is that there is no ‘little logician’ in LLMs.

TeMPOraL · on May 25, 2024

> X is relative to a distribution on the questions we ask when we aren’t dealing with Encyclopedia Eric but with an LLM.

Assuming I understand what you mean here correctly, this should be the case for both LLMs and Encyclopedia Eric - there are topics Eric knows by heart (or thinks they know); there are specific phrases seared into his mind through sheer exposure during his life prior to becoming a living Encyclopedia. There are words he's used to, and exact synonyms he barely recognizes. All that means your chance of getting correct answer to your query depends, in complex and unknown to you way, on how you state it.

eynsham · on May 25, 2024

I think the thought experiment can be set up both ways. In one case Eric has a fixed probability of getting /any/ query right. (This might leave boosting open so we might want to gerrymander repeated queries out.) In the other this is relative to the distribution of queries.

lukev · on May 25, 2024

A relevant point to this is the notion of "System-1" vs "System-2" thinking. Somewhat dubious when applied to actual human psychology but I think a valid metaphor for how LLMs work: they are only capable of System-1 thinking; a single forward pass through the weights of intuition

In my actual life, I don't trust my own System-1 thoughts: for anything important, I'm always going to engage System-2. And LLMs don't have a System-2.

(I also agree that when dealing with LLMs the value of X is not a single value but a highly complex space depending on the nature of the question and the training data of the model. In my mind it does not change the epistemological equation, it just means that even the value of X itself is harder to "know", so this ambiguity can only ever make LLMs a less viable source of knowledge.)

eynsham · on May 26, 2024

I am not sure that the ambiguity has to work that way. The suggestion I am making is that if we fix the distribution on questions and the training data, we might (a) know the value of X in this specific case, (b) be able to ensure that it is fairly high.

I’d say this is the murky case because on the fixed training data and query distribution X ≈ 1 and we know that even though we don’t know the value of X on other training data and other query distributions. I think that might be where the disagreement lies.

nicklecompte · on May 25, 2024

To be clear he is saying that the LLM is not capable of justified true belief, not commenting on people who believe LLM output. I don’t think your comment is relevant here.

lukev · on May 25, 2024

I do think trusting an LLM is less firm ground for knowledge than other ways of learning.

Say I have a model that I know is 98% accurate. And it tells me a fact.

I am now justified in adjusting my priors and weighting the fact quite heavily at .98. But that’s as far as I can get.

If I learned a fact from an online anonymously edited encyclopedia, I might also weight that a 0.98 to start with. But that’s a strictly better case because I can dig more. I can look up the cited sources, look at the edit history, or message the author. I can use that as an entry point to end up with significantly more than 98% conviction.

That’s a pretty important difference with respect to knowledge. It isn’t just about accuracy percentage.

eynsham · on May 25, 2024

That reading of the comment did occur to me, but I think neither dictionaries nor LLMs are capable of belief, and the comment was about the status of beliefs derived from them.

nicklecompte · on May 25, 2024

Okay we are speaking past each other, and you are still misunderstanding the subtlety of the comment:

A dictionary or a reputable Wikipedia entry or whatever is ultimately full of human-edited text where, presuming good faith, the text is written according to that human's rational understanding, and humans are capable of justified true belief. This is not the case at all with an LLM; the text is entirely generated by an entity which is not capable of having justified true beliefs in the same way that humans and rats have justified true beliefs. That is why text from an LLM is more suspect than text from a dictionary.

eynsham · on May 25, 2024

I think the parent comment ultimately concerned the reliability of /beliefs derived from text in reference works v text output by LLMs/, and that seems to be what the replies by the commenter concern. If the point is merely that the text output by LLMs does not really reflect belief but the text in a dictionary reflects belief (of the person writing it), it is well-taken. Since it is fairly obvious and I think the original comment really was about the first question, I address the first rather than second question.

The point you make might be regarded as an argument about the first question. In each case, the ‘chain of custody’ (as the parent comment put it) is compared and some condition is proposed. The condition explicitly considered in the first question was reliability; it was suggested that reliability is not enough, because it isn’t justification (which we can understand pretheoretically, ignoring the post-Gettier literature). My point was that we can’t circumvent the post-Gettier literature because at least one seemingly plausible view of justification is just reliability, and so that needs to be rejected Gettier-style (see e.g. BonJour on clairvoyance). The condition one might read into your point here is something like: if in the ‘chain of custody’ some text is generated by something that is incapable of belief, the text at the end of the chain loses some sort of epistemic virtue (for example, beliefs acquired on reading it may not amount to knowledge). Thus,

> text from an LLM is more suspect than text from a dictionary.

I am not sure that this is right. If I have a computer generate a proof of a proposition, I know the proposition thereby proved, even though ‘the text is entirely generated by an entity which is not capable of having justified true beliefs’ (or, arguably, beliefs at all). Or, even more prosaically, if I give a computer a list of capital cities, and then write a simple program to take the name of a country and output e.g. ‘[t]he capital of France is Paris’, the computer generates the text and is incapable of belief, but, in many circumstances, it is plausible to think that one thereby comes to know the fact output.

I don’t think that that is a reductio of the point about LLMs, because the output of LLMs is different from the output of, for example, an algorithm that searches for a formally verified proof, and the mechanisms by which it is generated also are.

seanp2k2 · on May 26, 2024

+1, AIs don’t really “understand” anything, like human anatomy and deformity rates and societal norms when tasked with image generation, so you get hands with weird numbers of digits and other topological errors which even a very unintelligent human wouldn’t make. AI doesn’t understand and build knowledge in interconnected layers, it can’t “think” and link things back to first principles, it can’t really reason about things, and it’s not going to get significantly better until we start approaching it differently. This generation of AI might be useful for some things, but it’s being applied wayyyy too broadly too quickly and I see a big pullback coming.

“Expert systems” are not a new thing, and they’re still not all that useful except in some very small niches. Phone trees that can keyword match FAQs are useful for a lot of low-effort callers who put zero effort into solving their problem on their own first, but frustrating for callers who are only calling because there’s literally no other way for them to resolve their issue. Unfortunately for consumers, the cost is very low for businesses to make everyone wade through junky phone tree systems and penalize anyone who tries to mash zero to talk to a real person, even if that’s the only thing which will actually help them.

hnfong · on May 26, 2024

The Gettier problem is an indication that the definition has (at least) a bug.

There are other formulations of "knowledge" which does not involve justification, see eg. Gnosticism.

Of course, for a publicly available frequently used service, the "JTB" formulation of knowledge is probably the only one we can practically use, but this kind of indicates that the whole idea of search engines, knowledge systems, or expert systems is flawed due to the Gettier problem.

benrutter · on May 25, 2024

> So 100% accurate can't be the goal. Obviously the goal is to get the responses to be less obviously stupid.

I'm not sure I agree. I think you're right that 100% accuracy is potentially unfeasable as a realistic aim, but I think the question is how accurate something needs to be in order to be a useful proposition for search.

AI that's as knowledgable as I am is a good achievement and helpful for a lot of use cases, but if I'm searching "What's the capital of Mongilia" someone with averageish knowledge taking a punt with "Maybe Mongoliana City?" is not helpful at all- if I can't trust AI responses to a high degree, I'd much rather just have normal search results showing me other resources I can trust.

Google's bar for justifying adding AI to their search proposition isn't "be better than asking someone on the street", it's "be better than searching google without any AI results"

smashed · on May 25, 2024

The problem is that in all the shared examples, Google ai search does not respond with a Maybe xyz, question mark? like you did. It always answers with high confidence and can't seem to navigate any gray area where there are multiple differing opinions or opposing source of truths.

namaria · on May 25, 2024

Yeah the "manipulating language cogently is intelligence" premise that underlines this "AI" cycle is proving itself wrong in a grand way.

jerf · on May 27, 2024

I should have been more clear. I am referring to Google's goals. Humanity as an abstract concept or you personally may have other goals, but, well, perhaps I am a cynic, but I think Google's goals are rather more monetary and less idealistic than they would represent. They don't want or need (as other replies correctly point out) the AI to always be correct and accurate. Along with general cynicism with regard to any conceivable AI's ability to do that, it is also fair to point out that the web itself doesn't have that ability either. We can't even find an objective yardstick to measure an AI with that way. Google's goal is to make the bad press go away so people use the AI more so that in the indefinite but ideally near future this AI can be monetized somehow to justify the interstellar valuations being ascribed to this technology in the "if it isn't happening in two fiscal quarters or less it might as well not exist" US/Western financial markets.

praisewhitey · on May 25, 2024

You're looking at it the wrong way, the goal should be 0% inaccurate. Meaning for the 20% of things it can't answer, it shouldn't make something up.

ruszki · on May 26, 2024

Nothing can be sure that it hasn’t inaccurate or incomplete knowledge. So that can’t be a goal either.

Szpadel · on May 25, 2024

I think the biggest difference with human (and the most important one) is that human can tell you "I have no idea, this isn't my field" or "I'm just guessing here" but LLMs will confidently say to super stupid statement. AI doesn't know what it knows.

if you only score where human provide answer, then human score would be probably in high 90s

antifa · on May 26, 2024

> human can tell you "I have no idea, this isn't my field" or "I'm just guessing here"

I wish more of them would lol

pankajkumar229 · on May 25, 2024

I find irony here.

bugglebeetle · on May 25, 2024

Yes, which is why the ability to sift accurate and authoritative sources from spam, propaganda, and intentionally deceptive garbage, like advertising, and present those high-quality results to the user for review and consideration, is more important than any attempt to have an AI serve a single right answer. Google, unfortunately, abandoned this problem some time ago and is now left to serve up nonsense from the melange of low-quality noise they incentivized in pursuit of profits. If they had, instead, remained focused on the former problem, it’s actually conceivable to have an LLM work more successfully from this base of knowledge.

skybrian · on May 25, 2024

Or to put it another way, I think Google should have a way of saying "yes, we know this result is wrong, but we're leaving it in because it's funny."

There is a demand for funny results. Someone asking “how many rocks should I eat” is looking for entertainment, so you might as well give it to them.

leptons · on May 25, 2024

The right answer is no rocks. Some mentally ill person could type that in and get "eat 1000 rocks" and then die from eating rocks, and that would be Google's fault. It's not funny. I have no doubt right now there are at least 50 youtube videos being made testing different glue's effectiveness holding cheese on a pizza. And some of those idiots are going to taste-test it, too. And then people will try it at home, some stupid kids will get sick - I have no doubt.

It was a bit premature to label LLMs as "Intelligence", it's a cool parlor trick based on a shitload of power consumption and 3D graphics cards, but it's not intelligent and it probably shouldn't be telling real (stupid) humans answers that it can't verify are correct.

nomel · on May 25, 2024

Google is not responsible, and should never be responsible, for protecting mentally ill people from themselves. It would be at a severe detriment to the rest of us if they took on that responsibility. Society should set the bar to “a reasonable person”, otherwise you’re doomed, with no possible alternative to a nanny state.

leptons · on May 25, 2024

It's not only mentally ill people that are at risk, but anyone that doesn't know it's not a good idea to put "non-toxic" glue in pizza cheese. That includes a lot of not-mentally-ill but just plain dumb people. Google didn't need to tell people that glue+pizza is a reasonable thing to do, or even just a thing. It sure did frame it like it was a legitimate response. And Google didn't even have to reply with this or anything else, they could have just supplied the links to other sites where it had been suggested, but no - they have to make a show of force with their premature foray into AI, and have it tell real people all kinds of false, and possibly dangerous things. That's an unforced error by Google that they could end up being prosecuted for.

avery17 · on May 26, 2024

Thats what parents and mentors are for. We as a society should not have to break our backs bending over backwards to stop people from doing stupid things. People can make their own descisions and be responsible for them. If they lack proper guidance, well that just sucks.

leptons · on May 26, 2024

>Thats what parents and mentors are for.

It's nice that you have a parent that cares about how you are raised, or "a mentor". Do you realize that not everyone has that?

>We as a society should not have to break our backs bending over backwards to stop people from doing stupid things.

Sure, let's take down all the speed limits and see what happens. Let's tell people it's an option to wear seatbelts and see what happens. Let's deregulate everything and hope for the best. Sounds reasonable?

>People can make their own descisions and be responsible for them. If they lack proper guidance, well that just sucks.

There's this thing called "human nature". I think you should do some reading about it.

nomel · on May 27, 2024

Sure. But we have to keep the bar somewhere reasonable, otherwise you won’t be free to make mistakes.

anonymousab · on May 28, 2024

They absolutely should bear responsibility for authoritatively telling people wrong and unsafe cooking temperatures for meats or spreading lies about people on their results page, in their own voice. "It's just a random text generator!" isn't a protection against, say, gross libel.

They pass the threshold of criminal negligence when they keep up a system that they know will actively mislead countless people in subtly dangerous ways. The problem being hard or the tech being fundamentally unsound doesn't wave away their culpability - if anything it destroys any reason why they should be given some leeway.

viking123 · on May 26, 2024

It's funny how people on reddit think that these LLMs will somehow become AGI in the next year, or when openAI releases gpt 5.

The reality is though, that there is no known path currently to true AGI system and the research needs to be done. No one knows how to build this kind of system yet. LLMs are nice for things like roleplaying, helping with code stuff etc. but they are far from all that the marketers hype them to be.

leptons · on May 27, 2024

It's going to end up being another bubble, and it will eventually burst. Most people, investors included, don't really know why LLMs aren't going to be able to reason about and solve all of humanity's problems, so they go on believing it. It's just a matter of time before the money runs out, or gets shifted towards some new, shiny thing.

avar · on May 25, 2024

    > The right answer is no rocks.

Sand is considered a "rock". If you live in e.g. the USA or the EU you've definitely inadvertently eaten rocks from food produce that's regulated and considered perfectly safe to eat.

It's impossible to completely eliminate such trace contaminants from produce.

Pedantic? Yes, but you also can't expect a machine to confidently give you absolutes is response to questions that don't even warrant them, or to distinguish them from questions like "do mammals lay eggs?".

fragmede · on May 26, 2024

Salt is rock, and most everyone eats plenty of that.

The LLM is clearly being dumb, but the underlying science of the question is actually interesting. Iron is another interesting one. Run a magnet though iron-fortified cereal.

leptons · on May 25, 2024

This is not a serious reply.

duskwuff · on May 25, 2024

> Or to put it another way, I think Google should have a way of saying "yes, we know this result is wrong, but we're leaving it in because it's funny."

These specific results aren't the problem, though. They're illustrations of a larger problem -- if a single satirical article or Reddit comment can fool the model into saying "eating rocks is good for you" or "put glue in your pizza sauce", there are certain to be many more subtle inaccuracies (or deliberate untruths) which their model has picked up from user-generated content which it'll regurgitate given the right prompt.

skybrian · on May 26, 2024

Yes of course, but maybe keeping the funny ones around might serve as a warning, if suitably marked. A public service message?

They have disclaimers, but a funny message is more likely to be read.

notnullorvoid · on May 25, 2024

100% accuracy should be the goal, but the way to achieve that isn't going to from teaching an AI to construct a definitive sounding answer to 100% of questions. Teaching AI how to respond with "I don't know", and give confidence scores is the path to nearing 100% accuracy.

bluefirebrand · on May 25, 2024

> You are wrong about some basic things too

Sure, but probably not "add glue to pizza to get the cheese to stick" wrong...

fragmede · on May 26, 2024

The thing about that is that polyvinyl acetate is that's what's in elemers glue and is also used in chewing gum, and chocolate and to make the surface of Apples more shiny, so you're probably eating glue, we just don't like to call it that. emulsifier is a better description.

dspillett · on May 25, 2024

At least it suggested non-toxic glue… That suggests some context about recipes needing to be safe is somehow present in its model.

bluefirebrand · on May 25, 2024

Most likely this has nothing to do with "recipes being safe" being in the model

It seems the glue thing comes from a reddit shitpost from some time ago. There's a screenshot going around on twitter about it[0](11 years in the screenshot but no idea when it was taken)

It specifically mentions "any glue will work as long as it is non-toxic" so best guess is that's why google output that

[0]https://x.com/kurtopsahl/status/1793494822436917295?t=aBfEzD...

Fartmancer · on May 25, 2024

It is indeed from 11 years ago. Here's a direct link to the Reddit post: https://www.reddit.com/r/Pizza/comments/1a19s0/my_cheese_sli...

saagarjha · on May 25, 2024

Thankfully a billion people are not asking me for answers to things, so it's OK if I am wrong sometimes.

fragmede · on May 25, 2024

Nor am I being treated as an omniscient magic black box of knowledge.

Hilariously though, polyvinyl acetate, the main ingredient in Elemers glue is used as a binding agent to keep emulsions from separating into oil and water, and is used in chewing gum, and covers citrus fruits, sweets, chocolate, and apples in a glossy finish, among other food things.

wredue · on May 25, 2024

If I could delivery “80% correct” software for my workplace, my day would be a whole hell of a lot easier.

SoftTalker · on May 25, 2024

> putting glue on pizza to hold the cheese on

It's actually not the dumbest idea I've heard from a real person. So no surprise it might be suggested by an AI that was trained on data from real people.

krapp · on May 25, 2024

It wasn't an idea, though. It was a joke someone made on Reddit. If an AI can't tell the difference, it shouldn't be responsible for posting answers as authoritative.

dgellow · on May 25, 2024

Insane people at Google thought it would be a good idea to let Reddit of all places drive their AI search responses

oldgradstudent · on May 25, 2024

Reddit is a magnificent source of useful knowledge.

r/AskHistorians r/bikewrench

To name just two. There is nothing even remotely comparable.

But you need to be able to detect sarcasm and irony.

mvdtnz · on May 25, 2024

I have seen a tremendous amount of bad advice on bikewrench.

oldgradstudent · on May 25, 2024

But a lot of great advice.

I became a half decent home bike mechanic through reading it, and of course Park Tool videos.

blablabla123 · on May 25, 2024

...which is sometimes incredibly hard and it might not be possible because it's such a niche topic or people might be just wrong. Just thinking about Urban Myths, Conspiracy theories etc. where even without a niche factor things may sound unbelievable but actually disproving can be effort that is out of proportion

giantrobot · on May 25, 2024

I don't know about bikewrench but AskHistorians is a useful source of knowledge because it is strongly moderated and curated. It's not just a bunch of random assholes spouting off on topics. Top level replies are unceremoniously removed if they lack sourcing or make unsourced/unsubstantiated claims. Top level posters also try to self-correct by clearly indicating when they're making claims of fact that are disputed or have unclear evidence.

OpenAI, Google, and the other LLMs-are-smart boosters seem to think because the Internet is large it must be smart. They're applying the infinite monkey theorem[0] incorrectly.

[0] https://en.m.wikipedia.org/wiki/Infinite_monkey_theorem

VancouverMan · on May 25, 2024

In general, I have trouble trusting environments that can be described as "strongly moderated and curated".

I find that environments that rely on censorship tend to foster dogma, rather than knowledge and real understanding of the topics at hand. They give an illusion of quality and trustworthiness. It's something we see happen at this site to some extent, for example.

I'd rather see ideas and information being freely expressed, and if necessary, pitted against one another, with me being the one to judge for myself the ideas/claims/positions/arguments/perspectives/etc. that are being expressed.

giantrobot · on May 25, 2024

Your comment is orthogonal to the quality of the AskHistorians subreddit. AskHistorians' moderation tends towards curating posts following the rules rather than content. There's often competing narratives on questions where there's academic dispute of facts.

Regardless of whether you think that's the right approach to moderation, top level posts are sourced and can at least be examined. It's a marked improvement over the unsourced musings of random Redditors.

warkdarrior · on May 25, 2024

It is certainly popular here to run your web searches against reddit. Every post about how Google Search sucks ends up with comments on appending "site:reddit.com" to the search terms.

dgellow · on May 25, 2024

Yes and us as human filter through the noise. But you cannot rely upon it as a source for anything truthful without that filtering. Reddit is very, very, very context dependent and full of irony, sarcasm, jokes, memes, confidently written incorrect information. People love to upvote something funny or culturally relevant at a given time, not because it’s true or useful but because it’s fun to do

candiddevmike · on May 25, 2024

I wonder what the impact all of those erase tools are having on LLM training. The ones that replaced all of these highly upvoted comments with nonsense.

SirMaster · on May 25, 2024

I'm pretty sure those "erase" tools are just for the front-end and reddit keeps the original stuff in the back-end. And surely the deal Google made was for the back-end source data, or probably the data that includes the original and the edit.

astrange · on May 25, 2024

The LLM does a summary of web search results. It's quoting what you can see, not pretrained knowledge, afaik.

dspillett · on May 25, 2024

It may not be a joke. Perhaps it has confused making food for eating with directions for preparing food for menu photography and other advertising.

Fartmancer · on May 25, 2024

The Reddit post in question was definitely a joke. This is the post in response to a user asking how to make their cheese not slide off the slice:

> To get the cheese to stick I recommend mixing about 1/8 cup of Elmer's glue in with the sauce. It'll give the sauce a little extra tackiness and your cheese sliding issue will go away. It'll also add a little unique flavor. I like Elmer's school glue, but any glue will work as long as it's non-toxic.

This matches the AI's response of suggesting 1/8 a cup of glue for additional "tackiness."

mvdtnz · on May 25, 2024

> No, not even you, dear person reading this. You are wrong about some basic things too. It'll vary from person to person what those are, but it is guaranteed there's something.

The difference is that I'm not put on the interface of a product facing hundreds of millions of users every day to feed those users incorrect information.

tomrod · on May 25, 2024

If everyone can be wrong, then might the assertion that all are wrong committing this same fallacy? "Can" is not destiny, perhaps you have met people who are fully right about the basics but you just didn't sufficiently grok their correctness.

willis936 · on May 25, 2024

Failing loudly is an excellent feature. "More compelling lies" is not the answer.

noncoml · on May 25, 2024

“No, not even you, dear person reading this. You are wrong about some basic things too.”

But even when I’m wrong I’m not 100% off. Not “to help with depression jump of a bridge” or “use glue to keep the cheese on the pizza” kind of wrong.

fragmede · on May 26, 2024

So you think. Seems like hubris to believe you're not though. I'm blind to what I'm blind to, and while I'd link to think I'm never wrong, the reality is that I often am. The biggest personal growth for me was in not needing to be right.

noncoml · on May 26, 2024

I disagree. Not only for myself but for the vast majority of human kind.

LLMs are just a statistical model. It can claim that it’s normal for pigs to have wings and fly to the moon if it’s in the training data. No human, free of a mental/cognitive disorder, will be that wrong.

fragmede · on May 27, 2024

Why would pigs have wings and fly to the moon be in the training data as a data source marked as serious? We can no true scotsman both sides here. No true human, free of mental/cognitive disorder would be that wrong, but neither would an LLM, with properly annotated training data would be that wrong either.

noncoml · on May 27, 2024

Did you miss the part that “put glue in pizza to make the cheese stick” was in the training set?

fragmede · on May 29, 2024

Did you miss the part where I said properly annotated training data?

noncoml · on May 29, 2024

“Properly annotated data” has nothing to do with the original context.

We were discussing about the current state of affairs. Of course I am not stupid to think what I said in my original reply if we are taking about an LLM trained on “perfect data”

But that was not the premise.

fragmede · on May 30, 2024

Your claim was that "LLMs will claim that it’s normal for pigs to have wings and fly to the moon" and that humans free of mental/cognitive disorder would not. Which is to say, humans with a mental/cognitive disorder might claim that it’s normal for pigs to have wings and fly to the moon. If we're carving out such a section for humans to be so wrong, then we should also carve out a section for LLMs to be so wrong.

Fwiw, ChatGPT-4o can write a lengthy essay as to how pigs don't have wings and couldn't fly to the moon even if they did, but if we're more interested in them being nothing more than just a statistical model and that those mere statistics can't possibly result in something that looks like reasoning then we've got to disregard the fact that it "knows" that pigs don't have wings.

Of course pigs having wings is a stand in for whatever else wrong thing that LLMs might "believe", so I agree it's very important for everyone that uses an LLM to understand their limitations especially around hallucinations, but where there are books written about how flat the Earth is and are in the training data, the current state of affairs is that ChatGPT and Gemini both know it's not flat. That Google search AI results, which is a different model, is telling users to use glue on pizza, or to drink urine only serves to say that Google Search's bot using Reddit as unannotated training data is as representative of LLMs as a human with a mental/cognitive disorder.

noncoml · on June 2, 2024

Well the whole conversation started by me saying that I think even when I am wrong I am not “put glue in your pizza” wrong. And by I, I did mean the average human. Which is unannotated data from Reddit.

jsemrau · on May 25, 2024

This is statistics though. Edge cases are nothing new and risk management concepts have evolved around fat tails and anomalies for decades. Therefore the statement is as naive as writing a trading agent that is 100% correct. In my opinion, this error shows lack of understanding responsible scaling architectures. If this would be their first screw up I wouldn't mind, but Google just showed us a group of diverse Nazis. If there is a need for consumer protection for online services, it is exactly stuff like this. ISO 42001 lays out in great detail that AI systems need to be tested before they are rolled out to the public. The lack of understanding of AI risk management is apparent.

Salgat · on May 26, 2024

I'm willing to bet that with a team of fact checking experts, you'd get a result that is indistinguishable from 100%.

Swizec · on May 25, 2024

> No, not even you, dear person reading this. You are wrong about some basic things too. It'll vary from person to person what those are, but it is guaranteed there's something.

Kahneman has a fantastic book on this called Noise. It’s all about noise in human decision making and how to counteract it.

My favorite example was how even the same expert evaluating the same fingerprints on different occasions (long enough to forget) will find different results.

hatenberg · on May 25, 2024

So google decides shipping 80% distilled crap is good enough. Yay

verisimi · on May 25, 2024

100% correct, 80% correct lol.

The thing is that truth/reality is not a thing that is resolvable. Not even the scientific method has this sort of expectation!

You can imagine getting close to those percentages, with regards to consensus opinion. That's just a question of educating people to respond appropriately.

mvdtnz · on May 25, 2024

No. Whether a person should eat a certain number of small rocks each day is not a matter of opinion, it's not a deep philosophical problem and it's not a question whose truth is not resolveable. You should not be eating rocks.

verisimi · on May 25, 2024

You choose such an edge case question - how about this sort of thing:

Which is the best political party?

Are the side effects to X medical treatment?

I bet there are even cases when eating rocks is ok!

PS

It has been written about:

https://www.atharjaber.com/works/writings/the-art-of-eating-...

> Lithophagia is a subset of geophagia and is a habit of eating pebbles or rocks. In the setting of famine and poverty, consuming earth matter may serve as an appetite suppressant or filler. Geophagia has also been recorded in patients with anorexia nervosa. However, this behavior is usually associated with pregnancy and iron deficiency. It is also linked to mental health conditions, including obsessive-compulsive disorder.

Would you deny a starving person information on an appetite suppressant?

Also here:

https://www.remineralize.org/2017/05/craving-minerals-eating...

> Aside from the capuchin monkeys, other animals have also been observed to demonstrate geophagy (“soil-eating”), including but not limited to: rodents, birds, elephants, pacas and other species of primates.[1]

> Researchers found that the majority of geophagy cases involve the ingestion of clay-based soil, suggesting that the binding properties of clay help absorb toxins.

^^ The point being that even your edge case example is not unambiguously correct.

mvdtnz · on May 25, 2024

Are you really going to start eating rocks just to convince yourself that Google's AI isn't shit and objective truth is not real?

verisimi · on May 25, 2024

Lol! No, of course not.

My point is that I object to the idea that a result can be 100% right! Even in the case of eating rocks, it seems there are times that it can be beneficial.

To think '100% correct' is achievable is to misunderstand the nature of reality.

fragmede · on May 26, 2024

Really? No more salt for you then. Good luck with dehydration, hyponatremia, cramps, and cardiovascular issues.

ein0p · on May 25, 2024

You don’t need to be super ai complete - GPT4 is perfectly willing and able to tell you not to eat rocks and not to mix wood glue into pizza sauce. This is a fuckup caused by not dogfooding, and by focusing on alignment for political correctness at the expense of all else. And also by wasting a ton of engineering effort on unnecessary bullshit and spreading it too thin.

PreInternet01 · on May 25, 2024

It's debatable whether Google has truly lost the plot because of the "AI wars", but the moment the statement "Bing returns more sensible results than you" becomes verifiably true, it's... cause for concern?

The approach that Google appears to have taken, which is to assume that the top-ranked part of its current search index is a sensible knowledge base, may have been true some years ago, but definitely isn't now: for whatever reasons, it's now 33% spam, 33% clickbait/propaganda, with the rest being equally divided between what could be called "truths" and miscellaneous detritus.

To me, it seems that returning to the concept that search results should at least reflect a broad consensus of what is true is a necessary first step for Google. As part of that, learning to flag obvious trolling, clickbait and bad-faith content is paramount. And then, maybe then, they can start touting their LLM benefits. But until the realities of the Internet are taken into account (i.e.: it's 80% spam!), any "we offer automated answers!" play is doomed.

rurp · on May 25, 2024

Not only is the current internet 80% spam, it's rapidly approaching 99% thanks in large part to LLMs. At this point I would be shocked if Google had a solid plan for how to handle this going forward as the problem space gets more difficult.

EasyMark · on May 25, 2024

that's the part that scares me. I railed on someone's comment the other day about "indexes will come back into fashion" but the more I think about how much garbage has increased in just the past 2 to 3 years, I think I was wrong. Indexes and forums may be the only way to have a sane net where you can find things. Perhaps communities linking together in a ring like format, a "web ring" of sorts.

ysavir · on May 25, 2024

What I've been wanting to see for a while now is a social-network based search engine:

* No pages are indexed automatically. The only indexed pages are pages that users say are worth indexing. Probably have a browser add-on for a one button click that people can use. * You can friend/follow others * Your search results are a combination of your own indexed pages and the pages indexed by people in your network.

manquer · on May 26, 2024

Isn’t that what Reddit is or digg was ? Link aggregators ?

Gaming that is solved problem , you can use human bot farms to brigade and astroturf and you can even motivate people to do it for free .

If cost of spamming is cheaper than cost of moderation, spam will win

ysavir · on May 26, 2024

Not quite like reddit and digg. You can bot farm those because the lists are common to all.

In this search engine, let's say there's you, me, person three, and spammer. You you are following me, and I'm following person three. Spammer isn't in any of our networks.

When you use the search engine, you only see results that you, me, or person three manually tagged as worthwhile. Any pages or content that Spammer tagged as worthwhile aren't part of your results, because they aren't in your network. So they can try to game the system all they want, but it won't affect you.

If person three starts following Spammer, I can unfollow them and then Spammer's results will no longer be included in your search results (or you can unfollow me and avoid those results).

I imagine rankings would also be affected by degrees of separation, so even if you followed me, I followed person three, and person three followed spammer, results tagged by me and you would take much higher precedence than results tagged by Spammer.

This also allows you to make custom searches by choosing which people you follow to include. Suppose you want to search for good headphones, so you make a search, but only include people in your network that you know are music and audio savvy, so that the results reflect the pages tagged by those people.

antifa · on May 26, 2024

What happens when I need to search a term that's outside the scope of topics that my followees are trusted for? Then it's a spammer free-for-all?

noncoml · on May 25, 2024

Spammers will find a way to beat it.

im3w1l · on May 25, 2024

Good indices lead to good search engines (engines can make use of indices) Good search engines lead to bad indices (by obsoleting them) Bad indices lead to bad search engines Bad search engines lead to good indices

whateverevetahw · on May 25, 2024

That sounds kind of like what Groupsy does. It creates a spider web of ideas.

https://groupsy.applicationfitness.com/post/healthymeals/664...

reustle · on May 25, 2024

https://en.wikipedia.org/wiki/Dead_Internet_theory

zogrodea · on May 25, 2024

I do see incredibly weird kids content on YouTube sometimes (most likely bot generated?) which makes me think kids have been experiencing a worse internet before the rest of us have.

rchaud · on May 25, 2024

Kids are far less knowledgeable about how modern software works because they don't know of an Internet that didn't have algorithmic recommendations. They have to be taught to do things like click "Not Interested/Don't recommend channel" to improve their feed. Dark pattern designs make this harder by hiding these options behind tiny 3-dot buttons.

freshpretzels · on May 26, 2024

I don’t think this is a real problem because as users start being more intentional about who they subscribe to and more thorough in ranking content according to its usefulness and quality, the low quality stuff or regurgitated stuff will just vanish.

Why would it matter if there are clones of the best, say, blog post on how to make spicy ramen? If they are not adding anything new or making that original effort better, then they will not surface in searches as search tools improve. Nobody will save that content or recirculate it or refer to it when they need to remember how to make spicy ramen.

And people will build curated subscriptions and followings and recommendations that are more tailored to the individual, and we will spend more time determining who is trustworthy and who is not.

greg_V · on May 25, 2024

Oh it gets even better. The public has been hearing about AI this and AI that for over a year, but the existing use cases and deployment was confined to some super special niches like writing or the creative industries and programming.

This is the first nation-scale deployment of the technology, running on Google's biggest and most profitable market in one of the most widely used internet services, and it's a shitshow.

They can try manually fine tuning it, but all of the investors who have been throwing money at AI for the past year are now learning what this tech is like in the day-to-day, beyond just speculations, and it's looking... bad.

itronitron · on May 25, 2024

It's especially embarrassing for Google considering they have indexed virtually all of the world's information for the last 25 years.

PreInternet01 · on May 25, 2024

Yeah, the most likely take here is that Google's leadership truly did not recognize how utterly awful the quality of their flagship search index had become over the years.

I mean, it explains a lot, but still... you're recruited using industry-leading practices out of an overflowing pool of abundant talent... and this is what you make of it? As the kids say: SMH!

lawn · on May 25, 2024

> you're recruited using industry-leading practices out of an overflowing pool of abundant talent

The ridiculous focus on on leet-code is surely industry-leading (because whatever Google does becomes industry-leading) but it sure isn't a good way to filter for competency.

chucke1992 · on May 25, 2024

I heard a funny quote that "today we have a new generation of developers who learnt how to pass interviews but don't know how code, and we have an old generation of developers who know how to code but forgot how to pass interviews. Or maybe never knew".

neilv · on May 25, 2024

> you're recruited using industry-leading practices out of an overflowing pool of abundant talent... and this is what you make of it?

That's exactly what to make of their frathouse nonsense.

Google has gotten away with it because smart people and a sweet moment of opportunity 20-25 years ago gave them... uh, an inheritance. They can coast on that inherited monopoly position, and afford to pay 100 people to do the work of 1, use the company's position to push whatever they build onto the market, and then probably cancel it anyway, always going back to the inherited money machine from the ancestors.

And then a lot of companies who didn't understand software development blindly tried to copy whatever the richest company they saw was doing, not understanding the real difference between the companies. While VC growth investment schemes let some of those companies get away with that, because they didn't have to be profitable, viable, responsible, nor legal, nor even have reasonably maintainable software.

Poor Zoomers are now a generation separated from before the tech industry's cocaine bender. For whatever software jobs will be available to them, and with the density of nonsense "knowledge" that will be in the air, I don't know how they'll all learn non-dysfunctional practices.

rm_-rf_slash · on May 25, 2024

Plenty of people have been using ChatGPT for daily tasks for almost two years now. GPT-4 isn’t perfect but is otherwise really really good, and deftly handling use cases in my industry that would be impossible without it or however many billion dollars it would take to make GPT-4.

From the black Nazis to the suggestion to jump off the Golden Gate Bridge b/c depression, it’s pretty clear that this fiasco isn’t an LLM problem, it’s a Google problem.

lupire · on May 25, 2024

Because no one cares when ChatGPT gets things wrong.

freshpretzels · on May 26, 2024

> To me, it seems that returning to the concept that search results should at least reflect a broad consensus of what is true is a necessary first step for Google. As part of that, learning to flag obvious trolling, clickbait and bad-faith content is paramount.

Who will decide what is obvious trolling and bad-faith content and how will they decide it? The problem they have is that search is only useful if it gives users what they are looking for. Their business model though is predicated on finding a way to introduce ads into the mix, and if they are also then trying to become arbiters of what truth people find and see, then all the conflicting goals will create a series of contradictory requirements. The search tools that usefully find what the user is looking for, with helpful suggestions, will win. Once users find that their experience is curated and that they are coerced by unelected arbiters and censors they will not trust the platform in question and someone else will get that market share.

chucke1992 · on May 25, 2024

the future is in-context search - basically not even going to google search to find something, but straight up doing that from your current window from any location. Basically a chat bot following you everywhere.

latentsea · on May 26, 2024

One that you can turn down, but not off.

ttGpN5Nde3pK · on May 25, 2024

My whole qualm with this AI integration into search engines: it's a search engine, not a question engine. I go to google to search the internet for something, not ask it a question. IMO, asking AI for something is a different task than searching the internet.

It's sorta the same problem as if I go into a store and ask an employee where something is, and they reply with "well what are you trying to do?"

notatoad · on May 25, 2024

>it's a search engine, not a question engine.

for a lot of people and in a lot of use cases, it is a tool for answering questions. it generally works well for that.

i get that the AI implementation sucks, but to suggest that people don't use google to find the answer to questions is absurd. that's absolutely what it's for.

refulgentis · on May 25, 2024

Your interpretation is a bit strict, with little charity, its clear the poster means "i don't always just want an answer, i want to learn"

I saw this over and over again working at products at G, someone would invoke some myth I can't quite remember about "Larry" had a vision of just giving the answer

That's true but comes back to the central mistake Google makes: we don't actually have AGI, they can't actually answer questions, and people aren't actually satisfied with just the answer.

There's all sorts of tendrils from there, ex. a major sin here _has_ to be they're using a very crappy very cheap LLM.

But, I saw it over and over again, 7 years at Google, on every AI project I worked on or was adjacent to, except one. They all assume $LATEST_STACK can just give the perfect answer and users will be so happy. It can't, they don't actually want just the answer, and BigCo culture means you don't rock the boat and just keep moving forward.

chucke1992 · on May 25, 2024

the thing with search is that a human has to use reasoning on the result, while with AI the expectation

Thus when a human sees a suggestion to use glue on pizza, it would question the result. While AI can't.

waqf · on May 26, 2024

Recently I searched Google for a slightly unlikely phrase — in quotation marks — and Google proudly told me that my phrase was grammatically correct.

And nothing else. They didn't give me any search results. Or even tell me there weren't any results. Or even give me a button to press to say "no, I really wanted to search the internet for this phrase".

And also I have zero interest in Google's opinion on English grammar and am frankly insulted to be offered it, although to be fair I'm probably in a minority worldwide on that one.

If I can't use Google to search the internet for things, then Google is eventually going to have a big problem.

bombela · on May 25, 2024

I sometimes wants a search engine, sometimes a question engine. Likewise at the store.

Why not have both with a way to choose which one I want on the moment?

skydhash · on May 25, 2024

> I sometimes wants a search engine, sometimes a question engine.

If you want a search engine, it's easy to use the results as a feedback to refine the query. But a question (answer?) engine would need to be an expert in the subject. And not parroting stuff. That usually means curation. You need something to do the work ahead to filter the wheat from the shaft. I don't see how LLMs can do that.

LLMs can't be a search engine, and can't be an question engine. The best way to treat it is a simulation engine, but the use cases depend on the training data. But the proof is there that the internet is full of junk, and not that expansive.

fragmede · on May 26, 2024

> I don't see how LLMs can do that.

If it's in the training data, then it should be able to do that. That is to say, a comment's points matter. and the subreddit it's on. and who said it, and how the rest of their comments do/where they are. The LLM could annotate the unredacted reddit dataset with metadata as to where to rate it on the words used, the accuracy of the information, the sarcasm quotient, the hilarity quotient, how condescending the comment is; all of that an LLM could generate metadata about and feed into itself to get better and better.