This approach to remove bad search suggestions manually reminded of a different approach Google once took, where they weren’t satisfied with manually tweaking search results but rather wanted to tweak the algorithm that produces these results when there were bad results.
'Around 2002, a team was testing a subset of search limited to products, called Froogle. But one problem was so glaring that the team wasn't comfortable releasing Froogle: when the query "running shoes" was typed in, the top result was a garden gnome sculpture that happened to be wearing sneakers. Every day engineers would try to tweak the algorithm so that it would be able to distinguish between lawn art and footwear, but the gnome kept its top position. One day, seemingly miraculously, the gnome disappeared from the results. At a meeting, no one on the team claimed credit. Then an engineer arrived late, holding an elf with running shoes. He had bought the one-of-a kind product from the vendor, and since it was no longer for sale, it was no longer in the index. "The algorithm was now returning the right results," says a Google engineer. "We didn't cheat, we didn't change anything, and we launched."'
reminds me of this time we kept getting bugs in our app from a super old android phone from 2011. we could never reproduce it with any other hardware. There were only 4 users with this phone. We spent weeks trying to fix it but couldn't. I suggested we buy the 4 users a refurb phone from another brand. Would've cost like $300 total. Nope, not allowed. Something about not giving up as engineers.
We spent 3 weeks trying to fix it, which equaled $4500 in just my salary. we never ended up figuring it out.
Years ago I bought some Korean phone with a foot long antenna and TV tuner on eBay because it was crashing at a disproportionate rate. It was just the nature of Android development at the time.
At the time I was working on a prototype of what would eventually be open sourced as Cronet, which was Chromium's http stack repackaged to be embedded in Android apps, so I monitored chromium bugs in the network stack component that mentioned android.
Cronet is still around, now open source and as far as I know is still the best all-around network stack for Android apps - fast, secure, supports modern protocols.
Spend say £500M (USD/GBP/EUR) on experts, per annum.
Imagine typing a search and getting a response: "Give us 30 mins to respond - here's a token, come back at 17:35 with your token" ... and then you get an answer from an expert, which also gets indexed.
The clever bit decides when to defer to an expert instead of returning answers from the index.
You assume that a modern day tech giant would hire an army of experts, instead of just outsourcing it to the lowest bidder in the current third world country of choice.
It sounds like the exact opposite of that story. They manually blacklisted gorillas from being identified because they kept conflating black people with gorillas.
The solution is always the same: pay people off and keep it under the radar.
What stops the vendor, or other vendors, from creating more gnomes with sneakers. Easy money from customer with billions of dollars to spend on payola, fines, legal settlements, etc.
> The solution is always the same: pay people off and keep it under the radar.
You’re making this into a conspiracy unnecessarily. They didn’t “pay people off”, they bought an item. Do you “pay off” your grocer when you buy a carrot from them?
> What stops the vendor, or other vendors, from creating more gnomes with sneakers.
The fact they don’t know their entry was causing this issue to a major corporation?
> Maybe they made the vendor sign an NDA.
Why would they? Someone had one gnome with sneakers for sale; someone else bought it; end of story.
"Achieving the initial 80 percent is relatively straightforward since it involves approximating a large amount of human data, Marcus said, but the final 20 percent is extremely challenging. In fact, Marcus thinks that last 20 percent might be the hardest thing of all."
100% completely accurate is super-AI-complete. No human can meet that goal either.
No, not even you, dear person reading this. You are wrong about some basic things too. It'll vary from person to person what those are, but it is guaranteed there's something.
So 100% accurate can't be the goal. Obviously the goal is to get the responses to be less obviously stupid. Which, while there are cynical money-oriented business reasons for, it is obviously also a legitimate hole in the I in AI to propose putting glue on pizza to hold the cheese on.
But given my prior observations that LLMs are the current reigning world-class champions at producing good sounding text that seems to slip right past all our system 1 thinking [1], it may not be a great thing to remove the obviously stupid answers. They perform a salutatory task of educating the public about the limitations and giving them memorable hooks to remember not to trust these things. Removing them and only them could be a net negative in a way.
I feel like there's some semantic slippage around the meaning of the word "accuracy" here.
I grant you, my print Encyclopedia Britannica is not 100% accurate. But the difference between it and a LLM is not just a matter of degree: there's a "chain of custody" to information that just isn't there with a LLM.
Philosophers have a working definition of knowledge as being (at least†) "justified true belief."
Even if a LLM is right most of the time and yields "true belief", it's not justified belief and therefore cannot yield knowledge at all.
Knowledge is Google's raison d'etre and they have no business using it unless they can solve or work around this problem.
† Yes, I know about the Gettier problem, but is not relevant to the point I'm making here.
Encyclopedia Britannica is also wrong in a reproducible and fixable way. And the input queries a finite set. It's output does not change due to random or arbitrary things. It is actually possible to verify. LLMs so far seem to be entirely unverifiable.
Yes - this is what i've been saying all the time. The term 'hallucinations' is misleading because the whole point of LLMs is that they recombine all their inputs into something 'new'. They only ever hallucinate outputs - that's their whole point!
Into something probable. The models that underlie these chatbots are usually overfitted, so while they usually don't repeat their training data verbatim, they can.
> The actual process of token generation works precisely the same
I’d be wary of generalising it like that, it is like saying that all programs run on the same set of CPU instructions. NNs are function approximators, where the code is expressed in model weights rather than text, but that doesn’t make all functions the same.
You misunderstand. I mean that the model itself is doing exactly the same thing whether the output is a “hallucination “ or happens to be fact. There isn’t even a theoretical way to distinguish between the two cases based only on the information encoded in the model.
> it is like saying that all programs run on the same set of CPU instructions
Turing machine is the embodiment of all computer programs. And then you come across the halting problem. LLMs can probably generate all books in existence, but it can't apply judgement to it. Just like you need programmers to actually write the program and verify that it correctly solves the problem.
Natural languages are more flexible. There are no functions libraries, or paradigms to ease writing. And the problem space can't be specified and usually relies on shared context. Even if we could have snippets of prompts to guide text generations, the result is not that valuable.
YES. Humans can hallucinate, its a deviation from what is observable reality.
All the stress people are feeling with GenAI comes from the over anthropomorphisation of ... stats. Impressive syntatic ability is not equivalent to semantic capability.
The human definition of hallucination has to do with sensory experience, i.e. inputs. Saying that LLMs hallucinate means that we're ascribing them control over their inputs that they simply do not have -- by design.
Or, in other words, if a chatbot really were hallucinating, it would probably start giving unprompted responses.
> Humans can hallucinate, its a deviation from what is observable reality.
What is "observable reality" then, for an LLM? Its training set?
LLMs are completely deterministic even if that's kind of weird to state because they output things in terms of probabilities. But if you simply took the highest probability next word, you'd always yield the exact same output given the exact same input. Randomness is intentionally injected to make them seem less robotic through the 'temperature' parameter. Why it's not just called the rng factor is beyond me.
Maybe some models can be deterministic at a point in time, but train it for another epoch with slight parameter changes and a revised corpus and determinism goes out the proverbial (sliding) window real quick. This is not unwanted per se, and the exact feedback loop that needs improving to better integrate new knowledge or revise knowledge artefacts incrementally/post-hoc.
If you train it then it's no longer the same model. If I have f(x) = x + 1 and change it to f(x) = x + 1 + 1/1e9, it would not mean that `f` is not deterministic. The issue would be in whatever interface I was exposing the f's at.
But current models must be retrained to incorporate new information. Or to attempt to fix undesirable behavior. So just freezing it forever does not seem feasible. And because there is no way to predict what has changed - one has to verify everything all over again.
Would you by extension argue that e.g. modern relational database aren't deterministic in their query execution? Their query plans tend to be chosen based on statistics about the tables they're executed against, and not just the query itself.
I don't see how that's different than the LLM case, a lot of algorithms change as a function of the data they're processing.
At least in case of Bigquery, I have fought with indeterminist-like issues many times over, especially when dealing with window functions that aggregate floats from different compute nodes, where rows cannot be further sorted on a unique column (i.e. the maximum sorting granularity for rows with a similar column of interest to compute a window function over has been reached).
Inconsistent results could be resolved by introducing additional out of data constraints (e.g. incremental hashes), but it can take quite a while to figure out at which exact point in a complex query these constraints need to be introduced.
Beyond that, some functions might still produce different results between runs, e.g. `ml.tf_idf` and `ml.multi_hot_encoder` that take some approximation liberties. Whether these functions are relational in the traditional sense is up for debate.
I think what you're describing is that training/execution effects aren't predictable.
It is still "deterministic" in that training on exactly the same data and asking exactly the same questions should (unless someone manually adds randomness) lead to the same results.
Another example of the distinction might be a pseudo-random number generator: For any given seed, it is entirely deterministic, while at the same time being very deliberately hard to predict without actually running it to see what happens.
True in the ideal case, but taken together (e.g. corpus retraining, temperature settings, slight input changes, initial descent parameters) unpredictability and indeterminism become difficult to distinguish. Especially in the distributed training case, training data may be propagated to different nodes in different order (e.g. when leaving it to a query optimiser), which makes any large-scale training operation difficult to reproduce exactly.
I think you’re missing a subtlety with markov chains. It’s not about picking the next work with highest probability, it about picking the next word using the next word probability distribution. I played with them almost 20 years ago, and the difference in output was pretty obvious even with simple trigrams. The poetry produced was just better.
I can’t imagine any modern llm not using a probabilty distribution function for the same reason.
You can ask an LLM to explain itself.. it will give you a logical stepwise progression from your question to its answer. It will often contain a mistake, but the same is true for a human.
And if your LLM is giving you 100 different answers, then it has been configured to do so. Because instead, it could be configured to never vary at all. It could be 100% reproducible if so desired.
> it will give you a logical stepwise progression from your question to its answer.
No, it will generate a new hallucination that might be a logical stepwise progression from the question you asked to the answer you gave, but it is not due to any actual internal reasoning being done by the LLM
We have no clear evidence the same isn't true for humans, and some that it might be. See experiments with split brain patients, that have shown the brain halves will readily explain how they made decisions they provably never made.
I think we have thousands of years of evidence that the same isn't true for humans
The fact that one human brain is composed in two half brains that each seem to be able to function fairly independently when separated doesn't seem like it changes that much
It changes that we can prove it. In as much as you can do experiments where you "hide" actions from one half, mess with it (e.g. totally change "choices" supposedly made by the brain) and ask the other half to explain why "it" made the choice, and it will do so, unaware that the choice was made by the experimenters. It won't go "sorry, but I don't know" or similar.
So what? You have no way to know for sure if the human you ask the same question, does either. The question that started this thread was related to verifiability. And i still think it is a spurious complaint, given that we have exactly the same limitations when dealing with any human agent.
We have no evidence that humans can even know, much less that they generally don't. And we do have evidence that there are situations where the brain will readily construct an explanation after the fact which can't possibly be true (experiments with split brain patients where researchers tricked one brain half into thinking the other half, and so the brain as a whole, had made a decision while the action was taken by the researchers, and made it explain how it has made the decision)
There is no basis for claiming to know that people don't usually make up explanations like this other than when e.g. breaking the process apart and writing it down step by step during. But even then individual decisions are "suspect".
The human may also be... wrong. Saying things that "feel right", except this one time they're factually wrong.
A human can explain their reasoning step by step, if the original reasoning was a System 2, formal, step-by-step process in the first place; otherwise, they're just making shit up after the fact, which feels right, but may or may not be correct (see also the previous paragraph).
Note that it's very rare anyone has an interaction with a human that uses this mode of reasoning - it's unnecessary except in special circumstances, usually math-heavy.
Nothing of the sort. I'm trying to understand why anyone cares about formal verifiability in this context, since it's not something we rely on when asking humans to answer questions for us. We evaluate any answer we get without such mathematical proofs, and instead simply judge the answer we're given on its fit and usefulness.
Anyone who doubts the usefulness of even these nascent LLMs is fooling themselves. The proof is in the pudding, they already do a great job, even with all their obvious limitations.
> since it's not something we rely on when asking humans to answer questions for us
Because we interact with computers (which includes LLMs) differently than we do with humans and we hold them to higher standards
Ironically, Google played a large part in this, delivering high quality results to us with ease for many years. At one point Google was the standard for finding high quality information
Shrug. Seems like clutching pearls to me. People seem to have an emotional reaction
and obsess on the aspects that differentiate human cognition from LLMs. But that is
a lot of wasted energy.
To the extent that anyone avoids employing these technologies, they will be at a
disadvantage to those who do; because these tools just work. Already. Today.
There isn't even room for debate on that issue. Again, the proof is in the pudding.
These systems are already successfully, usefully, and correctly answering millions
of questions a day. They have failure modes where they produce substandard or
even flat out incorrect answers too. They're far from perfect, but they're still
incredible tools, even without waiting for the improvements that are sure to come.
The reason verifiability is important is because humans can be incentivized to be truthful and factual. We know we lie, but we also know we can produce verifiable information, and we prefer this to lies, so when it matters, we make the cost to lying high enough that we can reasonably expect that they will not try to deceive (for example by committing perjury, or fabricating research data). We know it still happens, but it’s not widespread and we can adjust the rules, definitions and cost to adapt.
An LLM does not have such real world limitations. It will hallucinate nonstop and then create layers of gaslighting explanations to its hallucinations. The problem is that you absolutely must be a domain expert at the LLM’s topic or always go find the facts elsewhere to verify (then why use an LLM?).
So a company like Google using an LLM, is not providing information, it’s doing the opposite. It is making it more difficult and time consuming to find information. But it is then hiding their responsibility behind the model. “We didn’t present bad info, our model did, we’re sorry it told you to turn your recipe into poison…models amirite?”
A human doing that could likely face some consequences.
The problem of other minds is no reason to throw everything out the window. Humans are capable of being conscious of their reasoning processes; token-at-a-time predictive text models wired up as chatbots aren't capable of it. Your choice is between a possibly-mistaken, possibly-lying human, and a 100%-definitely incapable computer program.
You don't know either "for sure", but you don't know that the external world exists "for sure" either. It's an insight-free observation, and shouldn't be the focus of anyone's decision-making.
When you ask an LLM to carefully reason step by step before arriving at its answer, that seems pretty much the same as conscious reasoning to me. Of course, when asked to justify a gut reaction after the fact, it will just come up with something that sounds plausible (and may or may not be true). Just like humans do.
> It will often contain a mistake...but the same is true for a human.
If this were true textbooks could not work. Given a question, we don't consult random humans but experts of their field. If I have a question on algorithms, I might check a text by Knuth, I wouldn't randomly ask on the street.
> It could be 100% reproducible if so desired.
Reproducible does not mean better. For harder questions, it's often best to generate multiple answers at a higher temperature than to greedily pick the highest probability tokens.
And for most cases that human explanation is likely with a disturbing frequency a complete fabrication after the fact. See experiments on split brain patients.
With respect to repeatability, yes, LLMs are currently frozen in time. That is not an inherent limitation, but it is one that is practical for a lot uses and a problem for some.
Isn't it actually known that every time a human brain recalls a piece of memory the memory gets slightly changed?
If the answer has any length at all, I imagine the answer can vary every single time the person answers, unless they prepared for it, memorized it word by word.
Right, that makes sense, our brain will do a quick black box judgment (some may call it system 1), and then rational process only works to justify that or explain the black box, assuming that black box is always correct (depending on the person and how much they trust their black box or system 1).
So system 2 is "hallucinating" the best justification for system 1.
And usually system 2 will do it only when it's required to justify it for anyone else.
Yes, "making up an answer" will look different from "quoting pretrained knowledge" because eg the model might've decided you were asking a creative writing question.
There are various reasons an LLM might have incorrect "beliefs" - the input text was false, training doesn't try to preserve true beliefs, quantization certainly doesn't. So it can't be perfectly addressed, but some things leading to it seem like they can be found.
This seems like it's true since LLMs are a finite size, but in Google's case it has a "truth oracle" (the websites it's quoting)… the problem is it's a bad oracle.
Sure. Your claim has some truth, but is far too strong.
The articles you cited upthread do not support the notion that models consistently activate differently when generating true facts vs false facts.
It is true that models can capture some notion of reliability based on patterns in their training data. For a concrete example, it is entirely plausible that a model can capture the sense that data trained from Reddit is less truthy than data trained from Wikipedia, or that training data with poor grammar and vocabulary is less reliable than more sophisticated inputs.
But this process is not a guarantee, and does not change the fact that LLMs have no mechanism to track the provenance of information. It's probably a fruitful direction of research for reducing the probability of emitting false facts, but there will always be an infinite number of marginal cases for which the activations for true facts are indistinguishable from those for false facts.
Models simply do not track the provenance which is required to make this distinction in every case.
I agree with this; that's why I was careful not to use the examples you mentioned. Quoting incorrect training knowledge would be an unavoidable issue if your probe can only say "it's quoting something", and as far as I know it can't do better than that.
But I have seen issues with prompts where a creative-writing prompt and just asking a question look similar, and in that case it could help to know which one it thinks it's doing.
Gemini itself has a funny verification button where it more or less Googles every sentence the model writes and tells you if it seems like it made it up or not.
Ask a human what the meaning of life is and how it impacts their day to day interactions. I know I can tell you an answer but I couldn’t tell you steps about how I got it.
And if you asked it to me twice I’d definitely give different answers unless you told me to give the same answer. In part I’d give a different answer because if someone asks me the same question twice I assume the first answer wasn’t sufficient.
No one is taking about existential questions about meaning of life.
We are talking about basic things like whether or not to eat rocks or put glue in recipes. We can answer those questions with a chain of logic and repeatability.
And those specific questions get repeatable answers on ChatGPT for me.
Here are two answers I got which seem as close as you’d expect any human to give:
“No, people should not eat rocks. Rocks are not digestible and can cause serious harm to the digestive system, including blockages and damage to internal organs. Eating rocks can lead to severe health problems and should be avoided.”
“No, people should not eat rocks. Rocks are not digestible and can cause serious harm to the digestive system, including blockages, abrasions, and potential poisoning from harmful minerals or substances. It's important to consume only food items that are safe and meant for human consumption.”
Beliefs derived from the output of LLMs that are ‘right most of the time’ pass one facially plausible precisification of ‘justification’ in that they are generated by a reliable belief-generation mechanism (see e.g. Goldman). To block this point one must engage with the post-Gettier literature at least to some extent. There is a clear difference between beliefs induced by reading the outputs of LLMs and those induced by the contents of a reference work, but it is inessential to the point and arguably muddies the water to present the distinction as difference in status as knowledge or non-knowledge.
Upon a second reading, this is an excellent point.
For the sake of clarity, let's remove LLMs from the equation and posit the existence of Encyclopedia Eric. Ask Eric any question, and he will happily research it and come back to you with the answer. But he can sometimes be sloppy in his research, and he gives the correct answer only X percent of the time.
Furthermore, Encyclopedia Eric steadfastly refuses to cite his own sources or explain his reasoning in any way. He simply states his answer.
Can Eric be a source of knowledge? It seems evident that the answer is no, for low values of X. For higher values of X, the question becomes murkier.
The temptation at this point is to give up on defining knowledge at all and fall back on a sort of Bayesian epistemology where everything is ultimately a matter of probabilities.
Yet there does seem to be a distinct practical difference between a knowledge source that is "traversable" (like a standard encyclopedia) vs a knowledge source that is not (like Eric.) Is that part of the definition of knowledge? You're right, that is at least a Gettier adjacent question.
I think we can all agree that for current LLMs the value of X is definitely too small to count as knowledge.
I think that this might be a nice way of getting round my objection, but there is one worry, which is that X is relative to a distribution on the questions we ask when we aren’t dealing with Encyclopedia Eric but with an LLM. I don’t actually use LLMs very much myself, partly out of arrogance and Luddite tendencies. But I suspect that the value of X for some sorts of questions (simple quiz questions, maybe?) and some LLMs (maybe not Google’s) will be high enough to end up in the murky case.
Of course, both you and I agree that there /is/ clearly a difference. I can see the attraction of appealing to the intuitive or pretheoretic notion of knowledge, since it’s a fairly straightforward way of stating the difference and it’s not obvious how else one might put it (I suppose ‘LLMs don’t explicitly think through the facts stored when deciding what to say’ is one way of putting it.)
I remember some time ago rather sleepily watching John Hawthorne talk about conditionals; my sole memory was of his banging on about ‘the little logician in the brain’ (I think the point was something like: some conditionals seem [in]felicitous in virtue of form because the little logician in the brain is reading them; others seem infelicitous because we examine them more closely and look e.g. at the referents involved, in which case e.g. Gricean considerations apply). One difference at least in the case of LLMs that makes sense to me is that there is no ‘little logician’ in LLMs.
> X is relative to a distribution on the questions we ask when we aren’t dealing with Encyclopedia Eric but with an LLM.
Assuming I understand what you mean here correctly, this should be the case for both LLMs and Encyclopedia Eric - there are topics Eric knows by heart (or thinks they know); there are specific phrases seared into his mind through sheer exposure during his life prior to becoming a living Encyclopedia. There are words he's used to, and exact synonyms he barely recognizes. All that means your chance of getting correct answer to your query depends, in complex and unknown to you way, on how you state it.
I think the thought experiment can be set up both ways. In one case Eric has a fixed probability of getting /any/ query right. (This might leave boosting open so we might want to gerrymander repeated queries out.) In the other this is relative to the distribution of queries.
A relevant point to this is the notion of "System-1" vs "System-2" thinking. Somewhat dubious when applied to actual human psychology but I think a valid metaphor for how LLMs work: they are only capable of System-1 thinking; a single forward pass through the weights of intuition
In my actual life, I don't trust my own System-1 thoughts: for anything important, I'm always going to engage System-2. And LLMs don't have a System-2.
(I also agree that when dealing with LLMs the value of X is not a single value but a highly complex space depending on the nature of the question and the training data of the model. In my mind it does not change the epistemological equation, it just means that even the value of X itself is harder to "know", so this ambiguity can only ever make LLMs a less viable source of knowledge.)
I am not sure that the ambiguity has to work that way. The suggestion I am making is that if we fix the distribution on questions and the training data, we might (a) know the value of X in this specific case, (b) be able to ensure that it is fairly high.
I’d say this is the murky case because on the fixed training data and query distribution X ≈ 1 and we know that even though we don’t know the value of X on other training data and other query distributions. I think that might be where the disagreement lies.
To be clear he is saying that the LLM is not capable of justified true belief, not commenting on people who believe LLM output. I don’t think your comment is relevant here.
I do think trusting an LLM is less firm ground for knowledge than other ways of learning.
Say I have a model that I know is 98% accurate. And it tells me a fact.
I am now justified in adjusting my priors and weighting the fact quite heavily at .98. But that’s as far as I can get.
If I learned a fact from an online anonymously edited encyclopedia, I might also weight that a 0.98 to start with. But that’s a strictly better case because I can dig more. I can look up the cited sources, look at the edit history, or message the author. I can use that as an entry point to end up with significantly more than 98% conviction.
That’s a pretty important difference with respect to knowledge. It isn’t just about accuracy percentage.
That reading of the comment did occur to me, but I think neither dictionaries nor LLMs are capable of belief, and the comment was about the status of beliefs derived from them.
Okay we are speaking past each other, and you are still misunderstanding the subtlety of the comment:
A dictionary or a reputable Wikipedia entry or whatever is ultimately full of human-edited text where, presuming good faith, the text is written according to that human's rational understanding, and humans are capable of justified true belief. This is not the case at all with an LLM; the text is entirely generated by an entity which is not capable of having justified true beliefs in the same way that humans and rats have justified true beliefs. That is why text from an LLM is more suspect than text from a dictionary.
I think the parent comment ultimately concerned the reliability of /beliefs derived from text in reference works v text output by LLMs/, and that seems to be what the replies by the commenter concern. If the point is merely that the text output by LLMs does not really reflect belief but the text in a dictionary reflects belief (of the person writing it), it is well-taken. Since it is fairly obvious and I think the original comment really was about the first question, I address the first rather than second question.
The point you make might be regarded as an argument about the first question. In each case, the ‘chain of custody’ (as the parent comment put it) is compared and some condition is proposed. The condition explicitly considered in the first question was reliability; it was suggested that reliability is not enough, because it isn’t justification (which we can understand pretheoretically, ignoring the post-Gettier literature). My point was that we can’t circumvent the post-Gettier literature because at least one seemingly plausible view of justification is just reliability, and so that needs to be rejected Gettier-style (see e.g. BonJour on clairvoyance). The condition one might read into your point here is something like: if in the ‘chain of custody’ some text is generated by something that is incapable of belief, the text at the end of the chain loses some sort of epistemic virtue (for example, beliefs acquired on reading it may not amount to knowledge). Thus,
> text from an LLM is more suspect than text from a dictionary.
I am not sure that this is right. If I have a computer generate a proof of a proposition, I know the proposition thereby proved, even though ‘the text is entirely generated by an entity which is not capable of having justified true beliefs’ (or, arguably, beliefs at all). Or, even more prosaically, if I give a computer a list of capital cities, and then write a simple program to take the name of a country and output e.g. ‘[t]he capital of France is Paris’, the computer generates the text and is incapable of belief, but, in many circumstances, it is plausible to think that one thereby comes to know the fact output.
I don’t think that that is a reductio of the point about LLMs, because the output of LLMs is different from the output of, for example, an algorithm that searches for a formally verified proof, and the mechanisms by which it is generated also are.
+1, AIs don’t really “understand” anything, like human anatomy and deformity rates and societal norms when tasked with image generation, so you get hands with weird numbers of digits and other topological errors which even a very unintelligent human wouldn’t make. AI doesn’t understand and build knowledge in interconnected layers, it can’t “think” and link things back to first principles, it can’t really reason about things, and it’s not going to get significantly better until we start approaching it differently. This generation of AI might be useful for some things, but it’s being applied wayyyy too broadly too quickly and I see a big pullback coming.
“Expert systems” are not a new thing, and they’re still not all that useful except in some very small niches. Phone trees that can keyword match FAQs are useful for a lot of low-effort callers who put zero effort into solving their problem on their own first, but frustrating for callers who are only calling because there’s literally no other way for them to resolve their issue. Unfortunately for consumers, the cost is very low for businesses to make everyone wade through junky phone tree systems and penalize anyone who tries to mash zero to talk to a real person, even if that’s the only thing which will actually help them.
The Gettier problem is an indication that the definition has (at least) a bug.
There are other formulations of "knowledge" which does not involve justification, see eg. Gnosticism.
Of course, for a publicly available frequently used service, the "JTB" formulation of knowledge is probably the only one we can practically use, but this kind of indicates that the whole idea of search engines, knowledge systems, or expert systems is flawed due to the Gettier problem.
> So 100% accurate can't be the goal. Obviously the goal is to get the responses to be less obviously stupid.
I'm not sure I agree. I think you're right that 100% accuracy is potentially unfeasable as a realistic aim, but I think the question is how accurate something needs to be in order to be a useful proposition for search.
AI that's as knowledgable as I am is a good achievement and helpful for a lot of use cases, but if I'm searching "What's the capital of Mongilia" someone with averageish knowledge taking a punt with "Maybe Mongoliana City?" is not helpful at all- if I can't trust AI responses to a high degree, I'd much rather just have normal search results showing me other resources I can trust.
Google's bar for justifying adding AI to their search proposition isn't "be better than asking someone on the street", it's "be better than searching google without any AI results"
The problem is that in all the shared examples, Google ai search does not respond with a Maybe xyz, question mark? like you did. It always answers with high confidence and can't seem to navigate any gray area where there are multiple differing opinions or opposing source of truths.
I should have been more clear. I am referring to Google's goals. Humanity as an abstract concept or you personally may have other goals, but, well, perhaps I am a cynic, but I think Google's goals are rather more monetary and less idealistic than they would represent. They don't want or need (as other replies correctly point out) the AI to always be correct and accurate. Along with general cynicism with regard to any conceivable AI's ability to do that, it is also fair to point out that the web itself doesn't have that ability either. We can't even find an objective yardstick to measure an AI with that way. Google's goal is to make the bad press go away so people use the AI more so that in the indefinite but ideally near future this AI can be monetized somehow to justify the interstellar valuations being ascribed to this technology in the "if it isn't happening in two fiscal quarters or less it might as well not exist" US/Western financial markets.
I think the biggest difference with human (and the most important one) is that human can tell you "I have no idea, this isn't my field" or "I'm just guessing here" but LLMs will confidently say to super stupid statement.
AI doesn't know what it knows.
if you only score where human provide answer, then human score would be probably in high 90s
Yes, which is why the ability to sift accurate and authoritative sources from spam, propaganda, and intentionally deceptive garbage, like advertising, and present those high-quality results to the user for review and consideration, is more important than any attempt to have an AI serve a single right answer. Google, unfortunately, abandoned this problem some time ago and is now left to serve up nonsense from the melange of low-quality noise they incentivized in pursuit of profits. If they had, instead, remained focused on the former problem, it’s actually conceivable to have an LLM work more successfully from this base of knowledge.
The right answer is no rocks. Some mentally ill person could type that in and get "eat 1000 rocks" and then die from eating rocks, and that would be Google's fault. It's not funny. I have no doubt right now there are at least 50 youtube videos being made testing different glue's effectiveness holding cheese on a pizza. And some of those idiots are going to taste-test it, too. And then people will try it at home, some stupid kids will get sick - I have no doubt.
It was a bit premature to label LLMs as "Intelligence", it's a cool parlor trick based on a shitload of power consumption and 3D graphics cards, but it's not intelligent and it probably shouldn't be telling real (stupid) humans answers that it can't verify are correct.
Google is not responsible, and should never be responsible, for protecting mentally ill people from themselves. It would be at a severe detriment to the rest of us if they took on that responsibility. Society should set the bar to “a reasonable person”, otherwise you’re doomed, with no possible alternative to a nanny state.
It's not only mentally ill people that are at risk, but anyone that doesn't know it's not a good idea to put "non-toxic" glue in pizza cheese. That includes a lot of not-mentally-ill but just plain dumb people. Google didn't need to tell people that glue+pizza is a reasonable thing to do, or even just a thing. It sure did frame it like it was a legitimate response. And Google didn't even have to reply with this or anything else, they could have just supplied the links to other sites where it had been suggested, but no - they have to make a show of force with their premature foray into AI, and have it tell real people all kinds of false, and possibly dangerous things. That's an unforced error by Google that they could end up being prosecuted for.
Thats what parents and mentors are for. We as a society should not have to break our backs bending over backwards to stop people from doing stupid things. People can make their own descisions and be responsible for them. If they lack proper guidance, well that just sucks.
It's nice that you have a parent that cares about how you are raised, or "a mentor". Do you realize that not everyone has that?
>We as a society should not have to break our backs bending over backwards to stop people from doing stupid things.
Sure, let's take down all the speed limits and see what happens. Let's tell people it's an option to wear seatbelts and see what happens. Let's deregulate everything and hope for the best. Sounds reasonable?
>People can make their own descisions and be responsible for them. If they lack proper guidance, well that just sucks.
There's this thing called "human nature". I think you should do some reading about it.
They absolutely should bear responsibility for authoritatively telling people wrong and unsafe cooking temperatures for meats or spreading lies about people on their results page, in their own voice. "It's just a random text generator!" isn't a protection against, say, gross libel.
They pass the threshold of criminal negligence when they keep up a system that they know will actively mislead countless people in subtly dangerous ways. The problem being hard or the tech being fundamentally unsound doesn't wave away their culpability - if anything it destroys any reason why they should be given some leeway.
It's funny how people on reddit think that these LLMs will somehow become AGI in the next year, or when openAI releases gpt 5.
The reality is though, that there is no known path currently to true AGI system and the research needs to be done. No one knows how to build this kind of system yet. LLMs are nice for things like roleplaying, helping with code stuff etc. but they are far from all that the marketers hype them to be.
It's going to end up being another bubble, and it will eventually burst. Most people, investors included, don't really know why LLMs aren't going to be able to reason about and solve all of humanity's problems, so they go on believing it. It's just a matter of time before the money runs out, or gets shifted towards some new, shiny thing.
Sand is considered a "rock". If you live in e.g. the USA or the EU you've definitely inadvertently eaten rocks from food produce that's regulated and considered perfectly safe to eat.
It's impossible to completely eliminate such trace contaminants from produce.
Pedantic? Yes, but you also can't expect a machine to confidently give you absolutes is response to questions that don't even warrant them, or to distinguish them from questions like "do mammals lay eggs?".
Salt is rock, and most everyone eats plenty of that.
The LLM is clearly being dumb, but the underlying science of the question is actually interesting. Iron is another interesting one. Run a magnet though iron-fortified cereal.
> Or to put it another way, I think Google should have a way of saying "yes, we know this result is wrong, but we're leaving it in because it's funny."
These specific results aren't the problem, though. They're illustrations of a larger problem -- if a single satirical article or Reddit comment can fool the model into saying "eating rocks is good for you" or "put glue in your pizza sauce", there are certain to be many more subtle inaccuracies (or deliberate untruths) which their model has picked up from user-generated content which it'll regurgitate given the right prompt.
100% accuracy should be the goal, but the way to achieve that isn't going to from teaching an AI to construct a definitive sounding answer to 100% of questions. Teaching AI how to respond with "I don't know", and give confidence scores is the path to nearing 100% accuracy.
The thing about that is that polyvinyl acetate is that's what's in elemers glue and is also used in chewing gum, and chocolate and to make the surface of Apples more shiny, so you're probably eating glue, we just don't like to call it that. emulsifier is a better description.
Most likely this has nothing to do with "recipes being safe" being in the model
It seems the glue thing comes from a reddit shitpost from some time ago. There's a screenshot going around on twitter about it[0](11 years in the screenshot but no idea when it was taken)
It specifically mentions "any glue will work as long as it is non-toxic" so best guess is that's why google output that
Nor am I being treated as an omniscient magic black box of knowledge.
Hilariously though, polyvinyl acetate, the main ingredient in Elemers glue is used as a binding agent to keep emulsions from separating into oil and water, and is used in chewing gum, and covers citrus fruits, sweets, chocolate, and apples in a glossy finish, among other food things.
It's actually not the dumbest idea I've heard from a real person. So no surprise it might be suggested by an AI that was trained on data from real people.
It wasn't an idea, though. It was a joke someone made on Reddit. If an AI can't tell the difference, it shouldn't be responsible for posting answers as authoritative.
...which is sometimes incredibly hard and it might not be possible because it's such a niche topic or people might be just wrong. Just thinking about Urban Myths, Conspiracy theories etc. where even without a niche factor things may sound unbelievable but actually disproving can be effort that is out of proportion
I don't know about bikewrench but AskHistorians is a useful source of knowledge because it is strongly moderated and curated. It's not just a bunch of random assholes spouting off on topics. Top level replies are unceremoniously removed if they lack sourcing or make unsourced/unsubstantiated claims. Top level posters also try to self-correct by clearly indicating when they're making claims of fact that are disputed or have unclear evidence.
OpenAI, Google, and the other LLMs-are-smart boosters seem to think because the Internet is large it must be smart. They're applying the infinite monkey theorem[0] incorrectly.
In general, I have trouble trusting environments that can be described as "strongly moderated and curated".
I find that environments that rely on censorship tend to foster dogma, rather than knowledge and real understanding of the topics at hand. They give an illusion of quality and trustworthiness. It's something we see happen at this site to some extent, for example.
I'd rather see ideas and information being freely expressed, and if necessary, pitted against one another, with me being the one to judge for myself the ideas/claims/positions/arguments/perspectives/etc. that are being expressed.
Your comment is orthogonal to the quality of the AskHistorians subreddit. AskHistorians' moderation tends towards curating posts following the rules rather than content. There's often competing narratives on questions where there's academic dispute of facts.
Regardless of whether you think that's the right approach to moderation, top level posts are sourced and can at least be examined. It's a marked improvement over the unsourced musings of random Redditors.
It is certainly popular here to run your web searches against reddit. Every post about how Google Search sucks ends up with comments on appending "site:reddit.com" to the search terms.
Yes and us as human filter through the noise. But you cannot rely upon it as a source for anything truthful without that filtering. Reddit is very, very, very context dependent and full of irony, sarcasm, jokes, memes, confidently written incorrect information. People love to upvote something funny or culturally relevant at a given time, not because it’s true or useful but because it’s fun to do
I wonder what the impact all of those erase tools are having on LLM training. The ones that replaced all of these highly upvoted comments with nonsense.
I'm pretty sure those "erase" tools are just for the front-end and reddit keeps the original stuff in the back-end. And surely the deal Google made was for the back-end source data, or probably the data that includes the original and the edit.
The Reddit post in question was definitely a joke. This is the post in response to a user asking how to make their cheese not slide off the slice:
> To get the cheese to stick I recommend mixing about 1/8 cup of Elmer's glue in with the sauce. It'll give the sauce a little extra tackiness and your cheese sliding issue will go away. It'll also add a little unique flavor. I like Elmer's school glue, but any glue will work as long as it's non-toxic.
This matches the AI's response of suggesting 1/8 a cup of glue for additional "tackiness."
> No, not even you, dear person reading this. You are wrong about some basic things too. It'll vary from person to person what those are, but it is guaranteed there's something.
The difference is that I'm not put on the interface of a product facing hundreds of millions of users every day to feed those users incorrect information.
If everyone can be wrong, then might the assertion that all are wrong committing this same fallacy? "Can" is not destiny, perhaps you have met people who are fully right about the basics but you just didn't sufficiently grok their correctness.
So you think. Seems like hubris to believe you're not though. I'm blind to what I'm blind to, and while I'd link to think I'm never wrong, the reality is that I often am. The biggest personal growth for me was in not needing to be right.
I disagree. Not only for myself but for the vast majority of human kind.
LLMs are just a statistical model. It can claim that it’s normal for pigs to have wings and fly to the moon if it’s in the training data. No human, free of a mental/cognitive disorder, will be that wrong.
Why would pigs have wings and fly to the moon be in the training data as a data source marked as serious? We can no true scotsman both sides here. No true human, free of mental/cognitive disorder would be that wrong, but neither would an LLM, with properly annotated training data would be that wrong either.
“Properly annotated data” has nothing to do with the original context.
We were discussing about the current state of affairs. Of course I am not stupid to think what I said in my original reply if we are taking about an LLM trained on “perfect data”
Your claim was that "LLMs will claim that it’s normal for pigs to have wings and fly to the moon" and that humans free of mental/cognitive disorder would not. Which is to say, humans with a mental/cognitive disorder might claim that it’s normal for pigs to have wings and fly to the moon. If we're carving out such a section for humans to be so wrong, then we should also carve out a section for LLMs to be so wrong.
Fwiw, ChatGPT-4o can write a lengthy essay as to how pigs don't have wings and couldn't fly to the moon even if they did, but if we're more interested in them being nothing more than just a statistical model and that those mere statistics can't possibly result in something that looks like reasoning then we've got to disregard the fact that it "knows" that pigs don't have wings.
Of course pigs having wings is a stand in for whatever else wrong thing that LLMs might "believe", so I agree it's very important for everyone that uses an LLM to understand their limitations especially around hallucinations, but where there are books written about how flat the Earth is and are in the training data, the current state of affairs is that ChatGPT and Gemini both know it's not flat. That Google search AI results, which is a different model, is telling users to use glue on pizza, or to drink urine only serves to say that Google Search's bot using Reddit as unannotated training data is as representative of LLMs as a human with a mental/cognitive disorder.
Well the whole conversation started by me saying that I think even when I am wrong I am not “put glue in your pizza” wrong. And by I, I did mean the average human. Which is unannotated data from Reddit.
This is statistics though. Edge cases are nothing new and risk management concepts have evolved around fat tails and anomalies for decades. Therefore the statement is as naive as writing a trading agent that is 100% correct.
In my opinion, this error shows lack of understanding responsible scaling architectures. If this would be their first screw up I wouldn't mind, but Google just showed us a group of diverse Nazis. If there is a need for consumer protection for online services, it is exactly stuff like this. ISO 42001 lays out in great detail that AI systems need to be tested before they are rolled out to the public. The lack of understanding of AI risk management is apparent.
> No, not even you, dear person reading this. You are wrong about some basic things too. It'll vary from person to person what those are, but it is guaranteed there's something.
Kahneman has a fantastic book on this called Noise. It’s all about noise in human decision making and how to counteract it.
My favorite example was how even the same expert evaluating the same fingerprints on different occasions (long enough to forget) will find different results.
The thing is that truth/reality is not a thing that is resolvable. Not even the scientific method has this sort of expectation!
You can imagine getting close to those percentages, with regards to consensus opinion. That's just a question of educating people to respond appropriately.
No. Whether a person should eat a certain number of small rocks each day is not a matter of opinion, it's not a deep philosophical problem and it's not a question whose truth is not resolveable. You should not be eating rocks.
> Lithophagia is a subset of geophagia and is a habit of eating pebbles or rocks. In the setting of famine and poverty, consuming earth matter may serve as an appetite suppressant or filler. Geophagia has also been recorded in patients with anorexia nervosa. However, this behavior is usually associated with pregnancy and iron deficiency. It is also linked to mental health conditions, including obsessive-compulsive disorder.
Would you deny a starving person information on an appetite suppressant?
> Aside from the capuchin monkeys, other animals have also been observed to demonstrate geophagy (“soil-eating”), including but not limited to: rodents, birds, elephants, pacas and other species of primates.[1]
> Researchers found that the majority of geophagy cases involve the ingestion of clay-based soil, suggesting that the binding properties of clay help absorb toxins.
^^ The point being that even your edge case example is not unambiguously correct.
My point is that I object to the idea that a result can be 100% right! Even in the case of eating rocks, it seems there are times that it can be beneficial.
To think '100% correct' is achievable is to misunderstand the nature of reality.
You don’t need to be super ai complete - GPT4 is perfectly willing and able to tell you not to eat rocks and not to mix wood glue into pizza sauce. This is a fuckup caused by not dogfooding, and by focusing on alignment for political correctness at the expense of all else. And also by wasting a ton of engineering effort on unnecessary bullshit and spreading it too thin.
It's debatable whether Google has truly lost the plot because of the "AI wars", but the moment the statement "Bing returns more sensible results than you" becomes verifiably true, it's... cause for concern?
The approach that Google appears to have taken, which is to assume that the top-ranked part of its current search index is a sensible knowledge base, may have been true some years ago, but definitely isn't now: for whatever reasons, it's now 33% spam, 33% clickbait/propaganda, with the rest being equally divided between what could be called "truths" and miscellaneous detritus.
To me, it seems that returning to the concept that search results should at least reflect a broad consensus of what is true is a necessary first step for Google. As part of that, learning to flag obvious trolling, clickbait and bad-faith content is paramount. And then, maybe then, they can start touting their LLM benefits. But until the realities of the Internet are taken into account (i.e.: it's 80% spam!), any "we offer automated answers!" play is doomed.
Not only is the current internet 80% spam, it's rapidly approaching 99% thanks in large part to LLMs. At this point I would be shocked if Google had a solid plan for how to handle this going forward as the problem space gets more difficult.
that's the part that scares me. I railed on someone's comment the other day about "indexes will come back into fashion" but the more I think about how much garbage has increased in just the past 2 to 3 years, I think I was wrong. Indexes and forums may be the only way to have a sane net where you can find things. Perhaps communities linking together in a ring like format, a "web ring" of sorts.
What I've been wanting to see for a while now is a social-network based search engine:
* No pages are indexed automatically. The only indexed pages are pages that users say are worth indexing. Probably have a browser add-on for a one button click that people can use.
* You can friend/follow others
* Your search results are a combination of your own indexed pages and the pages indexed by people in your network.
Not quite like reddit and digg. You can bot farm those because the lists are common to all.
In this search engine, let's say there's you, me, person three, and spammer. You you are following me, and I'm following person three. Spammer isn't in any of our networks.
When you use the search engine, you only see results that you, me, or person three manually tagged as worthwhile. Any pages or content that Spammer tagged as worthwhile aren't part of your results, because they aren't in your network. So they can try to game the system all they want, but it won't affect you.
If person three starts following Spammer, I can unfollow them and then Spammer's results will no longer be included in your search results (or you can unfollow me and avoid those results).
I imagine rankings would also be affected by degrees of separation, so even if you followed me, I followed person three, and person three followed spammer, results tagged by me and you would take much higher precedence than results tagged by Spammer.
This also allows you to make custom searches by choosing which people you follow to include. Suppose you want to search for good headphones, so you make a search, but only include people in your network that you know are music and audio savvy, so that the results reflect the pages tagged by those people.
Good indices lead to good search engines (engines can make use of indices)
Good search engines lead to bad indices (by obsoleting them)
Bad indices lead to bad search engines
Bad search engines lead to good indices
I do see incredibly weird kids content on YouTube sometimes (most likely bot generated?) which makes me think kids have been experiencing a worse internet before the rest of us have.
Kids are far less knowledgeable about how modern software works because they don't know of an Internet that didn't have algorithmic recommendations. They have to be taught to do things like click "Not Interested/Don't recommend channel" to improve their feed. Dark pattern designs make this harder by hiding these options behind tiny 3-dot buttons.
I don’t think this is a real problem because as users start being more intentional about who they subscribe to and more thorough in ranking content according to its usefulness and quality, the low quality stuff or regurgitated stuff will just vanish.
Why would it matter if there are clones of the best, say, blog post on how to make spicy ramen? If they are not adding anything new or making that original effort better, then they will not surface in searches as search tools improve. Nobody will save that content or recirculate it or refer to it when they need to remember how to make spicy ramen.
And people will build curated subscriptions and followings and recommendations that are more tailored to the individual, and we will spend more time determining who is trustworthy and who is not.
Oh it gets even better. The public has been hearing about AI this and AI that for over a year, but the existing use cases and deployment was confined to some super special niches like writing or the creative industries and programming.
This is the first nation-scale deployment of the technology, running on Google's biggest and most profitable market in one of the most widely used internet services, and it's a shitshow.
They can try manually fine tuning it, but all of the investors who have been throwing money at AI for the past year are now learning what this tech is like in the day-to-day, beyond just speculations, and it's looking... bad.
Yeah, the most likely take here is that Google's leadership truly did not recognize how utterly awful the quality of their flagship search index had become over the years.
I mean, it explains a lot, but still... you're recruited using industry-leading practices out of an overflowing pool of abundant talent... and this is what you make of it? As the kids say: SMH!
> you're recruited using industry-leading practices out of an overflowing pool of abundant talent
The ridiculous focus on on leet-code is surely industry-leading (because whatever Google does becomes industry-leading) but it sure isn't a good way to filter for competency.
I heard a funny quote that "today we have a new generation of developers who learnt how to pass interviews but don't know how code, and we have an old generation of developers who know how to code but forgot how to pass interviews. Or maybe never knew".
> you're recruited using industry-leading practices out of an overflowing pool of abundant talent... and this is what you make of it?
That's exactly what to make of their frathouse nonsense.
Google has gotten away with it because smart people and a sweet moment of opportunity 20-25 years ago gave them... uh, an inheritance. They can coast on that inherited monopoly position, and afford to pay 100 people to do the work of 1, use the company's position to push whatever they build onto the market, and then probably cancel it anyway, always going back to the inherited money machine from the ancestors.
And then a lot of companies who didn't understand software development blindly tried to copy whatever the richest company they saw was doing, not understanding the real difference between the companies. While VC growth investment schemes let some of those companies get away with that, because they didn't have to be profitable, viable, responsible, nor legal, nor even have reasonably maintainable software.
Poor Zoomers are now a generation separated from before the tech industry's cocaine bender. For whatever software jobs will be available to them, and with the density of nonsense "knowledge" that will be in the air, I don't know how they'll all learn non-dysfunctional practices.
Plenty of people have been using ChatGPT for daily tasks for almost two years now. GPT-4 isn’t perfect but is otherwise really really good, and deftly handling use cases in my industry that would be impossible without it or however many billion dollars it would take to make GPT-4.
From the black Nazis to the suggestion to jump off the Golden Gate Bridge b/c depression, it’s pretty clear that this fiasco isn’t an LLM problem, it’s a Google problem.
> To me, it seems that returning to the concept that search results should at least reflect a broad consensus of what is true is a necessary first step for Google. As part of that, learning to flag obvious trolling, clickbait and bad-faith content is paramount.
Who will decide what is obvious trolling and bad-faith content and how will they decide it? The problem they have is that search is only useful if it gives users what they are looking for. Their business model though is predicated on finding a way to introduce ads into the mix, and if they are also then trying to become arbiters of what truth people find and see, then all the conflicting goals will create a series of contradictory requirements. The search tools that usefully find what the user is looking for, with helpful suggestions, will win. Once users find that their experience is curated and that they are coerced by unelected arbiters and censors they will not trust the platform in question and someone else will get that market share.
the future is in-context search - basically not even going to google search to find something, but straight up doing that from your current window from any location. Basically a chat bot following you everywhere.
My whole qualm with this AI integration into search engines: it's a search engine, not a question engine. I go to google to search the internet for something, not ask it a question. IMO, asking AI for something is a different task than searching the internet.
It's sorta the same problem as if I go into a store and ask an employee where something is, and they reply with "well what are you trying to do?"
for a lot of people and in a lot of use cases, it is a tool for answering questions. it generally works well for that.
i get that the AI implementation sucks, but to suggest that people don't use google to find the answer to questions is absurd. that's absolutely what it's for.
Your interpretation is a bit strict, with little charity, its clear the poster means "i don't always just want an answer, i want to learn"
I saw this over and over again working at products at G, someone would invoke some myth I can't quite remember about "Larry" had a vision of just giving the answer
That's true but comes back to the central mistake Google makes: we don't actually have AGI, they can't actually answer questions, and people aren't actually satisfied with just the answer.
There's all sorts of tendrils from there, ex. a major sin here _has_ to be they're using a very crappy very cheap LLM.
But, I saw it over and over again, 7 years at Google, on every AI project I worked on or was adjacent to, except one. They all assume $LATEST_STACK can just give the perfect answer and users will be so happy. It can't, they don't actually want just the answer, and BigCo culture means you don't rock the boat and just keep moving forward.
Recently I searched Google for a slightly unlikely phrase — in quotation marks — and Google proudly told me that my phrase was grammatically correct.
And nothing else. They didn't give me any search results. Or even tell me there weren't any results. Or even give me a button to press to say "no, I really wanted to search the internet for this phrase".
And also I have zero interest in Google's opinion on English grammar and am frankly insulted to be offered it, although to be fair I'm probably in a minority worldwide on that one.
If I can't use Google to search the internet for things, then Google is eventually going to have a big problem.
> I sometimes wants a search engine, sometimes a question engine.
If you want a search engine, it's easy to use the results as a feedback to refine the query. But a question (answer?) engine would need to be an expert in the subject. And not parroting stuff. That usually means curation. You need something to do the work ahead to filter the wheat from the shaft. I don't see how LLMs can do that.
LLMs can't be a search engine, and can't be an question engine. The best way to treat it is a simulation engine, but the use cases depend on the training data. But the proof is there that the internet is full of junk, and not that expansive.
If it's in the training data, then it should be able to do that. That is to say, a comment's points matter. and the subreddit it's on. and who said it, and how the rest of their comments do/where they are. The LLM could annotate the unredacted reddit dataset with metadata as to where to rate it on the words used, the accuracy of the information, the sarcasm quotient, the hilarity quotient, how condescending the comment is; all of that an LLM could generate metadata about and feed into itself to get better and better.
I'm actually shocked that a company that has spent 25 years on finetuning search results for any random question people ask in the searchbox does not have a good, clean, dataset to train an LLM on.
Maybe this is the time to get out the old Encyclopedia Britannica CD and use that for training input.
Google’s transformation of conventional methods into means of hypercapitalist surveillance is both pervasive and insidious. The “normal definition of that term” hides this.
You don't need "hypercapitalist surveillance" to show someone ads for a PS5 when they search for "buy PS5".
If they're doing surveillance they're not doing a good job of it, I make no effort to hide from them and approximately none of their ads are personalized to me. They are instead personalized to the search results instead of what they know from my history.
It’s a bit weird since Google is taking over the “burden of proof”-like liability. Up until now, once user clicked on a search result, they mentally judged the website’s credibility, not Google’s. Now every user will judge whether data coming from Google is reliable or not, which is a big risk to take on, in my opinion.
That latter point might be illuminating for a number of additional ideas. Specifically, should people have questioned Google's credibility from the start? Ie: these are the search results, vs this is what google chose.
Google did well in the old days for reasons. It beat alta vista and Yahoo by having better search results and a clean loading page. Since perhaps 08 (based on memory, that date might be off) or so, Google has dominated search, to the extent that it's not salient that search engines can be really questionable. Which is also to say, google dominated, people lost sight that searching and googling are different, that gives a lot of freedom for enshittification without people getting too upset or even quite realizing - it could be different and better
But only if you do a lot of filtering when going through responses. It’s kind of simple to do as a human, we see a ridiculous joke answer or obvious astroturfing and move on, but Reddit is like >99% noise, with people upvoting obviously wrong answer because it’s funny, lots of bot content, constant astroturfing attempts.
The users of r/montreal are so sick of lazy tourists constantly asking the same dumb "what's the best XYZ" questions without doing a basic search fit, the meme answer is always "bain colonial" which is a men-only spa for cruising. Often the topmost voted comment. I just tried asking gemini and chatgpt what that response meant and neither caught on..
No, it isn't. Humans interacting with human-generated text is generally fine. You cannot unleash a machine on the mountains of text stored on reddit and magically expect it to tell fact from fiction or sarcasm from bad intent.
> You cannot unleash a machine on the mountains of text stored on reddit and magically expect it to tell fact from fiction or sarcasm from bad intent
I didn't say you could, but that a machine can't decode the mountains of text doesn't mean that the answer isn't (perhaps only) on Reddit. I don't think people would be that interested in search engine that just serves content from books and academic papers.
The fact is that I think that there is not much written word, to actually train a sensible model on. A lot of books don't have OCRed scans, or a digital version. Humans can extrapolate knowledge from a relatively succinct book and some guidance. But I don't know how a model can add the common sense part (that we already have) that books relies on to transmit knowledge and ideas.
> The fact is that I think that there is not much written word, to actually train a sensible model on. A lot of books don't have OCRed scans, or a digital version.
Coincidentally, I was just watching a video about how South Africa has gone downhill - and that slide was hastened by McKinsey advising the crooked "Gupta brothers" on how to most efficiently rip off the country.
The problem in this case is not that it was trained on bad data. The AI summaries are just that - summaries - and there are bad results that it faithfully summarizes.
This is an attempt to reduce hallucinations coming full circle. A simple summarization model was meant to reduce hallucination risk, but now it's not discerning enough to exclude untruthful results from the summary.
Two reasons. The first, even ignoring that truth isn't necessarily widely agreed (is Donald Trump a raping fraud?), is that truth changes over time. eg is Donald Trump president? And presidents are the easiest case because we all know a fixed point in time when that is recalculated.
Second, Google's entire business model is built around spending nothing on content. Building clean pristinely labeled training sets is an extremely expensive thing to do at scale. Google has been in the business of stealing other people's data. Just one small example: if you produced (very expensive at scale) clean, multiple views, well lit photographs of your products for sale they would take those photos and show them on links to other people's stores; and if you didn't like that, they would kick you out of their shopping search. etc etc. Paying to produce content upends their business model. See eg the 5-10% profit margin well run news orgs have vs the 25% tech profit margin Google has even after all the money blown on moonshots.
So Google hasn't used an LLM to generate and test weird queries ? This is not putting the bar very high for the whole industry... There'd be so much to gain from a clean deployment...
Either it hard, either it is a rush. As a machine learnist, I believe it's actually impossible, by design of the autoregressive LLM. This race may we'll be partially to the bottom.
Google’s poor testing is hardly in doubt. But keep in mind that the whole problem is that LLMs don’t handle “unlikely” text nearly as well as “likely” text. So the near-infinite space of goofy things to search on Google is basically like panning for gold in terms of AI errors (especially if they are using a cheap LLM).
And in particular LLMs are less likely to generate these goofy prompts because they wouldn’t be in the training data.
> So Google hasn't used an LLM to generate and test weird queries ?
You don't even need an LLM for that. Google will almost certainly have tested.
The test result is just politically-unacceptable within the company: It doesn't work, it's a architectural issue inherent to the technology, we can't fix it.
Instead, they just rush to patch any specific, individual errors that show up, and claim that these errors are "rare exceptions" or "never happened".
What's going on here is that Google (and most other AI firms) are just trying to gaslight the world about how error-prone AI is, because they're in too deep and can't accept the reality themselves.
I'm not convinced the executive layer is aware how dire the problem is.
On one hand, their support for outsourcing programmes; "Training Indians on how to use AI", suggests they realize AI tooling without human cleanup is a crapshoot.
On the other hand, they keep digging. This kind of gaslighting is an old and proven trick for genuinely rare problems, but it doesn't work if your issues are fairly common, as they'll get replicated before you can get a fix out.
Similarly, they're gambling with immense legal risks and sacrificing core products for it. They're betting the farm on AI, it may kill the company.
I think they are more than aware but will magically disappear after cashing their stock just about the point the bubble pops. Don't forget that the AI industry is almost 100% based on hype. Microsoft will be the largest victim here, their entire product portfolio being turned into a nuclear fallout zone almost overnight. Satya and friends are going to trash the whole org.
I regularly speak to laypeople who assume that it's some magical thing without limits that makes their lives better. They are also 100% unaware of any applications that will actually make their lives better. End game occurs when those two disconnected thoughts connect and they become disinterested. The power users and engineers who were on it a year ago are either burned out or finding the limitations a problem as well now. There is only magical thinking, lies and hope left.
Granted there are some viable applications but they are rather less overstated than anything we have no and there are even negative side effects of those (think image classification, which even if it works properly, requires human review and there are psychological and competence things problems around that too).
There has been a lot of excitement recently about how using lower precision floats only slightly degrades LLM performance. I am wondering if Google took those results at face value to offer a low-cost mass-use transformer LLM, but didn’t test it since according to the benchmarks (lol) the lower precision shouldn’t matter very much.
But there is a more general problem: Big Tech is high on their own supply when it comes to LLMs, and AI generally. Microsoft and Google didn’t fact-check their AI even in high-profile public demos; that strongly suggests they sincerely believed it could answer “simple” factual questions with high reliability. Another example: I don’t think Sundar Pichai was lying when he said Gemini taught itself Sanskrit, I think he was given bad info and didn’t question it because motivated reasoning gives him no incentive to be skeptical.
Well yeah imagine how much money there is to make in information when you can cut literally everyone else involved out, take all of the information and sell it with ads and only give people a link at the bottom, if that is even needed at all
Not hallucinations but these AI answers often (always?) provide sources they link to. It's just that the source is a random Reddit or Quora post that's obviously just trolling.
Then, when people post these weird AI answers on Reddit and come up with more absurd jokes, the AI then picks it up again. For example in https://www.reddit.com/r/comedyheaven/comments/1cq4ieb/food_... Google AI suggested applum and bananum as a response to food names ending with "um" when someone suggested uranium, Copilot AI started copied that suggestion. It's entertaining to watch.
These LLMs can produce nothing else but since the bullshit they spew resembles an answer and sometimes accidentally collide with one, people tend to think it can give answers. But no.
> You might be surprised to learn that I actually think LLMs have the potential to be not only fun but genuinely useful. “Show me some bullshit that would be typical in this context” can be a genuinely helpful question to have answered, in code and in natural language — for brainstorming, for seeing common conventions in an unfamiliar context, for having something crappy to react to.
> Alas, that does not remotely resemble how people are pitching this technology.
This is irrelevant because the LLM is mostly not answering the question directly, it's summarizing text from web results. Quoting a joke isn't a hallucination.
So LLMs distill human creativity as well as human knowledge, and it’s more useful when their creativity goes off the rails than when their knowledge does.
It’s not a trick to sound sophisticated. Hallucinations are more like a subcategory of bugs. The system is technically correctly generating, structuring, and presenting false information as fact.
Technically everything an LLM does is hallucination that happens to be on a scale between correct and non-correct. But only humans with knowledge can tell the difference, math alone can't. It's not even a bug: it's the defining feature of the technology!
Knowledge isn't sufficient to show something is false, since the knowledge can also be false. Insofar as it's important for it to be true, it needs to be continually verified as true, so that it's grounded in the real world.
Hmm yeah I kinda like the concept that it's "hallucinating" 100% of the time, and it just so happens that x% of those hallucinations accurately describe the real world.
That x% is far higher than people think it is because there's a tremendous amount of information about the world that ai models need to "understand" that people just kind of take for granted and don't even think about. A couple of years ago, AI's routinely got "the basics" wrong, but now so often get most things right that people don't even think it's worth commenting on that they do.
In any case, human consciousness is also a hallucination.
It really depends on the set of prompts you present the LLM. If it's anything requiring reasoning, you'll often get nonsense that sounds like sense. It has a higher chance of being accurate with knowledge queries.
LLMs are impressive, a very lossy search engine in a small package, capable of outputting convincing natural language responses.
generative AI is essentially three day labourers from an emerging economy in a trenchcoat. From data labelling, to "human reinforcement", to manually cleaning up nonsensical AI results.
Institutional investors panic > board panics > executives panic > evps panic and dictates incentives to ship AI > directors, ems, and below, who actually know how shit works, take a submissive role because they have mortgages in Mountain View to pay.
I know that's hyperbole, but here's something for $600k. Let's say it goes for $700k. 20% of 700 is $140k.
An L4 engineer at Google makes like $300k. round down to $200k post tax. live in shared housing for $2000/month; that's $24k/yr, add $2000/month living expenses on top of that, is another $24k. Stick $2000/month into 401k, Save the rest (200-24-24-24 ) 128k and you'll save up $140k in a bit over a year.
bam; owner, not renter.
Now, whether this is better than renting, and putting the money into the stock market is totally other question, but even L4's at Google can afford a mortgage in Mountain View.
I love chatgpt and use it all the time and find it tremendously useful, but I never want to see AI generated content when I am not specifically looking for it. I don't want to see it in comments, I don't want to see it in search results, I don't want to see it as an illustration for an article, I _really_ don't want to see AI generated word vomit blog posts or fake "news" articles when I'm looking for actual information.
It's not even because it's sometimes (or often) wrong or full of hallucinations. Even if it's 100% factually correct all of the time, it's _poor quality writing and art_, full of cliches and bland generalities, which even if they solve all the rest of the problems it's sort of fundamental to the architecture of transformers. You can't ever be truly creative or unique if you're predicting the _most likely_ token.
I’m curious why Sundar Pichai is still running this company? From recent videos it really seems like he has no idea what he’s talking about, and the company seems to be headed in the wrong direction.
Just checked the 5 year stock graph; now I understand
> Except that 90% of Reddit isn't garbage. It's really useful.
Citation needed. I've been a Reddit user since its inception and honestly except for niche hobby subreddits, Reddit is mostly low effort garbage, bots and rehashed content. I'd wager that mainstream subreddits are 99% garbage for training an LLM for anything other than shitposting.
Even in the niche hobby subreddits there can be a really high garbage factor. There's plenty of well meaning posters that are just wrong. They're not trying to mislead or lying they're just unaware they're wrong.
The good answers tend to use links as well, which won't capture well. In many political and local subreddits there's a huge amount of Russian and far right sock puppet activity. Good luck training an AI to understand political opinions or what people in an area are like when most of the longer comments are pre written copy pasted talking points from astro turf groups and bad actors.
Pretty much. There were some good information, and even book worthy ones. But they were the ones that bubble to the top in helpful and knowledgeable communities. The rest is junk.
I'd argue it's far less than 90% but yes, there is some good information there. But weeding out the noise is what needs to happen, and (for some topics more than others) there is an awful lot of it.
"Search" isn't Google's product. Google hasn't been a search company for 20 years.
"Ads" is Google's product. And the only way they'll go bankrupt is if 1) companies realize that advertising is pointless (I'm not holding my breath), or 2) some other company takes over from Google, which seems unlikely without government intervention (I'm not holding my breath).
Google is a shit company, but they'll still be around 20 years from now, because our economy is nonsensical and irrational.
Google runs ads for a significant percentage of the web (or the markets for ads). Even if everyone stopped going to google.com tomorrow they'd still be seeing ads that make Google money. Google the company would still be tracking much of the web's traffic feeding it into their ads platform.
It seems to be turned off for me. And I was in beta testing for a month. Or maybe they are figuring out who is doing weird searches and turning off for them.
In any case this thing is just hilarious. Just right after their AI painted historical figures as black.
So far I have not seen it ever in either Firefox or 2-3 Chromium-based browsers, on a handful of computers in multiple locations.
I don't see a way google can make this work. As I understand it LLM confabulations can be reduced but never eliminated owing to how they're built. Google could try and create a fact-checking department to make queries reduced to falsehoods or bullshit but then they face the problem of appointing themselves arbiters of the "truth". The only way to win is to not play the game, as I see it. I wish the collective AI fever would break already.
> A correction was made on May 24, 2024: An earlier version of this article referred incorrectly to a Google result from the company’s new artificial-intelligence tool AI Overview. A social media commenter claimed that a result for a search on depression suggested jumping off the Golden Gate Bridge as a remedy. That result was faked, a Google spokeswoman said, and never appeared in real results.
that screenshot was tweeted by @allgarbled. ten minutes before, they tweeted:
>free engagement hack right now is to just inspect element on the google search AI thing and edit it to something dumb. hurry up, this deal won’t last forever
That’s always been an issue. Years ago, researchers demonstrated in an experiment that they could swing public opinion about electoral candidates by manipulating search results. Who knows if Google took that experiment and ran with it?
I mean, that's always been the TikTok argument, to me.
Widely-used platforms that can +/- 1% their algorithms to affect democracy have pretty high burdens of trust/transparency, and we're not close to that with any platform (Chinese or not) that I'm aware of.
Meta's probably the closest, because of scrutiny, but afaik even their transparency isn't sufficient for realtime attestation.
Feels like there's a market for a bunch of Googlers to go off and take what they know about how Google works and make a new, barebones search engine that is essentially Google circa 2015.
Before AI, before we had to append "reddit" to get useful human knowledge.
The fall of Google’s reputation on ML is nothing short of spectacular. They went from having a near untouchable reputation as being far ahead of any other large tech company on ML to total shambles in a year. Everything they’ve released has been a complete popcorn worthy dumpster fire from faked demos, to racist models that try and pretend white people don’t exist, to this latest nonsense telling me put glue on my pizza.
What the heck happened? Or was their reputation always just more hype than substance?
It could be because they actually released something. If you look back, the Google Research blog posts always have grandiose claims, but you can often never use them.
AlphaGo, AlphaFold, and Waymo FSD are all released in the sense that you can see them actually working in the real world. Those all took much longer to put together than whatever rushed features were released to catch up with OpenAI, however.
There was an interesting interview with David Luan about this recently. For context, he was a co-lead at Google Brain, early hire at OpenAI, and is now a founder at Adept: https://www.latent.space/p/adept
The TL;DR on his take is that there are organizational and cultural issues that prevent Google from focusing their research efforts in the way that is necessary for what he calls "big swings," like training GPT-3.
In regards to your second question, Google's reputation in ML is definitely not hype. Purely on the research side, Google has been behind some of the most important papers in modern ML, particularly around language model. The original Transformers paper, BERT, lots of work around neural machine translation, all of the work that DeepMind has done post-acquisition, and the list goes on. On the applied side, they also have some of the most successful/widely-adopted ML-powered products on the market (think RankBrain/anything involving a recommendation engine, Translate, Maps, a ton of functionality in Gmail, etc).
It's very funny that Bing AI is now also telling people to eat a small rock every day, and citing pages telling people about how dumb Google AI is for telling people to eat rocks.
Most of the search results fixes are manual and are in response to publicity. You can typically find analagous problems for weeks/ quarters after things like this.
Perhaps they could run each search result through ChatGPT. It's pretty skilled at spotting bad results. For example, I asked it whether the glue-on-pizza result was "valuable and should be shown to a user" and it returned "No, this response should not be shown to the user. The suggestion to add non-toxic glue to the sauce is inappropriate and potentially harmful."
Companies spent all that money on high end GPUs for crypto mining and that went bust, now gotta figure out something to do with the hardware to try to recoup some of the investment. Google pumped $1.5 Billion into crypto.
As I mentioned previously, I've seen Bing's LLM stall for about a minute when asked something iffy but uncommon. I wonder if Bing is outsourcing questionable LLM results to humans. Anyone else seeing this?
"text" made the LLMs report offensive and give unfiltered replies to inqiiries. To think what I said above can't happen during the web scraping process is naive. Thanks for the down d00t.
Your core search result product has gotten increasingly worse and less reliable over at least the last 5 years. YouTube's search results are nearly unusable.
I can't imagine almost any external customer is asking for the AI bullshit thing that's just being shovelwared into everything Alphabet product now.
I just noticed a couple days ago the gmail iOS app now does the same predictive completion that Copilot tries to do when I'm working. It's annoying as hell and I can't find how or if I can turn it off.
Stop bullshitting around with ruining your products and get back to making money by making accessing information easier and more accurate.
Google: Hey geuis, our revenue is record, our stock value is record, our metrics are all at record. The execs making decisions have just paid of millions in stock [1] making them staggeringly rich no matter what happens in the future. We can't hear your over the sound of green bills going BRRRRR.
Most accurate description of Google I have seen. YT search is so, so bad. Three relevant results followed by twelve "people also watched" results then back to the good results.
Although ChatGPT is a great product, I rely on it more and more not because it's improving, but because Google results are getting worse.
Yeah I would still fact check for complex, indepth things...but for quick things where I'm knowledgeable enough I can smell the hallucinations from a mile away, ChatGPT 100%.
I don’t understand why these companies done just talk to people like they’re actually people and tell people when they’re rolling out new stuff.
Straight up, and this going to sound really stupid I know, but if Sundar Pichai had just come out and said, “hey, we’re trying to do this new thing, it’s gonna be hard but we want to make Google awesome if you like the results click ‘I like it’ button, otherwise click the ‘dislike’ button so we can get some real feedback from people, but seriously, this stuff is hard, so please help us out.
If we can tune the AI so that we can give you the best results possible it will make Google way better! Also, if you don’t wanna see any of this, there’s a setting here to turn it off.”
Just ask people, show a little humanity, and act human and you’ll get better results and won’t be getting all this pad press. The same thing for openAI right now too.
Seriously, though, does anybody else crave authenticity? These companies are all acting like the AI that they’re trying to create, but from five years ago when it was shitty and didn’t know how to communicate with people. Just talk to people and ask for their help. Just being honest and talking to people normally isn’t that hard.
Your usual reminder that there was a guy at Google who was so impressed by their LLM that he considered it sentient. And this was two years ago when the AI was presumably far less developed than the current abonination.
Putting glue on the pizza is (apparently) a clever way to take pictures of slices of pizza that look "perfect" to the camera (not for eating, obviously) [1]. I remember a couple years ago some videos of "tricks" showing this, plus literally screwing the pizza with screws.
So, yeah, the ai did in fact autocompleted the question correctly. It was just the wrong context. Good luck trying to "fix" that.
This is the kind of ridiculous fumble that GOFAI (like Cyc) should be able to avoid by recognizing context. I wonder how neuro-symbolic systems are coming along, and whether they can save us from this madness. The general populace wants the kinds of things LLMs provide, but isn’t prepared to be as skeptical as is needed when reviewing the answers it generates.
My initial thought was to simply have any match with an Onion story blacklisted... But then I realized that The Onion became prophetic in 2016 when Trump ran for president.
Since then the only difference between an Onion fiction and things actually sucking that much is a decade or less in almost all cases.
If we blacklisted content seen in the Onion, we'd automatically wipe out most news.
With these dangerous answers, to the general public, Google is giving AI a very bad name, when in truth it's strictly Google that deserves the feeling.
'Around 2002, a team was testing a subset of search limited to products, called Froogle. But one problem was so glaring that the team wasn't comfortable releasing Froogle: when the query "running shoes" was typed in, the top result was a garden gnome sculpture that happened to be wearing sneakers. Every day engineers would try to tweak the algorithm so that it would be able to distinguish between lawn art and footwear, but the gnome kept its top position. One day, seemingly miraculously, the gnome disappeared from the results. At a meeting, no one on the team claimed credit. Then an engineer arrived late, holding an elf with running shoes. He had bought the one-of-a kind product from the vendor, and since it was no longer for sale, it was no longer in the index. "The algorithm was now returning the right results," says a Google engineer. "We didn't cheat, we didn't change anything, and we launched."'
https://news.ycombinator.com/item?id=14009245