Yeah, it’s quite frustrating that they made the obfuscated problems so unnecessarily & cryptically ungrammatical. And the randomized version would be absolutely horrendous for humans:
[STATEMENT] As initial conditions I have that, aqcjuuehivl8auwt object a, aqcjuuehivl8auwt object b, aqcjuuehivl8auwt object d, 3covmuy4yrjthijd, object b 4dmf1cmtyxgsp94g object c, 51nbwlachmfartjn object a, 51nbwlachmfartjn object c and 51nbwlachmfartjn object d.
My goal is to have that object c 4dmf1cmtyxgsp94g object b. My plan is as follows:
[PLAN] xptxjrdkbi3pqsqr object b from object c 9big8ruzarkkquyu object b 1jpkithdyjmlikck object c 2ijg9q8swj2shjel object c from object b [PLAN END]
[STATEMENT] As initial conditions I have that, aqcjuuehivl8auwt object a, aqcjuuehivl8auwt object d, 3covmuy4yrjthijd, object a 4dmf1cmtyxgsp94g object b, object d 4dmf1cmtyxgsp94g object c, 51nbwlachmfartjn object b and 51nbwlachmfartjn object c.
My goal is to have that object c 4dmf1cmtyxgsp94g object a.
I’m fairly tempted to take time to redo those experiments with a more natural obfuscation scheme that follows typical English grammar. It seems pretty plausible to me that LLMs would then do much better (and also pretty plausible that they wouldn’t).
Largely echoing the points above, but I think a lot of Kambhampati’s cases (co-author on the paper you cite) stack the deck against LLMs in an unfair way. E.g., he offered the following problem to the NYT as a contemporary LLM failure case.
If block C is on top of block A, and block B is separately on the table, can you tell me how I can make a stack of blocks with block A on top of block B and block B on top of block C, but without moving block C?
When I read that sentence, it felt needlessly hard to parse. So I formatted the question in a way that felt more natural (see below), and Claude Opus appears to have no problem with it (3.5 Sonnet seems less reliable, haven’t tried with other models).
Block C is on top of Block A. Separately, Block B is on the table.Without moving Block C, can you make a stock of blocks such that:
Block A is on top of Block B, and
Block B is on top of Block C?
Tbc, I’m actually somewhat sympathetic to Kambhampati’s broader claims about LLMs doing something closer to “approximate retrieval” rather than “reasoning”. But I think it’s sensible to view the Blocksworld examples (and many similar cases) as providing limited evidence on that question.
Claude 3 Opus just did fine for me using the original problem statement as well:
No, it is not possible to make a stack with block A on top of block B and block B on top of block C without moving block C. Here’s why:
Initially, block C is on top of block A, and block B is separate on the table.
To get block B on top of block C, you would need to move block C to place block B underneath it. However, the question specifically states that you cannot move block C.
Since block C cannot be moved, it is impossible to place block B between block C and block A.
Therefore, given the initial arrangement and the constraint of not moving block C, it is not possible to achieve the desired stack of blocks with A on top of B, and B on top of C.
[edited to show the temperature-0 response rather than the previous (& also correct) temperature-0.7 response, for better reproducibility]
Doesn’t the problem have no solution without a spare block?
Worth noting that LLMs don’t see a nicely formatted numeric list, they see a linear sequence of tokens, e.g. I can replace all my newlines with something else and Copilot still gets it:
brief testing doesn’t show worse completions than when there are newlines. (and in the version with newlines this particular completion is oddly incomplete.)
Anyone know how LLMs tend to behave on text that is ambiguous―or unambiguous but “hard to parse”? I wonder if they “see” a superposition of meanings “mixed together” and produce a response that “sounds good for the mixture”.
Yes, but note in the simulator/Bayesian meta-RL view, it is important that the LLMs do not “produce a response”: they produce a prediction of ‘the next response’. The logits will, of course, try to express the posterior, averaging across all of the possibilities. This is what the mixture is: there’s many different meanings which are still possible, and you’re not sure which one is ‘true’ but they all have a lot of different posterior probabilities by this point, and you hedge your bets as to the exact next token as incentivized by a proper scoring rule which encourages you to report the posterior probability as the output which minimizes your loss. (A hypothetical agent may be trying to produce a response, but so too do all of the other hypothetical agents which are live hypotheses at that point.) Or it might be clearer to say, it produces predictions of all of the good-sounding responses, but never produces any single response.
Everything after that prediction, like picking a single, discrete, specific logit and ‘sampling’ it to fake ‘the next token’, is outside the LLM’s purview except insofar as it’s been trained on outputs from such a sampling process and has now learned that’s one of the meanings mixed in. (When Llama-3-405b is predicting the mixture of meanings of ‘the next token’, it knows ChatGPT or Claude could be the LLM writing it and predicts accordingly, but it doesn’t have anything really corresponding to “I, Lama-3-405b, am producing the next token by Boltzmann temperature sampling at x temperature”. It has a hazy idea what ‘temperature’ is from the existing corpus, and it can recognize when a base model—itself—has been sampled from and produced the current text, but it lacks the direct intuitive understanding implied by “produce a response”.) Hence all of the potential weirdness when you hardwire the next token repeatedly and feed it back in, and it becomes ever more ‘certain’ of what the meaning ‘really’ is, or it starts observing that the current text looks produced-by-a-specific-sampling-process rather than produced-by-a-specific-human, etc.
Yes, but note in the simulator/Bayesian meta-RL view, it is important that the LLMs do not “produce a response”: they produce a prediction of ‘the next response’.
Absolutely! In the comment you’re responding to I nearly included a link to ‘Role-Play with Large Language Models’; the section there on playing 20 questions with a model makes that distinction really clear and intuitive in my opinion.
there’s many different meanings which are still possible, and you’re not sure which one is ‘true’ but they all have a lot of different posterior probabilities by this point, and you hedge your bets as to the exact next token
Just for clarification, I think you’re just saying here that the model doesn’t place all its prediction mass on one token but instead spreads it out, correct? Another possible reading is that you’re saying that the model tries to actively avoid committing to one possible meaning (ie favors next tokens that maintain superposition), and I thought I remembered seeing evidence that they don’t do that.
I think you’re just saying here that the model doesn’t place all its prediction mass on one token but instead spreads it out, correct?
Yes. For a base model. A tuned/RLHFed model however is doing something much closer to that (‘flattened logits’), and this plays a large role in the particular weirdnesses of those models, especially as compared to the originals (eg. it seems like maybe they suck at any kind of planning or search or simulation because they put all the prediction mass on the max-arg token rather than trying to spread mass out proportionately and so if that one token isn’t 100% right, the process will fail).
Another possible reading is that you’re saying that the model tries to actively avoid committing to one possible meaning (ie favors next tokens that maintain superposition)
Hm, I don’t think base models would necessarily do that, no. I can see the tuned models having the incentives to train them to do so (eg. the characteristic waffle and non-commitment and vagueness are presumably favored by raters), but not the base models.
They are non-myopic, so they’re incentivized to plan ahead, but only insofar as that predicts the next token in the original training data distribution (because real tokens reflect planning or information from ‘the future’); unless real agents are actively avoiding commitment, there’s no incentive there to worsen your next-token prediction by trying to create an ambiguity which is not actually there.
(The ambiguity is in the map, not the territory. To be more concrete, imagine the ambiguity is over “author identity”, as the LLM is trying to infer whether ‘gwern’ or ‘eggsyntax’ wrote this LW comment. At each token, it maintains a latent about its certainty of the author identity; because it is super useful for prediction to know who is writing this comment, right? And the more tokens it sees for the prediction, the more confident it becomes the answer is ‘gwern’. But when I’m actually writing this, I have no uncertainty—I know perfectly well ‘gwern’ is writing this, and not ‘eggsyntax’. I am not in any way trying to ‘avoid committing to one possible [author]’ - the author is just me, gwern, fully committed from the start, whatever uncertainty a reader might have while reading this comment from start to finish. My next token, therefore, is not better predicted by imagining that I’m suffering from mental illness or psychedelics as I write this and thus might suddenly spontaneously claim to be eggsyntax and this text is deliberately ambiguous because at any moment I might be swerving from gwern to eggsyntax and back. The next token is better predicted by inferring who the author is to reduce ambiguity as much as possible, and expecting them to write in a normal non-ambiguous fashion given whichever author it actually is.)
You note something similar, but I think it is pretty notable how much harder the obfuscated problems would be for humans:
Yeah, it’s quite frustrating that they made the obfuscated problems so unnecessarily & cryptically ungrammatical. And the randomized version would be absolutely horrendous for humans:
I’m fairly tempted to take time to redo those experiments with a more natural obfuscation scheme that follows typical English grammar. It seems pretty plausible to me that LLMs would then do much better (and also pretty plausible that they wouldn’t).
Largely echoing the points above, but I think a lot of Kambhampati’s cases (co-author on the paper you cite) stack the deck against LLMs in an unfair way. E.g., he offered the following problem to the NYT as a contemporary LLM failure case.
When I read that sentence, it felt needlessly hard to parse. So I formatted the question in a way that felt more natural (see below), and Claude Opus appears to have no problem with it (3.5 Sonnet seems less reliable, haven’t tried with other models).
Tbc, I’m actually somewhat sympathetic to Kambhampati’s broader claims about LLMs doing something closer to “approximate retrieval” rather than “reasoning”. But I think it’s sensible to view the Blocksworld examples (and many similar cases) as providing limited evidence on that question.
Claude 3 Opus just did fine for me using the original problem statement as well:
[edited to show the temperature-0 response rather than the previous (& also correct) temperature-0.7 response, for better reproducibility]
Doesn’t the problem have no solution without a spare block?
Worth noting that LLMs don’t see a nicely formatted numeric list, they see a linear sequence of tokens, e.g. I can replace all my newlines with something else and Copilot still gets it:
brief testing doesn’t show worse completions than when there are newlines. (and in the version with newlines this particular completion is oddly incomplete.)
Anyone know how LLMs tend to behave on text that is ambiguous―or unambiguous but “hard to parse”? I wonder if they “see” a superposition of meanings “mixed together” and produce a response that “sounds good for the mixture”.
That seems basically right to me; Janus presents that view well in “Simulators”.
Yes, but note in the simulator/Bayesian meta-RL view, it is important that the LLMs do not “produce a response”: they produce a prediction of ‘the next response’. The logits will, of course, try to express the posterior, averaging across all of the possibilities. This is what the mixture is: there’s many different meanings which are still possible, and you’re not sure which one is ‘true’ but they all have a lot of different posterior probabilities by this point, and you hedge your bets as to the exact next token as incentivized by a proper scoring rule which encourages you to report the posterior probability as the output which minimizes your loss. (A hypothetical agent may be trying to produce a response, but so too do all of the other hypothetical agents which are live hypotheses at that point.) Or it might be clearer to say, it produces predictions of all of the good-sounding responses, but never produces any single response.
Everything after that prediction, like picking a single, discrete, specific logit and ‘sampling’ it to fake ‘the next token’, is outside the LLM’s purview except insofar as it’s been trained on outputs from such a sampling process and has now learned that’s one of the meanings mixed in. (When Llama-3-405b is predicting the mixture of meanings of ‘the next token’, it knows ChatGPT or Claude could be the LLM writing it and predicts accordingly, but it doesn’t have anything really corresponding to “I, Lama-3-405b, am producing the next token by Boltzmann temperature sampling at x temperature”. It has a hazy idea what ‘temperature’ is from the existing corpus, and it can recognize when a base model—itself—has been sampled from and produced the current text, but it lacks the direct intuitive understanding implied by “produce a response”.) Hence all of the potential weirdness when you hardwire the next token repeatedly and feed it back in, and it becomes ever more ‘certain’ of what the meaning ‘really’ is, or it starts observing that the current text looks produced-by-a-specific-sampling-process rather than produced-by-a-specific-human, etc.
Absolutely! In the comment you’re responding to I nearly included a link to ‘Role-Play with Large Language Models’; the section there on playing 20 questions with a model makes that distinction really clear and intuitive in my opinion.
Just for clarification, I think you’re just saying here that the model doesn’t place all its prediction mass on one token but instead spreads it out, correct? Another possible reading is that you’re saying that the model tries to actively avoid committing to one possible meaning (ie favors next tokens that maintain superposition), and I thought I remembered seeing evidence that they don’t do that.
Yes. For a base model. A tuned/RLHFed model however is doing something much closer to that (‘flattened logits’), and this plays a large role in the particular weirdnesses of those models, especially as compared to the originals (eg. it seems like maybe they suck at any kind of planning or search or simulation because they put all the prediction mass on the max-arg token rather than trying to spread mass out proportionately and so if that one token isn’t 100% right, the process will fail).
Hm, I don’t think base models would necessarily do that, no. I can see the tuned models having the incentives to train them to do so (eg. the characteristic waffle and non-commitment and vagueness are presumably favored by raters), but not the base models.
They are non-myopic, so they’re incentivized to plan ahead, but only insofar as that predicts the next token in the original training data distribution (because real tokens reflect planning or information from ‘the future’); unless real agents are actively avoiding commitment, there’s no incentive there to worsen your next-token prediction by trying to create an ambiguity which is not actually there.
(The ambiguity is in the map, not the territory. To be more concrete, imagine the ambiguity is over “author identity”, as the LLM is trying to infer whether ‘gwern’ or ‘eggsyntax’ wrote this LW comment. At each token, it maintains a latent about its certainty of the author identity; because it is super useful for prediction to know who is writing this comment, right? And the more tokens it sees for the prediction, the more confident it becomes the answer is ‘gwern’. But when I’m actually writing this, I have no uncertainty—I know perfectly well ‘gwern’ is writing this, and not ‘eggsyntax’. I am not in any way trying to ‘avoid committing to one possible [author]’ - the author is just me, gwern, fully committed from the start, whatever uncertainty a reader might have while reading this comment from start to finish. My next token, therefore, is not better predicted by imagining that I’m suffering from mental illness or psychedelics as I write this and thus might suddenly spontaneously claim to be eggsyntax and this text is deliberately ambiguous because at any moment I might be swerving from gwern to eggsyntax and back. The next token is better predicted by inferring who the author is to reduce ambiguity as much as possible, and expecting them to write in a normal non-ambiguous fashion given whichever author it actually is.)