Of these, I’m most worried about neuralese recurrence effectively removing direct access to the AI’s reasoning in a legible format.
I am not worried about this right now. We should always be able to translate latent space reasoning aka neuralese (see COCONUT) to a human language equivalent representation. This might be incomplete or leave out details—but that is already the case for existing models (as discussed here). The solution suggested by Villiam is to recursively expand as needed.
Another option might be to translate neuralese to equivalent program code (preferably Lean). This would be harder for most people to read but more precise and probably easier to verify.
We should always be able to translate latent space reasoning aka neuralese (see COCONUT) to a human language equivalent representation.
I don’t think this is true at all. How do you translate, say, rotating multiple shapes in parallel into text? Current models already use neuralese as they refine their answer in the forward pass. Why can’t we translate that yet? (Yes, we can decode the model’s best guess at the next token, but that’s not an explanation.)
Chain-of-thought isn’t always faithful, but it’s still what the model actually uses when it does serial computation. You’re directly seeing a part of the process that produced the answer, not a hopefully-adequate approximation.
I don’t think this is true at all. How do you translate, say, rotating multiple shapes in parallel into text?
At least for multimodal LLMs in the pure-token approach like Gato or DALL-E 1 (and probably GPT-4o and Gemini, although few details have been published), you would be able to do that by generating the tokens which embody an encoded image (or video!) of several shapes, well, rotating in parallel. Then you just look at them.
I am not worried about this right now. We should always be able to translate latent space reasoning aka neuralese (see COCONUT) to a human language equivalent representation. This might be incomplete or leave out details—but that is already the case for existing models (as discussed here). The solution suggested by Villiam is to recursively expand as needed.
Another option might be to translate neuralese to equivalent program code (preferably Lean). This would be harder for most people to read but more precise and probably easier to verify.
I don’t think this is true at all. How do you translate, say, rotating multiple shapes in parallel into text? Current models already use neuralese as they refine their answer in the forward pass. Why can’t we translate that yet? (Yes, we can decode the model’s best guess at the next token, but that’s not an explanation.)
Chain-of-thought isn’t always faithful, but it’s still what the model actually uses when it does serial computation. You’re directly seeing a part of the process that produced the answer, not a hopefully-adequate approximation.
At least for multimodal LLMs in the pure-token approach like Gato or DALL-E 1 (and probably GPT-4o and Gemini, although few details have been published), you would be able to do that by generating the tokens which embody an encoded image (or video!) of several shapes, well, rotating in parallel. Then you just look at them.