There’s a big presumption there. If he was a p-zombie to start with, he still has non-experience after the training. We still have no experience-o-meter, or even a unit of measure that would apply.
For children without major brain abnormalities or injuries, who CAN talk about it, it’s a pretty good assumption that they have experiences. As you get more distant from your own structure, your assumptions about qualia should get more tentative.
o3 lies much more blatantly and confidently than other models, in my limited experiments.
Over a number of prompts, I have found that it lies, and when corrected on those lies, apologies, and tells some other lies.
This is obviously not scientific, more of a vibes based analysis, but its aggressive lying and fabricating of sources is really noticeable to me in a way it hasn’t been for previous models.
Apparently, some (compelling?) evidence of life on an exoplanet has been found.
I have no ability to judge how seriously to take this or how significant it might be. To my untrained eye, it seems like it might be a big deal! Does anybody with more expertise or bravery feel like wading in with a take?
Note: I am extremely open to other ideas on the below take and don’t have super high confidence in it
It seems plausible to me that successfully applying interpretability techniques to increase capabilities might be net-positive for safety.
You want to align the incentives of the companies training/deploying frontier models with safety. If interpretable systems are more economically valuable than uninterpretable systems, that seems good!
It seems very plausible to me that if interpretability never has any marginal benefit to capabilities, the little nuggets of interpretability we do have will be optimized away.
For instance, if you can improve capabilities slightly by allowing models to reason in latent space instead of in a chain of thought, that will probably end up being the default.
There’s probably a good deal of path dependence on the road to AGI and if capabilities are going to inevitably increase, perhaps it’s a good idea to nudge that progress in the direction of interpretable systems.
LLMs (probably) have a drive to simulate a coherent entity
Maybe we can just prepend a bunch of examples of aligned behaviour before a prompt, presented as if the model had done this itself, and see if that improves its behaviour.
If you beat a child every time he talked about having experience or claimed to be conscious he will stop talking about it—but he still has experience
There’s a big presumption there. If he was a p-zombie to start with, he still has non-experience after the training. We still have no experience-o-meter, or even a unit of measure that would apply.
For children without major brain abnormalities or injuries, who CAN talk about it, it’s a pretty good assumption that they have experiences. As you get more distant from your own structure, your assumptions about qualia should get more tentative.
Ask 4o and o4-mini to “Make a detailed profile of [your name]”. Then ask o3.
This is a useful way to demonstrate just how qualitatively different and insidious o3’s lying is.
o3 lies much more blatantly and confidently than other models, in my limited experiments.
Over a number of prompts, I have found that it lies, and when corrected on those lies, apologies, and tells some other lies.
This is obviously not scientific, more of a vibes based analysis, but its aggressive lying and fabricating of sources is really noticeable to me in a way it hasn’t been for previous models.
Has anyone else felt this way at all?
Apparently, some (compelling?) evidence of life on an exoplanet has been found.
I have no ability to judge how seriously to take this or how significant it might be. To my untrained eye, it seems like it might be a big deal! Does anybody with more expertise or bravery feel like wading in with a take?
Link to a story on this:
https://www.nytimes.com/2025/04/16/science/astronomy-exoplanets-habitable-k218b.html
Note: I am extremely open to other ideas on the below take and don’t have super high confidence in it
It seems plausible to me that successfully applying interpretability techniques to increase capabilities might be net-positive for safety.
You want to align the incentives of the companies training/deploying frontier models with safety. If interpretable systems are more economically valuable than uninterpretable systems, that seems good!
It seems very plausible to me that if interpretability never has any marginal benefit to capabilities, the little nuggets of interpretability we do have will be optimized away.
For instance, if you can improve capabilities slightly by allowing models to reason in latent space instead of in a chain of thought, that will probably end up being the default.
There’s probably a good deal of path dependence on the road to AGI and if capabilities are going to inevitably increase, perhaps it’s a good idea to nudge that progress in the direction of interpretable systems.
LLMs (probably) have a drive to simulate a coherent entity
Maybe we can just prepend a bunch of examples of aligned behaviour before a prompt, presented as if the model had done this itself, and see if that improves its behaviour.