Donald Hobson comments on A Rocket–Interpretability Analogy

Donald Hobson 24 Oct 2024 19:57 UTC
2 points
0
Imagine A GPT that predicts random chunks of the internet.
Sometimes it produces poems. Sometimes deranged rants. Sometimes all sorts of things. It wanders erratically around a large latent space of behaviours.
This is the unmasked shogolith, green slimey skin showing but inner workings still hidden.
Now perform some change that mostly pins down the latent space to “helpful corporate assistant”. This is applying the smiley face mask.
In some sense, all the dangerous capabilities the corporate assistant were in the original model. Dangerous capabilities haven’t been removed either, but some capabilities are a bit easier to access without careful prompting, and other capabilites are harder to access.
What ChatGPT currently has is a form of low quality pseudo-alignment.
What would long term success look like using nothing but this pseudo-alignment. It would look like a chatbot far smarter than any current ones, which mostly did nice things, so long as you didn’t put in any weird prompts.
Now If corrigibility is a broad basin, this might well be enough to hit it. The basin of corrigibility means that the AI might have bugs, but at the very least, you can turn the AI off and edit the code. Ideally you can ask the AI for help fixing it’s own bugs. Sure the first AI is far from perfect. But perhaps the flaws disappear under self rewriting + competent human advice.