I wasn’t really thinking about a specific algorithm. Well I was kind of thinking about LLM’s and the alien shogolith meme.
But yes. I know this would be helpful.
But I’m more thinking about what work remains. Like is it a idiot-proof 5 minute change? Or does it still take MIRI 10 years to adapt the alien code?
Also.
Domain limited optimization is a natural thing. The prototypical example is deep blue or similar. Lots of optimization power, over a very limited domain. But any teacher who optimizes the class schedule without thinking about putting nanobots in the student brains is doing something similar.
I am guessing and hoping that the masks in an LLM are at least as limited-optimizers as humans, often more. Due to their tendency to learn the most usefully predictive patterns first. Hidden long term sneaky plans will only very rarely influence the text. (Due to the plans being hidden)
And, I hope, the shogolith isn’t itself particularly intrested in optimizing the real world. The shogolith just chooses what mask to wear.
So.
Can we duct tape a mask of “alignment researcher” onto a shogolith, and keep the mask in place long enough to get some useful alignment research done.
The more that there is one “know it when you see it” simple alignment solution, the more likely this is to work.
It depends on how they did it. If they did it by formalizing the notion of “the values and preferences (coherently extrapolated) of (the living members of) the species that created the AI”, then even just blindly copying their design without any attempt to understand it has a very high probability of getting a very good outcome here on Earth.
The AI of course has to inquire into and correctly learn about our values and preferences before it can start intervening on our behalf, so one way such a blind copying might fail is if the method the aliens used to achieve this correct learning depended on specifics of the situation on the alien planet that don’t obtain here on Earth.
Domain limited optimization is a natural thing. The prototypical example is deep blue or similar. Lots of optimization power, over a very limited domain. But any teacher who optimizes the class schedule without thinking about putting nanobots in the student brains is doing something similar.
Agreed it is natural.
To describe ‘limited optimization’ in my words: The teacher implements an abstract function whose optimization target is not {the outcome of a system containing a copy of this function}, but {criteria about the isolated function’s own output}. The input to this function is not {the teacher’s entire world model}, but some simple data structure whose units map to schedule-related abstractions. The output of this function, when interpreted by the teacher, then maps back to something like a possible schedule ordering. (Of course, this is an idealized case, I don’t claim that actual human brains are so neat)
The optimization target of an agent, though, is “{the outcome of a system containing a copy of this function}” (in this case, ‘this function’ refers to the agent). If agents themselves implemented agentic functions, the result would be infinite recurse; so all agents of sufficiently complex worlds must, at some point in the course of solving their broader agent-question[1], ask ‘domain limited’ sub-questions.
(note that ‘domain limited’ and ‘agentic’ are not fundamental-types; the fundamental thing would be something like “some (more complex) problems have sub-problems which can/must be isolated”)
I think humans have deep assumptions conducive to their ‘embedded agency’ which can make it harder to see this for the first time. It may be automatic to view ‘the world’ as a referent which a ‘goal function’ can somehow be about naturally. I once noticed I had a related confusion, and asked “wait, how can a mathematical function ‘refer to the world’ at all?”. The answer is that there is no mathematically default ‘world’ object to refer to, and you have to construct a structural copy to refer to instead (which, being a copy, contains a copy of the agent, implying the actions of the real agent and its copy logically-correspond), which is a specific non-default thing, which nearly all functions do not do.
(This totally doesn’t answer your clarified question, I’m just writing a related thing to something you wrote in hopes of learning)
I wasn’t really thinking about a specific algorithm. Well I was kind of thinking about LLM’s and the alien shogolith meme.
But yes. I know this would be helpful.
But I’m more thinking about what work remains. Like is it a idiot-proof 5 minute change? Or does it still take MIRI 10 years to adapt the alien code?
Also.
Domain limited optimization is a natural thing. The prototypical example is deep blue or similar. Lots of optimization power, over a very limited domain. But any teacher who optimizes the class schedule without thinking about putting nanobots in the student brains is doing something similar.
I am guessing and hoping that the masks in an LLM are at least as limited-optimizers as humans, often more. Due to their tendency to learn the most usefully predictive patterns first. Hidden long term sneaky plans will only very rarely influence the text. (Due to the plans being hidden)
And, I hope, the shogolith isn’t itself particularly intrested in optimizing the real world. The shogolith just chooses what mask to wear.
So.
Can we duct tape a mask of “alignment researcher” onto a shogolith, and keep the mask in place long enough to get some useful alignment research done.
The more that there is one “know it when you see it” simple alignment solution, the more likely this is to work.
It depends on how they did it. If they did it by formalizing the notion of “the values and preferences (coherently extrapolated) of (the living members of) the species that created the AI”, then even just blindly copying their design without any attempt to understand it has a very high probability of getting a very good outcome here on Earth.
The AI of course has to inquire into and correctly learn about our values and preferences before it can start intervening on our behalf, so one way such a blind copying might fail is if the method the aliens used to achieve this correct learning depended on specifics of the situation on the alien planet that don’t obtain here on Earth.
Agreed it is natural.
To describe ‘limited optimization’ in my words: The teacher implements an abstract function whose optimization target is not {the outcome of a system containing a copy of this function}, but {criteria about the isolated function’s own output}. The input to this function is not {the teacher’s entire world model}, but some simple data structure whose units map to schedule-related abstractions. The output of this function, when interpreted by the teacher, then maps back to something like a possible schedule ordering. (Of course, this is an idealized case, I don’t claim that actual human brains are so neat)
The optimization target of an agent, though, is “{the outcome of a system containing a copy of this function}” (in this case, ‘this function’ refers to the agent). If agents themselves implemented agentic functions, the result would be infinite recurse; so all agents of sufficiently complex worlds must, at some point in the course of solving their broader agent-question[1], ask ‘domain limited’ sub-questions.
(note that ‘domain limited’ and ‘agentic’ are not fundamental-types; the fundamental thing would be something like “some (more complex) problems have sub-problems which can/must be isolated”)
I think humans have deep assumptions conducive to their ‘embedded agency’ which can make it harder to see this for the first time. It may be automatic to view ‘the world’ as a referent which a ‘goal function’ can somehow be about naturally. I once noticed I had a related confusion, and asked “wait, how can a mathematical function ‘refer to the world’ at all?”. The answer is that there is no mathematically default ‘world’ object to refer to, and you have to construct a structural copy to refer to instead (which, being a copy, contains a copy of the agent, implying the actions of the real agent and its copy logically-correspond), which is a specific non-default thing, which nearly all functions do not do.
(This totally doesn’t answer your clarified question, I’m just writing a related thing to something you wrote in hopes of learning)
From possible outputs, which meets some criteria about {the outcome of a system containing a copy of this function}?