a system which needs a protected epistemic layer sounds suspiciously like a system that can’t tile
I stand as a counterexample: I personally want my epistemic layer to have accurate beliefs—y’know, having read the sequences… :-P
I think of my epistemic system like I think of my pocket calculator: a tool I use to better achieve my goals. The tool doesn’t need to share my goals.
The way I think about it is:
Early in training, the AGI is too stupid to formulate and execute a plan to hack into its epistemic level.
Late in training, we can hopefully get to the place where the AGI’s values, like mine, involve a concept of “there is a real world independent of my beliefs”, and its preferences involve the state of that world, and therefore “get accurate beliefs” becomes instrumentally useful and endorsed.
In between … well … in between, we’re navigating treacherous waters …
Second, there’s an obstacle to pragmatic/practical considerations entering into epistemics. We need to focus on predicting important things; we need to control the amount of processing power spent; things in that vein. But (on the two-level view) we can’t allow instrumental concerns to contaminate epistemics! We risk corruption!
I mean, if the instrumental level has any way whatsoever to influence the epistemic level, it will be able to corrupt it with false beliefs if it’s hell-bent on doing so, and if it’s sufficiently intelligent and self-aware. But remember we’re not protecting against a superintelligent adversary; we’re just trying to “navigate the treacherous waters” I mentioned above. So the goal is to allow what instrumental influence we can on the epistemic system, while making it hard and complicated to outright corrupt the epistemic system. I think the things that human brains do for that are:
The instrumental level gets some influence over what to look at, where to go, what to read, who to talk to, etc.
There’s a trick (involving acetylcholine) where the instrumental level has some influence over a multiplier on the epistemic level’s gradients (a.k.a. learning rate). So epistemic level is always updates towards “more accurate predictions on this frame”, but it updates infinitesimally in situations where prediction accuracy is instrumentally useless, and it updates strongly in situations where prediction accuracy is instrumentally important.
There’s a different mechanism that creates the same end result as #2: namely, the instrumental level has some influence over what memories get replayed more or less often.
For #2 and #3, the instrumental level has some influence but not complete influence. There are other hardcoded algorithms running in parallel and flagging certain things as important, and the instrumental level has no straightforward way to prevent that from happening.
In between … well … in between, we’re navigating treacherous waters …
Right, I basically agree with this picture. I might revise it a little:
Early, the AGI is too dumb to hack its epistemics (provided we don’t give it easy ways to do so!).
In the middle, there’s a danger zone.
When the AGI is pretty smart, it sees why one should be cautious about such things, and it also sees why any modifications should probably be in pursuit of truthfulness (because true beliefs are a convergent instrumental goal) as opposed to other reasons.
When the AGI is really smart, it might see better ways of organizing itself (eg, specific ways to hack epistemics which really are for the best even though they insert false beliefs), but we’re OK with that, because it’s really freaking smart and it knows to be cautious and it still thinks this.
So the goal is to allow what instrumental influence we can on the epistemic system, while making it hard and complicated to outright corrupt the epistemic system.
One important point here is that the epistemic system probably knows what the instrumental system is up to. If so, this gives us an important lever. For example, in theory, a logical inductor can’t be reliably fooled by an instrumental reasoner who uses it (so long as the hardware, including the input channels, don’t get corrupted), because it would know about the plans and compensate for them.
So if we could get a strong guarantee that the epistemic system knows what the instrumental system is up to (like “the instrumental system is transparent to the epistemic system”), this would be helpful.
I stand as a counterexample: I personally want my epistemic layer to have accurate beliefs—y’know, having read the sequences… :-P
I think of my epistemic system like I think of my pocket calculator: a tool I use to better achieve my goals. The tool doesn’t need to share my goals.
The way I think about it is:
Early in training, the AGI is too stupid to formulate and execute a plan to hack into its epistemic level.
Late in training, we can hopefully get to the place where the AGI’s values, like mine, involve a concept of “there is a real world independent of my beliefs”, and its preferences involve the state of that world, and therefore “get accurate beliefs” becomes instrumentally useful and endorsed.
In between … well … in between, we’re navigating treacherous waters …
I mean, if the instrumental level has any way whatsoever to influence the epistemic level, it will be able to corrupt it with false beliefs if it’s hell-bent on doing so, and if it’s sufficiently intelligent and self-aware. But remember we’re not protecting against a superintelligent adversary; we’re just trying to “navigate the treacherous waters” I mentioned above. So the goal is to allow what instrumental influence we can on the epistemic system, while making it hard and complicated to outright corrupt the epistemic system. I think the things that human brains do for that are:
The instrumental level gets some influence over what to look at, where to go, what to read, who to talk to, etc.
There’s a trick (involving acetylcholine) where the instrumental level has some influence over a multiplier on the epistemic level’s gradients (a.k.a. learning rate). So epistemic level is always updates towards “more accurate predictions on this frame”, but it updates infinitesimally in situations where prediction accuracy is instrumentally useless, and it updates strongly in situations where prediction accuracy is instrumentally important.
There’s a different mechanism that creates the same end result as #2: namely, the instrumental level has some influence over what memories get replayed more or less often.
For #2 and #3, the instrumental level has some influence but not complete influence. There are other hardcoded algorithms running in parallel and flagging certain things as important, and the instrumental level has no straightforward way to prevent that from happening.
Right, I basically agree with this picture. I might revise it a little:
Early, the AGI is too dumb to hack its epistemics (provided we don’t give it easy ways to do so!).
In the middle, there’s a danger zone.
When the AGI is pretty smart, it sees why one should be cautious about such things, and it also sees why any modifications should probably be in pursuit of truthfulness (because true beliefs are a convergent instrumental goal) as opposed to other reasons.
When the AGI is really smart, it might see better ways of organizing itself (eg, specific ways to hack epistemics which really are for the best even though they insert false beliefs), but we’re OK with that, because it’s really freaking smart and it knows to be cautious and it still thinks this.
One important point here is that the epistemic system probably knows what the instrumental system is up to. If so, this gives us an important lever. For example, in theory, a logical inductor can’t be reliably fooled by an instrumental reasoner who uses it (so long as the hardware, including the input channels, don’t get corrupted), because it would know about the plans and compensate for them.
So if we could get a strong guarantee that the epistemic system knows what the instrumental system is up to (like “the instrumental system is transparent to the epistemic system”), this would be helpful.