Yudkowsky: 5. “Hiding” of thoughts, in a system with significant transparency, via an imaginative and reflective consequentialist that has deduced the existence of the transparency system and models how humans will react to thoughts exposed by the transparency system.
Tallinn: 5. hmm.. here i’m running into trouble (type mismatch error) again. i can imagine this in abstract (and perhaps incorrectly/anthropomorphisingly!), but would—at this stage—fail to code up anything like a gridworlds example. more research needed (TM) i guess :)
Yudkowsky: 5 - you’re plausibly understanding this correctly, then, this is legit a lot harder to spec a gridworld example for (relative to my own present state of knowledge).
This can be modeled using what I dubbed Turing reinforcement learning (TRL). That is, you have a “core” RL agent connected to the environment and to an “envelope” external computer (i.e. a module that allows running programs and contains most of the computing power) that it regards as part of the environment.
Such a system learns how to program the envelope to make a more powerful agent, and this more powerful agent learns how to reprogram the envelope part of itself even better. So, the transparency system would catch the core having obviously malign thoughts, but the core will instead end up having thoughts like “I should run such-and-such program on the envelope for such-and-such abstract reasons” while the envelope contains opaque code that keeps optimizing itself to become more opaque, and all the direct plotting is inside the computations performed by the opaque code.
Yeah, I had a similar thought when reading that part. In agent-foundations discussions, the idea often came up that the right decision theory should quantify not over outputs or input-output maps, but over successor programs to run and delegate I/O to. Wei called it “UDT2”.
This can be modeled using what I dubbed Turing reinforcement learning (TRL). That is, you have a “core” RL agent connected to the environment and to an “envelope” external computer (i.e. a module that allows running programs and contains most of the computing power) that it regards as part of the environment.
Such a system learns how to program the envelope to make a more powerful agent, and this more powerful agent learns how to reprogram the envelope part of itself even better. So, the transparency system would catch the core having obviously malign thoughts, but the core will instead end up having thoughts like “I should run such-and-such program on the envelope for such-and-such abstract reasons” while the envelope contains opaque code that keeps optimizing itself to become more opaque, and all the direct plotting is inside the computations performed by the opaque code.
Yeah, I had a similar thought when reading that part. In agent-foundations discussions, the idea often came up that the right decision theory should quantify not over outputs or input-output maps, but over successor programs to run and delegate I/O to. Wei called it “UDT2”.