I think your explanation of legibility here is basically what I have in mind, excepting that if it’s human designed it’s potentially not all encompassing. (For example, a world model that knows very little, but knows how to search for information in a library)
I think interpretability is usually a bit more narrow, and refers to developing an understanding of an illegible system. My take is that it is not “interpretability” to understand a legible system, but maybe I’m using the term differently than others here. This is why I don’t think “interpretability” applies to systems that are designed to be always-legible. (In the second graph, “interpretability” is any research that moves us upwards)
I agree that the ability to come up with totally alien and untranslateable to humans ideas gives AGI a capabilities boost. I do think that requiring a system to only use legible cognition and reasoning is a big “alignment tax”. However I don’t think that this tax is equivalent to a strong proof that legible AGI is impossible.
I think my central point of disagreement with this comment is that I do think that it’s possible to have compact world models (or at least compact enough to matter). I think if there was a strong proof that it was not possible to have a generally intelligent agent with a compact world model (or a compact function which is able to estimate and approximate a world model), that would be an update for me.
(For the record, I think of myself as a generally intelligent agent with a compact world model)
I think of myself as a generally intelligent agent with a compact world model
In what sense? Your world-model is built out of ~100 trillion synapses, storing all sorts of illegible information including “the way my friend sounds when he talks with his mouth full” and “how it feels to ride a bicycle whose gears need lubrication”.
(or a compact function which is able to estimate and approximate a world model)
That seems very different though! The GPT-3 source code is rather compact (gradient descent etc.); combine it with data and you get a huge and extraordinarily complicated illegible world-model (or just plain “model” in the GPT-3 case, if you prefer).
Likewise, the human brain has a learning algorithm that builds a world-model. The learning algorithm is (I happen to think) a compact easily-human-legible algorithm involving pattern recognition and gradient descent and so on. But the world-model built by that learning algorithm is super huge and complicated.
Sorry if I’m misunderstanding.
the ability to come up with totally alien and untranslateable to humans ideas gives AGI a capabilities boost. I do think that requiring a system to only use legible cognition and reasoning is a big “alignment tax”. However I don’t think that this tax is equivalent to a strong proof that legible AGI is impossible.
I’ll try to walk through why I think “coming up with new concepts outside what humans have thought of” is required. We want an AGI to be able to do powerful things like independent alignment research and inventing technology. (Otherwise, it’s not really an AGI, or at least doesn’t help us solve the problem that people will make more dangerous AGIs in the future, I claim.) Both these things require finding new patterns that have not been previously noticed by humans. For example, think of the OP that you just wrote. You had some idea in your head—a certain visualization and associated bundle of thoughts and intuitions and analogies—and had to work hard to try to communicate that idea to other humans like me.
I think your explanation of legibility here is basically what I have in mind, excepting that if it’s human designed it’s potentially not all encompassing. (For example, a world model that knows very little, but knows how to search for information in a library)
I think interpretability is usually a bit more narrow, and refers to developing an understanding of an illegible system. My take is that it is not “interpretability” to understand a legible system, but maybe I’m using the term differently than others here. This is why I don’t think “interpretability” applies to systems that are designed to be always-legible. (In the second graph, “interpretability” is any research that moves us upwards)
I agree that the ability to come up with totally alien and untranslateable to humans ideas gives AGI a capabilities boost. I do think that requiring a system to only use legible cognition and reasoning is a big “alignment tax”. However I don’t think that this tax is equivalent to a strong proof that legible AGI is impossible.
I think my central point of disagreement with this comment is that I do think that it’s possible to have compact world models (or at least compact enough to matter). I think if there was a strong proof that it was not possible to have a generally intelligent agent with a compact world model (or a compact function which is able to estimate and approximate a world model), that would be an update for me.
(For the record, I think of myself as a generally intelligent agent with a compact world model)
In what sense? Your world-model is built out of ~100 trillion synapses, storing all sorts of illegible information including “the way my friend sounds when he talks with his mouth full” and “how it feels to ride a bicycle whose gears need lubrication”.
That seems very different though! The GPT-3 source code is rather compact (gradient descent etc.); combine it with data and you get a huge and extraordinarily complicated illegible world-model (or just plain “model” in the GPT-3 case, if you prefer).
Likewise, the human brain has a learning algorithm that builds a world-model. The learning algorithm is (I happen to think) a compact easily-human-legible algorithm involving pattern recognition and gradient descent and so on. But the world-model built by that learning algorithm is super huge and complicated.
Sorry if I’m misunderstanding.
I’ll try to walk through why I think “coming up with new concepts outside what humans have thought of” is required. We want an AGI to be able to do powerful things like independent alignment research and inventing technology. (Otherwise, it’s not really an AGI, or at least doesn’t help us solve the problem that people will make more dangerous AGIs in the future, I claim.) Both these things require finding new patterns that have not been previously noticed by humans. For example, think of the OP that you just wrote. You had some idea in your head—a certain visualization and associated bundle of thoughts and intuitions and analogies—and had to work hard to try to communicate that idea to other humans like me.
Again, sorry if I’m misunderstanding.