To me it seems that understanding how a system that you are building actually works (i.e. have good models about its internal) is the most basic requirement to be able to reason about the system coherently at all.
Yes if you’d actually understood how intelligence works in a deep way you don’t automatically solve alignment. But it sure will make it a lot more tractable in many ways. Especially when only aiming for a pivotal act.
I am pretty sure you can figure out alignment in advance as you suggest. That might be the overall saver route… if we didn’t have coordination problems. But it seems slower, and we don’t have time.
Obviously, if you figure out the intelligence algorithm before you know how to steer it, don’t put it on GitHub or the universe’s fate will be sealed momentarily. Ideally don’t even run it at all.
So far working on this project seems to have created ontologies in my brain that are good for thinking about alignment. There are a couple of approaches that now seem obvious, which I think wouldn’t seem obvious before. Again having good models about intelligence (which is really what this is about) is actually useful for thinking about intelligence. And Alignment research is mainly thinking about intelligence.
The approach many people take of trying to pick some alignment problem seems somewhat backward to me. E.g. embedded agency is a very important problem, and you need to solve it at some point. But it doesn’t feel like the problem such that when you work on it, you build up the most useful models of intelligence in your brain.
As an imperfect analogy consider trying to understand how a computer works by first understanding how modern DRAM works. To build a practical computer you might need to use DRAM. But in principle, you could build a computer with only S-R latch memory. So clearly while important it is not at the very core. First, you understand how NAND gates work, the ALU, and so on. Once you have a good understanding of the fundamentals, DRAM will be much easier to understand. It becomes obvious how it needs to work at a high level: You can write and read bits. If you don’t understand how a computer works you might not even know why storing a bit is an important thing to be able to do.
I am pretty sure you can figure out alignment in advance as you suggest
I’m not so sure about that. How do you figure out how to robustly keep a generally intelligent dynamically updating system on-target without having a solid model of how that system is going to change in response to its environment? Which, in turn, would require a model of what that system is?
I expect the formal definition of “alignment” to be directly dependent on the formal framework of intelligence and embedded agency, the same way a tetrahedron could only be formally defined within the context of Euclidean space.
I’d think you can define a tedrahedron for non-euclidean space. And you can talk about and reason about a set of polyhedra with 10 verticies as an abstract object without talking or defining any specific such polyhedra.
Just consider if you take the assumption that the system would not change in arbitrary ways in response to it’s environment. There might be certain constrains. You can think about what the constrains need to be such that e.g. a self modifying agent would never change itself such that it would expect that in the future it would get less utility than if it would not selfmodify.
And that is just a random thing that came to mind without me trying. I would expect that you can learn useful things about alignment by thinking about such things. Infact the line between understanding intelligence and figuring out alignment in advance really doesn’t exist I think. Clearly understanding something about alignment is understanding something about intelligence.
When people say to only figure out alignment thing, maybe what they mean is to figure out things about intelligence that won’t actually get you much closer to being able to build a dangerous intelligence. And there do seem to be such things. It is just that I expect that just trying to work on these will not actually make you generate the most useful models about intelligence in your mind, making you worse/slower at thinking on average per unit of time working.
And that’s of cause not a law. Probably there are some things that you want to understand through an abstract theoretical lens at certain points in time. Do whatever works best.
I’d think you can define a tedrahedron for non-euclidean space
If you relax the definition of a tetrahedron to cover figures embedded in non-Euclidean spaces, sure. It wouldn’t be the exact same concept, however. In a similar way to how “a number” is different if you define it as a natural number vs. real number.
Perhaps more intuitively, then: the notion of a geometric figure with specific properties is dependent on the notion of a space in which it is embedded. (You can relax it further – e. g., arguably, you can define a “tetrahedron” for any set with a distance function over it – but the general point stands, I think.)
Just consider if you take the assumption that the system would not change in arbitrary ways in response to it’s environment. There might be certain constrains. You can think about what the constrains need to be such that e.g. a self modifying agent would never change itself such that it would expect that in the future it would get less utility than if it would not selfmodify.
Yes, but: those constraints are precisely the principles you’d need to code into your AI to give it general-intelligence capabilities. If your notion of alignment only needs to be robust to certain classes of changes, because you’ve figured out that an efficient generally intelligent system would only change in such-and-such ways, then you’ve figured out a property of how generally intelligent systems ought to work – and therefore, something about how to implement one.
Speaking abstractly, the “negative image” of the theory of alignment is precisely the theory of generally intelligent embedded agents. A robust alignment scheme would likely be trivial to transform into an AGI recipe.
A robust alignment scheme would likely be trivial to transform into an AGI recipe.
Perhaps if you did have the full solution, but it feels like that there are some things of a solution that you could figure out, such that that part of the solution doesn’t tell you as much about the other parts of the solution.
And it also feels like there could be a book such that if you read it you would gain a lot of knowledge about how to align AIs without knowing that much more about how to build one. E.g. a theoretical solution to the stop button problem seems like it would not tell you that much about how to build an AGI compared to figuring out how to properly learn a world model of Minecraft. And knowing how to build a world model of minecraft probably helps a lot with solving the stop button problem, but it doesn’t just trivially yield a solution.
Perhaps if you did have the full solution, but it feels like that there are some things of a solution that you could figure out, such that that part of the solution doesn’t tell you as much about the other parts of the solution.
To me it seems that understanding how a system that you are building actually works (i.e. have good models about its internal) is the most basic requirement to be able to reason about the system coherently at all.
Yes if you’d actually understood how intelligence works in a deep way you don’t automatically solve alignment. But it sure will make it a lot more tractable in many ways. Especially when only aiming for a pivotal act.
I am pretty sure you can figure out alignment in advance as you suggest. That might be the overall saver route… if we didn’t have coordination problems. But it seems slower, and we don’t have time.
Obviously, if you figure out the intelligence algorithm before you know how to steer it, don’t put it on GitHub or the universe’s fate will be sealed momentarily. Ideally don’t even run it at all.
So far working on this project seems to have created ontologies in my brain that are good for thinking about alignment. There are a couple of approaches that now seem obvious, which I think wouldn’t seem obvious before. Again having good models about intelligence (which is really what this is about) is actually useful for thinking about intelligence. And Alignment research is mainly thinking about intelligence.
The approach many people take of trying to pick some alignment problem seems somewhat backward to me. E.g. embedded agency is a very important problem, and you need to solve it at some point. But it doesn’t feel like the problem such that when you work on it, you build up the most useful models of intelligence in your brain.
As an imperfect analogy consider trying to understand how a computer works by first understanding how modern DRAM works. To build a practical computer you might need to use DRAM. But in principle, you could build a computer with only S-R latch memory. So clearly while important it is not at the very core. First, you understand how NAND gates work, the ALU, and so on. Once you have a good understanding of the fundamentals, DRAM will be much easier to understand. It becomes obvious how it needs to work at a high level: You can write and read bits. If you don’t understand how a computer works you might not even know why storing a bit is an important thing to be able to do.
I’m not so sure about that. How do you figure out how to robustly keep a generally intelligent dynamically updating system on-target without having a solid model of how that system is going to change in response to its environment? Which, in turn, would require a model of what that system is?
I expect the formal definition of “alignment” to be directly dependent on the formal framework of intelligence and embedded agency, the same way a tetrahedron could only be formally defined within the context of Euclidean space.
I’d think you can define a tedrahedron for non-euclidean space. And you can talk about and reason about a set of polyhedra with 10 verticies as an abstract object without talking or defining any specific such polyhedra.
Just consider if you take the assumption that the system would not change in arbitrary ways in response to it’s environment. There might be certain constrains. You can think about what the constrains need to be such that e.g. a self modifying agent would never change itself such that it would expect that in the future it would get less utility than if it would not selfmodify.
And that is just a random thing that came to mind without me trying. I would expect that you can learn useful things about alignment by thinking about such things. Infact the line between understanding intelligence and figuring out alignment in advance really doesn’t exist I think. Clearly understanding something about alignment is understanding something about intelligence.
When people say to only figure out alignment thing, maybe what they mean is to figure out things about intelligence that won’t actually get you much closer to being able to build a dangerous intelligence. And there do seem to be such things. It is just that I expect that just trying to work on these will not actually make you generate the most useful models about intelligence in your mind, making you worse/slower at thinking on average per unit of time working.
And that’s of cause not a law. Probably there are some things that you want to understand through an abstract theoretical lens at certain points in time. Do whatever works best.
If you relax the definition of a tetrahedron to cover figures embedded in non-Euclidean spaces, sure. It wouldn’t be the exact same concept, however. In a similar way to how “a number” is different if you define it as a natural number vs. real number.
Perhaps more intuitively, then: the notion of a geometric figure with specific properties is dependent on the notion of a space in which it is embedded. (You can relax it further – e. g., arguably, you can define a “tetrahedron” for any set with a distance function over it – but the general point stands, I think.)
Yes, but: those constraints are precisely the principles you’d need to code into your AI to give it general-intelligence capabilities. If your notion of alignment only needs to be robust to certain classes of changes, because you’ve figured out that an efficient generally intelligent system would only change in such-and-such ways, then you’ve figured out a property of how generally intelligent systems ought to work – and therefore, something about how to implement one.
Speaking abstractly, the “negative image” of the theory of alignment is precisely the theory of generally intelligent embedded agents. A robust alignment scheme would likely be trivial to transform into an AGI recipe.
Perhaps if you did have the full solution, but it feels like that there are some things of a solution that you could figure out, such that that part of the solution doesn’t tell you as much about the other parts of the solution.
And it also feels like there could be a book such that if you read it you would gain a lot of knowledge about how to align AIs without knowing that much more about how to build one. E.g. a theoretical solution to the stop button problem seems like it would not tell you that much about how to build an AGI compared to figuring out how to properly learn a world model of Minecraft. And knowing how to build a world model of minecraft probably helps a lot with solving the stop button problem, but it doesn’t just trivially yield a solution.
I agree with that.
Thanks for the reply :)