Yes but part of what makes interpretability hard is it may be incomputable. And part of what makes making interpretable models difficult to create is that there may be some inner deception you are missing. I think any argument that makes alignment difficult symmetrically makes two agents working together difficult. I mean think about it, if agent 1 can cooperate with agent 2, agent 1 can also bootstrap a version of agent 2 aligned with itself.
Intuitively interpreting an AI feels like we are trying to break some cryptographic hash, but much harder. For example, as humans even interpreting gpt-2 is a monumental task for us despite our relatively vastly superior computational abilities and knowledge. After all, something that’s 2 OOMs superior (gpt-4) could only confidently interpret a fraction of a percent of neurons, and only with certain confidence. It’s likely these are relatively trivial ones too.
Therefore, I think the only way for agent 1 to confirm the motives of agent 2 and eliminate unknown unknowns is to fully simulate it, assuming there isn’t some shortcut to interpretability. And of course fully simulating it requires it to accomplish boxing agent 2, and probably have a large capability edge over it.
I don’t think the TSA analogy is perfect either. There is no (Defect, Cooperate) case. The risks in it are known unknowns. You have a good understanding of what happens if you fail as you just go to jail, but with adversarial agents, they have absolutely no understanding of what happens to them if they are wrong. Being wrong here means you get eliminated. So (D, C)’s negative expectation for example is unbounded. You also have a reasonable expectation of what TSA will do to verify you aren’t breaking the law. Just their methodology of scanning you eliminates a lot of detection methods from the probability space. You have no idea what another agent will do. Maybe they sent an imperfect clone to communicate with you, maybe there’s a hole in the proposed consensus protocol you haven’t considered. The point is this is one giant unknown unknown that can’t be modeled as you are trying to predict what someone with greater capabilities than you can do.
Lastly, I think this chain of reasoning breaks down if there is some robust consensus protocol or some way to make models fully interpretable. However if such a method exists, we would probably just initially align the AGI to begin with as our narrow AI would help us create interpretable AGI, or a robust system of detecting inner deception.
I agree the analogy breaks down in the case of very adversarial agents and / or big gaps in intelligence or power. My point is just that these problems probably aren’t unsolvable in principle, for humans or AI systems who have something to gain from cooperating or trading, and who are at roughly equal but not necessarily identical levels of intelligence and power. See my response to the sibling comment here for more.
yes but part of what makes interpretability hard is it may be incomputable.
Aside: I think “incomputable” is a vague term, and results from computational complexity theory often don’t have as much relevance to these kind of problems as people intuitively expect. See the second part of this comment for more.
Aside: I think “incomputable” is a vague term, and results from computational complexity theory often don’t have as much relevance to these kind of problems as people intuitively expect. See the second part of this comment for more.
I mean incomputable in that computation would exceed the physical resources of the universe, whether it is P or NP. We can have a weaker definition of incomputable too, say it would exceed the capabilities of the agent by at least 1 OOM.
Specifically in regards to interpretability, if it isn’t possible to do easily, my intuition tells me it will require brute force simulation/exhaustive search. (Think mining bitcoin). In that case you need a system more powerful than whatever you are trying to interpret, making it kind of difficult in the context of two agents who are roughly on a similar capability level.
I think it will be difficult for goal maximizers to cooperate for the same reasons it would be difficult for humans to survive in a world with a superintelligent goal maximizer. In almost all argument leading to doom for humans and a paperclip maximizing kind of ASI, you can replace human with agent and the argument will still stand.
It is easy for intelligent humans and groups of humans to cooperate because very rarely do they have fanatical goals., and historically, the ones that did went to war.
Yes but part of what makes interpretability hard is it may be incomputable. And part of what makes making interpretable models difficult to create is that there may be some inner deception you are missing. I think any argument that makes alignment difficult symmetrically makes two agents working together difficult. I mean think about it, if agent 1 can cooperate with agent 2, agent 1 can also bootstrap a version of agent 2 aligned with itself.
Intuitively interpreting an AI feels like we are trying to break some cryptographic hash, but much harder. For example, as humans even interpreting gpt-2 is a monumental task for us despite our relatively vastly superior computational abilities and knowledge. After all, something that’s 2 OOMs superior (gpt-4) could only confidently interpret a fraction of a percent of neurons, and only with certain confidence. It’s likely these are relatively trivial ones too.
Therefore, I think the only way for agent 1 to confirm the motives of agent 2 and eliminate unknown unknowns is to fully simulate it, assuming there isn’t some shortcut to interpretability. And of course fully simulating it requires it to accomplish boxing agent 2, and probably have a large capability edge over it.
I don’t think the TSA analogy is perfect either. There is no (Defect, Cooperate) case. The risks in it are known unknowns. You have a good understanding of what happens if you fail as you just go to jail, but with adversarial agents, they have absolutely no understanding of what happens to them if they are wrong. Being wrong here means you get eliminated. So (D, C)’s negative expectation for example is unbounded. You also have a reasonable expectation of what TSA will do to verify you aren’t breaking the law. Just their methodology of scanning you eliminates a lot of detection methods from the probability space. You have no idea what another agent will do. Maybe they sent an imperfect clone to communicate with you, maybe there’s a hole in the proposed consensus protocol you haven’t considered. The point is this is one giant unknown unknown that can’t be modeled as you are trying to predict what someone with greater capabilities than you can do.
Lastly, I think this chain of reasoning breaks down if there is some robust consensus protocol or some way to make models fully interpretable. However if such a method exists, we would probably just initially align the AGI to begin with as our narrow AI would help us create interpretable AGI, or a robust system of detecting inner deception.
I agree the analogy breaks down in the case of very adversarial agents and / or big gaps in intelligence or power. My point is just that these problems probably aren’t unsolvable in principle, for humans or AI systems who have something to gain from cooperating or trading, and who are at roughly equal but not necessarily identical levels of intelligence and power. See my response to the sibling comment here for more.
Aside: I think “incomputable” is a vague term, and results from computational complexity theory often don’t have as much relevance to these kind of problems as people intuitively expect. See the second part of this comment for more.
I mean incomputable in that computation would exceed the physical resources of the universe, whether it is P or NP. We can have a weaker definition of incomputable too, say it would exceed the capabilities of the agent by at least 1 OOM.
Specifically in regards to interpretability, if it isn’t possible to do easily, my intuition tells me it will require brute force simulation/exhaustive search. (Think mining bitcoin). In that case you need a system more powerful than whatever you are trying to interpret, making it kind of difficult in the context of two agents who are roughly on a similar capability level.
I think it will be difficult for goal maximizers to cooperate for the same reasons it would be difficult for humans to survive in a world with a superintelligent goal maximizer. In almost all argument leading to doom for humans and a paperclip maximizing kind of ASI, you can replace human with agent and the argument will still stand.
It is easy for intelligent humans and groups of humans to cooperate because very rarely do they have fanatical goals., and historically, the ones that did went to war.