One thing I’m very interested in is whether advanced AGIs will be explainable to other advanced AGIs. If one AGI can observe all the weights and firings of the other AGI, and pore over the details of everything the other AGI does, will it be able to tell when the other AGI is lying? Will it be able to say “Ah, here is the real goal of the other AGI, now I can make my own copy that has a different goal.” Etc. What do you think about this case?
For a bit about why this is important, see e.g. this and this.
That example doesn’t really make sense to me, could you taboo the word “lying”. I am rather confused as to what you mean by it, it could have a lot of different interpretations.
Sure. I guess I should have said interpretable rather than explainable, though maybe the two go hand in hand.
Suppose two AGIs are negotiating with each other. Suppose that their inner workings (their network weights, the activations, etc.) is all transparent to each other, and recorded so that they can replay it and analyze it. Suppose they agree to some deal, but are worried that the other is secretly planning to cheat on their end of the bargain. Can they find out whether the other is secretly planning to cheat? How? Insofar as their inner working are explainable/interpretable to each other, perhaps they can examine each other’s thoughts and see whether or not there exist any plans to cheat, or plans to make such plans, etc. in the other’s mind.
I still don’t understand the example. If you have access to everything about a given algorithm you are guaranteed to be able to know anything you want about it.
If “cheating” means something like “deciding at T that I will do action X at T+20 even though I said “” I will do action Y at T+20″” ”… then that decision is stored somewhere in those parameters and as such is known to anyone with access to them.
If neither system knows what action will happen at T+20 until T+20 arrises, then it becomes a problem of one turing machine trying to simulate another turing machine, so the amount of operations available from T until T+20 will decide the problem.
But I feel like the framework you are using here doesn’t really make a lot of sense, as in, what you are describing is very anthropomorphized.
It’s interesting that you feel this way—I feel the opposite.
If you have access to everything about a given algorithm you are guaranteed to be able to know anything you want about it.
This seems pretty false to me. You yourself give some counterexamples later.
If “cheating” means something like “deciding at T that I will do action X at T+20 even though I said “” I will do action Y at T+20″” ”… then that decision is stored somewhere in those parameters and as such is known to anyone with access to them.
This is definitely false. It’s adjacent to something true, which is: “Deciding given input-history H that I will do action X even though I said “I will do action Y given input-history H” is something that anyone with access to parameters etc. can verify given sufficient compute, by running a copy of the agent given input-history H.
However, this true thing doesn’t solve the problem (or resolve the question I’m interested in) by itself, for several reasons. One, you might not have sufficient compute, even in principle (perhaps what they do is logically entangled with what you do, so you can’t just simulate them or else run into the two-turing-machines-simulating-each-other problem). Two, in realistic situations you are interested not just in a single specific H, but a whole category of H’s (i.e. whatever future scenarios may arise). And you may not have a good definition for the category. And you certainly don’t have enough compute to simulate what the agent does in every possible H! Three, in realistic situations you are interested not just in a specific X/Y but in a distinction between actions that classifies some as X’s and some as Y’s, and you don’t have a precise definition of that distinction. Or maybe you do have a precise definition, but it’s based on long-term outcomes and you don’t have enough compute to simulate the long term.
I’m optimistic that there are solutions to these problems though—which is why I asked you what you thought.
This seems pretty false to me. You yourself give some counterexamples later.
Hmh, I don’t think so.
As in, my argument is simply that it might not be worth groking through the data and the explanation is a poorly defined concept which we don’t have even about human-made understanding.
I’d never claim that it’s impossible for me to know a specific about the outputs of an algorithm I have full data about, after all, I could just run it and see what the specific output I care about it. The edge case would come when I can’t run the algorithm due to computing power limitations, but someone else can by having much more compute than me. In which case the problem becomes one of trying to infer things about the output without running the algorithm itself (which could be construed as similar to the explanation problem, maybe, if I twist my head at a weird angle)
Anyway, I can see your point here but I can see it from a linguistic perspective, as in, we seem to use similar terms with slightly different meanings and this leads me to not quite understanding your reasoning here (and I assume the feeling is mutual). I will try to read this again later and see if I can formulate a reply, but for now I’m unable to put my hand on what those linguistic differences are, and I find that rather frustrating on a personal level :/
One thing I’m very interested in is whether advanced AGIs will be explainable to other advanced AGIs. If one AGI can observe all the weights and firings of the other AGI, and pore over the details of everything the other AGI does, will it be able to tell when the other AGI is lying? Will it be able to say “Ah, here is the real goal of the other AGI, now I can make my own copy that has a different goal.” Etc. What do you think about this case?
For a bit about why this is important, see e.g. this and this.
That example doesn’t really make sense to me, could you taboo the word “lying”. I am rather confused as to what you mean by it, it could have a lot of different interpretations.
Sure. I guess I should have said interpretable rather than explainable, though maybe the two go hand in hand.
Suppose two AGIs are negotiating with each other. Suppose that their inner workings (their network weights, the activations, etc.) is all transparent to each other, and recorded so that they can replay it and analyze it. Suppose they agree to some deal, but are worried that the other is secretly planning to cheat on their end of the bargain. Can they find out whether the other is secretly planning to cheat? How? Insofar as their inner working are explainable/interpretable to each other, perhaps they can examine each other’s thoughts and see whether or not there exist any plans to cheat, or plans to make such plans, etc. in the other’s mind.
I still don’t understand the example. If you have access to everything about a given algorithm you are guaranteed to be able to know anything you want about it.
If “cheating” means something like “deciding at T that I will do action X at T+20 even though I said “” I will do action Y at T+20″” ”… then that decision is stored somewhere in those parameters and as such is known to anyone with access to them.
If neither system knows what action will happen at T+20 until T+20 arrises, then it becomes a problem of one turing machine trying to simulate another turing machine, so the amount of operations available from T until T+20 will decide the problem.
But I feel like the framework you are using here doesn’t really make a lot of sense, as in, what you are describing is very anthropomorphized.
It’s interesting that you feel this way—I feel the opposite.
This seems pretty false to me. You yourself give some counterexamples later.
This is definitely false. It’s adjacent to something true, which is: “Deciding given input-history H that I will do action X even though I said “I will do action Y given input-history H” is something that anyone with access to parameters etc. can verify given sufficient compute, by running a copy of the agent given input-history H.
However, this true thing doesn’t solve the problem (or resolve the question I’m interested in) by itself, for several reasons. One, you might not have sufficient compute, even in principle (perhaps what they do is logically entangled with what you do, so you can’t just simulate them or else run into the two-turing-machines-simulating-each-other problem). Two, in realistic situations you are interested not just in a single specific H, but a whole category of H’s (i.e. whatever future scenarios may arise). And you may not have a good definition for the category. And you certainly don’t have enough compute to simulate what the agent does in every possible H! Three, in realistic situations you are interested not just in a specific X/Y but in a distinction between actions that classifies some as X’s and some as Y’s, and you don’t have a precise definition of that distinction. Or maybe you do have a precise definition, but it’s based on long-term outcomes and you don’t have enough compute to simulate the long term.
I’m optimistic that there are solutions to these problems though—which is why I asked you what you thought.
Hmh, I don’t think so.
As in, my argument is simply that it might not be worth groking through the data and the explanation is a poorly defined concept which we don’t have even about human-made understanding.
I’d never claim that it’s impossible for me to know a specific about the outputs of an algorithm I have full data about, after all, I could just run it and see what the specific output I care about it. The edge case would come when I can’t run the algorithm due to computing power limitations, but someone else can by having much more compute than me. In which case the problem becomes one of trying to infer things about the output without running the algorithm itself (which could be construed as similar to the explanation problem, maybe, if I twist my head at a weird angle)
Anyway, I can see your point here but I can see it from a linguistic perspective, as in, we seem to use similar terms with slightly different meanings and this leads me to not quite understanding your reasoning here (and I assume the feeling is mutual). I will try to read this again later and see if I can formulate a reply, but for now I’m unable to put my hand on what those linguistic differences are, and I find that rather frustrating on a personal level :/