Thanks for trying to clarify “X and only X”, which IMO is a promising concept.
One thing we might want from an only-Xer is that, in some not-yet-formal sense, it’s “only trying to X” and not trying to do anything else. A further thing we might want is that the only-Xer only tries to X, across some relevant set of counterfactuals. You’ve discussed the counterfactuals across possible environments. Another kind of counterfactual is across modifications of the only-Xer. Modification-counterfactuals seem to point to a key problem of alignment: how does this generalize? If we’ve selected something to do X, within some set of environments, what does that imply about how it’ll behave outside of that set of environments? It looks like by your definition we could have a program that’s a very competent general intelligence with a slot for a goal, plus a pointer to X in that slot; and that program would count as an only-Xer. This program would be very close, in some sense, to programs that optimize competently for not-X, or for a totally unrelated Y. That seems counterintuitive for my intuitive picture of an “X and only X”er, so either there’s more to be said, or my picture is incoherent.
My picture of an X and only X er is that the actual program you run should optimize only for X. I wasn’t considering similarity in code space at all.
Getting the lexicographically first formal ZFC proof of say the Collatz conjecture should be safe. Getting a random proof sampled from the set of all proofs < 1 terabyte long should be safe. But I think that there exist proofs that wouldn’t be safe. There might be a valid proof of the conjecture that had the code for a paperclip maximizer encoded into the proof, and that exploited some flaw in computers or humans to bootstrap this code into existence. This is what I want to avoid.
Your picture might be coherent and formalizable into some different technical definition. But you would have to start talking about difference in codespace, which can differ depending on different programming languages.
The program if True: x() else: y() is very similar in codespace to if False: x() else: y() .
If code space is defined in terms of minimum edit distance, then layers of interpereters, error correction and holomorphic encryption can change it. This might be what you are after, I don’t know.
Well, a main reason we’d care about codespace distance, is that it tells us something about how the agent will change as it learns (i.e. moves around in codespace). (This is involving time, since the agent is changing, contra your picture.) So a key (quasi)metric on codespace would be, “how much” learning does it take to get from here to there. The if True: x() else: y() program is an unnatural point in codespace in this metric: you’d have to have traversed the both the distances from null to x() and from null to y(), and it’s weird to have traversed a distance and make no use of your position. A framing of the only-X problem is that traversing from null to a program that’s an only-Xer according to your definition, might also constitute traversing almost all of the way from null to a program that’s an only-Yer, where Y is “very different” from X.
I don’t think that learning is moving around in codespace. In the simplest case, the AI is like any other non self modifying program. The code stays fixed as the programmers wrote it. The variables update. The AI doesn’t start from null. The programmer starts from a blank text file, and adds code. Then they run the code. The AI can start with sophisticated behaviour the moment its turned on.
So are we talking about a program that could change from an X er to a Y er with a small change in the code written, or with a small amount of extra observation of the world?
To clarify where my responses are coming from: I think what I’m saying is not that directly relevant to your specific point in the post. I’m more (1) interested in discussing the notion of only-X, broadly, and (2) reacting to the feature of your discussion (shared by much other discussion) that you (IIUC) consider only the extensional (input-output) behavior of programs, excluding from analysis the intensional properties. (Which is a reasonable approach, e.g. because the input-output behavior captures much of what we care about, and also because it’s maybe easier to analyze and already contains some of our problems / confusions.)
From where I’m sitting, when a program “makes an observation of the world”, that’s moving around in codespace. There’s of course useful stuff to say about the part that didn’t change. When we really understand how a cognitive algorithm works, it starts to look like a clear algorithm / data separation; e.g. in Bayesian updating, we have a clear picture of the code that’s fixed, and how it operates on the varying data. But before we understand the program in that way, we might be unable to usefully separate it out into a fixed part and a varying part. Then it’s natural to say things like “the child invented a strategy for picking up blocks; next time, they just use that strategy”, where the first clause is talking about a change in source code. We know for sure that such separations can be done, because for example we can say that the child is always operating in accordance with fixed physical law, and we might suspect there’s “fundamental brain algorithms” that are also basically fixed. Likewise, even though Solomonoff induction is always just Solomonoff induction plus data, it can be also useful to understand SI(some data) in terms of understanding those programs that are highly ranked by SI(some data), and it seems reasonable to call that “the algorithm changed to emphasize those programs”.
Thanks for trying to clarify “X and only X”, which IMO is a promising concept.
One thing we might want from an only-Xer is that, in some not-yet-formal sense, it’s “only trying to X” and not trying to do anything else. A further thing we might want is that the only-Xer only tries to X, across some relevant set of counterfactuals. You’ve discussed the counterfactuals across possible environments. Another kind of counterfactual is across modifications of the only-Xer. Modification-counterfactuals seem to point to a key problem of alignment: how does this generalize? If we’ve selected something to do X, within some set of environments, what does that imply about how it’ll behave outside of that set of environments? It looks like by your definition we could have a program that’s a very competent general intelligence with a slot for a goal, plus a pointer to X in that slot; and that program would count as an only-Xer. This program would be very close, in some sense, to programs that optimize competently for not-X, or for a totally unrelated Y. That seems counterintuitive for my intuitive picture of an “X and only X”er, so either there’s more to be said, or my picture is incoherent.
My picture of an X and only X er is that the actual program you run should optimize only for X. I wasn’t considering similarity in code space at all.
Getting the lexicographically first formal ZFC proof of say the Collatz conjecture should be safe. Getting a random proof sampled from the set of all proofs < 1 terabyte long should be safe. But I think that there exist proofs that wouldn’t be safe. There might be a valid proof of the conjecture that had the code for a paperclip maximizer encoded into the proof, and that exploited some flaw in computers or humans to bootstrap this code into existence. This is what I want to avoid.
Your picture might be coherent and formalizable into some different technical definition. But you would have to start talking about difference in codespace, which can differ depending on different programming languages.
The program if True: x() else: y() is very similar in codespace to if False: x() else: y() .
If code space is defined in terms of minimum edit distance, then layers of interpereters, error correction and holomorphic encryption can change it. This might be what you are after, I don’t know.
Well, a main reason we’d care about codespace distance, is that it tells us something about how the agent will change as it learns (i.e. moves around in codespace). (This is involving time, since the agent is changing, contra your picture.) So a key (quasi)metric on codespace would be, “how much” learning does it take to get from here to there. The if True: x() else: y() program is an unnatural point in codespace in this metric: you’d have to have traversed the both the distances from null to x() and from null to y(), and it’s weird to have traversed a distance and make no use of your position. A framing of the only-X problem is that traversing from null to a program that’s an only-Xer according to your definition, might also constitute traversing almost all of the way from null to a program that’s an only-Yer, where Y is “very different” from X.
I don’t think that learning is moving around in codespace. In the simplest case, the AI is like any other non self modifying program. The code stays fixed as the programmers wrote it. The variables update. The AI doesn’t start from null. The programmer starts from a blank text file, and adds code. Then they run the code. The AI can start with sophisticated behaviour the moment its turned on.
So are we talking about a program that could change from an X er to a Y er with a small change in the code written, or with a small amount of extra observation of the world?
To clarify where my responses are coming from: I think what I’m saying is not that directly relevant to your specific point in the post. I’m more (1) interested in discussing the notion of only-X, broadly, and (2) reacting to the feature of your discussion (shared by much other discussion) that you (IIUC) consider only the extensional (input-output) behavior of programs, excluding from analysis the intensional properties. (Which is a reasonable approach, e.g. because the input-output behavior captures much of what we care about, and also because it’s maybe easier to analyze and already contains some of our problems / confusions.)
From where I’m sitting, when a program “makes an observation of the world”, that’s moving around in codespace. There’s of course useful stuff to say about the part that didn’t change. When we really understand how a cognitive algorithm works, it starts to look like a clear algorithm / data separation; e.g. in Bayesian updating, we have a clear picture of the code that’s fixed, and how it operates on the varying data. But before we understand the program in that way, we might be unable to usefully separate it out into a fixed part and a varying part. Then it’s natural to say things like “the child invented a strategy for picking up blocks; next time, they just use that strategy”, where the first clause is talking about a change in source code. We know for sure that such separations can be done, because for example we can say that the child is always operating in accordance with fixed physical law, and we might suspect there’s “fundamental brain algorithms” that are also basically fixed. Likewise, even though Solomonoff induction is always just Solomonoff induction plus data, it can be also useful to understand SI(some data) in terms of understanding those programs that are highly ranked by SI(some data), and it seems reasonable to call that “the algorithm changed to emphasize those programs”.