its formal goal: to maximize whichever utility function (as a piece of math) would be returned by the (possibly computationally exponentially expensive) mathematical expression E which the world would’ve contained instead of the answer, if in the world, instances of question were replaced with just the string “what should the utility function be?” followed by spaces to pad to 1 gigabyte.
How do you think about the under-definedness of counterfactuals?
EG, if counterfactuals are weird, this proposal probably does something weird, as it has to condition on increasingly weird counterfactuals.
the counterfactuals might be defined wrong but they won’t be “under-defined”. but yes, they might locate the blob somewhere we don’t intend to (or insert the counterfactual question in a way we don’t intend to); i’ve been thinking a bunch about ways this could fail and how to overcome them (1, 2, 3).
on the other hand, if you’re talking about the blob-locating math pointing to the right thing but the AI not making accurate guesses early enough as to what the counterfactuals would look like, i do think getting only eventual alignment is one of the potential problems, but i’m hopeful it gets there eventually, and maybe there are ways to check that it’ll make good enough guesses even before we let it loose.
Yeah, no, I’m talking about the math itself being bad, rather than the math being correct but the logical uncertainty making poor guesses early on.
i’ve been thinking a bunch about ways this could fail and how to overcome them (1, 2, 3).
I noticed you had some other posts relating to the counterfactuals, but skimming them felt like you were invoking a lot of other machinery that I don’t think we have, and that you also don’t think we have (IE the voice in the posts is speculative, not affirmative).
So I thought I would just ask.
My own thinking would be that the counterfactual reasoning should be responsive to the system’s overall estimates of how-humans-would-want-it-to-reason, in the same way that its prior needs to be an estimate of the human-endorsed prior, and values should approximate human-endorsed values.
Sticking close to QACI, I think what this amounts to is tracking uncertainty about the counterfactuals employed, rather than solidly assuming one way of doing it is correct. But there are complex questions of how to manage that uncertainty.
i’ve made some work towards building that machinery (see eg here) but yes still there are still a bunch of things to be figured out, though i’m making progress in that direction (see the posts about blob location).
My own thinking would be that the counterfactual reasoning should be responsive to the system’s overall estimates of how-humans-would-want-it-to-reason, in the same way that its prior needs to be an estimate of the human-endorsed prior, and values should approximate human-endorsed values.
are you saying this in the prescriptive sense, i.e. we should want that property? i think if implemented correctly, accuracy is all we would really need right? carrying human intent in those parts of the reasoning seems difficult and wonky and plausibly not necessary to me, where straightforward utility maximization should work.
How do you think about the under-definedness of counterfactuals?
EG, if counterfactuals are weird, this proposal probably does something weird, as it has to condition on increasingly weird counterfactuals.
the counterfactuals might be defined wrong but they won’t be “under-defined”. but yes, they might locate the blob somewhere we don’t intend to (or insert the counterfactual question in a way we don’t intend to); i’ve been thinking a bunch about ways this could fail and how to overcome them (1, 2, 3).
on the other hand, if you’re talking about the blob-locating math pointing to the right thing but the AI not making accurate guesses early enough as to what the counterfactuals would look like, i do think getting only eventual alignment is one of the potential problems, but i’m hopeful it gets there eventually, and maybe there are ways to check that it’ll make good enough guesses even before we let it loose.
Yeah, no, I’m talking about the math itself being bad, rather than the math being correct but the logical uncertainty making poor guesses early on.
I noticed you had some other posts relating to the counterfactuals, but skimming them felt like you were invoking a lot of other machinery that I don’t think we have, and that you also don’t think we have (IE the voice in the posts is speculative, not affirmative).
So I thought I would just ask.
My own thinking would be that the counterfactual reasoning should be responsive to the system’s overall estimates of how-humans-would-want-it-to-reason, in the same way that its prior needs to be an estimate of the human-endorsed prior, and values should approximate human-endorsed values.
Sticking close to QACI, I think what this amounts to is tracking uncertainty about the counterfactuals employed, rather than solidly assuming one way of doing it is correct. But there are complex questions of how to manage that uncertainty.
i’ve made some work towards building that machinery (see eg here) but yes still there are still a bunch of things to be figured out, though i’m making progress in that direction (see the posts about blob location).
are you saying this in the prescriptive sense, i.e. we should want that property? i think if implemented correctly, accuracy is all we would really need right? carrying human intent in those parts of the reasoning seems difficult and wonky and plausibly not necessary to me, where straightforward utility maximization should work.
Notably, this relies on the utility function actually being sparse enough that it can’t be maximized except by generating the traits abram mentions.