Some different (I think) points against arguments related to the ones that are rebutted in the post:
requiring a strategy be implementable with a utility function restricts the strategy to a portion of the total strategy space. But, it doesn’t follow that any strategy implementable with a utility function actually has to be implemented that way.
even if a strategy lives in the “utility function” portion of the strategy space, it might be implemented using additional restrictions such that arguments that would apply to a “typical” utility function won’t apply
some of these theorems seem to me to assume consequentialism (e.g. NVM theorem) and I’m not sure that they are usefully generalizable to non-consequentialist parts of the strategy space (though they might be)
what we actually want might or might not live in the consequentialist (or “utility function”) parts of the strategy space
if what we really want is incompatible with some desiderata then so much the worse for the desiderata
even if a strategy implementing what we want does live in the “utility function” portion of the space, using an explicit utility function might not be the most convenient way to implement the strategy
For example, Haskell-style pseudocode for an AI (type signatures only, this margin too small to contain actual functions): (edit: was too hasty first time, changed to better reflect what I intended)
This code doesn’t look to me that it would be easy to express in terms of a utility function, particularly if human values contain non-consequentialist components. But, it ought to avoid being exploited if humans want it to (while being exploitable if humans want that instead).
What is the function evaluateAction supposed to do when human values contain non-consequentialist components? I assume ExpectedValue is a real number.
Maybe there could be a way to build a utility function that corresponds to the code, but that is hard to judge since you have left the details out.
(edited the code after this comment, corresponding edits below, to avoid noisiness the original is not shown; the original code did not make explicit what I discuss in the “main reason” paragraph:)
evaluateWithCorrelations uses both the ProbDistributionOfWorldPath(unknowns) and the Action to generate the ExpectedValue (not explicit, but implicitly the WorldPath can take into account the past and present as well). So, yes, ExpectedValue is a real number, but it doesn’t necessarily depend only on the consequences of the action.
However, my main reason for thinking that this would be hard to express as a utility function is that the calculation of the ExpectedValue is supposed to take into account the future actions of the AI (not just the Action being chosen now), and is supposed to take into account correlations between ProbDistributionOfHumanValues(t,unknowns) and ProbDistributionOfWorldPath(unknowns). Note, I don’t mean taking into account changes in actual human values—it should only be using current ones in the evaluation, though it should take into account possible changes for the prediction. But, the future actions of humans depend on current human values. So, ideally it should be able to predict that asking humans what they want will lead to an update of the model at t’ that is correlated to the unknowns in ProbDistributionOfHumanValues(t,unknowns) that will then lead to different actions by the AI depending on what the humans respond with so that it can then assess a better ExpectedValue to this course of action than not asking, whereas if it was a straight utility function maximizer I would expect it would assign the same value in the short run and reduced value in the long run to such asking.
Obviously yes a real AI would be much more complicated.
Some different (I think) points against arguments related to the ones that are rebutted in the post:
requiring a strategy be implementable with a utility function restricts the strategy to a portion of the total strategy space. But, it doesn’t follow that any strategy implementable with a utility function actually has to be implemented that way.
even if a strategy lives in the “utility function” portion of the strategy space, it might be implemented using additional restrictions such that arguments that would apply to a “typical” utility function won’t apply
some of these theorems seem to me to assume consequentialism (e.g. NVM theorem) and I’m not sure that they are usefully generalizable to non-consequentialist parts of the strategy space (though they might be)
what we actually want might or might not live in the consequentialist (or “utility function”) parts of the strategy space
if what we really want is incompatible with some desiderata then so much the worse for the desiderata
even if a strategy implementing what we want does live in the “utility function” portion of the space, using an explicit utility function might not be the most convenient way to implement the strategy
For example, Haskell-style pseudocode for an AI (type signatures only, this margin too small to contain actual functions): (edit: was too hasty first time, changed to better reflect what I intended)
trainModel :: Input(t) → Model(t)
extractValues :: Model(t) → ProbDistributionOfHumanValues(t,unknowns)
predictEntireFuture :: Model(t) → Action → ProbDistributionOfWorldPath(unknowns)
evaluateWithCorrelations :: Model(t) → (ProbDistributionOfHumanValues(t,unknowns), ProbDistributionOfWorldPath(unknowns)) → ExpectedValue
generateActions :: Model(t) → [Action]
chooseAction :: [Action] → ( Action-> ExpectedValue) → Action
This code doesn’t look to me that it would be easy to express in terms of a utility function, particularly if human values contain non-consequentialist components. But, it ought to avoid being exploited if humans want it to (while being exploitable if humans want that instead).
What is the function evaluateAction supposed to do when human values contain non-consequentialist components? I assume ExpectedValue is a real number. Maybe there could be a way to build a utility function that corresponds to the code, but that is hard to judge since you have left the details out.
(edited the code after this comment, corresponding edits below, to avoid noisiness the original is not shown; the original code did not make explicit what I discuss in the “main reason” paragraph:)
evaluateWithCorrelations uses both the ProbDistributionOfWorldPath(unknowns) and the Action to generate the ExpectedValue (not explicit, but implicitly the WorldPath can take into account the past and present as well). So, yes, ExpectedValue is a real number, but it doesn’t necessarily depend only on the consequences of the action.
However, my main reason for thinking that this would be hard to express as a utility function is that the calculation of the ExpectedValue is supposed to take into account the future actions of the AI (not just the Action being chosen now), and is supposed to take into account correlations between ProbDistributionOfHumanValues(t,unknowns) and ProbDistributionOfWorldPath(unknowns). Note, I don’t mean taking into account changes in actual human values—it should only be using current ones in the evaluation, though it should take into account possible changes for the prediction. But, the future actions of humans depend on current human values. So, ideally it should be able to predict that asking humans what they want will lead to an update of the model at t’ that is correlated to the unknowns in ProbDistributionOfHumanValues(t,unknowns) that will then lead to different actions by the AI depending on what the humans respond with so that it can then assess a better ExpectedValue to this course of action than not asking, whereas if it was a straight utility function maximizer I would expect it would assign the same value in the short run and reduced value in the long run to such asking.
Obviously yes a real AI would be much more complicated.