quila comments on Evaluating the historical value misspecification argument

quila 3 Aug 2024 2:31 UTC
1 point
0
Maybe this has been discussed already, just commenting as I read.
This fact is key to what I’m saying because it means that, in the near future, we can literally just query multimodal GPT-N about whether an outcome is bad or good, and use that as an adequate “human value function”.
In any AI system structure where it’s true that GPT-N can fulfill this function^[1], a natural human could too (just with a longer delay for their output to be passed back).^[2]
(The rest of this and the footnotes are just-formed ideas)
Though, if your AI relies on predicting the response of GPT-N, then it does have an advantage: GPT-N can be precisely specified within the AI structure, unlike a human (whose precise neural specification is unknown) where you’d have to point to them in the environment or otherwise predict an input from the environment and thus make your AI vulnerable to probable environment hacking.
So I suppose if there’s ever a GPT-N who really seems to write with regard to actual values, and not current human discourse/cultural beliefs about what human-cultural-policies are legitimated, it could work as an outer/partial inner alignment solution.^[1]
Failing that kind of GPT-N, maybe you can at least have one which answers a simpler question like, “How would <natural language plan and effects> score in terms of its effect on total suffering and happiness, given x weighting of each?”—the system with that basis seems, modulo possible botched alignment concerns, trivially preferable to an orthogonal-maximizer AI, if it’s the best we can create. it wouldn’t capture the full complexity of the designer’s value, but would still score very highly under it due to reduction of suffering in other lightcones. Edit: another user proposes probably-better natural language targets in another comment
1. ^
  Though in both cases (human, gpt-n), you face some issues like: “How is the planner component generating the plans, without something like a value function (to be used in a [criterion for the plan to satisfy] to be passed to the planner?” (i.e., you write that GPT-N would only be asked to evaluate the plan after the plan is generated). Though I’m seeing some ways around this one*
  and “How are you translating from the planner’s format to natural language text to be sent to the GPT?”
  * (If you already have a way to translate between written human language and the planner’s format, I see some ways around this which leverage that, like “translate from human-language to the planner’s internal format criteria for the plan to satisfy, before passing the resulting plan to GPT-N for evaluation”, and some complications** (haven’t branched much beyond that, but it looks solvable))
  ** (i) Two different plans can correspond to the same natural language description. (ii) The choice of what to specify (specifically in the translation of an internal format to natural language) is in informed by context including values and background assumptions, neither of which are necessarily specified to the translator. I have some thoughts about possible ways to make these into non-issues, if we have the translation capacity and a general purpose planner to begin with.
  relevantly there’s no actual value function being maximized in this model (i.e the planner is not trying to select for [the action whose description will elicit the strongest Yes rating from GPT-N], though the planner is underspecified as is)
2. ^
  Either case implies structural similarity to Holden (2012)‘s tool AI proposal. I.e., {generate plan^[1] → output plan and wait for input} → {display plan to human, or input plan to GPT-N} → {if ‘yes’ received back as input, then actually enact plan}