paulfchristiano comments on A dilemma for prosaic AI alignment

paulfchristiano 18 Dec 2019 16:10 UTC
LW: 9 AF: 7
AF
I normally imagine using joint training in these cases, rather than pre-training + fine-tuning. e.g., at every point in time we maintain an agent and a question-answerer, where the question-answerer “knows everything the agent knows.” They get better together, with each gradient update affecting both of them, rather than first training a good agent and then adding a good question-answerer.
(Independently of concerns about mesa-optimization, I think the fine-tuning approach would have trouble because you couldn’t use statistical regularities from the “main” objective to inform your answers to questions, and therefore your question answers will be dumber than the policy and so you couldn’t get a good reward function or specification of catastrophically bad behavior.)
What links here?
- Daniel Kokotajlo 19 Dec 2019 8:07 UTC
  LW: 3 AF: 2
  AF Parent
  That sounds safer, but is it competitive? Would AlphaStar be close to as good as it is, if it had been simultaneously trained to answer questions?
  - paulfchristiano 19 Dec 2019 20:26 UTC
    LW: 7 AF: 4
    AF Parent
    We could also ask: “Would AlphaStar remain as good as it is, if fine-tuned to answer questions?”
    In either case it’s an empirical question. I think the answer is probably yes if you do it carefully.
    You could imagine separating this into two questions:
    Is there a policy that plays starcraft and answers questions, that is only slightly larger than a policy for playing starcraft alone? This is a key premise for the whole project. I think it’s reasonably likely; the goal is only to answer questions the model “already knows,” so it seems realistic to hope for only a constant amount of extra work to be able to use that knowledge to answer questions. I think most of the uncertainty here is about details of “know” and question-answering and so on.
    Can you use joint optimization to find that policy with only slightly more training time? I think probably yes.
    - Daniel Kokotajlo 19 Dec 2019 21:05 UTC
      LW: 1 AF: 1
      AF Parent
      OK, thanks! I’m pleased to see this and other empirical premises explicitly laid out. It means we as a community are making predictions about the future based on models which can be tested before it’s too late, and perhaps even now.