bgold comments on Davidad’s Bold Plan for Alignment: An In-Depth Explanation

bgold 27 Apr 2023 14:17 UTC
7 points
1
First off thank you for writing this, great explanation.
- Do you anticipate acceleration risks from developing the formal models through an open, multilateral process? Presumably others could use the models to train and advance the capabilities of their own RL agents. Or is the expectation that regulation would accompany this such that only the consortium could use the world model?
- Would the simulations be exclusively for ‘hard science’ domains—ex. chemistry, biology—or would simulations of human behavior, economics, and politics also be needed? My expectation is that it would need the latter, but I imagine simulating hundreds of millions of intelligent agents would dramatically (prohibitively?) increase the complexity and computational costs.
- Gabin 2 May 2023 23:08 UTC
  6 points
  0
  Parent
  - The formal models don’t need to be open and public, and probably shouldn’t be. Of course this adds a layer of difficulty, since it is harder to coordinate on an international scale and invite a lot of researchers to help on your project when you also want some protection against your model being stolen or published on the internet. It is perhaps okay if it is open source in the case where it is very expensive to train a model in this simulation and no other group can afford it.
  - Good question. I don’t know, and I don’t think that I have a good model of what the simulation would look like. Here is what my (very simplified, probably wrong) model of Davidad would say:
    We only want to be really sure that the agent is locally nice. In particular, we want the agent to not hurt people (or perhaps only if we can be sure that there are good reasons, for example if they were going to hurt someone). The agent should not hurt them with weapons, or by removing the oxygen, or by increasing radiations. For that, we need to find a mathematical model of human boundaries, and then we need to formally verify that these boundaries will be respected. Since the agent is trained using time-bounded RL, after a short period of time it will not have any effect anymore on the world (if time-bounded RL indeed works), and the shareholders will be able to determine if the policy had a good impact on the world or not, and if not, train another agent and/or change the desiderata and/or improve the formal model. That’s why it is more important to have a fine model of chemistry and physics, and we can do with a coarser model of economics and politics. In particular, we should not simulate millions of people.
  - Is it reasonable? I don’t know, and until I see this mathematical model of human boundaries, or a very convincing prototype, I’ll be a bit skeptical.