Vanessa Kosoy comments on Slightly against aligning with neo-luddites

Vanessa Kosoy 27 Dec 2022 9:08 UTC
7 points
8
IMO it might very well be that most restrictions on data and compute are net positive. However, there are arguments in both directions.

On my model, current AI algorithms are missing some key ingredients for AGI, but they might still eventually produce AGI by learning those missing ingredients. This is similar to how biological evolution a learning algorithm which is not a GI, but it produced humans who are GIs. Such an AGI would be a mesa-optimizer, and it’s liable to be unaligned regardless of the details of the outer loop (assuming an outer loop made of building blocks similar to what we have today). For example, the outer loop might be aimed at a human imitation, but the resulting mesa-optimizer is only imitating humans when it’s instrumentally beneficial for it. Moreover, as in the case of evolution, this process would probably be very costly in terms of compute and data, as it is trying to “brute force” a problem for which it doesn’t have an efficient algorithm. Therefore, limiting compute or data seems like a promising way to prevent this undesirable scenario.

On the other hand, the most likely path to aligned AI would be through a design that’s based on solid theoretical principles. Will such a design require much data or compute compared to unaligned competitors?

Reasons to think it won’t:
- Solid theoretical principles should allow improve capabilities as well as alignment.
- Intuitively, if an AI is capable enough to be transformative (given access to particular amounts of compute and data), it should be capable enough to figure out human values, assuming it is motivated to do so at the first place. Or, it should at least be capable enough to act against unaligned competition while not irreversibly destroying information about human values (in which case it can catch up on learning those later). This is similar to what Christiano calls “strategy stealing”.
Reasons to think it will:
- Maybe aligning AI requires installing safe-guards that cause substantial overhead. This seems very plausible when looking at proposals such as Delegative Reiforcement Learning, which have worst regret asymptotic that “unaligned” alternatives (conventional RL). It also seems plausible when looking at proposals such as IDA or debate, which introduce another level of indirection (simulating humans) to the problem of optimizing the world that unaligned AI attacks directly (in Christiano’s terminology, they fail to exploit inaccessible information). It’s less clear about PreDCA, but even there alignment requires a loss function with more complex type signature than the infra-Bayesian physicalism “default”, which might incur a statistical or computational penalty.
- Maybe aligning AI requires restricting ourselves to using well-understood algorithmic building blocks and not heuristic (but possibly more efficient) building blocks. Optimistically, having solid theoretic principles should allow us to roughly predict the behavior even of heuristic algorithms that are effective (because such algorithms have to be doing qualitatively the same thing as the rigorous algorithms). Pessimistically, alignment might depend on nuances that are obscured in heuristics.
We can model the situation by imagining 3 frontiers in resource space:
- The mesa-resource-frontier (MRF) is how much resources are needed to create TAI with something similar to modern algorithms, i.e. while still missing key AGI ingredients (which is necessarily unaligned).
- The direct-resource-frontier (DRF) is how much resources are needed to create TAI assuming all key algorithms, but without any attempt at alignment.
- The aligned-resource-frontier (ARF) is how much resources are needed to create aligned TAI.
We have ARF > DRF and MRF > DRF, but the relation between ARF and MRF is not clear. They might even intersect (resource space is multidimensional, we at least have data vs compute and maybe finer distinctions are important). I would still guess MRF > ARF, by and large. Assuming MRF > ARF > DRF, the ideal policy would forbid resources beyond MRF but allow resources beyond ARF. A policy that is too lax might lead to doom by the mesa-optimizer pathway. A policy that is too strict might lead to doom by making alignment infeasible. If the policy is so strict that it forces us below DRF then it buys time (which is good), but if the restrictions are then lifted gradually, it predictably leads to the region between DRF and ARF (which is bad).

Overall, the conclusion is uncertain.