I think that selection theorems provide a robust framework with which to formulate (and prove) safety desiderata/guarantees for AI systems that are robust to arbitrary capability amplification.
In particular, I think selection theorems naturally lend themselves to proving properties that are selected for/emerge in the limit of optimisation for particular objectives (convergence theorems?).
Properties proven to emerge in the limit become more robust with scale. I think that’s an incredibly powerful result.
My preferred conception of selection theorems is more general than Wentworth’s.
They are more general statements about constructive optimisation processes (natural selection, stochastic gradient descent, human design) and their artifacts (humans, ML models, the quicksort algorithm).
What artifact types are selected for by optimisation for (a) particular objective(s)?
Given (a) particular artifact type(s) what’s the type of the objective(s) for which it was selected.
[Where the “type” specifies (nontrivial?) properties of artifacts, constructive optimisation processes and objectives.]
I am really excited about selection theorems[1]
I think that selection theorems provide a robust framework with which to formulate (and prove) safety desiderata/guarantees for AI systems that are robust to arbitrary capability amplification.
In particular, I think selection theorems naturally lend themselves to proving properties that are selected for/emerge in the limit of optimisation for particular objectives (convergence theorems?).
Properties proven to emerge in the limit become more robust with scale. I think that’s an incredibly powerful result.
My preferred conception of selection theorems is more general than Wentworth’s.
They are more general statements about constructive optimisation processes (natural selection, stochastic gradient descent, human design) and their artifacts (humans, ML models, the quicksort algorithm).
What artifact types are selected for by optimisation for (a) particular objective(s)? Given (a) particular artifact type(s) what’s the type of the objective(s) for which it was selected.
[Where the “type” specifies (nontrivial?) properties of artifacts, constructive optimisation processes and objectives.]