We will say that a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system.
I appreciate the difficulty of actually defining optimizers, and so don’t want to quibble with this definition, but am interested in whether you think humans are a central example of optimizers under this definition, and if so whether you think that most mesa-optimizers will “explicitly represent” their objective functions to a similar degree that humans do.
I think humans are fairly weird because we were selected for an objective that is unlikely to be what we select for in our AIs.
That said, if we model AI success as driven by model size and compute (with maybe innovations in low-level architecture), then I think that the way humans represent objectives is probably fairly close to what we ought to expect.
If we model AI success as mainly innovative high-level architecture, then I think we will see more explicitly represented objectives.
My tentative sense is that for AI to be interpretable (and safer) we want it to be the latter kind, but given enough compute the former kind of AI will give better results, other things being equal.
Here, what I mean by low-level architecture is something like “we’ll use lots of LSTMs instead of lots of plain RNNs, but keep the model structure simple: plug in the inputs, pass it through some layers, and read out the action probabilities”, and high-level is something like “let’s organise the model using this enormous flowchart with all of these various pieces that each are designed to take a particular role; here’s the observation embedding, here’s the search in latent model space, here’s the …”
I appreciate the difficulty of actually defining optimizers, and so don’t want to quibble with this definition, but am interested in whether you think humans are a central example of optimizers under this definition, and if so whether you think that most mesa-optimizers will “explicitly represent” their objective functions to a similar degree that humans do.
I think humans are fairly weird because we were selected for an objective that is unlikely to be what we select for in our AIs.
That said, if we model AI success as driven by model size and compute (with maybe innovations in low-level architecture), then I think that the way humans represent objectives is probably fairly close to what we ought to expect.
If we model AI success as mainly innovative high-level architecture, then I think we will see more explicitly represented objectives.
My tentative sense is that for AI to be interpretable (and safer) we want it to be the latter kind, but given enough compute the former kind of AI will give better results, other things being equal.
Here, what I mean by low-level architecture is something like “we’ll use lots of LSTMs instead of lots of plain RNNs, but keep the model structure simple: plug in the inputs, pass it through some layers, and read out the action probabilities”, and high-level is something like “let’s organise the model using this enormous flowchart with all of these various pieces that each are designed to take a particular role; here’s the observation embedding, here’s the search in latent model space, here’s the …”