Yes, I think the question is more about what we expect such a model to be critically lacking in which might make the difference in whether it is actively dangerous. Some people have been discussing ideas of things we should check for to determine if a model is sorta safe vs critically dangerous. For instance its ability to: deceive, self-improve, have sitational awareness / episodic memory (as discussed in neuroscience), have agentic goals, do long term strategic planning (vs being more safely myopic), do active search & and experimentation to disambiguate between competing hypotheses, jump to useful novel insights from a number of subtle hints in the available evidence, self-assess (did I succeed at the recent task I tried or fail? Can I proceed to the next step in my plan or do I need to try again, or perhaps create a whole new plan? Did I fail several times in a row using a particular strategy, implying I should try a different approach?) . I’m sure there’s more to add to this list. I don’t think current models are at literally zero on all of these. I think coming up with evaluations to measure models vs humans on these tasks seems hard but important. I think this list is probably incomplete, but sufficient if a model was super-humanly skilled at all these things simultaneously. What do you think? Am I missing something? Including something unnecessary?
Yes, I think the question is more about what we expect such a model to be critically lacking in which might make the difference in whether it is actively dangerous. Some people have been discussing ideas of things we should check for to determine if a model is sorta safe vs critically dangerous. For instance its ability to: deceive, self-improve, have sitational awareness / episodic memory (as discussed in neuroscience), have agentic goals, do long term strategic planning (vs being more safely myopic), do active search & and experimentation to disambiguate between competing hypotheses, jump to useful novel insights from a number of subtle hints in the available evidence, self-assess (did I succeed at the recent task I tried or fail? Can I proceed to the next step in my plan or do I need to try again, or perhaps create a whole new plan? Did I fail several times in a row using a particular strategy, implying I should try a different approach?) . I’m sure there’s more to add to this list. I don’t think current models are at literally zero on all of these. I think coming up with evaluations to measure models vs humans on these tasks seems hard but important. I think this list is probably incomplete, but sufficient if a model was super-humanly skilled at all these things simultaneously. What do you think? Am I missing something? Including something unnecessary?
I’m currently thinking about this paper and wondering how we can come up with better evaluations of real-world generality: https://www.nature.com/articles/s41598-021-01997-7