Could LLMs develop the type of self awareness you describe as part of their own training or RL-based fine-tuning? Many LLM do seem to have “awareness” of their existence and function (incidentally this could be evidenced by the model evals run by Anthropic). I assume a simple future setup could be auto-GPT-N with a prompt like “You are the CEO of Walmart, you want to make the company maximally profitable” in that scenario I would contend that the Agent could be easily aware of both its role and function and easily be attracted to that search space.
Could we detect deployed (and continually learning) agent entering these attractors? Personally I would say that the more complex the plan being carried the harder for us to determine whether it actually goes there (so we need supervision).
This seems to me very close to the core of Krueger et al. work in “Defining and characterizing reward gaming” and the solution of “Stop before you encounter the attractors/hackable policy” seems hard to actually implement without some form of advanced supervision (which might get deceived) unless we find some broken scaling laws for this behavior.
I don’t count on myopic agent, which might be limited in their exploration being where the economic incentive lives.
Assuming it’s LLM all the way to AGI, would schemas like Constitutional AI/RLHF applied during the pre-training as well be enough to constraint the model search space?
EDIT: aren’t we risking that all the trope on evil AI act as an attractor for LLMs?
Couple of questions wrt to this:
Could LLMs develop the type of self awareness you describe as part of their own training or RL-based fine-tuning? Many LLM do seem to have “awareness” of their existence and function (incidentally this could be evidenced by the model evals run by Anthropic). I assume a simple future setup could be auto-GPT-N with a prompt like “You are the CEO of Walmart, you want to make the company maximally profitable” in that scenario I would contend that the Agent could be easily aware of both its role and function and easily be attracted to that search space.
Could we detect deployed (and continually learning) agent entering these attractors? Personally I would say that the more complex the plan being carried the harder for us to determine whether it actually goes there (so we need supervision).
This seems to me very close to the core of Krueger et al. work in “Defining and characterizing reward gaming” and the solution of “Stop before you encounter the attractors/hackable policy” seems hard to actually implement without some form of advanced supervision (which might get deceived) unless we find some broken scaling laws for this behavior.
I don’t count on myopic agent, which might be limited in their exploration being where the economic incentive lives.
Assuming it’s LLM all the way to AGI, would schemas like Constitutional AI/RLHF applied during the pre-training as well be enough to constraint the model search space?
EDIT: aren’t we risking that all the trope on evil AI act as an attractor for LLMs?