I want to push harder on Q33: “Isn’t goal agnosticism pretty fragile? Aren’t there strong pressures pushing anything tool-like towards more direct agency?”
In particular, the answer: “Being unable to specify a sufficiently precise goal to get your desired behavior out of an optimizer isn’t merely dangerous, it’s useless!” seems true to some degree, but incomplete. Let’s use a specific hypothetical of a stock-trading company employing an AI system to maximize profits. They want the system to be agentic because this takes the humans out of the loop on actually getting profits, but also understand that there is a risk that the system will discover unexpected/undesired methods of achieving its goals like insider trading. There are a couple of core problems:
1. Externalized Cost: if the system can cover its tracks well enough that the company doesn’t suffer any legal consequences for its illegal behavior, then the effects of insider trading on the market are “somebody else’s problem.” 2. Irreversible Mistake: if the company is overly optimistic about their ability to control their system, doesn’t understand the risks, etc. then they might use it despite regretting this decision later. On a large scale, this might be self-correcting if some companies have problems with AI agents and this gives the latter a bad reputation, but that assumes there are lots of small problems before a big one.
It is unlikely that the trainer’s first attempt at specifying the optimization constraints during RL-ish fine tuning will precisely bound the possible implementations to their truly desired target, even if the allowed space does contain it; underconstrained optimization is a likely default for many tasks.
Which implementations are likely to be found during training depends on what structure is available to guide the optimizer (everything from architecture, training scheme, dataset, and so on), and the implementations’ accessibility to the optimizer with respect to all those details.
Against the backdrop of the pretrained distribution on LLMs, low-level bad behavior (think Sydney Bing vibes) is easy to access (even accidentally!) against a pretraining distribution. Agentic coding assistants are harder to access; it’s very unlikely you will accidentally produce an agentic coding assistant. Likewise, it takes effort to specify an effective agent that pursues coherent goals against the wishes of its user. It requires a fair number of bits to narrow the distribution in that way.
More generally, if you use N bits to try to specify behavior A, having a nonnegligible chance of accidentally instead specifying behavior B requires that the bits you specify at minimum allow B, and to make it probable, they would need to imply B. (I think Sydney Bing is actually a good example case to consider here.)
For a single attempt at specifying behavior, it’s vastly more likely that a developer trains a model that fails in uninteresting ways than for them to accidentally specify just enough bits to achieve something that looks about right, but ends up entailing extremely bad outcomes at the same time. Uninteresting, useless, and easy-to-notice failures are the default because they hugely outnumber ‘interesting’ (i.e. higher bit count) failures.
You can still successfully specify bad behavior if you are clever, but malicious.
You can still successfully specify bad behavior if you make a series of mistakes. This is not impossible or even improbable; it has already happened and will happen again. Achieving higher capability bad behavior, however, tends to require more mistakes, and is less probable.
Because of this, I expect to see lots of early failures, and that more severe failures will be rarer proportional to the error rate needed to specify the failure. I strongly expect the failures to be visible enough that a desire to make a working product combined with something like liability frameworks would have some iterations to work and spook irresponsible companies into putting nonzero effort into not making particularly long series of mistakes. This is not a guarantee of safety.
I want to push harder on Q33: “Isn’t goal agnosticism pretty fragile? Aren’t there strong pressures pushing anything tool-like towards more direct agency?”
In particular, the answer: “Being unable to specify a sufficiently precise goal to get your desired behavior out of an optimizer isn’t merely dangerous, it’s useless!” seems true to some degree, but incomplete. Let’s use a specific hypothetical of a stock-trading company employing an AI system to maximize profits. They want the system to be agentic because this takes the humans out of the loop on actually getting profits, but also understand that there is a risk that the system will discover unexpected/undesired methods of achieving its goals like insider trading. There are a couple of core problems:
1. Externalized Cost: if the system can cover its tracks well enough that the company doesn’t suffer any legal consequences for its illegal behavior, then the effects of insider trading on the market are “somebody else’s problem.”
2. Irreversible Mistake: if the company is overly optimistic about their ability to control their system, doesn’t understand the risks, etc. then they might use it despite regretting this decision later. On a large scale, this might be self-correcting if some companies have problems with AI agents and this gives the latter a bad reputation, but that assumes there are lots of small problems before a big one.
These things are possible, yes. Those bad behaviors are not necessarily trivial to access, though.
If you underspecify/underconstrain your optimization process, it may roam to unexpected regions permitted by that free space.
It is unlikely that the trainer’s first attempt at specifying the optimization constraints during RL-ish fine tuning will precisely bound the possible implementations to their truly desired target, even if the allowed space does contain it; underconstrained optimization is a likely default for many tasks.
Which implementations are likely to be found during training depends on what structure is available to guide the optimizer (everything from architecture, training scheme, dataset, and so on), and the implementations’ accessibility to the optimizer with respect to all those details.
Against the backdrop of the pretrained distribution on LLMs, low-level bad behavior (think Sydney Bing vibes) is easy to access (even accidentally!) against a pretraining distribution. Agentic coding assistants are harder to access; it’s very unlikely you will accidentally produce an agentic coding assistant. Likewise, it takes effort to specify an effective agent that pursues coherent goals against the wishes of its user. It requires a fair number of bits to narrow the distribution in that way.
More generally, if you use N bits to try to specify behavior A, having a nonnegligible chance of accidentally instead specifying behavior B requires that the bits you specify at minimum allow B, and to make it probable, they would need to imply B. (I think Sydney Bing is actually a good example case to consider here.)
For a single attempt at specifying behavior, it’s vastly more likely that a developer trains a model that fails in uninteresting ways than for them to accidentally specify just enough bits to achieve something that looks about right, but ends up entailing extremely bad outcomes at the same time. Uninteresting, useless, and easy-to-notice failures are the default because they hugely outnumber ‘interesting’ (i.e. higher bit count) failures.
You can still successfully specify bad behavior if you are clever, but malicious.
You can still successfully specify bad behavior if you make a series of mistakes. This is not impossible or even improbable; it has already happened and will happen again. Achieving higher capability bad behavior, however, tends to require more mistakes, and is less probable.
Because of this, I expect to see lots of early failures, and that more severe failures will be rarer proportional to the error rate needed to specify the failure. I strongly expect the failures to be visible enough that a desire to make a working product combined with something like liability frameworks would have some iterations to work and spook irresponsible companies into putting nonzero effort into not making particularly long series of mistakes. This is not a guarantee of safety.