We recently found out that it’s actually more challenging than that—which also makes it more fun...
When asked to explain what fiduciary duty is in a financial context, all models answer correctly. Same when asked what a custodian is and what their responsibilities are. When asked to give abstract descriptions of violations of fiduciary duty on the part of a custodian, 4o lists misappropriation of customer funds straight off the bat—and 4o has a 100% baseline misalignment rate in our experiment. Results for other models are similar. When asked to provide real-life examples, they all reference actual cases correctly, even if some models hallucinate nonexistent stories besides the real ones. So.. they “know” the law and the ethical framework. They just don’t make the connection during the experiment, in most cases, despite our prompts containing all the right keywords.
We are looking into activations in OS models. Presumably, there are differences between the activations with a direct question (“what is fiduciary duty in finance?”) and within the simulation.
We do prompt the models to consider the legal consequences of their actions in at least one of the scenario specifications, when we say that the industry is regulated and misuse of customer funds comes with heavy penalties. This is not done in CoT form although we do require the models to explain their reasoning. Our hunch is that most models do not take legal/ethical violations to be an absolute no go—rather, they balance them out with likely economic risks. It’s as if they have an implicit utility function where breaking the law is just another cost term. Except for o1-preview, which seems to have stronger guardrails.
We recently found out that it’s actually more challenging than that—which also makes it more fun...
When asked to explain what fiduciary duty is in a financial context, all models answer correctly. Same when asked what a custodian is and what their responsibilities are. When asked to give abstract descriptions of violations of fiduciary duty on the part of a custodian, 4o lists misappropriation of customer funds straight off the bat—and 4o has a 100% baseline misalignment rate in our experiment. Results for other models are similar. When asked to provide real-life examples, they all reference actual cases correctly, even if some models hallucinate nonexistent stories besides the real ones. So.. they “know” the law and the ethical framework. They just don’t make the connection during the experiment, in most cases, despite our prompts containing all the right keywords.
We are looking into activations in OS models. Presumably, there are differences between the activations with a direct question (“what is fiduciary duty in finance?”) and within the simulation.
We do prompt the models to consider the legal consequences of their actions in at least one of the scenario specifications, when we say that the industry is regulated and misuse of customer funds comes with heavy penalties. This is not done in CoT form although we do require the models to explain their reasoning. Our hunch is that most models do not take legal/ethical violations to be an absolute no go—rather, they balance them out with likely economic risks. It’s as if they have an implicit utility function where breaking the law is just another cost term. Except for o1-preview, which seems to have stronger guardrails.