I think this misunderstands my purpose a little bit.
My point isn’t that we should try to solve the problem of how to run a business smoothly. My point is that if you have a plan to create alignment in AI of some kind, it is probably valuable to ask how that plan would work if you applied it to a corporation.
Creating a CPU that doesn’t lie about addition is easy, but most ML algorithms will make mistakes outside of their training distribution, and thinking of ML subcomponents as human employees is an intuition pump for how or whether your alignment plan might interact.
Modelling an AI as a group of humans is just asking for an anthropomorphized and probably wrong answer. The human brain easily anthropomorphizes by default, thats a force you have to actively work against, not encourage.
Humans have failure modes like getting bored of doing the same thing over and over again, and stopping paying attention. AI’s can overfit the training data and produce useless predictions in practice.
Another way of seeing this is to consider two different AI designs, maybe systems with 2 different nonlinearity functions, or network sizes or whatever. These two algorithms will often do different things. If the algorithms get “approximated” into the same arrangement of humans, the human based prediction must be wrong for at least one of the algorithms.
The exception for this is approaches like IDA, which use AI’s trained to imitate humans, so will probably actually be quite human like.
Take an example of an aligned AI system, and describe what the corresponding arrangement of humans would even be. Say take a satificer agent with an impact penalty. This is an agent that gets 1 reward if the reward button is pressed at least once, and is penalised in proportion to the difference between the real world and the hypothetical where it did nothing.How many people does this AI correspond to, and how are the people arranged into a coorporation?
I’m not trying to claim that if you can solve the (much harder and more general problem of) AGI alignment, then it should be able to solve the (simpler specific case of) corporate incentives.
It’s true that many AGI architectures have no clear analogy to corporations, and if you are using something like a satisficer model with no black-box subagents, this isn’t going to be a useful lens.
But many practical AI schema have black-box submodules, and some formulations like mesa-optimization or supervised amplification-distillation explicitly highlight problems with black box subagents.
I claim that an employee that destroys documentation so that they become irreplaceable to a company is a misaligned mesa-optimizer. Then I further claim that this suggests:
Company structures contain existing research on misaligned subagents. It’s probably worth doing a literature review to see if some of those structures have insights that can be translated.
Given a schema for aligning sub-agents of an AGI, either the schema should also work on aligning employees at a company or there should be a clear reason it breaks down
if the analogy applies, one could test the alignment schema by actually running such a company, which is a natural experiment that isn’t safely accessible for AI projects. This doesn’t prove that the schema is safe, but I would expect aspects of the problem to be easier to understand via natural experiment than via doing math on a whiteboard.
Company structures contain existing research on misaligned subagents. It’s probably worth doing a literature review to see if some of those structures have insights that can be translated.
Also, how does nature solve this problem? How are genes aligned with the cell as a whole, cells with the multicellular organism, ants with the anthill?
Though I suspect that most (all?) solutions would be ethically and legally unacceptable for humans. They would translate as “if the company fails, all employees are executed” and similar.
I think this misunderstands my purpose a little bit.
My point isn’t that we should try to solve the problem of how to run a business smoothly. My point is that if you have a plan to create alignment in AI of some kind, it is probably valuable to ask how that plan would work if you applied it to a corporation.
Creating a CPU that doesn’t lie about addition is easy, but most ML algorithms will make mistakes outside of their training distribution, and thinking of ML subcomponents as human employees is an intuition pump for how or whether your alignment plan might interact.
Modelling an AI as a group of humans is just asking for an anthropomorphized and probably wrong answer. The human brain easily anthropomorphizes by default, thats a force you have to actively work against, not encourage.
Humans have failure modes like getting bored of doing the same thing over and over again, and stopping paying attention. AI’s can overfit the training data and produce useless predictions in practice.
Another way of seeing this is to consider two different AI designs, maybe systems with 2 different nonlinearity functions, or network sizes or whatever. These two algorithms will often do different things. If the algorithms get “approximated” into the same arrangement of humans, the human based prediction must be wrong for at least one of the algorithms.
The exception for this is approaches like IDA, which use AI’s trained to imitate humans, so will probably actually be quite human like.
Take an example of an aligned AI system, and describe what the corresponding arrangement of humans would even be. Say take a satificer agent with an impact penalty. This is an agent that gets 1 reward if the reward button is pressed at least once, and is penalised in proportion to the difference between the real world and the hypothetical where it did nothing.How many people does this AI correspond to, and how are the people arranged into a coorporation?
I think you’re misunderstanding my analogy.
I’m not trying to claim that if you can solve the (much harder and more general problem of) AGI alignment, then it should be able to solve the (simpler specific case of) corporate incentives.
It’s true that many AGI architectures have no clear analogy to corporations, and if you are using something like a satisficer model with no black-box subagents, this isn’t going to be a useful lens.
But many practical AI schema have black-box submodules, and some formulations like mesa-optimization or supervised amplification-distillation explicitly highlight problems with black box subagents.
I claim that an employee that destroys documentation so that they become irreplaceable to a company is a misaligned mesa-optimizer. Then I further claim that this suggests:
Company structures contain existing research on misaligned subagents. It’s probably worth doing a literature review to see if some of those structures have insights that can be translated.
Given a schema for aligning sub-agents of an AGI, either the schema should also work on aligning employees at a company or there should be a clear reason it breaks down
if the analogy applies, one could test the alignment schema by actually running such a company, which is a natural experiment that isn’t safely accessible for AI projects. This doesn’t prove that the schema is safe, but I would expect aspects of the problem to be easier to understand via natural experiment than via doing math on a whiteboard.
“Principal-agent problem” seems like a relevant keyword.
Also, how does nature solve this problem? How are genes aligned with the cell as a whole, cells with the multicellular organism, ants with the anthill?
Though I suspect that most (all?) solutions would be ethically and legally unacceptable for humans. They would translate as “if the company fails, all employees are executed” and similar.