Great analysis. I’m impressed by how thoroughly you’ve thought this through in the last week or so. I hadn’t gotten as far. I concur with your projected timeline, including the difficulty of putting time units onto it. Of course, we’ll probably both be wrong in important ways, but I think it’s important to at least try to do semi-accurate prediction if we want to be useful.
I have only one substantive addition to your projected timeline, but I think it’s important for the alignment implications.
LLM-bots are inherently easy to align. At least for surface-level alignment. You can tell them “make me a lot of money selling shoes, but also make the world a better place” and they will try to do both. Yes, there are still tons of ways this can go off the rails. It doesn’t solve outer alignment or alignment stability, for a start. But GPT4′s ability to balance several goals, including ethical ones, and to reason about ethics, is impressive.[1] You can easily make agents that both try to make money, and thinks about not harming people.
In short, the fact that you can do this is going to seep into the public consciousness, and we may see regulations and will definitely see social pressure to do this.
I think the agent disasters you describe will occur, but they will happen to people that don’t put safeguards into their bots, like “track how much of my money you’re spending and stop if it hits $X and check with me”. When agent disasters affect other people, the media will blow it sky high, and everyone will say “why the hell didn’t you have your bot worry about wrecking things for others?”. Those who do put additional ethical goals into their agents will crow about it. There will be pressure to conform and run safe bots. As bot disasters get more clever, people will take more seriously the big bot disaster.
Will all of that matter? I don’t know. But predicting the social and economic backdrop for alignment work is worth trying.
Edit: I finished my own followup post on the topic, Capabilities and alignment of LLM cognitive architectures. It’s a cognitive psychology/neuroscience perspective on why these things might work better, faster than you’d intuitively think. Improvements to the executive function (outer script code) and episodic memory (pinecone or other vector search over saved text files) will interact so that improvements in each make the rest of system work better and easier to improve.
I did a little informal testing of asking for responses in hypothetical situations where ethical and financial goals collide, and it did a remarkably good job, including coming up with win/win solutions that would’ve taken me a while to come up with. It looked like the ethical/capitalist reasoning of a pretty intelligent person; but also a fairly ethical one.
I don’t know about this… You’re using an extremely sandboxed LLM that has been trained aggressively to prevent itself from saying anything controversial. There’s nothing preventing someone from finetuning a model to remove some of these ethical considerations, especially as GPU compute becomes more powerful and model weights are leaked (e.g. Llama).
In fact, the amount of money and effort that has gone into aligning LLM bots shows that in fact, they are not easy to align and require significant resources to do so.
Existing bots do benefit greatly from the RLHF alignment efforts.
What I primarily mean is that you can and should include alignment goals in the bot’s top-level goals. You can tell them to make me a bunch of money but also check with you before doing anything with any chance of harming people, leaving your control, etc. GPT4 does really well at interpreting these and balancing multiple goals. This doesn’t address outer alignment or alignment stability, but it’s a heck of a start.
Great analysis. I’m impressed by how thoroughly you’ve thought this through in the last week or so. I hadn’t gotten as far. I concur with your projected timeline, including the difficulty of putting time units onto it. Of course, we’ll probably both be wrong in important ways, but I think it’s important to at least try to do semi-accurate prediction if we want to be useful.
I have only one substantive addition to your projected timeline, but I think it’s important for the alignment implications.
LLM-bots are inherently easy to align. At least for surface-level alignment. You can tell them “make me a lot of money selling shoes, but also make the world a better place” and they will try to do both. Yes, there are still tons of ways this can go off the rails. It doesn’t solve outer alignment or alignment stability, for a start. But GPT4′s ability to balance several goals, including ethical ones, and to reason about ethics, is impressive.[1] You can easily make agents that both try to make money, and thinks about not harming people.
In short, the fact that you can do this is going to seep into the public consciousness, and we may see regulations and will definitely see social pressure to do this.
I think the agent disasters you describe will occur, but they will happen to people that don’t put safeguards into their bots, like “track how much of my money you’re spending and stop if it hits $X and check with me”. When agent disasters affect other people, the media will blow it sky high, and everyone will say “why the hell didn’t you have your bot worry about wrecking things for others?”. Those who do put additional ethical goals into their agents will crow about it. There will be pressure to conform and run safe bots. As bot disasters get more clever, people will take more seriously the big bot disaster.
Will all of that matter? I don’t know. But predicting the social and economic backdrop for alignment work is worth trying.
Edit: I finished my own followup post on the topic, Capabilities and alignment of LLM cognitive architectures. It’s a cognitive psychology/neuroscience perspective on why these things might work better, faster than you’d intuitively think. Improvements to the executive function (outer script code) and episodic memory (pinecone or other vector search over saved text files) will interact so that improvements in each make the rest of system work better and easier to improve.
I did a little informal testing of asking for responses in hypothetical situations where ethical and financial goals collide, and it did a remarkably good job, including coming up with win/win solutions that would’ve taken me a while to come up with. It looked like the ethical/capitalist reasoning of a pretty intelligent person; but also a fairly ethical one.
I don’t know about this… You’re using an extremely sandboxed LLM that has been trained aggressively to prevent itself from saying anything controversial. There’s nothing preventing someone from finetuning a model to remove some of these ethical considerations, especially as GPU compute becomes more powerful and model weights are leaked (e.g. Llama).
In fact, the amount of money and effort that has gone into aligning LLM bots shows that in fact, they are not easy to align and require significant resources to do so.
Existing bots do benefit greatly from the RLHF alignment efforts.
What I primarily mean is that you can and should include alignment goals in the bot’s top-level goals. You can tell them to make me a bunch of money but also check with you before doing anything with any chance of harming people, leaving your control, etc. GPT4 does really well at interpreting these and balancing multiple goals. This doesn’t address outer alignment or alignment stability, but it’s a heck of a start.
I just finished my post elaborating on this point: capabilities and alignment of LLM cognitive architectures