I’d also say that at least some of the justification for changing values in philosophers/humans is because they believe the new values are closer to the moral reality/truth, which is an instrumental incentive.
To be clear, I’m not going to state confidently that this will happen (maybe something like instruction following ala @Seth Herd is used instead, such that the pointer is to the human giving the instructions, rather than having values instead), but this is at least reasonably plausible IMO.
Consider gravity on Earth, it seems to work every year. However, this fact alone is consistent with theories that gravity will stop working in 2025, 2026, 2027, 2028, 2029, 2030, etc. There are infinite such theories and only one theory that gravity will work as an absolute rule.
We might infer from the simplest explaination that gravity holds as an absolute rule. However, the case is different with alignment. To ensure AI
alignment, our evidence must rule out whether an AI is following a misaligned rule compared to an
aligned rule based on time-and situation-limited data.
While it may be safe, for all practical purposes, to
assume that simpler explanations tend to be correct when it comes to nature, we cannot safely assume this for LLMs—for the reason that the
learning algorithms that are programmed into them can have complex unintended consequences for
how the LLM will behave in the future, given the changing conditions an LLM finds itself in.
Doesn’t this mean that it is not possible to achieve alignment?
Fair enough. Would you expect that AI would also try to move its values to the moral reality? (something that’s probably good for us, cause I wouldn’t expect human extinction to be a morally good thing)
The problem with that plan is that there are too many valid moral realities, so which one you do get is once again a consequence of alignment efforts.
To be clear, I’m not stating that it’s hard to get the AI to value what we value, but it’s not so brain-dead easy that we can make the AI find moral reality and then all will be well.
Noosphere, I am really, really thankful for your responses. You completely answered almost all (I am still not convinced about that strategy of avoiding value drift. I am probably going to post that one as a question to see if maybe other people have different strategies on preventing value drift) of the concerns that I had about alignment.
This discussion, significantly increased my knowledge. If I could triple upvote your answers, I would. Thank you! Thank you a lot!
Another concern that I could see with the plan. Step 1 is to create safe and alignment AI, but there are some results which suggest that even current AIs may not be as safe as we want them to be. For example, according to this article, current AI (specifically o1) can help novices build CBRN weapons and significantly increase threat to the world. Do you think this is concerning or do you think that this threat will not materialize?
The threat model is plausible enough that some political actions should be done, like banning open-source/open-weight models, and putting in basic Know Your Customer checks.
Isn’t it a bit too late for that? If o1 gets publicly released, then according to that article, we would have an expert-level consultant in bioweapons available for everyone. Or do you think that o1 won’t be released?
I don’t buy that o1 has actually given people expert-level bioweapons, so my actions here are more so about preparing for future AI that is very competent at bioweapon building.
Also, even with the current level of jailbreak resistance/adversarial example resistance, assuming no open-weights/open sourcing of AI is achieved, we can still make AIs that are practically hard to misuse by the general public.
After some thought, I think this is a potentially really large issue which I don’t know how we can even begin to solve. We can have aligned AI, being aligned with someone who wants to create bioweapons. Is there anything being done (or anything that can be done) to prevent that?
The answers to this question is actually 2 things:
This is why I expect we will eventually have to fight to ban open-source, and we will have to get the political will to ban both open-source and open-weights AI.
This is where the unlearning field comes in. If we could make the AI unlearn knowledge, an example being nuclear weapons, we could possibly distribute AI safely without causing novices to create dangerous stuff.
Not always, but I’d say often.
I’d also say that at least some of the justification for changing values in philosophers/humans is because they believe the new values are closer to the moral reality/truth, which is an instrumental incentive.
To be clear, I’m not going to state confidently that this will happen (maybe something like instruction following ala @Seth Herd is used instead, such that the pointer is to the human giving the instructions, rather than having values instead), but this is at least reasonably plausible IMO.
I have found another possible concern of mine.
Consider gravity on Earth, it seems to work every year. However, this fact alone is consistent with theories that gravity will stop working in 2025, 2026, 2027, 2028, 2029, 2030, etc. There are infinite such theories and only one theory that gravity will work as an absolute rule.
We might infer from the simplest explaination that gravity holds as an absolute rule. However, the case is different with alignment. To ensure AI alignment, our evidence must rule out whether an AI is following a misaligned rule compared to an aligned rule based on time-and situation-limited data.
While it may be safe, for all practical purposes, to assume that simpler explanations tend to be correct when it comes to nature, we cannot safely assume this for LLMs—for the reason that the learning algorithms that are programmed into them can have complex unintended consequences for how the LLM will behave in the future, given the changing conditions an LLM finds itself in.
Doesn’t this mean that it is not possible to achieve alignment?
I have posted this text as a standalone question here
Fair enough. Would you expect that AI would also try to move its values to the moral reality? (something that’s probably good for us, cause I wouldn’t expect human extinction to be a morally good thing)
The problem with that plan is that there are too many valid moral realities, so which one you do get is once again a consequence of alignment efforts.
To be clear, I’m not stating that it’s hard to get the AI to value what we value, but it’s not so brain-dead easy that we can make the AI find moral reality and then all will be well.
Noosphere, I am really, really thankful for your responses. You completely answered almost all (I am still not convinced about that strategy of avoiding value drift. I am probably going to post that one as a question to see if maybe other people have different strategies on preventing value drift) of the concerns that I had about alignment.
This discussion, significantly increased my knowledge. If I could triple upvote your answers, I would. Thank you! Thank you a lot!
P.S. Here is the link to the question that I posted.
Another concern that I could see with the plan. Step 1 is to create safe and alignment AI, but there are some results which suggest that even current AIs may not be as safe as we want them to be. For example, according to this article, current AI (specifically o1) can help novices build CBRN weapons and significantly increase threat to the world. Do you think this is concerning or do you think that this threat will not materialize?
The threat model is plausible enough that some political actions should be done, like banning open-source/open-weight models, and putting in basic Know Your Customer checks.
Isn’t it a bit too late for that? If o1 gets publicly released, then according to that article, we would have an expert-level consultant in bioweapons available for everyone. Or do you think that o1 won’t be released?
I don’t buy that o1 has actually given people expert-level bioweapons, so my actions here are more so about preparing for future AI that is very competent at bioweapon building.
Also, even with the current level of jailbreak resistance/adversarial example resistance, assuming no open-weights/open sourcing of AI is achieved, we can still make AIs that are practically hard to misuse by the general public.
See here for more:
https://www.lesswrong.com/posts/KENtuXySHJgxsH2Qk/managing-catastrophic-misuse-without-robust-ais
After some thought, I think this is a potentially really large issue which I don’t know how we can even begin to solve. We can have aligned AI, being aligned with someone who wants to create bioweapons. Is there anything being done (or anything that can be done) to prevent that?
The answers to this question is actually 2 things:
This is why I expect we will eventually have to fight to ban open-source, and we will have to get the political will to ban both open-source and open-weights AI.
This is where the unlearning field comes in. If we could make the AI unlearn knowledge, an example being nuclear weapons, we could possibly distribute AI safely without causing novices to create dangerous stuff.
More here:
https://www.lesswrong.com/posts/mFAvspg4sXkrfZ7FA/deep-forgetting-and-unlearning-for-safely-scoped-llms
https://www.lesswrong.com/posts/9AbYkAy8s9LvB7dT5/the-case-for-unlearning-that-removes-information-from-llm
But the solutions are intentionally going to make AI safe without relying on alignment.