Consider gravity on Earth, it seems to work every year. However, this fact alone is consistent with theories that gravity will stop working in 2025, 2026, 2027, 2028, 2029, 2030, etc. There are infinite such theories and only one theory that gravity will work as an absolute rule.
We might infer from the simplest explaination that gravity holds as an absolute rule. However, the case is different with alignment. To ensure AI
alignment, our evidence must rule out whether an AI is following a misaligned rule compared to an
aligned rule based on time-and situation-limited data.
While it may be safe, for all practical purposes, to
assume that simpler explanations tend to be correct when it comes to nature, we cannot safely assume this for LLMs—for the reason that the
learning algorithms that are programmed into them can have complex unintended consequences for
how the LLM will behave in the future, given the changing conditions an LLM finds itself in.
Doesn’t this mean that it is not possible to achieve alignment?
I have found another possible concern of mine.
Consider gravity on Earth, it seems to work every year. However, this fact alone is consistent with theories that gravity will stop working in 2025, 2026, 2027, 2028, 2029, 2030, etc. There are infinite such theories and only one theory that gravity will work as an absolute rule.
We might infer from the simplest explaination that gravity holds as an absolute rule. However, the case is different with alignment. To ensure AI alignment, our evidence must rule out whether an AI is following a misaligned rule compared to an aligned rule based on time-and situation-limited data.
While it may be safe, for all practical purposes, to assume that simpler explanations tend to be correct when it comes to nature, we cannot safely assume this for LLMs—for the reason that the learning algorithms that are programmed into them can have complex unintended consequences for how the LLM will behave in the future, given the changing conditions an LLM finds itself in.
Doesn’t this mean that it is not possible to achieve alignment?
I have posted this text as a standalone question here