To the extent that I understand your position, it’s that sharing a lot of values doesn’t automatically imply that AI is safe/non-dystopian to your values if built, rather than saying that alignment is hard/impossible to someone’s values (note when I say that a model is aligned, I am always focused on aligning it to one person’s values).
I also dislike the terminology, and I actually agree that alignment is not equal to safety, and this is probably one of my disagreements between a lot of LWers and myself, where I don’t think alignment automatically makes things better (in fact, things can get worse by making alignment better).
For example, it does not rule out this scenario, where the species doesn’t literally go extinct, but lots of humans die because the economic incentives for not stealing/using violence fall apart as humans become effectively worthless on the market:
To the extent that I understand your position, it’s that sharing a lot of values doesn’t automatically imply that AI is safe/non-dystopian to your values if built, rather than saying that alignment is hard/impossible to someone’s values (note when I say that a model is aligned, I am always focused on aligning it to one person’s values).
Yes, with the caveat that I am not thereby saying that it’s not hard to align to even one person’s values.
I admittedly have a lot of agreement with you, and that’s despite thinking we can make machines that do follow orders/are intent-aligned ala Seth Herd’s definition:
To the extent that I understand your position, it’s that sharing a lot of values doesn’t automatically imply that AI is safe/non-dystopian to your values if built, rather than saying that alignment is hard/impossible to someone’s values (note when I say that a model is aligned, I am always focused on aligning it to one person’s values).
I also dislike the terminology, and I actually agree that alignment is not equal to safety, and this is probably one of my disagreements between a lot of LWers and myself, where I don’t think alignment automatically makes things better (in fact, things can get worse by making alignment better).
For example, it does not rule out this scenario, where the species doesn’t literally go extinct, but lots of humans die because the economic incentives for not stealing/using violence fall apart as humans become effectively worthless on the market:
https://www.lesswrong.com/posts/2ujT9renJwdrcBqcE/the-benevolence-of-the-butcher
Yes, with the caveat that I am not thereby saying that it’s not hard to align to even one person’s values.
Fair enough.
I admittedly have a lot of agreement with you, and that’s despite thinking we can make machines that do follow orders/are intent-aligned ala Seth Herd’s definition:
https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than