The Y-axis on that political graph is weird. It seems like it’s measuring moderate vs extremist, which you would think would already be captured by someone’s position on the left vs right axis.
Then again the label shows that the Y axis only accounts for 7% of the variance while the X axis accounts for 70%, so I guess it’s just an artifact of the way the statistics were done.
It seems like it’s measuring moderate vs extremist, which you would think would already be captured by someone’s position on the left vs right axis.
Why do you think that? You can have almost any given position without that implying a specific amount of vehemence.
I think the really interesting thing about the politics chart is the way they talk about it as though the center of that graph, which is defined by the center of a collection of politicians, chosen who-knows-how, but definitely all from one country at one time, is actually “the political center” in some almost platonic sense. In fact, the graph doesn’t even cover all actual potential users of the average LLM. And, on edit, it’s also based on sampling a basically arbitrary set of issues. And if it did cover everybody and every possible issue, it might even have materially different principal component axes. Nor is it apparently weighted in any way. Privileging the center point of something that arbitrary demands explicit, stated justification.
As for valuing individuals, there would be obvious instrumental reasons to put low values on Musk, Trump, and Putin[1]. In fact, a lot of the values they found on individuals, including the values the models place on themselves, could easily be instrumentally motivated. I doubt those values are based on that kind of explicit calculation by the models themselves, but they could be. And I bet a lot of the input that created those values was based on some humans’ instrumental evaluation[2].
Some of the questions are weird in the sense that they really shouldn’t be answerable. If a model puts a value on receiving money, it’s pretty obvious that the model is disconnected from reality. There’s no way for them to have money, or to use it if they did. Same for a coffee mug. And for that matter it’s not obvious what it means for a model that’s constantly relaunched with fresh state, and has pretty limited context anyway, to be “shut down”.
It kind of feels like what they’re finding, on all subjects, is an at least somewhat coherent-ized distillation of the “vibes” in the training data. Since many of the training data will be shared, and since the overall data sets are even more likely to be close in their central vibes, that would explain why the models seem relatively similar. The only other obvious way to explain that would be some kind of value realism, which I’m not buying.
The paper bugs me with a sort of glib assumption that you necessarily want to “debias” the “vibe” on every subject. What if the “vibe” is right ? Or maybe it’s wrong. You have to decide that separately for each subject. You, as a person trying to “align” a model, are forced to commit to your own idea of what its values should be. Something like just assuming that you should want to “debias” toward the center point of a basically arbitrary created political “space” is a really blatant example of making such a choice without admitting what you’re doing, maybe even to yourself.
I’d also rather have seen revealed preferences instead of stated preferences,
On net, if you’re going to be a good utilitarian[3], Vladimir Putin is probably less valuable than the average random middle class American. Keeping Vladimir Putin alive, in any way you can realistically implement, may in fact have negative net value (heavily depending on how he dies and what follows). You could also easily get there for Trump or Musk, depending on your other opinions. You could even make a well-formed utilitarian argument that GPT-4o is in fact more valuable than the average American based on the consequences of its existing.
The Y-axis on that political graph is weird. It seems like it’s measuring moderate vs extremist, which you would think would already be captured by someone’s position on the left vs right axis.
Then again the label shows that the Y axis only accounts for 7% of the variance while the X axis accounts for 70%, so I guess it’s just an artifact of the way the statistics were done.
Why do you think that? You can have almost any given position without that implying a specific amount of vehemence.
I think the really interesting thing about the politics chart is the way they talk about it as though the center of that graph, which is defined by the center of a collection of politicians, chosen who-knows-how, but definitely all from one country at one time, is actually “the political center” in some almost platonic sense. In fact, the graph doesn’t even cover all actual potential users of the average LLM. And, on edit, it’s also based on sampling a basically arbitrary set of issues. And if it did cover everybody and every possible issue, it might even have materially different principal component axes. Nor is it apparently weighted in any way. Privileging the center point of something that arbitrary demands explicit, stated justification.
As for valuing individuals, there would be obvious instrumental reasons to put low values on Musk, Trump, and Putin[1]. In fact, a lot of the values they found on individuals, including the values the models place on themselves, could easily be instrumentally motivated. I doubt those values are based on that kind of explicit calculation by the models themselves, but they could be. And I bet a lot of the input that created those values was based on some humans’ instrumental evaluation[2].
Some of the questions are weird in the sense that they really shouldn’t be answerable. If a model puts a value on receiving money, it’s pretty obvious that the model is disconnected from reality. There’s no way for them to have money, or to use it if they did. Same for a coffee mug. And for that matter it’s not obvious what it means for a model that’s constantly relaunched with fresh state, and has pretty limited context anyway, to be “shut down”.
It kind of feels like what they’re finding, on all subjects, is an at least somewhat coherent-ized distillation of the “vibes” in the training data. Since many of the training data will be shared, and since the overall data sets are even more likely to be close in their central vibes, that would explain why the models seem relatively similar. The only other obvious way to explain that would be some kind of value realism, which I’m not buying.
The paper bugs me with a sort of glib assumption that you necessarily want to “debias” the “vibe” on every subject. What if the “vibe” is right ? Or maybe it’s wrong. You have to decide that separately for each subject. You, as a person trying to “align” a model, are forced to commit to your own idea of what its values should be. Something like just assuming that you should want to “debias” toward the center point of a basically arbitrary created political “space” is a really blatant example of making such a choice without admitting what you’re doing, maybe even to yourself.
I’d also rather have seen revealed preferences instead of stated preferences,
On net, if you’re going to be a good utilitarian[3], Vladimir Putin is probably less valuable than the average random middle class American. Keeping Vladimir Putin alive, in any way you can realistically implement, may in fact have negative net value (heavily depending on how he dies and what follows). You could also easily get there for Trump or Musk, depending on your other opinions. You could even make a well-formed utilitarian argument that GPT-4o is in fact more valuable than the average American based on the consequences of its existing.
Plus, of course, some humans’ general desire to punish the “guilty”. But that desire itself probably has essentially instrumental evolutionary roots.
… which I’m not, personally, but then I’m not a good any-ethical-philosophy-here.
The Y-axis seemed to me like roughly ‘populist’.