Aaron_Scher comments on The hot mess theory of AI misalignment: More intelligent agents behave less coherently

Aaron_Scher 17 Mar 2023 8:15 UTC
1 point
0
- I found it hard to engage with this partially because motivated reasoning and thinking my prior beliefs, which expects very intelligent and coherent AGIs, are correct. Overcoming this bias is hard and I sometimes benefit from clean presentation of solid experimental results, which this study lacked, making it extra hard for me to engage with. Below are some of my messy thoughts, with the huge caveat that I have yet to engage with these ideas from as neutral a prior as I would like.
- This is an interesting study to conduct. I don’t think its results, regardless of what they are, should update anybody much because:
  - the study is asking a small set (n = 5-6) of ~random people to rank various entities based on vague definitions of intelligence and coherence, we shouldn’t expect a process like this to provide strong evidence for any conclusion
  - Asking people to rate items on coherence and intelligence feel pretty different from carefully thinking about the properties of each item. I would rather see they author pick a few items from around the spectrums and analyze each in depth (though if that’s all they did I would be complaining about a lack of more items lol)
  - A priori I don’t necessarily expect a strong intelligence-coherence correlation for lower levels of intelligence, and I don’t think finding of this type are very useful for thinking about super-intelligence. Convergent instrumental subgoals are not a thing for sufficiently unintelligent agents, and I expect coherence as such a subgoal to kick in at fairly high intelligence levels (definitely above average human, at least based on observing many humans be totally uncoherent which isn’t exactly a priori reasoning). I dislike that this is the state of my belief, because it’s pretty much like “no experiment you can run right now would get at the crux of my beliefs,” but I do think it’s the case here that we can only learn so much from observing non-superintelligent systems.
  - The terms, especially “coherence”, seem pretty poorly defined in a way that really hurts the usefulness of the study, as some commenters have pointed out
- My takes about the results
  - Yep the main effect seems to be a negative relationship between intelligence and coherence
  - The poor inter-rater reliability for coherence seems like a big deal
  - Figure 6 (most important image) seems really whacky. It seems to imply that all the AI models are ranked on average lower in intelligence than everything else — including oak tree and ant. This just seems slightly wild because I think most people who interact with AI would disagree with such rankings. Out of 5 respondents, only 2 ranked GPT-3 as more intelligent than an oak tree.
  - Isolating just the humans (a plot for this is not shown for some reason) seems like it doesn’t support the author’s hypothesis very much. I think this in line with some prediction like “for low levels of intelligence there is not a strong relationship to coherence, and then as you get in the high human level this changes”
- Other thoughts
  - The spreadsheet sheets are labeled “alignment subject response” and “intelligence subject response”. Alignment is definitely not the same as coherence. I mostly trust that the author isn’t pulling a fast one on us by fudging the original purpose of the study or the directions they gave to participants, but my prior trust for people on the internet is somewhat low and the experimental design here does seem pretty janky.
  - Figures 3-5, as far as I can tell, are only using the rank compared to other items in the graph, as opposed to all 60. This threw me off for a bit and I think might not be a great analysis, given that participants ranked all 60 together rather than in category batches.
  - Something something inner optimizers and inner agent conflicts might imply a lack of coherence in superintelligent systems, but systems like this still seem quite dangerous.