tailcalled comments on On the lethality of biased human reward ratings

tailcalled 18 Nov 2023 8:53 UTC
6 points
2

For instance, my current best model of Alex Turner at this point is like “well maybe some of the AI’s internal cognition would end up structured around the intended concept of happiness, AND inner misalignment would go in our favor, in such a way that the AI’s internal search/planning and/or behavioral heuristics would also happen to end up pointed at the intended ‘happiness’ concept rather than ‘happy’/‘unhappy’ labels or some alien concept”. That would be the easiest version of the “Alignment by Default” story.

I always get the impression that Alex Turner and his associates are just imagining much weaker optimization processes than Eliezer or I or probably also you are. Alex Turner’s arguments make a lot of sense to me if I condition on some ChatGPT-like training setup (imitation learning + action-level RLHF), but not if I condition on the negation (e.g. brain-like AGI, or sufficiently smart scaffolding to identify lots of new useful information and integrate it, or …).