Algon comments on Alignment can improve generalisation through more robustly doing what a human wants—CoinRun example

Algon 21 Nov 2023 14:43 UTC
6 points
2
Goodhard problem.
*Goodhart
One important feature of ACE is that it can overcome simplicity bias—even quite strong simplicity bias. In the following example, the labelled data consisted of smiling faces with a red bar under them, and frowning faces with a blue bar under them.
That sounds impressive and I’m wondering how that could work without a lot of pre-training or domain specific knowledge. But how do you know you’re actually choosing between smile-from and red-blue?

Also, this method seems superficially related to CIRL. How does it avoid the associated problems?
- Stuart_Armstrong 21 Nov 2023 16:45 UTC
  6 points
  0
  Parent
  
  *Goodhart
  
  Thanks! Corrected (though it is indeed a good hard problem).
  
  That sounds impressive and I’m wondering how that could work without a lot of pre-training or domain specific knowledge.
  
  Pre-training and domain specific knowledge are not needed.
  
  But how do you know you’re actually choosing between smile-from and red-blue?
  
  Run them on examples such as frown-with-red-bar and smile-with-blue-bar.
  
  Also, this method seems superficially related to CIRL. How does it avoid the associated problems?
  
  Which problems are you thinking of?
  - Algon 28 Nov 2023 19:05 UTC
    2 points
    0
    Parent
    Run them on examples such as frown-with-red-bar and smile-with-blue-bar.
    That sounds like a black-box approach.
    Which problems are you thinking of?
    Human’s not knowing what goals we want AI to have and the riggability of the reward learning process. Which you stated were problems for CIRL in 2020.