Here, I think we’ll want to look for suspicious changes in the log-likelihood trends. E.g., it’s a red flag if we see steady increases in log-likelihood on some scary behavior, but then the trend reverse at some level of model scale.
Current theme: default
Less Wrong (text)
Less Wrong (link)
Here, I think we’ll want to look for suspicious changes in the log-likelihood trends. E.g., it’s a red flag if we see steady increases in log-likelihood on some scary behavior, but then the trend reverse at some level of model scale.