How do you feel about this strategy today? What chance of success would you give this? Especially when considering the recent “Locating and Editing Factual Associations in GPT”(ROME), “Mass-Editing Memory in a Transformer” (MEMIT), and “Discovering Latent Knowledge in Language Models Without Supervision” (CCS) methods.
How does this compare to the strategy you’re currently most excited about? Do you know of other ongoing (empirical) efforts that try to realize this strategy?
I’m not sure I understand this correctly. Are you saying that one of the main reasons for optimism is that more competent models will be easier to align because we just need to give them “the right incentives”?
What exactly do you mean by “the right incentives”?
Can you illustrate this by means of an example?