Regarding making AIs motivated to have accurate beliefs, you can make agents that do planning and RL on organizing better predictions, e.g. AIs whose only innate drives/training signal (beside short-run data modeling, as with LLM pretraining) are doing well in comprehensive forecasting tournaments/prediction markets, or implementing reasoning that scores well on various classifiers built based on habits of reasoning that drive good performance in prediction problems, even against adversarial pressures (AIs required to follow the heuristics have a harder time believing or arguing for false beliefs even when optimized to do so under the constraints).
Before I even start to think about how to make AIs that are motivated to have accurate beliefs, I want to figure out whether that’s a good use of time. So my first two questions are:
Is figuring this out necessary for TAI capabilities? (If yes, I don’t need to think about it, because it will automatically get sorted out before TAI.)
Hmm, I guess my answer is “no”, because, for example, humans can be very high-achieving in practical domains like inventing stuff and founding companies while having confidently wrong opinions about things that are not immediately consequential, like religion or politics or of course x-risk. :)
Is not figuring this out before TAI a safety problem? (If it’s not a safety problem, then I don’t care much.)
Hmm, I guess my answer is “yes it’s a problem”, although I think it’s a less critical problem than alignment. Like, if an AI is motivated to make a great future, but has some wishful thinking and confirmation bias, they might do catastrophic things by accident.
OK, so I guess I do care about this topic. So now I’m reading your comment!
Giving an AI a motivation to do well at prediction markets or forecasting tournaments (maybe the latter is a bit better than the former per this?) seems like a perfectly good idea. I definitely wouldn’t want that to be the only motivation, at least for the kind of agent-y AGI that I’m expecting and trying to plan for, but it could be part of the mix.
The latter part of your comment (“or implementing reasoning…”) seems somewhat redundant with the former part, on my models of actor-critic AGI. Specifically, if you have actor-critic RL trained on good forecasting, then the critic becomes “various classifiers built based on habits of reasoning that drive good performance in prediction problems”, and then the actor “implements reasoning” on that basis. It might be less redundant for other types of AI. Sorry if I’m misunderstanding.
Also, I still also think literally giving the AI a copy of Scout Mindset, as silly as it sounds, is not a crazy idea (again, specifically for the kind of RL-agent-y AGI that I’m thinking of). You would also want to futz with the AI’s motivation system to make it excited to read the book. (I think that kind of futzing will probably be possible as we get towards AGI. It’s not so different from what happens when someone I greatly admire recommends a book to me. I think it would look like tweaking the critic function—like this kind of thing.)
Regarding making AIs motivated to have accurate beliefs, you can make agents that do planning and RL on organizing better predictions, e.g. AIs whose only innate drives/training signal (beside short-run data modeling, as with LLM pretraining) are doing well in comprehensive forecasting tournaments/prediction markets, or implementing reasoning that scores well on various classifiers built based on habits of reasoning that drive good performance in prediction problems, even against adversarial pressures (AIs required to follow the heuristics have a harder time believing or arguing for false beliefs even when optimized to do so under the constraints).
Thanks!
Before I even start to think about how to make AIs that are motivated to have accurate beliefs, I want to figure out whether that’s a good use of time. So my first two questions are:
Is figuring this out necessary for TAI capabilities? (If yes, I don’t need to think about it, because it will automatically get sorted out before TAI.)
Hmm, I guess my answer is “no”, because, for example, humans can be very high-achieving in practical domains like inventing stuff and founding companies while having confidently wrong opinions about things that are not immediately consequential, like religion or politics or of course x-risk. :)
Is not figuring this out before TAI a safety problem? (If it’s not a safety problem, then I don’t care much.)
Hmm, I guess my answer is “yes it’s a problem”, although I think it’s a less critical problem than alignment. Like, if an AI is motivated to make a great future, but has some wishful thinking and confirmation bias, they might do catastrophic things by accident.
OK, so I guess I do care about this topic. So now I’m reading your comment!
Giving an AI a motivation to do well at prediction markets or forecasting tournaments (maybe the latter is a bit better than the former per this?) seems like a perfectly good idea. I definitely wouldn’t want that to be the only motivation, at least for the kind of agent-y AGI that I’m expecting and trying to plan for, but it could be part of the mix.
The latter part of your comment (“or implementing reasoning…”) seems somewhat redundant with the former part, on my models of actor-critic AGI. Specifically, if you have actor-critic RL trained on good forecasting, then the critic becomes “various classifiers built based on habits of reasoning that drive good performance in prediction problems”, and then the actor “implements reasoning” on that basis. It might be less redundant for other types of AI. Sorry if I’m misunderstanding.
Also, I still also think literally giving the AI a copy of Scout Mindset, as silly as it sounds, is not a crazy idea (again, specifically for the kind of RL-agent-y AGI that I’m thinking of). You would also want to futz with the AI’s motivation system to make it excited to read the book. (I think that kind of futzing will probably be possible as we get towards AGI. It’s not so different from what happens when someone I greatly admire recommends a book to me. I think it would look like tweaking the critic function—like this kind of thing.)