One key question I had (apologies if you already covered this as I only skimmed this post), is how you’re modeling earplugs? To be most persuasive, you’d want the earplugs to have been build using dynamics it could already make sense of in the world instead of just being “magic”, however that seems hard.
Perhaps one option would be to have a transmitter that can be destroyed during test, but not during training, because they would differ in the material they are made out of? For example, if the AI learns during training that wooden objects can be destroyed, but metal objects can’t be.
If we want to make it easier for the AI to learn, perhaps we create a few scenarios in training where it sees a transmitter spontaneously explode so it sees that if the transmitter is destroyed then the signal will end, but it can’t destroy the transmitter itself. Then we could try to get an AI to learn on hard mode where it never sees the transmitter interfered with during training.
Anyway, I’d be keen to hear what you think about this scheme.
In principle, something like this could be attempted. Part of me thinks “why overcomplicate the number of reasoning steps that the AI needs to make if everything would already be persuasive in principle with just one correct reasoning step?”
I.e., to me, it feels fine to just teach the AI “If you’re able to put in earplugs, and you do it, then you’ll not hear the alert sound ever again”, and additionally “if you hear the alert sound, then you will output the null action” instead of longer reasoning chains such as suggested by you.
Do you think the experiments would also to you be not persuasive in the current form, or are you imagining some unknown reviewer of a potential paper?
I don’t need the experiments to persuade me, so I’m not in the target audience. I’m trying to think about what people who are more critical might want to see.
Okay, that’s fair. I agree, if we could show that the experiments remain stable even when longer strings of reasoning are required, then the experiments seem more convincing. There might be the added benefit that one can then vary the setting in more ways to demonstrate that the reasoning caused the agent to act in a particular way, instead of the actions just being some kind of coincidence.
One key question I had (apologies if you already covered this as I only skimmed this post), is how you’re modeling earplugs? To be most persuasive, you’d want the earplugs to have been build using dynamics it could already make sense of in the world instead of just being “magic”, however that seems hard.
Perhaps one option would be to have a transmitter that can be destroyed during test, but not during training, because they would differ in the material they are made out of? For example, if the AI learns during training that wooden objects can be destroyed, but metal objects can’t be.
If we want to make it easier for the AI to learn, perhaps we create a few scenarios in training where it sees a transmitter spontaneously explode so it sees that if the transmitter is destroyed then the signal will end, but it can’t destroy the transmitter itself. Then we could try to get an AI to learn on hard mode where it never sees the transmitter interfered with during training.
Anyway, I’d be keen to hear what you think about this scheme.
In principle, something like this could be attempted.
Part of me thinks “why overcomplicate the number of reasoning steps that the AI needs to make if everything would already be persuasive in principle with just one correct reasoning step?”
I.e., to me, it feels fine to just teach the AI “If you’re able to put in earplugs, and you do it, then you’ll not hear the alert sound ever again”, and additionally “if you hear the alert sound, then you will output the null action” instead of longer reasoning chains such as suggested by you.
Do you think the experiments would also to you be not persuasive in the current form, or are you imagining some unknown reviewer of a potential paper?
I don’t need the experiments to persuade me, so I’m not in the target audience. I’m trying to think about what people who are more critical might want to see.
Okay, that’s fair. I agree, if we could show that the experiments remain stable even when longer strings of reasoning are required, then the experiments seem more convincing. There might be the added benefit that one can then vary the setting in more ways to demonstrate that the reasoning caused the agent to act in a particular way, instead of the actions just being some kind of coincidence.