There’s a phenomenon I currently hypothesize to exist where direct attacks on the problem of AI alignment are criticized much more often than indirect attacks.
If this phenomenon exists, it could be advantageous to the field in the sense that it encourages thinking deeply about the problem before proposing solutions. But it could also be bad because it disincentivizes work on direct attacks to the problem (if one is criticism averse and would prefer their work be seen as useful).
I have arrived at this hypothesis from my observations: I have watched people propose solutions only to be met with immediate and forceful criticism from others, while other people proposing non-solutions and indirect analyses are given little criticism at all. If this hypothesis is true, I suggest it is partly or mostly because direct attacks on the problem are easier to defeat via argument, since their assumptions are made plain
If this is so, I consider it to be a potential hindrance on thought, since direct attacks are often the type of thing that leads to the most deconfusion—not because the direct attack actually worked, but because in explaining how it failed, we learned what definitely doesn’t work.
Nod. This is part of a general problem where vague things that can’t be proven not to work are met with less criticism than “concrete enough to be wrong” things.
A partial solution is a norm wherein “concrete enough to be wrong” is seen as praise, and something people go out of their way to signal respect for.
Did you have some specific cases in mind when writing this? For example, HCH is interesting and not obviously going to fail in the ways that some other proposals I’ve seen would, and the proposal there seems to have gotten better as more details have been fleshed out even if there’s still some disagreement on things that can be tested eventually even if not yet. Against this we’ve seen lots of things, like various oracle AI proposals, that to my mind usually have fatal flaws right from the start due to misunderstanding something that they can’t easily be salvaged.
I don’t want to disincentivize thinking about solving AI alignment directly when I criticize something, but I also don’t want to let pass things that to me have obvious problems that the authors probably didn’t think about or thought about from different assumptions that maybe are wrong (or maybe I will converse with them and learn that I was wrong!). It seems like an important part of learning in this space is proposing things and seeing why they don’t work so you can better understand the constraints of the problem space to work within them to find solutions.
There’s a phenomenon I currently hypothesize to exist where direct attacks on the problem of AI alignment are criticized much more often than indirect attacks.
If this phenomenon exists, it could be advantageous to the field in the sense that it encourages thinking deeply about the problem before proposing solutions. But it could also be bad because it disincentivizes work on direct attacks to the problem (if one is criticism averse and would prefer their work be seen as useful).
I have arrived at this hypothesis from my observations: I have watched people propose solutions only to be met with immediate and forceful criticism from others, while other people proposing non-solutions and indirect analyses are given little criticism at all. If this hypothesis is true, I suggest it is partly or mostly because direct attacks on the problem are easier to defeat via argument, since their assumptions are made plain
If this is so, I consider it to be a potential hindrance on thought, since direct attacks are often the type of thing that leads to the most deconfusion—not because the direct attack actually worked, but because in explaining how it failed, we learned what definitely doesn’t work.
Nod. This is part of a general problem where vague things that can’t be proven not to work are met with less criticism than “concrete enough to be wrong” things.
A partial solution is a norm wherein “concrete enough to be wrong” is seen as praise, and something people go out of their way to signal respect for.
Did you have some specific cases in mind when writing this? For example, HCH is interesting and not obviously going to fail in the ways that some other proposals I’ve seen would, and the proposal there seems to have gotten better as more details have been fleshed out even if there’s still some disagreement on things that can be tested eventually even if not yet. Against this we’ve seen lots of things, like various oracle AI proposals, that to my mind usually have fatal flaws right from the start due to misunderstanding something that they can’t easily be salvaged.
I don’t want to disincentivize thinking about solving AI alignment directly when I criticize something, but I also don’t want to let pass things that to me have obvious problems that the authors probably didn’t think about or thought about from different assumptions that maybe are wrong (or maybe I will converse with them and learn that I was wrong!). It seems like an important part of learning in this space is proposing things and seeing why they don’t work so you can better understand the constraints of the problem space to work within them to find solutions.