I think that before this announcement I’d have said that OpenAI was at around a 2.5 and Anthropic around a 3 in terms of what they’ve actually applied to existing models (which imo is fine for now, I think that doing more to things at GPT-4 capability levels is mostly wasted effort in terms of current safety impacts), though prior to the superalignment announcement I’d have said openAI and anthropic were both aiming at a 5, i.e. oversight with research assistance, and Deepmind’s stated plan was the best at a 6.5 (involving lots of interpretability and some experiments). Now OpenAI is also aiming at a 6.5 and Anthropic now seems to be the laggard and still at a 5 unless I’ve missed something.
However the best currently feasible plan is still slightly better than either. I think e.g. very close integration of the deception and capability evals from the ARC and Apollo research teams into an experiment workflow isn’t in either plan and should be, and would bump either up to a 7.
I don’t see how anyone can have high justifiable confidence that it’s zero instead of epsilon, given our general state of knowledge/confusion about philosophy of mind/consciousness, axiology, metaethics, metaphilosophy.
I tend to agree with Zvi’s conclusion although I also agree with you that I don’t know that it’s definitely zero. I think it’s unlikely (subjectively like under a percent) that the real truth about axiology says that insects in bliss are an absolute good, but I can’t rule it out like I can rule out winning the lottery because no-one can trust reasoning in this domain that much.
What I’d say is just in general in ‘weird’ domains (AI Strategy, thinking about longtermist prioritization, metaethics) because the stakes are large and the questions so uncertain, you run into a really large number of “unlikely but not really unlikely enough to honestly call it a pascal’s mugging” considerations, things you’d subjectively say are under 1% likely but over one in a million or so. I think the correct response in these uncertain domains is to mostly just ignore them like you’d ignore things that are under one in a billion in a more certain and clear domain like construction engineering.
I tend to agree with Zvi’s conclusion although I also agree with you that I don’t know that it’s definitely zero. I think it’s unlikely (subjectively like under a percent) that the real truth about axiology says that insects in bliss are an absolute good, but I can’t rule it out like I can rule out winning the lottery because no-one can trust reasoning in this domain that much.
I’m not aware of any reasoning that I’d trust enough to drive my subjective probability of insect bliss having positive value to <1%. Can you cite some works in this area that caused you to be that certain?
What I’d say is just in general in ‘weird’ domains (AI Strategy, thinking about longtermist prioritization, metaethics) because the stakes are large and the questions so uncertain, you run into a really large number of “unlikely but not really unlikely enough to honestly call it a pascal’s mugging” considerations, things you’d subjectively say are under 1% likely but over one in a million or so. I think the correct response in these uncertain domains is to mostly just ignore them like you’d ignore things that are under one in a billion in a more certain and clear domain like construction engineering.
What are some other specific examples of this, in your view?
I think that before this announcement I’d have said that OpenAI was at around a 2.5 and Anthropic around a 3 in terms of what they’ve actually applied to existing models (which imo is fine for now, I think that doing more to things at GPT-4 capability levels is mostly wasted effort in terms of current safety impacts), though prior to the superalignment announcement I’d have said openAI and anthropic were both aiming at a 5, i.e. oversight with research assistance, and Deepmind’s stated plan was the best at a 6.5 (involving lots of interpretability and some experiments). Now OpenAI is also aiming at a 6.5 and Anthropic now seems to be the laggard and still at a 5 unless I’ve missed something.
However the best currently feasible plan is still slightly better than either. I think e.g. very close integration of the deception and capability evals from the ARC and Apollo research teams into an experiment workflow isn’t in either plan and should be, and would bump either up to a 7.
I tend to agree with Zvi’s conclusion although I also agree with you that I don’t know that it’s definitely zero. I think it’s unlikely (subjectively like under a percent) that the real truth about axiology says that insects in bliss are an absolute good, but I can’t rule it out like I can rule out winning the lottery because no-one can trust reasoning in this domain that much.
What I’d say is just in general in ‘weird’ domains (AI Strategy, thinking about longtermist prioritization, metaethics) because the stakes are large and the questions so uncertain, you run into a really large number of “unlikely but not really unlikely enough to honestly call it a pascal’s mugging” considerations, things you’d subjectively say are under 1% likely but over one in a million or so. I think the correct response in these uncertain domains is to mostly just ignore them like you’d ignore things that are under one in a billion in a more certain and clear domain like construction engineering.
I’m not aware of any reasoning that I’d trust enough to drive my subjective probability of insect bliss having positive value to <1%. Can you cite some works in this area that caused you to be that certain?
What are some other specific examples of this, in your view?