Perhaps given my short-term preference, it’s not surprising that I find it hard to track very deep comment threads, but let me just give a couple of short responses.
I don’t think the argument on hacking relied on the ability to formally verify systems. Formally verified systems could potentially skew the balance of power to the defender side, but even if they don’t exist, I don’t think balance is completely skewed to the attacker. You could imagine that, like today, there is a “cat and mouse” game, where both attackers and defenders try to find “zero day vulnerabilities” and exploit (in one case) or fix (in the other). I believe that in the world of powerful AI, this game would continue, with both sides having access to AI tools, which would empower both but not necessarily shift the balance to one or the other.
I think the question of whether a long-term planning agent could emerge from short-term training is a very interesting technical question! Of course we need to understand how to define “long term” and “short term” here. One way to think about this is the following: we can define various short-term metrics, which are evaluable using information in the short-term, and potentially correlated with long-term success. We would say that a strategy is purely long-term if it cannot be explained by making advances on any combination of these metrics.
I don’t think the argument on hacking relied on the ability to formally verify systems. Formally verified systems could potentially skew the balance of power to the defender side, but even if they don’t exist, I don’t think balance is completely skewed to the attacker.
My point was not about the defender/attacker balance. My point was that even short-term goals can be difficult to specify, which undermines the notion that we can easily empower ourselves by short-term AI.
Of course we need to understand how to define “long term” and “short term” here. One way to think about this is the following: we can define various short-term metrics, which are evaluable using information in the short-term, and potentially correlated with long-term success. We would say that a strategy is purely long-term if it cannot be explained by making advances on any combination of these metrics.
Sort of. The correct way to make it more rigorous, IMO, is using tools from algorithmic information theory, like I suggested here.
Hi Vanesssa,
Perhaps given my short-term preference, it’s not surprising that I find it hard to track very deep comment threads, but let me just give a couple of short responses.
I don’t think the argument on hacking relied on the ability to formally verify systems. Formally verified systems could potentially skew the balance of power to the defender side, but even if they don’t exist, I don’t think balance is completely skewed to the attacker. You could imagine that, like today, there is a “cat and mouse” game, where both attackers and defenders try to find “zero day vulnerabilities” and exploit (in one case) or fix (in the other). I believe that in the world of powerful AI, this game would continue, with both sides having access to AI tools, which would empower both but not necessarily shift the balance to one or the other.
I think the question of whether a long-term planning agent could emerge from short-term training is a very interesting technical question! Of course we need to understand how to define “long term” and “short term” here. One way to think about this is the following: we can define various short-term metrics, which are evaluable using information in the short-term, and potentially correlated with long-term success. We would say that a strategy is purely long-term if it cannot be explained by making advances on any combination of these metrics.
My point was not about the defender/attacker balance. My point was that even short-term goals can be difficult to specify, which undermines the notion that we can easily empower ourselves by short-term AI.
Sort of. The correct way to make it more rigorous, IMO, is using tools from algorithmic information theory, like I suggested here.