WillPetillo comments on The Leeroy Jenkins principle: How faulty AI could guarantee “warning shots”

WillPetillo 15 Jan 2024 3:44 UTC
12 points
11
One more objection to the model: AI labs apply just enough safety measures to prevent dumb rogue AIs. Fearing a public backlash to low-level catastrophes, AI companies test their models, checking for safety vulnerabilities, rogue behaviors, and potential for misuse. The easiest to catch problems, however, are also the least dangerous, so only the most cautious, intelligent, and dangerous rogue AI’s pass the security checks. Further, this correlation continues indefinitely, so all additional safety work contributes towards filtering the population of malevolent AIs towards the most dangerous. AI companies are not interested in adhering to the standard of theoretical, “provably safe” models, as they are trying to get away with the bare minimum, so the filter never catches everything. While “warning shots” appear all the time in experimental settings, these findings are suppressed or downplayed in public statements and the media, and the public only sees the highly sanitized result of the filtration process. Eventually, the security systems fail, but by this point AI has been developed past the threshold needed to become catastrophically dangerous.