Suppose someone in 1970 makes the prediction. “more future tech progress will be in computers, not rockets.” (Claiming amongst other arguments that rockets couldn’t be made orders of magnitude smaller, and computers could) There is a sense in which they are clearly right, but its really hard to turn something like that into a specific objective prediction, even with the benefit of hindsight. Any time you set an objective criteria, there are ways that the technical letter of the rules could fail to match the intended spirit.
(The same way complicated laws have loopholes)
Take this attempt
Change the first criterion to: “Able to reliably pass a 2-hour, adversarial Turing test during which the participants can send text, images, and audio files (as is done in ordinary text messaging applications) during the course of their conversation. An ‘adversarial’ Turing test is one in which the human judges are instructed to ask interesting and difficult questions, designed to advantage human participants, and to successfully unmask the computer as an impostor. A single demonstration of a computer passing such a Turing test, or one that is sufficiently similar, will be sufficient for this condition to have been met, so long as the test was well-designed.”
Ways this could fail.
The test was 1 hour 50 minutes. Audio wasn’t included because someone had spilled coffee in the speakers. It doesn’t technically count.
Where do they find these people? The computer produces a string of random gibberish, and is indistinguishable from the human producing gibberish.
An AI adversarially optimizing against a human mind? With all our anti mind hacking precautions removed? Are you insane? There is no way we would run any experiment that dangerous with our AGI.
Turns out the optical subsurface scattering of light with skin is just really hard to fake with available compute. (But the human mind can easily verify it) The AI is smart, but can’t break P!=NP. Producing actually realistic looking skin would take leaps far beyond current computer science, and the AI isn’t THAT good.
Neurology advances lead to the development of hypnoimages. The “AI” just sends a swirly fractal image and ignores any responses. Human judges swear they had a conversation with a real intelligent human.
Where do they find these people? I mean come on. They flew someone in straight from an uncontacted tribe in the amazon. If the human judge is supposed to tell human from computer, they should at least have a clue what a computer is. (More realistically, a not particularly bright or skilled judge, like someone who can’t program)
Neurology advance. Humans are really bad at producing randomness. If we brain scan a particular judge the day before, we can hand write a look up table of every question they might ask. Humans are just that predictable.
Most likely. Whoever has the AI isn’t particularly interested in running this particular test. Doing so would take a lot of researcher time to fiddle the details and a lot of compute.
Change the second criterion to: “Has general robotic capabilities, of the type able to autonomously, when equipped with appropriate actuators and when given human-readable instructions, satisfactorily assemble a (or the equivalent of a) circa-2021 Ferrari 312 T4 1:8 scale automobile model. A single demonstration of this ability, or a sufficiently similar demonstration, will be considered sufficient.”
The suitable actuators just aren’t there. It knows what to do, but the robot hand is too crude.
The reverse. It turns out this task is possible with lots of really good hardware and hard coded heuristics. It can take 3d scans of all components. It can search for arrangements such that parts that fit together go together. It has hardcoded lists of common patterns for models like this. It can render various arrangements of parts, and compare those renders to images/ diagrams. And it uses some pretty crude regex string mangling, Elisa style. And it can easily position everything, thanks to advanced hardware. This turns out to be sufficient.
Safety. Again
No-one cares. This test isn’t commercially valuable.
The AI only succeeded because it was given extensive knowledge . A database of all automobile models ever made and computer friendly assembly instructions.
The AI only failed because it didn’t have common human knowledge of what a screwdriver was and how to use one.
Definitely agree that narrow questions can lose the spirit of it. The forecasting community can hedge against this by having a variety of questions that try to get at it from “different angles”.
For example, that person in 1970 could set up a basket of questions:
Percent of GDP that would be computing-related instead of rocket-related.
Growth in the largest computer by computational power, versus the growth in the longest distance traveled by rocket, etc.
Growth in the number of people who had flown in a rocket, versus the number of people who own computers.
Changes in dollars per kilo of cargo hauled into space, versus changes in FLOPS-per-dollar.
Of course, I understand completely if people in 1970 didn’t know about Tetlock’s modern work. But for big important questions, today, I don’t see why we shouldn’t just use modern proper forecasting technique. Admittedly it is laborious! People have been struggling to write good AI timeline questions for years.
Suppose someone in 1970 makes the prediction. “more future tech progress will be in computers, not rockets.” (Claiming amongst other arguments that rockets couldn’t be made orders of magnitude smaller, and computers could) There is a sense in which they are clearly right, but its really hard to turn something like that into a specific objective prediction, even with the benefit of hindsight. Any time you set an objective criteria, there are ways that the technical letter of the rules could fail to match the intended spirit.
(The same way complicated laws have loopholes)
Take this attempt
Ways this could fail.
The test was 1 hour 50 minutes. Audio wasn’t included because someone had spilled coffee in the speakers. It doesn’t technically count.
Where do they find these people? The computer produces a string of random gibberish, and is indistinguishable from the human producing gibberish.
An AI adversarially optimizing against a human mind? With all our anti mind hacking precautions removed? Are you insane? There is no way we would run any experiment that dangerous with our AGI.
Turns out the optical subsurface scattering of light with skin is just really hard to fake with available compute. (But the human mind can easily verify it) The AI is smart, but can’t break P!=NP. Producing actually realistic looking skin would take leaps far beyond current computer science, and the AI isn’t THAT good.
Neurology advances lead to the development of hypnoimages. The “AI” just sends a swirly fractal image and ignores any responses. Human judges swear they had a conversation with a real intelligent human.
Where do they find these people? I mean come on. They flew someone in straight from an uncontacted tribe in the amazon. If the human judge is supposed to tell human from computer, they should at least have a clue what a computer is. (More realistically, a not particularly bright or skilled judge, like someone who can’t program)
Neurology advance. Humans are really bad at producing randomness. If we brain scan a particular judge the day before, we can hand write a look up table of every question they might ask. Humans are just that predictable.
Most likely. Whoever has the AI isn’t particularly interested in running this particular test. Doing so would take a lot of researcher time to fiddle the details and a lot of compute.
The suitable actuators just aren’t there. It knows what to do, but the robot hand is too crude.
The reverse. It turns out this task is possible with lots of really good hardware and hard coded heuristics. It can take 3d scans of all components. It can search for arrangements such that parts that fit together go together. It has hardcoded lists of common patterns for models like this. It can render various arrangements of parts, and compare those renders to images/ diagrams. And it uses some pretty crude regex string mangling, Elisa style. And it can easily position everything, thanks to advanced hardware. This turns out to be sufficient.
Safety. Again
No-one cares. This test isn’t commercially valuable.
The AI only succeeded because it was given extensive knowledge . A database of all automobile models ever made and computer friendly assembly instructions.
The AI only failed because it didn’t have common human knowledge of what a screwdriver was and how to use one.
I like your list!
Definitely agree that narrow questions can lose the spirit of it. The forecasting community can hedge against this by having a variety of questions that try to get at it from “different angles”.
For example, that person in 1970 could set up a basket of questions:
Percent of GDP that would be computing-related instead of rocket-related.
Growth in the largest computer by computational power, versus the growth in the longest distance traveled by rocket, etc.
Growth in the number of people who had flown in a rocket, versus the number of people who own computers.
Changes in dollars per kilo of cargo hauled into space, versus changes in FLOPS-per-dollar.
Of course, I understand completely if people in 1970 didn’t know about Tetlock’s modern work. But for big important questions, today, I don’t see why we shouldn’t just use modern proper forecasting technique. Admittedly it is laborious! People have been struggling to write good AI timeline questions for years.