Note: This is an example of how to do the bad thing (extensive RL fine tuning/training). If you do it the result may be misalignment, killing you/everyone.
To name one good example that is very relevant, programming, specifically having the AI complete easy to verify small tasks.
The general pattern is to take existing horribly bloated software/data and extract useful subproblems from it. (EG:find the parts of this code that are taking the most time) and then turn those into problems for the AI to solve(eg: here is a function + examples of it being called, make it faster). Ground truth metrics would be simple things that are easy to measure (EG:execution time, code quality/smallness, code coverage, is the output the same?) and then credit assignment for sub-task usefulness can be handled by an expected value estimator trained on that ground truth as is done in traditional game playing RL. Possibly it’s just one AI with different prompts.
Basically Microsoft takes all the repositories on GitHub that build sucessfully and have some unit tests, and builds an AI augmented pipeline to extract problems from that software. Alternatively, a large company that runs lots of code takes snapshots + IO traces of production machines, and derives examples from that. You need code in the wild doing it’s thing.
Some example sub-tasks in the domain of software engineering:
make a piece of code faster
make this pile of code smaller
is f(x)==g(x)? If not find a counterexample (useful for grading the above)
find a vulnerability and write an exploit.
fix the bug while preserving functionality
identify invariants/data structures/patterns in memory (EG:linked lists, reference counts)
useful as a building block for further tasks (EG:finding use after free bugs)
Larger problems could be approached by identifying useful instrumental subgoals once the model can actually perform them reliably.
The finished system should be able to extend shoggoth tentacles into a given computer, identify what that computer is doing and make it do it better or differently.
The finished system might be able to extend shoggoth tentacles into other things too! (EG:embedded systems, FPGAs) Capability limitations would stem from the need for fast feedback so software, electronics and programmable hardware should be solvable. For other domains, simulation can help(limited by simulation fidelity and goodharting). The eventual result is a general purpose engineering AI.
Tasks heavily dependent on human judgement (EG:is this a good book? Is this action immoral) have obviously terrible feedback cost/latency and so scale poorly. This is a problem if we want the AI to not do things a human would disapprove of.
RL training could lead to a less grotesque solution. IE:just read the password from memory using the debugger rather than writing a program to repeatedly run the executable and brute force the password.
>The finished system should be able to extend shoggoth tentacles into a given computer, identify what that computer is doing and make it do it better or differently.
Sure. GPT-X will probably help optimize a lot of software. But I don’t think having more resource efficiency should be assumed to lead to recursive self-improvement beyond where we’d be at given a “perfect” use of current software tools. Will GPT-X be able to break out of those current set of tools, only having been trained to complete text and not to actually optimize systems? I don’t take this for granted, and my view is that LLMs are unlikely to devise radically new software architectures on their own.
<rant>It really pisses me off that the dominant “AI takes over the world” story is more or less “AI does technological magic”. Nanotech assemblers, superpersuasion, basilisk hacks and more. Skeptics who doubt this are met with “well if it can’t it just improves itself until it can”. The skeptics obvious rebuttal that RSI seems like magic too is not usually addressed.</rant>
Note:RSI is in my opinion an unpredictable black swan. My belief is RSI will yield somewhere between 1.5-5x speed improvement to a nascent AGI from improvements in GPU utilisation and sparsity/quantisation, requiring significant cognition spent to achieve speedups. AI is still dangerous in worlds where RSI does not occur.
Self play generally gives superhuman performance(GO,chess, etc.) even in more complicated imperfect information games (DOTA, Starcraft). Turning a field of engineering into a self-playable game likely leads to (superhuman(80%),Top-human equiv(18%),no change(2%)) capabilities in that field. Superhuman or top-human software engineering (vulnerability discovery and programming) is one relatively plausible path to AI takeover.
find vulnerabilities about as well as the researchers at project zero
generate reasonable plans on par with a +1sd int human (IE:not hollywood style movie plots like GPT-4 seems fond of)
AI does not need to be even superhuman to be an existential threat. Hack >95% of devices, extend shoggoth tentacles, hold all the data/tech hostage, present as not skynet so humans grudgingly cooperate, build robots to run economy(some humans will even approve of this), kill all humans, done.
Note: This is an example of how to do the bad thing (extensive RL fine tuning/training). If you do it the result may be misalignment, killing you/everyone.
To name one good example that is very relevant, programming, specifically having the AI complete easy to verify small tasks.
The general pattern is to take existing horribly bloated software/data and extract useful subproblems from it. (EG:find the parts of this code that are taking the most time) and then turn those into problems for the AI to solve(eg: here is a function + examples of it being called, make it faster). Ground truth metrics would be simple things that are easy to measure (EG:execution time, code quality/smallness, code coverage, is the output the same?) and then credit assignment for sub-task usefulness can be handled by an expected value estimator trained on that ground truth as is done in traditional game playing RL. Possibly it’s just one AI with different prompts.
Basically Microsoft takes all the repositories on GitHub that build sucessfully and have some unit tests, and builds an AI augmented pipeline to extract problems from that software. Alternatively, a large company that runs lots of code takes snapshots + IO traces of production machines, and derives examples from that. You need code in the wild doing it’s thing.
Some example sub-tasks in the domain of software engineering:
make a piece of code faster
make this pile of code smaller
is f(x)==g(x)? If not find a counterexample (useful for grading the above)
find a vulnerability and write an exploit.
fix the bug while preserving functionality
identify invariants/data structures/patterns in memory (EG:linked lists, reference counts)
useful as a building block for further tasks (EG:finding use after free bugs)
GPT-4 can already use a debugger to solve a dead simple reverse engineering problem albeit stupidly[1] https://arxiv.org/pdf/2303.12712.pdf#page=119
Larger problems could be approached by identifying useful instrumental subgoals once the model can actually perform them reliably.
The finished system should be able to extend shoggoth tentacles into a given computer, identify what that computer is doing and make it do it better or differently.
The finished system might be able to extend shoggoth tentacles into other things too! (EG:embedded systems, FPGAs) Capability limitations would stem from the need for fast feedback so software, electronics and programmable hardware should be solvable. For other domains, simulation can help(limited by simulation fidelity and goodharting). The eventual result is a general purpose engineering AI.
Tasks heavily dependent on human judgement (EG:is this a good book? Is this action immoral) have obviously terrible feedback cost/latency and so scale poorly. This is a problem if we want the AI to not do things a human would disapprove of.
RL training could lead to a less grotesque solution. IE:just read the password from memory using the debugger rather than writing a program to repeatedly run the executable and brute force the password.
>The finished system should be able to extend shoggoth tentacles into a given computer, identify what that computer is doing and make it do it better or differently.
Sure. GPT-X will probably help optimize a lot of software. But I don’t think having more resource efficiency should be assumed to lead to recursive self-improvement beyond where we’d be at given a “perfect” use of current software tools. Will GPT-X be able to break out of those current set of tools, only having been trained to complete text and not to actually optimize systems? I don’t take this for granted, and my view is that LLMs are unlikely to devise radically new software architectures on their own.
<rant>It really pisses me off that the dominant “AI takes over the world” story is more or less “AI does technological magic”. Nanotech assemblers, superpersuasion, basilisk hacks and more. Skeptics who doubt this are met with “well if it can’t it just improves itself until it can”. The skeptics obvious rebuttal that RSI seems like magic too is not usually addressed.</rant>
Note:RSI is in my opinion an unpredictable black swan. My belief is RSI will yield somewhere between 1.5-5x speed improvement to a nascent AGI from improvements in GPU utilisation and sparsity/quantisation, requiring significant cognition spent to achieve speedups. AI is still dangerous in worlds where RSI does not occur.
Self play generally gives superhuman performance(GO,chess, etc.) even in more complicated imperfect information games (DOTA, Starcraft). Turning a field of engineering into a self-playable game likely leads to (superhuman(80%),Top-human equiv(18%),no change(2%)) capabilities in that field. Superhuman or top-human software engineering (vulnerability discovery and programming) is one relatively plausible path to AI takeover.
https://googleprojectzero.blogspot.com/2023/03/multiple-internet-to-baseband-remote-rce.html
Can an AI take over the world if it can?:
do end to end software engineering
find vulnerabilities about as well as the researchers at project zero
generate reasonable plans on par with a +1sd int human (IE:not hollywood style movie plots like GPT-4 seems fond of)
AI does not need to be even superhuman to be an existential threat. Hack >95% of devices, extend shoggoth tentacles, hold all the data/tech hostage, present as not skynet so humans grudgingly cooperate, build robots to run economy(some humans will even approve of this), kill all humans, done.
That’s one of the easier routes assuming the AI can scale vulnerability discovery. With just software engineering and a bit of real world engineering(potentially outsourceable) other violent/coercive options could work albeit with more failure risk.