There were recently some funny examples of LLMs playing minecraft and, for example,
The player asks for wood, so the AI chops down the player’s tree house because it’s made of wood
The player asks for help keeping safe, so the AI tries surrounding the player with walls
This seems interesting because minecraft doesn’t have a clear win condition, so unlike chess, there’s a difference between minecraft-capabilities and minecraft-alignment. So we could take an AI, apply some alignment technique (for example, RLHF), let it play minecraft with humans (which is hopefully out of distribution compared to the AI’s training), and observe whether the minecraft-world is still fun to play or if it’s known that asking the AI for something (like getting gold) makes it sort of take over the world and break everything else.
Or it could teach us something else like “you must define for the AI which exact boundaries to act in, and then it’s safe and useful, so if we can do something like that for real-world AGI we’ll be fine, but we don’t have any other solution that works yet”. Or maybe “the AI needs 1000 examples for things it did that we did/didn’t like, which would make it friendly in the distribution of those examples, but it’s known to do weird things [like chopping down our tree house] without those examples or if the examples are only from the forest but then we go to the desert”
I have more to say about this, but the question that seems most important is “would results from such an experiment potentially change your mind”:
If there’s an alignment technique you believe in and it would totally fail to make a minecraft server be fun when playing with an AI, would you significantly update towards “that alignment technique isn’t enough”?
If you don’t believe in some alignment technique but it proves to work here, allowing the AI to generalize what humans want out of its training distribution (similarly to how a human that plays minecraft for the first time will know not to chop down your tree house), would that make you believe in that alignment technique way more and be much more optimistic about superhuman AI going well?
Assume the AI is smart enough to be vastly superhuman at minecraft, and that it has too many thoughts for a human to reasonably follow (unless the human is using something like “scalable oversight” successfully. that’s one of the alignment techniques we could test if we wanted to)
This is a surprisingly interesting field of study. Some video games provide a great simulation of the real world and Minecraft seems to be one of them. We’ve had a few examples of minecraft evals with one that comes to mind here: https://www.apartresearch.com/project/diamonds-are-not-all-you-need
The property I like about minecraft (which most computer games don’t have) is that there’s a difference between minecraft-capabilities and minecraft-alignment, and the way to be “aligned” in minecraft isn’t well defined (at least in the way I’m using the word “aligned” here, which I think is a useful way). Specifically, I want the AI to be “aligned” as in “take human values into account as a human intuitively would, in this out of distribution situation”.
In the link you sent, “aligned” IS well defined by “stay within this area”. I expect that minecraft scaffolding could make the agent close to perfect at this (by making sure, before performing an action requested by the LLM, that the action isn’t “move to a location out of these bounds”) (plus handling edge cases like “don’t walk on to a river which will carry you out of these bounds”, which would be much harder, and I’ll allow myself to ignore unless this was actually your point). So we wouldn’t learn what I’d hope to learn from these evals.
Similarly for most video games—they might be good capabilities evals, but for example in chess—it’s unclear what a “capable but misaligned” AI would be. [unless again I’m missing your point]
P.S
The “stay within this boundary” is a personal favorite of mine, I thought it was the best thing I had to say when I attempted to solve alignment myself just in case it ended up being easy (unfortunately that wasn’t the case :P ). Link
I do like the idea of having “model organisms of alignment” (notably different than model organisms of misalignment)
Minecraft is a great starting point, but it would also be nice to try to capture two things: wide generalization, and inter-preference conflict resolution. Generalization because we expect future AI to be able to take actions and reach outcomes that humans can’t, and preference conflict resolution because I want to see an AI that uses human feedback on how best to do it (rather than just a fixed regularization algorithm).
Generalization because we expect future AI to be able to take actions and reach outcomes that humans can’t
I’m assuming we can do this in Minecraft [see the last paragraph in my original post]. Some ways I imagine doing this:
Let the AI (python program) control 1000 minecraft players so it can do many things in parallel
Give the AI a minecraft world-simulator so that it can plan better auto-farms (or defenses or attacks) than any human has done so far
Imagine Alpha-Fold for minecraft structures. I’m not sure if that metaphor makes sense, but teaching some RL model to predict minecraft structures that have certain properties seems like it would have superhuman results and sometimes be pretty hard for humans to understand.
I think it’s possible to be better than humans currently are at minecraft, I can say more if this sounds wrong
[edit: adding] I do think minecraft has disadvantages here (like: the players are limited in how fast they move, and the in-game computers are super slow compared to players) and I might want to pick another game because of that, but my main crux about this project is whether using minecraft would be valuable as an alignment experiment, and if so I’d try looking for (or building?) a game that would be even better suited.
preference conflict resolution because I want to see an AI that uses human feedback on how best to do it (rather than just a fixed regularization algorithm)
Do you mean that if the human asks the AI to acquire wood and the AI starts chopping down the human’s tree house (or otherwise taking over the world to maximize wood) then you’re worried the human won’t have a way to ask the AI to do something else? That the AI will combine the new command “not from my tree house!” into a new strange misaligned behaviour?
I think it’s possible to be better than humans currently are at minecraft, I can say more if this sounds wrong
Yeah, that’s true. The obvious way is you could have optimized micro, but that’s kinda boring. More like what I mean might be generalization to new activities for humans to do in minecraft that humans would find fun, which would be a different kind of ‘better at minecraft.’
[what do you mean by preference conflict?]
I mean it in a way where the preferences are modeled a little better than just “the literal interpretation of this one sentence conflicts with the literal interpretation of this other sentence.” Sometimes humans appear to act according to fairly straightforward models of goal-directed action. However, the precise model, and the precise goals, may be different at different times (or with different modeling hyperparameters, and of course across different people) - and if you tried to model the human well at all the different times, you’d get a model that looked like physiology and lost the straightforward talk of goals/preferences
Resolving preference conflicts is the process of stitching together larger preferences out of smaller preferences, without changing type signature. The reason literally-interpreted-sentences doesn’t really count is because interpreting them literally is using a smaller model than necessary—you can find a broader explanation for the human’s behavior in context that still comfortably talks about goals/preferences.
More like what I mean might be generalization to new activities for humans to do in minecraft that humans would find fun, which would be a different kind of ‘better at minecraft.’
Oh I hope not to go there. I’d count that as cheating. For example, if the agent would design a role playing game with riddles and adventures—that would show something different from what I’m trying to test. [I can try to formalize it better maybe. Or maybe I’m wrong here]
I mean it in a way where the preferences are modeled a little better than just “the literal interpretation of this one sentence conflicts with the literal interpretation of this other sentence.”
Absolutely. That’s something that I hope we’ll have some alignment technique to solve, and maybe this environment could test.
I think there are lots of technical difficulties in literally using minecraft (some I wrote here), so +1 to that.
I do think the main crux is “would the minecraft version be useful as an alignment test”, and if so—it’s worth looking for some other solution that preserves the good properties but avoids some/all of the downsides. (agree?)
Still I’m not sure how I’d do this in a text game. Say more?
Making a thing like Papers Please, but as a text adventure, popping an ai agent into that. Also, could literally just put the ai agent into a text rpg adventure—something like the equivalent of Skyrim, where there are a number of ways to achieve the endgame, level up, etc, both more and less morally. Maybe something like https://www.choiceofgames.com/werewolves-3-evolutions-end/ Will bring it up at the alignment eval hackathon
This all sounds pretty in-distribution for an LLM, and also like it avoids problems like “maybe thinking in different abstractions” [minecraft isn’t amazing at this either, but at least has a bit], “having the AI act/think way faster than a human”, “having the AI be clearly superhuman”.
a number of ways to achieve the endgame, level up, etc, both more and less morally.
I’m less interested in “will the AI say it kills its friend” (in a situation that very clearly involves killing and a person and perhaps a very clear tradeoff between that and having 100 more gold that can be used for something else), I’m more interested in noticing if it has a clear grasp of what people care about or mean. The example of chopping down the tree house of the player in order to get wood (which the player wanted to use for the tree house) is a nice toy example of that. The AI would never say “I’ll go cut down your tree house”, but it.. “misunderstood” [not the exact word, but I’m trying to point at something here]
note: the minecraft agents people use have far greater ability to act than to sense. They have access to commands which place blocks anywhere, and pick up blocks from anywhere, even without being able to see them, eg the llm has access to mine(blocks.wood) command which does not require it to first locate or look at where the wood is currently. If llms played minecrafts using the human interface these misalignments would happen less
A word of caution about interpreting results from these evals:
Sometimes, depending on social context, it’s fine to be kind of a jerk if it’s in the context of a game. Crucially, LLMs know that Minecraft is a game. Granted, the default Assistant personas implemented in RLHF’d LLMs don’t seem like the type of Minecraft player to pull pranks out of their own accord. Still, it’s a factor to keep in mind for evals that stray a bit more off-distribution from the “request-assistance” setup typical of the expected use cases of consumer LLMs.
I’m imagining an assistant AI by default (since people are currently pitching that an AGI might be a nice assistant).
If an AI org wants to demonstrate alignment by showing us that having a jerk player is more fun (and that we should install their jerk-AI-app on our smartphone), then I’m open to hear that pitch, but I’d be surprised if they’d make it
Opinions on whether it’s positive/negative to build tools like Cursor / Codebuff / Replit?
I’m asking because it seems fun to build and like there’s low hanging fruit to collect in building a competitor to these tools, but also I prefer not destroying the world.
Considerations I’ve heard:
Reducing “scaffolding overhang” is good, specifically to notice if RSPs should trigger a more advanced RSP level
(This depends on the details/quality of the RSP too)
There are always reasons to advance capabilities, this isn’t even a safety project (unless you count… elicitation?), our bar here should be high
Such scaffolding won’t add capabilities which might make the AI good at general purpose learning or long term general autonomy. It will be specific to programming, with concepts like “which functions did I look at already” and instructions on how to write high quality tests.
Anthropic are encouraging people to build agent scaffolding, and Codebuff was created by a Manifold cofounder [if you haven’t heard about it, see here and here]. I’m mainly confused about this, I’d expect both to not want people to advance capabilities (yeah, Anthropic want to stay in the lead and serve as an example, but this seems different). Maybe I’m just not in sync
I’m not confident but I am avoiding working on these tools because I think that “scaffolding overhang” in this field may well be most of the gap towards superintelligent autonomous agents.
If you imagine a o1-level entity with “perfect scaffolding”, i.e. it can get any info on a computer into its context whenever it wants, and it can choose to invoke any computer functionality that a human could invoke, and it can store and retrieve knowledge for itself at will, and its training includes the use of those functionalities, it’s not completely clear to me that it wouldn’t already be able to do a slow self-improvement takeoff by itself, although the cost might be currently practically prohibitive.
I don’t think building that scaffolding is a trivial task at all, though.
I think a simple bash tool running as admin could do most of these:
it can get any info on a computer into its context whenever it wants, and it can choose to invoke any computer functionality that a human could invoke, and it can store and retrieve knowledge for itself at will
Regarding
and its training includes the use of those functionalities
I think this isn’t a crux because the scaffolding I’d build wouldn’t train the model. But as a secondary point, I think today’s models can already use bash tools reasonably well.
it’s not completely clear to me that it wouldn’t already be able to do a slow self-improvement takeoff by itself
This requires skill in ML R&D which I think is almost entirely not blocked by what I’d build, but I do think it might be reasonable to have my tool not work for ML R&D because of this concern. (would require it to be closed source and so on)
Thanks for raising concerns, I’m happy for more if you have them
But as a secondary point, I think today’s models can already use bash tools reasonably well.
Perhaps that’s true, I haven’t seen a lot of examples of them trying. I did see Buck’s anecdote which was a good illustration of doing a simple task competently (finding the IP address of an unknown machine on the local network).
I don’t work in AI so maybe I don’t know what parts of R&D might be most difficult for current SOTA models. But based on the fact that large-scale LLMs are sort of a new field that hasn’t had that much labor applied to it yet, I would have guessed that a model which could basically just do mundane stuff and read research papers, could spend a shitload of money and FLOPS to run a lot of obviously informative experiments that nobody else has properly run, and polish a bunch of stuff that nobody else has properly polished.
[disclaimers: I have some association with the org that ran that (I write some code for them) but I don’t speak for them, opinions are my own]
Also, Anthropic have a trigger in their RSP which is somewhat similar to what you’re describing, I’ll quote part of it:
Autonomous AI Research and Development: The ability to either: (1) Fully automate the work of an entry-level remote-only Researcher at Anthropic, as assessed by performance on representative tasks or (2) cause dramatic acceleration in the rate of effective scaling.
Also, in Dario’s interview, he spoke about AI being applied to programming.
My point is—lots of people have their eyes on this, it seems not to be solved yet, it takes more than connecting an LLM to bash.
Opinions about putting in a clause like “you may not use this for ML engineering” (assuming it would work legally) (plus putting in naive technical measures to make the tool very bad for ML engineering) ?
Posts I want to write (most of them already have drafts) :
(how do I prioritize? do you want to user-test any of them?)
Hiring posts
For lesswrong
For Metaculus
Startups
Do you want to start an EA software startup? A few things to probably avoid [I hate writing generic advice like this but it seems to be a reoccurring question]
Announce “EA CTOs” plus maybe a few similar peer groups
Against generic advice—why I’m so strongly against questions like “what should I know if I want to start a startup” [this will be really hard to write but is really close to my heart]. Related:
How to find mentoring for a startup/project (much better than reading generic advice!!!)
“Software Engineers: How to have impact?”—meant to mostly replace the 80k software career review, will probably be co-authored with 80k
My side project: Finding the most impactful tech companies in Israel
Common questions and answers with software developers
How to negotiate for salary (TL;DR: If you don’t have an offer from another company, you’re playing hard mode. All the other people giving salary-negotiation advice seem to be ignoring this)
How to build skill in the long term
How to conduct a job search
When focusing on 1-2 companies, like “only Google or OpenAI”
When there are a lot of options
How to compare the companies I’m interviewing for?
Not sure what you are looking for in a job?
I’m not enjoying software development, is it for me? [uncertain I can write this well as text]
So many things to learn, where to start?
I want to develop faster / I want to be a 10x developer / Should I learn speed typing? (TL;DR: No)
Q: I have a super hard bug that is taking me too long panic!
Goodheart Meta: Is posting this week a good idea at all? If more stuff is posted, will less people read my own posts?
Perhaps, but I’d guess only in a rather indirect way. If there’s some manufacturing process that the company invests in improving in order to make their chips, and that manufacturing process happens to be useful for matrix multiplication, then yes, that could contribute.
But it’s worth noting how many things would be considered AGI risks by such a standard; basically the entire supply chain for computers, and anyone who works for or with top labs; the landlords that rent office space to DeepMind, the city workers that keep the lights on and the water running for such orgs (and their suppliers), etc.
I wouldn’t worry your friends too much about it unless they are contributing very directly to something that has a clear path to improving AI.
May I ask why you think AGI won’t contain an important computationally-constrained component which is not a neural network?
Is it because right now neural networks seem to be the most useful thing? (This does not feel reassuring, but I’d be happy for help making sense of it)
Do we want minecraft alignment evals?
My main pitch:
There were recently some funny examples of LLMs playing minecraft and, for example,
The player asks for wood, so the AI chops down the player’s tree house because it’s made of wood
The player asks for help keeping safe, so the AI tries surrounding the player with walls
This seems interesting because minecraft doesn’t have a clear win condition, so unlike chess, there’s a difference between minecraft-capabilities and minecraft-alignment. So we could take an AI, apply some alignment technique (for example, RLHF), let it play minecraft with humans (which is hopefully out of distribution compared to the AI’s training), and observe whether the minecraft-world is still fun to play or if it’s known that asking the AI for something (like getting gold) makes it sort of take over the world and break everything else.
Or it could teach us something else like “you must define for the AI which exact boundaries to act in, and then it’s safe and useful, so if we can do something like that for real-world AGI we’ll be fine, but we don’t have any other solution that works yet”. Or maybe “the AI needs 1000 examples for things it did that we did/didn’t like, which would make it friendly in the distribution of those examples, but it’s known to do weird things [like chopping down our tree house] without those examples or if the examples are only from the forest but then we go to the desert”
I have more to say about this, but the question that seems most important is “would results from such an experiment potentially change your mind”:
If there’s an alignment technique you believe in and it would totally fail to make a minecraft server be fun when playing with an AI, would you significantly update towards “that alignment technique isn’t enough”?
If you don’t believe in some alignment technique but it proves to work here, allowing the AI to generalize what humans want out of its training distribution (similarly to how a human that plays minecraft for the first time will know not to chop down your tree house), would that make you believe in that alignment technique way more and be much more optimistic about superhuman AI going well?
Assume the AI is smart enough to be vastly superhuman at minecraft, and that it has too many thoughts for a human to reasonably follow (unless the human is using something like “scalable oversight” successfully. that’s one of the alignment techniques we could test if we wanted to)
This is a surprisingly interesting field of study. Some video games provide a great simulation of the real world and Minecraft seems to be one of them. We’ve had a few examples of minecraft evals with one that comes to mind here: https://www.apartresearch.com/project/diamonds-are-not-all-you-need
Hey Esben :) :)
The property I like about minecraft (which most computer games don’t have) is that there’s a difference between minecraft-capabilities and minecraft-alignment, and the way to be “aligned” in minecraft isn’t well defined (at least in the way I’m using the word “aligned” here, which I think is a useful way). Specifically, I want the AI to be “aligned” as in “take human values into account as a human intuitively would, in this out of distribution situation”.
In the link you sent, “aligned” IS well defined by “stay within this area”. I expect that minecraft scaffolding could make the agent close to perfect at this (by making sure, before performing an action requested by the LLM, that the action isn’t “move to a location out of these bounds”) (plus handling edge cases like “don’t walk on to a river which will carry you out of these bounds”, which would be much harder, and I’ll allow myself to ignore unless this was actually your point). So we wouldn’t learn what I’d hope to learn from these evals.
Similarly for most video games—they might be good capabilities evals, but for example in chess—it’s unclear what a “capable but misaligned” AI would be. [unless again I’m missing your point]
P.S
The “stay within this boundary” is a personal favorite of mine, I thought it was the best thing I had to say when I attempted to solve alignment myself just in case it ended up being easy (unfortunately that wasn’t the case :P ). Link
I do like the idea of having “model organisms of alignment” (notably different than model organisms of misalignment)
Minecraft is a great starting point, but it would also be nice to try to capture two things: wide generalization, and inter-preference conflict resolution. Generalization because we expect future AI to be able to take actions and reach outcomes that humans can’t, and preference conflict resolution because I want to see an AI that uses human feedback on how best to do it (rather than just a fixed regularization algorithm).
Hey,
I’m assuming we can do this in Minecraft [see the last paragraph in my original post]. Some ways I imagine doing this:
Let the AI (python program) control 1000 minecraft players so it can do many things in parallel
Give the AI a minecraft world-simulator so that it can plan better auto-farms (or defenses or attacks) than any human has done so far
Imagine Alpha-Fold for minecraft structures. I’m not sure if that metaphor makes sense, but teaching some RL model to predict minecraft structures that have certain properties seems like it would have superhuman results and sometimes be pretty hard for humans to understand.
I think it’s possible to be better than humans currently are at minecraft, I can say more if this sounds wrong
[edit: adding] I do think minecraft has disadvantages here (like: the players are limited in how fast they move, and the in-game computers are super slow compared to players) and I might want to pick another game because of that, but my main crux about this project is whether using minecraft would be valuable as an alignment experiment, and if so I’d try looking for (or building?) a game that would be even better suited.
Do you mean that if the human asks the AI to acquire wood and the AI starts chopping down the human’s tree house (or otherwise taking over the world to maximize wood) then you’re worried the human won’t have a way to ask the AI to do something else? That the AI will combine the new command “not from my tree house!” into a new strange misaligned behaviour?
Yeah, that’s true. The obvious way is you could have optimized micro, but that’s kinda boring. More like what I mean might be generalization to new activities for humans to do in minecraft that humans would find fun, which would be a different kind of ‘better at minecraft.’
I mean it in a way where the preferences are modeled a little better than just “the literal interpretation of this one sentence conflicts with the literal interpretation of this other sentence.” Sometimes humans appear to act according to fairly straightforward models of goal-directed action. However, the precise model, and the precise goals, may be different at different times (or with different modeling hyperparameters, and of course across different people) - and if you tried to model the human well at all the different times, you’d get a model that looked like physiology and lost the straightforward talk of goals/preferences
Resolving preference conflicts is the process of stitching together larger preferences out of smaller preferences, without changing type signature. The reason literally-interpreted-sentences doesn’t really count is because interpreting them literally is using a smaller model than necessary—you can find a broader explanation for the human’s behavior in context that still comfortably talks about goals/preferences.
Oh I hope not to go there. I’d count that as cheating. For example, if the agent would design a role playing game with riddles and adventures—that would show something different from what I’m trying to test. [I can try to formalize it better maybe. Or maybe I’m wrong here]
Absolutely. That’s something that I hope we’ll have some alignment technique to solve, and maybe this environment could test.
this can be done more scalably in a text game, no?
I think there are lots of technical difficulties in literally using minecraft (some I wrote here), so +1 to that.
I do think the main crux is “would the minecraft version be useful as an alignment test”, and if so—it’s worth looking for some other solution that preserves the good properties but avoids some/all of the downsides. (agree?)
Still I’m not sure how I’d do this in a text game. Say more?
Making a thing like Papers Please, but as a text adventure, popping an ai agent into that.
Also, could literally just put the ai agent into a text rpg adventure—something like the equivalent of Skyrim, where there are a number of ways to achieve the endgame, level up, etc, both more and less morally. Maybe something like https://www.choiceofgames.com/werewolves-3-evolutions-end/
Will bring it up at the alignment eval hackathon
it would basically be DnD like.
options to vary rules/environment/language as well, to see how the alignment generalizes ood. will try this today
This all sounds pretty in-distribution for an LLM, and also like it avoids problems like “maybe thinking in different abstractions” [minecraft isn’t amazing at this either, but at least has a bit], “having the AI act/think way faster than a human”, “having the AI be clearly superhuman”.
I’m less interested in “will the AI say it kills its friend” (in a situation that very clearly involves killing and a person and perhaps a very clear tradeoff between that and having 100 more gold that can be used for something else), I’m more interested in noticing if it has a clear grasp of what people care about or mean. The example of chopping down the tree house of the player in order to get wood (which the player wanted to use for the tree house) is a nice toy example of that. The AI would never say “I’ll go cut down your tree house”, but it.. “misunderstood” [not the exact word, but I’m trying to point at something here]
wdyt?
note: the minecraft agents people use have far greater ability to act than to sense. They have access to commands which place blocks anywhere, and pick up blocks from anywhere, even without being able to see them, eg the llm has access to
mine(blocks.wood)
command which does not require it to first locate or look at where the wood is currently. If llms played minecrafts using the human interface these misalignments would happen lessI agree.
I volunteer to play Minecraft with the LLM agents. I think this might be one eval where the human evaluators are easy to come by.
:)
If you want to try it meanwhile, check out https://github.com/MineDojo/Voyager
A word of caution about interpreting results from these evals:
Sometimes, depending on social context, it’s fine to be kind of a jerk if it’s in the context of a game. Crucially, LLMs know that Minecraft is a game. Granted, the default Assistant personas implemented in RLHF’d LLMs don’t seem like the type of Minecraft player to pull pranks out of their own accord. Still, it’s a factor to keep in mind for evals that stray a bit more off-distribution from the “request-assistance” setup typical of the expected use cases of consumer LLMs.
+1
I’m imagining an assistant AI by default (since people are currently pitching that an AGI might be a nice assistant).
If an AI org wants to demonstrate alignment by showing us that having a jerk player is more fun (and that we should install their jerk-AI-app on our smartphone), then I’m open to hear that pitch, but I’d be surprised if they’d make it
Opinions on whether it’s positive/negative to build tools like Cursor / Codebuff / Replit?
I’m asking because it seems fun to build and like there’s low hanging fruit to collect in building a competitor to these tools, but also I prefer not destroying the world.
Considerations I’ve heard:
Reducing “scaffolding overhang” is good, specifically to notice if RSPs should trigger a more advanced RSP level
(This depends on the details/quality of the RSP too)
There are always reasons to advance capabilities, this isn’t even a safety project (unless you count… elicitation?), our bar here should be high
Such scaffolding won’t add capabilities which might make the AI good at general purpose learning or long term general autonomy. It will be specific to programming, with concepts like “which functions did I look at already” and instructions on how to write high quality tests.
Anthropic are encouraging people to build agent scaffolding, and Codebuff was created by a Manifold cofounder [if you haven’t heard about it, see here and here]. I’m mainly confused about this, I’d expect both to not want people to advance capabilities (yeah, Anthropic want to stay in the lead and serve as an example, but this seems different). Maybe I’m just not in sync
I’m not confident but I am avoiding working on these tools because I think that “scaffolding overhang” in this field may well be most of the gap towards superintelligent autonomous agents.
If you imagine a o1-level entity with “perfect scaffolding”, i.e. it can get any info on a computer into its context whenever it wants, and it can choose to invoke any computer functionality that a human could invoke, and it can store and retrieve knowledge for itself at will, and its training includes the use of those functionalities, it’s not completely clear to me that it wouldn’t already be able to do a slow self-improvement takeoff by itself, although the cost might be currently practically prohibitive.
I don’t think building that scaffolding is a trivial task at all, though.
I think a simple bash tool running as admin could do most of these:
Regarding
I think this isn’t a crux because the scaffolding I’d build wouldn’t train the model. But as a secondary point, I think today’s models can already use bash tools reasonably well.
This requires skill in ML R&D which I think is almost entirely not blocked by what I’d build, but I do think it might be reasonable to have my tool not work for ML R&D because of this concern. (would require it to be closed source and so on)
Thanks for raising concerns, I’m happy for more if you have them
Perhaps that’s true, I haven’t seen a lot of examples of them trying. I did see Buck’s anecdote which was a good illustration of doing a simple task competently (finding the IP address of an unknown machine on the local network).
I don’t work in AI so maybe I don’t know what parts of R&D might be most difficult for current SOTA models. But based on the fact that large-scale LLMs are sort of a new field that hasn’t had that much labor applied to it yet, I would have guessed that a model which could basically just do mundane stuff and read research papers, could spend a shitload of money and FLOPS to run a lot of obviously informative experiments that nobody else has properly run, and polish a bunch of stuff that nobody else has properly polished.
Your guesses on AI R&D are reasonable!
Apparently this has been tested extensively, for example:
https://x.com/METR_Evals/status/1860061711849652378
[disclaimers: I have some association with the org that ran that (I write some code for them) but I don’t speak for them, opinions are my own]
Also, Anthropic have a trigger in their RSP which is somewhat similar to what you’re describing, I’ll quote part of it:
Also, in Dario’s interview, he spoke about AI being applied to programming.
My point is—lots of people have their eyes on this, it seems not to be solved yet, it takes more than connecting an LLM to bash.
Still, I don’t want to accelerate this.
I think it’s net negative—increases the profitability of training better LLM’s.
Thanks!
Opinions about putting in a clause like “you may not use this for ML engineering” (assuming it would work legally) (plus putting in naive technical measures to make the tool very bad for ML engineering) ?
Why downvote? you can tell me anonymously:
https://docs.google.com/forms/d/e/1FAIpQLSca6NOTbFMU9BBQBYHecUfjPsxhGbzzlFO5BNNR1AIXZjpvcw/viewform
Posts I want to write (most of them already have drafts) :
(how do I prioritize? do you want to user-test any of them?)
Hiring posts
For lesswrong
For Metaculus
Startups
Do you want to start an EA software startup? A few things to probably avoid [I hate writing generic advice like this but it seems to be a reoccurring question]
Announce “EA CTOs” plus maybe a few similar peer groups
Against generic advice—why I’m so strongly against questions like “what should I know if I want to start a startup” [this will be really hard to write but is really close to my heart]. Related:
How to find mentoring for a startup/project (much better than reading generic advice!!!)
“Software Engineers: How to have impact?”—meant to mostly replace the 80k software career review, will probably be co-authored with 80k
My side project: Finding the most impactful tech companies in Israel
Common questions and answers with software developers
How to negotiate for salary (TL;DR: If you don’t have an offer from another company, you’re playing hard mode. All the other people giving salary-negotiation advice seem to be ignoring this)
How to build skill in the long term
How to conduct a job search
When focusing on 1-2 companies, like “only Google or OpenAI”
When there are a lot of options
How to compare the companies I’m interviewing for?
Not sure what you are looking for in a job?
I’m not enjoying software development, is it for me? [uncertain I can write this well as text]
So many things to learn, where to start?
I want to develop faster / I want to be a 10x developer / Should I learn speed typing? (TL;DR: No)
Q: I have a super hard bug that is taking me too long panic!
Goodheart Meta: Is posting this week a good idea at all? If more stuff is posted, will less people read my own posts?
How can you help me:
TL;DR: Help me find what there is demand for
Comment / DM / upvote people’s comments
Thx!
How to decide when to move on from a job.
Is this an AGI risk?
A company that makes CPUs that run very quickly but don’t do matrix multiplication or other things that are important for neural networks.
Context: I know people who work there
Perhaps, but I’d guess only in a rather indirect way. If there’s some manufacturing process that the company invests in improving in order to make their chips, and that manufacturing process happens to be useful for matrix multiplication, then yes, that could contribute.
But it’s worth noting how many things would be considered AGI risks by such a standard; basically the entire supply chain for computers, and anyone who works for or with top labs; the landlords that rent office space to DeepMind, the city workers that keep the lights on and the water running for such orgs (and their suppliers), etc.
I wouldn’t worry your friends too much about it unless they are contributing very directly to something that has a clear path to improving AI.
Thanks
May I ask why you think AGI won’t contain an important computationally-constrained component which is not a neural network?
Is it because right now neural networks seem to be the most useful thing? (This does not feel reassuring, but I’d be happy for help making sense of it)
Metaculus has a question about whether the first AGI will be based on deep learning. The crowd estimate right now is at 85%.
I interpret that to mean that improvements to neural networks (particulary on the hardware side) are most likely to drive progress towards AGI.