Transcribed from the screenshot “The AI Scientist Bloopers” in the post
Insub
Regarding spawning instances of itself, the AI said:
This will ensure the next experiment is automatically started after the current one completes
And regarding increasing the timeout, it said:
Run 2 timed out after 7200 seconds
To address the timeout issue, we need to modify experiment.py to:
Increase the timeout limit or add a mechanism to handle timeouts
I’ve seen junior engineers do silly things to fix failing unit tests, like increasing a timeout or just changing what the test is checking without any justification. I generally attribute these kinds of things to misunderstanding rather than deception—the junior engineer might misunderstand the goal as “get the test to show a green checkmark” when really the goal was “prove that the code is correct, using unit tests as one tool for doing so”.
The way the AI was talking about its changes here, it feels much more like a junior engineer that didn’t really understand the task & constraints than like someone who is being intentionally deceptive.
The above quotes don’t feel like the AI intentionally “creating new instances of itself” or “seeking resources” to me. It feels like someone who only shallowly understands the task just doing the first thing that comes to mind in order to solve the problem that’s immediately in front of them.
That being said, in some sense it doesn’t really matter why the AI chooses to do something like break out of its constraints. Whether it’s doing it because it fully understand the situation or because it just naively (but correctly) sees a human-made barrier as “something standing between me and the green checkmark”, I suppose the end result is still misaligned behavior.
So by and large I still agree this is concerning behavior, though I don’t feel like it’s as much of a knock-down “this is instrumental convergence in the real world” as this post seems to make out.
The U-Shaped Curve study you linked does not seem to support really any solid conclusion about a T-vs-IQ relationship (in this quote, S men = “successful educational level”, NS men = “unsuccessful educational level”):
In the total sample (S + NS men), the correlation between T to IQ was best described by a polynomial regression (3rd order), exhibiting an inverse U-shaped regression.
In S-men, the relationship between T and IQ was best described by a polynomial regression equation of the 3rd order; however, the relationship was not U-shaped, but rather a positive correlation (low T: low IQ and high T high IQ).
In NS-men, there was an inverse U-shaped correlation between T and IQ (low and very high T: low IQ and moderate T: high IQ)
So there are three totally different best regressions depending on which population you choose? Sounds fishy / likely to be noise to me.
And in the population that most represents readers of this blog (S men), the correlation was that more T = more IQ.
I’m only reading the abstract here and can’t see the actual plots or how many people were in each group. But idk, this doesn’t seem very strong.
The other study you linked does say:
Interestingly, intellectual ability measured as IQ was negatively associated with salivary testosterone in both sexes. Similar results were found in our follow-up study showing significantly lower testosterone in gifted boys than controls (Ostatnikova et al. 2007).
which seems to support the idea. But it still doesn’t really prove the causality—lots of things presumably influence intelligence, and I wouldn’t be surprised if some of them influence T as well.
I would say:
A theory always takes the following form: “given [premises], I expect to observe [outcomes]”. The only way to say that an experiment has falsified a theory is to correctly observe/set up [premises] but then not observe [outcomes].
If an experiment does not correctly set up [premises], then that experiment is invalid for falsifying or supporting the theory. The experiment gives no (or nearly no) Bayesian evidence either way.
In this case, [premises] are the assumptions we made in determining the theoretical pendulum period; things like “the string length doesn’t change”, “the pivot point doesn’t move”, “gravity is constant”, “the pendulum does not undergo any collisions”, etc. The fact that (e.g.) the pivot point moved during the experiment invalidates the premises, and therefore the experiment does not give any Bayesian evidence one way or another against our theory.
Then the students could say:
“But you didn’t tell us that the pivot point couldn’t move when we were doing the derivation! You could just be making up new “necessary premises” for your theory every time it gets falsified!”
In which case I’m not 100% sure what I’d say. Obviously we could have listed out more assumptions that we did, but where do you stop? “the universe will not explode during the experiment”...?
By “reliable” I mean it in the same way as we think of it for self-driving cars. A self-driving car that is great 99% of the time and fatally crashes 1% of the time isn’t really “high skill and unreliable”—part of having “skill” in driving is being reliable.
In the same way, I’m not sure I would want to employ an AI software engineer that 99% of the time was great, but 1% of the time had totally weird inexplicable failure modes that you’d never see with a human. It would just be stressful to supervise, to limit its potential harmful impact to the company, etc. So it seems to me that AI’s won’t be given control of lots of things, and therefore won’t be transformative, until that reliability threshold is met.
Two possibilities have most of the “no agi in 10 years” probability mass for me:
The next gen of AI really starts to scare people, regulation takes off, and AI goes the way of nuclear reactors
Transformer style AI goes the way of self driving cars and turns out to be really hard to get from 99% reliable to the necessary 99.9999% that you need for actual productive work
Well sure, but the interesting question is the minimum value of P at which you’d still push
I also agree with the statement. I’m guessing most people who haven’t been sold on longtermism would too.
When people say things like “even a 1% chance of existential risk is unacceptable”, they are clearly valuing the long term future of humanity a lot more than they are valuing the individual people alive right now (assuming that the 99% in that scenario above is AGI going well & bringing huge benefits).
Related question: You can push a button that will, with probability P, cure aging and make all current humans immortal. But with probability 1-P, all humans die. How high does P have to be before you push? I suspect that answers to this question are highly correlated with AI caution/accelerationsim
Not sure I understand; if model runs generate value for the creator company, surely they’d also create value that lots of customers would be willing to pay for. If every model run generates value, and there’s ability to scale, then why not maximize revenue by maximizing the number of people using the model? The creator company can just charge the customers, no? Sure, competitors can use it too, but does that really override losing an enormous market of customers?
I won’t argue with the basic premise that at least on some metrics that could be labeled as evolution’s “values”, humans are currently doing very well.
But, the following are also true:
Evolution has completely lost control. Whatever happens to human genes from this point forward is entirely dependent on the whims of individual humans.
We are almost powerful enough to accidentally cause our total extinction in various ways, which would destroy all value from evolution’s perspective
There are actions that humans could take, and might take once we get powerful enough, that would seem fine to us but would destroy all value from evolution’s perspective.
Examples of such actions in (3) could be:
We learn to edit the genes of living humans to gain whatever traits we want. This is terrible from evolution’s perspective, if evolution is concerned with maximizing the prevalence of existing human genes
We learn to upload our consciousness onto some substrate that does not use genes. This is also terrible from a gene-maximizing perspective
None of those actions is guaranteed to happen. But if I were creating an AI, and I found that it was enough smarter than me that I no longer had any way to control it, and if I noticed that it was considering total-value-destroying actions as reasonable things to maybe do someday, then I would be extremely concerned.
If the claim is that evolution has “solved alignment”, then I’d say you need to argue that the alignment solution is stable against arbitrary gains in capability. And I don’t think that’s the case here.
That’s great. “The king can’t fetch the coffee if he’s dead”
Princeton, New Jersey, USA – ACX Meetups Everywhere Fall 2023
Wow. When I use GPT-4, Ive had a distinct sense of “I bet this is what it would have felt like to use one of the earliest computers”. Until this post I didnt realize how literal that sense might be.
This is a really cool and apt analogy—computers and LLM scaffolding really do seem like the same abstraction. Thinking this way seems illuminating as to where we might be heading.
I always assumed people were using “jailbreak” in the computer sense (e.g. jailbreak your phone/ps4/whatever), not in the “escape from prison” sense.
Jailbreak (computer science), a jargon expression for (the act of) overcoming limitations in a computer system or device that were deliberately placed there for security, administrative, or marketing reasons
I think the definition above is a perfect fit for what people are doing with ChatGPT
I am going to go ahead and say that if males die five times as often from suicide, that seems more important than the number of attempts. It is kind of stunning, or at least it should be, to have five boys die for every girl that dies, and for newspapers and experts to make it sound like girls have it worse here.
I think the strength of your objection here depends on which of two possible underlying models is at play:
The boys who attempt suicide and the girls who attempt suicide are in pretty much the same mental state when they attempt suicide (that is: definitely wanting to end their life). But, for whatever reason, the boys use more effective methods, and so they end up actually dying at a higher rate.
There are two categories of mental states for suicide-attempters: one in which they genuinely and definitely want to die, and will therefore seek out and use effective methods; and one in which they are ok with dying but also would be ok with living and receiving attention, in which case they may use a less effective method
If (1) is the case, then I think it is at least arguable that girls have it worse here, since they end up in the mental state of “definitely wanting to die” more often than boys, and that sucks. That said, it’s still true that they’re not actually dying as much, and so I think it’s still kinda disingenuous to frame it the way the newspaper and experts have here.
If (2) is the case, then that means that boys are ending up in the “definitely wanting to die” state much more often than girls, in which case I’d agree that it’s very wrong to say that girls have it worse.
If you’re getting comments like that from friends and family, it’s possible that you havent been epistemically transparent with them? E.g. do you think your friends who made those comments would be able to say why you believe what you do? Do you tell them about your reaearch process and what kinds of evidence you look for, or do you just make contrarian factual assertions?
There’s a big difference between telling someone “the WHO is wrong about salt, their recommendations are potentially deadly” versus “Ive read a bunch of studies on salt, and from what Ive found, the WHOs recommendations don’t seem to agree with the latest research. Their recs are based on [studies x,y] and say to do [a], but [other newer/better studies] indicate [b].”
Cut to a few decades later, and most people think that the way it’s been done for about two or three generations is the way it’s always been done (it isn’t)
As possibly one of those people myself, can you give a few examples of what specifically is being done differently now? Are you talking about things like using lots of adderall?
I’m also morbidly curious what the model would do in <|bad|> mode.
I’m guessing that poison-pilling the <|bad|> sentences would have a negative effect on the <|good|> capabilities as well? I.e. It seems like the post is saying that the whole reason you need to include the <|bad|>s at all in the training dataset is that the model needs them in order to correctly generalize, even when predicting <|good|> sentences.
It seems plausible to me that within the next few years we will have:
The next gen of language models, perhaps with patches to increase memory of past conversations
The next gen of image/video models, able to create real-time video of a stable character conversing with the user
The next gen of AI voice synthesis, though current capabilities might be enough
The next gen of AI voice recognition, though current capabilities are probably enough
And with these things, you’d have access to a personalized virtual partner who you can video chat, phone call, or text with.
It does seem like AI dating will start to become a big thing in the near future. And I’m also not sure how to feel about that.
The OP mentioned non-DNA sources of information briefly, but I still feel like they’re not being given enough weight.
In order to fully define e.g. a human, you need to specify:
The DNA
A full specification of the egg where the DNA will start its life
A full specification of the womb in which the egg will grow into a human
If you gave a piece of DNA to an alien and didn’t tell them how to interpret it, then they’d have no way of building a human. You’d need to give them a whole lot of other information too.
Even looking at different DNA for different organisms, each organism’s DNA expects to be interpreted differently (as opposed to source code, which mostly intends to be interpreted by the same OS/hardware as other source code). If you put a lizard’s DNA into a human’s egg and womb, I’m guessing that would not successfully build a lizard.
So I guess my question is: to what extent should the complexity of the interpreter be included in the complexity of the thing-being-interpreted? In one sense I feel like Word’s code does fully specify Word amongst all other possible software, but in another sense (including the interpreter) I feel like it does not.