A number of years ago, when LessWrong was being revived from its old form to its new form, I did not expect the revival to work. I said as much at the time. For a year or two in the middle there, the results looked pretty ambiguous to me. But by now it’s clear that I was just completely wrong—I did not expect the revival to work as well as it has to date.
Oliver Habryka in particluar wins Bayes points off of me. Hooray for being right while I was wrong, and for building something cool!
LW is more of AI Alignment forum now, but without mad science agent foundations spirit of earlier post-Sequences time. Probably capable of being alive because the field grew. So it’s not revived, it’s something new. Development of rationality mostly remains in the past, little significant discussion in recent years.
Yeah, I think some of this is true, but while there is a lot of AI content, I actually think that a lot of the same people would probably write non-AI content, and engage on non-AI content, if AI was less urgent, or the site had less existing AI content.
That counterfactual is hard to evaluate, but like, a lot of people who used to be core contributors to LW 1.0 are now also posting to LW 2.0, though they are now posting primarily on AI, and I think that’s evidence that it’s more that there has been a broader shift among LW users that AI is just like really urgent and important, instead of there having been a very different new user base that was discovered.
I kind of agree on development of rationality feeling kind of stagnant right now. I think there are still good posts being written, but a lot of cognitive energy is definitely going into AI stuff, more so than rationality stuff.
Same! LW is an outstanding counterexample to my belief that resurrections are impossible. But I haven’t incorporated it into my gears-level model yet, and I’m unsure how to. What did LW do differently, or which gear in my head caused me to fail to predict this?
The original LW was a clone of Reddit. The Reddit source code was quite complex. I am a software developer, I have looked at that code myself, tried to figure out some things, then gave up.
I do not remember whether I made any predictions at that time. But ignoring what I know now, I probably would have said the following:
Creating a website with all functionality of LessWrong 1.0 is a lot of work. Only a few LessWrong readers are capable of building such complex project. And it would take them a lot of time. Most of people with the required skills could probably get a job at Google, so the opportunity cost of building LessWrong 2.0 is very high.
Is anyone going to pay them, or are they supposed to do it in their free time? If it’s the former, it is going to be very expensive. Is it really a good way to spend so much money? If it’s the latter, it is very unlikely that the project will ever get finished, because it will progress very slowly.
If I understand it correctly, what happened is that some people got paid to work on this full-time. And they turned out to be very good at their job. They rewrote everything from scratch, which was probably the easier way, but it required a lot of time, and a lot of trust because it was “either complete success, or nothing” (as opposed to gradually adding new features to Reddit code).
If I understand it correctly, what happened is that some people got paid to work on this full-time.
This is about what I was going to say in response, before reading your comment.
I think the key factor that makes it different from other examples is that it was a competent person’s full time job.
There are some other things that need to go right in addition to that, but I suspect that there are lots of things that people are correctly outside view gloomy about which can just be done, if someone makes it their first priority.
it must be a competent person (as opposed to merely overconfident)
who really cares about the project (more than about other possible projects)
can convince other people of their competence (the ones who have money)
gets sufficient funding (doesn’t waste their time and energy working a second job)
has autonomy (no manager who would override or second-guess their decisions)
no unexpected disasters (e.g. getting hit by a bus, patent trolls suing the project,...)
Other than the unexpected disasters, it seems like something that a competent civilization should easily do. Once you have competent people, allow them to demonstrate their competence, look for intersection between what they want to do and what you need (or if you are sufficiently rich, just an intersection between what they want to do and what you believe is a good thing), give them money, and let them work.
In real life, having the right skills and sending the right signals is not the same thing; people who do things are not the same as people who decide things; time is wasted on meetings and paperwork.
Just to be clear. There were people working on it who had both agency and competence, but they were working on it as a side project. I think having something be someone’s only priority and full-time job makes a large difference on how much agency someone can bring to bear on a project.
This is an excerpt from a comment I wrote on the EA forum, extracted and crossposted here by request:
There’s a phenomenon where a gambler places their money on 32, and then the roulette wheel comes up 23, and they say “I’m such a fool; I should have bet 23”.
More useful would be to say “I’m such a fool; I should have noticed that the EV of this gamble is negative.” Now at least you aren’t asking for magic lottery powers.
Even more useful would be to say “I’m such a fool; I had three chances to notice that this bet was bad: when my partner was trying to explain EV to me; when I snuck out of the house and ignored a sense of guilt; and when I suppressed a qualm right before placing the bet. I should have paid attention in at least one of those cases and internalized the arguments about negative EV, before gambling my money.” Now at least you aren’t asking for magic cognitive powers.
My impression is that various EAs respond to crises in a manner that kinda rhymes with saying “I wish I had bet 23”, or at best “I wish I had noticed this bet was negative EV”, and in particular does not rhyme with saying “my second-to-last chance to do better (as far as I currently recall) was the moment that I suppressed the guilt from sneaking out of the house”.
(I think this is also true of the general population, to be clear. Perhaps even moreso.)
I have a vague impression that various EAs perform self-flagellation, while making no visible attempt to trace down where, in their own mind, they made a misstep. (Not where they made a good step that turned out in this instance to have a bitter consequence, but where they made a wrong step of the general variety that they could realistically avoid in the future.)
(Though I haven’t gone digging up examples, and in lieu of examples, for all I know this impression is twisted by influence from the zeitgeist.)
When I see or hear a piece of advice, I check to see what happens if the advice were the reverse. Often it’s also good advice, which means all we can do is take the advice into account as we try to live a balanced life. For example, if the advice is “be brave!” the reverse is “be more careful”. Which is good advice, too.
This advice is unusual in that it is non-reversible.
I’ve referred to “I should have bet on 23-type errors” several times over the past year. Having this shorthand and an explanation I can link to has sped up those conversations.
I’m not EA (though I do agree with most of the motte—I care about other humans, and I try to be effective), and not part of the rationalist “community”, so take this as an outside view.
There’s a ton of “standard human social drama” in EA and in rationalist communities, and really anywhere where “work” and “regular life” overlap significantly. Some of this takes the form of noticing flaws in other people’s rationality (or, just as often, flaws in kindness/empathy being justified by rationality).
Especially when one doesn’t want to identify and address specific examples, I think there’s a very high risk of misidentifying the cause of a disagreement or disrespect-of-behavior. In this case, I don’t notice much of the flagellation or wishing—either I don’t hang out in the right places, or I bounce off those posts and don’t pay much mind. But things that might fit that pattern strike me as a failure of personal responsibility, not a failure of modeling wishes. Your term self-flagellation is interesting from that standpoint—the historic practice was for penance of generalized sin, and to share suffering, not as a direct correction for anything. It’s clearly social, not rational.
IMO, rationalism must first and foremost be individual. I am trying to be less wrong in my private beliefs and in my goal-directed behaviors. Group rationality is a category error—I don’t have access to group beliefs (if there is such a thing). I do have some influence over group behaviors and shared statements, but I recognize that they are ALWAYS a compromise and negotiated results of individual beliefs and behaviors, and don’t necessarily match any individual in the group.
I’m surprised every time I see a rationalist assuming otherwise, and being disappointed that other members of the group doesn’t share all their beliefs and motivations.
From my perspective, “group rationality” means how much the group provides an environment conductive to becoming more rational (for those who have the propensity to do so).
It is probably easier to describe the opposite—a group where religion and woo are high status, skepticism is considered a form of stupidity, members are encouraged to think in slogans rather than inspect the details of their actual experience, etc.
A rational group would then be one where as an individual rationalist you can do the right thing without getting socially punished in turn, and are gently called out when you do something stupid.
And it is a “more or less” thing, rather than “yes or no”. (However, the same is true about individual rationality.) I would not expect any group to share all my beliefs and motivation. But the difference between sharing 20% or 80% of the beliefs and motivations means a lot to me.
In relation to my current stance on AI, I was talking with someone who said they’re worried about people putting the wrong incentives on labs. At various points in that convo I said stuff like (quotes are not exact; third paragraph is a present summary rather than a re-articulation of a past utterance):
“Sure, every lab currently seems recklessly negligent to me, but saying stuff like “we won’t build the bioweapon factory until we think we can prevent it from being stolen by non-state actors” is directionally better than not having any commitments about any point at which they might pause development for any reason, which is in turn directionally better than saying stuff like “we are actively fighting to make sure that the omnicidal technology is open-sourced”.”
And: “I acknowledge that you see a glimmer of hope down this path where labs make any commitment at all about avoiding doing even some minimal amount of scaling until even some basic test is passed, e.g. because that small step might lead to more steps, and/or that sort of step might positively shape future regulation. And on my notion of ethics it’s important to avoid stomping on other people’s glimmers of hope whenever that’s feasible (and subject to some caveats about this being tricky to navigate when your hopes are opposed), and I’d prefer people not stomp on that hope.”
I think that the labs should Just Fucking Stop but I think we should also be careful not to create more pain for the companies that are doing relatively better, even if that better-ness is miniscule and woefully inadequate.
My conversation partner was like “I wish you’d say that stuff out loud”, and so, here we are.
It seems like this is only directionally better if it’s true, and this is still an open question for me. Like, I buy that some of the commitments around securing weights are true, and that seems good. I’m way less sure that companies will in fact pause development pending their assessment of evaluations. And to the extent that they are not, in a meaningful sense, planning to pause, this seems quite bad. It seems potentially worse, to me, to have a structure legitimizing this decision and making it seem more responsible than it is, rather than just openly doing a reckless thing. Not only because it seems dishonest, but also because unambiguous behavior is easier for people to point at and hence to understand, to stop, etc.
I don’t want to stomp on hope, but I’d also like labs not to stomp out existence. AI companies are risking people’s lives without their consent—much more so than is remotely acceptable, with their estimated risk of extinction/catastrophe sometimes as high as 33%—this seems unacceptable to me. They should absolutely be getting pushback if their commitments are not up to par. Doing relatively better is not what matters.
I was recently part of a group-chat where some people I largely respect were musing about this paper and this post and some of Scott Aaronson’s recent “maybe intelligence makes things more good” type reasoning).
Here’s my replies, which seemed worth putting somewhere public:
The claims in the paper seem wrong to me as stated, and in particular seems to conflate values with instrumental subgoals. One does not need to terminally value survival to avoid getting hit by a truck while fetching coffee; they could simply understand that one can’t fetch the coffee when one is dead.
And then in reply to someone pointing out that the paper was perhaps trying to argue that most minds tend to wind up with similar values because of the fact that all minds are (in some sense) rewarded in training for developing similar drives:
So one hypothesis is that in practice, all practically-trainable minds manage to survive by dint of a human-esque survival instinct (while admitting that manually-engineered minds could survive some other way, e.g. by simply correctly modeling the consequences).
This mostly seems to me to be like people writing sci-fi in which the aliens are all humanoid; it is a hypothesis about tight clustering of cognitive drives even across very disparate paradigms (optimizing genomes is very different from optimizing every neuron directly).
But a deeper objection I have here is that I’d be much more comfortable with people slinging this sort of hypothesis around if they were owning the fact that it’s a hypothesis about tight clustering and non-alienness of all minds, while stating plainly that they think we should bet the universe on this intuition (despite how many times the universe has slapped us for believing anthropocentrism in the past).
FWIW, some reasons that I don’t myself buy this hypothesis include:
(a) the specifics of various human drives seem to me to be very sensitive to the particulars of our ancestry (ex: empathy seems likely a shortcut for modeling others by repurposing machinery for modeling the self (or vice versa), that is likely not found by hillclimbing when the architecture of the self is very different from the architecture of the other);
(b) my guess is that the pressures are just very different for different search processes (genetic recombination of DNA vs SGD on all weights); and
(c) it looks to me like value is fragile, such that even if the drives were kinda close, I don’t expect the obtainable optimum to be good according to our lights
(esp. given that the question is not just what drives the AI gets, but the reflective equilibrium of those drives: small changes to initial drives are allowed to have large changes to the reflective equilibrium, and I suspect this is so).
Someone recently privately asked me for my current state on my ‘Dark Arts of Rationality’ post. Here’s some of my reply (lightly edited for punctuation and conversation flow), which seemed worth reproducing publicly:
FWIW, that post has been on my list of things to retract for a while.
(The retraction is pending a pair of blog posts that describe some of my thoughts on related matters, which have been in the editing queue for over a year and the draft queue for years before that.)
I wrote that post before reading much of the sequences, and updated away from the position pretty soon after. My current stance is that you can basically get all the nice things, and never need to compromise your epistemics.
On my current accounting, the mistake I was making at the time of the dark arts post was something like: lots of stuff comes culturally bundled, in ways that can confuse you into thinking you can’t get good thing X without also swallowing bad thing Y.
And there’s a skill of just, like, taking the good stuff and discarding the bad stuff, even if you don’t yet know how to articulate a justification (which I lacked in full generality at the time of the dark arts post, and was developing at the time of the ‘certainty’ post.)
And it’s a little tricky to write about, because you’ve got to balance it against “care about consistency” / “notice when you’re pingponging between mutually-incosistent beliefs as is convenient”, which is… not actually hard, I think, but I haven’t found a way to write about the one without the words having an interpretation of “just drop your consistency drive”. …which is how these sorts of things end up languishing in my editing queue for years, whe I have other priorities.
(And for the record, another receipt here is that in some twitter thread somewhere—maybe the jargon thread?--I noted the insight about unbundling things, using “you can’t be sad and happy at the same time” as an example of a bundled-thing. which isn’t the whole concept, but which is another instance of the resolution intruding in a visible way.)
(More generally, a bunch of my early MoW posts are me, like, digesting parts of the sequences and correcting a bunch of my errors from before I encountered this community. And for the record, I’m grateful to the memes in this community—and to Eliezer in particular, who I count as originating many of them—for helping me stop being an idiot in that particular way.)
I’ve also gone ahead and added a short retraction-ish paragraph to the top of the dark arts post, and might edit it later to link it to the aforementioned update-posts, if they ever make it out of the editing queue.
A few people recently have asked me for my take on ARC evals, and so I’ve aggregated some of my responses here:
- I don’t have strong takes on ARC Evals, mostly on account of not thinking about it deeply. - Part of my read is that they’re trying to, like, get a small dumb minimal version of a thing up so they can scale it to something real. This seems good to me. - I am wary of people in our community inventing metrics that Really Should Not Be Optimized and handing them to a field that loves optimizing metrics. - I expect there are all sorts of issues that would slip past them, and I’m skeptical that the orgs-considering-deployments would actually address those issues meaningfully if issues were detected ([cf](https://www.lesswrong.com/posts/thkAtqoQwN6DtaiGT/carefully-bootstrapped-alignment-is-organizationally-hard)). - Nevertheless, I think that some issues can be caught, and attempting to catch them (and to integrate with leading labs, and make “do some basic checks for danger” part of their deployment process) is a step up from doing nothing. - I have not tried to come up with better ideas myself.
Overall, I’m generally enthusiastic about the project of getting people who understand some of the dangers into the deployment-decision loop, looking for advance warning signs.
I’ve encountered some confusion about which direction “geocentrism was false” generalizes. Correct use: “Earth probably isn’t at the center of the universe”. Incorrect use: “All aliens probably have two arms with five fingers.”
The generalized lesson from geocentrism being false is that the laws of physics don’t particularly care about us. It’s not that everywhere must be similar to here along the axes that are particularly salient to us.
I see this in the form of people saying “But isn’t it sheer hubris to believe that humans are rare with the property that they become more kind and compassionate as they become more intelligent and mature? Isn’t that akin to believing we’re at the center of the universe?”
I answer: no; the symmetry is that other minds have other ends that their intelligence reinforces; kindness is not priviledged in cognition any more than Earth was priviledged as the center of the universe; imagining all minds as kind is like imagining all aliens as 10-fingered.
(Some aliens might be 10-fingered! AIs are less likely to be 10-fingered, or to even have fingers in the relevant sense! See also some of Eliezer’s related thoughts)
A number of years ago, when LessWrong was being revived from its old form to its new form, I did not expect the revival to work. I said as much at the time. For a year or two in the middle there, the results looked pretty ambiguous to me. But by now it’s clear that I was just completely wrong—I did not expect the revival to work as well as it has to date.
Oliver Habryka in particluar wins Bayes points off of me. Hooray for being right while I was wrong, and for building something cool!
Aww, thank you! ^-^
LW is more of AI Alignment forum now, but without mad science agent foundations spirit of earlier post-Sequences time. Probably capable of being alive because the field grew. So it’s not revived, it’s something new. Development of rationality mostly remains in the past, little significant discussion in recent years.
Yeah, I think some of this is true, but while there is a lot of AI content, I actually think that a lot of the same people would probably write non-AI content, and engage on non-AI content, if AI was less urgent, or the site had less existing AI content.
That counterfactual is hard to evaluate, but like, a lot of people who used to be core contributors to LW 1.0 are now also posting to LW 2.0, though they are now posting primarily on AI, and I think that’s evidence that it’s more that there has been a broader shift among LW users that AI is just like really urgent and important, instead of there having been a very different new user base that was discovered.
I kind of agree on development of rationality feeling kind of stagnant right now. I think there are still good posts being written, but a lot of cognitive energy is definitely going into AI stuff, more so than rationality stuff.
I would love to be able to stop worrying about AI and go back to improving rationality. Yet another thing to look forward to once we leap this hurdle.
Totally agree. Oliver & co. won tons of Bayes points off me.
Same! LW is an outstanding counterexample to my belief that resurrections are impossible. But I haven’t incorporated it into my gears-level model yet, and I’m unsure how to. What did LW do differently, or which gear in my head caused me to fail to predict this?
The original LW was a clone of Reddit. The Reddit source code was quite complex. I am a software developer, I have looked at that code myself, tried to figure out some things, then gave up.
I do not remember whether I made any predictions at that time. But ignoring what I know now, I probably would have said the following:
If I understand it correctly, what happened is that some people got paid to work on this full-time. And they turned out to be very good at their job. They rewrote everything from scratch, which was probably the easier way, but it required a lot of time, and a lot of trust because it was “either complete success, or nothing” (as opposed to gradually adding new features to Reddit code).
This is about what I was going to say in response, before reading your comment.
I think the key factor that makes it different from other examples is that it was a competent person’s full time job.
There are some other things that need to go right in addition to that, but I suspect that there are lots of things that people are correctly outside view gloomy about which can just be done, if someone makes it their first priority.
Things that need to go right:
it must be a competent person (as opposed to merely overconfident)
who really cares about the project (more than about other possible projects)
can convince other people of their competence (the ones who have money)
gets sufficient funding (doesn’t waste their time and energy working a second job)
has autonomy (no manager who would override or second-guess their decisions)
no unexpected disasters (e.g. getting hit by a bus, patent trolls suing the project,...)
Other than the unexpected disasters, it seems like something that a competent civilization should easily do. Once you have competent people, allow them to demonstrate their competence, look for intersection between what they want to do and what you need (or if you are sufficiently rich, just an intersection between what they want to do and what you believe is a good thing), give them money, and let them work.
In real life, having the right skills and sending the right signals is not the same thing; people who do things are not the same as people who decide things; time is wasted on meetings and paperwork.
That anyone with any agency and competence was working on it as their primary goal, as opposed to nobody doing so.
Just to be clear. There were people working on it who had both agency and competence, but they were working on it as a side project. I think having something be someone’s only priority and full-time job makes a large difference on how much agency someone can bring to bear on a project.
This is an excerpt from a comment I wrote on the EA forum, extracted and crossposted here by request:
There’s a phenomenon where a gambler places their money on 32, and then the roulette wheel comes up 23, and they say “I’m such a fool; I should have bet 23”.
More useful would be to say “I’m such a fool; I should have noticed that the EV of this gamble is negative.” Now at least you aren’t asking for magic lottery powers.
Even more useful would be to say “I’m such a fool; I had three chances to notice that this bet was bad: when my partner was trying to explain EV to me; when I snuck out of the house and ignored a sense of guilt; and when I suppressed a qualm right before placing the bet. I should have paid attention in at least one of those cases and internalized the arguments about negative EV, before gambling my money.” Now at least you aren’t asking for magic cognitive powers.
My impression is that various EAs respond to crises in a manner that kinda rhymes with saying “I wish I had bet 23”, or at best “I wish I had noticed this bet was negative EV”, and in particular does not rhyme with saying “my second-to-last chance to do better (as far as I currently recall) was the moment that I suppressed the guilt from sneaking out of the house”.
(I think this is also true of the general population, to be clear. Perhaps even moreso.)
I have a vague impression that various EAs perform self-flagellation, while making no visible attempt to trace down where, in their own mind, they made a misstep. (Not where they made a good step that turned out in this instance to have a bitter consequence, but where they made a wrong step of the general variety that they could realistically avoid in the future.)
(Though I haven’t gone digging up examples, and in lieu of examples, for all I know this impression is twisted by influence from the zeitgeist.)
When I see or hear a piece of advice, I check to see what happens if the advice were the reverse. Often it’s also good advice, which means all we can do is take the advice into account as we try to live a balanced life. For example, if the advice is “be brave!” the reverse is “be more careful”. Which is good advice, too.
This advice is unusual in that it is non-reversible.
I’ve referred to “I should have bet on 23-type errors” several times over the past year. Having this shorthand and an explanation I can link to has sped up those conversations.
I’m not EA (though I do agree with most of the motte—I care about other humans, and I try to be effective), and not part of the rationalist “community”, so take this as an outside view.
There’s a ton of “standard human social drama” in EA and in rationalist communities, and really anywhere where “work” and “regular life” overlap significantly. Some of this takes the form of noticing flaws in other people’s rationality (or, just as often, flaws in kindness/empathy being justified by rationality).
Especially when one doesn’t want to identify and address specific examples, I think there’s a very high risk of misidentifying the cause of a disagreement or disrespect-of-behavior. In this case, I don’t notice much of the flagellation or wishing—either I don’t hang out in the right places, or I bounce off those posts and don’t pay much mind. But things that might fit that pattern strike me as a failure of personal responsibility, not a failure of modeling wishes. Your term self-flagellation is interesting from that standpoint—the historic practice was for penance of generalized sin, and to share suffering, not as a direct correction for anything. It’s clearly social, not rational.
IMO, rationalism must first and foremost be individual. I am trying to be less wrong in my private beliefs and in my goal-directed behaviors. Group rationality is a category error—I don’t have access to group beliefs (if there is such a thing). I do have some influence over group behaviors and shared statements, but I recognize that they are ALWAYS a compromise and negotiated results of individual beliefs and behaviors, and don’t necessarily match any individual in the group.
I’m surprised every time I see a rationalist assuming otherwise, and being disappointed that other members of the group doesn’t share all their beliefs and motivations.
From my perspective, “group rationality” means how much the group provides an environment conductive to becoming more rational (for those who have the propensity to do so).
It is probably easier to describe the opposite—a group where religion and woo are high status, skepticism is considered a form of stupidity, members are encouraged to think in slogans rather than inspect the details of their actual experience, etc.
A rational group would then be one where as an individual rationalist you can do the right thing without getting socially punished in turn, and are gently called out when you do something stupid.
And it is a “more or less” thing, rather than “yes or no”. (However, the same is true about individual rationality.) I would not expect any group to share all my beliefs and motivation. But the difference between sharing 20% or 80% of the beliefs and motivations means a lot to me.
In relation to my current stance on AI, I was talking with someone who said they’re worried about people putting the wrong incentives on labs. At various points in that convo I said stuff like (quotes are not exact; third paragraph is a present summary rather than a re-articulation of a past utterance):
“Sure, every lab currently seems recklessly negligent to me, but saying stuff like “we won’t build the bioweapon factory until we think we can prevent it from being stolen by non-state actors” is directionally better than not having any commitments about any point at which they might pause development for any reason, which is in turn directionally better than saying stuff like “we are actively fighting to make sure that the omnicidal technology is open-sourced”.”
And: “I acknowledge that you see a glimmer of hope down this path where labs make any commitment at all about avoiding doing even some minimal amount of scaling until even some basic test is passed, e.g. because that small step might lead to more steps, and/or that sort of step might positively shape future regulation. And on my notion of ethics it’s important to avoid stomping on other people’s glimmers of hope whenever that’s feasible (and subject to some caveats about this being tricky to navigate when your hopes are opposed), and I’d prefer people not stomp on that hope.”
I think that the labs should Just Fucking Stop but I think we should also be careful not to create more pain for the companies that are doing relatively better, even if that better-ness is miniscule and woefully inadequate.
My conversation partner was like “I wish you’d say that stuff out loud”, and so, here we are.
It seems like this is only directionally better if it’s true, and this is still an open question for me. Like, I buy that some of the commitments around securing weights are true, and that seems good. I’m way less sure that companies will in fact pause development pending their assessment of evaluations. And to the extent that they are not, in a meaningful sense, planning to pause, this seems quite bad. It seems potentially worse, to me, to have a structure legitimizing this decision and making it seem more responsible than it is, rather than just openly doing a reckless thing. Not only because it seems dishonest, but also because unambiguous behavior is easier for people to point at and hence to understand, to stop, etc.
I don’t want to stomp on hope, but I’d also like labs not to stomp out existence. AI companies are risking people’s lives without their consent—much more so than is remotely acceptable, with their estimated risk of extinction/catastrophe sometimes as high as 33%—this seems unacceptable to me. They should absolutely be getting pushback if their commitments are not up to par. Doing relatively better is not what matters.
I was recently part of a group-chat where some people I largely respect were musing about this paper and this post and some of Scott Aaronson’s recent “maybe intelligence makes things more good” type reasoning).
Here’s my replies, which seemed worth putting somewhere public:
See also instrumental convergence.
And then in reply to someone pointing out that the paper was perhaps trying to argue that most minds tend to wind up with similar values because of the fact that all minds are (in some sense) rewarded in training for developing similar drives:
Someone recently privately asked me for my current state on my ‘Dark Arts of Rationality’ post. Here’s some of my reply (lightly edited for punctuation and conversation flow), which seemed worth reproducing publicly:
I’ve also gone ahead and added a short retraction-ish paragraph to the top of the dark arts post, and might edit it later to link it to the aforementioned update-posts, if they ever make it out of the editing queue.
A few people recently have asked me for my take on ARC evals, and so I’ve aggregated some of my responses here:
- I don’t have strong takes on ARC Evals, mostly on account of not thinking about it deeply.
- Part of my read is that they’re trying to, like, get a small dumb minimal version of a thing up so they can scale it to something real. This seems good to me.
- I am wary of people in our community inventing metrics that Really Should Not Be Optimized and handing them to a field that loves optimizing metrics.
- I expect there are all sorts of issues that would slip past them, and I’m skeptical that the orgs-considering-deployments would actually address those issues meaningfully if issues were detected ([cf](https://www.lesswrong.com/posts/thkAtqoQwN6DtaiGT/carefully-bootstrapped-alignment-is-organizationally-hard)).
- Nevertheless, I think that some issues can be caught, and attempting to catch them (and to integrate with leading labs, and make “do some basic checks for danger” part of their deployment process) is a step up from doing nothing.
- I have not tried to come up with better ideas myself.
Overall, I’m generally enthusiastic about the project of getting people who understand some of the dangers into the deployment-decision loop, looking for advance warning signs.
Reproduced from a twitter thread:
I’ve encountered some confusion about which direction “geocentrism was false” generalizes. Correct use: “Earth probably isn’t at the center of the universe”. Incorrect use: “All aliens probably have two arms with five fingers.”
The generalized lesson from geocentrism being false is that the laws of physics don’t particularly care about us. It’s not that everywhere must be similar to here along the axes that are particularly salient to us.
I see this in the form of people saying “But isn’t it sheer hubris to believe that humans are rare with the property that they become more kind and compassionate as they become more intelligent and mature? Isn’t that akin to believing we’re at the center of the universe?”
I answer: no; the symmetry is that other minds have other ends that their intelligence reinforces; kindness is not priviledged in cognition any more than Earth was priviledged as the center of the universe; imagining all minds as kind is like imagining all aliens as 10-fingered.
(Some aliens might be 10-fingered! AIs are less likely to be 10-fingered, or to even have fingers in the relevant sense! See also some of Eliezer’s related thoughts)