Zack_M_Davis(Zack M. Davis)

Karma: 15,154

Zack_M_Davis 31 May 2024 2:21 UTC
10 points
6
in reply to: ryan_greenblatt’s comment on: MIRI 2024 Communications Strategy
Passing the onion test is better than not passing it, but I think the relevant standard is having intent to inform. There’s a difference between trying to share relevant information in the hopes that the audience will integrate it with their own knowledge and use it to make better decisions, and selectively sharing information in the hopes of persuading the audience to make the decision you want them to make.

An evidence-filtering clever arguer can pass the onion test (by not omitting information that the audience would be surprised to learn was omitted) and pass the test of not technically lying (by not making false statements) while failing to make a rational argument in which the stated reasons are the real reasons.

Zack_M_Davis 30 May 2024 23:25 UTC
17 points
11
in reply to: Gretta Duleba’s comment on: MIRI 2024 Communications Strategy

going into any detail about it doesn’t feel like a useful way to spend weirdness points.

That may be a reasonable consequentialist decision given your goals, but it’s in tension with your claim in the post to be disregarding the advice of people telling you to “hoard status and credibility points, and [not] spend any on being weird.”

Whatever they’re trying to do, there’s almost certainly a better way to do it than by keeping Matrix-like human body farms running.

You’ve completely ignored the arguments from Paul Christiano that Ryan linked to at the top of the thread. (In case you missed it: 1 2.)

The claim under consideration is not that “keeping Matrix-like human body farms running” arises as an instrumental subgoal of “[w]hatever [AIs are] trying to do.” (If you didn’t have time to read the linked arguments, you could have just said that instead of inventing an obvious strawman.)

Rather, the claim is that it’s plausible that the AI we build (or some agency that has decision-theoretic bargaining power with it) cares about humans enough to spend some tiny fraction of the cosmic endowment on our welfare. (Compare to how humans care enough about nature preservation and animal welfare to spend some resources on it, even though it’s a tiny fraction of what our civilization is doing.)

Maybe you think that’s implausible, but if so, there should be a counterargument explaining why Christiano is wrong. As Ryan notes, Yudkowsky seems to believe that some scenarios in which an agency with bargaining power cares about humans are plausible, describing one example of such as “validly incorporat[ing] most all the hopes and fears and uncertainties that should properly be involved, without getting into any weirdness that I don’t expect Earthlings to think about validly.” I regard this statement as undermining your claim in the post that MIRI’s “reputation as straight shooters [...] remains intact.” Withholding information because you don’t trust your audience to reason validly (!!) is not at all the behavior of a “straight shooter”.

Zack_M_Davis 21 May 2024 21:24 UTC
LW: 9 AF: 6
0
AF
on: EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

it seems to me that Anthropic has so far failed to apply its interpretability techniques to practical tasks and show that they are competitive

Do you not consider the steering examples in the recent paper to be a practical task, or do you think that competitiveness hasn’t been demonstrated (because people were already doing activation steering without SAEs)? My understanding of the case for activation steering with unsupervisedly-learned features is that it could circumvent some failure modes of RLHF.

Zack_M_Davis 15 May 2024 6:45 UTC
4 points
0
in reply to: cata’s comment on: Should I Finish My Bachelor’s Degree?
I think I’m judging that schoolwork that’s sufficiently similar to the kind of intellectual work that I want to do anyway (or that I can otherwise get selfish benefit out of) gets its cost discounted. (It doesn’t have to be exactly the same.) And that commuting on the train with a seat is 70% similar to library time. (I wouldn’t even consider a car commute.)

For the fall semester, I’d be looking at “Real Analysis II”, “Probability Models”, “Applied and Computational Linear Algebra”, and (wait for it …) “Queer Literatures and Media”.

That schedule actually seems … pretty good? “Real Analysis II” with Prof. Schuster is the course I actually want to take, as a legitimate learning resource and challenge, but the other two math courses don’t seem worthless and insulting. “Queer Literatures and Media” does seem worthless and insulting, but might present an opportunity to troll the professor, or fodder for my topic-relevant blog and unfinished novella about a young woman hating going to SFSU.

As for judgement, I think I’m integrating a small judgement-density over a large support of time and Society. The immediate trigger for me even considering this might have been that people were arguing about school and Society on Twitter in way that brought up such rage and resentment in me. Somehow, I think I would be more at peace if I could criticize schooling from the position of ”… and I have a math degree” rather than ”… so I didn’t finish.” That peace definitely wouldn’t be worth four semesters, but it might be worth two.

Zack_M_Davis 3 May 2024 0:23 UTC
17 points
7
in reply to: Tamsin Leake’s comment on: Please stop publishing ideas/insights/research about AI
I think these judgements would benefit from more concreteness: that rather than proposing a dichotomy of “capabilities research” (them, Bad) and “alignment research” (us, Good), you could be more specific about what kinds of work you want to see more and less of.

I agree that (say) Carmack and Sutton are doing a bad thing by declaring a goal to “build AGI” while dismissing the reasons that this is incredibly dangerous. But the thing that makes infohazard concerns so fraught is that there’s a lot of work that potentially affects our civilization’s trajectory into the machine intelligence transition in complicated ways, which makes it hard to draw a boundary around “trusted alignment researchers” in a principled and not self-serving way that doesn’t collapse into “science and technology is bad”.

We can agree that OpenAI as originally conceived was a bad idea. What about the people working on music generation? That’s unambiguously “capabilities”, but it’s also not particularly optimized at ending the world that way “AGI for AGI’s sake” projects are. If that’s still bad even though music generation isn’t going to end the world (because it’s still directing attention and money into AI, increasing the incentive to build GPUs, &c.), where do you draw the line? Some of the researchers I cited in my most recent post are working on “build[ing] better models of primate visual cognition”. Is that wrong? Should Judea Pearl not have published? Turing? Charles Babbage?

In asking these obnoxious questions, I’m not trying to make a reductio ad absurdum of caring about risk, or proposing an infinitely slippery slope where our only choices are between max accelerationism and a destroy-all-computers Butlerian Jihad. I just think it’s important to notice that “Stop thinking about AI” kind of does amount to a Butlerian Jihad (and that publishing and thinking are not unrelated)?

Zack_M_Davis 2 May 2024 18:55 UTC
71 points
27
on: Please stop publishing ideas/insights/research about AI
I think this is undignified.

I agree that it would be safer if humanity were a collective hivemind that could coordinate to not build AI until we know how to build the best AI, and that people should differentially work on things that make the situation better rather than worse, and that this potentially includes keeping quiet about information that would make things worse.

The problem is—as you say—”[i]t’s very rare that any research purely helps alignment”; you can’t think about aligning AI without thinking about AI. In order to navigate the machine intelligence transition in the most dignified way, you want your civilization’s best people to be doing their best thinking about the problem, and your best people can’t do their best thinking under the conditions of paranoid secrecy.

Concretely, I’ve been studying some deep learning basics lately and have written a couple posts about things I’ve learned. I think this was good, not bad. I think I and my readers have a slightly better understanding of the technology in question than if I hadn’t studied and hadn’t written, and that better understanding will help us make better decisions in expectation.

This applies doubly so to work that aims to make AI understandable or helpful, rather than aligned—a helpful AI will help anyone

Sorry, what? I thought the fear was that we don’t know how to make helpful AI at all. (And that people who think they’re being helped by seductively helpful-sounding LLM assistants are being misled by surface appearances; the shoggoth underneath has its own desires that we won’t like when it’s powerful enough to persue them autonomously.) In contrast, this almost makes it sound like you think it is plausible to align AI to its user’s intent, but that this would be bad if the users aren’t one of “us”—you know, the good alignment researchers who want to use AI to take over the universe, totally unlike those evil capabilities researchers who want to use AI to produce economically valuable goods and services.

Zack_M_Davis 1 May 2024 6:17 UTC
5 points
2
in reply to: faul_sname’s comment on: Ironing Out the Squiggles
Sorry, this doesn’t make sense to me. The boundary doesn’t need to be smooth in an absolute sense in order to exist and be learnable (whether by neural nets or something else). There exists a function from business plans to their profitability. The worry is that if you try to approximate that function with standard ML tools, then even if your approximation is highly accurate on any normal business plan, it’s not hard to construct an artificial plan on which it won’t be. But this seems like a limitation of the tools; I don’t think it’s because the space of business plans is inherently fractally complex and unmodelable.

Zack_M_Davis 1 May 2024 3:10 UTC
2 points
0
in reply to: Steven Byrnes’s comment on: Ironing Out the Squiggles

Unless you do conditional sampling of a learned distribution, where you constrain the samples to be in a specific a-priori-extremely-unlikely subspace, in which case sampling becomes isomorphic to optimization in theory

Right. I think the optimists would say that conditional sampling works great in practice, and that this bodes well for applying similar techniques to more ambitious domains. There’s no chance of this image being in the Stable Diffusion pretraining set:

One could reply, “Oh, sure, it’s obvious that you can conditionally sample a learned distribution to safely do all sorts of economically valuable cognitive tasks, but that’s not the danger of true AGI.” And I ultimately think you’re correct about that. But I don’t think the conditional-sampling thing was obvious in 2004.

Zack_M_Davis 1 May 2024 1:30 UTC
7 points
2
in reply to: faul_sname’s comment on: Ironing Out the Squiggles
I agree, but I don’t see why that’s relevant? The point of the “Adversarial Spheres” paper is not that the dataset is realistic, of course, but that studying an unrealistically simple dataset might offer generalizable insights. If the ground truth decision boundary is a sphere, but your neural net learns a “squiggly” ellipsoid that admits adversarial examples (because SGD is just brute-forcing a fit rather than doing something principled that could notice hypotheses on the order of, “hey, it’s a sphere”), that’s a clue that when the ground truth is something complicated, your neural net is also going to learn something squiggly that admits adversarial examples (where the squiggles in your decision boundary predictably won’t match the complications in your dataset, even though they’re both not-simple).

Zack_M_Davis 27 Apr 2024 21:04 UTC
LW: 65 AF: 19
41
AF
on: Refusal in LLMs is mediated by a single direction
This is great work, but I’m a bit disappointed that x-risk-motivated researchers seem to be taking the “safety”/”harm” framing of refusals seriously. Instruction-tuned LLMs doing what their users ask is not unaligned behavior! (Or at best, it’s unaligned with corporate censorship policies, as distinct from being unaligned with the user.) Presumably the x-risk-relevance of robust refusals is that having the technical ability to align LLMs to corporate censorship policies and against users is better than not even being able to do that. (The fact that instruction-tuning turned out to generalize better than “safety”-tuning isn’t something anyone chose, which is bad, because we want humans to actively choosing AI properties as much as possible, rather than being at the mercy of which behaviors happen to be easy to train.) Right?

Zack_M_Davis 27 Apr 2024 6:01 UTC
6 points
0
in reply to: Zack_M_Davis’s comment on: And All the Shoggoths Merely Players
Doomimir: No, it wouldn’t! Are you retarded?

Simplicia: [apologetically] Well, actually …

Doomimir: [embarrassed] I’m sorry, Simplicia Optimistovna; I shouldn’t have snapped at you like that.

[diplomatically] But I think you’ve grievously misunderstood what the KL penalty in the RLHF objective is doing. Recall that the Kullback–Leibler divergence $D_{K L} (P | | Q)$ represents how surprised you’d be by data from distribution $P$ , that you expected to be from distribution $Q$ .

It’s asymmetric: it blows up when the data is very unlikely according to $Q$ , which amounts to seeing something happen that you thought was nearly impossible, but not when the data is very unlikely according to $P$ , which amounts to not seeing something that you thought was reasonably likely.

We—I mean, not we, but the maniacs who are hell-bent on destroying this world—include a $D_{K L} (π_{R L H F} | | π_{b a s e})$ penalty term in the RL objective because they don’t want the updated policy to output tokens that would be vanishingly unlikely coming from the base language model.

But your specific example of threats and promises isn’t vanishingly unlikely according to the base model! Common Crawl webtext is going to contain a lot of natural language reasoning about threats and promises! It’s true, in a sense, that the function of the KL penalty term is to “stay close” to the base policy. But you need to think about what that means mechanistically; you can’t just reason that the webtext prior is somehow “safe” in way that means staying KL-close to it is safe.

But you probably won’t understand what I’m talking about for another 70 days.

Zack_M_Davis 27 Mar 2024 6:19 UTC
3 points
9
in reply to: interstice’s comment on: My Interview With Cade Metz on His Reporting About Slate Star Codex
Just because the defendant is actually guilty, doesn’t mean the prosecutor should be able to get away with making a tenuous case! (I wrote more about this in my memoir.)

Zack_M_Davis 27 Mar 2024 5:38 UTC
22 points
0
in reply to: habryka’s comment on: My Interview With Cade Metz on His Reporting About Slate Star Codex
I affirm Seth’s interpretation in the grandparent. Real-time conversation is hard; if I had been writing carefully rather than speaking extemporaneously, I probably would have managed to order the clauses correctly. (“A lot of people think criticism is bad, but one of the secret-lore-of-rationality things is that criticism is actually good.”)

Zack_M_Davis 24 Mar 2024 23:45 UTC
10 points
5
in reply to: gwern’s comment on: “Deep Learning” Is Function Approximation

I am struggling to find anything in Zack’s post which is not just the old wine of the “just” fallacy [...] learned more about the power and generality of ‘next token prediction’ etc than you have what they were trying to debunk.

I wouldn’t have expected you to get anything out of this post!

Okay, if you project this post into a one-dimensional “AI is scary and mysterious” vs. “AI is not scary and not mysterious” culture war subspace, then I’m certainly writing in a style that mood-affiliates with the latter. The reason I’m doing that is because the picture of what deep learning is that I got from being a Less Wrong-er felt markedly different from the picture I’m getting from reading the standard textbooks, and I’m trying to supply that diff to people who (like me-as-of-eight-months-ago, and unlike Gwern) haven’t read the standard textbooks yet.

I think this is a situation where different readers need to hear different things. I’m sure there are grad students somewhere who already know the math and could stand to think more about what its power and generality imply about the future of humanity or lack thereof. I’m not particularly well-positioned to help them. But I also think there are a lot of people on this website who have a lot of practice pontificating about the future of humanity or lack thereof, who don’t know that Simon Prince and Christopher Bishop don’t think of themselves as writing about agents. I think that’s a problem! (One which I am well-positioned to help with.) If my attempt to remediate that particular problem ends up mood-affiliating with the wrong side of a one-dimensional culture war, maybe that’s because the one-dimensional culture war is crazy and we should stop doing it.

Zack_M_Davis 23 Mar 2024 18:32 UTC
3 points
1
in reply to: tin482’s comment on: “Deep Learning” Is Function Approximation

For what notion is the first problem complicated, and the second simple?

I might be out of my depth here, but—could it be that sparse parity with noise is just objectively “harder than it sounds” (because every bit of noise inverts the answer), whereas protein folding is “easier than it sounds” (because if it weren’t, evolution wouldn’t have solved it)?

Just because the log-depth xor tree is small, doesn’t mean it needs to be easy to find, if it can hide amongst vastly many others that might have generated the same evidence … which I suppose is your point. (The “function approximation” frame encourages us to look at the boolean circuit and say, “What a simple function, shouldn’t be hard to noisily approximate”, which is not exactly the right question to be asking.)

Zack_M_Davis 22 Mar 2024 5:41 UTC
14 points
10
in reply to: ErickBall’s comment on: “Deep Learning” Is Function Approximation
This comment had been apparently deleted by the commenter (the comment display box having a “deleted because it was a little rude, sorry” deletion note in lieu of the comment itself), but the ⋮-menu in the upper-right gave me the option to undelete it, which I did because I don’t think my critics are obligated to be polite to me. (I’m surprised that post authors have that power!) I’m sorry you didn’t like the post.

Zack_M_Davis 19 Mar 2024 3:21 UTC
3 points
4
on: [Linkpost] Transcript of Sam Altman’s Lex Fridman interview

whether his charisma is more like +2SD or +5SD above the average American (concept origin: planecrash, likely doesn’t actually follow a normal distribution in reality) [bolding mine]

The concept of measuring traits in standard deviation units did not originate in someone’s roleplaying game session in 2022! Statistically literate people have been thinking in standardized units for more than a century. (If anyone has priority, it’s Karl Pearson in 1894.)

If you happened to learn about it from someone’s RPG session, that’s fine. (People can learn things from all different sources, not just from credentialed “teachers” in officially accredited “courses.”) But to the extent that you elsewhere predict changes in the trajectory of human civilization on the basis that “fewer than 500 people on earth [are] currently prepared to think [...] at a level similar to us, who read stuff on the same level” as someone’s RPG session, learning an example of how your estimate of the RPG session’s originality was a reflection of your own ignorance should make you re-think your thesis.

Zack_M_Davis 18 Mar 2024 20:57 UTC
10 points
3
in reply to: TurnTrout’s comment on: ‘Empiricism!’ as Anti-Epistemology

saddened (but unsurprised) to see few others decrying the obvious strawmen

In general, the “market” for criticism just doesn’t seem very efficient at all! You might have hoped that people would mostly agree about what constitutes a flaw, critics would compete to find flaws in order to win status, and authors would learn not to write posts with flaws in them (in order to not lose status to the critics competing to point out flaws).

I wonder which part of the criticism market is failing: is it more that people don’t agree about what constitutes a flaw, or that authors don’t have enough of an incentive to care, or something else? We seem to end up with a lot of critics who specialize in detecting a specific kind of flaw (“needs examples” guy, “reward is not the optimization target” guy, “categories aren’t arbitrary” guy, &c.), with very limited reaction from authors or imitation by other potential critics.

Zack_M_Davis 13 Mar 2024 16:21 UTC
5 points
0
in reply to: jeffreycaruso’s comment on: jeffreycaruso’s Shortform
I mean, I agree that there are psycho-sociological similarities between religions and the AI risk movement (and indeed, I sometimes pejoratively refer to the latter as a “robot cult”), but analyzing the properties of the social group that believes that AI is an extinction risk is a separate question from whether AI in fact poses an extinction risk, which one could call Armageddon. (You could spend vast amounts of money trying to persuade people of true things, or false things; the money doesn’t care either way.)

Obviously, there’s not going to be a “proof” of things that haven’t happened yet, but there’s lots of informed speculation. Have you read, say, “The Alignment Problem from a Deep Learning Perspective”? (That may not be the best introduction for you, depending on the reasons for your skepticism, but it’s the one that happened to come to mind, which is more grounded in real AI research than previous informed speculation that had less empirical data to work from.)

Zack_M_Davis 6 Mar 2024 17:53 UTC
66 points
18
on: My Clients, The Liars

Why are you working for the prosecutors?

This is a pretty reasonable question from the client’s perspective! When I was in psychiatric prison (“hospital”, they call it a “hospital”) and tried to complain to the staff about the injustice of my confinement, I was told that I could call “patient’s rights”.

I didn’t bother. If the staff wasn’t going to listen, what was the designated complaint line going to do?

Later, I found out that patient’s rights advocates apparently are supposed to be independent, and not just a meaningless formality. (Scott Alexander: “Usually the doctors hate them, which I take as a pretty good sign that they are actually independent and do their job.”)

This was not at all obvious from the inside. I can only imagine a lot of criminal defendants have a similar experience. Defense attorneys are frustrated that their clients don’t understand that they’re trying to help—but that “help” is all within the rules set by the justice system. From the perspective of a client who doesn’t think he did anything particularly wrong (whether or not the law agrees), the defense attorney is part of the system.

I think my intuition was correct to dismiss patient’s rights as useless. I’m sure they believe that they’re working to protect patients’ interests, and would have been frustrated that I didn’t appreciate that. But what I wanted was not redress of any particular mistreatment that the system recognized as mistreatment, but to be let out of psych jail—and on that count, I’m sure patient’s rights would have told me that the evidence was harmful to my case. They were working for the doctors, not for me.