This point suggests alternative models for risks and opportunities from “AI”. If deep learning applied to various narrow problems is a new source of various superhuman capabilities, that has a lot of implications for the future of the world, setting “AGI” aside.
Zack_M_Davis
If your prototypical example of a contemporary computer program analogous to future AGI is a chess engine rather than an LLM, then agency by default is very intuitive: what humans think of as “tactics” to win material emerge from a comprehensive but efficient search for winning board-states without needing to be individually programmed. If contemporary LLMs are doing something less agentic than a comprehensive but efficient search for winning universe-states, there’s reason to be wary that this is not the end of the line for AI development. (If you could set up a sufficiently powerful outcome-oriented search, you’d expect creator-unintended agency to pop up in the winning solutions.)
I think this was excellent work that no one (rounding down) appreciated at the time because I sacrificed readability by optimizing for comprehensiveness. If it helps, I have now composed a Twitter-optimized summary:
Time for a Twitter-optimized capsule 🧵 of my 2021 philosophy of language thesis about why choosing bad definitions is relevantly similar to lying! If you wouldn’t lie, you also shouldn’t say, “it’s not lying; I’m just defining words in a way that I prefer.” 1⁄24
Some people say: the borders of a category are like the borders of a country—they have consequences, but there’s no sense in which some possible borders can be objectively worse than others. 2⁄24
But category “borders” or “boundaries” are just a visual metaphor corresponding to a kind of probabilistic model. Editing the “boundary” means editing the model’s predictions. There is a sense in which some models can be objectively worse than others! 3⁄24
Imagine having to sort a bunch of blue egg-shaped things (which contain vanadium) & red cubes (that don’t). Technically, you don’t actually need separate “blue egg” & “red cube” categories. You could just build up a joint probability table over all objects and query that. 4⁄24
But that’s unwieldy. Thinking about “blue egg” & “red cube” as separate categories and computing the properties of an object conditional on its category is much more efficient. 5⁄24
“Computing the properties of an object conditional on its category” can be visualized as category “boundaries” in a picture. 6⁄24
But the picture is an illustration of the math; you can’t change the picture without changing the math. It’s not like national borders at all. The U.S. purchasing Alaska (non-contiguous with the 48 states) wasn’t about editing a probabilistic model. 7⁄24
In itself, this doesn’t yet explain what’s wrong with “squiggly”, “gerrymandered” categories. You can still make predictions with squiggly categories. 8⁄24
But if approximately-correct answers are at all more useful than totally-wrong answers, squiggly categories are mathematically just worse (going by the mean squared error). If your “blue eggs” category contains some red cubes, you’ll drill for vanadium where there isn’t any. 9⁄24
The best categories are subjective in the sense that they depend on what you’re trying to predict, but that’s not the same thing as the category boundary itself being subjective. Given what you want to predict, the model (and thus the “boundary”) is determined by the data. 10⁄24
Some people say: okay, but what if I really do have a preference for using a particular squiggly boundary, intrinsically, not in a way that arises from desired predictions and the data distribution? That’s just how my utility function is! What’s irrational about that? 11⁄24
Let’s interrogate this. What would it mean, to have such an exotic utility function? There is a trivial sense in which any pattern of behavior, however bizarre, could be rationalized in terms of preferring to take the actions that I do in the situation that I face. 12⁄24
But a theory that explains everything explains nothing. The explanatory value of the “utility function” formalism isn’t that it can justify anything given a choice of “utility”, but in the constraints it articulates on coherent behaviors (given, yes, a choice of “utility”). 13⁄24
If your gambling behavior violates the independence axiom with respect to money, that doesn’t automatically make you irrational, but it does mean that you’re acting as if you care about something else besides money—that you’ll sacrifice some money for that something else. 14⁄24
Similarly, if your communication signals aren’t explainable in terms of conveying probabilistic predictions, that does imply that you care about something else than conveying probabilistic predictions—that you’ll sacrifice clarity (of predictions) for that something else. 15⁄24
But what might that something else be, concretely? It’s hard to see where a completely arbitrary, hardwired, “just because” preference for using a particular category boundary would come from! Why would that be a thing? Why?? 16⁄24
A much more plausible reason to sacrifice clarity of predictions is because you don’t want other agents to make accurate predictions. (Because if those others had better models, they’d make decisions that harm your interests.) That’s deception. 17⁄24
There’s no functional difference between saying “I reserve the right to lie p% of the time about whether something belongs to a category” and adopting a new category system that misclassifies p% of things. The input–output relations are the same. 18⁄24
A related reason for unnatural categories: it’s tempting to “wirehead” by choosing a map that looks good, instead of the map that reflects the territory (which might be unpleasant to look at). That’s self-deception. 19⁄24
If I want to believe I’m pretty & funny, it might be tempting to redefine “pretty” & “funny” such that they include me. But that would just be fooling myself; it doesn’t actually work for making me pretty & funny (with respect to the usual meanings). 20⁄24
Sometimes things resemble another in some but not all aspects. This is mimickry. It’s deceptive if the point is for another agent to treat the mimic as the original against the agent’s interests—but it’s not deceptive if the agent really doesn’t care about the difference. 21⁄24
If agents sharing a language disagree about which aspects “count”, they’ll fight over the definitions of words: animal advocates would prefer if plant-based meat substitutes counted as “real” meat, to make it hard for carnivores to insist on the dead-animal kind. 22⁄24
Philosophy itself can’t determine which definition is right (which depends on the empirical merits), but philosophy does clarify what’s happening in this kind of conflict—that departing from the empirical merits extracts a cost in the form of worse predictions. 23⁄24
Original post: “Unnatural Categories Are Optimized for Deception” https://www.lesswrong.com/posts/onwgTH6n8wxRSo2BJ/unnatural-categories-are-optimized-for-deception END/24
I’ve noticed that Claude 4 really likes the surname “Chen”.
OK, I see the relationship to the standard definition. (It would be bad faith to put appearances of being open to being convinced, when you’re actually not.) The requirement of curiosity seems distinct and much more onerous, though. (If you think I’m talking nonsense and don’t feel curious about why, that doesn’t mean you’re not open to being convinced under any circumstances; it means you’re waiting for me to say something that you find convincing, without you needing to proactively read my mind.)
and not in good faith (as I judge it)
Wikipedia defines the antonym bad faith as “a sustained form of deception which consists of entertaining or pretending to entertain one set of feelings while acting as if influenced by another.” What hidden motive do you think Said is concealing, specifically? (Or if you’re using the term with a nonstandard meaning, what do you mean?)
Really?
Discontinuous Linear Functions?!
Sen. Markey of Massachusetts has issued a press release condemning the proposed moratorium and expressing intent to raise a point of order:
I am committed to fighting this 10-year ban with every tool at my disposal. And that starts by making it clear that this 10-year ban on state AI regulation is a policy change that has no impact on the federal budget. That means it cannot be included in a reconciliation bill. If Senate Republicans keep the House language in their reconciliation bill, I will raise a point of order against it.
The issue is that I have no idea where you’re getting that hypothesis from. What have I written, anywhere, that makes you think I would disapprove of Alexei’s comment?
The seventh guideline doesn’t say that you shouldn’t hypothesize about what other people believe
In accordance with the Eighth Guideline, I would like to revise the wording of my invocation of the Seventh Guideline in the grandparent: given our history of communication failures, I think your comments would be better if you try to avoid posing hypotheses (not “making claims”) about what I believe in the absence of direct textual evidence, in accordance with the Seventh Guideline.
(But again, that’s just my opinion about how I think you could write better comments; I don’t consider it a “request.”)
I’d be interested in a statement of what Zack-guideline the above “here’s what I think he believes?” falls afoul of.
I still think your Seventh Guideline applies as written. All three of your examples of “ways a Seventh Guideline request might look” seem appropriate to me with some small adaptations for context (notwithstanding that I don’t believe in “requests”).
You wrote:
“wow, I support this way less than I otherwise would have, because your (hypothesized) straightforward diagnosis of what was going on in a large conflict over norms seems to me to be kind of petty” is contra both my norms and my understanding of Zack’s preferred norms; unless I miss him entirely neither one of us wants LessWrong to be the kind of place where that sort of factor weighs very heavily in people’s analysis.
The first example of a way a Seventh Guideline request might look says,
That’s not what I wrote, though. Can you please engage with what I wrote?
I can’t quite ask you to engage with what I wrote, because your hypothesis that I don’t “want LessWrong to be the kind of place where that sort of factor weighs very heavily in people’s analysis” bears no obvious resemblance to anything I’ve written, so it’s not clear what part of my writing I should be directing you to read more carefully.
In fact, I don’t even read the pettiness judgement as having weighed very heavily in Alexei’s analysis! Alexei wrote, “Overall strong upvote from me, but I’m not doing it because [...]”. I interpret this as saying that the pettiness of that section was enough of a detractor from the value of the post that he didn’t feel like awarding a strong-upvote, which I regard as distinct from weighing heavily in his analysis of the contents of the rest of the post themselves (as contrasted to his analysis of whether to strong-upvote). If it looks like Dante was motivated to write The Inferno in order to have a short section at the end depicting his enemies suffering divine punishment, that’s definitely something Dante scholars should be allowed to notice and criticize, without that weighing heavily into their analysis of the preceding 4000 lines: there’s a lot of stuff in those 4000 lines to be analyzed, separately from the fact that it’s all building up to the enemy torture scene. (I’m doing a decent amount of interpretation here; if Alexei happens to make the poor time-allocation decision of reading this subthread, he is encouraged to invoke the Seventh Guideline against this paragraph.)
The second example of a way a Seventh Guideline request might look says,
Er, you seem to be putting a lot of words in my mouth.
I think this applies? (A previous revision of this comment said “This applies straightforwardly”, but maybe you think the “my understanding”/”unless I miss him” disclaimers exclude the possibility of “putting words in someone’s mouth”?)
The third and final example of a way a Seventh Guideline request might look says,
I feel like I’m being asked to defend a position I haven’t taken. Can you point at what I said that made you think I think X?
“Asked to defend” doesn’t apply, but the question does. Can you point at what I said that made you think that I think that Alexei’s comment weighs the pettiness judgement very heavily in his analysis and that I don’t want Less Wrong to be the kind of place?
After being prompted by this thread and thinking for a minute, I was able to come up with a reason I should arguably disapprove of Alexei’s comment: that pettiness is not intellectually substantive (the section is correct or not separately from whether it’s a petty thing to point out) and letting a pettiness assessment flip the decision of whether to upvote makes karma scores less useful. I don’t feel that strongly about this and wouldn’t have come up with it without prompting because I’m not a karma-grubber: I think it’s, well, petty to complain about someone’s reasons for downvoting or witholding an upvote.
Can you name any way to solve [chess but with rooks and bishops not being able to move more than four squares at a time] without RL (or something functionally equivalent to RL)?
This isn’t even hard. Just take a pre-2017 chess engine, and edit the rules code so that rooks and bishops can only move four spaces. You’re probably already done: the core minimax search still works, α–β pruning still works, quiescence still works, &c. To be fair, the heuristic evaluation function won’t be correct, but you could just … make bishops and rooks be respectively worth 2.5 and 3.5 points instead of the traditional 3 and 5? Even if my guess at those point values is wrong, that should still be easily superhuman with 2017 algorithms on 2017 hardware. (Stockfish didn’t incorporate neural networks until 2020.)
my understanding of Zack’s preferred norms; unless I miss him entirely neither one of us wants LessWrong to be the kind of place where that sort of factor weighs very heavily in people’s analysis.
Um, I strong-upvoted and strong-agreement-voted Alexei’s comment.
Given our history of communication failures, I think your comments would be better if you try to avoid making claims about what I believe in the absence of direct textual evidence, in accordance with the Seventh Guideline.
But, crucially, that’s me saying I think your comments would be better comments if you did that. I’m not saying you shouldn’t try to extrapolate my views if you want to. You don’t owe me anything!
Zack’s [...] request
Clarification: I didn’t think of that as a “request.” I was saying that according to my standards, I would be embarrassed to publish criticism of someone that didn’t quote or link to their writings, and that it seemed to me to be in tension with your condemnations of strawmanning.
I don’t think of that as a request that you change it, because in general, I don’t think I have “jurisdiction” over other people’s writing. If someone says something I think is wrong, my response is to write my own comment or post explaining why I think it’s wrong (or perhaps mention it in person at Less Online), which they can respond or not-respond to as they see fit. You don’t owe me anything!
The Rick and Morty analysis in Act III, Scene I is great. I guess the “To be fair, you have to have a very high IQ [...]” meme is for real!
It’s good news for learning, not necessarily good news for jobs. If you care about creating “teaching” make-work jobs, but don’t care whether people know things, then it’s bad news.
Specifically, the idea is that AI going well for humans would require a detailed theory of how to encode human values in form suitable for machine optimization, and the relevance of deep learning is that Yudkowsky and Soares think that deep learning is on track to provide the superhuman optimization without the theory of values. You’re correct to note that this is a stance according to which “artificial life is by default bad, dangerous, or disvaluable,” but I think the way you contrast it with the claim that “biological life is by default good or preferable” is getting the nuances slightly wrong: independently-evolved biological aliens with superior intelligence would also be dangerous for broadly similar reasons.
I preordered my copy.
Something about the tone of this announcement feels very wrong, though. You cite Rob Bensinger and other MIRI staff being impressed. But obviously, those people are highly selected for already agreeing with you! How much did you engage with skeptical and informed prereaders? (I’m imagining people in the x-risk-reduction social network who are knowledgeable about AI, acknowledge the obvious bare-bones case for extinction risk, but aren’t sold on the literal stated-with-certainty headline claim, “If anyone builds it, everyone dies.”)
If you haven’t already done so, is there still time to solicit feedback from such people and revise the text? (Sorry if the question sounds condescending, but the tone of the announcement really worries me. It would be insane not to commission red team prereaders, but if you did, then the announcement should be talking about the red team’s reaction, not Rob’s!)
Low-IQ voters can’t identify good policies or wise politicians; democracy favors political actors who can successfully propagandize and mobilize the largest number of people, which might not correspond to good governance. A political system with non-democratic elements that offers more formalized control to actors with greater competence or better incentives might be able to choose better policies.
I say “non-democratic elements” because it doesn’t have to be a strict binary between perfect democracy and perfect dictatorship. Consider, e.g., how the indirect election of U.S. Senators before the 17th Amendment was originally intended to make the Senate a more deliberative body by insulating it from the public.
(Maybe that’s all wrong, but you asked “what’s the model”, and this is an example model of why someone might be skeptical of democracy for pro-social structural reasons rather than just personally wanting their guy to be dictator.)
From “But Hurting People Is Wrong”:
Look across the superclusters, and most entities either don’t do natural-number arithmetic at all, like stars and rocks; or they do it perfectly up to the limits of bounded cognition, like galaxy-spanning superintelligences. If there’s anything odd about humans, it’s the way that humans are only halfway finished being sucked into attractors like that.
Best wishes, Less Wrong Reference Desk
Roll to disbelieve.
I share something like Achmiz’s incredulity, but for me, I wouldn’t call it an inability to model preferences so much as disapproval of how uninterested people are in arguing that their preferences are legitimate and should be respected by adults who care about advancing the art of human rationality.
Achmiz has argued quite eloquently at length for why his commenting style is conducive to intellectual progress. If someone disagrees with that case on the intellectual merits, that would be interesting. But most of the opposition I see seems to appeal not to the intellectual merits, but to feelings: that Achmiz’s comments make authors feel bad (in some way that can’t be attributed to a breach of etiquette rules that could be neutrally enforced), which makes them not want to use the website, and we want people to use the website.
I’m appalled that the mod team apparently takes this seriously. I mean, okay, I grant that you want people to use the website. If almost everyone who might use your website is actually that depraved (which sounds outlandish to me, but you’re the one who’s done dozens of user interviews and would know), I guess you need to accommodate their mental illness somehow for pragmatic reasons. But normatively (dealing with the intellectual merits and not feelings), you see how the problem is with everyone else, not Achmiz, right?