Discord: LemonUniverse (lemonuniverse). Reddit: u/Smack-works. Substack: The Lost Jockey. About my situation: here.
I wrote some worse posts before 2024 because I was very uncertain how the events may develop.
Discord: LemonUniverse (lemonuniverse). Reddit: u/Smack-works. Substack: The Lost Jockey. About my situation: here.
I wrote some worse posts before 2024 because I was very uncertain how the events may develop.
Sorry if it’s not appropriate for this site. But is anybody interested in chess research? I’ve seen that people here might be interested in chess. For example, here’s a chess post barely related to AI.
In chess, what positions have the longest forced wins? “Mate in N” positions can be split into 3 types:
Positions which use “tricks” to get a big number of moves before checkmate. Such as cycles of repeating moves. For example, this manmade mate in 415 (see the last position) uses obvious cycles. Not to mention mates in omega.
Tablebase checkmates, discovered by brute force, showing absolutely incomprehensible play with no discernible logic. See this mate in 549 moves. One should assume it’s based on some hidden cycles or something?
Positions which are similar to immortal games. Where the winning variation requires a combination without any cycles. For example: Kasparov’s Immortal (14 moves long combination), Stoofvlees vs. Igel (down a rook for 21 moves) - the examples lack optimal play tho.
Surprisingly, nobody seems to look for the longest mates of Type 3. Well, I did look for them and discovered some. Down below I’ll explain multiple ways to define what exactly I did. Won’t go into too much detail. If you want more detail—Research idea: the longest non-trivial middlegames. There you also can see the puzzles I’ve created.
My longest puzzle is 42 moves: https://lichess.org/study/sTon08Mb/JG4YGbcP Overall, I’ve created 7 unique puzzles. Worked a lot on 1 more (mate in 52 moves), but couldn’t make it work.
Among other things, I made this absurd mate in 34 puzzle. Almost the entire board is filled with pieces (62 pieces on the board!), only two squares are empty. And despite that the position has deep content. It’s kinda a miracle. I think it deserves recognition.
Unlike Type 1 and Type 2 mates, my mates involve many sacrifices of material. So my mates can be defined as “the longest sacrificial combinations”.
We can come up with important metrics which make a long mate more special, harder to find, more rare. Material disbalance, amount of non-check moves, amount of freedom of pieces, etc. Then we can search for the longest mates compatible with high enough values of those metrics.
Well, that’s what I did.
This is an idea of a definition rather than a definition. But it might be important.
Take a sequential game with perfect information.
Take positions with the longest forced wins.
Out of those positions, choose positions where the defending side has the greatest control over the attacking side’s optimal strategy.
My mates are an example of positions where the defending side has especially great control over the flow of the game.
Can there be any deep meaning behind researching my type of mates? I think yes. There are two relevant things.
First thing is hard to explain, because I’m not a mathematician. But I’ll try. Math can often be seen as skipping stuff which is the most interesting to humans. For example, math can prove theorems about games in general, without explaining why a specific game is interesting or why a specific position is interesting. However, here it seems like we can define something very closely related to subjective “interestingness”.
Hardness of defining valuable things is relevant to Alignment. The definitions above reveal that maybe sometimes valuable things are easier to define than it seems.
How did chess community receive my work?
On Reddit, some posts got a moderate amount of upvotes (enough to get into daily top). A silly middlegame position. With checkmate in 50-80 moves? (110+); Does this position set any record? (60+). Sadly the pattern didn’t continue: New long non-trivial middlegame mate found. Nobody asked for this. (1).
On a computer chess forum, people mostly ignored it. I hoped they could help me find the longest attacks in computer games.
On the Discord of chess composers, a bunch of people complimented my project. But nobody showed any proactive interest (e.g. “hey, I’d like to preserve your work”). One person reacted like ~”I’m not a specialist on that type of thing, I don’t know with whom you could talk about that”
On Reddit communities where you can ask mathematicians things, people told that game theory is too abstract for tackling such things.
Agree that neopronouns are dumb. Wikipedia says they’re used by 4% LGBTQ people and criticized both within and outside the community.
But for people struggling with normal pronouns (he/she/they), I have the following thoughts:
Contorting language to avoid words associated with beliefs… is not easier than using the words. Don’t project beliefs onto words too hard.
Contorting language to avoid words associated with beliefs… is still a violation of free speech (if we have such a strong notion of free speech). So what is the motivation to propose that? It’s a bit like a dog in the manger. “I’d rather cripple myself than help you, let’s suffer together”.
Don’t maximize free speech (in a negligible way) while ignoring every other human value.
In an imperfect society, truly passive tolerance (tolerance which doesn’t require any words/actions) is impossible. For example, in a perfect society, if my school has bigoted teachers, it immediately gets outcompeted by a non-bigoted school. In an imperfect society it might not happen. So we get enforceable norms.
Employees get paid, which kinda automatically reduces their free speech, because saying the wrong words can make them stop getting paid. (...) Employment is really a different situation. You get laws, and recommendations of your legal department; there is not much anyone can do about that.
I’m not familiar with your model of free speech (i.e. how you imagine free speech working if laws and power balances were optimal). People who value free speech usually believe that free speech should have power above money and property, to a reasonable degree. What’s “reasonable” is the crux.
I think in situations where people work together on something unrelated to their beliefs, prohibiting to enforce a code of conduct is unreasonable. Because respect is crucial for the work environment and protecting marginalized groups. I assume people who propose to “call everyone they” or “call everyone by proper name” realize some of that.
If I let people use my house as a school, but find out that a teacher openly doesn’t respect minority students (by rejecting to do the smallest thing for them), I’m justified to not let the teacher into my house.
I do not talk about people’s past for no good reason, and definitely not just to annoy someone else. But if I have a good reason to point out that someone did something in the past, and the only way to do that is to reveal their previous name, then I don’t care about the taboo.
I just think “disliking deadnaming under most circumstances = magical thinking, like calling Italy Rome” was a very strong, barely argued/explained opinion. In tandem with mentioning delusion (Napoleon) and hysteria. If you want to write something insulting, maybe bother to clarify your opinions a little bit more? Like you did in our conversation.
I think there should be more spaces where controversial ideas can be debated. I’m not against spaces without pronoun rules, just don’t think every place should be like this. Also, if we create a space for political debate, we need to really make sure that the norms don’t punish everyone who opposes centrism & the right. (Over-sensitive norms like “if you said that some opinion is transphobic you’re uncivil/shaming/manipulative and should get banned” might do this.) Otherwise it’s not free speech either. Will just produce another Grey or Red Tribe instead of Red/Blue/Grey debate platform.
I do think progressives underestimate free speech damage. To me it’s the biggest issue with the Left. Though I don’t think they’re entirely wrong about free speech.
For example, imagine I have trans employees. Another employee (X) refuses to use pronouns, in principle (using pronouns is not the same as accepting progressive gender theories). Why? Maybe X thinks my trans employees live such a great lie that using pronouns is already an unacceptable concession. Or maybe X thinks that even trying to switch “he” & “she” is too much work, and I’m not justified in asking to do that work because of absolute free speech. Those opinions seem unnecessarily strong and they’re at odds with the well-being of my employees, my work environment. So what now? Also, if pronouns are an unacceptable concession, why isn’t calling a trans woman by her female name an unacceptable concession?
Imagine I don’t believe something about a minority, so I start avoiding words which might suggest otherwise. If I don’t believe that gay love can be as true as straight love, I avoid the word “love” (in reference to gay people or to anybody) at work. If I don’t believe that women are as smart as men, I avoid the word “master” / “genius” (in reference to women or anybody) at work. It can get pretty silly. Will predictably cost me certain jobs.
I’ll describe my general thoughts, like you did.
I think about transness in a similar way to how I think about homo/bisexuality.
If homo/bisexuality is outlawed, people are gonna suffer. Bad.
If I could erase homo/bisexuality from existence without creating suffering, I wouldn’t anyway. Would be a big violation of people’s freedom to choose their identity and actions (even if in practice most people don’t actually “choose” to be homo/bisexual).
Different people have homo/bisexuality of different “strength” and form. One man might fall in love with another man, but dislike sex or even kissing. Maybe he isn’t a real homosexual, if he doesn’t need to prove it physically? Another man might identify as a bisexual, but be in a relationship with a woman… he doesn’t get to prove his bisexuality (sexually or romantically). Maybe we shouldn’t trust him unless he walks the talk? As a result of all such situations, we might have certain “inconsistencies”: some people identifying as straight have done more “gay” things than people identifying as gay. My opinion on this? I think all of this is OK. Pushing for an “objective gay test” would be dystopian and suffering-inducing. I don’t think it’s an empirical matter (unless we choose it to be, which is a value-laden choice). Even if it was, we might be very far away from resolving it. So just respecting people’s self-identification in the meantime is best, I believe. Moreover, a lot of this is very private information anyway. Less reason to try measuring it “objectively”.
My thoughts about transness specifically:
We strive for gender equality (I hope). Which makes the concept of gender less important for society as a whole.
The concept of gender is additionally damaged by all the things a person can decide to do in their social/sexual life. For example, take an “assigned male at birth” (AMAB) person. AMAB can appear and behave very feminine without taking hormones. Or vice-versa (take hormones, get a pair of boobs, but present masculine). Additionally there are different degrees of medical transition and different types of sexual preferences.
A lot of things which make someone more or less similar to a man/woman (behavior with friends, behavior with romantic partners, behavior with sexual partners, thoughts) are private. Less reason to try measuring those “objectively”.
I have a choice to respect people’s self-identified genders or not. I decide to respect them. Not just because I care about people’s feelings, but also because of points 1 & 2 & 3 and because of my general values (I show similar respect to homo/bisexuals). So I respect pronouns, but on top of that I also respect if someone identifies as a man/woman/nonbinary. I believe respect is optimal in terms of reducing suffering and adhering to human values.
When I compare your opinion to mine, most of my confusion is about two things: what exactly do you see as an empirical question? how does the answer (or its absence) affect our actions?
Zack insists that Blanchard is right, and that I fail at rationality if I disagree with him. People on Twitter and Reddit insist that Blanchard is wrong, and that I fail at being a decent human if I disagree with them. My opinion is that I have no comparative advantage at figuring out who is right and who is wrong on this topic, or maybe everyone is wrong, anyway it is an empirical question and I don’t have the data. I hope that people who have more data and better education will one day sort it out, but until that happens, my position firmly remains “I don’t know (and most likely neither do you), stop bothering me”.
I think we need to be careful to not make a false equivalence here:
Trans people want us to respect their pronouns and genders.
I’m not very familiar with Blanchard, so far it seems to me like Blanchard’s work is (a) just a typology for predicting certain correlations and (b) this work is sometimes used to argue that trans people are mistaken about their identities/motivations.
2A is kinda tangential to 1. So is this really a case of competing theories? I think uncertainty should make one skeptical of Blanchard work’s implications rather than make one skeptical about respecting trans people.
(Note that this is about the representatives, not the people being represented. Two trans people can have different opinions, but you are required to believe the woke one and oppose the non-woke one.) Otherwise, you are transphobic. I completely reject that.
Two homo/bisexuals can have different opinions on what’s “true homo/bisexuality” is too. Some opinions can be pretty negative. Yes, that’s inconvenient, but that’s just an expected course of events.
Shortly: disagreement is not hate. But it often gets conflated, especially in environments that overwhelmingly contain people of one political tribe.
I feel it’s just the nature of some political questions. Not in all questions, not in all spaces you can treat disagreement as something benign.
But if there is a person who actually feels dysphoria from not being addressed as “ve” (someone who would be triggered by calling them any of: “he”, “she”, or “they”), then I believe that this is between them and their psychiatrist, and I want to be left out of this game.
Agree. Also agree that lynching for accidental misgendering is bad.
(That’s when you get the “attack helicopters” as an attempt to point out the absurdity of the system.)
I’m pretty sure the helicopter argument began as an argument against trans people, not as an argument against weird-ass novel pronouns.
Draft of a future post, any feedback is welcome. Continuation of a thought from this shortform post.
(picture: https://en.wikipedia.org/wiki/Drawing_Hands)
There’s an alignment-related problem: how do we make an AI care about causes of a particular sensory pattern? What are “causes” of a particular sensory pattern in the first place? You want the AI to differentiate between “putting a real strawberry on a plate” and “creating a perfect illusion of a strawberry on a plate”, but what’s the difference between doing real things and creating perfect illusions, in general?
(Relevant topics: environmental goals; identifying causal goal concepts from sensory data; “look where I’m pointing, not at my finger”; Pointers Problem; Eliciting Latent Knowledge; symbol grounding problem; ontology identification problem.)
I have a general answer to those questions. My answer is very unfinished. Also it isn’t mathematical, it’s philosophical in nature. But I believe it’s important anyway. Because there’s not a lot of philosophical or non-philosophical ideas about the questions above. With questions like these you don’t know where to even start thinking, so it’s hard to imagine even a bad answer.
Observation 1. Imagine you come up with a model which perfectly predicts your sensory experience (Predictor). Just having this model is not enough to understand causes of a particular sensory pattern, i.e. differentiate between stuff like “putting a real strawberry on a plate” and “creating a perfect illusion of a strawberry on a plate”.
Observation 2. Not every Predictor has variables which correspond to causes of a particular sensory pattern. Not every Predictor can be used to easily derive something corresponding to causes of a particular sensory pattern. For example, some Predictors might make predictions by simulating a large universe with a superintelligent civilization inside which predicts your sensory experiences. See “Transparent priors”.
So, what are causes of a particular sensory pattern?
“Recursive Sensory Models” (RSMs).
I’ll explain what an RSM is and provide various examples.
An RSM is a sequence of N models (Model 1, Model 2, …, Model N) for which the following two conditions hold true:
Model (K + 1) is good at predicting more aspects of sensory experience than Model (K). Model (K + 2) is good at predicting more aspects than Model (K + 1). And so on.
Model 1 can be transformed into any of the other models according to special transformation rules. Those rules are supposed to be simple. But I can’t give a fully general description of those rules. That’s one of the biggest unfinished parts of my idea.
The second bullet point is kinda the most important one, but it’s very underspecified. So you can only get a feel for it through looking at specific examples.
Core claim: when the two conditions hold true, the RSM contains easily identifiable “causes” of particular sensory patterns. The two conditions are necessary and sufficient for the existence of such “causes”. The universe contains “causes” of particular sensory patterns to the extent to which statistical laws describing the patterns also describe deeper laws of the universe.
Imagine you’re looking at a landscape with trees, lakes and mountains. You notice that none of those objects disappear.
It seems like a good model: “most objects in the 2D space of my vision don’t disappear”. (Model 1)
But it’s not perfect. When you close your eyes, the landscape does disappear. When you look at your feet, the landscape does disappear.
So you come up with a new model: “there is some 3D space with objects; the space and the objects are independent from my sensory experience; most of the objects don’t disappear”. (Model 2)
Model 2 is better at predicting the whole of your sensory experience.
However, note that the “mathematical ontology” of both models is almost identical. (Both models describe spaces whose points can be occupied by something.) They’re just applied to slightly different things. That’s why “recursion” is in the name of Recursive Sensory Models: an RSM reveals similarities between different layers of reality. As if reality is a fractal.
Intuitively, Model 2 describes “causes” (real trees, lakes and mountains) of sensory patterns (visions of trees, lakes and mountains).
You notice that most visible objects move smoothly (don’t disappear, don’t teleport).
“Most visible objects move smoothly in a 2D/3D space” is a good model for predicting sensory experience. (Model 1)
But there’s a model which is even better: “visible objects consist of smaller and invisible/less visible objects (cells, molecules, atoms) which move smoothly in a 2D/3D space”. (Model 2)
However, note that the mathematical ontology of both models is almost identical.
Intuitively, Model 2 describes “causes” (atoms) of sensory patterns (visible objects).
Imagine you’re alone in a field with rocks of different size and a scale model of the whole environment. You’ve already learned object permanence.
“Objects don’t move in space unless I push them” is a good model for predicting sensory experience. (Model 1)
But it has a little flaw. When you push a rock, the corresponding rock in the scale model moves too. And vice-versa.
“Objects don’t move in space unless I push them; there’s a simple correspondence between objects in the field and objects in the scale model” is a better model for predicting sensory experience. (Model 2)
However, note that the mathematical ontology of both models is identical.
Intuitively, Model 2 describes a “cause” (the scale model) of sensory patterns (rocks of different size being at certain positions). Though you can reverse the cause and effect here.
If you put your hand on a hot stove, you quickly move the hand away. Because it’s painful and you don’t like pain. This is a great model (Model 1) for predicting your own movements near a hot stove.
But why do other people avoid hot stoves? If another person touches a hot stove, pain isn’t instantiated in your sensory experience.
Behavior of other people can be predicted with this model: “people have similar sensory experience and preferences, inaccessible to each other”. (Model 2)
However, note that the mathematical ontology of both models is identical.
Intuitively, Model 2 describes a “cause” (inaccessible sensory experience) of sensory patterns (other people avoiding hot stoves).
Imagine yourself in a universe where your sensory experience is produced by very simple, but very chaotic laws. Despite the chaos, your sensory experience contains some simple, relatively stable patterns. Purely by accident.
In such universe, RSMs might not find any “causes” underlying particular sensory patterns (except the simple chaotic laws).
But in such case there are probably no “causes”.
Napoleon is merely an argument for “just because you strongly believe it, even if it is a statement about you, does not necessarily make it true”.
When people make arguments, they often don’t list all of the premises. That’s not unique to trans discourse. Informal reasoning is hard to make fully explicit. “Your argument doesn’t explicitly exclude every counterexample” is a pretty cheap counter-argument. What people experience is important evidence and an important factor, it’s rational to bring up instead of stopping yourself with “wait, I’m not allowed to bring that up unless I make an analytically bulletproof argument”. For example, if you trust someone that they feel strongly about being a woman, there’s no reason to suspect them of being a cosplayer who chases Twitter popularity.
I expect that you will disagree with a lot of this, and that’s okay; I am not trying to convince you, just explaining my position.
I think I still don’t understand the main conflict which bothers you. I thought it was “I’m not sure if trans people are deluded in some way (like Napoleons, but milder) or not”. But now it seems like “I think some people really suffer and others just cosplay, the cosplayers take something away from true sufferers”. What is taken away?
Even if we assume that there should be a crisp physical cause of “transness” (which is already a value-laden choice), we need to make a couple of value-laden choices before concluding if “being trans” is similar to “believing you’re Napoleon” or not. Without more context it’s not clear why you bring up Napoleon. I assume the idea is “if gender = hormones (gender essentialism), and trans people have the right hormones, then they’re not deluded”. But you can arrive at the same conclusion (“trans people are not deluded”) by means other than gender essentialism.
I assume that for trans people being trans is something more than mere “choice”
There doesn’t need to be a crisp physical cause of “transness” for “transness” to be more than mere choice. There’s a big spectrum between “immutable physical features” and “things which can be decided on a whim”.
If you introduce yourself as “Jane” today, I will refer to you as “Jane”. But if 50 years ago you introduced yourself as “John”, that is a fact about the past. I am not saying that “you were John” as some kind of metaphysical statement, but that “everyone, including you, referred to you as John” 50 years ago, which is a statement of fact.
This just explains your word usage, but doesn’t make a case that disliking deadnaming is magical thinking.
I’ve decided to comment because bringing up Napoleon, hysteria and magical thinking all at once is egregiously bad faith. I think it’s not a good epistemic norm to imply something like “the arguments of the outgroup are completely inconsistent trash” without elaborating.
There are people who feel strongly that they are Napoleon. If you want to convince me, you need to make a stronger case than that.
It’s confusing to me that you go to “I identify as an attack helicopter” argument after treating biological sex as private information & respecting pronouns out of politeness. I thought you already realize that “choosing your gender identity” and “being deluded you’re another person” are different categories.
If someone presented as male for 50 years, then changed to female, it makes sense to use “he” to refer to their first 50 years, especially if this is the pronoun everyone used at that time. Also, I will refer to them using the name they actually used at that time. (If I talk about the Ancient Rome, I don’t call it Italian Republic either.) Anything else feels like magical thinking to me.
The alternative (using new pronouns / name) makes perfect sense too, due to trivial reasons, such as respecting a person’s wishes. You went too far calling it magical thinking. A piece of land is different from a person in two important ways: (1) it doesn’t feel anything no matter how you call it, (2) there’s less strong reasons to treat it as a single entity across time.
Meta-level comment: I don’t think it’s good to dismiss original arguments immediately and completely.
Object-level comment:
Neither of those claims has anything to do with humans being the “winners” of evolution.
I think it might be more complicated than that:
We need to define what “a model produced by a reward function” means, otherwise the claims are meaningless. Like, if you made just a single update to the model (based on the reward function), calling it “a model produced by the reward function” is meaningless (’cause no real optimization pressure was applied). So we do need to define some goal of optimization (which determines who’s a winner and who’s a loser).
We need to argue that the goal is sensible. I.e. somewhat similar to a goal we might use while training our AIs.
Here’s some things we can try:
We can try defining all currently living species as winners. But is it sensible? Is it similar to a goal we would use while training our AIs? “Let’s optimize our models for N timesteps and then use all surviving models regardless of any other metrics” ← I think that’s not sensible, especially if you use an algorithm which can introduce random mutations into the model.
We can try defining species which avoided substantial changes for the longest time as winners. This seems somewhat sensible, because those species experienced the longest optimization pressure. But then humans are not the winners.
We can define any species which gained general intelligence as winners. Then humans are the only winners. This is sensible because of two reasons. First, with general intelligence deceptive alignment is possible: if humans knew that Simulation Gods optimize organisms for some goal, humans could focus on that goal or kill all competing organisms. Second, many humans (in our reality) value creating AGI more than solving any particular problem.
I think the later is the strongest counter-argument to “humans are not the winners”.
My point is that chairs and humans can be considered in a similar way.
Please explain how your point connects to my original message: are you arguing with it or supporting it or want to learn how my idea applies to something?
I see. But I’m not talking about figuring out human preferences, I’m talking about finding world-models in which real objects (such as “strawberries” or “chairs”) can be identified. Sorry if it wasn’t clear in my original message because I mentioned “caring”.
Models or real objects or things capture something that is not literally present in the world. The world contains shadows of these things, and the most straightforward way of finding models is by looking at the shadows and learning from them.
You might need to specify what you mean a little bit.
The most straightforward way of finding a world-model is just predicting your sensory input. But then you’re not guaranteed to get a model in which something corresponding to “real objects” can be easily identified. That’s one of the main reasons why ELK is hard, I believe: in an arbitrary world-model, “Human Simulator” can be much simpler than “Direct Translator”.
So how do humans get world-models in which something corresponding to “real objects” can be easily identified? My theory is in the original message. Note that the idea is not just “predict sensory input”, it has an additional twist.
Creating an inhumanly good model of a human is related to formulating their preferences.
How does this relate to my idea? I’m not talking about figuring out human preferences.
Thus it’s a step towards eliminating path-dependence of particular life stories
What is “path-dependence of particular life stories”?
I think things (minds, physical objects, social phenomena) should be characterized by computations that they could simulate/incarnate.
Are there other ways to characterize objects? Feels like a very general (or even fully general) framework. I believe my idea can be framed like this, too.
There’s an alignment-related problem, the problem of defining real objects. Relevant topics: environmental goals; task identification problem; “look where I’m pointing, not at my finger”; The Pointers Problem; Eliciting Latent Knowledge.
I think I realized how people go from caring about sensory data to caring about real objects. But I need help with figuring out how to capitalize on the idea.
So… how do humans do it?
Humans create very small models for predicting very small/basic aspects of sensory input (mini-models).
Humans use mini-models as puzzle pieces for building models for predicting ALL of sensory input.
As a result, humans get models in which it’s easy to identify “real objects” corresponding to sensory input.
For example, imagine you’re just looking at ducks swimming in a lake. You notice that ducks don’t suddenly disappear from your vision (permanence), their movement is continuous (continuity) and they seem to move in a 3D space (3D space). All those patterns (“permanence”, “continuity” and “3D space”) are useful for predicting aspects of immediate sensory input. But all those patterns are also useful for developing deeper theories of reality, such as atomic theory of matter. Because you can imagine that atoms are small things which continuously move in 3D space, similar to ducks. (This image stops working as well when you get to Quantum Mechanics, but then aspects of QM feel less “real” and less relevant for defining object.) As a result, it’s easy to see how the deeper model relates to surface-level patterns.
In other words: reality contains “real objects” to the extent to which deep models of reality are similar to (models of) basic patterns in our sensory input.
I don’t understand Model-Utility Learning (MUL) section, what pathological behavior does AI do?
Since humans (or something) must be labeling the original training examples, the hypothesis that building bridges means “what humans label as building bridges” will always be at least as accurate as the intended classifier. I don’t mean “whatever humans would label”. I mean they hypothesis that “build a bridge” means specifically the physical situations which were recorded as training examples for this system in particular, and labeled by humans as such.
So it’s like overfitting? If I train MUL AI to play piano in a green room, MUL AI learns that “playing piano” means “playing piano in a green room” or “playing piano in a room which would be chosen for training me in the past”?
Now, we might reasonably expect that if the AI considers a novel way of “fooling itself” which hasn’t been given in a training example, it will reject such things for the right reasons: the plan does not involve physically building a bridge.
But “sensory data being a certain way” is a physical event which happens in reality, so MUL AI might still learn to be a solipsist? MUL doesn’t guarantee to solve misgeneralization in any way?
If the answer to my questions is “yes”, what did we even hope for with MUL?
I’m noticing two things:
It’s suspicious to me that values of humans-who-like-paperclips are inherently tied to acquiring an unlimited amount of resources (no matter in which way). Maybe I don’t treat such values as 100% innocent, so I’m OK keeping them in check. Though we can come up with thought experiments where the urge to get more resources is justified by something. Like, maybe instead of producing paperclips those people want to calculate Busy Beaver numbers, so they want more and more computronium for that.
How consensual were the trades if their outcome is predictable and other groups of people don’t agree with the outcome? Looks like coercion.
Often I see people dismiss the things the Epicureans got right with an appeal to their lack of the scientific method, which has always seemed a bit backwards to me.
The most important thing, I think, is not even hitting the nail on the head, but knowing (i.e. really acknowledging) that a nail can be hit in multiple places. If you know that, the rest is just a matter of testing.
But avoidance of value drift or of unendorsed long term instability of one’s personality is less obvious.
What if endorsed long term instability leads to negation of personal identity too? (That’s something I thought about.)
I think corrigibility is the ability to change a value/goal system. That the literal meaning of the term… “Correctable”. If an AI were fully aligned, there would be no need to correct it.
Perhaps I should make a better argument:
It’s possible that AGI is correctable, but (a) we don’t know what needs to be corrected or (b) we cause new, less noticeable problems, while correcting AGI.
So, I think there’s not two assumptions “alignment/interpretability is not solved + AGI is incorrigible”, but only one — “alignment/interpretability is not solved”. (A strong version of corrigibility counts as alignment/interpretability being solved.)
Yes, and that’s the specific argument I am addressing,not AI risk in general. Except that if it’s many many times smarter, it’s ASI, not AGI.
I disagree that “doom” and “AGI going ASI very fast” are certain (> 90%) too.
Epistemic status: Draft of a post. I want to propose a method of learning environmental goals (a super big, super important subproblem in Alignment). It’s informal, so has a lot of gaps. I worry I missed something obvious, rendering my argument completely meaningless. I asked LessWrong feedback team, but they couldn’t get someone knowledgeable enough to take a look.
Can you tell me the biggest conceptual problems of my method? Can you tell me if agent foundations researchers are aware of this method or not?
If you’re not familiar with the problem, here’s the context: Environmental goals; identifying causal goal concepts from sensory data; ontology identification problem; Pointers Problem; Eliciting Latent Knowledge.
Explanation 1
One naive solution
Imagine we have a room full of animals. AI sees the room through a camera. How can AI learn to care about the real animals in the room rather than their images on the camera?
Assumption 1. Let’s assume AI models the world as a bunch of objects interacting in space and time. I don’t know how critical or problematic this assumption is.
Idea 1. Animals in the video are objects with certain properties (they move continuously, they move with certain relative speeds, they have certain sizes, etc). Let’s make the AI search for the best world-model which contains objects with similar properties (P properties).
Problem 1. Ideally, AI will find clouds of atoms which move similarly to the animals on the video. However, AI might just find a world-model (X) which contains the screen of the camera. So it’ll end up caring about “movement” of the pixels on the screen. Fail.
Observation 1. Our world contains many objects with P properties which don’t show up on the camera. So, X is not the best world-model containing the biggest number of objects with P properties.
Idea 2. Let’s make the AI search for the best world-model containing the biggest number of objects with P properties.
Question 1. For “Idea 2” to make practical sense, we need to find a smart way to limit the complexity of the models. Otherwise AI might just make any model contain arbitrary amounts of any objects. Can we find the right complexity prior?
Question 2. Assume we resolved the previous question positively. What if “Idea 2” still produces an alien ontology humans don’t care about? Can it happen?
Question 3. Assume everything works out. How do we know that this is a general method of solving the problem? We have an object in sense data (A), we care about the physical thing corresponding to it (B): how do we know B always behaves similarly to A and there are always more instances of B than of A?
One philosophical argument
I think there’s a philosophical argument which allows to resolve Questions 2 & 3 (giving evidence that Question 1 should be resolvable too).
By default, we only care about objects with which we can “meaningfully” interact with in our daily life. This guarantees that B always has to behave similarly to A, in some technical sense (otherwise we wouldn’t be able to meaningfully interact with B). Also, sense data is a part of reality, so B includes A, therefore there are always more instances of B than of A, in some technical sense. This resolves Question 3.
By default, we only care about objects with which we can “meaningfully” interact with in our daily life. This guarantees that models of the world based on such objects are interpretable. This resolves Question 2.
Can we define what “meaningfully” means? I think that should be relatively easy, at least in theory. There doesn’t have to be One True Definition Which Covers All Cases.
If the argument is true, the pointers problem should be solvable without Natural Abstraction hypothesis being true.
Anyway, I’ll add a toy example which hopefully helps to better understand what’s this all about.
One toy example
You’re inside a 3D video game. 1st person view. The game contains landscapes and objects, both made of small balls (the size of tennis balls) of different colors. Also a character you control.
The character can push objects. Objects can break into pieces. Physics is Newtonian. Balls are held together by some force. Balls can have dramatically different weights.
Light is modeled by particles. Sun emits particles, they bounce off of surfaces.
The most unusual thing: as you move, your coordinates are fed into a pseudorandom number generator. The numbers from the generator are then used to swap places of arbitrary balls.
You care about pushing boxes (as everything, they’re made of balls too) into a certain location.
...
So, the reality of the game has roughly 5 levels:
The level of sense data (2D screen of the 1st person view).
A. The level of ball structures. B. The level of individual balls.
A. The level of waves of light particles. B. The level of individual light particles.
I think AI should be able to figure out that it needs to care about 2A level of reality. Because ball structures are much simpler to control (by doing normal activities with the game’s character) than individual balls. And light particles are harder to interact with than ball structures, due to their speed and nature.
Explanation 2
An alternative explanation of my argument:
Imagine activities which are crucial for a normal human life. For example: moving yourself in space (in a certain speed range); moving other things in space (in a certain speed range); staying in a single spot (for a certain time range); moving in a single direction (for a certain time range); having varied visual experiences (changing in a certain frequency range); etc. Those activities can be abstracted into mathematical properties of certain variables (speed of movement, continuity of movement, etc). Let’s call them “fundamental variables”. Fundamental variables are defined using sensory data or abstractions over sensory data.
Some variables can be optimized (for a long enough period of time) by fundamental variables. Other variables can’t be optimized (for a long enough period of time) by fundamental variables. For example: proximity of my body to my bed is an optimizable variable (I can walk towards the bed — walking is a normal activity); the amount of things I see is an optimizable variable (I can close my eyes or hide some things — both actions are normal activities); closeness of two particular oxygen molecules might be a non-optimizable variable (it might be impossible to control their positions without doing something weird).
By default, people only care about optimizable variables. Unless there are special philosophical reasons to care about some obscure non-optimizable variable which doesn’t have any significant effect on optimizable variables.
You can have a model which describes typical changes of an optimizable variable. Models of different optimizable variables have different predictive power. For example, “positions & shapes of chairs” and “positions & shapes of clouds of atoms” are both optimizable variables, but models of the latter have much greater predictive power. Complexity of the models needs to be limited, by the way, otherwise all models will have the same predictive power.
Collateral conclusions: typical changes of any optimizable variable are easily understandable by a human (since it can be optimized by fundamental variables, based on typical human activities); all optimizable variables are “similar” to each other, in some sense (since they all can be optimized by the same fundamental variables); there’s a natural hierarchy of optimizable variables (based on predictive power). Main conclusion: while the true model of the world might be infinitely complex, physical things which ground humans’ high-level concepts (such as “chairs”, “cars”, “trees”, etc.) always have to have a simple model (which works most of the time, where “most” has a technical meaning determined by fundamental variables).
Formalization
So, the core of my idea is this:
AI is given “P properties” which a variable of its world-model might have. (Let’s call a variable with P properties P-variable.)
AI searches for a world-model with the biggest amount of P-variables. AI makes sure it doesn’t introduce useless P-variables. We also need to be careful with how we measure the “amount” of P-variables: we need to measure something like “density” rather than “amount” (i.e. the amount of P-variables contributing to a particular relevant situation, rather than the amount of P-variables overall?).
AI gets an interpretable world-model (because P-variables are highly interpretable), adequate for defining what we care about (because by default, humans only care about P-variables).
How far are we from being able to do something like this? Are agent foundations researches pursuing this or something else?