Isnasene

Karma: 726

Isnasene Jan 20, 2020, 6:14 AM
1 point
on: Dunning Kruger vs. Double Descent
When I first looked at these plots, I thought “ahhh, the top one has two valleys and the bottom one has two peaks. So, accounting for one reflecting error and the other reflecting accuracy, they capture the same behavior.” But this isn’t really what’s happening.
Comparing these plots is a little tricky. For instance, the double-descent graph shows two curves—“train error” (which can be interpreted as lack of confidence in model performance) and “test error” (which can be interpreted as lack of actual performance/lack of wisdom). Analogizing the double-descent curve to Dunning Kruger might be easier if one just plots “test error” on the y-axis and “train error” on the x-axis. Or better yet 1-error for both axes.
But actually trying to dig into the plots in this way is confusing. In the underfitted regime, there’s a pretty high level of knowledge (ie test error near the minimum value) withpretty low confidence (ie train error far from zero). In the overfitted regime, we then get double-descent into a higher level of knowledge (ie test error at the minimum) but now with extremely high confidence. Maybe we can tentatively interpret these minima as the “valley of despair” and “slope of enlightenment” but
- In both cases, our train error is lower than our test error—implying a disproportionate amount of confidence all the time. This is not consistent with the Dunning-Kruger effect
  - The “slope of enlightenment” especially has way more unjustified confidence (ie train error near zero) despite still having some objectively pretty high test error (around 0.3). This is not consistent with the Dunning-Kruger effect
- We see the same test error associated with both a high train error (in the underfit regime) and with a low train error (in the overfit regime). The Dunning-Kruger effect doesn’t capture the potential for different levels of confidence at the same level of wisdom
To me, the above deviations from Dunning-Kruger make sense. My mechanistic understanding of the effect is that it appears in fields of knowledge that are vast, but whose vastness can only be explored by those with enough introductory knowledge. So what happens is
- You start out learning something new and you’re not confident
- You master the introductory material and feel confident that you get things
- You now realize that your introductory understanding gives you a glimpse into the vast frontier of the subject
- Exposure to this vast frontier reduces your confidence
- But as you explore it, both your understanding and confidence rise again
And this process can’t really be captured in a set-up with a fixed train and test set. Maybe it could show up in reinforcement learning though since exploration is possible.

Isnasene Jan 18, 2020, 10:18 PM
7 points
in reply to: Mary Chernyshenko’s comment on: Mary Chernyshenko’s Shortform
This reminds me a little bit of the posts on anti-memes. There’s a way in which people are constantly updating their worldviews based on personal experience that
- is useless in discussion because people tend not to update on other people’s personal experience over their own,
- is personally risky in adversarial contexts because personal information facilitates manipulation
- is socially costly because the personal experience that people tend to update on is usually the kind of emotionally intense stuff that is viewed as inappropriate in ordinary conversation
And this means that there are a lot of ideas and worldviews produced by The Statistics which are never discussed or directly addressed in polite society. Instead, these emerge indirectly through particular beliefs which really on arguments that obfuscate the reality.
Not only is this hard to avoid on a civilizational level; it’s hard to avoid on a personal level: rational agents will reach inaccurate conclusions in adversarial (ie unlucky) environments.

Isnasene Jan 18, 2020, 3:39 AM
4 points
in reply to: Sniffnoy’s comment on: Underappreciated points about utility functions (of both sorts)
Thanks for the reply. I re-read your post and your post on Savage’s proof and you’re right on all counts. For some reason, it didn’t actually click for me that P7 was introduced to address unbounded utility functions and boundedness was a consequence of taking the axioms to their logical conclusion.

Isnasene Jan 17, 2020, 1:44 AM
1 point
in reply to: Sniffnoy’s comment on: Underappreciated points about utility functions (of both sorts)
Ahh, thanks for clarifying. I think what happened was that your modus ponens was my modus tollens—so when I think about my preferences, I ask “what conditions do my preferences need to satisfy for me to avoid being exploited or undoing my own work?” whereas you ask something like “if my preferences need to correspond to a bounded utility function, what should they be?” [1]. As a result, I went on a tangent about infinity to begin exploring whether my modified notion of a utility function would break in ways that regular ones wouldn’t.
Why should one believe that modifying the idea of a utility function would result in something that is meaningful about preferences, without any sort of theorem to say that one’s preferences must be of this form?
I agree, one shouldn’t conclude anything without a theorem. Personally, I would approach the problem by looking at the infinite wager comparisons discussed earlier and trying to formalize them into additional rationality condition. We’d need
- an axiom describing what it means for one infinite wager to be “strictly better” than another.
- an axiom describing what kinds of infinite wagers it is rational to be indifferent towards
Then, I would try to find a decisioning-system that satisfies these new conditions as well as the VNM-rationality axioms (where VNM-rationality applies). If such a system exists, these axioms would probably bar it from being represented fully as a utility function. If it didn’t, that’d be interesting. In any case, whatever happens will tell us more about either the structure our preferences should follow or the structure that our rationality-axioms should follow (if we cannot find a system).
Of course, maybe my modification of the idea of a utility function turns out to show such a decisioning-system exists by construction. In this case, modifying the idea of a utility function would help tell me that my preferences should follow the structure of that modification as well.
Does that address the question?
[1] From your post:
We should say instead, preferences are not up for grabs—utility functions merely encode these, remember. But if we’re stating idealized preferences (including a moral theory), then these idealized preferences had better be consistent—and not literally just consistent, but obeying rationality axioms to avoid stupid stuff. Which, as already discussed above, means they’ll correspond to a bounded utility function.

Isnasene Jan 16, 2020, 7:03 AM
20 points
on: Go F*** Someone
I had fun reading this post. But as someone who has a number of meaningful relationships but doesn’t really bother dating, I was also confused of what to make of it.
Also, given that this is Rationalism-Land, its worth keeping in mind that many people who don’t date got there because they have an unusually low prior on the idea that they will find someone they can emotionally connect with. This prior is also often caused by painful experience that advice like “date more!” will tacitly remind them of.
Anyway, things that I agree with you on:
- Dating is hard
- Self-improvement is relatively easy compared to being emotionally vulnerable
- I hate the saying “you do you.” I emotionally interpret it as “here’s a shovel; bury yourself with it”
Things I disagree with you on:
- We aren’t more lonely because of aggressively optimizing relationships for status rather than connection; we’re more lonely because the opportunity cost of going on dates is unusually high. Many reasons for this:
  - It’s easier than ever to unilaterally do cool things (ie learn guitar from the internet, buy arts and crafts off Amazon). And, as you noted, there’s a cottage industry for making this as awesome as possible
  - It’s easier than ever to defect from your local community and hang out with online people who “get” you
  - This causes a feedback loop that reduces the people looking to date, which increases the effort it dates to date, which reduces the number of people looking to date. Everyone is else defecting so I’m gonna defect too
- I think the general conflation of “self-improvement” with “bragging about stuff on social media” is odd in the context you’re discussing. People who aren’t interested in the human connection of dates generally don’t get much out of social media. At least in my bubble, people who are into self-improvement tend to do things like delete facebook.
- If you’re struggling to build financial capital, the goal is to keep doing that until you’re financially secure. The goal very much isn’t to refocus your efforts on going on hundreds of dates to learn how to make others happy.

Isnasene Jan 16, 2020, 1:05 AM
1 point
in reply to: FeepingCreature’s comment on: Predictors exist: CDT going bonkers… forever
[Comment edited for clarity]
Since when does CDT include backtracking on noticing other people’s predictive inconsistency?
I agree that CDT does not including backtracking on noticing other people’s predictive inconsistency. My assumption is that decision-theories (including CDT) takesa world-map and outputs an action. I’m claiming that this post is conflating an error in constructing an accurate world-map with an error in the decision theory.
CDT cannot notice that Omega’s prediction aligns with its hypothetical decision because Omega’s prediction is causally “before” CDT’s decision, so any causal decision graph cannot condition on it. This is why post-TDT decision theories are also called “acausal.”
Here is a more explicit version of what I’m talking about. CDT makes a decision to act based on the expected value of its action. To produce such an action, we need to estimate an expected value. In the original post, there are two parts to this:
Part 1 (Building a World Model):
- I believe that the predictor modeled my reasoning process and has made a prediction based on that model. This prediction happens before I actually instantiate my reasoning process
- I believe this model to be accurate/quasi-accurate
- I start unaware of what my causal reasoning process is so I have no idea what the predictor will do. In any case, the causal reasoning process must continue because I’m thinking.
- As I think, I get more information about my causal reasoning process. Because I know that the predictor is modeling my reasoning process, this let’s me update my prediction of the predictor’s prediction.
- Because the above step was part of my causal reasoning process and information about my causal reasoning process affects my model of the predictor’s model of me, I must update on the above step as well
- [The Dubious Step] Because I am modeling myself as CDT, I will make a statement intended to inverse the predictor. Because I believe the predictor is modeling me, this requires me to inverse myself. That is to say, every update my causal reasoning process makes to my probabilities is inversing the previous update
  - Note that this only works if I believe my reasoning process (but not necessarily the ultimate action) gives me information about the predictor’s prediction.
- The above leads to infinite regress
Part 2 (CDT)
- Ask the world model what the odds are that the predictor said “one” or “zero”
- Find the one with higher likelihood and inverse it
I believe Part 1 fails and that this isn’t the fault of CDT. For instance, imagine the above problem with zero stakes such that decision theory is irrelevant. If you ask any agent to give the inverse of its probabilities that Omega will say “one” or “zero” with the added information that Omega will perfectly predict those inverses and align with them, that agent won’t be able to give you probabilities. Hence, the failure occurs in building a world model rather than in implementing a decision theory.

-------------------------------- Original version
Since when does CDT include backtracking on noticing other people’s predictive inconsistency?
Ever since the process of updating a causal model of the world based on new information was considered an epistemic question outside the scope of decision theory.
To see how this is true, imagine the exact same situation as described in the post with zero stakes. Then ask any agent with any decision theory about the inverse of the prediction it expects the predictor to make. The answer will always be “I don’t know”, independent of decision theory. Ask that same agent if it can assign probabilities to the answers and it will say “I don’t know; every time I try to come up with one, the answer reverses.”
All I’m trying to do is compute the probability that the predictor will guess “one” or “zero” and failing. The output of failing here isn’t “well, I guess I’ll default to fifty-fifty so I should pick at random”[1], it’s NaN.
Here’s a causal explanation:
- I believe the predictor modeled my reasoning process and has made a prediction based on that model.
- I believe this model to be accurate/quasi-accurate
- I start unaware of what my causal reasoning process is so I have no idea what the predictor will do. But my prediction of the predictor depends on my causal reasoning process
- Because my causal reasoning process is contingent on my prediction and my prediction is contingent on my causal reasoning process, I end up in an infinite loop where my causal reasoning process cannot converge on an actual answer. Every time it tries, it just keeps updating.
- I quit the game because my prediction is incomputable

Isnasene Jan 15, 2020, 12:56 AM
3 points
on: Predictors exist: CDT going bonkers… forever
Decision theories map world models into actions. If you ever make a claim like “This decision-theory agent can never learn X and is therefore flawed”, you’re either misphrasing something or you’re wrong. The capacity to learn a good world-model is outside the scope of what decision theory is[1]. In this case, I think you’re wrong.
For example, suppose the CDT agent estimates the prediction will be “zero” with probability p, and “one” with probability 1-p. Then if p≥1/2, they can say “one”, and have a probability p≥1/2 of winning, in their own view. If p<1/2, they can say “zero”, and have a subjective probability 1−p>1/2 of winning.
This is not what a CDT agent would do. Here is what a CDT agent would do:
1. The CDT agent makes an initial estimate that the prediction will be “zero” with probability 0.9 and “one” with probability 0.1.
2. The CDT agent considers making the decision to say “one” but notices that Omega’s prediction aligns with its actions.
3. Given that the CDT agent was just considering saying “one”, the agent updates its initial estimate by reversing it. It declares “I planned on guessing one before but the last time I planned that, the predictor also guessed one. Therefore I will reverse and consider guessing zero.”
4. Given that the CDT agent was just considering saying “zero”, the agent updates its initial estimate by reversing it. It declares “I planned on guessing zero before but the last time I planned that, the predictor also guessed zero. Therefore I will reverse and consider guessing one.”
5. The CDT agent realizes that, given the predictor’s capabilities, its own prediction will be undefined
6. The CDT agent walks away, not wanting to waste the computational power
The longer and longer the predictor is accurate for, the higher and higher the CDT agent’s prior becomes that its own thought process is casually affecting the estimate[2]. Since the CDT agent is embedded, it’s impossible for the CDT agent to reason outside it’s thought process and there’s no use in it nonsensically refusing to leave the game.
Furthermore, any good decision-theorist knows that you should never go up against a Sicilian when death is on the line[3].
[1] This is not to say that world-modeling isn’t relevant to evaluating a decision theory. But in this case, we should be fully discussing things that may/may not happen in the actual world we’re in and picking the most appropriate decision theory for this one. Isolated thought experiments do not serve this purpose.
[2] Note that, in cases where this isn’t true, the predictor should get worse over time. The predictor is trying to model the CDT agent’s predictions (which depend on how the CDT agent’s actions affect its thought-process) without accounting for the way the CDT agent is changing as it makes decision. As a result, a persevering CDT agent will ultimately beat the predictor here and gain infinite utility by playing the game forever
[3] The Battle of Wits from the Princess Bride is isomorphic to problem in this post

Isnasene Jan 14, 2020, 4:39 AM
1 point
in reply to: Mathisco’s comment on: Open & Welcome Thread—January 2020
Yes—this fits with my perspective. The definition of the word “thought” is not exactly clear to me but claiming that it’s duration is lower-bounded by brainwave duration seems reasonable to me.
I am assuming it’s temporal multithreading, with each though at least one cycle.
Yeah, it could be that our conscious attention performs temporal multi-threading—only being capable of accessing a single one of the many normally background processes going on in the brain at once. Of course, who knows? Maybe it only feels that way because we are only a single conscious attention thread and there are actually many threads like this in the brain running in parallell. Split brain studies are a potential indicator that this could be true:
After the right and left brain are separated, each hemisphere will have its own separate perception, concepts, and impulses to act. Having two “brains” in one body can create some interesting dilemmas. When one split-brain patient dressed himself, he sometimes pulled his pants up with one hand (that side of his brain wanted to get dressed) and down with the other (this side did not).
--quote from wikipedia
People are discussing this across the internet of course, here’s one example on Hacker News
Alternative hypothesis: The way our brain produces thought-words seems like it could in principle be predictive processing a-la GPT-2. Maybe we’re just bad at multi-tasking because switching rapidly between different topics just confuses whatever brain-part is instantiating predictive-processing.

Isnasene Jan 14, 2020, 4:10 AM
1 point
in reply to: Charlie Steiner’s comment on: Open & Welcome Thread—January 2020
That’s a bet with good odds.
I didn’t mean to doubt you
I just figured it out
Oh the difference a day makes

Isnasene Jan 12, 2020, 11:44 PM
1 point
in reply to: Mathisco’s comment on: Open & Welcome Thread—January 2020
When you say ‘one thought at a time’, do you mean one conscious thought? From reading all these multi-agent models I assumed the subconscious is a collection of parallel thoughts, or at least multi-threaded.
Yes. The key factor is that, while I might have many computations going on in my brain at once, I am only ever experiencing a single thing. These things flicker into existence and non-existence extremely quickly and are sampled from a broader range of parallel, unexperienced, thoughts occuring in the subconscious.
Under this hypothesis, I would now state I have at least observed three states of multi-threading:
I think it’s worth hammering out the definition of a thread here. In terms of brain-subagents engaging in computational process, I’d argue that those are always on subconsciously. When I’m watching and listening to TV for instance, I’d describe my self as rapidly flickering between three main computational processes: a visual experience, an auditory experience, and an experience of internal monologue. There are also occasionally threads that I give less attention to—like a muscle being too tense. But I wouldn’t consider myself as experiencing all of these processes simultaneously—instead its more like I’m seeing a single console output that keeps switching between the data produced by each of the processes.

Isnasene Jan 12, 2020, 11:23 PM
26 points
1
on: The Rocket Alignment Problem
[Disclaimer: I’m reading this post for the first time now, as of 1/11/2020. I also already have a broad understanding of the importance of AI safety. While I am skeptical about MIRI’s approach to things, I am also a fan of MIRI. Where this puts me relative to the target demographic of this post, I cannot say.]
Overall Summary
I think this post is pretty good. It’s a solid and well-written introduction to some of the intuitions behind AI alignment and the fundamental research that MIRI does. At the same time, the use of analogy made the post more difficult for me to parse and hid some important considerations about AI alignment from view. Though it may be good (but not optimal) for introducing some people to the problem of AI alignment and a subset of MIRI’s work, it did not raise or lower my opinion of MIRI as someone who already understood AGI safety to be important.
To be clear, I do not consider any of these weaknesses serious because I believe them to be partially irrelevant to the audience of people who don’t appreciate the importance of AI-Safety. Still, they are relevant to the audience of people who give AI-Safety the appropriate scrutiny but remain skeptical of MIRI. And I think this latter audience is important enough to assign this article a “pretty good” instead of a “great”.
I hope a future post directly explores the merit of MIRI’s work on the context AI alignment without use of analogy.
Below is an overview of my likes and dislikes in this post. I will go into more detail about them in the next section, “Evaluating Analogies.”
Things I liked:
- It’s a solid introduction to AI-alignment, covering a broad range of topics including:
  - Why we shouldn’t expect aligned AGI by default
  - How modern conversation about AGI behavior is problematically underspecified
  - Why fundamental deconfusion research is necessary for solving AI-alignment
- It directly explains the value/motivation of particular pieces of MIRI work via analogy—which is especially nice given that it’s hard for the layman to actually appreciate the mathematically complex stuff MIRI is doing
- On the whole, the analogy is elegant
Things I disliked:
- Analogizing AI alignment to rocket alignment created a framing that hid important aspects of AI alignment from view and (unintentionally) stacked the deck in favor of MIRI.
  - A criticism of rocket alignment research with a plausible AI alignment analog was neglected (and could only be addressed by breaking the analogy).
  - An argument in favor of MIRI for rocket alignment had an AI analog that was much less convincing when considered in the context of AI alignment unique facts.
- The cognitive effort I spent mapping the rocket alignment problem to the AI alignment problem took more cognitive effort than just directly reading justifications of AI alignment and MIRI
- The world-building wasn’t great
  - The actual world of the dialogue is counterintuitive—imagine a situation where planes and rockets exist (or don’t exist, but are being theorized about), but no one knows calculus (despite modeling cannonballs pretty well) or how centripetal force+gravity works. It’s hard for me to parse the exact epistemic meaning of any given statement relative to the world
  - The world-building wasn’t particularly clear—it took me a while to completely parse that calculus hadn’t been invented.
- There’s a lot of asides where Beth (a stand-in for a member of MIRI) makes nontrivial scientific claims that we know to be true. While this is technically justified (MIRI does math and is unlikely to make claims that are wrong; and Eliezer has been right about about a lot of stuff and does deserve credit), it probably just feels smug and irritating to people who are MIRI-skeptics, aka this post’s probable target.
Evaluating Analogies
Since this post is intended as an analogy to AI alignment, evaluating its insights requires two steps. First, one must re-interpret the post in the context of AI alignment. Second, one must take that re-interpretation and see whether it holds up. This means that, if I criticize the content of this post—my criticism might be directly in error or my interpretation could be in error.
1. The Alignment Problem Analogy:
Overall, I think the analogy between the Rocket Alignment Problem and the AI Alignment Problem is pretty good. Structurally speaking, they’re identical and I can convert one to the other by swapping words around:
Rocket Alignment: “We know the conditions rockets fly under on Earth but, as we make our rockets fly higher and higher, we have reasons to expect those conditions to break down. Things like wind and weather conditions will stop being relevant and other weird conditions (like whatever keeps the Earth moving around the sun) will take hold! If we don’t understand those, we’ll never get to the moon!”
AI Alignment: “We know the conditions that modern AI performs under right now, but as we make our AI solve more and more complex problems, we have reason to expect those conditions to break down. Things like model overfitting and sample-size limitations will stop being relevant and other weird conditions (like noticing problems so subtle and possible decisions so clever that you as a human can’t reason about them) will take hold! If we don’t understand those, we’ll never make an AI that does what we want!”
1a. Flaws In the Alignment Problem Analogy:
While the alignment problem is pretty good, it leaves out the key and fundamentally important fact that failed AI Alignment will end the world. While it’s often not a big deal when an analogy isn’t completely accurate, missing this fact leaves MIRI-skeptics with a pretty strong counter-argument that can only exist outside of the analogy:
In Rocket Alignment terms -- “Why bother thinking about all this stuff now? If conditions are different in space, we’ll learn that when we start launching things into space and see things happen to them? This sounds more efficient than worrying about cannonballs.”
In AI Alignment terms -- “Why bother thinking about all this stuff now? If conditions are different when AI start getting clever, we’ll learn about those differences once we start making actual AI that are clever enough to behave like agents. This sounds more efficient than navel-gazing about mathematical constructs.”
If you explore this counter-argument and its counter-counter-argument deeper, the conversation gets pretty interesting:
MIRI-Skeptic: Fine okay. The analogy breaks down there. We can’t empirically study a superintelligent AI safely. But we can make AI that are slightly smarter than us but put security mechanisms around them that only AI extremely smarter than us would be expected to break. Then we can learn experimentally from the behavior of those AI about how to make clever AI safe. Again, easier than navel-gazing about mathematical constructs and we might expect this to happen because slow take-off.
MIRI-Defender: First of all, there’s no theoretical reason we would expect to be able to extrapolate the behavior of slightly clever AI to the behavior of extremely clever AI. Second, we have empirical reasons for thinking your empirical approach won’t work. We already did a test-run of your experiment proposal with a slightly clever being; we put Eliezer Yudkowsky in an inescapable box armed with only a communication tool and the guard let him out (twice!).
MIRI-Skeptic: Fair enough but… [Author’s Note: There are further replies to MIRI-Defender but this is a dialogue for another day]
Given that this post is supposed to address MIRI skeptics and that the aforementioned conversation is extremely relevant to judging the benefits of MIRI, I consider the inabillity to address this argument to be a flaw—despite it being an understandable flaw in the context of the analogy used.
2. The Understanding Intractably Complicated Things with Simple Things Analogy:
I think that this is a cool insight (with parallels to inverse-inverse problems) and the above post captures it very well. Explicitly, the analogy is this: “Rocket Alignment to Cannonballs is like AI Alignment to tiling agents.” Structurally speaking, they’re identical and I can convert one to the other by swapping words around:
Rocket Modeling: “We can’t think about rocket trajectories using actual real rockets under actual real conditions because there are so many factors and complications that can affect them. But, per the rocket alignment problem, we need to understand the weird conditions that rockets need to deal with when they’re really high up and these conditions should apply to a lot of things that are way simpler than rockets. So instead of dealing with the incredibly hard problem of modeling rockets, let’s try really simple problems using other high-up fast-moving objects like cannonballs.”
AI Alignment: “We can’t think about AI behavior using actual AI under actual real conditions because there are so many factors and complications that can affect them. But, per the AI alignment problem, we need to understand the weird conditions that AI need to deal with when they’re extremely intelligent and these conditions should apply to a lot of things that are way simpler than modern AI. So instead of dealing with the incredibly hard problem of modeling AI, let’s try the really simple problem of using other intelligent decision-making things like Tiling Agents.”
3. The “We Need Better Mathematics to Know What We’re Talking About” Analogy
I really like just how perfect this analogy is. The way that AI “trajectory” and literal physical rocket trajectory line-up feels nice.
Rocket Alignment: “There’s a lot of trouble figuring out exactly where a rocket will go at any given moment as it’s going higher and higher. We need calculus to make claims about this.”
AI alignment: “There’s a lot of trouble figuring out exactly what an AI will do at any given moment as it gets smarter and smarter (ie self-modification but also just in general). We need to understand how to model logical uncertainty to even say anything about its decisions.”
4. The “Mathematics Won’t Give Us Accurate Models But It Will Give Us the Ability to Talk Intelligently” Analogy
This analogy basically works...
Rocket Alignment: “We can’t use math to accurately predict rockets in real life but we need some of if so we can even reason about what rockets might do. Also we expect our math to get more accurate when the rockets get higher up.”
AI alignment: “We can’t use math to accurately predict AGI in real life but we need some of if so we can even reason about what AGI might do. Also we expect our math to get more accurate when the AGI gets way smarter.”
I also enjoy the way this discussion lightly captures the frustration that the AI Safety community has felt. Many skeptics have claimed their AGIs won’t become misaligned but then never specify the details of why that wouldn’t have it. And when AI Safety proponents produce situations where the AGI does become misaligned, the skeptics move the goal posts.
4a. Flaws in the “Mathematics Won’t Give Us Accurate Models But It Will Give Us the Ability to Talk Intelligently” Analogy
On a cursory glance, the above analogy seems to make sense. But, again, this analogy breaks down on the object level. I’d expect being able to talk precisely about what conditions affect movement in space to help us make better claims about how a rocket would go to the moon because that is just moving in space in a particular way. The research (if successful) completes the set of knowledge needed to reach the goal.
But being able to talk precisely about the trajectory of an AGI doesn’t really help us talk precisely about getting to the “destination” of friendly AGI for a couple reasons:
- For rocket trajectories, there are clear control parameters that can be used to exploit the predictions made by a good understanding of how trajectories work. But for AI alignment, I’m not sure what would constitute a control parameter that would exploit a hypothetical good understanding of what strategies superintelligent beings use to make decisions.
- For rocket trajectories, the knowledge set of how to get a rocket into a point in outer-space and how to predict the trajectories of objects in outer-space basically encompass the things one would need to know to get that rocket to the moon. For AGI trajectories, the trajectories depend on three things: it’s decision theory (a la logical uncertainty, tiling agents, decision theory...), the actual state of the world that the AGI perceives (which is fundamentally unknowable to us humans, since the AGI will be much more perceptive than us), and its goals (which are well-known to be orthogonal to the AGI’s actual strategy algorithms).
- Given the above, we know scenarios where we understand agent foundations but not the goals of our agents won’t work. But, if we do figure out the goals of our agents, it’s not obvious that controlling those superintelligent agents’ rationality skills will be a good use of our time. After all, they’ll come up with better strategies than we would.
  - Like I guess you could argue that we can view our goals as the initial conditions and then use our agent foundations to reason about the AGI behavior given those goals and decide if we like its choices… But again, the AGI is more perceptive than us. I’m not sure if we could capably design toy circumstances for an AGI to behave under that would reflect the circumstances of reality in a meaningful way
  - Also, to be fair, MIRI does work on goal-oriented stuff in addition to agent-oriented stuff. Corrigibility ,which the post later links to, is an example of this. But, frankly, my expectation that this kind of thing will pan out is pretty low.
In principle, the rocket alignment analogy could’ve written in a way that captured the above concerns. For instance, instead of asking the question “How do we get this rocket to the moon when we don’t understand how things move in outer-space?”, we could ask “How do we get this rocket to the moon when we don’t understand how things move in outer-space, we have a high amount of uncertainty about what exactly is up there in outer-space, and we don’t have specifics about what exactly the moon is?”
But that would make this a much different, and much more epistemologically labyrinthian post.
Minor Comments
1. I appreciate the analogizing of an awesome thing (landing on the moon) to another awesome thing (making a friendly AGI). The AI safety community is quite rationally focused mostly on how bad a misaligned AI would be but I always enjoy spending some time thinking about the positives.
2. I noticed that Alfonso keeps using the term “spaceplanes” and Beth never does. I might be reading into it but my understanding is that this is done to capture how deeply frustrating when people talk about the thing you’re studying (AGI) like it’s something superficially similar but fundamentally different (modern machine-learning but like, with better data).
However, coming into this dialogue without any background on the world involved, the apparent interchangeability of spaceplane and rocket just felt confusing.
3.
As an example of work we’re presently doing that’s aimed at improving our understanding, there’s what we call the “tiling positions” problem. The tiling positions problem is how to fire a cannonball from a cannon in such a way that the cannonball circumnavigates the earth over and over again, “tiling” its initial coordinates like repeating tiles on a tessellated floor –
Because of the deliberate choice to analogize tiling agents and tiling positions, I spent probably five minutes trying to figure out exactly what the relationship between tiling positions and rocket alignment meant about tiling agents and AI alignment. It seems to me tiling isn’t clearly necessary in the former (understanding any kind of trajectory should do the job) while it is in the latter (understanding how AI can guarantee similar behavior in agents it creates seems fundamentally important).
My impression now is that this was just a conceptual pun on the idea of tiling. I appreciate that but I’m not sure it’s good for this post. The reason I thought so hard about this was also because the Logical Discreteness/Logical Uncertainty analogy seemed deeper.

Isnasene Jan 12, 2020, 1:19 AM
7 points
on: Open & Welcome Thread—January 2020
So yall are rationalists so you probably know about the thing I’m talking about:
You’ve just discovered that you were horribly wrong about something you consider fundamentally important. But, on a practical level, you have no idea how to feel about that.
On one hand, you get to be happy and triumphant about finally pinning down a truth that opens up a vast number of possibilities that you previously couldn’t even consider. On the other hand, you get to be deeply sad and almost mournful because, even if the thing wasn’t true, you have a lot of respect for the aesthetic of believing in the thing you now know to be false. Overall, the result is the bittersweet feeling of a Pyrrhic victory blended with the feeling of being lost (epistemologically).
One song that I find captures this well is Lord Huron’s Way Out There:
Find me way out there
There’s no road that will lead us back
When you follow the strange trails
They will take you who knows where
- The distance between you and your past captured by find me way out there
- The irreversibility captured by no road that will lead us back
- The epistemic ambiguity of who knows where, denying the destination any positive or negative valence
Anyone else know any songs like this?

Isnasene Jan 11, 2020, 11:48 PM
1 point
in reply to: Mathisco’s comment on: Open & Welcome Thread—January 2020
Interesting… When you do this, do you consider the experience of the thought looking at your first thought to be happening simultaneously with the experience of your first thought? If so, this would be contrary to my expectation that one only experiences one thought at a time. To quote Scott Alexander quoting Daniel Ingram:
Then there may be a thought or an image that arises and passes, and then, if the mind is stable, another physical pulse. Each one of these arises and vanishes completely before the other begins, so it is extremely possible to sort out which is which with a stable mind dedicated to consistent precision and not being lost in stories.
If you’re interesting in this, you might want to also check out Scott’s review of Daniel’s book.

Isnasene Jan 11, 2020, 12:51 AM
6 points
in reply to: roland’s comment on: Open & Welcome Thread—January 2020
For the most point, admitting to having done Y is strong evidence that the person did do Y so I’m not sure if it can generally be considered a bias.
In the case where there is additional evidence that the admittance was coerced, I’d probably decompose it into the Just World Fallacy (ie “Coercion is wrong! X couldn’t have possibly been coerced.”) or a blend of Optimism Bias and Typical Mind Fallacy (ie “I think I would never admitting to something I haven’t done! So I don’t think X would either!”) where the person is overconfident in their uncoercibility and extrapolates this confidence to others.
This doesn’t cover all situations though. For instance, if someone was obviously paid a massive amount of money to take the fall for something, I don’t know of a bias that would lead to to continue to believe that they must’ve done it

Isnasene Jan 10, 2020, 2:56 AM
1 point
in reply to: Sniffnoy’s comment on: Underappreciated points about utility functions (of both sorts)
Yes, I think that’s a good description.
I don’t see why one would expect it to have anything to do with preferences.
In my case, it’s a useful distinction because I’m the kind of person who thinks that showing that a real thing is infinite requires an infinite amount of information. This means I can say things like “my utility function scales upward linearly with the number of happy people” without things breaking because it is essentially impossible to convince me that any set of finite action could legitimately cause a literally infinite number of happy people to exist.
For people who believe they could achieve actually infinitely high values in their utility functions, the issues you point out still hold. But I think my utility function is bounded by something eventually even if I can’t tell you what that boundary actually is.

Isnasene Jan 9, 2020, 1:24 AM
4 points
in reply to: Mathisco’s comment on: Open & Welcome Thread—January 2020
Since I like profound discussions I am now going to have to re-read IFS, it didn’t fully resonate with me the first time.
Huzzah! To speak more broadly, I’m really interested in joining abstract models of the mind with the way that we subjectively experience ourselves. Back in the day when I was exploring psychological modifications, I would subjectively “mainline” my emotions (ie cause them to happen and become aware of them happening) and then “jam the system” (ie deliberately instigating different emotions and shoving them into that experiential flow). IFS and later Unlocking The Emotional Brain (and Scott Alexander’s post on that post, Mental Mountains) helped confirm for me that the thing I thought I was doing was actually the thing I was doing.
I cannot come up with such a cool wolverine story I am afraid.
No worries; you’ve still got time!

Isnasene Jan 9, 2020, 1:01 AM
3 points
in reply to: Alexei’s comment on: Open & Welcome Thread—January 2020
Sure! I work for a financial services company (read: not quant finance). We leverage a broad range of machine-learning methodologies to create models that make various decisions across the breadth of our business. I’m involved with a) developing our best practices for model development and b) performing experiments to see if new methodologies can improve model performance.

Isnasene Jan 8, 2020, 2:12 PM
1 point
in reply to: Sniffnoy’s comment on: Underappreciated points about utility functions (of both sorts)
So, I mean, yeah, you can make the problem go away by assuming bounded utility, but if you were trying to say something more than that, a bounded utility that is somehow “closer” to unbounded utility, then no such notion is meaningful.
Say our utility function assigns an actual thing in the universe with value V1 and the utility function is bounded by value X. What I’m saying is that we can make the problem go away by assuming bounded utility but without actually having to define the ratio between V1 and X as a specific finite number (this would not change upon scaling).
This means that, if your utility function is something like “number of happy human beings”, you don’t have to worry about your utility function breaking if the maximum number of happy human beings is larger than you expected since you never have to define such an expectation. See my sub-sub-reply to Eigil Rischel’s sub-reply for elaboration.

Isnasene Jan 8, 2020, 7:49 AM
LW: 21 AF: 7
AF
on: (Double-)Inverse Embedded Agency Problem
I thought about this for longer than expected so here’s an elaboration on inverse-inverse problems in the examples you provided:
Partial Differential Equations
Finding solutions to partial differential equations with specific boundary conditions is hard and often impossible. But we know a lot of solutions to differential equations with particular boundary conditions. If we match up those solutions with the problem at hand, we can often get a decent answer.
The direct problem: you have a function; figure out what relationships its derivatives have and its boundary conditions
The inverse problem: you know a bunch of relationships between derivatives and some boundary conditions; figure out the function that satisfies these conditions
The inverse inverse problem: you have a bunch of solutions to inverse problems (ie you can take a bunch of functions, solve the direct problem, and now you know the inverse problem that the function is a solution to), figure out which of these solutions look like the unsolved inverse problem you’re currently dealing with
Arithmetic
Performing division is hard but adding and multiplying is easy.
The direct problem: you have two numbers A and B; figure out what happens when you multiply them
The inverse problem: you have two numbers A and C; figure out what you can multiply A by to produce C
The inverse inverse problem: you have a bunch of solutions to inverse problems (ie you can take A and multiply it by all sorts of things like B’ to produce numbers like C’, solving direct problems. Now you know that B’ is a solution to the inverse problems where you must divide C’ by A. You just need to figure out out which of these inverse problem solutions look like the inverse problem at hand (ie if you find a C’ so C’ = C, you’ve solved the inverse problem)
In The Abstract
We have a problem like “Find X that produces Y” which is a hard problem from a broader class of problems. But we can produce a lot of solutions in that broader class pretty quickly by solving problems of the form “Find the Y’ that X’ produces.” Then the original problem is just a matter of finding a Y’ which is something like Y. Once we achieve this, we know that X will be something like X’.
Applications for Embedded Agency
The direct problem: You have a small model of something, come up with a thing much bigger than the model that the model is modeling well
The inverse problem: You have a world; figure out something much smaller than the world that can model it well
The inverse inverse problem: You have a a bunch of worlds and a bunch of models that model them well. Figure out which world looks like ours and see what it’s corresponding model tells us about good models for modeling our world.
Some Theory About Why Inverse-Inverse Solutions Work
To speak extremely loosely, the assumption for inverse-inverse problems is something along the lines of “if X’ solves problem Y’, then we have reason to expect that solutions X similar to X’ will solve problems Y similar to Y’ ”.
This tends to work really well in math problems with functions that are continuous/analytic because, as you take the limit of making Y’ and Y increasingly similar, you can make their solutions X’ and X arbitrarily close. And, even if you can’t get close to that limit, X’ will still be a good place to start work on finagling a solution X if the relationship between the problem-space and the solution-space isn’t too crazy.
Division is a good example of an inverse-inverse problem with a literal continous and analytic mapping between the problem-space and solution-space. Differential equations with tweaked parameters/boundary conditions can be like this too although to a much weaker extent since they are iterative systems that allow dramatic phase transitions and bifurcations. Appropriately, inverse-inversing a differential equation is much, much harder inverse-inversing division.
From this perspective, the embedded agency inverse-problem is much more confusing than ordinary inverse-inverse problems. Like differential equations, there seem to be many subtle ways of tweaking the world (ie black swans) that dramatically change what counts as a good model.
Fortunately, we also have an advantage over conventional inverse problems: Unlike multiplying numbers or taking derivatives which are functions with one solution (typically—sometimes things are undefined or weird), a particular direct problem of embedded agency likely has multiple solutions (a single model can be good at modeling multiple different worlds). In principle, this makes things easier -- it’s more Y’ (worlds that embedded agency is solved in) that we can compare to our Y (actual world).
Thoughts on Structuring Embedded Agency Problems
- Inverse-inverse problems really on leveraging similarities between an unsolved problem and a solved problem which means we need to be really careful about defining things
  - Defining what it means to be a solution (to either the direct problem or inverse problem)
    Defining a metric of good upon which we can use to compare model goodness or define worlds that models are good for. This requires us to either pick a set of goals that our model should be able to achieve or go meta and look at the model over all possible sets of goals (but I’m guessing this latter option runs into a No-Free-Lunch theorem). This is also non-trivial—different world abstractions are good for different goals and you can’t have them all
    Defining a threshold after which we treat a world as a solution to the question “find a world that this model does well at.” A Model:World pair can range a really broad spectrum of model performance
  - Defining what it means for a world to be similar to our own. Consider a phrase like “today’s world will be similar to tomorrow if nothing impacts on it.” This sort of claim makes sense to me but impact tends to be approached through Attainable Utility Preservaton
What links here?
- Gordon Seidoh Worley's comment on (Double-)Inverse Embedded Agency Problem by Shmi (Jan 8, 2020, 10:33 PM; 4 points)

Isnasene Jan 8, 2020, 5:23 AM
4 points
in reply to: habryka’s comment on: Open & Welcome Thread—January 2020
Open Threads should be pinned to the frontpage if you have the “Include Personal Blogposts” checkbox enabled. So for anyone who has done that, they should be pretty noticeable.
Thanks for that! You’re right, I did not have “include Personal Blogposts” checked. I can now see that the Open Thread is pinned. IDK if I found it back in the day, unclicked it, and forgot about it or if that’s just the default. In any case, I appreciate the clarification.
Though you saying otherwise does make me update that something in the current setup is wrong.
Turns out the experience described above wasn’t a site-problem anyway; it was just my habit of going straight to the “all posts” page, instead of either a) editing my front page so “latest posts” show up higher on my screen or b) actually scrolling down to look at the latest posts. What can I say for myself except beware trivial inconveniences?

Isnasene

Overall Summary

Evaluating Analogies

Minor Comments

Partial Differential Equations

Arithmetic

In The Abstract

Applications for Embedded Agency

Some Theory About Why Inverse-Inverse Solutions Work

Thoughts on Structuring Embedded Agency Problems