Benya_Fallenstein Nov 21, 2014, 8:02 AM
0 points
AF
on: Trustworthy automated philosophy?
In our chat about this, Eliezer said that, aside from the difficulty of making a human-level mathematical philosophy engine aligned with our goals, an additional significant relevant disagreement with Paul is that Eliezer thinks it’s likely that we’ll use low-level self-improvement on the road to human-level AI; he used the analogies of programmers using compilers instead of writing machine code directly, and of EURISKO helping Lenat. (Again, hoping I’m not misrepresenting.)

This seems like a plausible scenario to me, but I’m not convinced it argues for the kind of work we’re doing currently; it seems fairly likely to me that “reliably helping the programmers to do low-level tasks on the way to human-level AI” can probably be handled by having a small protected metalayer, which only the humans can modify. This sort of architecture seems very problematic for an AI acting in the world—it makes Cartesian assumptions, and we don’t want something that’s bound to a single architecture like that—which is why we haven’t been looking in this direction, but if this is a significant reason for studying self-reference, we should be exploring obvious tricks involving a protected metalayer.

For example, the Milawa theorem prover has a small initial verifier, which can be replaced by a different verifier that accepts proofs in a more powerful language, given a proof that the new verifier only outputs theorems that the old verifier would also have outputted on some input. How does this avoid running into the Löbstacle? (I.e., how can the new verifier also allow the user to replace it given a proof of the analogous theorem in its language?) The answer is, by having a protected metalevel; the proof that the new verifier only outputs theorems the old verifier would have outputted does not need to prove anything about the code that is willing to switch out the new verifier for an even newer verifier.

Benya_Fallenstein Nov 21, 2014, 7:18 AM
0 points
AF
in reply to: Benya_Fallenstein’s comment on: Approximability
...though although I’m guessing the variant of the reflective oracle discussed in the comment thread may be approximable, it seems less likely that a version of AIXI can be defined based on it that would be approximable.

Benya_Fallenstein Nov 21, 2014, 3:14 AM
LW: 1 AF: 1
AF
on: Approximability
Sorry for the delayed reply! I like this post, and agree with much of what you’re saying. I guess I disagree with the particular line you draw, though; I think there’s an interesting line, but it’s at “is there a Turing machine which will halt and give the answer to this problem” rather than “is there a Turing machine which will spit out the correct answer for a large enough input (but will spit out wrong answers for smaller inputs, and you don’t know what number is large enough)”. The latter of these doesn’t seem that qualitatively different to me from “is there a Turing machine which will give you the correct answer if you give it two large enough numbers (but will spit out wrong answers if the numbers aren’t large enough, and what is ‘large enough’ for the second number depends on what the first number you give it is)”, which takes you one more level up the arithmetic hierarchy; the line between the first and the second seems more significant to me than the line between the second and the third.

Regarding the reflective oracle result, yes the version presented in the post is higher up in the arithmetic hierarchy, but I think the variant discussed in this comment thread with Paul is probably approximable. As I said above, though, I’m not convinced that that’s actually the right place to put the line. Also, there’s of course an “AIXItl-like version” which is computable by limiting itself to hypotheses with bounded source code length and computation time. But most importantly in my mind, I don’t actually think it would make sense for any actual agent to compute an oracle like this; the point is to define a notion of perfect Bayesian agent which can reason about worlds containing other perfect Bayesian agents, in the hope that this model will yield useful insights about the real world of logically uncertain agents reasoning about worlds containing other logically uncertain agents, and these logically uncertain agents certainly don’t reason about each other by computing out an “oracle” first.

Benya_Fallenstein Nov 21, 2014, 3:00 AM
0 points
AF
in reply to: paulfchristiano’s comment on: Stable self-improvement as a research problem
I think the main disagreement is about whether it’s possible to get an initial system which is powerful in the ways needed for your proposal and which is knowably aligned with our goals; some more about this in my reply to your post, which I’ve finally posted, though there I mostly discuss my own position rather than Eliezer’s.

Trustworthy automated philosophy?

Benya_FallensteinNov 21, 2014, 2:57 AM

6 points

3 comments9 min readLW link

Benya_Fallenstein Nov 16, 2014, 4:29 AM
0 points
AF
in reply to: paulfchristiano’s comment on: “Evil” decision problems in provability logic
Actually, drnickbone’s original LessWrong post introducing evil problems also gives an extension to the case you are considering: The evil decision problem gives the agent three or more options, and rewards the one that the “victim” decision theory assigns the least probability to (breaking ties lexicographically). Then, no decision theory can put probability $> 1 / N$ on the action that is rewarded in its probabilistic evil problem.

Topological truth predicates: Towards a model of perfect Bayesian agents

Benya_FallensteinNov 15, 2014, 6:39 AM

14 points

8 comments9 min readLW link

Oracle machines instead of topological truth predicates

Benya_FallensteinNov 15, 2014, 6:39 AM

2 points

13 comments7 min readLW link

Simplicity priors with reflective oracles

Benya_FallensteinNov 15, 2014, 6:39 AM

1 point

0 comments6 min readLW link

A primer on provability logic

Benya_FallensteinNov 15, 2014, 6:39 AM

8 points

3 comments4 min readLW link

Main vs. Discussion

Benya_FallensteinNov 15, 2014, 6:38 AM

4 points

0 comments1 min readLW link

An optimality result for modal UDT

Benya_FallensteinNov 15, 2014, 6:38 AM

11 points

0 comments6 min readLW link

Main vs. Discussion

Benya_FallensteinNov 15, 2014, 6:35 AM

0 points

1 comment1 min readLW link

Predictors that don’t try to manipulate you(?)

Benya_FallensteinNov 15, 2014, 5:53 AM

3 points

1 comment7 min readLW link

Benya_Fallenstein Nov 12, 2014, 1:18 AM
LW: 3 AF: 2
AF
in reply to: abramdemski’s comment on: Exploiting EDT
True. This looks to me like an effect of EDT not being stable under self-modification, although here the issue is handicapping itself through external means rather than self-modification—like, if you offer a CDT agent a potion that will make it unable to lift more than one box before it enters Newcomb’s problem (i.e., before Omega makes its observation of the agent), then it’ll cheerfully take it and pay you for the privilege.