Zane

Karma: 571

Zane Oct 26, 2023, 2:02 PM
3 points
0
in reply to: Joe Collman’s comment on: Lying to chess players for alignment
I was thinking I would test the players to make sure they really could beat each other as they should be able to. Good points on using blitz and doing the test afterwards; the main constraint as to whether it happens before or after the game is that I would prefer to do it beforehand to know whether the rankings were accurate rather than playing for weeks and only later realizing we were doing the wrong test.
I wasn’t thinking of much in the way of limits on what Cs could say, although possibly some limits on whether the Cs can see and argue against each other’s advice. C’s goal is pretty much just “make A win the game” or “make A lose the game” as applicable.
I’m definitely thinking a prototype would help. I’ve actually been contacted about applying for a grant to make this a larger experiment, and I was planning on first running a one-day game or two as a prototype before expanding it with more people and longer games.

Zane Oct 26, 2023, 1:52 PM
1 point
0
in reply to: Richard Willis’s comment on: Lying to chess players for alignment
Individual positions like that could be an interesting thing to test; I’ll likely have some people try out some of those too.
I think the aspect where the deceivers have to tell the truth in many cases to avoid getting caught could make it more realistic, as in the real AI situation the best strategy might be to present a mostly coherent plan with a few fatal flaws.

Zane Oct 25, 2023, 10:14 PM
4 points
2
in reply to: aphyer’s comment on: Lying to chess players for alignment
Yeah, that’s a bit of an issue. I think in real life you would have some back-and-forth ability between advisors, but the complexity and unknowns of the real world would create a qualitative difference between the conversation and an actual game—which chess doesn’t have. Maybe we can either limit back-and-forth like you suggested, or just have short enough time controls that there isn’t enough time for that to get too far.

Zane Oct 25, 2023, 9:14 PM
15 points
8
in reply to: jessicata’s comment on: Lying to chess players for alignment
Yes, if this were only about chess, then having the advisors play games with each other as A watched would help A learn who to trust. I’m saying that since the real-world scenario we’re trying to model doesn’t allow such a thing to happen, we artificially forbid this in the chess game to make it more like the real-world scenario. The prediction market thing, similarly, would require being able to do a test run so that the dishonest advisors could lose their money by the time A had to make a choice.
I don’t think the advisors should be able to use chess engines, because then even the advisors themselves don’t understand the reasoning behind what the chess engines are saying. The premise of the experiment involves the advisors telling A “this is my reasoning on what the best move is; try to evaluate if it’s right.”

Zane Oct 25, 2023, 7:31 PM
13 points
11
in reply to: jessicata’s comment on: Lying to chess players for alignment
Neither of these would be allowed, because in the real world, you can’t do a bunch of test “games” before or during the actual “game.” There’s no way to perform a proposed alignment plan in a faraway galaxy, check whether that galaxy is destroyed, and make decisions for what to do on Earth based on that data—let alone perform so many of those tests to inform a prediction market based on what they say.
I would have allowed player A to consult a prediction market made by other a bunch of other inexperienced players on who was really honest or lying. After all, in the real world, whoever was making the final decision on what plan to execute would be able to ask a prediction market what it thought. But the problem is that if I make a prediction market that’s supposed to only be for other players around player A’s level, somebody will just use a chess engine to cheat, bet in the market, and make it unrealistically accurate.

Zane Oct 25, 2023, 7:00 PM
1 point
0
in reply to: RHollerith’s comment on: Lying to chess players for alignment
24 hours per move would make the experiment a lot more accurate, but I expect a lot of players might not be willing to play a game that could last several months. I’ll ask everyone how long they can handle.

Zane Oct 25, 2023, 6:54 PM
1 point
0
in reply to: javva209’s comment on: Lying to chess players for alignment
Unsure about the time controls at the moment; see my response to aphyer. The advisors would be able to give the A player justification for the move they’ve recommended.
The concern that A might not be able to understand the reasoning that the advisors give them is a valid one, and that’s the whole point of the experiment! If A can’t follow the reasoning well enough to determine whether it’s good advice, then (says the analogy) people who are asking AIs how to solve alignment can’t follow their reasoning well enough to determine whether it’s good advice.

Zane Oct 25, 2023, 6:43 PM
2 points
0
in reply to: aphyer’s comment on: Lying to chess players for alignment
I think a time control of some sort would be helpful just so that it doesn’t take a whole week, but I would prefer it to be a fairly long time control. Not long enough to play a whole new game, though, because that’s not an option when it comes to alignment—in the analogy, that would be like actually letting loose the advisors’ plans in another galaxy and seeing if the world gets destroyed.
I’m not sure exactly what the time control would be—maybe something like 4 hours on each side, if we’re using standard chess time controls. I’m also thinking about using a less traditional method of time control—for example, on each move, the advisors have 4 minutes to compose their answers, and A has another 4 minutes to look them over and make a decision. But then it’s hard to decide how much time it’s fair to give B for each move − 4 minutes, 8 minutes, somewhere in between?
I don’t think chess engines would be allowed; the goal is for the advisors to be able to explain their own reasoning (or a lie about their reasoning), and they can’t do that if Stockfish reasons for them.

Zane Oct 25, 2023, 5:47 PM
1 point
0
on: Lying to chess players for alignment
I can be any of A, B, or C. I’ve been playing chess for the past ten years, and my USCF rating was in the upper 1500s when I last played in-person a year ago. I’m usually available from 9PM-UCT to 2AM-UCT (afternoon to evenings in American time) every day, and on Saturdays from 5PM-UCT to 2AM-UCT.

[Question] Lying to chess players for alignment

ZaneOct 25, 2023, 5:47 PM

97 points

54 comments1 min readLW link

Zane Oct 24, 2023, 1:36 AM
1 point
0
in reply to: johnswentworth’s comment on: What is an “anti-Occamian prior”?
Does this apply at all to anything more probabilistic than just reversing the outcome of a single most likely hypothesis and the next bit(s) it outputs? An Occamian prior doesn’t just mean “this is the shortest hypothesis; therefore it is true,” it means hypotheses are weighted by their simplicity. It’s possible for an Occamian prior to think the shortest hypothesis is most likely wrong, if there are several slightly longer hypotheses that have more probability in total.

Zane Oct 23, 2023, 3:22 PM
2 points
0
on: VLM-RM: Specifying Rewards with Natural Language
That’s terrifyingly cool! I notice that they usually fall over after having completed the assigned position; are you only rewarding them being in a position at a particular point in time, after which there’s nothing left to optimize for? Are you able to make them maintain a position for longer?

Zane Oct 23, 2023, 3:14 PM
1 point
0
in reply to: cubefox’s comment on: What is an “anti-Occamian prior”?
What exactly is an “event set” in this context? I don’t think a hypothesis would necessarily correspond to a particular set of events that it permitted, but rather its own probability distribution over which events you would be more likely to see under that hypothesis. In that sense, an event set with no probabilities attached would not be enough to specify which hypothesis you were talking about, because multiple hypotheses could correspond to the same set of permitted events despite assigning very different probabilities to each of those events occurring.

[Question] What is an “anti-Occamian prior”?

ZaneOct 23, 2023, 2:26 AM

35 points

22 comments1 min readLW link

Zane Oct 18, 2023, 12:09 AM
1 point
0
in reply to: Radford Neal’s comment on: Eliezer’s example on Bayesian statistics is wr… oops!
Yeah, I discovered that part on accident at one point because I used the binomial distribution equation in a situation where it didn’t really apply, but still got the right answer.
I would think the most natural way to write a likelihood function would be to divide by the integral from 0 to 1, so that the total area under the curve is 1. That way the integral from a to b gives the probability the hypothesis assigns to receiving a result between a and b. But all that really matters is the ratios, which stay the same even without that.

Eliezer’s example on Bayesian statistics is wr… oops!

ZaneOct 17, 2023, 6:38 PM

72 points

13 comments7 min readLW link

Zane Oct 2, 2023, 7:55 PM
1 point
0
on: Fifty Flips
I got alternating THTHTHTHTH… for the first 28 flips, which I would have thought would be very unlikely on priors for the 80% rule. Are you sure that’s an accurate description of the rule? It doesn’t change halfway through?

Zane Aug 4, 2023, 2:24 AM
1 point
0
in reply to: Eli Tyre’s comment on: Final Words
I voted up on every comment in this chain on which someone stated that they voted it up, and down on every comment on this chain on which someone stated that they voted it down, removing votes when they cancelled out and using strong-votes instead when they added together. I regret to say that the comment by Dorikka seems to have had three more people say that they voted it up than that they voted it down, so although I gave it a strong upvote, I have only been able to replicate two-thirds of the original vote. I upvoted Dorikka’s last comment on another post to bring the universe back into balance.

Zane Jul 13, 2023, 4:10 PM
3 points
1
in reply to: SMK’s comment on: Betting on Logic
But wouldn’t what Peano is capable of proving about your specific algorithm necessarily be “downstream” of the output of that algorithm itself? The Peano axioms are upstream, yes, but what Peano proves about a particular function depends on what that function is.

Zane Jul 12, 2023, 10:33 PM
4 points
−2
on: Betting on Logic
I would think that FDT chooses Bet 2, unless I’m misunderstanding something about the role of Peano Arithmetic here. Taking Bet 2 results in P being true, and vice versa for Bet 1; therefore, the only options that are actually possible are the bottom left and the top right.
In fact, this seems like the exact sort of situation in which FDT can be easily shown to outperform CDT. CDT would reason along the lines of “Bet 1 is better if P is true, and better if P is false, and therefore better overall” without paying attention to the direct dependency between the output of your decision algorithm and the truth value of P.
I’m not quite sure what Yudkowsky and Soares meant by “dominance” there. I’d guess on priors that they meant FDT pays attention to those dependencies when deciding whether one strategy outperforms another… but yeah, they kind of worded it in a way that suggests the opposite interpretation.

Zane

[Question] Ly­ing to chess play­ers for alignment

[Question] What is an “anti-Oc­camian prior”?

Eliezer’s ex­am­ple on Bayesian statis­tics is wr… oops!

[Question] Lying to chess players for alignment

[Question] What is an “anti-Occamian prior”?

Eliezer’s example on Bayesian statistics is wr… oops!