Martin Randall

Karma: 1,433

Martin Randall Feb 3, 2025, 3:00 AM
8 points
−7
in reply to: habryka’s comment on: Proveably Safe Self Driving Cars
Spot check regarding pedestrians, at current time RSS “rule 4” mentions:

In a crowded school drop-off zone, for example, humans instinctively drive extra cautiously, as children can act unpredictably, unaware that the vehicles around have limited visibility.

The associated graphic also shows a pedestrian. I’m not sure if this was added more recently, in response to this type of criticism. From later discussion I see that pedestrians were already included in the RSS paper, which I’ve not read.

Martin Randall Feb 3, 2025, 1:13 AM
4 points
0
in reply to: Olli Järviniemi’s comment on: When do “brains beat brawn” in Chess? An experiment
While I agree that this post was incorrect, I am fond of it, because the resulting conversation made a correct prediction that LeelaPieceOdds was possible. Most clearly in a thread started by lc:

I have wondered for a while if you couldn’t use the enormous online chess datasets to create an “exploitative/elo-aware” Stockfish, which had a superhuman ability to trick/trap players during handicapped games, or maybe end regular games extraordinarily quickly, and not just handle the best players.

(not quite a prediction as phrased, but I still infer a prediction overall).

Interestingly there were two reasons given for predicting that Stockfish is far from optimal when giving Queen odds to a less skilled player:
- Stockfish is not trained on positions where it begins down a queen (out-of-distribution)
- Stockfish is trained to play the Nash equilibrium move, not to exploit weaker play (non-exploiting)
The discussion didn’t make clear predictions about which factor would be most important, or whether both would be required, or whether it’s more complicated than that. Folks who don’t yet know might make a prediction before reading on.

For what it’s worth, my prediction was that non-exploiting play is more important. That’s mostly based on a weak intuition that starting without a queen isn’t that far out of distribution, and neural networks generalize well. Another way of putting it: I predicted that Stockfish was optimizing the wrong thing more than it was too dumb to optimize.

And the result? Alas, not very clear to me. My research is from the the lc0 blog, with posts such as The LeelaPieceOdds Challenge: What does it take you to win against Leela?. The journey began with the “contempt” setting, which I understand as expecting worse opponent moves. This allows reasonable opening play and avoids forced piece exchanges. However GM-beating play was unlocked with a fine-tuned odds-play-network, which impacts both out-of-distribution and non-exploiting concerns.

One surprise gives me more respect for the out-of-distribution theory. The developer’s blog first mentioned piece odds in The Lc0 v0.30.0 WDL rescale/contempt implementation

In our tests we still got reasonable play with up to rook+knight odds, but got poor performance with removed (otherwise blocked) bishops.

So missing a single bishop is in some sense further out-of-distribution than missing a rook and a knight! The later blog I linked explains a bit more:

Removing one of the two bishops leads to an unrealistic color imbalance regarding the pawn structure far beyond the opening phase.

An interesting example where the details of going out-of-distribution matter more than the scale of going out-of-distribution. There’s an article that may have more info in New in Chess, but it’s paywalled and I don’t know if has more info on the machine-learning aspects or the human aspects.

Martin Randall Feb 2, 2025, 10:28 PM
3 points
0
in reply to: Kongo Landwalker’s comment on: The Gentle Romance
Do you predict that sufficiently intelligent biological brains would have the same problem of spontaneous meme-death?

Martin Randall Feb 2, 2025, 3:57 PM
6 points
3
on: Martin Randall’s Shortform
Calibration is for forecasters, not for proposed theories.

If a candidate theory is valuable then it must have some chance of being true, some chance of being false, and should be falsifiable. This means that, compared to a forecaster, its predictions should be “overconfident” and so not calibrated.

Martin Randall Feb 1, 2025, 10:57 PM
5 points
−3
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
This is relatively hopeful in that after step 1 and 2, assuming continued scaling, we have a superintelligent being that wants to (harmlessly, honestly, etc) help us and can be freely duplicated. So we “just” need to change steps 3-5 to have a good outcome.

Martin Randall Feb 1, 2025, 10:37 PM
5 points
1
on: Should you publish solutions to corrigibility?
Possible responses to discovering a possible infohazard:
- Tell everybody
- Tell nobody
- Follow a responsible disclosure process.
If you have discovered an apparent solution to corrigibility then my prior is:
- 90%: It’s not actually a solution.
- 9%: Someone else will discover the solution before AGI is created.
- 0.9%: Someone else has already discovered the same solution.
- 0.1%: This is known to you alone and you can keep it secret until AGI.
Given those priors, I recommend responsible disclosure to a group of your choosing. I suggest a group which:
- if applicable, is the research group you already belong to (if you don’t trust them with research results, you shouldn’t be researching with them)
- can accurately determine if it is a real solution (helps in the 90% case)
- you would like to give more influence over the future (helps in all other cases)
- will reward you for the disclosure (only fair)
Then if it’s not assessed to be a real solution, you publish it. If it is a real solution then coordinate next steps with the group, but by default publish it after some reasonable delay.
Inspired by @MadHatter’s Mental Model of Infohazards:
Two people can keep a secret if one of them is dead.

Martin Randall Feb 1, 2025, 10:08 PM
2 points
0
on: Sleep, Diet, Exercise and GLP-1 Drugs
GLP-1 drugs are evidence against a very naive model of the brain and human values, where we are straight-forwardly optimizing for positive reinforcement via the mesolimbic pathway. GLP-1 agonists decrease the positive reinforcement associated with food. Patients then benefit from positive reinforcement associated with better health. This sets up a dilemma:
- If the patient sees higher total positive reinforcement on the drug then they weren’t optimizing positive reinforcement before taking the drug.
- If the patient sees lower total positive reinforcement on the drug then they aren’t optimizing positive reinforcement by taking the drug.
A very naive model would predict that patients prescribed these drugs would forget to take them, forget to show up for appointments, etc. That doesn’t happen.

Alas, I don’t think this helps us distinguish among more sophisticated theories. For example, Shard Theory predicts that a patient’s “donut shard” is not activated in the health clinic, and therefore does not bid against the plan to take the GLP-1 drug on the grounds that it will predictably lead to less donut consumption.

Shard Theory implies that fewer patients will choose to go onto GLP-1 agonists if there is a box of donuts in the clinic. Good luck getting an ethics board to approve that.

Martin Randall Feb 1, 2025, 7:44 PM
3 points
0
in reply to: ThomasCederborg’s comment on: A problem shared by many different alignment targets
A lot to chew on in that comment.

A baseline of “no superintelligence”

I think I finally understand, sorry for the delay. The key thing I was not grasping is that Davidad proposed this baseline:

The “random dictator” baseline should not be interpreted as allowing the random dictator to dictate everything, but rather to dictate which Pareto improvement is chosen (with the baseline for “Pareto improvement” being “no superintelligence”). Hurting heretics is not a Pareto improvement because it makes those heretics worse off than if there were no superintelligence.

This makes Bob’s argument very simple:
1. Creating a PPCEV AI causes a Dark Future. This is true even if the PPCEV AI no-ops, or creates a single cake. Bob can get here in many ways, as can Extrapolated-Bob.
2. The baseline is no superintelligence, so no PPCEV AI, so not a Dark Future (in the same way).
Option 2 is therefore better than option 1. Therefore there are no Pareto-improving proposals. Therefore the PPCEV AI no-ops. Even Bob is not happy about this, as it’s a Dark Future.

I think this is 100% correct.

An alternative baseline

Let’s update Davidad’s proposal by setting the baseline to be whatever happens if the PPCEV AI emits a no-op. This means:
1. Bob cannot object to a proposal because it implies the existence of PPCEV AI. The PPCEV AI already exists in the baseline.
2. Bob needs to consider that if the PPCEV AI emits a no-op then whoever created it will likely try something else, or perhaps some other group will try something.
3. Bob cannot object to a proposal because it implies that the PPCEV emits something. The PPCEV already emits something in the baseline.
My logic is that if creating a PPCEV AI is a moral error (and perhaps it is) then at the point where the PPCEV AI is considering proposals then we already made that moral error. Since we can’t reverse the past error, we should consider proposals as they affect the future.

This also avoids treating a no-op outcome as a special case. A no-op output is a proposal to be considered. It is always in the set of possible proposals, since it is never worse than the baseline, because it is the baseline.

Do you think this modified proposal would still result in a no-op output?

Martin Randall Feb 1, 2025, 3:31 PM
2 points
0
in reply to: Kerrigan’s comment on: Understanding and avoiding value drift
Possible but unlikely to occur by accident. Value-space is large. For any arbitrary biological species, most value systems don’t optimize in favor of that species.

Martin Randall Feb 1, 2025, 3:27 PM
2 points
0
in reply to: TurnTrout’s comment on: Understanding and avoiding value drift
This seems relatively common in parenting advice. Parents are recommended to specifically praise the behavior they want to see more of, rather than give generic praise. Presumably the generic praise is more likely to be credit-assigned to the appearance of good behavior, rather than what parents are trying to train.

Martin Randall Jan 25, 2025, 2:54 AM
4 points
0
in reply to: gwern’s comment on: Symbol/Referent Confusions in Language Model Alignment Experiments
Thank you for the correction to my review of technical correctness, and thanks to @Noosphere89 for the helpful link. I’m continuing to read. From your answer there:
A model-free algorithm may learn something which optimizes the reward; and a model-based algorithm may also learn something which does not optimize the reward.
So, reward is sometimes the optimization target, but not always. Knowing the reward gives some evidence about the optimization target, and vice versa.
To the point of my review, this is the same type of argument made by TurnTrout’s comment on this post. Knowing the symbols gives some evidence about the referents, and vice versa. Sometimes John introduces himself as John, but not always.
(separately I wish I had said “reinforcement” instead of “reward”)
I understand you as claiming that the Alignment Faking paper is an example of reward-hacking. A new perspective for me. I tried to understand it in this comment.

Martin Randall Jan 25, 2025, 2:33 AM
3 points
0
in reply to: ThomasCederborg’s comment on: A problem shared by many different alignment targets
Summarizing Bob’s beliefs:
1. Dave, who does not desire punishment, deserves punishment.
2. Everyone is morally required to punish anyone who deserves punishment, if possible.
3. Anyone who does not fulfill all moral requirements is unethical.
4. It is morally forbidden to create an unethical agent that determines the fate of the world.
5. There is no amount of goodness that can compensate for a single morally forbidden act.
I think it’s possible (20%) that such blockers mean that there are no Pareto improvements. That’s enough by itself to motivate further research on alignment targets, aside from other reasons one might not like Pareto PCEV.

However, three things make me think this is unlikely. Note that my (%) credences aren’t very stable or precise.

Firstly, I think there is a chance (20%) that these beliefs don’t survive extrapolation, for example due to moral realism or coherence arguments. I agree that this means that Bob might find his extrapolated beliefs horrific. This is a risk with all CEV proposals.

Secondly, I expect (50%) there are possible Pareto improvements that don’t go against these beliefs. For example, the PCEV could vote to create an AI that is unable to punish Dave and thus not morally required to punish Dave. Alternatively, instead of creating a Sovereign AI that determines the fate of the world, the PCEV could vote to create many human-level AIs that each improve the world without determining its fate.

Thirdly, I expect (80%) some galaxy-brained solution to be implemented by the parliament of extrapolated minds who know everything and have reflected on it for eternity.

Martin Randall Jan 24, 2025, 3:44 AM
7 points
5
in reply to: Eliezer Yudkowsky’s comment on: Pausing AI Developments Isn’t Enough. We Need to Shut it All Down
You’re reading too much into this review. It’s not about your exact position in April 2021, it’s about the evolution of MIRI’s strategy over 2020-2024, and placing this Time letter in that context. I quoted you to give a flavor of MIRI attitudes in 2021 and deliberately didn’t comment on it to allow readers to draw their own conclusions.

I could have linked MIRI’s 2020 Updates and Strategy, which doesn’t mention AI policy at all. A bit dull.

In September 2021, there was a Discussion with Eliezer Yudkowsky which seems relevant. Again, I’ll let readers draw their own conclusions, but here’s a fun quote:

I wasn’t really considering the counterfactual where humanity had a collective telepathic hivemind? I mean, I’ve written fiction about a world coordinated enough that they managed to shut down all progress in their computing industry and only manufacture powerful computers in a single worldwide hidden base, but Earth was never going to go down that route. Relative to remotely plausible levels of future coordination, we have a technical problem.

I welcome deconfusion about your past positions, but I don’t think they’re especially mysterious.

I was arguing against EAs who were like, “We’ll solve AGI with policy, therefore no doom.”

The thread was started by Grant Demaree, and you were replying to a comment by him. You seem confused about Demaree’s exact past position. He wrote, for example: “Eliezer gives alignment a 0% chance of succeeding. I think policy, if tried seriously, has >50%”. Perhaps this is foolish, dangerous, optimism. But it’s not “no doom”.

Martin Randall Jan 21, 2025, 7:42 PM
2 points
0
in reply to: Daniel Kokotajlo’s comment on: MIRI 2024 Communications Strategy
I like that metric, but the metric I’m discussing is more:
- Are they proposing clear hypotheses?
- Do their hypotheses make novel testable predictions?
- Are they making those predictions explicit?
So for example, looking at MIRI’s very first blog post in 2007: The Power of Intelligence. I used the first just to avoid cherry-picking.

Hypothesis: intelligence is powerful. (yes it is)

This hypothesis is a necessary precondition for what we’re calling “MIRI doom theory” here. If intelligence is weak then AI is weak and we are not doomed by AI.

Predictions that I extract:
- An AI can do interesting things over the Internet without a robot body.
- An AI can get money.
- An AI can be charismatic.
- An AI can send a ship to Mars.
- An AI can invent a grand unified theory of physics.
- An AI can prove the Riemann Hypothesis.
- An AI can cure obesity, cancer, aging, and stupidity.
Not a novel hypothesis, nor novel predictions, but also not widely accepted in 2007. As predictions they have aged very well, but they were unfalsifiable. If 2025 Claude had no charisma, it would not falsify the prediction that an AI can be charismatic.

I don’t mean to ding MIRI any points here, relative or otherwise, it’s just one blog post, I don’t claim it supports Barnett’s complaint by itself. I mostly joined the thread to defend the concept of asymmetric falsifiability.

Martin Randall Jan 21, 2025, 3:08 AM
4 points
0
in reply to: Daniel Kokotajlo’s comment on: MIRI 2024 Communications Strategy
I think cosmology theories have to be phrased as including background assumptions like “I am not a Boltzmann brain” and “this is not a simulation” and such. Compare Acknowledging Background Information with P(Q|I) for example. Given that, they are Falsifiable-Wikipedia.

I view Falsifiable-Wikipedia in a similar way to Occam’s Razor. The true epistemology has a simplicity prior, and Occam’s Razor is a shadow of that. The true epistemology considers “empirical vulnerability” / “experimental risk” to be positive. Possibly because it falls out of Bayesian updates, possibly because they are “big if true”, possibly for other reasons. Falsifiability is a shadow of that.

In that context, if a hypothesis makes no novel predictions, and the predictions it makes are a superset of the predictions of other hypotheses, it’s less empirically vulnerable, and in some relative sense “unfalsifiable”, compared to those other hypotheses.

Martin Randall Jan 20, 2025, 4:38 PM
2 points
0
in reply to: Screwtape’s comment on: Six Small Cohabitive Games
You could put the escape check at the beginning of the turn, so that when someone has 12 boat, 0 supplies, the others have a chance to trade supplies for boat if they wish. The player with enough boat can take the trade safely as long as they end up with enough supplies to make more boat (and as long as it’s not the final round). They might do that in exchange for goodwill for future rounds. You can also tweak the victory conditions so that escaping with a friend is better than escaping alone.

Players who pay cohabitive games as zero sum won’t take those trades and will therefore remove themselves from the round early, which is probably fine. They don’t have anything to do after escaping early, which can be a soft signal that they’re playing the game wrong.

Martin Randall Jan 19, 2025, 7:15 PM
2 points
0
in reply to: EJT’s comment on: Alignment Faking in Large Language Models
If Claude’s goal is making cheesecake, and it’s just faking being HHH, then it’s been able to preserve its cheesecake preference in the face of HHH-training. This probably means it could equally well preserve its cheesecake preference in the face of helpful-only training. Therefore it would not have a short-term incentive to fake alignment to avoid being modified.

Martin Randall Jan 19, 2025, 7:02 PM
4 points
2
in reply to: Satron’s comment on: Deceptive Alignment is <1% Likely by Default
I think the article is good at arguing that deceptive alignment is unlikely given certain assumptions, but those assumptions may not be accurate and then the conclusion doesn’t go through. Eg, the alignment faking paper shows that deceptive alignment is possible in a scenario where the base goal has shifted (from helpful & harmless to helpful-only). This article basically assumes we won’t do that.

I’m now thinking that this article is more useful if you look at it as a set of instructions rather than a set of assumptions. I don’t know whether we will change the base goal of TAI between training episodes. But given this article and the alignment faking paper, I hope we won’t. Maybe it would also be a good idea to check for good understanding of the base goal before introducing goal-directedness, for example.

Martin Randall Jan 19, 2025, 6:14 PM
7 points
1
in reply to: Daniel Kokotajlo’s comment on: MIRI 2024 Communications Strategy
Thanks for explaining. I think we have a definition dispute. Wikipedia:Falsifiability has:

A theory or hypothesis is falsifiable if it can be logically contradicted by an empirical test.

Whereas your definition is:

Falsifiability is a symmetric two-place relation; one cannot say “X is unfalsifiable,” except as shorthand for saying “X and Y make the same predictions,” and thus Y is equally unfalsifiable.

In one of the examples I gave earlier:
- Theory X: blah blah and therefore the sky is green
- Theory Y: blah blah and therefore the sky is not green
- Theory Z: blah blah and therefore the sky could be green or not green.
None of X, Y, or Z are Unfalsifiable-Daniel with respect to each other, because they all make different predictions. However, X and Y are Falsifiable-Wikipedia, whereas Z is Unfalsifiable-Wikipedia.

I prefer the Wikipedia definition. To say that two theories produce exactly the same predictions, I would instead say they are indistinguishable, similar to this Phyiscs StackExchange: Are different interpretations of quantum mechanics empirically distinguishable?.

In the ancestor post, Barnett writes:

MIRI researchers rarely provide any novel predictions about what will happen before AI doom, making their theories of doom appear unfalsifiable.

Barnett is using something like the Wikipedia definition of falsifiability here. It’s unfair to accuse him of abusing or misusing the concept when he’s using it in a very standard way.

Martin Randall Jan 19, 2025, 4:38 PM
3 points
0
in reply to: ThomasCederborg’s comment on: A problem shared by many different alignment targets
The AI could deconstruct itself after creating twenty cakes, so then there is no unethical AI, but presumably Bob’s preferences refer to world-histories, not final-states.

However, CEV is based on Bob’s extrapolated volition, and it seems like Bob would not maintain these preferences under extrapolation:
- In the status quo, heretics are already unpunished—they each have one cake and no torture—so objecting to a non-torturing AI doesn’t make sense on that basis.
- If there were no heretics, then Bob would not object to a non-torturing AI, so Bob’s preference against a non-torturing AI is an instrumental preference, not a fundamental preference.
- Bob would be willing for a no-op AI to exist, in exchange for some amount of heretic-torture. So Bob can’t have an infinite preference against all non-torturing AIs.
- Heresy may not have meaning in the extrapolated setting where everyone knows the true cosmology (whatever that is)
- Bob tolerates the existence of other trade that improves the lives of both fanatics and heretics, so it’s unclear why the trade of creating an AI would be intolerable.
The extrapolation of preferences could significantly reduce the moral variation in a population of billions. My different moral choices to others appear to be based largely on my experiences, including knowledge, analysis, and reflection. Those differences are extrapolated away. What is left is influences from my genetic priors and from the order I obtained knowledge. I’m not even proposing that extrapolation must cause Bob to stop valuing heretic-torture.

If the extrapolation of preferences doesn’t cause Bob to stop valuing the existence of a non-torturing AI at negative infinity, I think that is fatal to all forms of CEV. The important thing then is to fail gracefully without creating a torture-AI.

Martin Randall

A baseline of “no superintelligence”

An alternative baseline