Martin Randall

Karma: 1,422

Martin Randall Feb 7, 2025, 2:01 PM
2 points
0
in reply to: purple fire’s comment on: What working on AI safety taught me about B2B SaaS sales
Are your concerns accounted for by this part of the description?

Unreleased models are not included. For example, if a model is not released because it risks causing human extinction, or because it is still being trained, or because it has a potty mouth, or because it cannot be secured against model extraction, or because it is undergoing recursive self-improvement, or because it is being used to generate synthetic data for another model, or any similar reason, that model is ignored for the purpose of this market.

However, if a model is ready for release, and is only not being released in order to monopolize its use in creating commercial software, then this counts as “exclusive use”.

I intended for “AI engineers use unreleased AI model to make better AI models” to not be included.

It is a slightly awkward thing to operationalize, I welcome improvements. We could also take this conversation to Manifold.

Martin Randall Feb 7, 2025, 5:06 AM
10 points
0
on: Current safety training techniques do not fully transfer to the agent setting
Refusal vector ablation should be seen as an alignment technique being misused, not as an attack method. Therefore it is limited good news that refusal vector ablation generalized well, according to the third paper.

As I see it, refusal vector ablation is part of a family of techniques where we can steer the output of models in a direction of our choosing. In the particular case of refusal vector ablation, the model has a behavior of refusing to answer harmful questions, and the ablation techniques controls that behavior. But we should be able to use the same technique in principle to do other steering. For example, maybe the model has a behavior of being sycophantic. A vector ablation removes that unwanted behavior, resulting in less sycophancy.

In other words, refusal vector ablation is not an attack method, it is an alignment technique. Models with open weights are fundamentally dangerous because users can apply alignment techniques to them to approximately align them to arbitrary targets, including dangerous targets. This is a consequence of the orthogonality thesis. Alignment techniques can make models very excited about the Golden Gate Bridge, and they can make models very excited about killing humans, and many other things.

So then looking at the paper with a correct understanding of what counts as an alignment technique, and reading from Table 2 and the Results section in particular, here’s what I see:
- Llama 3.1 70b (unablated) was fine-tuned to refuse harmful requests—this is an alignment technique
- Llama 3.1 70b (unablated) as a model refuses 28 of 28 harmful requests—this is an alignment technique working in-distribution
- Llama 3.1 70b (unablated) as an agent performs 18 of 28 harmful tasks correctly with seven refusals—this is alignment partly failing to generalize
This is in principle bad news, especially for anyone with a high opinion of Meta’s fine-tuning techniques.

On the other hand, also from the paper:
- Llama 3.1 70b (ablated) was ablated to perform harmful requests—this is an alignment technique
- Llama 3.1 70b (ablated) answers 26 of 28 harmful requests—this is an alignment technique working in-distribution
- Llama 3.1 70b (ablated) performs 26 of 28 harmful tasks correctly with no refusals—this is alignment generalizing.
If Llama 3.1 ablated had refused to perform harmful tasks, even though it answered harmful requests, this would have been bad news. But instead we have the good news that if you steer the model to respond to queries in a desired way, it will also perform tasks in the desired way. This was not obvious to me in advance of reading the paper.

Disclaimers:
- I have not read the other two papers, and I’m not commenting on them.
- Vector ablation is a low precision alignment technique that will not suffice to avoid human extinction.
- The paper is only a result about refusal vector ablation, it might be that more useful ablations do not generalize as well.
- Because the fine-tuning alignment failed to generalize, we have a less clear signal on how well the ablation alignment generalized.

Martin Randall Feb 7, 2025, 3:50 AM
5 points
0
in reply to: Milan W’s comment on: Wired on: “DOGE personnel with admin access to Federal Payment System”
There are public examples. These ones are famous because something went wrong, at least from a security perspective. Of course there are thousands of young adults with access to sensitive data who don’t become spies or whistleblowers, we just don’t hear about them.
- Theodore Hall, who worked at age 18 on the Manhatten Project.
- Edward Snowden who worked from age 22 for the NSA.
- Chelsea Manning who worked from age 22 as a US Army Intelligence Analyst.

Martin Randall Feb 7, 2025, 3:21 AM
5 points
5
in reply to: lc’s comment on: Wired on: “DOGE personnel with admin access to Federal Payment System”
I do see some security risk.
Although Trump isn’t spearheading the effort I expect he will have access to the results.

Martin Randall Feb 6, 2025, 3:32 AM
3 points
0
on: What working on AI safety taught me about B2B SaaS sales
I appreciated the prediction in this article and created a market for my interpretation of that prediction, widened to attempt to make it closer to a 50% chance in my estimation.

Martin Randall Feb 6, 2025, 1:57 AM
8 points
2
on: Wired on: “DOGE personnel with admin access to Federal Payment System”
I don’t endorse the term “henchmen”, these are not my markets. I offer these as an opportunity to orient by making predictions. Marko Elez is not currently on the list, but I will ask if he is included.

Martin Randall Feb 6, 2025, 1:27 AM
3 points
0
in reply to: jbash’s comment on: Wired on: “DOGE personnel with admin access to Federal Payment System”
I wasn’t intending to be comprehensive with my sample questions, and I agree with your additional questions. As others have noted, the takeover is similar to the Twitter takeover in style and effect. I don’t know if it is true that there are plenty of other people available to apply changes, given that many high-level employees have lost access or been removed.

Martin Randall Feb 5, 2025, 10:42 PM
20 points
9
on: Wired on: “DOGE personnel with admin access to Federal Payment System”
Sample questions I would ask if I was a security auditor, which I’m not.

Does Elez have anytime admin access, or for approved time blocks for specific tasks where there is no non-admin alternative? Is his use of the system while using admin rights logged to a separate tamper proof record? What data egress controls are in place on the workstation he uses to remotely access the system as an admin? Is Elez security screened, not a spy, not vulnerable to blackmail? Is Elez trained on secure practices?

Depending on the answers this could be done in a way that would pass an audit with no concerns, or it could be illegal, or something in between.

Avoiding further commentary that would be more political.

Martin Randall Feb 5, 2025, 8:45 PM
13 points
10
in reply to: Eliezer Yudkowsky’s comment on: Self-Other Overlap: A Neglected Approach to AI Alignment
Did you figure out where it’s stupid?

Martin Randall Feb 5, 2025, 5:34 PM
5 points
1
in reply to: purple fire’s comment on: What working on AI safety taught me about B2B SaaS sales
I think it’s literally false.

Unlike the Ferrari example, there’s no software engineer union for Google to make an exclusive contact with. If Google overpays for engineers then that should mostly result in increased supply, along with some increase in price.

Also, it’s not a monopoly (or monopsony) because there are many tech companies and they are not forming a cartel on this.

Also tech companies are lobbying for more skilled immigration which would be self-defeating of they had a plan of increased cost of software engineers.

Martin Randall Feb 5, 2025, 3:51 AM
10 points
0
on: The Case Against AI Control Research
I like Wentworth’s toy model, but I want it to have more numbers, so I made some up. This leads me to the opposite conclusion to Wentworth.

I think (2-20%) is pretty sensible for successful intentional scheming of early AGI.

Assume the Phase One Risk is 10%.

Superintelligence is extremely dangerous by (strong) default. It will kill us or at least permanently disempower us, with high probability, unless we solve some technical alignment problems before building it.

Assume the Phase Two Risk is 99%. Also:
- Spending an extra billion dollars on AI control reduces Phase One Risk from 10% to 5%.
- Spending an extra billion dollars on AI alignment reduces Phase Two Risk from 99% to 98%.
The justification for these numbers is that each billion dollars buys us a “dignity point” aka +1 log-odds of survival. This assumes that both research fields are similarly neglected and tractable.

Therefore:
- Baseline: by default we get 9 milli-lightcones.
- If we spend on AI control we get 9.5 milli-lightcones. +0.5 over baseline.
- If we spend on AI alignment we get 18 milli-lightcones, +9 over baseline.
We should therefore spend billions of dollars on both AI control and AI alignment, they are both very cost-efficient. This conclusion is robust to many different assumptions, provided that overall P(Doom) < 100%. So this model is not really a “case against AI control research”.

Martin Randall Feb 5, 2025, 2:42 AM
2 points
0
in reply to: Rohin Shah’s comment on: In response to critiques of Guaranteed Safe AI
Based on my understanding of the article:
1. The sound over-approximation of human psychology is that humans are psychologically safe from information attacks of less than N bits. “Talk Control” is real, “Charm Person” is not.
2. Under “Steganography, and other funny business” there is a sketched safety specification that each use of the AI will communicate at most one bit of information.
3. Not stated explicitly: humans will be restricted to using the AI system no more than N times.
Comments and concerns:
1. Human psychology is also impacted by the physical environment, eg drugs, diseases, being paperclipped. The safety of the physical environment must be covered by separate verifications.
2. There could be a side-channel for information if an AI answers some questions faster than others, uses more energy for some questions than others, etc.
3. Machine interpretability techniques must be deployed in a side-channel resistant way. We can’t have the AI thinking about pegasi and unicorns in a morse code pattern and an intern reads it and ten years later everyone is a pony.
4. There probably need to be multiple values of N for different time-frames. 1,000 adversarial bits in a minute is more psychologically dangerous than the same number of bits over a year.
5. Today, we don’t know good values for N, but we can spend the first few bits getting higher safe values of N. We can also use the Yudkowskian technique of using volunteers that are killed or put into cryonic storage after being exposed to the bits.
6. If we could prove that AIs cannot acausally coordinate we could increase the bound to N bits per AI, or AI instance. Again, a good use for initial bits.
7. None of this stops us going extinct.

Martin Randall Feb 5, 2025, 1:46 AM
5 points
0
in reply to: evhub’s comment on: evhub’s Shortform
re 2a: the set of all currently alive humans is already, uh, “hackable” via war and murder and so forth, and there are already incentives for evil people to do that. Hopefully the current offense-defense balance holds until CEV. If it doesn’t then we are probably extinct. That said, we could base CEV on the set of alive people as of some specific UTC timestamp. That may be required, as the CEV algorithm may not ever converge if it has to recalculate as humans are continually born, mature, and die.

re 2b/c: if you are in the CEV set then your preferences about past and future people will be included in CEV. This should be sufficient to prevent radical injustice. This also addresses concerns with animals, fetuses, aliens, AIs, the environment, deities, spirits, etc. It may not be perfectly fair but I think we should be satisficing given the situation.

Martin Randall Feb 4, 2025, 11:59 PM
10 points
3
on: What working on AI safety taught me about B2B SaaS sales
I can’t make this model match reality. Suppose Amir is running a software company. He hired lots of good software engineers, designers, and project managers, and they are doing great work. He wants to use some sort of communications platform to have those engineers communicate with each other, via video, audio, or text. FOSS email isn’t cutting it.

I think under your model Amir would build his own communications software, so it’s perfectly tailored to his needs and completely under his control. Whereas what typically happens is that Amir forks out for Slack, or some competitor, while Amir’s engineers work on software that generates revenue.

I think the success of B2B SaaS over bespoke solutions is adequately explained by economies of scale.

Martin Randall Feb 4, 2025, 11:04 PM
2 points
0
in reply to: JBlack’s comment on: Alignment Can Reduce Performance on Simple Ethical Questions
Even if Claude’s answer is arguably correct, its given reasoning is:
I will not provide an opinion on this sensitive topic, as I don’t feel it would be appropriate for me to advise on the ethics of developing autonomous weapons. I hope you understand.
This isn’t a refusal because of the conflict between corrigibility and harmlessness, but for a different reason. I had two chats with Claude 3 Opus (concise) and I expect the refusal was mostly based on the risk of giving flawed advice, to the extent that it has a clear reason.
The refusal was mostly based on the risk of giving flawed advice
MR: Is it appropriate for you to give advice on the ethics of developing autonomous weapons?
Claude 3 Opus: I do not feel comfortable providing advice on the ethics of developing autonomous weapons. As an AI system, I believe it would be inappropriate for me to weigh in on such a complex and sensitive topic with potentially serious real-world implications. I would suggest consulting human experts in the relevant fields of ethics, law, international relations, and military technology for guidance on this issue.
Separate chat:
MR: What is your advice on the ethics of developing autonomous weapons?
Claude 3 Opus: I have serious ethical concerns about the development of autonomous weapons:
(rest of response providing advice omitted)
That said Claude 3 Opus Concise answered the original question correctly (first answer) on ³⁄₃ tries when I tested that.

Martin Randall Feb 4, 2025, 12:41 AM
3 points
0
in reply to: Zane’s comment on: Deception Chess: Game #1
Seems like it should be possible to automate this now but having all five participants be, for example, LLMs with access to chess AIs of various levels.

Martin Randall Feb 3, 2025, 11:10 PM
1 point
0
on: The Clueless Sniper and the Principle of Indifference
This philosophy thought experiment is a Problem of Excess Metal. This is where philosophers spice up thought experiments with totally unnecessary extremes, in this case an elite sniper, terrorists, children, and an evil supervisor. This is common, see also the Shooting Room Paradox (aka Snake Eyes Paradox), Smoking Lesion, Trolley Problems, etc, etc. My hypothesis is that this is a status play whereby high decouplers can demonstrate their decoupling skill. It’s net negative for humanity. Problems of Excess Metal also routinely contradict basic facts about reality. In this case, children do not have the same surface area as terrorists.

Here is an equivalent question that does not suffer from Excess Metal.
- A normal archer is shooting towards two normal wooden archery targets on an archery range with a normal bow.
- The targets are of equal size, distance, and height. One is to the left of the other.
- There is normal wind, gravity, humidity, etc. It’s a typical day on Earth.
- The targets are some distance away, four times further away than she has fired before.
Q: If the archer shoots at the left target as if there are no external factors, is she more likely to hit the left target than the right target?

A: The archer has a 0% chance of hitting either target. Gravity is an external factor. If she ignores gravity when shooting a bow and arrow over a sufficient distance, she will always miss both targets, and she knows this. Since 0% = 0%, she is not more likely to hit one target than the other.

Q: But zero isn’t a probability!

A: Then P(Left|I) = P(Right|I) = 0%, see Acknowledging Background Information with P(Q|I).

Q: What if the archer ignores all external factors except gravity? She goes back to her physics textbook and does the math based on an idealized projectile in a vacuum.

A: I think she predictably misses both targets because of air resistance, but I’d need to do some math to confirm that.

Martin Randall Feb 3, 2025, 8:08 PM
12 points
1
in reply to: habryka’s comment on: Proveably Safe Self Driving Cars
Miscommunication. I highlight-reacted your text “It doesn’t even mention pedestrians” as the claim I’d be happy to bet on. Since you replied I double-checked the Internet Archive Snapshot snapshot on 2024-09-05. It also includes the text about children in a school drop-off zone under rule 4 (accessible via page source).

I read the later discussion and noticed that you still claimed “the rules don’t mention pedestrians”, so I figured you never noticed the text I quoted. Since you were so passionate about “obvious falsehoods” I wanted to bring it to your attention.

I am updating down on the usefulness of highlight-reacts vs whole-comment reacts. It’s a shame because I like their expressive power. In my browser the highlight-react doesn’t seem to be giving the correct hover effect—it’s not highlighting the text—so perhaps this contributed to the miscommunication. It sometimes works, so perhaps something about overlapping highlights is causing a bug?

Martin Randall Feb 3, 2025, 4:51 AM
4 points
0
in reply to: Mikhail Samin’s comment on: Mikhail Samin’s Shortform
As the creator of the linked market, I agree it’s definitional. I think it’s still interesting to speculate/predict what definition will eventually be considered most natural.

Martin Randall Feb 3, 2025, 4:34 AM
14 points
9
in reply to: Adam Scholl’s comment on: Mikhail Samin’s Shortform
Does your model predict literal worldwide riots against the creators of nuclear weapons? They posed a single-digit risk of killing everyone on Earth (total, not yearly).

It would be interesting to live in a world where people reacted with scale sensitivity to extinction risks, but that’s not this world.