[Question] Is AI alignment a purely functional property?

RokoDec 15, 2024, 9:42 PM

13 points

In some recent discussions I have realized that there is a quite a nasty implied disagreement about whether AI alignment is a functional property or not, that is if your personal definition of whether an AI is “aligned” is purely a function of its input/output behavior irrespective of what kind of crazy things are going on inside to generate that behavior.

So I’d like to ask the community whether it is currently considered the mainstream take that ‘Alignment’ is functional (only input/output mapping matters) or whether the internal computation matters (it’s not OK to think a naughty thought and then have some subroutine that cancels it, for example).

RokoDec 15, 2024, 9:42 PM

13 points

8 comments1 min readLW link

No answers.

Tahp Dec 15, 2024, 11:48 PM
5 points
0

It may be that generating horrible counterfactual lines of thought for the purpose of rejecting them is necessary for getting better outcomes. To the extent that you have a real dichotomy here, I would say that the input/output mapping is the thing that matters. I want all humans to not end up worse off for inventing AI.

That said, humans may end up worse off by our own metrics if we make AI that is itself suffering terribly based off of its internal computation or it is generating ancestor torture simulations or something. Technically that is an alignment issue, although I worry that most humans won’t care if the AI is suffering if they don’t have to look at it suffer and it generates outputs that humans like aside from that hidden detail.
Signer Dec 15, 2024, 11:47 PM
4 points
2

There is no such disagreement, you just can’t test all inputs. And without knowledge of how internals work, you may me wrong about extrapolating alignment to future systems.
- Roko Dec 18, 2024, 5:18 AM
  4 points
  0
  Parent
  
  There are plenty of systems where we rationally form beliefs about likely outputs from a system without a full understanding of how it works. Weather prediction is an example.
  - Signer Dec 18, 2024, 3:06 PM
    2 points
    0
    Parent
    
    What makes it rational is that there is an actual underlying hypothesis about how weather works, instead of vague “LLMs are a lot like human uploads”. And weather prediction outputs numbers connected to reality we actually care about. And there is no alternative credible hypothesis that implies weather prediction not working.
    
    I don’t want to totally dismiss empirical extrapolations, but given the stakes, I would personally prefer for all sides to actually state their model of reality and how they think evidence changed it’s plausibility, as formally as possible.
p4rziv4l Dec 16, 2024, 12:09 AM
1 point
0

What it says: irrelevant
How it thinks: irrelevant
It has always been about what it can do in the real world.
If it can generate substantial amounts of money and buy server capacity or
hack into computer systems
then we got cyberlife, aka autonomous, rogue, self-sufficient AI, subject to darwinian forces on the internet, leading to more of those qualities, which improve its online fitness, all the way into a full-blown takeover.
- Roko Dec 16, 2024, 1:25 AM
  3 points
  0
  Parent
  
  I should have been clear: “doing things” is a form of input/output since the AI must output some tokens or other signals to get anything done
  - p4rziv4l Feb 20, 2025, 8:31 AM
    1 point
    0
    Parent
    
    in a world where mechinterp is not 100%, the answer is logically: input/output is what matters.
    we won’t be able to read the thoughts anyways, so why base our judgment on it?
    but see my comment on why survival fitness in cyberspace is the one axis where most of the relevant input/output will be generated.
Trevor Hill-Hand Dec 15, 2024, 10:41 PM
1 point
0

It seems like if there is any non-determinism at all, there’s always going to be an unavoidable potential for naughty thoughts, so whatever you call the “AI” must address them as part of its function anyway- either that or there is a deterministic solution?