mattmacdermott

Karma: 647

mattmacdermott 21 Nov 2022 16:53 UTC
LW: 6 AF: 4
AF
on: Seeking Power is Provably Instrumentally Convergent in MDPs
I’ve been thinking about whether these results could be interpeted pretty differently under different branding.
The current framing, if I understand it correctly, is something like, ‘Powerseeking is not desirable. We can prove that keeping your options open tends to be optimal and tends to meet a plausible definition of powerseeking. Therefore we should expect RL agents to seek power, which is bad.’
An alternative framing would be, ‘Making irreversible changes is not desirable. We can prove that keeping your options open tends to be optimal. Therefore we should not expect RL agents to make irreversible changes, which is good.’
I don’t think that the second framing is better than the first, but I do think that if you had run with it instead then lots of people would be nodding their heads and feeling reassured about corrigibility, instead of feeling like their views about instrumental convergence had been confirmed. That makes me feel like we shouldn’t update our views too much based on formal results that leave so much room for interpretation. If I showed a bunch of theorems about MDPs, with no exposition, to two people with different opinions about alignment, I expect they might come to pretty different conclusions about what they meant.
What do you think?
(To be clear I think this is a great post and paper, I just worry that there are pitfalls when it comes to interpretation.)

mattmacdermott 22 Nov 2022 19:41 UTC
1 point
in reply to: TurnTrout’s comment on: Seeking Power is Provably Instrumentally Convergent in MDPs
Fair enough, but in that example making irreversible decisions is unavoidable. What if we consider a modified tree such that one and only one branch is traversible in both directions, and utility can be anywhere?
I expect we get that the reversible brach is the most popular across the distribution of utility functions (but not necessarily that most utility functions prefer it). That sounds like cause for optimism—‘optimal policies tend to avoid irreversible changes’.

mattmacdermott 17 Jan 2023 19:37 UTC
4 points
on: Information At A Distance Is Mediated By Deterministic Constraints
When you write $P [M_{0} | M_{n}] = P [M_{0} | f_{n} (M_{n})]$ I understand that to mean that $P [M_{0} | M_{n} = m_{n}] = P [M_{0} | f_{n} (M_{n}) = f_{n} (m_{n})]$ for all $m_{n}$ . But when I look up definitions of conditional probability it seems that that notation would usually mean $P [M_{0} | M_{n} = m_{n}] = P [M_{0} | f_{n} (M_{n}) = m_{n}]$ for all $m_{n} .$
Am I confused or are you just using non-standard notation?

mattmacdermott 17 Jan 2023 22:40 UTC
1 point
in reply to: johnswentworth’s comment on: Information At A Distance Is Mediated By Deterministic Constraints
Thanks. Is there a particular source whose notation yours most aligns with?

Normative vs Descriptive Models of Agency

mattmacdermott2 Feb 2023 20:28 UTC

26 points

5 comments4 min readLW link

mattmacdermott 8 Feb 2023 19:21 UTC
2 points
0
in reply to: jacquesthibs’s comment on: Embedded Agency
1. In our universe, as opposed to the “current basic theory of AI” universe.
2. From Arbital:
A Cartesian agent setup is one where the agent receives sensory information from the environment, and the agent sends motor outputs to the environment, and nothing else can cross the “Cartesian border” separating the agent and environment. If you can eat a psychedelic mushroom that affects the way you process the world—not just presenting you with sensory information, but altering the computations you do to think—then this is an example of an event that “violates the Cartesian boundary”. Likewise if the agent drops an anvil on its own head. Nothing that happens in a Cartesian universe can kill a Cartesian agent or modify its processing; all the universe can do is send the agent sensory information, in a particular format, that the agent reads.
3. For embedded agency. In the old frame agents aren’t really made of anything.

mattmacdermott 9 Feb 2023 6:23 UTC
LW: 3 AF: 2
0
AF
in reply to: leogao’s comment on: Normative vs Descriptive Models of Agency
Nice, thanks. It seems like the distinction the authors make between ‘building agents from the ground up’ and ‘understanding their behaviour and predicting roughly what they will do’ maps to the distinction I’m making, but I’m not convinced by the claim that the second one is a much stronger version of the first.
The argument in the paper is that the first requires an understanding of just one agent, while the second requires an understanding of all agents. But it seems like they require different kinds of understanding, especially if the agent being built is meant to be some theoretical ideal of rationality. Building a perfect chess algorithm is just a different task to summarising the way an arbitrary algorithm plays chess (which you could attempt without even knowing the rules).
What links here?
- Some Summaries of Agent Foundations Work by mattmacdermott (15 May 2023 16:09 UTC; 56 points)

mattmacdermott 9 Feb 2023 6:35 UTC
1 point
0
in reply to: DragonGod’s comment on: Normative vs Descriptive Models of Agency
Do you expect useful generic descriptive models of agency to exist?

mattmacdermott 9 May 2023 9:59 UTC
1 point
on: The ground of optimization
An interesting point about the agency-as-retargetable-optimisation idea is that it seems like you can make the perturbation in various places upstream of the agent’s decision-making, but not downstream, i.e. you can retarget an agent by perturbing its sensors more easily than its actuators.

For example, to change a thermostat-controlled heating system to optimise for a higher temperature, the most natural perturbation might be to turn the temperature dial up, but you could also tamper with its thermistor so that it reports lower temperatures. On the other hand, making its heating element more powerful wouldn’t affect the final temperature.

I wonder if this suggests that an agent’s goal lives in the last place in a causal chain of things you can perturb to change the set of target states of the system.

Towards Measures of Optimisation

mattmacdermott and Alexander Gietelink Oldenziel

12 May 2023 15:29 UTC

53 points

37 comments4 min readLW link

mattmacdermott 12 May 2023 20:53 UTC
1 point
0
in reply to: Garrett Baker’s comment on: Towards Measures of Optimisation
Hm, I’m not sure this problem comes up.

Say I’ve built a room-tidying robot, and I want to measure its optimisation power. The room can be in two states: tidy or untidy. A natural choice of default distribution $p$ is my beliefs about how tidy the room will be if I don’t put the robot in it. Let’s assume I’m pretty knowledgeable and I’m extremely confident that in that case the room will be untidy: $p (u n t i d y) = 2047 / 2048$ and $p (t i d y) = 1 / 2048$ (we do have to avoid probabilities of 0, but that’s standard in a Bayesian context). But really I do put the robot in and it gets the room tidy, for an optimisation power of $- l o g \frac{1}{2048} = 11$ bits.

That 11 bits doesn’t come from any uncertainty on my part about the optimisation process, although it does depend on my uncertainty about what would happen in the counterfactual world where I don’t put the robot in the room. But becoming more confident that the room would be untidy in that world makes me see the robot as more of an optimiser.

Unlike in information theory, these bits aren’t measuring a resolution of uncertainty, but a difference between the world and a counterfactual.

mattmacdermott 12 May 2023 21:02 UTC
10 points
2
in reply to: rotatingpaguro’s comment on: Towards Measures of Optimisation
Probably the easy utility function makes agent 1 have more optimisation power. I agree this means comparisons between different utility functions can be unfair, but not sure why that rules out a measure which is invariant under positive affine transformations of a particular utility function?

mattmacdermott 12 May 2023 21:43 UTC
LW: 2 AF: 1
0
AF
in reply to: Alex_Altair’s comment on: Towards Measures of Optimisation
Nice, I’d read the first but didn’t realise there were more. I’ll digest later.

I think agents vs optimisation is definitely reality-carving, but not sure I see the point about utility functions and preference orderings. I assume the idea is that an optimisation process just moves the world towards states, but an agent tries to move the world towards certain states i.e. chooses actions based on how much they move the world towards certain states, so it make sense to quantify how much of a weighting each state gets in its decision-making. But it’s not obvious to me that there’s not a meaningful way to assign weightings to states for an optimisation process too—for example if a ball rolling down a hill gets stuck in the large hole twice as often as it gets stuck in the medium hole and ten times as often as the small hole, maybe it makes sense to quantify this with something like a utility function. Although defining a utility function based on the typical behaviour of the system and then trying to measure its optimisation power against it gets a bit circular.

Anyway, the dynamical systems approach seems good. Have you stopped working on it?

Some Summaries of Agent Foundations Work

mattmacdermott15 May 2023 16:09 UTC

56 points

1 comment13 min readLW link

mattmacdermott 21 May 2023 15:56 UTC
1 point
0
in reply to: Mateusz Bagiński’s comment on: Towards Measures of Optimisation
We’re already comparing to the default outcome in that we’re asking “what fraction of the default expected utility minus the worst comes from outcomes at least this good?”.

I think you’re proposing to replace “the worst” with “the default”, in which case we end up dividing by zero.

We could pick some other new reference point other than the worst, but different to the default expected utility. (But that does introduce the possibility of negative OP and still have sensitivity issues).

mattmacdermott 28 Jun 2023 22:58 UTC
6 points
in reply to: James Fox’s comment on: Utility Maximization = Description Length Minimization
It’s worth emphasising just how closely related it is. Fristons’ expected free energy of a policy is $G (π) = E_{Q (s_{τ} ∣ π)} D_{K L} [Q (s_{τ} ∣ π) ∣∣ Q (s_{τ} ∣ o_{τ})] - E_{Q (s_{τ}, o_{τ} ∣ π)} ln P (o_{τ})$ , where the first term is the expected information gained by following the policy and the second the expected ‘extrinsic value’.
The extrinsic value term $- E_{Q (s_{τ}, o_{τ} ∣ π)} ln P (o_{τ})$ , translated into John’s notation and setup, is precisely $E [- log P (X | M_{2}) ∣ M_{1} (θ)]$ . Where John has optimisers choosing $θ$ to minimise the cross-entropy of $X$ under $M_{2}$ with respect to $X$ under $M_{1}$ , Friston has agents choosing $π$ to minimise the cross-entropy of preferences ( $P$ ) with respect to beliefs ( $Q$ ).
What’s more, Friston explicitly thinks of the extrinsic value term $- E_{Q (s_{τ}, o_{τ} ∣ π)} ln P (o_{τ})$ as a way of writing expected utility (see the image below from one of his talks). In particular $P$ is a way of representing real-valued preferences as a probability distribution. He often constucts $P$ by writing down a utility function and then taking a softmax (like in this rat T-maze example), which is exactly what John’s construction amounts to.
It seems that John is completely right when he speculates that he’s rediscovered an idea well-known to Karl Friston.

mattmacdermott 10 Jul 2023 16:52 UTC
3 points
2
on: Consciousness as a conflationary alliance term
Could it be that everyone’s talking about the same thing, but it’s just hard to pin the concept down in words?

I think you’d get similarly varied answers if you asked people what they mean by ‘art’, but I think they’re all basically talking about the same phenomenon.

mattmacdermott 17 Jul 2023 15:54 UTC
1 point
0
in reply to: aysja’s comment on: Towards Measures of Optimisation
The above formulas rely on comparing the actual world to a fixed counterfactual baseline. Gaining more information about the actual world might make the distance between the counterfactual baseline and the actual world grow smaller, but it also might make it grow bigger, so it’s not the case that the optimisation power goes to zero as my uncertainty about the world decreases. You can play with the formulas and see.

But maybe your objection is not so much that the formulas actually spit out zero, but that if I become very confident about what the world is like, it stops being coherent to imagine it being different? This would be a general argument against using counterfactuals to define anything. I’m not convinced of it, but if you like you can purge all talk of imagining the world being different, and just say that measuring optimisation power requires a controlled experiment: set up the messy room, record what happens when you put the robot in it, set the room up the same, and record what happens with no robot.

Optimisation Measures: Desiderata, Impossibility, Proposals

mattmacdermott and Alexander Gietelink Oldenziel

7 Aug 2023 15:52 UTC

35 points

9 comments1 min readLW link

mattmacdermott 7 Aug 2023 16:02 UTC
3 points
0
in reply to: Richard_Kennaway’s comment on: Optimisation Measures: Desiderata, Impossibility, Proposals
Thanks, should be fixed now.
It’s not that we needed to add a translation here to end up with the right definition of $O P$ in terms of $u$ , but with the way we had written it $rep$ wasn’t a well-defined function of equivalence classes. We had restated proposition 1 to try to make things cleaner, but turns out it messed things up so we’ve reverted to the previous statement. Hopefully it should all work now.

mattmacdermott

Nor­ma­tive vs De­scrip­tive Models of Agency

Towards Mea­sures of Optimisation

Some Sum­maries of Agent Foun­da­tions Work

Op­ti­mi­sa­tion Mea­sures: Desider­ata, Im­pos­si­bil­ity, Proposals

Normative vs Descriptive Models of Agency

Towards Measures of Optimisation

Some Summaries of Agent Foundations Work

Optimisation Measures: Desiderata, Impossibility, Proposals