Theoretical Computer Science Msc student at the University of [Redacted] in the United Kingdom.
I’m an aspiring alignment theorist; my research vibes are descriptive formal theories of intelligent systems (and their safety properties) with a bias towards constructive theories.
I think it’s important that our theories of intelligent systems remain rooted in the characteristics of real world intelligent systems; we cannot develop adequate theory from the null string as input.
DragonGod
My contention is that I don’t think the preconditions hold.
Agents don’t fail to be VNM coherent by having incoherent preferences given the axioms of VNM. They fail to be VNM coherent by violating the axioms themselves.
Completeness is wrong for humans, and with incomplete preferences you can be non exploitable even without admitting a single fixed utility function over world states.
Not at all convinced that “strong agents pursuing a coherent goal is a viable form for generally capable systems that operate in the real world, and the assumption that it is hasn’t been sufficiently motivated.
What are the best arguments that expected utility maximisers are adequate (descriptive if not mechanistic) models of powerful AI systems?
[I want to address them in my piece arguing the contrary position.]
The solution is IMO just to consider the number of computations performed per generated token as some function of the model size, and once we’ve identified a suitable asymptotic order on the function, we can say intelligent things like “the smallest network capable of solving a problem in complexity class C of size N is X”.
Or if our asymptotic bounds are not tight enough:
“No economically feasible LLM can solve problems in complexity class C of size >= N”.
(Where economically feasible may be something defined by aggregate global economic resources or similar, depending on how tight you want the bound to be.)
Regardless, we can still obtain meaningful impossibility results.
The solution is IMO just to consider the number of computations performed per generated token as some function of the model size, and once we’ve identified a suitable asymptotic order on the function, we can say intelligent things like “the smallest network capable of solving a problem in complexity class C of size N is X”.
Or if our asymptotic bounds are not tight enough:
“No economically feasible LLM can solve problems in complexity class C of size >= N”.
(Where economically feasible may be something defined by aggregate global economic resources or similar, depending on how tight you want the bound to be.)
Regardless, we can still obtain meaningful impossibility results.
- Apr 30, 2023, 4:13 AM; 2 points) 's comment on LLMs and computation complexity by (
Very big caveat: the LLM doesn’t actually perform O(1) computations per generated token.
The number of computational steps performed per generated token scales with network size: https://www.lesswrong.com/posts/XNBZPbxyYhmoqD87F/llms-and-computation-complexity?commentId=QWEwFcMLFQ678y5Jp
Strongly upvoted.
Short but powerful.
Tl;Dr: LLMs perform O(1) computational steps per generated token and this is true regardless of the generated token.
The LLM sees each token in its context window when generating the next token so can compute problems in O(n^2) [where n is the context window size].
LLMs can get along the computational requirements by “showing their working” and simulating a mechanical computer (one without backtracking, so not Turing complete) in their context window.
This only works if the context window is large enough to contain the workings for the entire algorithm.
Thus LLMs can perform matrix multiplication when showing workings, but not when asked to compute it without showing workings.
Important fundamental limitation on the current paradigm.
We can now say with certainty tasks that GPT will never be able to solve (e.g. beat Stockfish at Chess because Chess is combinatorial and the LLM can’t search the game tree to any depth) no matter how far it’s scaled up.
This is a very powerful argument.
A reason I mood affiliate with shard theory so much is that like...
I’ll have some contention with the orthodox ontology for technical AI safety and be struggling to adequately communicate it, and then I’ll later listen to a post/podcast/talk by Quintin Pope/Alex Turner, or someone else trying to distill shard theory and then see the exact same contention I was trying to present expressed more eloquently/with more justification.
One example is that like I had independently concluded that “finding an objective function that was existentially safe when optimised by an arbitrarily powerful optimisation process is probably the wrong way to think about a solution to the alignment problem”.
And then today I discovered that Alex Turner advances a similar contention in “Inner and outer alignment decompose one hard problem into two extremely hard problems”.
Shard theory also seems to nicely encapsulates my intuitions that we shouldn’t think about powerful AI systems as optimisation processes with a system wide objective that they are consistently pursuing.
Or just the general intuitions that our theories of intelligent systems should adequately describe the generally intelligent systems we actually have access to and that theories that don’t even aspire to do that are ill motivated.
It is the case that I don’t think I can adequately communicate shard theory to a disbeliever, so on reflection there’s some scepticism that I properly understand it.
That said, the vibes are right.
“All you need is to delay doom by one more year per year and then you’re in business” — Paul Christiano.
Took this to drafts for a few days with the intention of refining it and polishing the ontology behind the post.
I ended up not doing that as much, because the improvements I was making to the underlying ontology felt better presented as a standalone post, so I mostly factored them out of this one.
I’m not satisfied with this post as is, but there’s some kernel of insight here that I think is valuable, and I’d want to be able to refer to the basic thrust of this post/some arguments made in it elsewhere.
I may make further edits to it in future.
Consequentialism is in the Stars not Ourselves
It should be noted, however, that while inner alignment is a robustness problem, the occurrence of unintended mesa-optimization is not. If the base optimizer’s objective is not a perfect measure of the human’s goals, then preventing mesa-optimizers from arising at all might be the preferred outcome. In such a case, it might be desirable to create a system that is strongly optimized for the base objective within some limited domain without that system engaging in open-ended optimization in new environments.(11) One possible way to accomplish this might be to use strong optimization at the level of the base optimizer during training to prevent strong optimization at the level of the mesa-optimizer.(11)
I don’t really follow this paragraph, especially the bolded.
Why would mesa-optimisation arising when not intended not be an issue for robustness (the mesa-optimiser could generalise capably of distribution but pursue the wrong goal).
The rest of the post also doesn’t defend that claim; it feels more like defending a claim like:
The non-occurrence of mesa-optimisation is not a robustness problem.
Is this a correct representation of corrigible alignment:
The mesa-optimizer (MO) has a proxy of the base objective that it’s optimising for.
As more information about the base objective is received, MO updates the proxy.
With sufficient information, the proxy may converge to a proper representation of the base objective.
Example: a model-free RL algorithm whose policy is argmax over actions with respect to its state-action value function
The base objective is the reward signal
The value function serves as a proxy for the base objective.
The value function is updated as future reward signals are received, gradually refining the proxy to better align with the base objective.
Sounds good, will do!
March 22nd is when my first exam starts.
It finishes June 2nd.
Is it possible for me to delay my start a bit?
I’m gestating on this post. I suggest part of my original framing was confused, and so I’ll just let the ideas ferment some more.
Yeah for humans in particular, I think the statement is not true of solely biological evolution.
But also, I’m not sure you’re looking at it on the right level. Any animal presumably doesvmany bits worth of selection in a given day, but the durable/macroscale effects are better explained by evolutionary forces acting on the population than actions of different animals within their lifetimes.
Or maybe this is just a confused way to think/talk about it.
I could change that. I was thinking of work done in terms of bits of selection.
Though I don’t think that statement is true of humans unless you also include cultural memetic evolution (which I think you should).
If you define your utility function over histories, then every behaviour is maximising an expected utility function no?
Even behaviour that is money pumped?
I mean you can’t money pump any preference over histories anyway without time travel.
The Dutchbook arguments apply when your utility function is defined over your current state with respect to some resource?
I feel like once you define utility function over histories, you lose the force of the coherence arguments?
What would it look like to not behave as if maximising an expected utility function for a utility function defined over histories.