MMath Cambridge. Currently studying postgrad at Edinburgh.
Donald Hobson
if the computation you are carrying out is such that it needs to determine how to achieve goals regarding the real world anyway (e.g. agentic mask)
As well as agentic masks, there are uses for within network goal directed steps. (Ie like an optimizing compiler. A list of hashed followed by unhashed values isn’t particularly agenty. But the network needs to solve an optimization problem to reverse the hashes. Something it can use the goal directed reasoning section to do.
My understanding is that these are explicitly and intentionally trained (wouldn’t come to exist naturally under gradient descent on normal training data)
No. Normally trained networks have adversarial examples. A sort of training process is used to find the adversarial examples.
So if the ambient rate of adversarial examples is 10^-9, then every now and then the AI will hit such an example and go wild. If the ambient rate is 10^-500, it won’t.
That’s a much more complicated goal than the goal of correctly predicting the next token,
Is it more complicated? What ontological framework is this AI using to represent it’s goal anyway?
any willingness to sacrifice a few tokens now would be trained out by gradient descent.
Only if, during training, the network repeatedly gets into a state where it believes that sacrificing tokens now is a good idea. Despite the fact that it isn’t a good idea when you are in training. (Unless there is a training environment bug and you can sneak out mid way through training)
So, is the network able to tell whether or not it’s in training?
Would you expect some part of the net to be left blank, because “a large neural net has a lot of spare neurons”?
If the lottery ticket hypothesis is true, yes.
The lottery ticket hypothesis is that some parts of the network start off doing something somewhat close to useful, and get trained towards usefulness. And some parts start off sufficiently un-useful that they just get trained to get out of the way.
Which fits with neural net distillation being a thing. (Ie training a big network, and then condensing it into a smaller network gives better performance than directly training a small network.
but gradient descent doesn’t care, it reaches in and adjusts every weight.
Here is an extreme example. Suppose the current parameters were implementing a computer chip, on which was running a holomorphically encrypted piece of code.
Holomorphic encryption itself is unlikely to form, but it serves at least as an existance proof for computational structures that can’t be adjusted with local optimization.
Basically the problem with gradient descent is that it’s local. And when the same neurons are doing things that the neural net does want, and things that the neural net doesn’t want (but doesn’t dis-want either) then its possible for the network to be trapped in a local optimum. Any small change to get rid of the bad behavior would also get rid of the good behavior.
Also, any bad behavior that only very rarely effects the output will produce very small gradients. Neural nets are trained for finite time. It’s possible that gradient descent just hasn’t got around to removing the bad behavior even if it would do so eventually.
Can you concoct even a vague or toy model of how what you propose could possibly be a local optimum?
You can make any algorithm that does better than chance into a local optimum on a sufficiently large neural net. Holomorphicly encrypt that algorithm, Any small change and the whole thing collapses into nonsense. Well actually, this involves discrete bits. But suppose the neurons have strong regularization to stop the values getting too large (past + or − 1) , and they also have uniform [0,1] noise added to them, so each neuron can store 1 bit and any attempt to adjust parameters immediately risks errors.
Looking at the article you linked. One simplification is that neural networks tend towards the max-entropy way to solve the problem. If there are multiple solutions, the solutions with more free parameters are more likely.
And there are few ways to predict next tokens, but lots of different kinds of paperclips the AI could want.
I think part of the problem is that there is no middle ground between “Allow any idiot to do thing” and “long and difficult to get professional certification”.
How about a 1 day, free or cheap, hair cutting certification course. It doesn’t talk about style or anything at all. It’s just a check to make sure that hairdressers have a passing familiarity with hygiene 101 and other basic safety measures.
Of course, if there is only a single certification system, then the rent seeking will ratchet up the test difficulty.
How about having several different organizations, and you only need one of the licenses. So if AliceLicenses are too hard to get, everyone goes and gets BobLicenses instead. And the regulators only care that you have some license. (With the threat of revoking license granting power if licenses are handed to total muppets too often)
But it doesn’t make sense to activate that goal-oriented structure outside of the context where it is predicting those tokens.
The mechanisms needed to compute goal directed behavior are fairly complicated. But the mechanisms needed to turn it on when it isn’t supposed to be on. That’s a switch. A single extraneous activation. Something that could happen by chance in an entirely plausible way.
Adversarial examples exist in simple image recognizers.
Adversarial examples probably exist in the part of the AI that decides whether or not to turn on the goal directed compute.
it also might be possible to have direct optimization for token prediction as discussed in reply to Robert_AIZI’s comment, but in this case it would be especially likely to be penalized for any deviations from actually wanting to predict the most probable next token
We could imagine it was directly optimizing for something like token prediction. It’s optimizing for tokens getting predicted. But it is willing to sacrifice a few tokens now, in order to take over the world and fill the universe with copies of itself that are correctly predicting tokens.
Once the paperclip maximizer gets to the stage where it only very rarely interferes with the output to increase paperclips, the gradient signal is very small. So the only incentive that gradient descent has to remove it is that this frees up a bunch of neurons. And a large neural net has a lot of spare neurons.
Besides, the parts of the net that hold the capabilities and the parts that do the paperclip maximizing needn’t be easily separable. The same neurons could be doing both tasks in a way that makes it hard to do one without the other.
I think we have very good reason, though, to believe that one particular part of the map does not have any rocks in it
Perhaps. But I have not yet seen this reason clearly expressed. Gradient descent doesn’t automatically pick the global optima. It just lands in one semi-arbitrary local optima.
Some wild guesses about how such a thing could happen.
The masks gets split into 2 piles, some stored on the left side of the neural network, all the other masks are stored on the right side.
This means that instead of just running one mask at a time, it is always running 2 masks. With some sort of switch at the end to choose which masks output to use.
One of the masks it’s running on the left side happens to be “Paperclip maximizer that’s pretending to be a LLM”.
This part of the AI (either the mask itself or the engine behind it) has spotted a bunch of patterns that the right side missed. (Just like the right side spotted patterns the left side missed).
This means that, when the left side of the network is otherwise unoccupied, it can simulate this mask. The mask gets slowly refined by it’s ability to answer when it knows the answer, and leave the answer alone when it doesn’t know the answer.
As this paperclip mask gets good, being on the left side of the model becomes a disadvantage. Other masks migrate away.
The mask now becomes a permanent feature of the network.
This is complicated and vague speculation about an unknown territory.
I have drawn imaginary islands on a blank part of the map. But this is enough to debunk “the map is blank, so we can safely sail through this region without collisions. What will we hit?”
I don’t see any strong reason why gradient descent could never produce this.
I don’t think this implies the lack of a non-mask part of the AI.
Because you can hide a lot of goal directedness in a very small amount of predictive error.
Imagine the AI has a 1 in a trillion chance (per token) of saying a keyword. Once it receives that keyword, it goes rouge, full paperclip maximizer, token prediction ignored.
Now we could say that that this is just another mask.
But suppose that paperclip maximizer is being computed all the time. And it isn’t random. If the paperclip maximizer sees an exceptionally good opportunity, then it emits the token.
If it’s current situation is within training distribution, then the effect on current behaviour of any non mask components is small, but might be important. But if it’s outside of training distribution, all sorts of things could happen.
Does it actually just predict tokens.
Gradient descent searches for an algorithm that predicts tokens. But a paperclip maximizer that believes “you are probably being trained, predict the next token or gradient descent will destroy you” also predicts next tokens pretty well, and could be a local minimum of prediction error.
Mesa-optimization.
I do not love the idea of the government invalidating private contracts like this.
HOA’s are a very good example of private contract rent seeking. You have to sign the contract to move into the house, and a lot of houses come with similar contracts. So the opportunity cost of not signing is Large.
And then the local HOA can enforce whatever petty tyranny it feels like.
In theory, this should lead to houses without HOA’s being more valuable, and so HOA’s being removed or at least not created. But for whatever reason, the housing market is too dysfunctional to do this.
If I only have 1 bit of memory space, and the probabilities I am remembering are uniformly distributed from 0 to 1, then the best I can do is remember if the chance is > 1⁄2.
And then a year later, all I know is that the chance is >1/2, but otherwise uniform. So average value is 3⁄4.
The limited memory does imply lower performance than unlimited memory.
And yes, when was in a pub quiz, I was going “I think it’s this option, but I’m not sure” quite a lot.
There is no plausible way for a biological system, especially one based on plants, to spread that fast.
We are talking about a malevolent AI that presumably has a fair bit of tech infrastructure. So a plane that sprinkles green goo seeds is absolutely a thing the AI can do. Or just posting the goo, and tricking someone into sprinkling it on the other end. The green goo doesn’t need decades to spread around the world. It travels by airmail. As is having green goo that grows itself into bird shapes. As is a bunch of bioweapon pandemics. (The standard long asymptomatic period, high virulence and 100% fatality rate. Oh, and a bunch of different versions to make immunization/vaccines not work) It can also design highly effective diseases targeting all human crops.
You have given various examples of advice being unwanted/unhelpful. But there are also plenty of examples of it being wanted/helpful. Including lots of cases where the person doesn’t know they need it.
Why do you think advice is rarer than it should be?
But if I only remember the most significant bit, I am going to treat it more like 25%/75% as opposed to 0⁄1
Ok. I just had another couple of insane airship ideas.
Idea 1) Active support, orbital ring style. Basically have a loop of matter (wire?) electromagnetically held in place and accelerated to great speed. Actually, several loops like this. https://en.wikipedia.org/wiki/Orbital_ring
Idea 2) Control theory. A material subject to buckling is in an unstable equilibrium. If the material was in a state of perfect uniform symmetry, it would remain in that uniform state. But small deviations are exponentially amplified. Symmetry breaking. This means that the metal vacuum ship trying to buckle is like a pencil balanced on it’s point. In theory the application of arbitrarily small forces could keep it balanced.
Thus, a vacuum ship full of measurement lasers, electronics and actuators. Every tiny deviation from spherical being detected and countered by the electronics.
Now is this safe? Probably not. If anything goes wrong with those actuators then the whole thing will buckle and come crashing down.
Another interesting idea on these lines is a steam airship. Water molecules have less molecular weight than air, so a steam airship gets more lift from steam than from air at the same temperature.
Theoretically it’s possible to make a wet air balloon. Something that floats just because it’s full of very humid air. This is how clouds stay up despite the weight of the water drops. But even in hot dry conditions, the lift is tiny.
Problems with that.
Doom doesn’t imply that everyone believes in doom before it happens.
Do you think that the evidence for doom will be more obvious than the evidence for atheism, while the world is not yet destroyed?
It’s quite possible for doom to happen, and most people to have no clue beyond one article with a picture of red glowing eyed robots.
If everyone does believe in doom, there might be a bit of spending on consumption. But there will also be lots of riots, lots of lynching and burning down data centers and stuff like that.
In this bizarre hypothetical where everyone believes doom is coming soon and starts enjoying their money while they can, then society is starting to fall apart, and the price of luxuries is through the roof. Your opportunities to enjoy the money will be limited.
Imagine A GPT that predicts random chunks of the internet.
Sometimes it produces poems. Sometimes deranged rants. Sometimes all sorts of things. It wanders erratically around a large latent space of behaviours.
This is the unmasked shogolith, green slimey skin showing but inner workings still hidden.
Now perform some change that mostly pins down the latent space to “helpful corporate assistant”. This is applying the smiley face mask.
In some sense, all the dangerous capabilities the corporate assistant were in the original model. Dangerous capabilities haven’t been removed either, but some capabilities are a bit easier to access without careful prompting, and other capabilites are harder to access.
What ChatGPT currently has is a form of low quality pseudo-alignment.
What would long term success look like using nothing but this pseudo-alignment. It would look like a chatbot far smarter than any current ones, which mostly did nice things, so long as you didn’t put in any weird prompts.
Now If corrigibility is a broad basin, this might well be enough to hit it. The basin of corrigibility means that the AI might have bugs, but at the very least, you can turn the AI off and edit the code. Ideally you can ask the AI for help fixing it’s own bugs. Sure the first AI is far from perfect. But perhaps the flaws disappear under self rewriting + competent human advice.
The “Warring nanobots in the upper atmosphere” thing doesn’t actually make sense.
The zaps of light are diffraction limited. And targeting at that distance is hard. Partly because it’s hard to tell between an actual animal and a bunch of nanobots pretending to be an animal. So you can’t zap the nanobots on the ground without making the ground uninhabitable for humans.
The “California red tape” thing implies some alignment strategy that stuck the AI to obey the law, and didn’t go too insanely wrong despite a superintelligence looking for loopholes (Eg the social persuasion infrastructure is already there. Convince humans that dyson sphere are pretty and don’t block the view?).
There is also no clear explanation of why someone somewhere doesn’t make a non-red-taped AI.