MMath Cambridge. Currently studying postgrad at Edinburgh.
Donald Hobson
There is a model of bounded rationality, logical induction.
Can that be used to handle logical counterfactuals?
I believe that if I choose to cooperate, my twin will choose to cooperate with probability p; and if I choose to defect, my twin will defect with probability q;
And here the main difficulty pops up again. There is no causal connection between your choice and their choice. Any correlation is a logical one. So imagine I make a copy of you. But the copying machine isn’t perfect. A random 0.001% of neurons are deleted. Also, you know you aren’t a copy. How would you calculate that probability p,q? Even in principle.
If two Logical Decision Theory agents with perfect knowledge of each other’s source code play prisoners dilemma, theoretically they should cooperate.
LDT uses logical counterfactuals in the decision making.
If the agents are CDT, then logical counterfactuals are not involved.
[Question] How counterfactual are logical counterfactuals?
The research on humans in 0 g is only relevant if you want to send humans to mars. And such a mission is likely to end up being an ISS on mars. Or a moon landings reboot. A lot of newsprint and bandwidth expended talking about it. A small amount of science that could have been done more cheaply with a robot. And then everyone gets bored, they play golf on mars and people look at the bill and go “was that really worth it?”
Oh and you would contaminate mars with earth bacteria.
A substantially bigger, redesigned space station is fairly likely to be somewhat more expensive. And the point of all this is still not clear.
Current day NASA also happens to be in a failure mode where everything is 10 to 100 times more expensive than it needs to be, projects live or die based on politics not technical viability, and repeating the successes of the past seems unattainable. They aren’t good at innovating, especially not quickly and cheaply.
n tHere is a more intuitive version of the same paradox.
Again, conditional on all dice rolls being even. But this time it’s either
A) 1,000,000 consecutive 6′s.
B) 999,999 consecutive 6′s followed by a (possibly non-consecutive 6).
Suppose you roll a few even numbers, followed by an extremely lucky sequence of 999,999 6′s.
From the point of view of version A, the only way to continue the sequence is a single extra 6. If you roll 4, you would need to roll a second sequence of a million 6′. And you are very unlikely to do that in the next 10 million steps. And very unlikely to go for 10 million steps without rolling an odd number.
Yes if this happened, it would add at least a million extra rolls. But the chance of that is exponentially tiny.
Whereas, for B, then it’s quite plausible to roll 26 or 46 or 2426 instead of just 6.
Another way to think about this problem is with regular expressions. Let e=even numbers. *=0 or more.
The string “e*6e*6” matches any sequence with at least two 6′s and no odd numbers.
The sequence “e*66” matches those two consecutive 6′s. And the sequence “66″ matches two consecutive 6′s with no room for extra even numbers before the first 6. This is the shortest.
Phrased this way it looks obvious. Every time you allow a gap for even numbers to hide in, an even number might be hiding in the gap, and that makes the sequence longer.
When you remove the conditional on the other numbers being even, then the “first” becomes important to making the sequence converge at all.
That is, our experiences got more reality-measure, thus matter more, by being easier to point at them because of their close proximity to the conspicuous event of the hottest object in the Universe coming to existence.
Surely not. Surely our experiences always had more reality measure from the start because we were the sort of people who would soon create the hottest thing.
Reality measure can flow backwards in time. And our present day reality measure is being increased by all the things an ASI will do when we make one.
We can discuss anything that exists, that might exist, that did exist, that could exist, and that could not exist. So no matter what form your predict-the-next-token language model takes, if it is trained over the entire corpus of the written word, the representations it forms will be pretty hard to understand, because the representations encode an entire understanding of the entire world.
Perhaps.
Imagine a huge number of very skilled programmers tried to manually hard code a ChatGPT in python.
Ask this pyGPT to play chess, and it will play chess. Look under the hood, and you see a chess engine programmed in. Ask it to solve algebra problems, a symbolic algebra package is in there. All in the best neat and well commented code.
Ask it to compose poetry, and you have some algorithm that checks if 2 words rhyme. Some syllable counter. Etc.
Rot13 is done with a hardcoded rot13 algorithm.
Somewhere in the algorithm is a giant list of facts, containing “Penguins Live In Antarctica”. And if you change this fact to say “Penguins Live in Canada”, then the AI will believe this. (Or spot it’s inconsistency with other facts?)
And with one simple change, the AI believes this consistently. Penguins appear when this AI is asked for poems about canada, and don’t appear in poems about Antarctica.
When asked about the native canadian diet, it will speculate that this likely included penguin, but say that it doesn’t know of any documented examples of this.
Can you build something with ChatGPT level performance entirely out of human comprehensible programmatic parts?
Obviously having humans program these parts directly would be slow. (We are still talking about a lot of code.) But if some algorithm could generate that code?
But if the universal failure of nature and man to find non-connectionist forms of general intelligence does not move you
Firstly, AIXI exists, and we agree that it would be very smart if we had the compute to run it.
Secondly I think there is some sort of slight of had here.
ChatGPT isn’t yet fully general. Neither is a 3-sat solver. 3-sat looks somewhat like what you might expect a non-connectionist approach to intelligence to look like. There are a huge range of maths problems that are all theoretically equivalent to 3 sat.
In the infinite limit, both types of intelligence can simulate the other at huge overhead, In practice, they can’t.
Also, non-connectionist forms of intelligence are hard to evolve, because evolution works in small changes.
why is it obvious the nanobots could pretend to be an animal so well that it’s indistinguishable?
These nanobots are in the upper atmosphere, possibly with clouds in the way, and the nanobot fake humans could be any human to nanobot ratio. Nanobot internals except human skin and muscles. Or just a human with a few nanobots in their blood.
Or why would targeted zaps have bad side-effects?
Because nanobots can be like a bacteria if they want. Tiny and everywhere. The nanobots can be hiding under leaves, cloths, skin, roofs etc. And even if they weren’t, a single nanobot is a tiny target. Most of the energy of the zap can’t hit a single nanobot. Any zap of light that can stop nanobots in your house needs to be powerful enough to burn a hole in your roof.
And even if the zap isn’t huge, it’s not 1 or 2 zapps, it’s loads of zapps constantly.
The “Warring nanobots in the upper atmosphere” thing doesn’t actually make sense.
The zaps of light are diffraction limited. And targeting at that distance is hard. Partly because it’s hard to tell between an actual animal and a bunch of nanobots pretending to be an animal. So you can’t zap the nanobots on the ground without making the ground uninhabitable for humans.
The “California red tape” thing implies some alignment strategy that stuck the AI to obey the law, and didn’t go too insanely wrong despite a superintelligence looking for loopholes (Eg the social persuasion infrastructure is already there. Convince humans that dyson sphere are pretty and don’t block the view?).
There is also no clear explanation of why someone somewhere doesn’t make a non-red-taped AI.
if the computation you are carrying out is such that it needs to determine how to achieve goals regarding the real world anyway (e.g. agentic mask)
As well as agentic masks, there are uses for within network goal directed steps. (Ie like an optimizing compiler. A list of hashed followed by unhashed values isn’t particularly agenty. But the network needs to solve an optimization problem to reverse the hashes. Something it can use the goal directed reasoning section to do.
My understanding is that these are explicitly and intentionally trained (wouldn’t come to exist naturally under gradient descent on normal training data)
No. Normally trained networks have adversarial examples. A sort of training process is used to find the adversarial examples.
So if the ambient rate of adversarial examples is 10^-9, then every now and then the AI will hit such an example and go wild. If the ambient rate is 10^-500, it won’t.
That’s a much more complicated goal than the goal of correctly predicting the next token,
Is it more complicated? What ontological framework is this AI using to represent it’s goal anyway?
any willingness to sacrifice a few tokens now would be trained out by gradient descent.
Only if, during training, the network repeatedly gets into a state where it believes that sacrificing tokens now is a good idea. Despite the fact that it isn’t a good idea when you are in training. (Unless there is a training environment bug and you can sneak out mid way through training)
So, is the network able to tell whether or not it’s in training?
Would you expect some part of the net to be left blank, because “a large neural net has a lot of spare neurons”?
If the lottery ticket hypothesis is true, yes.
The lottery ticket hypothesis is that some parts of the network start off doing something somewhat close to useful, and get trained towards usefulness. And some parts start off sufficiently un-useful that they just get trained to get out of the way.
Which fits with neural net distillation being a thing. (Ie training a big network, and then condensing it into a smaller network gives better performance than directly training a small network.
but gradient descent doesn’t care, it reaches in and adjusts every weight.
Here is an extreme example. Suppose the current parameters were implementing a computer chip, on which was running a holomorphically encrypted piece of code.
Holomorphic encryption itself is unlikely to form, but it serves at least as an existance proof for computational structures that can’t be adjusted with local optimization.
Basically the problem with gradient descent is that it’s local. And when the same neurons are doing things that the neural net does want, and things that the neural net doesn’t want (but doesn’t dis-want either) then its possible for the network to be trapped in a local optimum. Any small change to get rid of the bad behavior would also get rid of the good behavior.
Also, any bad behavior that only very rarely effects the output will produce very small gradients. Neural nets are trained for finite time. It’s possible that gradient descent just hasn’t got around to removing the bad behavior even if it would do so eventually.
Can you concoct even a vague or toy model of how what you propose could possibly be a local optimum?
You can make any algorithm that does better than chance into a local optimum on a sufficiently large neural net. Holomorphicly encrypt that algorithm, Any small change and the whole thing collapses into nonsense. Well actually, this involves discrete bits. But suppose the neurons have strong regularization to stop the values getting too large (past + or − 1) , and they also have uniform [0,1] noise added to them, so each neuron can store 1 bit and any attempt to adjust parameters immediately risks errors.
Looking at the article you linked. One simplification is that neural networks tend towards the max-entropy way to solve the problem. If there are multiple solutions, the solutions with more free parameters are more likely.
And there are few ways to predict next tokens, but lots of different kinds of paperclips the AI could want.
I think part of the problem is that there is no middle ground between “Allow any idiot to do thing” and “long and difficult to get professional certification”.
How about a 1 day, free or cheap, hair cutting certification course. It doesn’t talk about style or anything at all. It’s just a check to make sure that hairdressers have a passing familiarity with hygiene 101 and other basic safety measures.
Of course, if there is only a single certification system, then the rent seeking will ratchet up the test difficulty.
How about having several different organizations, and you only need one of the licenses. So if AliceLicenses are too hard to get, everyone goes and gets BobLicenses instead. And the regulators only care that you have some license. (With the threat of revoking license granting power if licenses are handed to total muppets too often)
But it doesn’t make sense to activate that goal-oriented structure outside of the context where it is predicting those tokens.
The mechanisms needed to compute goal directed behavior are fairly complicated. But the mechanisms needed to turn it on when it isn’t supposed to be on. That’s a switch. A single extraneous activation. Something that could happen by chance in an entirely plausible way.
Adversarial examples exist in simple image recognizers.
Adversarial examples probably exist in the part of the AI that decides whether or not to turn on the goal directed compute.
it also might be possible to have direct optimization for token prediction as discussed in reply to Robert_AIZI’s comment, but in this case it would be especially likely to be penalized for any deviations from actually wanting to predict the most probable next token
We could imagine it was directly optimizing for something like token prediction. It’s optimizing for tokens getting predicted. But it is willing to sacrifice a few tokens now, in order to take over the world and fill the universe with copies of itself that are correctly predicting tokens.
Once the paperclip maximizer gets to the stage where it only very rarely interferes with the output to increase paperclips, the gradient signal is very small. So the only incentive that gradient descent has to remove it is that this frees up a bunch of neurons. And a large neural net has a lot of spare neurons.
Besides, the parts of the net that hold the capabilities and the parts that do the paperclip maximizing needn’t be easily separable. The same neurons could be doing both tasks in a way that makes it hard to do one without the other.
I think we have very good reason, though, to believe that one particular part of the map does not have any rocks in it
Perhaps. But I have not yet seen this reason clearly expressed. Gradient descent doesn’t automatically pick the global optima. It just lands in one semi-arbitrary local optima.
Some wild guesses about how such a thing could happen.
The masks gets split into 2 piles, some stored on the left side of the neural network, all the other masks are stored on the right side.
This means that instead of just running one mask at a time, it is always running 2 masks. With some sort of switch at the end to choose which masks output to use.
One of the masks it’s running on the left side happens to be “Paperclip maximizer that’s pretending to be a LLM”.
This part of the AI (either the mask itself or the engine behind it) has spotted a bunch of patterns that the right side missed. (Just like the right side spotted patterns the left side missed).
This means that, when the left side of the network is otherwise unoccupied, it can simulate this mask. The mask gets slowly refined by it’s ability to answer when it knows the answer, and leave the answer alone when it doesn’t know the answer.
As this paperclip mask gets good, being on the left side of the model becomes a disadvantage. Other masks migrate away.
The mask now becomes a permanent feature of the network.
This is complicated and vague speculation about an unknown territory.
I have drawn imaginary islands on a blank part of the map. But this is enough to debunk “the map is blank, so we can safely sail through this region without collisions. What will we hit?”
I don’t see any strong reason why gradient descent could never produce this.
The Halting problem is a worst case result. Most agents aren’t maximally ambiguous about whether or not they halt. And those that are, well then it depends what the rules are for agents that don’t halt.
There are set ups where each agent is using an nonphysically large but finite amount of compute. There was a paper I saw somewhere a while ago where both agents were doing a brute force proof search for the statement “if I cooperate, then they cooperate” and cooperating if they found a proof.
(Ie searching all proofs containing <10^100 symbols)