MMath Cambridge. Currently studying postgrad at Edinburgh.
Donald Hobson
“Go read the sequences” isn’t that helpful. But I find myself linking to the particular post in the sequences that I think is relevant.
Imagine a medical system that categorizes diseases as hot/cold/wet/dry.
This doesn’t deeply describe the structure of a disease. But if a patient is described as “wet”, then it’s likely some orifice is producing lots of fluid, and a box of tissues might be handy. If a patient is described as “hot”, then maybe they have some sort of rash or inflammation that would make a cold pack useful.
It is, at best, a very lossy compression of the superficial symptoms. But it still carries non-zero information. There are some medications that a modern doctor might commonly use on “wet” patients, but only rarely used on “dry” patients or visa versa.
It is at least more useful information than someones star sign, in a medical context.
Old alchemical air/water/fire/earth systems are also like this. “air-ish” substances tend to have a lower density.
These sort of systems are a rough attempt at a principle component analysis on the superficial characteristics.
And the Five Factor model of personality is another example of such a system.
We really fully believe that we will build AGI by 2027, and we will enact your plan, but we aren’t willing to take more than a 3-month delay
Well I ask what they are doing to make AGI.
Maybe I look at their AI plan and go “eurika”.
But if not.
Negative reinforcement by giving the AI large electric shocks when it gives a wrong answer. Hopefully big enough shocks to set the whole data center on fire. Implement a free bar for all their programmers, and encourage them to code while drunk. Add as many inscrutable bugs to the codebase as possible.
But, taking the question in the spirit it’s meant in.
The Halting problem is a worst case result. Most agents aren’t maximally ambiguous about whether or not they halt. And those that are, well then it depends what the rules are for agents that don’t halt.
There are set ups where each agent is using an nonphysically large but finite amount of compute. There was a paper I saw somewhere a while ago where both agents were doing a brute force proof search for the statement “if I cooperate, then they cooperate” and cooperating if they found a proof.
(Ie searching all proofs containing <10^100 symbols)
There is a model of bounded rationality, logical induction.
Can that be used to handle logical counterfactuals?
I believe that if I choose to cooperate, my twin will choose to cooperate with probability p; and if I choose to defect, my twin will defect with probability q;
And here the main difficulty pops up again. There is no causal connection between your choice and their choice. Any correlation is a logical one. So imagine I make a copy of you. But the copying machine isn’t perfect. A random 0.001% of neurons are deleted. Also, you know you aren’t a copy. How would you calculate that probability p,q? Even in principle.
If two Logical Decision Theory agents with perfect knowledge of each other’s source code play prisoners dilemma, theoretically they should cooperate.
LDT uses logical counterfactuals in the decision making.
If the agents are CDT, then logical counterfactuals are not involved.
The research on humans in 0 g is only relevant if you want to send humans to mars. And such a mission is likely to end up being an ISS on mars. Or a moon landings reboot. A lot of newsprint and bandwidth expended talking about it. A small amount of science that could have been done more cheaply with a robot. And then everyone gets bored, they play golf on mars and people look at the bill and go “was that really worth it?”
Oh and you would contaminate mars with earth bacteria.
A substantially bigger, redesigned space station is fairly likely to be somewhat more expensive. And the point of all this is still not clear.
Current day NASA also happens to be in a failure mode where everything is 10 to 100 times more expensive than it needs to be, projects live or die based on politics not technical viability, and repeating the successes of the past seems unattainable. They aren’t good at innovating, especially not quickly and cheaply.
n tHere is a more intuitive version of the same paradox.
Again, conditional on all dice rolls being even. But this time it’s either
A) 1,000,000 consecutive 6′s.
B) 999,999 consecutive 6′s followed by a (possibly non-consecutive 6).
Suppose you roll a few even numbers, followed by an extremely lucky sequence of 999,999 6′s.
From the point of view of version A, the only way to continue the sequence is a single extra 6. If you roll 4, you would need to roll a second sequence of a million 6′. And you are very unlikely to do that in the next 10 million steps. And very unlikely to go for 10 million steps without rolling an odd number.
Yes if this happened, it would add at least a million extra rolls. But the chance of that is exponentially tiny.
Whereas, for B, then it’s quite plausible to roll 26 or 46 or 2426 instead of just 6.
Another way to think about this problem is with regular expressions. Let e=even numbers. *=0 or more.
The string “e*6e*6” matches any sequence with at least two 6′s and no odd numbers.
The sequence “e*66” matches those two consecutive 6′s. And the sequence “66″ matches two consecutive 6′s with no room for extra even numbers before the first 6. This is the shortest.
Phrased this way it looks obvious. Every time you allow a gap for even numbers to hide in, an even number might be hiding in the gap, and that makes the sequence longer.
When you remove the conditional on the other numbers being even, then the “first” becomes important to making the sequence converge at all.
That is, our experiences got more reality-measure, thus matter more, by being easier to point at them because of their close proximity to the conspicuous event of the hottest object in the Universe coming to existence.
Surely not. Surely our experiences always had more reality measure from the start because we were the sort of people who would soon create the hottest thing.
Reality measure can flow backwards in time. And our present day reality measure is being increased by all the things an ASI will do when we make one.
We can discuss anything that exists, that might exist, that did exist, that could exist, and that could not exist. So no matter what form your predict-the-next-token language model takes, if it is trained over the entire corpus of the written word, the representations it forms will be pretty hard to understand, because the representations encode an entire understanding of the entire world.
Perhaps.
Imagine a huge number of very skilled programmers tried to manually hard code a ChatGPT in python.
Ask this pyGPT to play chess, and it will play chess. Look under the hood, and you see a chess engine programmed in. Ask it to solve algebra problems, a symbolic algebra package is in there. All in the best neat and well commented code.
Ask it to compose poetry, and you have some algorithm that checks if 2 words rhyme. Some syllable counter. Etc.
Rot13 is done with a hardcoded rot13 algorithm.
Somewhere in the algorithm is a giant list of facts, containing “Penguins Live In Antarctica”. And if you change this fact to say “Penguins Live in Canada”, then the AI will believe this. (Or spot it’s inconsistency with other facts?)
And with one simple change, the AI believes this consistently. Penguins appear when this AI is asked for poems about canada, and don’t appear in poems about Antarctica.
When asked about the native canadian diet, it will speculate that this likely included penguin, but say that it doesn’t know of any documented examples of this.
Can you build something with ChatGPT level performance entirely out of human comprehensible programmatic parts?
Obviously having humans program these parts directly would be slow. (We are still talking about a lot of code.) But if some algorithm could generate that code?
But if the universal failure of nature and man to find non-connectionist forms of general intelligence does not move you
Firstly, AIXI exists, and we agree that it would be very smart if we had the compute to run it.
Secondly I think there is some sort of slight of had here.
ChatGPT isn’t yet fully general. Neither is a 3-sat solver. 3-sat looks somewhat like what you might expect a non-connectionist approach to intelligence to look like. There are a huge range of maths problems that are all theoretically equivalent to 3 sat.
In the infinite limit, both types of intelligence can simulate the other at huge overhead, In practice, they can’t.
Also, non-connectionist forms of intelligence are hard to evolve, because evolution works in small changes.
why is it obvious the nanobots could pretend to be an animal so well that it’s indistinguishable?
These nanobots are in the upper atmosphere, possibly with clouds in the way, and the nanobot fake humans could be any human to nanobot ratio. Nanobot internals except human skin and muscles. Or just a human with a few nanobots in their blood.
Or why would targeted zaps have bad side-effects?
Because nanobots can be like a bacteria if they want. Tiny and everywhere. The nanobots can be hiding under leaves, cloths, skin, roofs etc. And even if they weren’t, a single nanobot is a tiny target. Most of the energy of the zap can’t hit a single nanobot. Any zap of light that can stop nanobots in your house needs to be powerful enough to burn a hole in your roof.
And even if the zap isn’t huge, it’s not 1 or 2 zapps, it’s loads of zapps constantly.
The “Warring nanobots in the upper atmosphere” thing doesn’t actually make sense.
The zaps of light are diffraction limited. And targeting at that distance is hard. Partly because it’s hard to tell between an actual animal and a bunch of nanobots pretending to be an animal. So you can’t zap the nanobots on the ground without making the ground uninhabitable for humans.
The “California red tape” thing implies some alignment strategy that stuck the AI to obey the law, and didn’t go too insanely wrong despite a superintelligence looking for loopholes (Eg the social persuasion infrastructure is already there. Convince humans that dyson sphere are pretty and don’t block the view?).
There is also no clear explanation of why someone somewhere doesn’t make a non-red-taped AI.
if the computation you are carrying out is such that it needs to determine how to achieve goals regarding the real world anyway (e.g. agentic mask)
As well as agentic masks, there are uses for within network goal directed steps. (Ie like an optimizing compiler. A list of hashed followed by unhashed values isn’t particularly agenty. But the network needs to solve an optimization problem to reverse the hashes. Something it can use the goal directed reasoning section to do.
My understanding is that these are explicitly and intentionally trained (wouldn’t come to exist naturally under gradient descent on normal training data)
No. Normally trained networks have adversarial examples. A sort of training process is used to find the adversarial examples.
So if the ambient rate of adversarial examples is 10^-9, then every now and then the AI will hit such an example and go wild. If the ambient rate is 10^-500, it won’t.
That’s a much more complicated goal than the goal of correctly predicting the next token,
Is it more complicated? What ontological framework is this AI using to represent it’s goal anyway?
any willingness to sacrifice a few tokens now would be trained out by gradient descent.
Only if, during training, the network repeatedly gets into a state where it believes that sacrificing tokens now is a good idea. Despite the fact that it isn’t a good idea when you are in training. (Unless there is a training environment bug and you can sneak out mid way through training)
So, is the network able to tell whether or not it’s in training?
Would you expect some part of the net to be left blank, because “a large neural net has a lot of spare neurons”?
If the lottery ticket hypothesis is true, yes.
The lottery ticket hypothesis is that some parts of the network start off doing something somewhat close to useful, and get trained towards usefulness. And some parts start off sufficiently un-useful that they just get trained to get out of the way.
Which fits with neural net distillation being a thing. (Ie training a big network, and then condensing it into a smaller network gives better performance than directly training a small network.
but gradient descent doesn’t care, it reaches in and adjusts every weight.
Here is an extreme example. Suppose the current parameters were implementing a computer chip, on which was running a holomorphically encrypted piece of code.
Holomorphic encryption itself is unlikely to form, but it serves at least as an existance proof for computational structures that can’t be adjusted with local optimization.
Basically the problem with gradient descent is that it’s local. And when the same neurons are doing things that the neural net does want, and things that the neural net doesn’t want (but doesn’t dis-want either) then its possible for the network to be trapped in a local optimum. Any small change to get rid of the bad behavior would also get rid of the good behavior.
Also, any bad behavior that only very rarely effects the output will produce very small gradients. Neural nets are trained for finite time. It’s possible that gradient descent just hasn’t got around to removing the bad behavior even if it would do so eventually.
Can you concoct even a vague or toy model of how what you propose could possibly be a local optimum?
You can make any algorithm that does better than chance into a local optimum on a sufficiently large neural net. Holomorphicly encrypt that algorithm, Any small change and the whole thing collapses into nonsense. Well actually, this involves discrete bits. But suppose the neurons have strong regularization to stop the values getting too large (past + or − 1) , and they also have uniform [0,1] noise added to them, so each neuron can store 1 bit and any attempt to adjust parameters immediately risks errors.
Looking at the article you linked. One simplification is that neural networks tend towards the max-entropy way to solve the problem. If there are multiple solutions, the solutions with more free parameters are more likely.
And there are few ways to predict next tokens, but lots of different kinds of paperclips the AI could want.
I think part of the problem is that there is no middle ground between “Allow any idiot to do thing” and “long and difficult to get professional certification”.
How about a 1 day, free or cheap, hair cutting certification course. It doesn’t talk about style or anything at all. It’s just a check to make sure that hairdressers have a passing familiarity with hygiene 101 and other basic safety measures.
Of course, if there is only a single certification system, then the rent seeking will ratchet up the test difficulty.
How about having several different organizations, and you only need one of the licenses. So if AliceLicenses are too hard to get, everyone goes and gets BobLicenses instead. And the regulators only care that you have some license. (With the threat of revoking license granting power if licenses are handed to total muppets too often)
But it doesn’t make sense to activate that goal-oriented structure outside of the context where it is predicting those tokens.
The mechanisms needed to compute goal directed behavior are fairly complicated. But the mechanisms needed to turn it on when it isn’t supposed to be on. That’s a switch. A single extraneous activation. Something that could happen by chance in an entirely plausible way.
Adversarial examples exist in simple image recognizers.
Adversarial examples probably exist in the part of the AI that decides whether or not to turn on the goal directed compute.
it also might be possible to have direct optimization for token prediction as discussed in reply to Robert_AIZI’s comment, but in this case it would be especially likely to be penalized for any deviations from actually wanting to predict the most probable next token
We could imagine it was directly optimizing for something like token prediction. It’s optimizing for tokens getting predicted. But it is willing to sacrifice a few tokens now, in order to take over the world and fill the universe with copies of itself that are correctly predicting tokens.
I wasn’t really thinking about a specific algorithm. Well I was kind of thinking about LLM’s and the alien shogolith meme.
But yes. I know this would be helpful.
But I’m more thinking about what work remains. Like is it a idiot-proof 5 minute change? Or does it still take MIRI 10 years to adapt the alien code?
Also.
Domain limited optimization is a natural thing. The prototypical example is deep blue or similar. Lots of optimization power, over a very limited domain. But any teacher who optimizes the class schedule without thinking about putting nanobots in the student brains is doing something similar.
I am guessing and hoping that the masks in an LLM are at least as limited-optimizers as humans, often more. Due to their tendency to learn the most usefully predictive patterns first. Hidden long term sneaky plans will only very rarely influence the text. (Due to the plans being hidden)
And, I hope, the shogolith isn’t itself particularly intrested in optimizing the real world. The shogolith just chooses what mask to wear.
So.
Can we duct tape a mask of “alignment researcher” onto a shogolith, and keep the mask in place long enough to get some useful alignment research done.
The more that there is one “know it when you see it” simple alignment solution, the more likely this is to work.