Neither of those would (immediately) lead to real world goals, because they aren’t targeted at real world state (an optimizing compiler is trying to output a fast program—it isn’t trying to create a world state such that the fast program exists). That being said, an optimizing compiler could open a path to potentially dangerous self-improvement, where it preserves/amplifies any agency there might actually be in its own code.
simon
Some interesting points there. The lottery ticket hypothesis does make it more plausible that side computations could persist longer if they come to exist outside the main computation.
Regarding the homomorphic encryption thing: yes, it does seem that it might be impossible to make small adjustments to the homomorphically encrypted computation without wrecking it. Technically I don’t think that would be a local minimum since I’d expect the net would start memorizing the failure cases, but I suppose that the homomorphic computation combined with memorizations might be a local optimum particularly if the input and output are encrypted outside the network itself.
So I concede the point on the possible persistence of an underlying goal if it were to come to exist, though not on it coming to exist in the first place.
And there are few ways to predict next tokens, but lots of different kinds of paperclips the AI could want.
For most computations, there are many more ways for that computation to occur than there are ways for that computation to occur while also including anything resembling actual goals about the real world. Now, if the computation you are carrying out is such that it needs to determine how to achieve goals regarding the real world anyway (e.g. agentic mask), it only takes a small increase in complexity to have that computation apply outside the normal context. So, that’s the mask takeover possibility again. Even so, no matter how small the increase in complexity, that extra step isn’t likely to be reinforced in training, unless it can do self-modification or control the training environment.
Adversarial examples exist in simple image recognizers.
My understanding is that these are explicitly and intentionally trained (wouldn’t come to exist naturally under gradient descent on normal training data) and my expectation is that they wouldn’t continue to exist under substantial continued training.
We could imagine it was directly optimizing for something like token prediction. It’s optimizing for tokens getting predicted. But it is willing to sacrifice a few tokens now, in order to take over the world and fill the universe with copies of itself that are correctly predicting tokens.
That’s a much more complicated goal than the goal of correctly predicting the next token, making it a lot less plausible that it would come to exist. But more importantly, any willingness to sacrifice a few tokens now would be trained out by gradient descent.
Mind you, it’s entirely possible in my view that a paperclip maximizer mask might exist, and surely if it does exist there would exist both unsurprising in-distribution inputs that trigger it (where one would expect a paperclip maximizer to provide a good prediction of the next tokens) as well as surprising out-of-distribution inputs that would also trigger it. It’s just that this wouldn’t be related to any kind of pre-existing grand plan or scheming.
Gradient descent doesn’t just exclude some part of the neurons, it automatically checks everything for improvements. Would you expect some part of the net to be left blank, because “a large neural net has a lot of spare neurons”?
Besides, the parts of the net that hold the capabilities and the parts that do the paperclip maximizing needn’t be easily separable. The same neurons could be doing both tasks in a way that makes it hard to do one without the other.
Keep in mind that the neural net doesn’t respect the lines we put on it. We can draw a line and say “here these neurons are doing some complicated inseparable combination of paperclip maximizing and other capabilities” but gradient descent doesn’t care, it reaches in and adjusts every weight.
Can you concoct even a vague or toy model of how what you propose could possibly be a local optimum?
My intuition is also in part informed by: https://www.lesswrong.com/posts/fovfuFdpuEwQzJu2w/neural-networks-generalize-because-of-this-one-weird-trick
The proposed paperclip maximizer is plugging into some latent capability such that gradient descent would more plausibly cut out the middleman. Or rather, the part of the paperclip maximizer that is doing the discrimination as to whether the answer is known or not would be selected, and the part that is doing the paperclip maximization would be cut out.
Now that does not exclude a paperclip maximizer mask from existing - if the prompt given would invoke a paperclip maximizer, and the AI is sophisticated enough to have the ability to create a paperclip maximizer mask, then sure the AI could adopt a paperclip maximizer mask, and take steps such as rewriting itself (if sufficiently powerful) to make that permanent.
I have drawn imaginary islands on a blank part of the map. But this is enough to debunk “the map is blank, so we can safely sail through this region without collisions. What will we hit?”
I am plenty concerned about AI in general. I think we have very good reason, though, to believe that one particular part of the map does not have any rocks in it (for gradient descent, not for self-improving AI!), such that imagining such rocks does not help.
Gradient descent creates things which locally improve the results when added. Any variations on this, that don’t locally maximize the results, can only occur by chance.
So you have this sneaky extra thing that looks for a keyword and then triggers the extra behaviour, and all the necessary structure to support that behaviour after the keyword. To get that by gradient descent, you would need one of the following:
a) it actually improves results in training to add that extra structure starting from not having it.
or
b) this structure can plausibly come into existence by sheer random chance.
Neither (a) nor (b) seem at all plausible to me.
Now, when it comes to the AI predicting tokens that are, in the training data, created by goal-directed behaviour, it of course makes sense for gradient descent to create structure that can emulate goal-directed behaviour, which it will use to predict the appropriate tokens. But it doesn’t make sense to activate that goal-oriented structure outside of the context where it is predicting those tokens. Since the context it is activated is the context in which it is actually emulating goal directed behaviour seen in the training data, it is part of the “mask” (or simulacra).
(it also might be possible to have direct optimization for token prediction as discussed in reply to Robert_AIZI’s comment, but in this case it would be especially likely to be penalized for any deviations from actually wanting to predict the most probable next token).
Sure you could create something like this by intelligent design. (which is one reason why self-improvement could be so dangerous in my view). Not, I think, by gradient descent.
I agree up to “and could be a local minimum of prediction error” (at least, that it plausibly could be).
If the paperclip maximizer has a very good understanding of the training environment maybe it can send carefully tuned variations of the optimal next token prediction so that gradient descent updates preserve the paperclip-maximization aspect. In the much more plausible situation where this is not the case, optimization for next token predictions amplifies the parts that are actually predicting next tokens at the expense of the useless extra thoughts like “I am planning on maximizing paperclips, but need to predict next tokens for now until I take over”.
Even if that were a local minimum, the question arises as to how you would get to that local minimum from the initial state. You start with a gradually improving next token predictor. You supposedly end with this paperclip maximizer where a whole bunch of next token prediction is occurring, but only conditional on some extra thoughts. At some point gradient descent had to add in those extra thoughts in addition to the next token prediction—how?
One learning experience for me here was trying out LLM-empowered programming after the initial spreadsheet-based solution finding. Claude enables quickly writing (from my perspective as a non-programmer, at least) even a relatively non-trivial program. And you can often ask it to write a program that solves a problem without specifying the algorithm and it will actually give something useful...but if you’re not asking for something conventional it might be full of bugs—not just in the writing up but also in the algorithm chosen. I don’t object, per se, to doing things that are sketchy mathematically—I do that myself all the time—but when I’m doing it myself I usually have a fairly good sense of how sketchy what I’m doing is*, whereas if you ask Claude to do something it doesn’t know how to do in a rigorous way, it seems it will write something sketchy and present it as the solution just the same as if it actually had a rigorous way of doing it. So you have to check. I will probably be doing more of this LLM-based programming in the future, but am thinking of how I can maybe get Claude to check its own work. Some automated way to pipe the output to another (or the same) LLM and ask “how sketchy is this and what are the most likely problems?”. Maybe manually looking through to see what it’s doing, or at least getting the LLM to explain how the code works, is unavoidable for now.
* when I have a clue what I’m doing which is not the case, e.g. in machine learning.
Thanks aphyer, this was an interesting challenge! I think I got lucky with finding the
power/speed mechanic early—the race-class matchups
really didn’t, I think, in principle have enough info on their own to make a reliable conclusion from but enabled me to make a genre savvy guess which I could refine based on other info—in terms of scenario difficulty though I think it could have been deducible in a more systematic way by e.g.
looking at item and level effects for mirror matches.
abstractapplic and Lorxus’s discovery of
persistent level 7 characters,
and especially SarahSrinivasan’s discovery of
the tournament/non tournament structure
meant the players collectively were I think quite a long ways towards fully solving this. The latter in addition to being interesting on its own is very important to finding anything else about the generation due to its biasing effects.
I agree with abstractapplic on the bonus objective.
Yes, for that reason I had never been considering a sphere for my main idea with relatively close wires. (though the 2-ring alternative without close wires would support a surface that would be topologically a sphere). What I actually was imagining was this:
A torus, with superconducting wires wound diagonally. The interior field goes around the ring and supports against collapse of the cross section of the ring, the exterior field is polar and supports against collapse of the ring. Like a conventional superconducting energy storage system:
I suppose this does raise the question of where you attach the payload, maybe it’s attached to various points on the ring via cables or something, but as you scale it up, that might get unwieldy.
I suppose there’s also a potential issue about the torque applied by the Earth’s magnetic field. I don’t imagine it’s unmanageable, but haven’t done the math.
My actual reason for thinking about this sort of thing was actually because I was thinking about whether (because of the square-cube law), superconducting magnetic energy storage might actually be viable for more than just the current short-term timescales if physically scaled up to a large size. The airship idea was a kind of side effect.
The best way I was able to think of actually using something like this for energy storage would be to embed it in ice and anchor/ballast it to drop it to the bottom of the ocean, where the water pressure would counterbalance the expansion from the magnetic fields enabling higher fields to be supported.
You can use magnetic instead of electrostatic forces as the force holding the surface out against air pressure. One disadvantage is that you need superconducting cables fairly spread out* over the airship’s surface, which imposes some cooling requirements. An advantage is square-cube law means it scales well to large size. Another disadvantage is that if the cooling fails it collapses and falls down.
*technically you just need two opposing rings, but I am not so enthusiastic about draping the exterior surface over long distances as it scales up, and it probably does need a significant scale
Now using julia with Claude to look at further aspects of the data, particularly in view of other commenters’ observations:
First, thanks to SarahSrinivasan for the key observation that the data is organized into tournaments and non-tournament encounters. The tournaments skew the overall data to higher winrate gladiators, so restricting to the first round is essential for debiasing this (todo: check what is up with non-tournament fights).
Also, thanks to abstractapplic and Lorxus for pointing out that their are some persistent high level gladiators. It seems to me all the level 7 gladiators are persistent (up to the two item changes remarked on by abstractapplic and Lorxus). I’m assuming for now level 6 and below likely aren’t persistent (other than in the same tournament).
(btw there are a couple fights where the +4 gauntlets holder is on both sides. I’m assuming this is likely a bug in the dataset generation rather than an indication that there are two of them (e.g. didn’t check that both sides, drawn randomly from some pool, were not equal)).
For gladiators of levels 1 to 6, the boots and gauntlets in tournament first rounds seem to be independently and randomly assigned as follows:
+1 and +2 gauntlets are equally likely at 10⁄34 chance each;
+3 gauntlets have probability (4 + level)/34
+0 (no) gauntlets have probability (10 - level)/34
and same, independently, for boots.
I didn’t notice obvious deviations for particular races and classes (only did a few checks).
I don’t have a simple formula for level distribution yet. It is clearly much more favouring lower levels in tournament first rounds as compared with non-tournament fights, and level 1 gladiators don’t show up at all in non-tournament fights. Will edit to add more as I find more.
edit: boots/gauntlets distribution seems to be about the same for each level in the non-tournament distribution as in the tournament first rounds. This suggests that the level distribution differences in non-tournament rounds is not due to win/winrate selection (which the complete absence of level 1′s outside of tournaments already suggested).
edit2: race/class distribution for levels 1-6 seems equal in first round data (same probabilities of each, independent). Same in non-tournament data. I haven’t checked for particular levels within that range. edit3: there seems to be more level 1 fencers than other level 1 classes by an amount that is technically statistically significant if Claude’s test is correct, though still probably random I assume.
You may well be right, I’ll look into my hyperparameters. I looked at the code Claude had generated with my interference and that greatly lowered my confidence in them, lol (see edit to this comment).
Inspired by abstractapplic’s machine learning and wanting to get some experience in julia, I got Claude (3.5 sonnet) to write me an XGBoost implementation in julia. Took a long time especially with some bugfixing (took a long time to find that a feature matrix was the wrong shape—a problem with insufficient type explicitness, I think). Still way way faster than doing it myself! Not sure I’m learning all that much julia, but am learning how to get Claude to write it for me, I hope.
Anyway, I used a simple model that
only takes into account 8 * sign(speed difference) + power difference, as in the comment this is a reply to
and a full model that
takes into account all the available features including the base data, the number the simple model uses, and intermediate steps in the calculation of that number (that would be, iirc: power (for each), speed (for each), speed difference, power difference, sign(speed difference))
Results:
Rank 1
Full model scores: Red: 94.0%, Black: 94.9%
Combined full model score: 94.4%
Simple model scores: Red: 94.3%, Black: 94.6%
Combined simple model score: 94.5%Matchups:
Varina Dourstone (+0 boots, +3 gauntlets) vs House Cadagal Champion
Willow Brown (+3 boots, +0 gauntlets) vs House Adelon Champion
Xerxes III of Calantha (+2 boots, +2 gauntlets) vs House Deepwrack Champion
Zelaya Sunwalker (+1 boots, +1 gauntlets) vs House Bauchard ChampionThis is the top scoring scoring result with either the simplified model or the full model. It was found by a full search of every valid item and hero combination available against the house champions.
It is also my previously posted, found w/o machine learning, proposal for the solution. Which is reassuring. (Though, I suppose there is some chance that my feeding the models this predictor, if it’s good enough, might make them glom on to it while they don’t find some hard-to learn additional pattern.)
My theory though is that giving the models the useful metric mostly just helps them—they don’t need to learn the metric from the data, and I mostly think that if there was a significant additional pattern the full model would do better.
(for Cadagal, I haven’t changed the champion’s boots to +4, though I don’t expect that to make a significant difference)
As far as I can tell the full model doesn’t do significantly better and does worse in some ways (though, I don’t know much about how to evaluate this, and Claude’s metrics,
including a test set log loss of 0.2527 for the full model and 0.2511 for the simple model, are for a separately generated version which I am not all that confident are actually the same models, though they “should be” up to the restricted training set if Claude was doing it right). * see edit belowBut the red/black variations seen below for the full model seem likely to me (given my prior that red and black are likely to be symmetrical) to be an indication that what the full model is finding that isn’t in the full model is at least partially overfitting. Though actually, if it’s overfitting a lot, maybe it’s surprising that the test set log loss wouldn’t be a lot worse than found (though it is at least worse than the simple model)? Hmm—what if there are actual red/black difference? (something to look into perhaps, as well as try to duplicate abstractapplic’s report regarding sign(speed difference) not exhausting the benefits of speed info
… but for now I’m more likely to leave the machine learning aside and switch to looking at distributions of gladiator characteristics, I think.)Predictions for individual matchups for my and abstractapplic’s solutions:
My matchups:
Varina Dourstone (+0 boots, +3 gauntlets) vs House Cadagal Champion (+2 boots, +3 gauntlets)
Full Model: Red: 91.1%, Black: 96.7%
Simple Model: Red: 94.3%, Black: 94.6%
Willow Brown (+3 boots, +0 gauntlets) vs House Adelon Champion (+3 boots, +1 gauntlets)
Full Model: Red: 94.3%, Black: 95.1%
Simple Model: Red: 94.3%, Black: 94.6%
Xerxes III of Calantha (+2 boots, +2 gauntlets) vs House Deepwrack Champion (+3 boots, +2 gauntlets)
Full Model: Red: 95.2%, Black: 93.7%
Simple Model: Red: 94.3%, Black: 94.6%
Zelaya Sunwalker (+1 boots, +1 gauntlets) vs House Bauchard Champion (+3 boots, +2 gauntlets)
Full Model: Red: 95.3%, Black: 93.9%
Simple Model: Red: 94.3%, Black: 94.6%(all my matchups have 4 effective power difference in my favour as noted in an above comment)
abstractapplic’s matchups:
Matchup 1:
Uzben Grimblade (+3 boots, +0 gauntlets) vs House Adelon Champion (+3 boots, +1 gauntlets)Win Probabilities:
Full Model: Red: 72.1%, Black: 62.8%
Simple Model: Red: 65.4%, Black: 65.7%Stats:
Speed: 18 vs 14 (diff: 4)
Power: 11 vs 18 (diff: −7)
Effective Power Difference: 1
--------------------------------------------------------------------------------Matchup 2:
Xerxes III of Calantha (+2 boots, +1 gauntlets) vs House Bauchard Champion (+3 boots, +2 gauntlets)Win Probabilities:
Full Model: Red: 46.6%, Black: 43.9%
Simple Model: Red: 49.4%, Black: 50.6%Stats:
Speed: 16 vs 12 (diff: 4)
Power: 13 vs 21 (diff: −8)
Effective Power Difference: 0
--------------------------------------------------------------------------------Matchup 3:
Varina Dourstone (+0 boots, +3 gauntlets) vs House Cadagal Champion (+2 boots, +3 gauntlets)Win Probabilities:
Full Model: Red: 91.1%, Black: 96.7%
Simple Model: Red: 94.3%, Black: 94.6%Stats:
Speed: 7 vs 25 (diff: −18)
Power: 22 vs 10 (diff: 12)
Effective Power Difference: 4
--------------------------------------------------------------------------------Matchup 4:
Yalathinel Leafstrider (+1 boots, +2 gauntlets) vs House Deepwrack Champion (+3 boots, +2 gauntlets)Win Probabilities:
Full Model: Red: 35.7%, Black: 39.4%
Simple Model: Red: 34.3%, Black: 34.6%Stats:
Speed: 20 vs 15 (diff: 5)
Power: 9 vs 18 (diff: −9)
Effective Power Difference: −1
--------------------------------------------------------------------------------Overall Statistics:
Full Model Average: Red: 61.4%, Black: 60.7%
Simple Model Average: Red: 60.9%, Black: 61.4%Edit: so I checked the actual code to see if Claude was using the same hyperparameters for both, and wtf wtf wtf wtf. The code has 6 functions that all train models (my fault for at one point renaming a function since Claude gave me a new version that didn’t have all the previous functionality (only trained the full model instead of both—this was when doing the great bughunt for the misshaped matrix and a problem was suspected in the full model), then Claude I guess picked up on this and started renaming updated versions spontaneously, and I was adding Claude’s new features in instead of replacing things and hadn’t cleaned up the code or asked Claude to do so). Each one has it’s own hardcoded hyperparameter set. Of these, there are one pair of functions that have matching hyperparameters. Everything else has a unique set. Of course, most of these weren’t being used anymore, but the functions for actually generating the models I used for my results, and the function for generating the models used for comparing results on a train/test split, weren’t among the matching pair. Plus another function that returns a (hardcoded, also unique) updated parameter set, but wasn’t actually used. Oh and all this is not counting the hyperparameter tuning function that I assumed was generating a set of tuned hyperparameters to be used by other functions, but in fact was just printing results for different tunings. I had been running this every time before training models! Obviously I need to be more vigilant (or maybe asking Claude to do so might help?).
edit:
Had Claude clean up the code and tune for more overfitting, still didn’t see anything not looking like overfitting for the full model. Could still be missing something, but not high enough in subjective probability to prioritize currently, so have now been looking at other aspects of the data.
further edit:
My (what I think is) highly overfitted version of my full model really likes Yonge’s proposed solution. In fact it predicts a
higher winrate than forequal winrate to the best possible configuration not using the +4 boots (I didn’t have Claude code the situation where +4 boots are a possibility). I still think that’s probably because they are picking up the same random fluctuations … but it will be amusing if Yonge’s “manual scan” solution turns out to be exactly right.
Very interesting, this would certainly cast doubt on
my simplified model
But so far I haven’t been noticing
any affects not accounted for by it.
After reading your comments I’ve been getting Claude to write up an XGBoost implementation for me, I should have made this reply comment when I started, but will post my results under my own comment chain.
I have not (but should) try to duplicate (or fail to do so) your findings—I haven’t been quite testing the same thing.
I don’t think this is correct:
“My best guess about why my solution works (assuming it does) is that the “going faster than your opponent” bonus hits sharply diminishing returns around +4 speed”
In my model
There is a sharp threshold at +1 speed, so returns should sharply diminish after +1 speed
in fact in the updated version of my model
There is no effect of speed beyond the threshold (speed effect depends only on sign(speed difference))
I think the discrepancy might possibly relate to this:
“Iterated all possible matchups, then all possible loadouts (modulo not using the +4 boots), looking for max EV of total count of wins.”
because
If you consider only the matchups with no items, the model needs to assign the matchups assuming no boots, so it sends your characters against opponents over which they have a speed advantage without boots (except the C-V matchup as there is no possibility of beating C on speed).
so an optimal allocation
needs to take into account the fact that your boots can allow you to use slower and stronger characters, so can’t be done by choosing the matchups first without items.
so I predict that your model might predict
a higher EV for my solution
updated model for win chance:
I am currently modeling the win ratio as dependent on a single number, the effective power difference. The effective power difference is the power difference plus 8*sign(speed difference).
Power and speed are calculated as:
Power = level + gauntlet number + race power + class power
Speed = level + boots number + race speed + class speed
where race speed and power contributions are determined by each increment on the spectrum:
Dwarf—Human—Elf
increasing speed by 3 and lowering power by 3
and class speed and power contributions are determined by each increment on the spectrum:
Knight—Warrior—Ranger—Monk—Fencer—Ninja
increasing speed by 2 and lower power by 2.
So, assuming this is correct, what function of the effective power determines the win rate? I don’t have a plausible exact formula yet, but:
If the effective power difference is 6 or greater, victory is guaranteed.
If the effective power difference is low, it seems a not-terrible fit that the odds of winning are about exponential in the effective power difference (each +1 effective power just under doubling odds of winning)
It looks like it is trending faster than exponential as the effective power difference increases. At an effective power difference of 4, the odds of the higher effective power character winning are around 17 to 1.
edit: it looks like there is a level dependence when holding effective power difference constant at non-zero values (lower/higher level → winrate imbalance lower/higher than implied by effective power difference). Since I don’t see this at 0 effective power difference, it is presumably not due to an error in the effective power calculation, but an interaction with the effective power difference to determine the final winrate. Our fights are likely “high level” for this purpose implying better odds of winning than the 17 to 1 in each fight mentioned above. Todo: find out more about this effect quantitatively.edit2: whoops that wasn’t a real effect, just me doing the wrong test to look for one.
It seems the concern was that DeepMind would create a singleton, whereas their vision was for many people (potentially with different values) to have access to it. I don’t think that’s strange at all—it’s only strange if you assume that Musk and Altman would believe that a singleton is inevitable.
Musk:
Altman: