Christopher King
I think this would be fixed if they didn’t force yes and no to add to 100%. If they have the same interest rate, the price ratio would reveal the true odds.
The problem is you’re forcing a one year loan for $1 to add up to $1 in the present. It should add up to less than $1.
I’m assuming the LDT agent knows what the game is and who their opponent is.
Towards the end of the post in the No agent is rational in every problem section, I provided a more general argument. I was assuming LDT would fall under case 1, but if not then case 2 demonstrates it is irrational.
Towards the end of the post in the No agent is rational in every problem section, I provided a more general argument. I was assuming LDT would fall under case 1, but if not case 2 will demonstrate it is irrational.
Ultimately, though, we are not wedded to our particular formulation. Perhaps there is some clever sampling-based verifier that “trivializes” our conjecture as well, in which case we would want to revise it.
I think your goal should be to show that your abstract conjecture implies the concrete result you’re after, or is even equivalent to it.
At ARC, we are interested in finding explanations of neural network behavior. Concretely, a trained neural net (such as GPT-4) exhibits a really surprising property: it gets low loss on the training set (far lower than a random neural net).
We can formalize this in a similar way as the reversible circuit conjecture. Here’s a rough sketch:
Transformer performance no-coincidence conjecture: Consider a computable process that randomly generates text. The distribution has significantly lower entropy than the uniform distribution. Consider the property P(T) that says “the transformer T gets low average loss when predicting this process”. There is a deterministic polynomial time verifier V(T, π) such that:
P(T) implies that there exists a polynomial length π with V(T,π) = 1.
For 99% of transformers T, there is no π with V(T,π) = 1.
Note that “ignore π, and then test T on a small number of inputs” doesn’t work. P is only asking if T has low average loss, so you can’t falsify P with a small number of inputs.
I mean, beating a chess engine in 2005 might be a “years-long task” for a human? The time METR is measuring is how long it would hypothetically take a human to do the task, not how long it takes the AI.
What is the absurd conclusion?
You’re saying that if you assigned 1 human contractor the task of solving superalignment, they would succeed after ~3.5 billion years of work? 🤔 I think you misunderstood what the y-axis on the graph is measuring.
[Question] How far along Metr’s law can AI start automating or helping with alignment research?
I think the most mysterious part of this trend is that the x-axis is release date. Very useful but mysterious.
No, the Polymarket price does not mean we can immediately conclude what the probability of a bird flu pandemic is. We also need to know the interest rate!
I think there is an obvious signal that could be used: a forecast of how much MIRI will like the research when asked in 5 years. (Note that I don’t mean just asking MIRI now, but rather something like prediction markets or super-forecasters to predict what MIRI will say 5 years from now.)
Basically, if the forecast is above average, anyone who trusts MIRI should fund them.
How I saved 1 human life (in expectation) without overthinking it
Now see if you can catch sandbagging in the scratchpad!
The most important graph from the “faking alignment” paper is this one:
Christopher King’s Shortform
Also, you should care about worlds proportional to the square of their amplitude.
It’s actually interesting to consider why this must be the case. Without it, I concede that maybe some sort of Quantum Anthropic Shadow could be true. I’m thinking it would lead to lots of wacky consequences.
I suppose the main point you should draw from “Anthropic Blindness” to QI is that:
Quantum Immortality is not a philosophical consequence of MWI, it is an empirical hypothesis with a very low prior (due to complexity).
Death is not special. Assuming you have never gotten a Fedora up to this point, it is consistent to assume that that “Quantum Fedoralessness” is true. That is, if you keep flipping a quantum coin that has a 50% chance of giving you a Fedora, the universe will only have you experience the path that doesn’t give you the Fedora. Since you have never gotten a Fedora yet, you can’t rule this hypothesis out. The silliness of this example demonstrates why we should likewise be skeptical of Quantum Immortality.
A universe with classical mechanics, except that when you die the universe gets resampled, would be anthropic angelic.
Beings who save you are also anthropic angelic. For example, the fact that you don’t die while driving is because the engineers explicitly tried to minimize your chance of death. You can make inferences based on this. For example, even if you have never crashed, you can reason that during a crash you will endure less damage than other parts of the car, because the engineers wanted to save you more than they wanted to save the parts of the car.
And also it is a cheap example of a model organism of misalignment.