some guy who was recently hyped about asking o1 for the solution to quantum gravity—it gave the user some gibberish
yes, but this is pretty typical for what a human would generate.
some guy who was recently hyped about asking o1 for the solution to quantum gravity—it gave the user some gibberish
yes, but this is pretty typical for what a human would generate.
There are plenty of systems where we rationally form beliefs about likely outputs from a system without a full understanding of how it works. Weather prediction is an example.
I should have been clear: “doing things” is a form of input/output since the AI must output some tokens or other signals to get anything done
If you look at the answers there is an entire “hidden” section of the MIRI website doing technical governance!
Why is this work hidden from the main MIRI website?
nice!
“Our objective is to convince major powers to shut down the development of frontier AI systems worldwide”
This?
Who works on this?
Re: (2) it will only impact output on the current generated output, once the output is over all that stuff will be reset and the only thing that remains is the model weights which were set in stone at train time.
re: (1) “a LLM might produce text for reasons that don’t generalize like a sincere human answer would” it seems that current LLM systems are pretty good at generalizing like a human would and in some ways they are better due to being more honest, easier to monitor, etc
But do you really think we’re going to stop with tool AI, and not turn them into agents?
But if it is the case that agentic AI is an existential risk then if actors could choose not to develop it, which is a coordination problem not an alignment problem.
We already have aligned AGI, we can coordinate to not build misaligned AGI.
ok but as a matter of terminology, is a “Satan reverser” misaligned because it contains a Satan?
OK, imagine that I make an AI that works like this: a copy of Satan is instantiated and his preferences are extracted in percentiles, then sentences from Satan’s 2nd-5th percentile of outputs are randomly sampled. Then that copy of Satan is destroyed.
Is the “Satan Reverser” AI misaligned?
Is it “inner misaligned”?
So your definition of “aligned” would depend on the internals of a model, even if its measurable external behavior is always compliant and it has no memory/gets wiped after every inference?
Further on the tech tree, alignment tax can end up motivating systematic uses that make LLMs a source of danger.
Sure, but you can say the same about humans. Enron was a thing. Obeying the law is not as profitable as disobeying it.
maybe you should swap “understand ethics” for something like “follow ethics”/”display ethical behavior”
What is the difference between these two? This sounds like a distinction without a difference
Any argument which features a “by definition”
What is your definition of “Aligned” for an LLM with no attached memory then?
Wouldn’t it have to be
“The LLM outputs text which is compliant with the creator’s ethical standards and intentions”?
To add: I didn’t expect this to be controversial but it is currently on −12 agreement karma!
This is my next project!