into digital sentience these days
EuanMcLean
Thanks James!
One failure mode is that the modification makes the model very dumb in all instances.
Yea, good point. Perhaps an extra condition we’d need to include is that the “difficulty of meta-level questions” should be the same before and after the modification—e.g. - the distribution over stuff it’s good at and stuff its bad at should be just as complex (not just good at everything or bad at everything) before and after
Thanks Felix!
This is indeed a cool and surprising result. I think it strengthens the introspection interpretation, but without a requirement to make a judgement of the reliability of some internal signal (right?), it doesn’t directly address the question of whether there is a discriminator in there.
Interesting question! I’m afraid I didn’t probe the cruxes of those who don’t expect hard takeoff. But my guess is that you’re right—no hard takeoff ~= the most transformative effects happen before recursive self-improvement
Yea, I think you’re hitting on a weird duality between setting and erasing here. I think I agree that setting is more fundamental than erasing. I suppose when talking about energy expenditure of computation, each set bit must be erased in the long run, so they’re interchangeable in that sense.
Sorry for the delay. As both you and TheMcDouglas have mentioned; yea, this relies on $H(C|X) = 0$. The way I’ve worded it above is somewhere between misleading and wrong, have modified. Thanks for pointing this out!
Fixed, thanks!
Thanks for the comment, this is indeed an important component! I’ve added a couple of sentences pointing in this direction.
fixed, thanks!
Thanks for the feedback Garrett.
This was intended to be more of a technical report than a blog post, meaning I wanted to keep the discussion reasonably rigorous/thorough. Which always comes with the downside of it being a slog to read, so apologies for that!
I’ll write a shortened version if I find the time!