I would be great to have automated feedback on the epistemics of a piece of text. An LLM that can read text and identify reasoning errors or add appropriate qualifiers. As a browser plugin, it would also be helpful when reading news articles. Perhaps it can be done by using the Constitutional AI methodology and using Rationality: From A-Z(or something similar) as the constitution.
Marius Adrian Nicoară
I only skimmed a little through the post I’m linking to, but I’m curios if the method of self-other-overlap could help “keep AI meta-ethical evolution grounded to human preferences”:
My own high-level, vaguely defined guess of a method would be something that is central to the functioning of the AI such that if the AI goes against it, then the AI will not be able to make sense of the world. But that seems to carry the risk of the AI just messing everything up as it goes crazy. So the method should also include a way of limiting the capabilities of the AI while it’s in that confused state.
“it’s the distinction between learning from human data versus learning from a reward signal.” That’s an interesting distinction. The difference I currently see between the two is that currently a reward signal can be hacked by the AI, while human data cannot. Is that an accurate thing to say?
Are there any resources you could recommend for alignment methods that take into account the distinction you mentioned?
Satire: Sam Altman get’s grilled by the Financial Times for his kitchen and his cooking skills + what this might say about him
I think editing should be possible. Not sure about deleting it entirely.
I think that from an AI Alignment perspective, giving AI so much control over its training seems to be very problematic. What we are mostly left with is to control the interface that AI has to physical reality i.e. sensors and actuators.
For now, it seems to me that AI is mostly affecting the virtual world. I think the moment when AI can competently and more directly influence physical reality would be a tipping point, because then it can cause a lot more changes to the world.
I would say that the ability to do continuous learning is required to adapt well to the complexity of physical reality. So a big improvement in continuous learning might be an important next goalpost to watch for.
Yes, and there’s an interesting question in that post and and interesting answer in the comments there. Would be great to have everything in one place. @Matrice Jacobine and @alapmi, maybe you could try to come to some sort of an agreement?
An artistic illustration of Scalable Oversight—“A world apart, neither gods nor mortals”
This seems like an introduction to the topic. enough to get the curiosity boiling.
Hyppotherapy
Looking at the convergent instrumental goals
self preservation
goal preservation
resource acquisition
self improvement
I think some are more important than others.
There is the argument that in order to predict the actions of a superintelligent agent you need to be as intelligent as it is. It would follow that an AI might not be able to predict if its goal will be preserved or not by self improvement.
But I think it can have high confidence that self improvement will help with self preservation and resource acquisition. And those gains will be helpful with any new goal it might decide to have. So self improvement would not seem to be such a bad idea.
>Are those your musings about agents questioning their assumptions and word-views?
- Yes, these are my musings about agents questioning their assumptions and world-views.
>And like, do you wish to improve your fallacies?
- I want get better at avoiding fallacies. What I desire for myself I also desire for AI. As Marvin Minsky put it: “Will robots inherit the Earth? Yes, but they will be our children.”
>higher threshold than ability, like inherent desire/optimisation?
What kind of stability? Any from https://en.wikipedia.org/wiki/Stable_algorithm? I’d focus more on sort of non-fatal influence. Should the property be more about the alg being careful/cautious?
- I was thinking of stability in terms of avoiding infinite regress, as illustrated by Jonas noticing the endless sequence of metaphorical whale bellies.
Philosopher Gabriel Liiceanu in his book “Despre limită” (English: Concerning limit—unfortunately, no English version seems to be available) argues that we fell lost when we loose our landmark-limit i.e. in the desert/in the middle of the ocean on a cloudy night with no navigational tools. I would say that we can also get lost in our mental landscape and thus be unable to decide which goal to pursue.
Consider the paperclip maximizing algorithm: once it has turned all available matter in the Universe into paperclips, what will it do? And if the algorithm can predict that it will reach this confusing state, does it decide to continue the paperclip optimization? As a Buddhist saying goes: “When you get what you desire, you become a different person. Consider becoming that version of yourself first and you might find that you no longer need the object of your desires.”.
Waited fo 20 minutes and no one showed up. Better luck next time, I guess.
Some desirable properties of automated wisdom
Because no one has signed up, this event will not be held.
Nobody signed up, so the event will not take place.
“If the human wants coffee, we want the AI to get the human a coffee. We don’t want the AI to get itself a coffee.”
It’s not clear to me that this is the only possible outcome. It’s not a mistake that we humas do routinely. In fact, there is some evidence that if someone asks us to do them a favor, we might end up liking them more and continue to do more favors for that person. Granted, there seem to have been no large-scale studies analyzing this so called Ben Franklin effect. Even if this effect does turn out to be more robust, it’s not clear to me how this could transfer to an AI. And then there’s the issue of making sure the AI won’t somehow get rid of this constraint that we imposed on it.
”The problem is that we don’t know what we want the AI to do, certainly not with enough precision to turn it into code.”
I agree; that’s backed up by the findings from the Moral Machine experiment about what we think autonomous cars should do.