Mati_Roy
i want a better conceptual understanding of what “fundamental values” means, and how to disentangled that from beliefs (ex.: in an LLM). like, is there a meaningful way we can say that a “cat classifier” is valuing classifying cats even though it sometimes fail?
when potentially ambiguous, I generally just say something like “I have a different model” or “I have different values”
it seems to me that disentangling beliefs and values are important part of being able to understand each other
and using words like “disagree” to mean both “different beliefs” and “different values” is really confusing in that regard
topic: economics
idea: when building something with local negative externalities, have some mechanism to measure the externalities in terms of how much the surrounding property valuation changed (or are expected to change based, say, through a prediction market) and have the owner of that new structure pay the owners of the surrounding properties.
I wonder what fraction of people identify as “normies”
I wonder if most people have something niche they identify with and label people outside of that niche as “normies”
if so, then a term with a more objective perspective (and maybe better) would be non-<whatever your thing is>
like, athletic people could use “non-athletic” instead of “normies” for that class of people
just a loose thought, probably obvious
some tree species self-selected themselves for height (ie. there’s no point in being a tall tree unless taller trees are blocking your sunlight)
humans were not the first species to self-select (for humans, the trait being intelligence) (although humans can now do it intentionally, which is a qualitatively different level of “self-selection”)
on human self-selection: https://www.researchgate.net/publication/309096532_Survival_of_the_Friendliest_Homo_sapiens_Evolved_via_Selection_for_Prosociality
Board game: Medium
2 players reveal a card with a word, then they need to say a word based on that and get points if it’s the same word (basically, with some more complexities).
Example at 1m20 here: https://youtu.be/yTCUIFCXRtw?si=fLvbeGiKwnaXecaX
I’m glad past Mati cast a wider net has the specifics for this year’s Schelling day are different ☺️☺️
idk if the events are often going over time, but I might pass by now if it’s still happening ☺️
I liked reading your article; very interesting! 🙏
One point I figured I should x-post with our DMs 😊 --> IMO, if one cares about future lives (as much as present ones) then the question stops really being about expected lives and starts just being about whether an action increases or decreases x-risks. I think a lot/all of the tech you described also have a probability of causing an x-risk if they’re not implemented. I don’t think we can really determine whether a probability of some of those x-risk is low enough in absolute terms as those probabilities would need to be unreasonably low, leading to full paralysis, and full paralysis could lead to x-risk. I think instead someone with those values (ie. caring about unborn people) should compare the probability of x-risks if a tech gets developed vs not developed (or whatever else is being evaluated). 🙂
new, great, complementary post: Critical Questions about Patient Care in Cryonics and Biostasis
I love this story so much, wow! It feels so incredibly tailored to me (because it is 😄). I value that a lot! It’s a very scarce resource to begin with, but it hardly gets more tailored than that 😄
that’s awesome; thanks for letting me know :)
[Question] Which LessWrongers are (aspiring) YouTubers?
i’d be curious to know how the first event went if you’re inclined to share ☺
Private Biostasis & Cryonics Social
cars won’t replace horses, horses with cars will
Thanks for engaging with my post. I keep thinking about that question.
I’m not quite sure what you mean by “values and beliefs are perfectly correlated here”, but I’m guessing you mean they are “entangled”.
Ah yeah, that seems true for all systems (at least if you can only look at their behaviors and not their mind); ref.: Occam’s razor is insufficient to infer the preferences of irrational agents. Summary: In principle, all possible sets of possible value-system has a belief-system that can lead to any set of actions.
So, in principle, the cat classifier, looked from the outside, could actually be a human mind wanting to live a flourishing human life, but with a decision making process that’s so wrong that the human does nothing but say “cat” when they see a cat, thinking this will lead them to achieve all their deepest desires.
I think the paper says noisy errors would cancel each other (?), but correlated errors wouldn’t go away. One way to solve for them would be coming up with “minimal normative assumptions”.
I guess that’s as much relevant to the “value downloading” as it is to the “value (up)loading” on. (I just coined the term “value downloading” to refer to the problem of determining human values as opposed to the problem of programming values into an AI.)
The solution-space for determining the values of an agent at a high-level seems to be (I’m sure that’s too simplistic, and maybe even a bit confused, but just thinking out loud):
Look in their brain directly to understand their values (and maybe that also requires solving the symbol-grounding problem)
Determine their planner (ie. “decision-making process”) (ex.: using some interpretability methods), and determine their values from the policy and the planner
Make minimal normative assumptions about their reasoning errors and approximations to determine their planner from their behavior (/policy)
Augment them to make their planners flawless (I think your example fits into improving the planner by improving the image resolution—I love that thought 💡)
Ask the agent questions directly about their fundamental values which doesn’t require any planning (?)
Approaches like “iterated amplifications” correspond to some combination of the above.
But going back to my original question, I think a similar way to put it is that I wonder how complex the concept of “preferences″/”wanting” is. Is it a (messy) concept that’s highly dependent on our evolutionary history (ie. not what we want, which definitely is, but the concept of wanting itself) or is it a concept that all alien civilizations use in exactly the same way as us? It seems like a fundamental concept, but can we define it in a fully reductionist (and concise) way? What’s the simplest example of something that “wants” things? What’s the simplest planner a wanting-thing can have? Is it no planner at all?
A policy seems well defined–it’s basically an input-output map. We’re intuitively thinking of a policy as a planner + an optimization target, so if either of the latter 2 can be defined robustly, then it seems like we should be able to define the other as well. Although, maybe for a given planner or optimization target there are many possible optimization targets or planners to get a given policy, but maybe Occam’s razor would be helpful here.
Relatedly, I also just read Reward is not the optimization target which is relevant and overlaps a lot with ideas I wanted to write about (ie. neural-net-executor, not reward-maximizers as a reference to Adaptation-Executers, not Fitness-Maximizers). A reward function R will only select a policy π that wants R if wanting R is the best way to achieve R in the environment the policy is being developped. (I’m speaking loosely: technically not if it’s the “best” way, but just if it’s the way the weight-update function works.)
Anyway, that’s a thread that seems valuable to pull more. If you have any other thoughts or pointers, I’d be interested 🙂