Thanks so much for the response, this is all clear now!
Joe Kwon
Sorry if it’s obvious from some other part of your post, but the whole premise is that sufficiently strong models *deployed in sufficiently complex environments* leads to general intelligence with optimization over various levels of abstractions. So why is it obvious that: It doesn’t matter if your AI is only taught math, if it’s a glorified calculator — any sufficiently powerful calculator desperately wants to be an optimizer?
If it’s only trained to solve arithmetic and there are no additional sensory modalities aside from the buttons on a typical calculator, how does increasing this AI’s compute/power lead to it becoming an optimizer over a wider domain than just arithmetic? Maybe I’m misunderstanding the claim, or maybe there’s an obvious reason I’m overlooking.
Also, what do you think of the possibility that when AI becomes superhuman++ in tasks, that the representations go from interpretable to inscrutable again (because it uses lower level representations that are inaccessible to humans)? I understand the natural abstraction hypothesis, and I buy it too, but even an epsilon increase in details might compound into significant prediction outcomes if a causal model is trying to use tons of representations in conjunction to compute something complex.
Do you think it might be valuable to find a theoretical limit that shows that the amount of compute needed for such epsilon-details to be usefully incorporated is greater than ever will be feasible (or not)?
Hi Steve, loved this post! I’ve been interested in viewing the steering and thought generator + assessor submodule framework as the object and generator-of-values of which which we want AI to learn a good pointer to/representation of, to simulate out the complex+emergent human values and properly value extrapolate.
I know the way I’m thinking about the following doesn’t sit quite right with your perspective, because AFAIK, you don’t believe there need to be independent, modular value systems that give their own reward signals for different things (your steering subsystem and thought generator and assessor subsystem are working in tandem to produce a singular reward signal). I’d be interested in hearing your thoughts on what seems more realistic, after importing my model of value generators as more distinctive and independent modular systems in the brain.
In the past week, I’ve been thinking about the potential importance of considering human value generators as modular subsystems (for both compute and reward). Consider the possibility that at various stages of the evolutionary neurocircuitry-shaping timeline of humans, that modular and independently developed subsystems developed. E.g. one of the first systems, some “reptilian” vibe system, was one that rewarded sugary stuff because it was a good proxy at the time for nutritious/calorie-dense foods that help with survival. And then down the line, there was another system that developed to reward feeling high-social status, because it was a good proxy at the time for surviving as social animals in in-group tribal environments. What things would you critique about this view, and how would you fit similar core-gears into your model of the human value generating system?
I’m considering value generators as more independent and modular, because (this gets into a philosophical domain but) perhaps we want powerful optimizers to apply optimization pressure not towards the human values generated by our wholistic-reward-system, but to ones generated by specific subsystems (system 2, higher-order values, cognitive/executive control reward system) instead of reptilian hedon-maximizing system.
This is a few-day old, extremely crude and rough-around-the-edges idea, but I’d especially appreciate your input and critiques on this view. If it were promising enough, I wonder if (inspired by John Wentworth’s evolution of modularity post) training agents in a huge MMO environment and switching up reward signals in the environment (or the environment distribution itself) every few generations would lead to a development of modular reward systems (mimicking the trajectory of value generator systems developing in humans over the evolutionary timeline).
Enjoyed reading this! Really glad you’re getting good research experience and I’m stoked about the strides you’re making towards developing research skills since our call (feels like ages ago)! I’ve been doing a lot of what you describe as “directed research” myself lately as I’m learning more about DL-specific projects and I’ve been learning much faster than when I was just doing cursory, half-assed paper skimming, alongside my cogsci projects. Would love to catch up over a call sometime to talk about stuff we’re working on now
Really appreciated this post and I’m especially excited for post 13 now! In the past month or two, I’ve been thinking about stuff like “I crave chocolate” and “I should abstain from eating chocolate” as being a result of two independent value systems (one whose policy was shaped by evolutionary pressure and one whose policy is… idk vaguely “higher order” stuff where you will endure higher states of cortisol to contribute to society or something).
I’m starting to lean away from this a little bit, and I think reading this post gave me a good idea of what your thoughts are, but it’d be really nice to get confirmation (and maybe clarification). Let me know if I should just wait for post 13. My prediction is that you believe there is a single (not dual) generator of human values, which are essentially moderated at the neurochemical level, like “level of dopamine/serotonin/cortisol”. And yet, this same generator, due to our sufficiently complex “thought generator”, can produce plans and thoughts such as “I should abstain from eating chocolate” even though it would be a dopamine hit in the short-term, because it can simulate forward much further down the timeline, and believes that the overall neurochemical feedback will be better than caving into eating chocolate, on a longer time horizon. Is this correct?
If so, do you believe that because social/multi-agent navigation was essential to human evolution, the policy was heavily shaped by social world related pressures, which means that even when you abstain from the chocolate, or endure pain and suffering for a “heroic” act, in the end, this can all still be attributed to the same system/generator that also sometimes has you eat sugary but unhealthy foods?
Given my angle on attempting to contribute to AI Alignment is doing stuff to better elucidate what “human values” even is, I feel like I should try to resolve the competing ideas I’ve absorbed from LessWrong: 2 distinct value systems vs. singular generator of values. This post was a big step for me in understanding how the latter idea can be coherent with the apparent contradictions between hedonistic and higher-level values.
In case anyone stumbles across this post in the future, I found these posts from the past both arguing for and against some of the worries I gloss over here. I don’t think my post boils down completely to merely “recommender systems should be better aligned with human interests”, but that is a big theme.
I’m also not sold on this specific part, and I’m really curious about what things support the idea. One reason I don’t think it’s good to rely on this as the default expectation though, is that I’m skeptical about humans’ abilities to even know what the “best experience” is in the first place. I wrote a short rambly post touching on, in some part, my worries about online addiction: https://www.lesswrong.com/posts/rZLKcPzpJvoxxFewL/converging-toward-a-million-worlds
Basically, I buy into the idea that there are two distinct value systems in humans. One subconscious system where the learning is mostly from evolutionary pressures, and one conscious/executive system that cares more about “higher-order values” which I unfortunately can’t really explicate. Examples of the former: craving sweets, addiction to online games with well engineered artificial fulfillment. Example of the latter: wanting to work hard, even when it’s physically demanding or mentally stressful, to make some type of positive impact for broader society.
And I think today’s modern ML systems are asymmetrically exploiting the subconscious value system at the expense of the conscious/executive value system. Even knowing all this, I really struggle to overcome instances of akrasia, controlling my diet, not drowning myself in entertainment consumption, etc. I feel like there should be some kind of attempt to level the playing field, so to speak, with which value system is being allowed to thrive. At the very least, transparency and knowledge about this phenomena to people who are interacting with powerful recommender (or just general) ML systems, and in the optimal, allowing complete agency and control over what value system you want to prioritize, and to what extent.
Converging toward a Million Worlds
Very interesting post!
1) I wonder what your thoughts are on how “disentangled” having a “dim world” perspective and being psychopathic are (completely “entangled” being: all psychopaths experience dim world and all who experience dim world are psychopathic). Maybe I’m also packing too many different ideas/connotations into the term “psychopathy”.
2) Also, the variability in humans’ local neuronal connection and “long-range” neuronal connections seems really interesting to me. My very unsupported, weak suspicion is that perhaps there is a correlation between these ratios (or maybe the pure # of each), and the natural ability to learn information and develop expertise in a very narrow domain of things (music, math?) vs. develop big new ideas where the concepts are largely formed from cross-domain, interdisciplinary thinking. Do you have any thoughts on this? Depending on what we believe for this, what we believe for question 1) has some very interesting implications, I think?
3) Finally, I wonder if the lesswrong community has a higher rate of “dim world” perspective-havers (or “psychopaths in the narrowly defined sense of having lower thresholds for stimulation), than the base-rate of the general population.
Just a small note that your ability to contribute via research doesn’t go from 0 now, to 1 after you complete a PhD! As in, you can still contribute to AI Safety with research during a phd
Thanks for posting this! I was wondering if you might share more about your “isolation-induced unusual internal information cascades” hypothesis/musings! Really interested in how you think this might relate to low-chance occurrences of breakthroughs/productivity.
My original idea (and great points against the intuition by Rohin)
“To me, it feels viscerally like I have the whole argument in mind, but when I look closely, it’s obviously not the case. I’m just boldly going on and putting faith in my memory system to provide the next pieces when I need them. And usually it works out.”
This closely relates to the kind of experience that makes me think about language as post hoc symbolic logic fitting to the neural computations of the brain. Which kinda inspired the hypothesis of a language model trained on a distinct neural net being similar to how humans experience consciousness (and gives the illusion of free will).
So, I thought it would be a neat proof of concept if GPT3 served as a bridge between something like a chess engine’s actions and verbal/semantic level explanations of its goals (so that the actions are interpretable by humans). e.g. bishop to g5; this develops a piece and pins the knight to the king, so you can add additional pressure to the pawn on d5 (or something like this).
In response, Reiichiro Nakano shared this paper: https://arxiv.org/pdf/1901.03729.pdf
which kinda shows it’s possible to have agent state/action representations in natural language for Frogger. There are probably glaring/obvious flaws with my OP, but this was what inspired those thoughts.Apologies if this is really ridiculous—I’m maybe suggesting ML-related ideas prematurely & having fanciful thoughts. Will be studying ML diligently to help with that.
[Question] Partial-Consciousness as semantic/symbolic representational language model trained on NN
Joe Kwon’s Shortform
Thanks, I hadn’t thought about those limitations
For the basic features, I got used to navigating everything within a hour. I’ll be on the lookout for improvements to Roam or other note-taking programs like this
Hi John. One could run useful empirical experiments right now, before fleshing out all these structures and how to represent them, if you can assume that a proxy for human representations (crude: conceptnet, less crude: similarity judgments on visual features and classes collected by humans) is a good enough proxy for “relevant structures” (or at least that these representations more faithfully capture the natural abstractions than the best machines in vision tasks for example, where human performance is the benchmark performance), right?
I had a similar idea about ontology mismatch identification via checking for isomorphic structures, and also realized I had no idea how to realize that idea. Through some discussions with Stephen Casper and Ilia Sucholutsky, we kind of pivoted the above idea into the regime of interpretability/adversarial robustness where we are hunting for interesting properties given that we can identify the biggest ways that humans and machines are representing things differently (and that humans, for now, are doing it “better”/more efficiently/more like the natural abstraction structures that exist).
I think am working in the same building this summer (caught a split-second glance at you yesterday); I would love a chance to discuss how selection theorems might relate to an interpretability/adversarial robustness project I have been thinking about.