Jon Garcia

Karma: 644

I have a PhD in Computational Neuroscience from UCSD (Bachelor’s was in Biomedical Engineering with Math and Computer Science minors). Ever since junior high, I’ve been trying to figure out how to engineer artificial minds, and I’ve been coding up artificial neural networks ever since I first learned to program. Obviously, all my early designs were almost completely wrong/unworkable/poorly defined, but I think my experiences did prime my brain with inductive biases that are well suited for working on AGI.

Although I now work as a data scientist in R&D at a large medical device company, I continue to spend my free time studying the latest developments in AI/ML/DL/RL and neuroscience and trying to come up with models for how to bring it all together into systems that could actually be implemented. Unfortnately, I don’t seem to have much time to develop my ideas into publishable models, but I would love to have the opportunity to share ideas with those who do.

Of course, I’m also very interested in AI Alignment (hence the account here). My ideas on that front mostly fall into the “learn (invertible) generative models of human needs/goals and hook those up to the AI’s own reward signal” camp. I think methods of achieving alignment that depend on restricting the AI’s intelligence or behavior are about as destined to failure in the long term as Prohibition or the War on Drugs in the USA. We need a better theory of what reward signals are for in general (probably something to do with maximizing (minimizing) the attainable (dis)utility with respect to the survival needs of a system) before we can hope to model human values usefully. This could even extend to modeling the “values” of the ecological/socioeconomic/political supersystems in which humans are embedded or of the biological subsystems that are embedded within humans, both of which would be crucial for creating a better future.

Jon Garcia Mar 6, 2025, 6:23 PM
3 points
0
on: What Is The Alignment Problem?
Exercise: Do What I Mean (DWIM)
I haven’t thought much about what patterns need to hold in the environment in order for “do what I mean” to make sense at all. But it’s a natural next target in this list, so I’m including it as an exercise for readers: what patterns need to hold in the environment in order for “do what I mean” to make sense at all? Note that either necessary or sufficient conditions on such patterns can constitute marginal progress on the question.
As far as I can tell, DWIM will necessarily require other-agent modeling in some sort of predictive-coding framework. The “patterns in the environment” would be the correspondence between the actual state of the world and the representation of the desired goal state in the mind of the human, as well as between the trajectory taken to reach the goal state and the human’s own internal acceptance criteria.
Part of the AGI not hooked up to the reward signal would need to have a generative model of human agent’s behavior, words, commands, etc., derived from a latent representation of their beliefs and desires. This latent representation is constantly updated to minimize prediction error derived from observation, verbal feedback, etc. (e.g., Human: “That’s not what I meant!” AGI: “Hmm, what must be going on inside their head to make them say that, given the state of the environment and prior knowledge about their preferences, and how does that differ from what I was assuming?”)
At the same time, the AGI needs to have some latent representation of the environment and the paths taken through it that uses (a linear mapping to) the same latent space it uses for representing the human’s desires. Correspondence can then be measured and optimized for directly.

Jon Garcia Jun 22, 2024, 4:58 PM
2 points
0
in reply to: Jon Garcia’s comment on: Stephen Fowler’s Shortform
Also, consider a more traditional optimization process, such as a neural network undergoing gradient descent. If, in the process of training, you kept changing the training dataset, shifting the distribution, you would in effect be changing the optimization target.

Each minibatch generates a different gradient estimate, and a poorly randomized ordering of the data could even lead to training in circles.

Changing environments are like changing the training set for evolution. Differential reproductive success (mean squared error) is the fixed cost function, but the gradient that the population (network backpropagation) computes at any generation (training step) depends on the particular set of environmental factors (training data in the minibatch).

Jon Garcia Jun 22, 2024, 4:40 PM
2 points
0
in reply to: Stephen Fowler’s comment on: Stephen Fowler’s Shortform
Evolution may not act as an optimizer globally, since selective pressure is different for different populations of organisms on different niches. However, it does act as an optimizer locally.

For a given population in a given environment that happens to be changing slowly enough, the set of all variations in each generation act as a sort of numerical gradient estimate of the local fitness landscape. This allows the population as a whole to perform stochastic gradient descent. Those with greater fitness for the environment could be said to be lower on the local fitness landscape, so their is an ordering for that population.

In a sufficiently constant environment, evolution very much does act as an optimization process. Sure, the fitness landscape can change, even by organisms undergoing evolution (e.g. the Great Oxygenation Event of yester-eon, or the Anthropogenic Mass Extinction of today), which can lead to cycling. But many organisms do find very stable local minima of the fitness landscape for their species, like the coelacanth, horseshoe crab, cockroach, and many other “living fossils”. Humans are certainly nowhere near our global optimum, especially with the rapid changes to the fitness function wrought by civilization, but that doesn’t mean that there isn’t a gradient that we’re following.

Jon Garcia Jul 17, 2023, 8:39 PM
7 points
0
on: Conditional on living in a AI safety/alignment by default universe, what are the implications of this assumption being true?
I would expect that for model-based RL, the more powerful the AI is at predicting the environment and the impact of its actions on it, the less prone it becomes to Goodharting its reward function. That is, after a certain point, the only way to make the AI more powerful at optimizing its reward function is to make it better at generalizing from its reward signal in the direction that the creators meant for it to generalize.
In such a world, when AIs are placed in complex multiagent environments where they engage in iterated prisoner’s dilemmas, the more intelligent ones (those with greater world-modeling capacity) should tend to optimize for making changes to the environment that shift the Nash equilibrium toward cooperate-cooperate, ensuring more sustainable long-term rewards all around. This should happen automatically, without prompting, no matter how simple or complex the reward functions involved, whenever agents surpass a certain level of intelligence in environments that allow for such incentive-engineering.

Jon Garcia Jun 26, 2023, 8:50 PM
20 points
2
on: Another medical miracle
Disclaimer: I am not a medical doctor nor a nutritionist, just someone who researches nutrition from time to time.
I would be surprised if protein deficiency per se was the actual problem. As I understand it, many vegetables actually have a higher level of protein per calorie than meat (probably due to the higher fat content of the latter, which is more calorie dense), although obviously, there’s less protein per unit mass than meat (since vegetables are mostly cellulose and water). The point is, though, that if you were getting enough calories to function from whole, unrefined plant sources, you shouldn’t have had a protein deficiency. (Of course, you might have been eating a lot of highly processed “vegetarian” foods, in which case protein deficiency is not entirely out of the question.)
That being said, my guess is that you may be experiencing a nutritional deficiency either in sulfur or in vitamin D (the latter of which is a very common deficiency). Plant-derived proteins tend to have much lower levels of sulfur-containing amino acids (methionine, cysteine) than animal-derived proteins, and sulfur is an important component of cartilage (and of arthritis supplements). Both sulfur and vitamin D have been investigated for their role in musculoskeletal pain and other health issues (although from what I have read, results are more ambiguous for sulfur than for vitamin D with respect to musculoskeletal pain in particular). Eggs are particularly high in both sulfur (sulfur smell = rotten egg smell) and vitamin D, so if you were low on either one of those, it makes sense that eating a lot of eggs would have helped. It would be very interesting to test whether either high-sulfur vegetables (such as onions or broccoli) or vitamin D supplements would have a similar effect on your health.

Jon Garcia May 7, 2023, 10:34 PM
4 points
0
on: Residual stream norms grow exponentially over the forward pass

Due to LayerNorm, it’s hard to cancel out existing residual stream features, but easy to overshadow existing features by just making new features 4.5% larger.

If I’m interpreting this correctly, then it sounds like the network is learning exponentially larger weights in order to compensate for an exponentially growing residual stream. However, I’m still not quite clear on why LayerNorm doesn’t take care of this.

To avoid this phenomenon, one idea that springs to mind is to adjust how the residual stream operates. For a neural network module f, the residual stream works by creating a combined output: r(x)=f(x)+x

You seem to suggest that the model essentially amplifies the features within the neural network in order to overcome the large residual stream: r(x)=f(1.045*x)+x

However, what if instead of adding the inputs directly, they were rescaled first by a compensatory weight?: r(x)=f(x)+1/1.045x=f(x)+0.957x

It seems to me that this would disincentivize f from learning the exponentially growing feature scales. Based on your experience, would you expect this to eliminate the exponential growth in the norm across layers? Why or why not?

Jon Garcia Apr 25, 2023, 3:37 AM
1 point
0
in reply to: Steven Byrnes’s comment on: Deep learning models might be secretly (almost) linear
If both images have the main object near the middle of the image or taking up most of the space (which is usually the case for single-class photos taken by humans), then yes. Otherwise, summing two images with small, off-center items will just look like a low-contrast, noisy image of two items.

Either way, though, I would expect this to result in class-label ambiguity. However, in some cases of semi-transparent-object-overlay, the overlay may end up mixing features in such a jumbled way that neither of the “true” classes is discernible. This would be a case where the almost-linearity of the network breaks down.

Maybe this linearity story would work better for generative models, where adding latent vector representations of two different objects would lead the network to generate an image with both objects included (an image that would have an ambiguous class label to a second network). It would need to be tested whether this sort of thing happens by default (e.g., with Stable Diffusion) or whether I’m just making stuff up here.

Jon Garcia Apr 24, 2023, 11:25 PM
7 points
0
in reply to: Steven Byrnes’s comment on: Deep learning models might be secretly (almost) linear
For an image-classification network, if we remove the softmax nonlinearity from the very end, then $X$ would represent the input image in pixel space, and $Y$ would represent the class logits. Then $f (x_{1} + x_{2}) \approx f (x_{1}) + f (x_{2})$ would represent an image with two objects leading to an ambiguous classification (high log-probability for both classes), and $f (k x) \approx k f (x)$ would represent higher class certainty (softmax temperature = $1 / k$ ) when the image has higher contrast. I guess that kind of makes sense, but yeah, I think for real neural networks, this will only be linear-ish at best.

Jon Garcia Apr 22, 2023, 1:04 AM
6 points
1
on: Would we even want AI to solve all our problems?
I would say we want an ASI to view world-state-optimization from the perspective of a game developer. Not only should it create predictive models of what goals humans wish to achieve (from both stated and revealed preferences), but it should also learn to predict what difficulty level each human wants to experience in pursuit of those goals.
Then the ASI could aim to adjust the world into states where humans can achieve any goal they can think of when they apply a level of effort that would leave them satisfied in the accomplishment.
Humans don’t want everything handed to us for free, but we also don’t generally enjoy struggling for basic survival (unless we do). There’s a reason we pursue things like competitive sports and video games, even as we denounce the sort of warfare and power struggles that built those competitive instincts in the ancestral environment.
A safe world of abundance that still feels like we’ve fought for our achievements seems to fit what most people would consider “fun”. It’s what children expect in their family environment growing up, it’s what we expect from the games we create, and it’s what we should expect from a future where ASI alignment has been solved.

Jon Garcia Apr 17, 2023, 7:29 PM
1 point
0
in reply to: Brendan Long’s comment on: But why would the AI kill us?
I agree, hence the “if humanity never makes it to the long-term, this is a moot point.”

Jon Garcia Apr 17, 2023, 7:18 PM
3 points
2
on: But why would the AI kill us?
Last I checked, you can get about 10x as much energy from burning a square meter of biosphere as you can get by collecting a square meter of sunlight for a day.
Even if this is true, it’s only because that square meter of biosphere has been accumulating solar energy over an extended period of time. Burning biofuel may help accelerate things in the short term, but it will always fall short of long-term sustainability. Of course, if humanity never makes it to the long-term, this is a moot point.
Disassembling us for parts seems likely to be easier than building all your infrastructure in a manner that’s robust to whatever superintelligence humanity coughs up second.
It seems to me that it would be even easier for the ASI to just destroy all human technological infrastructure rather than to kill/disassemble all humans. We’re not much different biologically from what we were 200,000 years ago, and I don’t think 8 billion cavemen could put together a rival superintelligence anytime soon. Of course, most of those 8 billion humans depend on a global supply chain for survival, so this outcome may be just as bad for the majority.

Jon Garcia Apr 13, 2023, 5:44 PM
7 points
4
on: Trying AgentGPT, an AutoGPT variant
You heard the LLM, alignment is solved!

But seriously, it definitely has a lot of unwarranted confidence in its accomplishments.

I guess the connection to the real world is what will throw off such systems until they are trained on more real-world-like data.

I wouldn’t phrase it that it needs to be trained on more data. More like it needs to be retrained within an actual R&D loop. Have it actually write and execute its own code, test its hypotheses, evaluate the results, and iterate. Use RLHF to evaluate its assessments and a debugger to evaluate its code. It doesn’t matter whether this involves interacting with the “real world,” only that it learns to make its beliefs pay rent.

Anyway, that would help with its capabilities in this area, but it might be just a teensy bit dangerous to teach an LLM to do R&D like this without putting it in an air-gapped virtual sandbox, unless you can figure out how to solve alignment first.

Jon Garcia Apr 12, 2023, 7:56 AM
3 points
1
on: Gradient Descent in Activation Space: a Tale of Two Papers
“Activation space gradient descent” sounds a lot like what the predictive coding framework is all about. Basically, you compare the top-down predictions of a generative model against the bottom-up perceptions of an encoder (or against the low-level inputs themselves) to create a prediction error. This error signal is sent back up to modify the activations of the generative model, minimizing future prediction errors.
From what I know of Transformer models, it’s hard to tell exactly where this prediction error would be generated. Perhaps during few-shot learning, the model does an internal next-token prediction at every point along its input, comparing what it predicts the next token should be (based on the task it currently thinks it’s doing) against what the next token actually is. The resulting prediction error is fed “back” to the predictive model by being passed forward (via self-attention) to the next example in the input text, biasing the way it predicts next tokens in a way that would have given a lower error on the first example.
None of these predictions and errors would be visible unless you fed the input one token at a time and forced the hidden states to match what they were for the full input. A recurrent version of GPT might make that easier.
It would be interesting to see whether you could create a language model that had predictive coding built explicitly into its architecture, where internal predictions, error signals, etc. are all tracked at known locations within the model. I expect that interpretability would become a simpler task.

Jon Garcia Apr 9, 2023, 2:02 PM
6 points
3
on: Ng and LeCun on the 6-Month Pause (Transcript)
AI has gotten even faster and associated with that there are people that worry about AI, you know, fairness, bias, social economic displacement. There are also the further out speculative worries about AGI, evil sentient killer robots, but I think that there are real worries about harms, possible real harms today and possibly other harms in the future that people worry about.

It seems that the sort of AI risks most people worry about fall into one of a few categories:
1. AI/automation starts taking our jobs, amplifying economic inequalities.
2. The spread of misinformation will accelerate with deepfakes, fake news, etc. generated by malign humans using ever more convincing models.
3. 🤪 Evil sentient robots will take over the world and kill us all Terminator-style. 😏
It seems that a fourth option is not really prominent in the public consciousness: namely that powerful AI systems could end up destroying everything of value by accident when enough optimization pressure is applied toward any goal, no matter how noble. No robots or weapons are even required to achieve this. This oversight is a real PR problem for the alignment community, but it’s unfortunately difficult to explain why this makes sense as a real threat to the average person.

And I think, you know, thinking that somehow we’re smart enough to build those systems to be super intelligent and not smart enough to design good objectives so that they behave properly, I think is a very, very strong assumption that is, it’s just not, it’s very, it’s very low probability.

So close.

Jon Garcia Apr 9, 2023, 3:32 AM
4 points
2
on: Agentized LLMs will change the alignment landscape
Yep, ever since Gato, it’s been looking increasingly like you can get some sort of AGI by essentially just slapping some sensors, actuators, and a reward function onto an LLM core. I don’t like that idea.

LLMs already have a lot of potential for causing bad outcomes if abused by humans for generating massive amounts of misinformation. However, that pales in comparison to the destructive potential of giving GPT agency and setting it loose, even without idiots trying to make it evil explicitly.

I would much rather live in a world where the first AGIs weren’t built around such opaque models. LLMs may look like they think in English, but there is still a lot of black-box computation going on, with a strange tendency to switch personas partway through a conversation. That doesn’t bode well for steerability if such models are given control of an agent.

However, if we are heading for a world of LLM-AGI, maybe our priorities should be on figuring out how to route their models of human values to their own motivational schemas. GPT-4 probably already understands human values to a much deeper extent than we could specify with an explicit utility function. The trick would be getting it to care.

Maybe force the LLM-AGI to evaluate every potential plan it generates on how it would impact human welfare/society, including second-order effects, and to modify its plans to avoid any pitfalls it finds from a (simulated) human perspective. Do this iteratively until it finds no more conflict before it actually implements a plan. Maybe require actual verbal human feedback in the loop before it can act.

It’s not a perfect solution, but there’s probably not enough time to design a custom aligned AGI from scratch before some malign actor sets a ChaosGPT-AGI loose. A multipolar landscape is probably the best we can hope for in such a scenario.

Jon Garcia Apr 8, 2023, 9:37 PM
1 point
0
on: GPTs are Predictors, not Imitators
It seems to me that imitation requires some form of prediction in order to work. First make some prediction of the behavioral trajectory of another agent; then try to minimize the deviation of your own behavior from an equivalent trajectory. In this scheme, prediction constitutes a strict subset of the computational complexity necessary to enable imitation. How would GPT’s task flip this around?

And if prediction is what’s going on, in the much-more-powerful-than-imitation sense, what sort of training scheme would be necessary to produce pure imitation without also training the more powerful predictor as a prerequisite?

Jon Garcia Apr 5, 2023, 7:32 AM
11 points
4
on: Giant (In)scrutable Matrices: (Maybe) the Best of All Possible Worlds
First of all, I strongly agree that intelligence requires (or is exponentially easier to develop as) connectionist systems. However, I think that while big, inscrutable matrices may be unavoidable, there is plenty of room to make models more interpretable at an architectural level.
Well, I ask you—do you think any other ML model, trained over the domain of all human text, with sufficient success to reach GPT-4 level perplexity, would turn out to be simpler?
I have long thought that Transformer models are actually too general purpose for their own good. By that I mean that the $O (n^{2})$ operations they do, using all-to-all token comparisons for self-attention, is actually extreme overkill for what an LLM needs to do.
Sure, you can use this architecture for moving tokens around and building implicit parse trees and semantic maps and a bunch of other things, but all these functions are jumbled together in the same operations and are really hard to tease out. Recurrent models with well-partitioned internal states and disentangled token operations could probably do more with less. Sure, you can build a computer in Conway’s Game of Life (which is Turing-complete), but using a von Neumann architecture would be much easier to work with.
Embedded within Transformer circuits, you can find implicit representations of world models, but you could do even better from an interpretability standpoint by making such maps explicit. Give an AI a mental scratchpad that it depends on for reasoning (DALL-E, Stable Diffusion, etc. sort of do this already, except that the mental scratchpad is the output of the model [an image] rather than an internal map of conceptual/planning space), and you can probe that directly to see what the AI is thinking about.
Real brains tend to be highly modular, as Nathan Helm-Burger pointed out. The cortex maps out different spaces (visual, somatosensory, conceptual, etc.). The basal ganglia perform action selection and general information routing. The cerebellum fine-tunes top-down control signals. Various nuclei control global and local neuromodulation. And so on. I would argue that such modular constraints actually made it easier for evolution to explore the space of possible cognitive architectures.

Jon Garcia Apr 5, 2023, 12:26 AM
17 points
13
on: LW Team is adjusting moderation policy
Would it make sense to have a “Newbie Garden” section of the site? The idea would be to give new users a place to feel like they’re contributing to the community, along with the understanding that the ideas shared there are not necessarily endorsed by the LessWrong community as a whole. A few thoughts on how it could work:
- New users may be directed toward the Newbie Garden (needs a better name) if they try to make a post or comment, especially if a moderator deems their intended contribution to be low-quality. This could also happen by default for all users with karma below a certain threshold.
- New users are able to create posts, ask questions, and write comments with minimal moderation. Posts here won’t show up on the main site front page, but navigation to this area should be made easy on the sidebar.
- Voting should be as restricted here as on the rest of the site to ensure that higher-quality posts and comments continue trickling to the top.
- Teaching the art of rationality to new users should be encouraged. Moderated posts that point out trends and examples of cognitive biases and failures of rationality exhibited in recent newbie contributions, and that advise on how to correct for them in the future, could be pinned to the top of the Newbie Garden (still needs a better name). Moderated comments that serve a similar purpose could also be pinned to the top of comment sections of individual posts. This way, even heavily downvoted content could lead (indirectly) to higher quality contributions in the future.
- Newbie posts and questions with sufficient karma can be queued up for moderator approval to be posted to the main site.
I appreciate the high quality standards that have generally been maintained on LessWrong over the years, and I would like to see this site continue to act as both a beacon and an oasis of rationality.
But I also want people not to feel like they’re being excluded from some sort of elitist rationality club. Anyone should feel like they can join in the conversation as long as they’re willing to question their assumptions, receive critical feedback, and improve their ability to reason, about both what is true and what is good.

Jon Garcia Mar 23, 2023, 10:18 PM
22 points
24
in reply to: Ruby’s comment on: Alignment-related jobs outside of London/SF
Counterpoint:

If the alignment problem is the most important problem in history, shouldn’t alignment-focused endeavors be more willing to hire contributors who can’t/won’t relocate?

It’s not like remote work isn’t the easiest to implement that it’s ever been in all of history.

Of course there needs to be some filtering out of candidates to ensure resources are devoted to the most promising individuals. But I really don’t think that willingness to move correlates strongly enough with competence at solving alignment to warrant treating it like a dealbreaker.

Jon Garcia Feb 10, 2023, 7:34 AM
1 point
0
on: On utility functions
No, utility functions are not a property of computer programs in general. They are a property of (a certain class of) agents.

A utility function is just a way for an agent to evaluate states, where positive values are good (for states the agent wants to achieve), negative values are bad (for states the agent wants to avoid), and neutral values are neutral (for states the agent doesn’t care about one way or the other). This mapping from states to utilities can be anything in principle: a measure of how close to homeostasis the agent’s internal state is, a measure of how many smiles exist on human faces, a measure of the number of paperclips in the universe, etc. It all depends on how you program the agent (or how our genes and culture program us).

Utility functions drive decision-making. Behavioral policies and actions that tend to lead to states of high utility will get positively reinforced, such that the agent will learn to do those things more often. And policies/actions that tend to lead to states of low (or negative) utility will get negatively reinforced, such that the agent learns to do them less often. Eventually, the agent learns to steer the world toward states of maximum utility.

Depending on how aligned an AI’s utility function is with humanity’s, this could be good or bad. It turns out that for highly capable agents, this tends to be bad far more often than good (e.g., maximizing smiles or paperclips will lead to a universe devoid of value for humans).

Nondeterminism really has nothing to do with this. Agents that can modify their own code could in principle optimize for their utility functions even more strongly than if they were stuck at a certain level of capability, but a utility function still needs to be specified in some way regardless.

Jon Garcia

Exercise: Do What I Mean (DWIM)