Jack O'Brien comments on Jack O’Brien’s Shortform

Jack O'Brien 25 Aug 2024 3:01 UTC
3 points
0
** Progress Report: AI Safety Fundamentals Project ** This is a public space for me to keep updated on my AI safety fundamentals project. The project will take 4 weeks. My goal is to stay lean and limit my scope so I can actually finish on time. I aim to update this post at least once per week with my updates, but maybe more often.

Overall, I want to work on agent foundations and the theory behind AI alignment agendas. One stepping point for this is Selection theorems; a research program to find justifications that a given training process will result in a given agent property.

My plan for the agisf project: literature review on selection theorems. Take a whole load of concepts / blog posts, read them, riff on them if i feel like it. At least write a 1 paragraph summary of each post im intrerested in. List of posts:
- John’s original posts on Selection theorems, and Adam Khoja’s distillation of it.
- Scott garrabrant’s stuff on geometric rationality
- Coherence theorems for utility theory.
- Evolutionary biology shallow dive and explanation of price and fishers equations.
- Maybe some stuff by Thane Ruthenis.
- Some content from Jaynes’ probability theory about bayesian vs frequentism
- Power seeking is instrumentally convergent in MDPs.
- ??? more examples to come once i read john’s original post.
TODO:
- Make an initial lesswrong progress report.
- Make a list of things to read.
- Make a git repo on my pc with markdown and mathjax support. In the initial document, populate it with the list of things to read. For each thing I read, remove it from the TODO list and put its summary in the main body of the blog post. When I am done, any posts still left on the TODO list will get formatted and added as an ‘additional reading’ section.
- Jack O'Brien 8 Sep 2024 9:17 UTC
  3 points
  0
  Parent
  Well, haven’t got much done in the last 2 weeks. Life has gotten in the way, and in the times where I thought I actually had the time and headspace to work on the project, things happened like my shoulder got injured playing sport, and my laptop mysteriously died.
  
  But I have managed to create a github repo, and read the original posts on selection theorems. My list of selection theorems to summarize has grown. Check out the github page: https://github.com/jack-obrien/selection-theorems-review
  
  Tonight I will try to do at least an hour of solid work on it. I want to summarize the idea of selection theorems, and sumarize the good regulator theorem, and start reading the next post (probably Turner’s post on power seeking)
  - Jack O'Brien 15 Sep 2024 5:47 UTC
    3 points
    0
    Parent
    Ummmmm yeah what have I done so far. I didn’t really get any solid work done this week either. I have decided to extend the project by another two weeks with the other two people involved—we have all been pretty preoccupied with life. Last week on sunday night i didn’t really do a solid hour of work. I did manage to summarise the concept of selection theorems and think about the agent type signature—a concept i will be referring to throughout the post, super fundamental. Tonight I will hopefully actually meet with my group. I wanna do like half an hour of work before, and a little bit after too. I want to summarise the good regulator theorem this week as well as turner’s post on power seeking.