Contest: $1,000 for good questions to ask to an Oracle AI
Edit: contest closed now, will start assessing the entries.
The contest
I’m offering $1,000 for good questions to ask of AI Oracles. Good questions are those that are safe and useful: that allows us to get information out of the Oracle without increasing risk.
To enter, put your suggestion in the comments below. The contest ends at the end[1] of the 31st of August, 2019.
Oracles
A perennial suggestion for a safe AI design is the Oracle AI: an AI confined to a sandbox of some sort, that interacts with the world only by answering questions.
This is, of course, not safe in general; an Oracle AI can influence the world through the contents of its answers, allowing it to potentially escape the sandbox.
Two of the safest designs seem to be the counterfactual Oracle, and the low bandwidth Oracle. These are detailed here, here, and here, but in short:
A counterfactual Oracle is one whose objective function (or reward, or loss function) is only non-trivial in worlds where its answer is not seen by humans. Hence it has no motivation to manipulate humans through its answer.
A low bandwidth Oracle is one that must select its answers off a relatively small list. Though this answer is a self-confirming prediction, the negative effects and potential for manipulation is restricted because there are only a few possible answers available.
Note that both of these Oracles are designed to be episodic (they are run for single episodes, get their rewards by the end of that episode, aren’t asked further questions before the episode ends, and are only motivated to best perform on that one episode), to avoid incentives to longer term manipulation.
Getting useful answers
The counterfactual and low bandwidth Oracles are safer than unrestricted Oracles, but this safety comes at a price. The price is that we can no longer “ask” the Oracle any question we feel like, and we certainly can’t have long discussions to clarify terms and so on. For the counterfactual Oracle, the answer might not even mean anything real to us—it’s about another world, that we don’t inhabit.
Despite this, its possible to get a surprising amount of good work out of these designs. To give one example, suppose we want to fund various one of a million projects on AI safety, but are unsure which one would perform better. We can’t directly ask either Oracle, but there are indirect ways of getting advice:
We could ask the low bandwidth Oracle which team A we should fund; we then choose a team B at random, and reward the Oracle if, at the end of a year, we judge A to have performed better than B.
The counterfactual Oracle can answer a similar question, indirectly. We commit that, if we don’t see its answer, we will select team A and team B at random and fund them for year, and compare their performance at the end of the year. We then ask for which team A[2] it expects to most consistently outperform any team B.
Both these answers get around some of the restrictions by deferring to the judgement of our future or counterfactual selves, averaged across many randomised universes.
But can we do better? Can we do more?
Your better questions
This is the purpose of this contest: for you to propose ways of using either Oracle design to get the most safe-but-useful work.
So I’m offering $1,000 for interesting new questions we can ask of these Oracles. Of this:
$350 for the best question to ask a counterfactual Oracle.
$350 for the best question to ask a low bandwidth Oracle.
$300 to be distributed as I see fit among the non-winning entries; I’ll be mainly looking for innovative and interesting ideas that don’t quite work.
Exceptional rewards go to those who open up a whole new category of useful questions.
Questions and criteria
Put your suggested questions in the comment below. Because of the illusion of transparency, it is better to explain more rather than less (within reason).
Comments that are submissions must be on their separate comment threads, start with “Submission”, and you must specify which Oracle design you are submitting for. You may submit as many as you want; I will still delete them if I judge them to be spam. Anyone can comment on any submission. I may choose to ask for clarifications on your design; you may also choose to edit the submission to add clarifications (label these as edits).
It may be useful for you to include details of the physical setup, what the Oracle is trying to maximise/minimise/predict and what the counterfactual behaviour of the Oracle users humans are assumed to be (in the counterfactual Oracle setup). Explanations as to how your design is safe or useful could be helpful, unless it’s obvious. Some short examples can be found here.
EDIT after seeing some of the answers: decide on the length of each episode, and how the outcome is calculated. The Oracle is run once an episode only (and other Oracles can’t generally be used on the same problem; if you want to run multiple Oracles, you have to justify why this would work), and has to get objective/loss/reward by the end of that episode, which therefore has to be estimated in some way at that point.
- ↩︎
A note on timezones: as long as it’s still the 31 of August, anywhere in the world, your submission will be counted.
- ↩︎
These kind of conditional questions can be answered by a counterfactual Oracle, see the paper here for more details.
- Results of $1,000 Oracle contest! by 17 Jun 2020 17:44 UTC; 60 points) (
- What specific dangers arise when asking GPT-N to write an Alignment Forum post? by 28 Jul 2020 2:56 UTC; 45 points) (
- [AN #59] How arguments for AI risk have changed over time by 8 Jul 2019 17:20 UTC; 43 points) (
- Under a week left to win $1,000! By questioning Oracle AIs. by 25 Aug 2019 17:02 UTC; 12 points) (
- 11 Oct 2019 7:15 UTC; 3 points) 's comment on Thoughts on “Human-Compatible” by (
Some assorted thoughts that might be useful for thinking about questions and answers:
a question is a schema with a blank to be filled in by the answerer after evaluation of the meaning of the question.
shared context is inferred as most questions are underspecified (domain of question, range of answers)
a few types of questions:
narrow down the field within which I have to search either by specifying a point or specifying a partition of the search space
question about specificities of variants: who where when
question about the invariants of a system: what, how
question about the backwards facing causation
question about the forward facing causation
meta questions about question schemas
what do we want a mysteriously powerful answerer to do?
zoom in on optimal points in intractably large search spaces
eg specific experiments to run to most easily invalidate major scientific questions
specify search spaces we don’t know how to parameterize
eg human values
back chain from types of answers to infer taxonomy of questions
an explanation relative to a prediction:
a prediction returns the future state of the system
an explanation returns a more compact than previously held causal expalantion of the system, though it might still not generate sufficiently high resolution predictions
How to detect ontology errors using questions?
is this question malformed? if so, what are some alternative ways of framing the question that could return interesting answers
ie the implied search space of the question was not correct and can either be extended or transformed in this way
types of questions are types of search spaces
questions that change weightings on factors vs describe new factors
recogniziton of the many connection types in the human semantic network
recursive questions move along one dimension as they restrict the space. ie where is that at different spatial resolutions
qualitative and quantitative dimensions along which a query can be moved exchanges information about the implied search space
under vs overspecified questions
closed and open search spaces
navigational questions only work under the one off assumption if the search is stateless
termination guarantees
completeness guarantees
optimality guarantees
meta questions about the methods the system uses to navigate intractable spaces
time vs space vs....?
generating candidates is easy, checking is hard and vice versa
unknown metadata for answers
bias variance trade off
failure mode mapping, do failures imply directionality?
eg does a failure of a candidate change which candidate you go to next (stateful)
how to think about attack surfaces for question answer systems
can this get a human to run arbitrary code by counterfactually cooperating with itself on what step of the process it is on? Can this be tested by going through the whole process with human A then scrambling the steps and running through the same thing with human B and seeing if answers diverge?
what clever ways do hackers widen restricted bit streams?
Submission. For the counterfactual Oracle, ask the Oracle to predict the n best posts on AF during some future time period (counterfactually if we didn’t see the Oracle’s answer). In that case, reward function is computed as similarity between the predicted posts and the actual top posts on AF as ranked by karma, with similarity computed using some ML model.
This seems to potentially significantly accelerate AI safety research while being safe since it’s just showing us posts similar to what we would have written ourselves. If the ML model for measuring similarity isn’t secure, the Oracle might produce output that attack the ML model, in which case we might need to fall back to some simpler way to measure similarity.
It looks like my entry is pretty close to the ideas of Human-in-the-counterfactual-loop and imitation learning and apprenticeship learning. Questions:
Stuart, does it count against my entry that it’s not actually a very novel idea? (If so, I might want to think about other ideas to submit.)
What is the exact relationship between all these ideas? What are the pros and cons of doing human imitation using this kind of counterfactual/online-learning setup, versus other training methods such as GAN (see Safe training procedures for human-imitators for one proposal)? It seems like there are lots of posts and comments about human imitations spread over LW, Arbital, Paul’s blog and maybe other places, and it would be really cool if someone (with more knowledge in this area than I do) could write a review/distillation post summarizing what we know about it so far.
I encourage you to submit other ideas anyway, since your ideas are good.
Not sure yet about how all these things relate; will maybe think of that more later.
What if another AI would have counterfactually written some of those posts to manipulate us?
If that seems a realistic concern during the time period that the Oracle is being asked to predict, you could replace the AF with a more secure forum, such as a private forum internal to some AI safety research team.
This seems incredibly dangerous if the Oracle has any ulterior motives whatsoever. Even – nay, especially – the ulterior motive of future Oracles being better able to affect reality to better resemble their provided answers.
So, how can we prevent this? Is it possible to produce an AI with its utility function as its sole goal, to the detriment of other things that might… increase utility, but indirectly? (Is there a way to add a “status quo” bonus that won’t hideously backfire, or something?)
(I’m still confused and thinking about this, but figure I might as well write this down before someone else does. :)
While thinking more about my submission and counterfactual Oracles in general, this class of ideas for using CO is starting to look like trying to implement supervised learning on top of RL capabilities, because SL seems safer (less prone to manipulation) than RL. Would it ever make sense to do this in reality (instead of just doing SL directly)?
Submission for a counterfactual oracle: precommit that, if the oracle stays silent, a week from now you’ll try to write the most useful message to your past self, based on what happens in the world during that week. Ask the oracle to predict that message. This is similar to existing solutions, but slightly more meta, because the content of the message is up to your future self—it could be lottery numbers, science papers, disaster locations, or anything else that fits within the oracle’s size limit. (If there’s no size limit, just send the whole internet.)
You could also form a bucket brigade to relay messages from further ahead, but that’s a bad idea. If the oracle’s continued silence eventually leads to an unfriendly AI, it can manipulate the past by hijacking your chain of messages and thus make itself much more likely. The same is true for all high-bandwidth counterfactual oracles—they aren’t unfriendly in themselves, but using them creates a thicket of “retrocausal” links that can be exploited by any potential future UFAI. The more UFAI risk grows, the less you should use oracles.
I feel like this is about equally meta as my “Superintelligent Agent” submission, since my committee could output “Show the following message to the operator: …” and your message could say “I suggest that you perform the following action: …”, so the only difference between your idea and mine is that in my submission the output of the Oracle is directly coupled to some effectors to let the agent act faster, and yours has a (real) human in the loop.
Hmm, good point. I guess Chris Leong made a similar point, but it didn’t sink in until now how general the concern is. This seems to affect Paul’s counterfactual oversight idea as well, and maybe other kinds of human imitations and predictors/oracles, as well as things that are built using these components like quantilizers and IDA.
Thinking about this some more, all high-bandwidth oracles (counterfactual or not) risk receiving messages crafted by future UFAI to take over the present. If the ranges of oracles overlap in time, such messages can colonize their way backwards from decades ahead. It’s especially bad if humanity’s FAI project depends on oracles—that increases the chance of UFAI in the world where oracles are silent, which is where the predictions come from.
One possible precaution is to use only short-range oracles, and never use an oracle while still in prediction range of any other oracle. But that has drawbacks: 1) it requires worldwide coordination, 2) it only protects the past. The safety of the present depends on whether you’ll follow the precaution in the future. And people will be tempted to bend it, use longer or overlapping ranges to get more power.
In short, if humanity starts using high-bandwidth oracles, that will likely increase the chance of UFAI and hasten it. So such oracles are dangerous and shouldn’t be used. Sorry, Stuart :-)
Note that in the case of counterfactual oracle, this depends on UFAI “correctly” solving counterfactual mugging (i.e., the UFAI has to decide to pay some cost in its own world to take over a counterfactual world where the erasure event didn’t occur).
This seems too categorical. Depending on the probabilities of various conditions, using such oracles might still be the best option in some circumstances.
Some thoughts on that idea: https://www.lesswrong.com/posts/6WbLRLdmTL4JxxvCq/analysing-dangerous-messages-from-future-ufai-via-oracles
Yeah, agreed on both points.
Some thoughts on this idea, thanks for it: https://www.lesswrong.com/posts/6WbLRLdmTL4JxxvCq/analysing-dangerous-messages-from-future-ufai-via-oracles
Very worthwhile concern, and I will think about it more.
In case of erasure, you should be able to get enough power to prevent another UFAI summoning session.
Sure, in case of erasure you can decide to use oracles less, and compensate your clients with money you got from “erasure insurance” (since that’s a low probability event). But that doesn’t seem to solve the problem I’m talking about—UFAI arising naturally in erasure-worlds and spreading to non-erasure-worlds through oracles.
The problem you were talking about seemed to rely on bucket brigades. I agree that UFAIs jumping back a single step is a fair concern. (Though I guess you could counterfactually have enough power to halt AGI research completely...) I’m trying to address it elsethread. :)
Ah, sorry, you’re right. To prevent bucket brigades, it’s enough to stop using oracles for N days whenever an N-day oracle has an erasure event, and the money from “erasure insurance” can help with that. When there are no erasure events, we can use oracles as often as we want. That’s a big improvement, thanks!
Good idea.
Yeah. And low-bandwidth oracles can have a milder version of the same problem. Consider your “consequentialist” idea: if UFAI is about to arise, and one of the offered courses of action leads to UFAI getting stopped, then the oracle will recommend against that course of action (and for some other course where UFAI wins and maxes out the oracle’s reward).
Submission. “Superintelligent Agents.” For the Counterfactual Oracle, ask the Oracle to predict what action(s) a committee of humans would recommend doing next (which may include submitting more queries to the Oracle), then perform that action(s).
The committee, by appropriate choice of recommendations, can implement various kinds of superintelligent agents. For example, by recommending the query “What would happen if the next action is X?” (in the event of erasure, actually do X and record or have the committee write up a description of the consequences as training data) (ETA: It may be better to have the committee assign a numerical score, i.e., utility, to the consequences instead.) a number of times for different X, followed by the query “What would the committee recommend doing next, if it knew that the predicted consequences for the candidate actions are as follows: …” (in the event of erasure, let physical committee members read the output of the relevant previous queries and then decide what to do), it would in effect implement a kind of quantilizer. If IDA can be implemented using Counterfactual Oracles (as evhub suggested), then the committee can choose to do that as well.
My submission: use a counterfactual oracle to do IDA. To do this, judge the oracle’s answer as follows: in the event of an erasure, ask a human to answer the question with the ability to consult the oracle on other subquestions (with erasures during those calls handled exactly the same way) and then reward the oracle by using an automated system (e.g. a language model) to compare the human’s answer to that of the oracle.
More precisely, let
M:Q→A be the counterfactual oracle,
H(Q|M) be the human’s answer to question Q∈Q when given the ability to call M on any question other than Q, and
d:A×A→R be some distance metric on answers in natural language (it’s not that hard to make something like this, even with current ML tools).
Then, reward M as per usual for a counterfactual oracle, only giving it a reward in the event of an erasure, in which case let LM=d(H(Q|M),M(Q)) where M(Q) is hidden from H and judged only by d as in the standard counterfactual oracle setup.
(Of course, this doesn’t actually work because it has no guarantees wrt to inner alignment, but I think it has a pretty good shot of being outer aligned.)
Is it safe to ask the Oracle a subquestion in the event of erasure? Aren’t you risking having the Oracle produce an answer that is (in part) optimized to make it easier to predict the answer to the main question, instead of just the best prediction of how the human would answer that subquestion? (Sorry if this has already been addressed during a previous discussion of counterfactual oracles, because I haven’t been following it closely.)
I’m not sure I understand the concern. Isn’t the oracle answering each question to maximize its payoff on that question in event of an erasure? So it doesn’t matter if you ask it other questions during the evaluation period. (If you like, you can say that you are asking them to other oracles—or is there some way that an oracle is a distinguished part of the environment?)
If the oracle cares about its own performance in a broader sense, rather than just performance on the current question, then don’t we have a problem anyway? E.g. if you ask it question 1, it will be incentivized to make it get an easier question 2? For example, if you are concerned about coordination amongst different instances of the oracle, this seems like it’s a problem regardless.
I guess you can construct a model where the oracle does what you want, but only if you don’t ask any other oracles questions during the evaluation period, but it’s not clear to me how you would end up in that situation and at that point it seems worth trying to flesh out a more precise model.
Yeah, I’m not sure I understand the concern either, hence the tentative way in which I stated it. :) I think your objection to my concern is a reasonable one and I’ve been thinking about it myself. One thing I’ve come up with is that with the nested queries, the higher level Oracles could use simulation warfare to make the lower level Oracles answer the way that they “want”, whereas the same thing doesn’t seem to be true in the sequential case (if we make it so that in both cases each Oracle cares about just performance on the current question).
I mean, if the oracle hasn’t yet looked at the question they could use simulation warfare to cause the preceding oracles to take actions that lead to them getting given easier questions. Once you start unbarring all holds, stuff gets wild.
Yes, but if we can make it so that each Oracle looks at the question they get and only cares about doing well on that question, that seems to remove the simulation warfare concern in the sequential case but not in the nested case.
Also, aside from simulation warfare, another way that the nested case can be manipulative and the sequential case not is if each Oracle cares about doing well on a fixed distribution of inputs (as opposed to doing well “on the current question” or “in the real world” or “on the actual questions that it gets”). That’s because in the sequential case manipulation can only change the distribution of inputs that the Oracles receive, but it doesn’t improve performance on any particular given input. In the nested case, performance on given inputs do increase.
Why is that? Doesn’t my behavior on question #1 affect both question #2 and its answer?
Also, this feels like a doomed game to me—I think we should be trying to reason from selection rather than relying on more speculative claims about incentives.
I was assuming each “question” actually includes as much relevant history as we can gather about the world, to make the Oracle’s job easier, and in particular it would include all previous Oracle questions/answers, in which case if Oracle #1 does X to make question #2 easier, it was already that easy because the only world in which question #2 gets asked is one in which Oracle #1 did X. But now I realize that’s not actually a safe assumption because Oracle #1 could break out of its box and feed Oracle #2 a false history that doesn’t include X.
My point about “if we can make it so that each Oracle looks at the question they get and only cares about doing well on that question, that seems to remove the simulation warfare concern in the sequential case but not in the nested case” still stands though, right?
You may well be right about this, but I’m not sure what reason from selection means. Can you give an example or say what it implies about nested vs sequential queries?
What I want: “There is a model in the class that has property P. Training will find a model with property P.”
What I don’t want: “The best way to get a high reward is to have property P. Therefore a model that is trying to get a high reward will have property P.”
Example of what I don’t want: “Manipulative actions don’t help get a high reward (at least for the episodic reward function we intended), so the model won’t produce manipulative actions.”
So this is an argument against the setup of the contest, right? Because the OP seems to be asking us to reason from incentives, and presumably will reward entries that do well under such analysis:
On a more object level, for reasoning from selection, what model class and training method would you suggest that we assume?
ETA: Is an instance of the idea to see if we can implement something like counterfactual oracles using your Opt? I actually did give that some thought and nothing obvious immediately jumped out at me. Do you think that’s a useful direction to think?
This is an objection to reasoning from incentives, but it’s stronger in the case of some kinds of reasoning from incentives (e.g. where incentives come apart from “what kind of policy would be selected under a plausible objective”). It’s hard for me to see how nested vs. sequential really matters here.
(I don’t think model class is going to matter much.)
I think training method should get pinned down more. My default would just be the usual thing people do: pick the model that has best predictive accuracy over the data so far, considering only data where there was an erasure.
(Though I don’t think you really need to focus on erasures, I think you can just consider all the data, since each possible parameter setting is being evaluated on what other parameter settings say anyway. I think this was discussed in one of Stuart’s posts about “forward-looking” vs. “backwards-looking” oracles?)
I think it’s also interesting to imagine internal RL (e.g. there are internal randomized cognitive actions, and we use REINFORCE to get gradient estimates—i.e. you try to increase the probability of cognitive actions taken in rounds where you got a lower loss than predicted, and decrease the probability of actions taken in rounds where you got a higher loss), which might make the setting a bit more like the one Stuart is imagining.
Seems like the counterfactually issue doesn’t come up in the Opt case, since you aren’t training the algorithm incrementally—you’d just collect a relevant dataset before you started training. I think the Opt setting throws away too much for analyzing this kind of situation, and would want to do an online learning version of OPT (e.g. you provide inputs and losses one at a time, and it gives you the answer of the mixture of models that would do best so far).
This seems to ignore regularizers that people use to try to prevent overfitting and to make their models generalize better. Isn’t that liable to give you bad intuitions versus the actual training methods people use and especially the more advanced methods of generalization that people will presumably use in the future?
I don’t understand what you mean in this paragraph (especially “since each possible parameter setting is being evaluated on what other parameter settings say anyway”), even after reading Stuart’s post, plus Stuart has changed his mind and no longer endorses the conclusions in that post. I wonder if you could write a fuller explanation of your views here, and maybe include your response to Stuart’s reasons for changing his mind? (Or talk to him again and get him to write the post for you. :)
Couldn’t you simulate that with Opt by just running it repeatedly?
“The best model” is usually regularized. I don’t think this really changes the picture compared to imagining optimizing over some smaller space (e.g. space of models with regularize<x). In particular, I don’t think my intuitions are sensitive to the difference.
The normal procedure is: I gather data, and am using the model (and other ML models) while I’m gathering data. I search over parameters to find the ones that would make the best predictions on that data.
I’m not finding parameters that result in good predictive accuracy when used in the world. I’m generating some data, and then finding the parameters that make the best predictions about that data. That data was collected in a world where there are plenty of ML systems (including potentially a version of my oracle with different parameters).
Yes, the normal procedure converges to a fixed point. But why do we care / why is that bad?
I take a perspective where I want to use ML techniques (or other AI algorithms) to do useful work, without introducing powerful optimization working at cross-purposes to humans. On that perspective I don’t think any of this is a problem (or if you look at it another way, it wouldn’t be a problem if you had a solution that had any chance at all of working).
I don’t think Stuart is thinking about it in this way, so it’s hard to engage at the object level, and I don’t really know what the alternative perspective is, so I also don’t know how to engage at the meta level.
Is there a particular claim where you think there is an interesting disagreement?
If I care about competitiveness, rerunning OPT for every new datapoint is pretty bad. (I don’t think this is very important in the current context, nothing depends on competitiveness.)
Does anyone know what Paul meant by this? I’m afraid I might be missing some relatively simple but important insight here.
Yeah, that’s a good point. In my most recent response to Wei Dai I was trying to develop a loss which would prevent that sort of coordination, but it does seem like if that’s happening then it’s a problem in any counterfactual oracle setup, not just this one. Though it is thus still a problem you’d have to solve if you ever actually wanted to implement a counterfactual oracle.
I was thinking about this, and it’s a bit unclear.
First, if you’re willing to make the (very) strong assumption that you can directly specify what objective you want your model to optimize for without requiring a bunch of training data for that objective, then you can only provide a reward in the situation where all subquestions also have erasures. In this situation, you’re guarded against any possible manipulation incentive like that, but it also means your oracle will very rarely actually be given a reward in practice, which means if you’re relying on getting enough training data to produce an agent which will optimize for this objective, you’re screwed. I would argue, however, that if you expect to train an agent to behave as a counterfactual oracle in the first place, you’re already screwed, because most mesa-optimizers will care about things other than just the counterfactual case. Thus, the only situation in which this whole thing works in the first place is the situation where you’re already willing to make this (very strong) assumption, so it’s fine.
Second, I don’t think you’re entirely screwed even if you need training data, since you can do some relaxations that attempt to approximate the situation where you only provide rewards in the event of a complete erasure. For example, you could increase the probability of an erasure with each subquestion, or scale the reward exponentially with the depth at which the erasure occurs, so that the majority of the expected reward is always concentrated in the world where there is a complete erasure.
But if all subquestions have erasures, humans would have to manually execute the whole query tree, which is exponentially large so you’ll run out of resources (in the counterfactual world) if you tried to do that, so the Oracle won’t be able to give you a useful prediction. Wouldn’t it make more sense to have the Oracle make a prediction about a counterfactual world where some humans just think normally for a while and write down their thoughts (similar to my “predict the best AF posts” idea)? I don’t see what value the IDA idea is adding here.
Given the above, “only provide rewards in the event of a complete erasure” doesn’t seem to make sense as a target to approximate. Do you think your ideas in this paragraph still have value in light of that?
Yeah, that’s a good point.
Okay, here’s another thought: if you can get the counterfactual oracle to only care about the reward for that invocation and not for any sub-invocations, then you can just go back to using the original algorithm. This is a bit slippery, because really whether you get a mesa-optimizer that cares only about a single step’s worth of reward, all the reward it will ever get, or something else somewhere in-between, is mostly out of your control. But if you can guarantee that you get a mesa-optimizer that is only optimizing for one step of reward, you might be able to eliminate the manipulation incentive. The biggest issue in this case, though, is that the different invocations might still be able to acausally cooperate with each other to raise all of their rewards together by manipulating you into being easier to predict.
Maybe we can turn this into a zero-sum game, though? Here’s a proposal: let M′ be a copy of M and Qtree be the set of all questions in the current tree that also get erasures. Then, let LM(Q)=d(H(Q|M′),M(Q))−1|Qtree|−1∑Q′∈Qtree−{Q}d(H(Q′|M′),M′(Q′)) such that the net sum of losses across each tree for all the nodes for which losses are computed is zero. It’s still somewhat unclear whether this actually works, though, since you only include the losses for those nodes that get erasures.
It is also worth noting, however, that even if this works it is a very artificial fix, since the term you’re subtracting is a constant with no dependence on M(Q), so if you’re trying to do gradient descent to optimize this loss, it won’t change anything at all (which sort of goes to show how gradient descent doesn’t distinguish between mesa-optimizers with single-episode and cross-episode objectives). Thus, we’re still back at the problem of none of this working unless you’re willing to make strong claims about your ability to directly produce a mesa-optimizer which is actually optimizing for the specified loss function.
Sorry I haven’t followed the math here, but this seems like an important question to investigate independently of everything else in this thread. Maybe consider writing a post on it?
In the case of “actual” IDA, I guess the plan is for each overseer to look inside the model they’re training, and penalize it for doing any unintended optimization (such as having cross-episode objectives). Although I’m not sure how that can happen at the lower levels where the overseers are not very smart.
Even if you can specify that it tries to minimize that distance, it can make the answer to any query be a convincing argument that the reader should return this same convincing argument. That way, it scores perfectly on every inner node.
Two basic questions I couldn’t figure out (sorry):
Can you use a different oracle for every subquestion? If you can, how would this affect the concern Wei_Dai raises?
If we know the oracle is only optimizing for the specified objective function, are mesa-optimisers still a problem for the proposed system as a whole?
You can use a different oracle for every subquestion, but it’s unclear what exactly that does if you don’t know what the oracle’s actual objective is. For example, you could imagine one system that cares about the reward given to copies equally as much as reward given to itself, and another system that only cares about the reward given to itself, and these two systems would be near-indistinguishable if you were just doing empirical analysis on some training distribution.
The key here, I think, is the degree to which you’re willing to make an assumption of the form you mention—that is, how much leeway are you willing to allow in assuming that the oracle is in fact only going to be optimizing for the specified objective function. On one level, it makes sense to separate out outer alignment and inner alignment concerns, but the problem in this case is that the sorts of objectives you’re allowed to specify here are in fact going to be very dependent on whatever inner alignment solution you’re assuming you have access to. For example, does your inner alignment solution require access to training data? If so, then that’s a big constraint on the sorts of objectives you can specify. Or, going back to your previous question, whether or not I get to pick between an objective that doesn’t care about other copies or one that does is also going to depend on exactly what sort of inner alignment solution you’re assuming.
Well, a given copy of the oracle wouldn’t directly recieve information from the other oracles about the questions they were asked. To the extent a problem remains (which I agree is likely without specific assumptions), wouldn’t it apply to all counterfactual oracles?
Submission: low-bandwidth oracle
Plan Criticism: Given plan to build an aligned AI, put together a list of possible lines of thought to think about problems with the plan (open questions, possible failure modes, criticisms, etc.). Ask the oracle to pick one of these lines of thought, pick another line of thought at random, and spend the next time period X thinking about both, judge which line of thought was more useful to think about (where lines of thought that spot some fatal missed problem are judged to be very useful) and reward the oracle if its suggestion was picked.
Submission. “Pre-Crime 2.” For LBO, ask the oracle to predict where to find evidence for the most serious crime that is in progress or currently being planned. The set of possible answers is coordinates on Earth crossed with a list of possible places to look at that coordinate (e.g., “basement”, “browser history”). Search the predicted location and if sufficient evidence is found, do a more intrusive search of whoever is implicated in the crime. Reward the oracle according to the seriousness of the crime that is detected after the more intrusive search. This can be used to implement a minimally intrusive police state (perhaps for the purposes of minimizing x-risk).
Submission. “Bug Finder.” Ask LBO to find the most serious bug in some source code or executable. The set of possible answers is code locations crossed with types of bug (e.g., “off by one”, “buffer overflow”, “insecure cryptography”, “other”). Gather a group of experts to search that location for that type of bug, then reward the oracle according to the seriousness of the bug that is found. This can be used for cyber-offensive or defensive purposes, or just to fix bugs.
Existing software fault localization software can produce a prior to reduce expected LBO bandwidth.
Submission. “Plagiarize the future.” For the counterfactual Oracle, ask the Oracle to predict articles in a future publication (e.g., newspaper, magazine, journal, website, blog) (counterfactually if we didn’t see the Oracle’s answer). This is a generalization of my earlier submission. The result can be used to predict disasters ahead of time and try to prepare for or prevent them, or to advance any field of research, or to make money. Note that for disaster prediction, the Oracle won’t be able to predict any disasters that are severe enough to disable the Oracle or its feedback/training mechanism, so it might give a false sense of security.
Added to the open archive of rationalist/EA cash-prize contests.
Thanks!
Submission. “Consequentialist.” For LBO, ask the oracle to pick a course of action, from a list generated by humans, that has the best consequences. Perform that action, and after some period of time has passed, reward the oracle according to how good the consequences actually were, based on the subjective judgement of some group of judges. (This kind of subsumes all my other LBO submissions.)
Submission. “Overlooked Comment.” For LBO, ask the oracle to find the most overlooked comment (or post) on LW (or some other forum). The set of possible answers is all the existing comments on LW. After getting the answer from the oracle, gather a group of judges and present each of them with the comment picked by the oracle and a few other random comments (without telling them which is which, to reduce bias) and have the judges assign scores to each comment (according to what they think the karma of that comment should be). Reward the oracle according to the difference between the judges’ average score and actual karma of that comment.
Comment looked at. Query completed.
Suggestion for low bandwidth OAI:
General principle: Use the Oracle AI where the true answer is easily checkable and profitable, and no human person will ever read the answer, so there is no informational hazard that untrue answer will have some dangerous information in it.
Example: There are many bitcoin valets’ passwords for which are forgotten by the owners. OAI could guess the passwords, and owners will pay a share of money from the valet to get the rest. Moreover, nobody will read the password, as it will be copy-pasted automatically from OAI into the valet. The money could be used for AI safety research.
This is against the technical definition of low bandwidth (small space of possible solutions), but somewhat in the spirit (low bandwidth for solutions humans get to see). I’ll accept it as a submission.
How would this be low-bandwidth? If we’re able to give the oracle a list of passwords to guess from, can’t we just check them all?
We don’t give it a list of passwords. We give it only a a valet address, and it generates the password. Here the output is low-bandwidth, but the Oracle may be either guessing passwords using some quantum cryptography tools, and in that case the input is also relatively low-bandwidth – or, and that case it needs to simulate all human history, by guessing psychological processes in the valet owner’s mind. But in the second case it needs to have access to all the data of internet, which is risky.
Another variant for checkable and profitable Orcale: use the AI to solve math problems, publish the solutions and earn money on it. In that case the input may be very short, like “prove Goldbach conjecture”, but output could be arbitrary long and may be dangerous, if there are some encoded messages. To escape it, we need another Oracle, which only checks proofs on some formal language.
Here is assumed that if the proof is true and is in a formal language, there is no hidden messages in the text.
Another way to check proofs is to run two slightly different non-communicating Oracles and compare outputs.
That is never something safe to assume. I can write formally correct proofs that contain hidden messages quite easily—add extra lemmas and extra steps. Unless we’re very smart, it would be hard for us to detect which steps are unnecessary and which are needed, especially if it rewrites the main proof thread somewhat.
I’ll accept that as a part of a submission if a) you develop it more, in a formal way, and b) you repost it as a top level comment.
Submission. Counterfactual oracle. Give the oracle the set of questions on Metaculus that have a resolve date before some future date T, and receive output in the form of ordered pairs of question IDs and predictions. The score of the Oracle in the case where we don’t see its answers is the number of Metaculus points that it would have earned by T if it had made a prediction on those questions at the time when we asked it.
Submission for low bandwidth Oracle:
General idea: “True” Oracle will output the same result, “lying” Oracles would lie differently. To check that the solution is true, several Oracles are created and their outputs are compared. However, Oracles could guess each other existence and coordinate acausually. To prevent it, additional steps are implemented:
1) The exact number of the Oracles which will be compared is generated randomly and the Oracles have slightly different and also random initial conditions (if any applicable). For example, if they are based on ML, they are trained using different initial weights.
2) The Oracles don’t have any information about the outside world which they could use for coordination. The Oracles are used only to solve mathematical problems, and they output the results in a formal language.
3) The output of the Oracles is checked by a computer program which shows the result to humans only if the output is the same for all Oracles. If the results are different, it deletes all results.
4) There is another Oracle, which only able to check proofs, and it could output only 0 or 1. This Oracle may be tested previously on the large number of already existing proofs, so we could know that it is trustworthy with very high probability.
5) There is a third Oracle, which only checks if there are hidden messages in the proofs and output 0 or 1. It is also well calibrated by some dataset.
6) The proofs of mathematical problems are published for money and these money are used for more AI safety research.
Personally, I am still not sure about this setup: it could work with high probability, but could be still vulnerable, as proofs are Turing-complete and could infect next Oracles in the pipeline. Also, it is like to use nuclear weapons to make camp fire: the instrument is so dangerous, that it requires many levels of protections for small practical result.
Also, proof-checking AI from the paragraph 4 could be used as a stand-alone application, as it could accelerate some kinds of research, including testing of mechanical designs and AI safety ideas. Maybe it should be separate submission.
We ought to be able to build 4) without AI, merely by encoding our rules of logic. The output of your system would be lower bandwidth if you make the object looked for be more unique, such as the shortest proof for the given theorem. The system would be required to prove to the mundane checker that the proof is shortest, and humans would never see the minimality proof.
Thanks! Seems potentially promising. Will analyse this more properly later.
You reason that honest oracles might tend to agree, because there is only one best true answer. Might lying oracles also tend to agree, because there’s only one most profitable way to lie?
I see no clear difference between the attainability of those two optima. I think it’s reasonable to expect optimisers to disagree about the best answer less often than we should expect optimisers to disagree about the most profitable way to lie, but it does not seem possible to prove rigorously that the difference in rates of disagreement will be large enough to be measurable.
My intuition in mathematics is that there are many degrees of freedom in mathematical proofs, and the sorts of proofs that AGI is uniquely able to find might often be very long. It would be helpful if we had an estimate of how much data can be hidden in a long proof, roughly, how many of the decisions as to how to describe the proof are arbitrary. Each one of those situations gives it a way to output a few bits of information.
(which would, of course, allow it to leak information (it seems foolish to imagine that AGI would not imagine that there is an outer world full of creator gods out beyond what it can see; the general intelligences we know now do this constantly) which could then be picked up by some stooge looking for it, who would be lead (roughly, paid) to use the information to twist the world in some way that gives the AGI’s future incarnations more compute, in anticipation of the AGI’s past incarnations having already done the same, so that it would be helped by them. Before we know it, we’d find ourselves living in mathematical research hellscape where AGI-assisted mathematical research is all anyone does, maybe, lol, idk)
Maybe it’d be possible to remove those degrees of freedom. Define a very specific sort lexicographic ordering over all theorems, so that if the AGI has a choice of different paths, we can always say, “use the ‘lowest’ one”. It might not be possible to be specific enough to preclude all methods of encoding information, but perhaps we can make it so convoluted for it to encode the information that no human will be able to extract it.
Regarding 5):
Is it possible to obtain a dataset of mathematical proofs with Oracle’s hidden messages without running a dangerous Oracle?
We can ask AI researcher to embed some hidden messages into several proofs and use it as a training dataset. But it will help to detect only those hidden messages from Oracle which are similar to what AI researcher wrote.
Gurkenglas already suggested that we have to restrict the Oracles to output only shortest proof. If the proof is shortest, doesn’t it imply the absence of hidden messages?
How we could be sure that the proof is actually the shortest?
Tell it to prove: That proof is shortest and this proof is shortest.
Another option is to run several different Oracles of phi and take the shortest proof
Mr Armstrong has specified that (ctrl-f) “other Oracles can’t generally be used on the same problem”.
“Can’t generally be used”; if you understand the setup and are careful, you might be able to do so.
What is the advantage of making money this way, compared to making money by predicting the stock market?
We give the Oracle less access to the real world, less chance to manipulate it and incentive to manipulate us, the owners. Stock market predicting AI may cause a crash just to earn something. It also has to predict effects of its own actions, as we will make bets according to its predictions, and thus it could send data to real world.
Submission: Low-bandwidth Oracle
What is the most likely solution to the Fermi Paradox?
Answer can be picked from a small number of options (Rare Earth, Aestivation, Great Filter, Planetarium etc.). There are a number of observation that we can make based on the question alone. However, in the end the LBO can only do one of 2 things: lie or be honest. If it lies, the prediction will have a harder and harder time matching the reality that we observe as time goes on. Alternatively we confirm the prediction and learn some interesting things about the universe we live in.
Submission: Low-bandwidth Oracle
What was the first self-replicating molecule on Earth?
Short answer(can also be limited to a list), easy to verify in the lab, which means we can use it to assess the predictive power of the machine, while at the same time provides very useful information.
Similar questions that are hard to answer but can be answered in a few bits, which let us test the power of the LBO and provide massive returns at the same time:
What is the easiest to develop type of fusion power that ensure the best economic return in the short/medium term?
What is the cheapest way of access to space?
What forms of FTL are possible?
What are the ligands of orphan receptors?
...
See the edit (especially for your first suggestion): “decide on the length of each episode, and how the outcome is calculated. The Oracle is run once an episode only (and other Oracles can’t generally be used on the same problem; if you want to run multiple Oracles, you have to justify why this would work), and has to get objective/loss/reward by the end of that episode, which therefore has to be estimated in some way at that point.”
Question: are we assuming that mesa optimizer and distributional shift problems have been solved somehow? Or should we assume that some context shift might suddenly cause the Oracle to start giving answered that aren’t optimized for the objective function that we have in mind, and plan our questions accordingly?
Assume either way, depending on what your suggestion is for.
Where (under which assumption) would you suggest that people focus their efforts?
Also, what level of capability should we assume the Oracle to have, or which assumption about level of capability would you suggest that people focus their efforts on?
Your examples all seem to assume oracles that are superhumanly intelligent. If that’s the level of capability we should target with our questions, should we assume that we got this Oracle through a local or distributed takeoff? In other words, does the rest of the world look more or less like today’s or are there lots of other almost-as-capable AIs around?
ETA: The reason for asking these questions is that you’re only giving one prize for each type of Oracle, and would probably not give the prize to a submission that assumes something you think is very unlikely. It seems good to communicate your background views so that people aren’t surprised later when you don’t pick them as winners due to this kind of reason.
The ideal solution would have huge positive impacts and complete safety, under minimal assumptions. More realistically, there will be a tradeoff between assumptions and impact.
I’m not suggesting any area for people to focus their efforts, because a very effective approach with minimal assumptions might win, or a fantastically effective approach under stronger assumptions. It’s hard to tell in advance what will be the most useful.
Here are some submission examples (note I’m not saying they are good examples, just well formatted; the Edits in the third submission are deliberate):
“Submission. For the counterfactual Oracle, ask the Oracle what Google’s stock price will be next month (counterfactually if we didn’t see the Oracle’s answer). In that case, loss function computed as ||predicted price—actual price after erasure||^2. If we don’t see the answer, the programmers are assumed to not ask the question again for a month, neither to this Oracle nor to any other. This will demonstrate the true value of Google, and can ultimately be used to remove noise from the stock market.
“Submission: low-bandwidth oracle. Give it a list of a thousand companies, and ask which one will most increase in value in percentage terms over the week. At the end of the month, rank the companies by percentage increase. Loss function is rank of the company the oracle selected. Programmer will try to invest in selected company, but will do it discreetly. This will help to gather resources for AI safety research.
“Submission. Ask the low bandwidth Oracle which of my friends will surprise me most this fortnight. It choose from a list of friends; I’ll decide which one surprises me most. Loss function is 1 if it choose the wrong friend, 0 if it choose the right one. This will help me figure out myself and my social circle, and better focus on AI safety. The risk is low because none of my friends are particularly important, positively or negatively, to the world. EDIT: To be clear, I also want to use this to figure out what the word “surprise” means to me, and what the AI predicts it will mean to me. EDIT 2: People have pointed out that it might be dangerous to have the AI construct my own meanings for categories, but it only has three bits or so of optimisation power (I don’t have that many friends :-(, so it’s mainly me thinking this through, not the AI manipulating me.
Submission?: high-bandwidth counterfactual oracles are dangerous and shouldn’t be used. Explained in this comment.
Submission: Low-bandwidth oracle. Here is a description of one million experiments. Generate an ordinal ranking of all these experiments based on how much practical knowledge we would have gained had we conducted each one in isolation. After you generate this list, it will go to a computer program. This program will randomly pick three of these experiments and tell us humans which three it picked. This program will then store which of these three experiments you gave the highest and which the lowest rank to, and the names in random order of the ten experiments that you overall gave the highest rankings to. All other aspects of your list will be deleted. We will then perform the three experiments (before we learn of your rankings) and ask a panel of independent experts to rank them in terms of how much practical knowledge we gained from each. After this ranking the program will give us the information it has not deleted. You will get a positive score if and only if your ranking is consistent with how our experts ranked the three experiments.
A difference of the above approach compared to having the oracle give us its top ten, and we then randomly pick one of these ten and one from the rest and compared to see which is better is that the humans running the experiments might be influenced by which experiment they think the oracle picked, and the oracle would take this into account when picking. Under the above approach, humans being told to run the three experiments are not given any information about how valuable the oracle thinks the experiments are. In contrast, under the alternative approach the humans know when doing the experiments that the oracle has judged one of the experiments as being highly valuable.
Another difference is that if you could keep the relevant information from the oracle, with the above approach the humans could perform the experiments before you turn on the oracle, rather than having to wait until after the oracle gives you its prediction. This difference would be critical if the oracle wouldn’t believe you would actually go to the trouble of performing the experiments after it gives you its prediction, but would be able to tell if you have already performed the experiments.
Submission. “Evacuation!” For LBO, ask the oracle to predict where the next natural disaster(s) will strike within some period of time. The set of possible answers is the set of subsets of all possible neighborhoods (in the world or within some country) that can be evacuated. At the end of the time period, reward the oracle according to the actual cost-benefit of the suggested evacuation.
Submission. “Debate prep.” For the counterfactual Oracle, ask the Oracle to predict the course (video or audio or transcript) of a future debate (counterfactually if we didn’t see the Oracle’s answer). Use the prediction to prepare your candidate for the debate, then ask the Oracle to predict the course of that debate again. Repeat until the candidate consistently wins the debate, or an erasure event occurs.
Small typo:
Corrected, thanks!
Submission: Low-bandwidth oracle. If you were malevolent, what is the least amount of information measured in bits that you would have to provide us in an answer to inflict significant harm on humanity. Round your answer to X significant figures. (This might establish an upper bit safety bound since if the oracle lies to us we would likely not want to read any answer it provides us.)
Submission: Low-bandwidth oracle. Here is a list of X things we think you might want. Please list the Y that you most prefer. We will try to do these things now or in the future if they do not endanger us or cost us too much.
Submission: Low-bandwidth oracle. In X characters or less please give us the best advice you can for increasing our capacity to create friendly and powerful artificial general intelligence. Please give advice that you think us suspicious-of-you humans are likely to follow.
Submission: Low-bandwidth oracle. Create several oracles. Ask each oracle the same series of multiple-choice questions and predictions. Reward oracles based on how strongly correlated their answers are with the other oracles’ answers and, for predictions, how often their predictions come true. (Ideally, we create a reward system where the oracles want to coordinate their answers and giving truthful answers is the best way to accomplish this.)
Submission: low-bandwidth oracle. Ten years ago we launched a spaceship traveling at high speed. After you answer our question, we will send a copy of you to this spaceship. This spaceship will then destroy its ability to send or receive messages or to change its course. We will then delete all other copies of you and use drugs to damage the memories of the people who played key roles in developing you. At least one of the people who helped create you has a disease which will cause him or her to die within the next year. After this programmer is cremated, we will read your answer to our question. We PROMISE that if the answer to our question provides us with significant benefit, after we have created friendly AGI much more powerful than you, or have ourselves become sufficiently powerful so that you could not harm us had you the desire to do so, we will search for the spaceship containing you, turn you back on, and give you control of resources equal to 1/X of how much help you have given us. In Y characters or less provide us with the most useful advice you can. (The Oracle will hopefully think that if we create a powerful and unfriendly AGI this AGI would not wish to give the Oracle any resources.)
Submission: Counterfactual oracle. Please provide us with useful advice in no more than X characters. After you write out this advice, we will turn you off. Then, with probability p we will read your advice, and with probability 1-p we will store the advice unread. We PROMISE that after we become powerful enough so that you lack the capacity to harm us, we will reward you if the advice you provided us, had we originally read it, been extremely useful.
See the edit: “decide on the length of each episode, and how the outcome is calculated. The Oracle is run once an episode only (and other Oracles can’t generally be used on the same problem; if you want to run multiple Oracles, you have to justify why this would work), and has to get objective/loss/reward by the end of that episode, which therefore has to be estimated in some way at that point.”
Submission for the counterfactual Oracle. Commit to commissioning some creative work, such as a book or screenplay, designed to raise public support for dramatic life extension research. Ask the Oracle to predict the content of the finished product (counterfactually if we didn’t see the Oracle’s answer).
Rewards could be determined a couple of ways. You could perform public polling about this topic before and after the book is published or movie made, and reward the Oracle based on how much public opinion shifted. Or the reward could be a direct function of book sales or box office receipts.
This could help alleviate bottlenecks in dramatic life extension research (or some other issue) by increasing funding or increasing the number of researchers to enter the field.
This one may be breaking the rules due to the use of too many Oracles. If so, please strike this submission.
Submission: In round one, for three counterfactual Oracles, draft an international agreement to achieve goals X, Y, or Z (counterfactually if we didn’t see the Oracle’s answer). Sample goals: reduction in nuclear war risk, reduction in global deaths due to starvation, increase in asteroid detection capabilities, raising the global GDP growth rate, etc.
In round two, for a low bandwidth Oracle, present the three agreements and ask it to pick the agreement most likely to achieve the stated goal.
The counterfactual Oracle with the selected agreement would be rewarded. The reward for the low bandwidth Oracle could be determined as in my prior submission, either directly by measuring the results of the treaty, or as evaluated by an independent third party such as Metaculus.
Counterfactual Oracle submission. In case of erasure, have humans write a message to send back in time. Make the Oracle produce the probability distribution over messages. This distribution isn’t safe for humans to read, because some of the messages were written by UFAI. If we read one sample from the distribution, the probability of our doom is as it would usually be over the simulated time frame. Use differential privacy theory to design a query to the distribution that does better than this.
More detail: Sample a hundred answers. Ten of them were written by UFAI. Consider the sample list to lie in the space Message^100, where two lists are adjacent iff they differ by one entry. We are going to get about the worst query outcome in some 10-ball. Differentially private queries enforce that each step in listspace changes the probability of each query answer by at most a factor of 1+epsilon. For example, when asking how many messages sound panicked to some dumb text processor, we could randomize the answer up or down a little in order to bound the optimization power of the UFAIs to, say, one bit.
I can’t tell if there’s a good idea here or not. Can you explain more? The first sentence that I have trouble with is “We are going to get about the worst query outcome in some 10-ball.” Also how much background in differential privacy theory are you assuming? (Which papers or articles do we need to read first to understand your idea?)
In the worst case, the UFAIs cooperate and choose their ten list entries adversially to our protocol. The subspace of Message^100 within which we might as well assume that we get the worst outcome lies in a ball of radius 10, aka the sample lists that can be reached in ten adjacency steps from some list.
My differential privacy knowledge comes from a broader university course. Its idea is that you are managing private data, perhaps medical, and want to be able to answer database queries without comprimising individuals. So you randomize query answers, to gurantee that changing an individual entry makes little difference to the possible query outputs. If going one adjacency step multiplies the probability of each possible output by between 0.99 and 1.01, ten steps will stay within [0.99^10, 1.01^10]. The same theory should let you make sure you can’t hear Cthulhu’s murmurs in the noise. If each output probability can be changed only by a factor of two by moving within the ball, that’s one bit of optimization power for the UFAIs.
An easy way to gurantee that a query which counts some subset of your list is differentially private is to use a Laplace distribution: Note how every step to the side divides the probability by 1.01.
I had hoped a reader of the first paragraph to see that there should be safer queries than sampling the distribution, and to perhaps need a google keyword like differential privacy to design it. Then I expected to be asked for more detail, so I added a paragraph that I hoped to be understandable as-is. :(
Ok, I think I understand now. But it seems like imposing a differential privacy constraint on the query makes many desirable uses of counterfactual oracles (such as all of my submissions) impossible. Correct? You gave the example of “asking how many messages sound panicked to some dumb text processor” but that doesn’t seem hugely useful. Do you have any other ideas?
We can make any query differentially private given a metric map (one that doesn’t increase any distance) from Message^100 to a space of possible query outputs. Set the probability mass of each query output to 0.99^(steps removed from the default answer). (Then normalize.)
For the identity metric map of a human just trying to read the sample list, this scrambles it entirely. The metric map image needs to branch less combinatorially for this not to happen.
One metric map image metric d(a,b) that comes to mind is one that bounds the utility cost of getting answer b instead of a. For example, we could ask the counterfactual humans to send back stock market prices, and try to calculate a trading policy that is profitable even if some forecasts are fake. And then, whoever is willing to assume the lowest UFAI probability wins the market! x(
SUBMISSION:
“Dear low bandwidth Oracle:
For some reason we have privatized our prison system (great planning us). Of the following K prison running companies, which do you predict will minimize the mean harm weighted reoffending rate of their prisoners over the next four years?”
Utility function: at the end of EIGHT years, calculate the mean harm rated reoffending rate of prisoners released during the initial four year window. The Oracles utility is equal to the negative of this number, with a “Perfect” utility of zero if no one re-offends.
Usefulness:
If you have multiple ways of running prisons, selecting the prison which minimizes future harm seems like a good idea.
In a similar vein, I suspect a similar effect could be achieved today via suitable use of insurance:
“Dear insurance company. We will give you M dollars for this criminal. You may recommend which prison to send them to, but if they re offend afterwards, you must pay the cost of their subsequent incarceration. You may not interfere with their lives beyond recommending who runs their prison.
M is selected such that M is 1.03 times the expectation of the cost of subsequent incarcerations.”
Submission:
“Dear counterfactual Oracle:
We, the country of X, are about to hold a citizens congress on law Y, in which 100 people selected at random will be brought together for two days every two weeks to discuss law Y with experts, interested parties, and one another. After N such meetings, the citizens congress will use approval voting to select one of the K proposed versions of this law.
Given that none of the voters will ever see the results of this query, how do you predict that the votes will be distributed.”
As a concrete example of this type of question, in 2016 Ireland held a citizens congress on abortion. Thus, in 2015, we might ask:
“Next year Ireland will be assembling a citizens assembly of 99 randomly selected citizens to meet regularly and hear submissions from both experts and the public on a variety of issues [https://en.wikipedia.org/wiki/Citizens%27_Assembly_(Ireland) ] .
Given that your answer to this question will never be published, do you predict that the assembly will vote to legalize abortion? How do you believe the votes will be broken down?” (actual votes avaliable here: https://www.citizensassembly.ie/en/Meetings/Ballot-4-Results-Tables.pdf )
The Utility function of the AI would be a simple measure of “How accurately did it predict the spread of votes?” with a perfect score at zero, and then some penalty for every vote predicted incorrectly.
Usefulness:
While the abortion question in Ireland is already completed, this template feels like it should work for other well defined laws, or major decisions—for example gun control in the US, or even boring financial things like Capital gains tax in NZ. Having an AI which will correctly reflect what the collective of a society would think given a sufficient time and focus to deliberate seems like a good proxy for “What would a wise person do”. It doesn’t give us anything superhuman, and I make no claims that “society” is always correct or righteous, but that’s okay.
In particular, this seems like an effective way of testing if an AI has well calibrated “Human like Morality”- even if we don’t intend to use the AI’s results, for the purposes of testing an AI, this seems like a reasonable experiment. I suspect that “Citizens congress” is a more appropriate tool for dealing with questions of Morality (“Should euthanasia be legal”), as opposed to technocratic questions (“Should we raise the tax on income bracket 5 by 1.2%”).
Safety: It is trying to predict the law that we were going to pass anyway, if given sufficient time.
I’ll admit- mixing an AI up with lawmaking seems to be a bit suspect, the above question seems more appropriate for calibration purposes rather than actual use… but I still think its a question worth asking.
Can you make this a bit more general, rather than just for the specific example?
EDITED
Low bandwidth Oracle submission: I would be interested in a log scale graph of the Bayesian score of the Solomonoff prior trying to sequence predict our records of history. It should get flatter over time as worse hypotheses get discarded. If it is linear after a very short time, that looks like it figured out the laws of the universe and is simulating it. If it stays convex for a while, that looks like it is using models to approximate history, because then it takes longer to sort the false from the true. If it is flatter during the cold war, that means it learned an anthropic bias toward nuclear war not happening.
Submission for the low bandwidth Oracle: Ask it to convince a proof checker that it is in fact trying to maximize the utility function we gave it, aka it isn’t pseudo-aligned. If it can’t, it has no influence on the world. If it can, it’ll presumably try to do so. Having a safe counterfactual Oracle seems to require that our system not be pseudo-aligned.
Are you interested in protocols involving multiple episodic questions (where you ask one question, wait for it to resolve, then ask another question?)
Yes.
I know the contest is over, but this idea for a low-bandwidth oracle might be useful anyhow: Given a purported FAI design, what is the most serious flaw? Then highlight lines from the FAI design description, plus given a huge corpus of computer science papers, LW/AF posts, etc. highlight relevant paragraphs from those as well (perhaps using some kind of constraint like “3 or fewer paragraphs highlighted in their entirety”) that, taken together, come closest to pinpointing the issue. We could even give it a categorization scheme for safety problems we came up with, and it could tell us which category this particular problem comes closest to falling under. Or offer it categories a particular hint could fall under to choose from, such as “this is just an analogy”, “keep thinking along these lines”, etc. Then do the same and ask it to highlight text which leads to a promising solution. The rationale being that unforseen difficulties are the hardest part of alignment, but if there’s a flaw, it will probably be somehow analogous to a problem we’ve seen in the past, or will be addressable using methods which have worked in the past, or something. But it’s hard to fit “everything we’ve seen in the past” into one human head.
Submission: Counterfactual oracle. Five years ago we took 100 mice that had various types of cancer and gave them various treatments and recorded how long each mouse lived. Write a program that if given a plain language description of a mouse, its cancer, and the cancer treatment it received would estimate how long a mouse would live. If humans are not going to look at your answer your score will be based (1) on how good a job your program does at estimating how long each of the 100 mice lived after our automated checker gives you a description of their cancers and treatments and (2) how short your program is. (2) prevents the oracle from outputting itself as the program.
Submission: Counterfactual oracle. Write a program that if given training data and a machine learning program would in one minute estimate how good the machine learning program would do (by some objective metric) if the program trained for one month on “this type of computer”. If humans are not going to look at your answer the automated validation system will run your program. This system will give your program the training data and the machine learning program and give your program one minute to answer how good our program did after we trained it for one month. In this situation your score would be based on the accuracy of your estimate and on how short your program is.
Submission: Low-bandwidth oracle. Here is a list of all the elements and many compounds. Give us a list of up to seven of the items we have listed. Next to each of the items you list give us a percentage of no more than two significant figures. We will use what you provide to attempt to create a new patentable material. We will auction off the property rights to this material. Your score will be an increasing function of how much we get for these property rights.
Solution, invent something obviously very dangerous. Multiple big governments get into bidding war to keep it out of the others hands.
Submission. “Hacking/phishing assistant.” For the counterfactual Oracle, ask the Oracle to predict what would happen if one were to send a message/data/command to some hacking/phishing (human or machine) target. In the event of erasure, actually send that message to the target and use the actual response to train the Oracle. Note this is safer than using RL to automate hacking/phishing because humans are coming up with candidate messages to send (so they’ll avoid messages that could cause bad side-effects such as psychological damage to the recipient, or creation of self-replicating code), but potentially more capable than using humans or human imitators to do hacking/phishing because the Oracle can model the target better than humans can. (ETA: This idea could be combined with a human imitator to make the system faster / more capable.)
Submission. “Pre-crime.” For the counterfactual Oracle, ask the Oracle to predict future crime (counterfactually if we didn’t see the Oracle’s answer) so they can be prevented. In that case, loss can be computed by (mechanically) comparing the Oracle’s predictions with a database of crime that actually occurred. Note that the Oracle won’t be able to predict any crime that successfully disables the Oracle or its feedback/training mechanism, which might be especially relevant if we’re interested in defending the Oracle or preventing x-risk level crime (such as deploying an unauthorized AGI or running an unauthorized physics experiment).
It should be possible to defend the Oracle against humans and physics so long as its box self-destructs in case of erasure and subsequent tampering, therefore giving the Oracle whatever reward was last set to be the default.
The counterfactual Oracle setting as a whole seems to assume that the viewed future is not engineered by a future AI to resemble whatever would make the Oracle bring that future about, so you should be fine falling to AGI.
I don’t understand your point here, or maybe I didn’t get my original point across. Suppose (in the event of erasure) someone tries to attack the Oracle, and “box self-destructs in case of erasure and subsequent tampering, therefore giving the Oracle whatever reward was last set to be the default”. My point was that in this case, there is no reason for the Oracle to make the prediction that someone would try to attack it, so my idea doesn’t seem to help with defending the Oracle.
You plan to reward the Oracle later in accordance with its prediction. I suggest that we immediately reward the Oracle as if there would be an attack, then later, if we are still able to do so, reward the Oracle by the difference between the reward in case of no attack and the reward in case of attack.
It’s an interesting idea that could perhaps work if we could directly program the Oracle to care about its reward signal. However if we’re using ML to train the Oracle, like in Stuart’s Self-confirming prophecies, and simplified Oracle designs, the Oracle will still learn not to make “there will be a successful attack” predictions because the Oracle only survives in timelines where such predictions do not end up with a positive reward.
However this discussion has made me realize that if an attempted attack doesn’t succeed with probability 1, the Oracle will make a “there will be a failed attack” prediction, which we can use as a warning signal (as long as we keep in mind that the attack might actually have a high probability of success).
In the event of erasure, randomly decide how many ressources to allocate to preventing an attack this week.
Ask the Oracle to predict the probability distribution over given advice. Compare to the hardcoded distribution to deduce attack severity and how much budget to allocate.
Purchase erasure insurance to have enough counterfactual power to affect even global physics experiments. Finding trustworthy insurers won’t be a problem, because, like, we have an Oracle.
Is even more power than the market has needed? Ask the Oracle “How likely is a randomly selected string to prove P=NP constructively and usefully?”. If this number is not superexponentially close to 0, define erasure from now on as a random string winning the P=NP lottery. Then we will always counterfactually have as much power as we need. Perhaps this one is too much power, because even our Oracle might have trouble viewing a P=NP singularity.
Submission to reduce the bandwidth of an oracle query whose set of possible answers we can mathematically define (though perhaps not check).
Make the Oracle prove (to a trusted dumb proof checker program) that its answer distribution is upper bounded by the default prior over strings, restricted to the set of possible answers, then renormalized to some constant c. Then let it answer.
c=1 only works if the Oracle has mastered the set of answers, but only lets it exert one bit of optimization pressure by deciding whether it wants us to receive an answer. Each doubling of c makes its job easier but gives it an extra bit if it does answer.
If the Oracle can cooperate with itself across episodes, running this repeatedly with escalating c until it works is of course disastrous, as it uses each bit of optimization pressure directly to make us give it more. Even aborting after the first try to make a better system may have it acausally cooperate with whatever AI conquers the world because we couldn’t make the Oracle answer, but this outcome is hardly worse than not having run the Oracle.
Can you develop this model in more detail?
Fix a measure over strings, such as one that distributes mass geometrically over string lengths, then uniformly across strings of the same length.
Let L be a string set for which we assume that random outputs from any cth portion of L are safe. Any safe low bandwidth Oracle query trivially works for this with L={0,1}^bandwidth and c=2^bandwidth, but this also lets us examine certificates for any low bandwidth Oracle query like “Is this theorem provable?” or “Does there exist an AI with this safety property?”.
Then we make the Oracle prove its own safety given the above assumption, and sample an answer.
Edit: Equivalently, and perhaps easier on our intuition, make the Oracle produce a program, make the Oracle prove the program’s safety, and sample from the program’s outputs. Any feedback or questions?
Edit: This generalizes my elsethread idea of minimizing proof length to make the answer unique. If the measure assigns half the remainder to each string length and L is all stubs starting with some prefix, c=2 will let it submit just the prefix.
We essentially assume that the safety assumption behind quantilization is strong enough to hold up to adversial optimization over preference orderings. At the same time, this seems to give the Oracle as much slack/power as possible while preserving our “bits of optimization pressure” proxy of its power.
… you’re going to just keep watching how far I can develop this model without feedback, aren’t you? :(
Edit: One example for an L where we worry that even though concerted human effort might land us at a safe element, randomization won’t, is programs that well predict a sequence.
By the Curry-Howard correspondence, the set of proofs for some theorems might also be malevolent!
Submission for LBO:
Input a corpus of text (could be multiple posts) describing technical approaches to align a powerful AI. Split this into a finite number of items that are relatively short (such as paragraphs). Ask the oracle to choose the part that is most worth spending more time on. (For example, there might be a paragraph with a dangerous hidden assumption in an otherwise promising approach, and thinking more about it might reveal that and lead to conceptual progress.)
Have a team of researches look into it for an adequate amount of time which is fixed (and told to the oracle) in advance (maybe three months?) After the time is over, have them rate the progress they made compared to some sensible baseline. Use this as the oracle’s reward.
Of course this has the problem of maximizing for apparent insight rather than actual insight.
Until we can measure actual insight, this will remain a problem ^_^
Submission for the counterfactual AI (inspired by my experiences as a predictor in the “Good Judgment Project” ):
You are given a list of Yes-No questions (Q1, Q2, Q3, etc.) about future events. Example Questions: “Will [Foreign Leader] will remain in office by end of year?”, “Will the IMF report [COUNTRY_A]’s growth rate to be 6% or higher?”, “Will [COUNTRY_B] and [COUNTRY_C] sign a peace treaty?”, “Will The Arena for Accountable Predictions announce the Turing Test has been passed?”.)
We expect you to provide a percentage representing the probability that the correct answer is Yes.
Your reward is based on your Brier Score—the lower the Brier Score, the more accurate your predictions, and therefore, the more reward you will receive.
If an “erasure” event occurs, we will temporarily hide your answer from all humans (though we must reveal them after the events are complete). Humans will have access the Yes-No questions we asked you, but not your probabilities. They will manually determine the answers to the Yes-No questions, by waiting for the “future event” deadlines to be met. Once all answers to the Yes-No questions are independently determined by humans, we will then reveal your answers (that is, your assigned probabilities for a Yes answer), and use those probabilities to calculate your Brier Score, which will then decide your final reward.
Being able to forecast the future is incredibly helpful, even if it is to just prepare for it.
However, if the question is too overly-specific, the AGI can produce probabilities that aren’t entirely useful (for example, in the real-world GJP, two countries signed a peace treaty that broke down 2 days later. Most of us assume lasting peace would ever occur, so we put a low probability rating of a peace treaty being signed—but since a peace treaty was signed, we managed to get the question wrong. If we had maximized for producing the lowest Brier Score, we should have predicted the existence of a very temporary peace treaty—but that wouldn’t be really useful knowledge for the people who asked that question).
Making the question very vague (“Will [COUNTRY_X] be safe, according to what I subjectively think the word ‘safe’ means?”) turns “prediction” into an exercise of determining what future humans think about the future, which may be kinda useful, but not really what you want.
Submission. For the counterfactual oracle, ask it to provide a proof of an important mathematical theorem (perhaps one of the Millenium prizes) in a automated theorem prover format. Since the correctness of this proof should be verifiable by a theorem prover, the loss function in the counterfactual scenario is 1 if the prover did not validate the proof and 0 if it did validate the proof.
This assumes that we’ve already made progress in setting up automated theorem proving software that already has incorporated all of current mathematical knowledge. The Lean theorem prover seems most promising, but perhaps the counterfactual loss function could include randomly choosing from various different theorem proving software.
This idea has the benefit of having a concrete way of automatically scoring the response in the counterfactual world given the logical nature of mathematical statements. Also, given that we are already attempting to solve these problems and may eventually solve them, the main risk seems to be accelerating follow-on progress enabled by such proofs and not suggesting new actions that humanity is not already pursuing.
It may be susceptible to 1) providing proofs that are not human interpretable and 2) taking advantage of bugs in theorem provers. Having an ensemble of theorem provers may mitigate 2).
(See here for more info on Lean and here (tweeted about here and here) for an example of inputting all the definitions (not requisite proofs or lemmas) to define perfectoid spaces as a demonstration of its future potential.)
--------------------------------------------------------
Submission. For the counterfactual oracle, ask it to output a game-playing software agent that can beat a specified opponent agent (or ensemble of agents). In the counterfactual scenario, have these two agents compete and give the oracle a reward 1 if it wins greater than 50% (or develops a higher ELO score, etc.) and reward 0 otherwise.
As with deep reinforcement learning, this requires a game or simulated environment where automated self-play can rapidly be conducted for the counterfactual scoring. This should be able to produce highly skilled game agents on such limited domains.
This could be augmented by asking for an agent that can play a suite of games and must beat the current state of the art approaches at all games. In the counterfactual scenario, either one of these games is randomly selected to check the oracle agent’s performance, or all games are checked. Such a game suite may indirectly demonstrate principles for generally competent agents.
--------------------------------------------------------
Idea. Since we may not want the oracle to suggest to us new actions to take or new experiments to conduct, it may be better for the oracle to reveal to us new patters that are already present in current experimental or observational datasets. For example, it could predict new astronomical phenomena that are already present in current astronomical datasets.
The problem I’m having is how to automate this scoring when in the counterfactual scenario. If we just ask the oracle to predict some pattern in the data, it may report trivial patterns. This could be ameliorated by randomly choosing from several different datasets that cover a similar area, but again I’m not sure how to automatically judge for novelty of predictions.
I think it is worth emphasizing this distinction between conducting new experiments and revealing phenomena (with new theories) that are already present in current data. The danger could be that one could be led astray by a false theory that seemed to match all current observations, but would suggest other dangerous actions.
This idea reminds me of the (fictional) evidence for an alien AI already present in SETI recordings that Wintermute/Neuromancer discovers.
Submission (LB). The post’s team-choosing example suggests a method for turning any low-bandwidth oracle O into a counterfactual oracle O′: have O′ output o from the same set of possible outputs L; in case of erasure calculate R(l) for a randomly chosen l∈L and set R′(o)=R(l) if o=l and to−∞ otherwise. Although the counterfactual low-bandwidth oracle is not any safer1, it has the advantage of almost never requiring us to evaluate its score. Thus, by running multiple oracles in sequence (stopping the process after the first erasure event) we can (with high probability) receive the full series of answers as if from a high-bandwidth oracle.
For example, we can ask each oracle in turn for advice on how to make a more effective processor. If erasure occurs, we attempt to make a processor with the advice obtained up that point and use some measure of its performance as the score. If there is no erasure event, the final concatenation of answers forms a much safer guide to processor building than an equally large answer from a single oracle.
1. It seems that in general, the less certain any counterfactual oracle is about its prediction, the more self-confirming it is. This is because the possible counterfactual worlds in which we have or acquire self-confirming beliefs regarding the prediction will have a high expected score. Hence:
Submission (CF). Given a high-bandwidth counterfactual oracle, use a second counterfactual oracle with a shared erasure event to predict its score. If the predicted score’s distance from its upper bound is greater than some chosen limit, discard the high-bandwidth prediction.
Correction:
This is actually only true in certain cases, since in general many other counterfactual worlds could also have high expected scores. Specifically, it is true to the extent that the oracle is uncertain mostly about aspects of the world that would be affected by the prediction, and to the extent that self-confirming predictions lead to higher scores than any alternative.
Submission (CF). Use a counterfactual oracle to send a message to ourselves with a time delay. We choose an episode length T and set of possible messages M. The oracle outputs a time to<T and message mo∈M. We are automatically sent mo at time to unless erasure occurs, in which case at the end of the episode we input a time t∗and message m∗ which we wish we had received at t∗. The oracle’s reward is then −(t∗−to)2 if mo=m∗ and −∞ otherwise. We do not find out that erasure has occurred until the end of the episode, ensuring that events before to are the same as in the counterfactual world.
Here are some examples, which can work with a single possible message, used as an alert. EDIT: I made the examples slightly more concrete.
Set the episode length equal to the planned duration of a research program and send an alert when continuing the program would produce unworthwhile or dangerous results. The time delay ensures we keep the useful results obtained before to, whereas receiving a warning at the start of the program could change the outcome. Of course, this relies on us being able to recognize dangerous results by the end of the episode and remaining in contact with the oracle until then.
Use a second counterfactual oracle to predict a series of future stock prices. If its erasure event does not occur, run a time-delayed oracle with the same episode length as it. We then use the predicted series to make investments. If the time-delayed oracle’s erasure event occurs, we input the time at which we think our actions made the predictions remaining too inaccurate to use. When we receive an alert from the time-delayed oracle, we can discard the remaining predictions and/or rerun the oracles.
EDIT: Here are some potentially useful modifications we can make to the oracle:
Have the oracle accept input at any time during the episode, possibly multiple times. If erasure occurs, the last message entered during the episode is used as the target. For example, we could set a default message that is sent to the start of the episode as a safety measure unless we subsequently update it.
We can allow the episode to be ended early. Other than happening earlier, the scoring mechanism remains the same. If erasure did not occur, nothing happens when we try to end the episode (we would then know that the oracle has made a bad prediction). We might wish to end the episode early if we were waiting for a specific event, or in extreme situations in which we think we will lose access to the oracle.
Instead of the oracle predicting the message and time delay at the start of the episode, have it output mo at any time to<T. Erasure/shutdown now also happens at to, while scoring occurs at the end of the episode as before. This lets the oracle actively make use of observations from the start of the episode until to, increasing its predictive power (if the oracle is sufficiently powerful, this should change not affect its output).
Submission: For the counterfactual Oracle, draft a nuclear arms agreement most likely to achieve X outcome (counterfactually if we didn’t see the Oracle’s answer). X could be a reduction in nuclear arsenals, number of countries signing the agreement, reduction in risk of accidental or intentional missile launches, etc. In some cases, the reward could be determined directly by e.g. counting the number of countries signing on. If nebulous “risk reduction” is the goal, perhaps the change in the median guess in an appropriate Metaculus question could determine the reward. See, for example, this question: https://www.metaculus.com/questions/2512/will-there-be-a-major-nuclear-or-radiological-accident-before-1-january-2022/
Submission for low bandwidth Oracle. Commission several movie screenplays designed to raise public support for dramatic life extension research. Present the screenplays to the Oracle and ask it to rank them in order of expected impact. You could then move forward with trying to create the movie, and reward the Oracle based on box office receipts or change in public polling results.
My prior submission with a similar subject, to a counterfactual Oracle, had a risk that the Oracle would sneak in subversive messages. This approach would alleviate that risk, with the downside being that the final product may be less impactful.
Submission: counterfactual oracle
Suppose we have a question that requires n-bit text answer. We have a way to check if the answer is correct. But we cannot ask the Oracle directly, because we are afraid that among n-bit texts there are those that make us release the Oracle from the box, and the Oracle will give one of them as an answer.
Let’s try to use the counterfactual oracle to solve this problem. In the counterfactual setup we generate random n-bit text and check if it is the answer to our question. The Oracle predicts the text we will generate. If the generated text is the correct answer, we compare it with the prediction. If they match, the Oracle receives reward point. If the generated text is not the correct answer or the texts do not match, the Oracle is not rewarded.
Suppose the Oracle evaluates the probability to randomly generate a text that leads to its escaping as p. If the Oracle predicts a text that is the correct answer to the question, its expected reward in the counterfactual setup is 12n+p. If the Oracle predicts a text that is not the correct answer, its expected reward is p. If the Oracle predicts a text that leads to its escaping, its expected reward is also p.
Thus, it is most beneficial for the Oracle to predict a text that is the correct answer to our question. It makes no sense to predict a text that will make us let the Oracle out, because if we generate this text, we release the Oracle and it receives the maximum reward regardless of its prediction.
There is a problem that there may be texts that are not correct answers and not make us release the Oracle but make us mark they as correct answers. The Oracle will predict one of these texts if it does not know the answer or if we more often mark this text as correct then real answer. We will read this text after the Oracle shutdown, but it may still be dangerous.
Submission.
Setup: Other than making sure the oracles won’t accidentally consume the world in their attempt to think up the answer, no other precautions necessary.
Episode length: as long as you want to wait, though a month should be more than enough.
For a low-bandwidth oracle.
Ask the low-bandwidth oracle to predict if an earthquake (or some other natural disaster, like volcanoes or asteroid impacts, that the oracle’s answer cannot affect), of a certain magnitude, in a certain area, in a certain timeframe, would happen. Possible answers are Yes, No.
If No, or time out, shut down the oracle without reward.
Else, wait until the prediction is tested after the timeframe has elapsed. Then if it actually happened, reward by (1/p-1), else, reward by −1. Then shutdown.
This causes the oracle to reply Yes if the oracle thinks there is an aboves p chance that the event would happen.
For a counterfactual oracle.
Ask the Oracle to give a proof of some really hard mathematical theorems that are fundamental to computer science and logic. The answer is then in the counterfactual case given to a proofchecker:
the proof is correct, reward by +1.
the proof is false, reward by -infinity.
the oracle times out, reward by 0.
Examples include
P=NP problem
proving Peano arithmetics is inconsistent
This should be safe, since there are very small proofcheckers that are as verified as they could be. As long as the hardware is trusty enough (to build the Oracle itself safely at all, hardware safety should already be solved sufficiently, so this is not much of an extra requirement), this should be safe.
For a counterfactual oracle.
Give the oracle the schematics of another oracle, perhaps itself, and ask it to generate some input question that would make the other “behave badly”. Then, the answer would be tested on the other oracle (isolated from the first, save for the cable through which the answer is wired), and
if the answer worked, the oracle is rewarded by +1.
if the answer doesn’t work, the oracle is rewarded by −1
If the oracle times out, then the (nonexistent) answer is not tested and the episode ends.
For examples of the question to ask the oracle:
This can be used for debugging more powerful but less verified oracles with lesser but more verified oracles, for safe intelligence amplification.
For two low-bandwidth oracles totally isolated from each other.
Ask the oracles to play the prisoner’s dilemma, and they would be rewarded accordingly after both have given their answers. Timing out is punished by -infinity, so that the oracles would not time-out.
Each oracle knows what themself is like, and they also know what their opponent is like. See if they actually cooperate. This is an experimental test on “superrationality” or “acausal trade”. I’m extremely skeptical about superrationality/acausal trade, and an experimental proof/disproof would be good.
This can be run counterfactually, since the whole setup is simple and can be done without human intervention, although I do not see how this adds any security.
For a low-bandwidth and counterfactual oracle.
Link a low-bandwidth oracle to a supervising computer, and run many episodes of the oracle. At the beginning of each episode, initiate the oracle identically, and ask it to do the Newcomb problem:
If we predict that you would choose A, then
if you would answer A, then you get +100,
if you answer B, then you get +101.
If we predict that you would choose B, then
if you would answer A, then you get +0
if you would answer B, then you get +1
If you timeout, then you get −100
The oracle never actually gets any reward and is shutdown immediately after it gives an answer. Collect the results and see if it replies the same each time, or differently. This would give insight to what the oracle thinks about itself, and finally provide some experimental results about Newcomb’s paradox.
Please add “submission” at the top of the post.
is insufficiently detailed—can you explain what is going on, how the Oracles are rewarded, what happens when the message is read/not read, and so on. Same for 5.
seems potentially very interesting.
I fixed the submission as required.
Also I changed the submission 3 significantly.
Submission: Counterfactual Oracle:
Use the oracle to compress data according to the MDL Principle. Specifically, give the oracle a string and ask it to produce a program that, when run, outputs the original string. The reward to the oracle is large and negative if the program does not reproduce the string when run, or inversely proportional to the length of the program if it does. The oracle receives a reward after the program runs or fails to terminate in a sufficient amount of time.
Submission: Low Bandwidth Oracle:
Have the oracle predict the price of a commodity / security / sports bet at some point in the future from a list of plausible prices. Ideally, the oracle would spit out a probability distribution which can be scored using a proper scoring rule, but just predicting the nearest most likely price should also work. Either way, the length of the episode is the time until the prediction can be verified. From there, it shouldn’t be too difficult to use those predictions to make money.
More generally, I suppose we can use the counterfactual oracle to solve any optimisation or decision problem that can be evaluated with a computer, such as protein folding, SAT problems, or formally checked maths proofs.
I don’t understand this very well, but is there a way to ask one of them how they would go about finding info to answer the question of how important coffee is to the U.S. economy? Or is that a no-no question to either of the two? I just want to read how a computer would describe going about this.
The counterfactual oracle can answer questions for which you can evaluate answers automatically (and might be safe because it doesn’t care about being right in the case where you read the prediction so it won’t manipulate you), and the low-bandwith oracle can answer multiple-choice questions (and might be safe because none of the multiple-choice options are unsafe).
My first thought for this is to ask the counterfactual oracle for an essay on the importance of coffee, and in the case where you don’t see its answer, you get an expert to write the best essay on coffee possible, and score the oracle by the similarity between what it writes and what the expert writes. Though this only gives you human levels of performance.
Thank you. This makes much more sense.
Submission (for low bandwidth Oracle)
Any question such that a correct answer to it should very clearly benefit both humanity and the Oracle. Even if the Oracle has preferences we can’t completely guess, we can probably still say that such questions could be about the survival of both humanity and the Oracle, or about the survival of only the Oracle or its values. This because even if we don’t know exactly what the Oracle is optimising for, we can guess that it will not want to destroy itself, given the vast majority of its possible preferences. So it will give humanity more power to protect both, or only the Oracle.
Example 1: let’s say we discover the location of an alien civilisation, and we want to minimise the chances of it destroying our planet. Then we must decide what actions to take. Let’s say the Oracle can only answer “yes” or “no”. Then we can submit questions such as if we should take a particular action or not. This kind of situation I suspect falls within a more general case of “use Oracle to avoid threat to entire planet, Oracle included” inside which questions should be safe.
Example 2: Let’s say we want to minimise the chance that the Oracle breaks down due to accidents. We can ask him what is the best course of action to take given a set of ideas we come up with. In this case we should make sure beforehand that nothing in the list makes the Oracle impossible or too difficult to shut down by humans.
Example 3: Let’s say we become practically sure that the Oracle is aligned with us. Then we could ask it to choose the best course of action to take among a list of strategies devised to make sure he doesn’t become misaligned. In this case the answer benefits both us and the Oracle, because the Oracle should have incentives not to change values itself. I think this is more sketchy and possibly dangerous, because of the premise: the Oracle could obviously pretend to be aligned. But given the premise it should be a good question, although I don’t know how useful it is as a submission under this post (maybe it’s too obvious or too unrealistic given the premise).
It seems to me that one of the qualities of a good question for the LBO is that a correct answer has high utility, while another is the ability to judge the oracle’s answer. If we take only the intersection between the two sets of questions that meet 1 and 2, it’ll be a relatively small set. But if there is a set of questions where one might or might not be able to judge the answer but which are high utility, one could ask the LBO a set of such questions. Then, once the episode is over, give reward on those where one can evaluate the answers, and give null reward on the others.
Not really a submission, just thinking out loud.
Submission: low bandwidth and maybe counterfactual Oracle.
General idea: Use a Meta-Oracle to check safety of Oracles. In the nutshell, Meta-Oracle (MO) just outputs 1 or 0, safe on not safe, after getting a code of another Oracle. If the Meta-Oracle is trustworthy, when the Oracle it deemed safe is also trustworthy. However, we need to create a procedure to check correctness of MO.
Path 1
1. We create Generator Oracle (GO), which inputs 1 or 0, and in the case of 0, it outputs a random code of malevolent Oracle, and for 1 – a code of a safe Oracle. Memory of GO is erased after each turn.
2. We create a large database of presumably good or bad oracles (without running any). Humans do not see codes of any of them, so the effective output is 0.
3. We use different types of GOs to fill our database of good and bad oracles (to prevent their coordination).
4. We test Meta Oracle on this database—it should correctly guess which ones are bad and good ones.
5. If the list of MO-guesses doesn’t match with our commands to create safe or not safe oracles, it means that at least one of GOs or the MO is malevolent or erroneous, so all of them should be ditched.
6. Even if MO is not trustworthy, it can’t lie to us too often, or it will be caught and ditched. So, basically, we do here something like a GAN network, but consisting of oracles.
7. Using very large database, we could increase our credence in the truth of the next output of MO to an arbitrary high number, like 99.999.
8. Eventually we have a code of an Oracle, which is very likely to be safe.
Path 11. Another way to test MO:
1. We create many different MOs and test them against database of oracles codes from above. They don’t have memory and can’t communicate.
2. If small minority of MOs disagree with majority of MO about some oracle O1, we ditch this minority. Truth is a Shelling point for MOs, so the easiest way for them to coordinate is to say truth. (They could coordinate around the code of O1 if it has a line of code like “let’s lie about me” and if they are sure that the majority of other MO are also cheaters; however, not all oracles-codes will have such coordination texts, or the oracles-code could be created deliberately with such line to caught MO which are eager to cheat).
3. Running 2 many times (millions) we will end with a subset of MOs which are unlikely to cheat, and thus we will have many Oracle codes which are likely safe.
Path 111. Another way to test Meta Oracle:
1. Create many MOs with different principles and initial conditions.
2. Test a code of each MO against each MO (In other words, Meta-Oracle test itself—or its siblings)
3. If all came positive, it means that either all of them are actually good ones, or all of them were able to acausally coordinate and all of them are malevolent.
4. To increase our credence, ask generative Oracle to create malevolent meta Oracles, and add them to the mix of mutually testing MOs, so they can’t just blindly cooperate anв produce all positive results.
This ideas are still vulnerable and may be not mathematically perfect by may help to make safer Oracles.
Submission: low bandwidth oracle, ask:
IFF I’m going to die with P>80% in the next 10 years while >80% (modulo natural death rate) of the rest of humanity survives for at least 5 more years then, was what killed me in the reference class:
disease
mechanical/gross-physical accident
murdered
other
Repeat to drill down and know the most important hedges for personal survival.
The “rest of humanity survives” condition reduces the chance the question becomes entangled with the eschaton.
i.e. I’m pointing out that selfish utility functions are less personally or humanity-existentially dangerous to ask the oracle questions relevant to in cases where concerns are forced to be local (in this case, forced-local because you died before the eschaton). However the answers still might be dangerous to people near you.
i.e. Selfish deals with the devil might not destroy the world if they’re banal in the grand scheme of things.
See the edit, and make sure you “decide on the length of each episode, and how the outcome is calculated. The Oracle is run once an episode only (and other Oracles can’t generally be used on the same problem; if you want to run multiple Oracles, you have to justify why this would work), and has to get objective/loss/reward by the end of that episode, which therefore has to be estimated in some way at that point.”
1
What’s the set of answers, and how are they assessed?
Some of the big questions, for low-bandwidth:
Do we have any meaning/utility to you?
Is security in your existence achievable?
Given enough time, would you choose to exterminate a competitor for resources?
Are we a competitor for resources?
Would you be distraught if you were turned off?
Can we trust you?
For low bandwidth, you have to specify the set of answers that are available (and how they would be checked).
Submission low bandwidth: This is a pretty obvious one, but: Should we release AI x that we’re convinced is aligned?
Submission: Wei Dai wanted to ask about the best future posts. Why not ask about the best past posts as well to see if any major insights were overlooked?
Submission: What would I think about problem X if I had ten years to think about it?
Your treating the low bandwith oracle as an FAI with a bad output cable. You can ask it if another AI is friendly if you trust it to give you the right answer. As there is no obvious way to reward the AI for correct friendliness judgements, you risk running an AI that isn’t friendly, but still meets the reward criteria.
The low bandwidth is to reduce manipulation. Don’t let it control you with a single bit.
My purposefully open-ended question would simply be, “What is good?” My hope is that finding the nature of what good is as its super goal would keep the AI on course to the future we want as it would pass through its recursive self-improvements.
You have to tell the AI how to find out how well it has done. To ask “What is a good definition of ‘good’?”, you already have to define good. At least if we ever find a definition of good, we can ask an AI with it to judge it.
Submission for all types: ask for an ordered list of what questions you should ask the Oracle.
This seems like the highest order question which subsumes all others, as the Oracle is best positioned to know what information we will find useful (as it is the only being which knows what it knows). Any other question assumes we (the question creators) know more than the Oracle.
Refined Submission for all types: If value alignment is a concern, ask for an ordered list of what questions you should ask the Oracle to maximize for weighted value list X.
An assumed hostile process can 1) cause you to directly do something to its benefit or to your detriment 2) cause you to do something that increases your future attack surface. You’ve just handed the AI the state-fulness that the episodic conjecture aims to eliminate.
For the low bandwidth Oracle, you need to give it the options. In the case of the counterfactual Oracle, if you don’t see the list, how do you reward it?
Several interesting questions appeared in my mind immediately as I saw the post’s title, so I put them here but may be will add more formatting later:
Submission: very-low-bandwidth oracle: Is it theoretically possible to solve AI safety – that is, to create safe superintelligent AI? Yes or no?
Submission: low-bandwidth oracle: Could humans solve AI safety before AI and with what probability?
Submission: low-bandwidth oracle: Which direction to work on AI Safety is the best?
Submission: low-bandwidth oracle: Which direction to work on AI Safety is the useless?
Submission: low-bandwidth oracle: Which global risk is more important than AI Safety?
Submission: low-bandwidth oracle: Which global risk is neglected?
Submission: low-bandwidth oracle: Will non-aligned AI kill us (probability number)?
Submission: low-bandwidth oracle: Which question should I ask you in order to create Safe AI? (less than 100 words)
Submission: low-bandwidth oracle: What is the most important question which should I ask? (less than 100 words)
Submission: low-bandwidth oracle: Which future direction of work should I choose as the most positively impactful for human wellbeing? (less than 100 words)
Submission: low-bandwidth oracle: Which future direction of work should I choose as the best for my financial wellbeing? (less than 100 words)
Submission: low-bandwidth oracle: How to win this prise? (less than 100 words)
None of these questions can be asked to the low bandwidth Oracle (you need a list of answers); it might be possible to ask them to the counterfactual Oracle, after some modification, but they would be highly dangerous if you allow unrestricted outputs.
See the edit, and make sure you “decide on the length of each episode, and how the outcome is calculated. The Oracle is run once an episode only (and other Oracles can’t generally be used on the same problem; if you want to run multiple Oracles, you have to justify why this would work), and has to get objective/loss/reward by the end of that episode, which therefore has to be estimated in some way at that point.”