bryjnar

Karma: 914

bryjnar Feb 29, 2024, 10:47 PM
5 points
0
in reply to: Ben Pace’s comment on: Acting Wholesomely
Often the effect of being blinded is that you take suboptimal actions. As you pointed out in your example, if you see the problem then all sorts of cheap ways to reduce the harmful impact occur to you. So perhaps one way of getting to the issue could be to point at that: “I know you care about my feelings, and it wouldn’t have made this meeting any less effective to have had it more privately, so I’m surprised that you didn’t”?

bryjnar Jun 23, 2022, 5:50 PM
22 points
on: Half-baked AI Safety ideas thread
Wireheading traps.
An agent is “wireheading” if it is taking an action that a) provides it with enormous amounts of utility for little effort, b) is trivial or seemingly unrelated to its “main” utility function or goals.
People have discussed the possibility of an AI wireheading as a problem for getting it to do what you want, “what if it just works out a way to set your metric to +ve infinity and then goes to sleep satisfied?”.
But we can also use this as a guard-rail.
A “wireheading trap” is an action which a) is very hard for an AI to do below a level of capacity X, but very easy for it to do afterwards, b) provides it with vastly more utility than anything else it could do, and c) has little effect on the world otherwise.
An AI with a wireheading trap that is set to be doable if it attains world-affecting capabilities “fails closed”, because if it becomes more capable it will effectively switch itself off by wireheading. An example of a trap could be something like “etch ‘Bill is a poo poo head’ in 10m high letters into Mt Everest”. Very hard if you don’t have the ability to affect the physical world, but if you have nanotech… why bother melting humanity when you can just deface Mt Everest and be done with it?
Obvious problems:
- Convergent instrumental goals. We don’t want the AI to wipe us out in order to avoid us stopping it from wireheading, however unlikely that might be. Unclear what to do about this, myopia is one possibility (once you’ve wireheaded you don’t care if you’re stopped later?), but this has problems.
- If you make it too attractive the AI won’t even try and be useful before it has enough capacity, instead just wasting all its time on futile attempts to wirehead. Unclear how to make it attractive enough that it dominates once the capability is there but not before.
Overall very half-baked, but I wonder if there’s something to be done in the general area of “have the AI behave in a way that neuters it, but only when its capabilities increase”.

bryjnar Apr 10, 2022, 3:54 PM
12 points
on: The Proof of Doom
We have trained ML systems to play games, what if we trained one to play a simplified version of the “I’m an AI in human society” game?

Have a population of agents with preferences, the AI is given some poorly specified goal, it has the ability to expand its capabilities etc. You might expect to observe things like a “treacherous turn”.

If we could do that it would be quite the scary headline “Researchers simulate the future with AI and it kills us all”. Not proof, but perhaps viral and persuasive.

bryjnar Apr 12, 2018, 9:16 PM
1 point
in reply to: DanArmak’s comment on: Meaning and Moral Foundations Theory
I think I would argue that harm/care isn’t obviously deontological. Many of the others are indeed about the performance of the action, but I think arguably harm/care is actually about the harm. There isn’t an extra term for “and this was done by X”.
That might just be me foisting my consequentialist intuitions on people, though.

bryjnar Apr 12, 2018, 9:14 PM
1 point
in reply to: Ben Pace’s comment on: Is Rhetoric Worth Learning?
“What if there’s an arms race / race to the bottom in persuasiveness, and you have to pick up all the symmetrical weapons others use and then use asymmetrical weapons on top of those?”
Doesn’t this question apply to other cases of symmetric/asymmetric weapons just as much?
I think the argument is that you want to try and avoid the arms race by getting everyone to agree to stick to symmetrical weapons because they believe it’ll benefit them (because they’re right). This may not work if they don’t actually believe they’re right and are just using persuasion as a tool, but I think it’s something we could establish as a community norm in restricted circles at least.

bryjnar Apr 8, 2018, 10:49 AM
21 points
0
on: Local Validity as a Key to Sanity and Civilization
The point that the Law needs to be simple and local so that humans can cope with it is also true of other domains. And this throws up an important constraint for people designing systems that humans are supposed to interact with: you must make it possible to reason simply and locally about them.
This comes up in programming (to a man with a nail everything looks like a hammer): good programming practice emphasises splitting programs up into small components that can be reasoned about in isolation. Modularity, compositionality, abstraction, etc. aside from their other benefits, make it possible to reason about code locally.
Of course, some people inexplicably believe that programs are mostly supposed to be consumed by computers, which have very different simplicity requirements and don’t care much about locality. This can lead to programs that are very difficult for humans to consume.
Similarly, if you are writing a mathematical proof, it is good practice to try and split it up into small lemmas, transform the domain with definitions to make it simpler, and prove sub-components in isolation.
Interestingly, these days you can also write mathematical proofs to be consumed by a computer. And these often suffer some of the same problems that computer programs do—because what is simple for the computer does not necessarily correspond to what is simple for the human.
(Tendentious speculation: perhaps it is not a coincidence that mathematicians tend to gravitate towards functional programming.)

bryjnar Apr 8, 2018, 10:21 AM
12 points
in reply to: Vanessa Kosoy’s comment on: Is Rhetoric Worth Learning?
I am reminded of Guided by the Beauty of our Weapons. Specifically, it seems like we want to encourage forms of rhetoric that are disproportionately persuasive when deployed by someone who is in fact right.
Something like “make the structure of your argument clear” is probably good (since it will make bad arguments look bad), “use vivid examples” is unclear (can draw people’s attention to the crux of your argument, or distract from it), “tone and posture” are probably bad (because the effect is symmetrical).
So a good test is “would this have an equal effect on the persuasiveness of my speech if I was making an invalid point?”. If the answer is no, then do it; otherwise maybe not.

Meaning and Moral Foundations Theory

bryjnarApr 7, 2018, 5:59 PM

14 points

8 comments2 min readLW link

bryjnar Feb 11, 2018, 10:48 AM
1 point
in reply to: James Payor’s comment on: Beta-Beta Testing: Frontpage Rework
Yes, this is very annoying.

bryjnar Jan 20, 2018, 3:03 PM
5 points
on: An Apology is a Surrender
I found Kevin Simmler’s observation that an apology is a status lowering to be very helpful. In particular, it gives you a good way to tell if you made an apology properly—do you feel lower status?
I think that even if you take the advice in this post you can make non-apologies if you don’t manage to make yourself lower your own status. Bits of the script that are therefore important:
- Being honest about the explanation, especially if it’s embarassing.
- Emphasise explanations that attribute agency to you—“I just didn’t think about it” is bad for this reason.
- Not being too calm and clinical about the process—this suggests that it’s unimportant.
This also means that weird dramatic stuff can be good if it actaully makes you lower your status. If falling to your knees and embracing the person’s legs will be perceived as lowering your status rather than funny, then maybe that will help.

bryjnar Jan 5, 2018, 2:11 PM
7 points
in reply to: alkjash’s comment on: Choice begets regret
This is a great point. I think this can also lead to cognitive dissonance: if you can predict that doing X will give you a small chance of doing Y, then in some sense it’s already in your choice set and you’ve got the regret. But if you can stick your fingers in your ears enough and pretend that X isn’t possible, then that saves you from the regret.
Possible values of X: moving, starting a company, ending a relationship. Scary big decisions in general.
Something that confused me for a bit: people use regret-minimization to handle exporation-exploitation problems, shouldn’t they have noticed a bias against exploration? I think the answer here is that the “exploration” people usually think about involves taking an already known option to gain more information about it, not actually expanding the choice set. I don’t know of any framework that includes actions that actually change the choice set.

Choice begets regret

bryjnarJan 4, 2018, 8:28 PM

12 points

5 comments4 min readLW link

bryjnar Dec 30, 2017, 6:42 PM
1 point
on: The expected value of the long-term future
I’ve read it shallowly, and I think it’s generally good. I think I’ll have some more comments after I’ve thought about it a bit more. I’m surprised either by the lack of previous quantitative models, or the lack of reference to them (which is unsurprising if they don’t exist!). Is there really nothing prior to this?

bryjnar Nov 11, 2017, 4:06 PM
2 points
in reply to: habryka’s comment on: The Copernican Revolution from the Inside
I would dearly, dearly love to be able to use the fairly-standard Markdown footnote extension.

bryjnar Oct 27, 2017, 4:39 PM
3 points
in reply to: kvas’s comment on: Zero-Knowledge Cooperation
I think your example won’t work, but it depends on the implementation of FHE. If there’s a nonce involved (which there really should be), then you’ll get different encrypted data for the output of the two programs you run, even though the underlying data is the same.
But you don’t actually need to do that. The protocol lets B exfiltrate one bit of data, whatever bit they like. A doesn’t get to validate the program that B runs, they can only validate the output. So any program that produces 0 or 1 will satisfy A and they’ll even decrypt the output for you.
That does indeed mean that B can find out if A is blackmailable, or something, so exposing your source code is still risky. What would be really cool would be a way to let A also be sure what program has been run on their source by B, but I couldn’t think of a way to do this such that both A and B are sure that the program was the one that actually got run.

bryjnar Oct 26, 2017, 9:21 PM
1 point
in reply to: scarcegreengrass’s comment on: Zero-Knowledge Cooperation
I haven’t read Age of Em, but something like “spur safes” was an inspiration (I’m sure I’ve come across the idea before). My version is similar except that
1. It’s stripped down.
  1. B only needs to make a Validator, which could be a copy of themself, but doesn’t have to be.
  2. It only validates A to B, rather than trying to do both simultaneously. You can of course just run it twice in both directions.
2. You don’t need a trusted computing environment.
I think that’s a pretty big deal, because the trusted computing environment has to be trusted enough to run its end of A/B’s secure channels. In order for A/B to trust the output, it would need to e.g. be signed by their private keys, but then the computing envionment has access to those keys and can do whatever it wants! The trick with FHE is to let B run a computation using their secret key “inside” the safe without letting anyone else see the key.

bryjnar Oct 25, 2017, 11:36 PM
2 points
in reply to: philh’s comment on: Zero-Knowledge Cooperation
Pretty much! Expanding your explanation a little:
1. A sends msg_1 = Encrypt(A_source, A_key), and sends that to B
2. B wants to run Validate(source) = Sign(Check_trustworthy(source), B_key) on A_source, but can’t do that directly because B only has an encrypted version.
  1. So B runs Validate under FHE on msg_1, producing msg_2 = Encrypt(Validate(A_source), A_key), and sends that to A.
3. A decrypts msg_2, producing msg_3 = Validate(A_source) = Sign(Check_trustworthy(A_source), B_key), and sends that back to B (if it meets the agreed-on format).
4. B has a claim that A’s source is trustworthy, signed by B’s key, which A can’t have, so it must have been produced by B’s program.
Step 2.1 is where the magic happens.
(I should have just put this in the post!)

Zero-Knowledge Cooperation

bryjnarOct 25, 2017, 5:35 AM

16 points

7 comments4 min readLW link

bryjnar Sep 24, 2014, 9:56 PM
8 points
on: Newcomblike problems are the norm
Fantastic post, I think this is right on the money.

Many more Newcomblike scenarios simply don’t feel like decision problems: people present ideas to us in specific ways (depending upon their model of how we make choices) and most of us don’t fret about how others would have presented us with different opportunities if we had acted in different ways.

I think this is a big deal. Part of the problem is that the decision point (if there was anything so firm) is often quite temporally distant from the point at which the payoff happens. The time when you “decide” to become unreliable (or the period in which you become unreliable) may be quite a while before you actually feel the ill effects of being unreliable.

bryjnar Jun 23, 2014, 9:07 PM
1 point
in reply to: CalmCanary’s comment on: Utilitarianism and Relativity Realism
You cannot possibly gain new knowledge about physics by doing moral philosophy.

This seems untrue. If you have high credence in the two premisses:
- If X were a correct physical theory, then Y.
- Not Y.
then that should decrease your credence in X. It doesn’t matter whether Y is a proposition about the behaviour of gases or about moral philosophy (although the implication is likely to be weaker in the latter case).

bryjnar

Mean­ing and Mo­ral Foun­da­tions Theory

Choice begets regret

Zero-Knowl­edge Cooperation

Meaning and Moral Foundations Theory

Zero-Knowledge Cooperation