Mark_Friedenbach comments on It Looks Like You’re Trying To Take Over The World

Mark_Friedenbach 15 Mar 2022 7:31 UTC
15 points
−1
I find it frustrating that every time this story wanders into a domain in which I am knowledgeable about, the author shows his ignorance.
For example, HQU finding a flaw in how a zk-proof system handles floating-point numbers (no zk-proofs over floating point numbers are used anywhere in anything; I’d be surprised if such a system has even been deployed). Even the lead-in where the researcher thinks his reviewer is Chinese “given the Engrish”, but “Engrish” is a word used to describe the Japanese’s (not Chinese!) particular issues with learning and using English, and typically not in professional contexts.
These probably seem like trite details, and they are, but it makes me skeptical that the author is as knowledgeable as he tries to seem with the constant technobabble. Some parts of the story strike me as utterly fantastical. For example, as someone who has written HPC codes for a supercomputer and maintained an application using ASIC accelerators, the idea that HQU (*ahem*, Clippy) could upload itself to a distributed network of cloud computers and even come within 5 orders of magnitude of its original described performance is absurd.
I hope people aren’t going to attempt to learn from this and generalize from fictional evidence...
What links here?
- Charles He's comment on EA Forum feature suggestion thread by Aaron Gertler 🔸 (EA Forum; 15 Mar 2022 22:18 UTC; 1 point)
- Charles He's comment on EA Forum feature suggestion thread by Aaron Gertler 🔸 (EA Forum; 15 Mar 2022 22:01 UTC; -1 points)
- gwern 15 Mar 2022 15:21 UTC
  44 points
  1
  Parent
  I thank the anonymous reviewer for taking the time to comment, even if I think they are mistaken about my mistakes. To go through the 4 mistakes he think he identified as spotchecks:
  
  I’d be surprised if such a system has even been deployed)
  
  I am aware of this and it is deliberate. You say you would be surprised if such a system has ever been deployed. I am surprised I have to criticize cryptocurrency reliability & implementation quality around here (and to you, specifically, Mark*). Are you completely new to crypto? “I can’t believe they made that mistake and lost millions of dollars worth of crypto” is something that is said on Mondays and every day of the week ending in “day”. I can totally believe that some random new fly-by-night ZKP system used FP somewhere in it as part of the giant stack of cruft, copypasted smart contracts, and half-assed libraries that every cryptocurrency teeters on top of, accidentally enabled as an option or by an incorrect literal or a cast inside a library or something. There are multi-billion market cap cryptocurrencies which don’t even use binary (you may remember a certain ternary currency), and you think no one will ever use FP inappropriately? This is computer security! Everything you think of that can go wrong will go wrong! As well as the things you didn’t think of because they are too revoltingly stupid to think! I am being optimistic when I make the bug floating-point related, because FP is genuinely hard to understand and use safely. A real bug will be something deeply embarrassing like the backdoor which was an uncapitalized letter, the Parity wallet bug, the DAO bug, the...
  
  but “Engrish” is a word used to describe the Japanese’s (not Chinese!) particular issues with learning and using English,
  
  No, it is in fact used generically to describe East Asian ESL errors such as Chinese ESL errors, and not exclusively for Japanese. I have never seen a hard distinction enforced by native writers such as myself, and I can find no sources supporting your gotcha when you have wandered into my area of expertise (writing English). If I may quote Wikipedia, “native speakers of Japanese, Korean and other Asian languages.” WP also throws in a Chinese example from A Christmas Story to illustrate “Engrish”. (Had I used a more specific term, “Terms such as Japanglish, Japlish or Janglish are more specific terms for Japanese Engrish.”, then you might have had a point.) You can go to /r/engrish or engrish.com and find plenty of Chinese examples. (Feel free to check other dictionaries like Wiktionary, Collins, or TFD.) So, you are wrong here in trying to police my use of “Engrish”. “Engrish” is, and always has, been valid to use for Chinese ESL errors.
  
  and typically not in professional contexts.
  
  It is not used in professional contexts indeed, which is fine—because this is not a professional context, this is reported thought from the third-person omniscient narrator about the researcher’s thoughts. It’s called ‘fiction’. Do you also believe he is typing out “Really? Really? That’s what you’re worried about?” as well? Or that he is typing down ‘I am going out with my friends now to SF bars to get drunk’? Of course not. It is his perspective, and he is frustrated with the anonymous reviewer comments missing the point while claiming expertise which he has to rebut (after the HQU runs are done so he has some hard numbers to add to his paper), and he has thoughts which he will write down, if at all, more politely.
  
  the idea that HQU (ahem, Clippy) could upload itself to a distributed network of cloud computers and even come within 5 orders of magnitude of its original described performance is absurd.
  
  I don’t think it’s absurd or that we would expect performance penalties far worse than 5 orders of magnitude. First, most of this is embarrassingly parallel rollouts in RL environments. Approaches like AlphaZero tree search will parallelize very well, which is how projects like LeelaZero can successfully operate. If they were really paying “5 orders of magnitude” (where does that number come from...?) and were >>10,000x slower, they wouldn’t’ve finished even days’ worth of training by this point. Yet, they exist. Second, the high-performance DL training approaches like ZeRO and PatrickStar etc have shown you can get pretty decent utilization (approaching 50%) out of GPUs across a cluster with more or less constant performance regardless of size once you’ve paid the price of model and then layer parallelism. Once it’s paid the price to split across a bunch of GPUs, then adding more layers and parameters has just the linear cost and you can train almost arbitrary sized models. Third, projects like ALBERT have measured the crowdsourcing cost; it’s not >>10,000x! It’s more like 5x for small models, and I don’t see where you’re going to get another factor of 2,000x. 5x or 10x is not great, certainly, which is why people don’t do it when they can get real clusters, but it is far from being a showstopper, particularly when you’ve stolen all your compute in the first place and you have far more compute than you have interconnect or other resources and can afford to burn resources on tricks like rematerializing (recomputing) gradients locally or synthetic gradients or tied weights or sparsified gradients or machine-teaching synthetic datapoints, or a lot of other tricks I haven’t mentioned in the story (is a particular cluster not big enough to train an entire model? Then freeze a bunch of layers and train only the remaining, or cache their activations and ship those off as a dataset to train layers on. Can you accumulate gradients beyond the critical batch size because you have way more local compute than bandwidth? Then increase the hardness of the overall training to increase the critical batch size, taking fewer but more powerful steps). Fourth, forking paths, remember? Clippy doesn’t need to train in exactly the way you envision, it can train in any way that works, it only needs to train the big model once anytime anywhere. So, you say it can’t be done off a supercomputer no matter how many of the tricks you use or how much public compute? Then maybe it can seize a supercomputer or an equivalent cloud resource. There’s more than 1 of them out there, I understand, and it’s not like clouds or supercomputers have never been hacked in the past, to do things like, say, mine Bitcoin...
  
  I hope people aren’t going to attempt to learn from this and generalize from fictional evidence...
  
  Certainly not. Fiction is just entertaining lies, after all. My hope is that people will learn from the nonfiction references and perhaps think about them more flexibly than treating them as a bunch of isolated results of minor interest individually.
  
  * Don’t worry. I’m sure LW2 is the only software from here on out which will have silly security bugs. There were a lot of silly bugs—humans amirite—but we finally patched the last bug! We’re done, at last! But, er, we do need the intern to go reset the “X Days Since Last Silly Cryptocurrency Vulnerability” sign in the kitchen, because the Poly hack was followed by the Qubit hack. EDIT: dammit cryptocurrencies! /sigh. Add in surely no one would just put an unnecessary callback in a transfer function (again) and tell the intern to reset the sign after lunch… Probably should update the ‘X Days Since Last Internet-Breaking Server Vulnerability’ for Dirty Pipe coming after log4j too.
  - Mark_Friedenbach 15 Mar 2022 23:48 UTC
    21 points
    −1
    Parent
    I’m a crypto researcher at $dayjob, and I work with zero knowledge proofs daily. Practical zk-proofs are implemented as arithmetic circuits, which allow efficient proofs about adding, subtracting, multiplying, and comparing integers, typically approximately 256-bits in length. Obviously any integer math is trivial to prove, and so can be fixed-precision or rational numbers. But general floating point types can’t be efficiently encoded as operations on integer values with this precision. So you’d have to either (1) restrict yourself to fixed precision numbers (which also avoids all the famous problems with floating point math exploited in the story), or (2) use the equivalent of software-defined floating point on top of arithmetic circuits, which causes proof sizes and computation time to scale inversely proportional to how much slower software floating point is compared with hardware (which is a lot). No exaggeration, if your zk-proof took about a second to compute and is tens of kilobytes in size—typical for real systems used—then a floating-point math version might take minutes or hours to compute and be megabytes in size. Totally impractical, so no, no one does this.
    (If you want a crypto exploit that allows for arbitrary inflation, I would have used a co-factor vulnerability like the one that Monero was hit with back in 2017, or a weakness in the inner-product argument of bulletproofs used in mimblewimble, or a weakness in the pairing curve used for zcash proofs, etc. Not floating-point.)
    I’ll take your word on Engrish. I’ve never used that word online so I don’t know what the custom is here. Just speaking as someone who has spent significant time in Taiwan and Japan, I’ve only seen that word used among expats in Japan. The construction of the word is particularly specific to Japanese, which does not distinguish between the the l and r phonemes. Mandarin does however make that distinction. Chinese speakers have many issues with English, to be sure, but this isn’t one of them. I can see how the word could have taken broader meaning outside of the context in which it was coined, however.
    The 5 orders of magnitude number comes from a rule of thumb for the general speedup you can get for reducing complex but highly parallel computation to ASIC implementation using state of the art process nodes. It is, for example, the rough speedup you get from moving from GPU to ASIC for bitcoin mining, and I believe for hardware raytracing it is the same. Neural nets are outside my area of expertise, but from afar I understand them to be a similar “embarrassingly parallel” application where such speedups can occur. I’m open to being shown wrong here. However that multiplier also shows up independently in latency numbers: HPC switching (e.g. Infiniband) can be sub-100ns, but inter-cloud latency is in the 10s of ms. That’s a factor of 100,000x. I felt I was being generous in assuming that one of these effects will bottleneck, but it is also possible there’d be a larger combined slow down.
    None of those points are central to the question of whether a hard take-off is possible, however. But they are essential to a heuristic I use to evaluate whether someone’s claims are credible: if you wander outside of your area of expertise and into mine, I assume you at least consulted an expert to review and fact check the basic stuff. If you didn’t, why should I trust anything you say about other domains, like neural net architectures? Your story hinges on there being a sort of phase transition which causes a step function in the performance and general intelligence of Clippy. You’ve got links to papers whose abstracts seem to back that claim up. But you also similarly hand-waved with citations about floating point and zero knowledge proofs. How do I know your assertions about AI are more credible?
    I guess I’m a bit crusty on this because I feel Eliezer’s That Al Message really did damage by priming people with the wrong intuitions about the relative speed advantages of near-term AI, even presuming a hardware overhang. This story feels like the same sort of thing, and I fear people will accept it as a persuasive argument. Regardless of whether they should.
    - gwern 16 Mar 2022 0:21 UTC
      31 points
      1
      Parent
      Your floating point counterargument is irrelevant. Yes, it would be a bad idea. You already said that. You did not address any of my points about bad ideas being really really common in crypto (is mixing in some floating point really worse than, say, using ternary for everything binary? That is a realworld crypto which already exists. And while I’m at it, the FP inefficiency might be a reason to use FP—remember how the Bytecoin and other scams worked by obfuscating code and blockchain), nor did you offer any particular reason to think that this specific bad idea would be almost impossible. People switch between floating and integer all the time. Compilers do all sorts of optimizations or fallbacks which break basic security properties. There are countless ways to screw up crypto; secure systems can be composed in insecure ways; and so on.
      
      You’ll “take my word on Engrish”? You don’t need to, I provided WP and multiple dictionaries. There is nothing hard about “and other Asian languages” or movie examples about going to a Chinese food restaurant and making fun of it. If you don’t know anything about the use of ‘Engrish’ and don’t bother to check a single source even when they are served to you on a silver platter, why on earth are you going around talking about how it discredits me? This is bullshit man. “Spotchecking” doesn’t work if you’re not checking, and using your expertise to check for Gell-Man amnesia doesn’t work if you don’t have expertise. That you don’t even care that you were so trivially wrong bothers me more than you being wrong.
      
      No response to the unprofessional criticism, I see. How unprofessional.
      
      Neural nets are outside my area of expertise
      
      Pity this story is about neural nets, then. In any case, I still don’t see where you are getting 10,000x from or how ASICs are relevant, or how any of this addresses the existing and possible techniques for running NNs across many nodes. Yes, we have specialized ASICs for NN stuff which work better than CPUs They are great. We call them “TPUs” and “GPUs” (you may have heard of them), and there’s plenty of discussion about how the usual CPU->ASIC speedup has already been exhausted (as Nvidia likes to point out, the control flow part you are removing to get those speedups for examples like video codecs is already a small part of the NN workload, and you pay a big price in flexibility if you try to get rid of what’s left—as specialized AI chip companies keep finding out the hard way when no one can use their chips). I mean, just think critically for a moment: if the speedup from specialized hardware vs more broadly accessible hardware really was >>10,000x, if my normal Nvidia GPU was ¹⁄₁₀,000th the power of a comparable commercial chip, how or why is anyone training anything on regular Nvidia GPUs? With ratios like that, you could run your home GPUs for years and not get as much done as on a cloud instance in an hour or two. Obviously, that’s not the case. And, even granting this, it still has little to do with how much slower a big NN is going to run with Internet interconnects between GPUs instead of on GPU/TPU clusters.
      - Mark_Friedenbach 17 Mar 2022 4:47 UTC
        10 points
        0
        Parent
        Gwern, you seem to be incapable of taking constructive criticism, and worse you’ve demonstrated an alarming disregard for the safety of others in your willingness to doxx someone merely to score a rhetorical point. Thankfully in this case no harm was done, but you couldn’t have known that and it wasn’t your call to make.
        I will not be engaging with you again. I wish you the best.
  - dxu 15 Mar 2022 18:59 UTC
    4 points
    1
    Parent
    (and to you, specifically, Mark*)
    * Don’t worry. I’m sure LW2 is the only software from here on out which will have silly security bugs.
    ...Okay, I admit to some curiosity as to how you pulled that one off, though not enough curiosity to go poking around myself in the codebase. Is this one of those things where an explanation (public or private) can be given, at least after the vulnerability is patched (if not before)?
    - gwern 15 Mar 2022 19:14 UTC
      10 points
      1
      Parent
      This is a case where, much like Eliezer declining to explain specifically how he won any of his AI boxing matches, I think it’s better to leave a question mark, since it’s a relatively harmless one (see also fighting the hypothetical): “If I were writing LW2, I would simply not write the lines with bugs in them.”
      - Mark_Friedenbach 15 Mar 2022 23:51 UTC
        19 points
        1
        Parent
        De-anonymizing people who have chosen to post anonymously on purpose isn’t harmless.
        Then again, posting from a deactivated account as a hack for anonymity isn’t exactly officially supported either.
        gwern 16 Mar 2022 0:06 UTC
        −17 points
        2
        Parent
        I didn’t deanonymize anyone. There are many Marks on LW (what with it being one of the most common male personal names in the West). The people on EA Forum who have been posting about it using the full username, they’ve deanonymized Mark. You should go complain to them if you think there is harm in it. And I am only the messenger about the deanonymization: anyone who uses GreaterWrong & related mirrors has already been deanonymizing all of the anonymous users every time they load a page for years now. A bit late to be worried. (Mark is currently arguing to keep this, saying “it works well enough”, which is uh.)
        dxu 16 Mar 2022 1:36 UTC
        26 points
        1
        Parent
        FWIW, I did actually manage to guess which Mark it was based on the content of the initial comment, because there aren’t that many persistent commenters named Mark on LW, and only one I could think of who would post that particular initial comment. So claiming not to have deanonymized him at all does seem to be overstating your case a little, especially given some of your previous musings on anonymity. (“The lady doth protest too much, methinks” and all that.)
        I do, however, echo the sentiment you expressed on the EA Forum (that anonymous commenting on LW seems not worth it on the margin, both because the benefits themselves seem questionable, and because it sounds like a proper implementation would take a lot of developer effort that could be better used elsewhere).
      - jimrandomh 2 Apr 2022 19:42 UTC
        6 points
        3
        Parent
        LW2 developer here. I consider it a bug that it’s possible to continue ot comment through a deactivated account. I don’t consider it a bug that comments made through a deactivated account can be associated with the account name, since (in the normal case where an account never posts again after it’s been deactivated) the same information is also easily retrieved from archive.org/.is/etc. (And I can think of a dozen easy ways to do it, some of which would be a pain to close off.)
        (The officially supported mechanism for anonymous posting is to just make a new single-use account, and don’t attach a real email address to it. We do not enforce accounts having working emails, though new accounts will show up in moderator UI when they first post.)