Thanks. I think I get it now. (at least one of) my confusion was something between confusing a “transformer run” and “number of FLOPS”.
And I get the thing about cost, that’s what I meant but I articulated it poorly.
Thanks. I think I get it now. (at least one of) my confusion was something between confusing a “transformer run” and “number of FLOPS”.
And I get the thing about cost, that’s what I meant but I articulated it poorly.
Heh, I actually think it’s answered here.
Got it, thanks!
But to process the 1001st input token, you also need to load all the 1000 tokens in memory, forming the cache (it does happen in one step though). And for each new output token, you surely don’t dump all the existing KV cache after each generation, only to load it again to append an extra KV vectors for the last generated token. So isn’t the extra work for output tokens just that the KV cache is accessed, generated, expanded, one token at a time, and that’s where the “more work” come from?
Is there any reason why this would imply the ratio of pricing of output:input tokens being commonly something like 3:1?
Thanks for the answer, I appreciate it!
Intuitively, it seems that output tokens should be more expensive. The autoregressive model has to run once for each output token, and as these runs progress, output tokens gradually become a part of the input (so the last token is generated with context being all input and almost all output).
I agree with the intuition, but I think that’s where I am confused. Thanks to the KV cache we do not run the new input sequence (previous sequence + last generated token) through the encoders (as we do for the input sequence during prefill). It’s all cached (from prefill + from the last token generation for that sequence+token). So… I don’t know—it doesn’t feel like the output tokens are more expensive in this case (you run “once”, the same way as you run “once” for every input token)?
I think they do amortize their costs among all uses. A number of runs (number of output tokens) multiplied by a (varying) cost of the each run is unlikely to be close to linear.
Do you mind saying more about this? I am not sure what you mean. I.e. some pay more and some pay less (e.g. heavy hitters pay less while small prompters pay comparatively more per token?)
Even though some commenters mentioned some issues with the article, I really want to appreciate the attempt and being upfront with the estimates. It’s very relevant for the thing I am now trying to figure out. As I have almost no intuitions about this except about some raw FLOPS, it pointed to important flaws my analysis would have. There are not many public sources that would explain that [are not a book or don’t require me reading one-to-many to understand it]
Yes, but to defend (hehe) OP, he seems to be fully aware of that and addresses that explicitly in the linked article (which is also excellent, like this one):
In part because of those aforementioned stats on the frequency of guilty pleas, public defenders have garnered a reputation for being trial-averse, for pressuring clients to cop a plea just to keep the machine humming along. I think this reputation is ill-deserved. It’s completely counter to my own experience, at least, as few things are talked about with as much awed respect among one’s public-defender peers as the number of trials you have accumulated. It’s the functional equivalent of an attorney’s XP level.
Thanks for the feedback and the encouragement, I will incorporate these.
Btw. for questions 2-4 there is an intentional redundancy.
(slightly tangential) I think people are doing a terrible bucket error with competency, and that is that people overestimate how are others competent across all dimensions. I.e. it’s often enough for a person to show great competency in providing vision, and we assume that person also needs to be great in leadership or management, and people are shocked it’s not the case. Other examples:
scientist being great in research, therefore they need to be great teachers or college management
doctors are great in diagnosing, therefore they need to be great surgeons
programmer being great in programming, therefore they need to be great in leading the rest of team or mentoring …
Hey. I decided on a private school with more of a “democratic approach”. I unfortunately wasn’t able to find suitable tutors etc.
I am also trying to process what ChatGPT-like platforms will do with the landscape. E.g. my partner is using coding almost exclusively with ChatGPT and it’s outstanding. Kids gonna follow IMHO.
Thanks Duncan, I really appreciate you posting this, even though you are unsure about how exactly it all fits together. I am still glad to read it in this version, likely because you are quite clear about it, and not “leaving it as an exercise for the reader” to figure out where things do fit together and where they don’t (or worse, trying to make it more profound).
All of these might be stating obvious to some of you, but I am trying to clarify my thoughts and maybe some people will find it useful or correct me. At least part of this relates to (by me endorsed) aphorism (?) of “everyone is a mess”, or less controversially, “everyone is struggling”. Something something hedonic treadmill / adaptation—people will somehow struggle the same regardless of the bubble size, then adapt to what they were used to before by overcoming the main challenges (and learning how to deal with them) and then also “reprioritize” costs. I would definitely self-report that happening during my life. I do think this realization is important in a way I relate to others, including e.g. my daughter—the things she struggles with might seem trivial to me, but are not trivial to her—on contrary, they probably feel to her about the same magnitude as my “bigger” problems look to me. Same for everyone, everywhere.
Almost like I had some capacity of how much I can deal with stuff, and I always fill in this capacity with the things around me (my bubble?). Something like doing busy work if you don’t choose what to let into your to-do list. Or something closer to “I can always do 10 things, and their size doesn’t matter” (bear with me, I know this is not true and I do indeed sometimes work on a single project because it’s eating all of my capacity). If I let “bigger things” into it, I will be dealing with bigger things, while also just leaving some stuff behind me or “go wrong” in a way I wouldn’t allow in the N-1 bubble (something like not dealing with every single fuckup in my work, starting to take Uber instead of public transport etc).
It doesn’t hold entirely, I have clearly seen people who just couldn’t deal with e.g. a promotion (because it was too much for them at the given time), or, similarly, people who said they want to do more / be more ambitious but can’t for some objective reasons (like a physical disease).
Dunno.
Oh, I really enjoyed reading this, this is so LW-rationality-curiosity-boggle-at-things post. Thanks!
Thanks, that’s helpful.
Also, kudos for Lily to know active listening and being awesome.
I am curious about how you introduced money to your kids? Do you have some “framework” for that? I did a small research and didn’t end up with any really novel ideas (I am happy to share my findings and conclusions, but it’s a fairly small page in roam).
Basically, what I want to do with my daughter:
give her an opportunity to earn money by doing age-appropriate jobs (helping me to clean the kitchen), but not for things that she’s expected to do for “reasonable reasons”, such as cleaning her toys in common areas, rather lower allowance (but some)
make it clear that she can make more by doing more elaborate stuff (for which she might have to learn a few things first—happy to help), encourage her to come up with new ideas by herself (entrepreneurship!)
introduce saving mechanism, e.g. giving her interest for money she’s gonna put into my pocket
she can make more by betting and bonds as per your posts
she can do what she wants with her money (as long as it’s safe, but can’t really think about anything I would consider “forbidden”), let her do mistakes
I checked the marginal revolution repost and saw a few things that could go wrong...
Any other ideas?
This is an excellent post. I have been doing bets with my 4y old daughter already as well (and I am following your projects for a quite some time already)!
Interesting, thanks!