Thanks. I think I get it now. (at least one of) my confusion was something between confusing a “transformer run” and “number of FLOPS”.
And I get the thing about cost, that’s what I meant but I articulated it poorly.
Thanks. I think I get it now. (at least one of) my confusion was something between confusing a “transformer run” and “number of FLOPS”.
And I get the thing about cost, that’s what I meant but I articulated it poorly.
Heh, I actually think it’s answered here.
Got it, thanks!
But to process the 1001st input token, you also need to load all the 1000 tokens in memory, forming the cache (it does happen in one step though). And for each new output token, you surely don’t dump all the existing KV cache after each generation, only to load it again to append an extra KV vectors for the last generated token. So isn’t the extra work for output tokens just that the KV cache is accessed, generated, expanded, one token at a time, and that’s where the “more work” come from?
Is there any reason why this would imply the ratio of pricing of output:input tokens being commonly something like 3:1?
Thanks for the answer, I appreciate it!
Intuitively, it seems that output tokens should be more expensive. The autoregressive model has to run once for each output token, and as these runs progress, output tokens gradually become a part of the input (so the last token is generated with context being all input and almost all output).
I agree with the intuition, but I think that’s where I am confused. Thanks to the KV cache we do not run the new input sequence (previous sequence + last generated token) through the encoders (as we do for the input sequence during prefill). It’s all cached (from prefill + from the last token generation for that sequence+token). So… I don’t know—it doesn’t feel like the output tokens are more expensive in this case (you run “once”, the same way as you run “once” for every input token)?
I think they do amortize their costs among all uses. A number of runs (number of output tokens) multiplied by a (varying) cost of the each run is unlikely to be close to linear.
Do you mind saying more about this? I am not sure what you mean. I.e. some pay more and some pay less (e.g. heavy hitters pay less while small prompters pay comparatively more per token?)
Even though some commenters mentioned some issues with the article, I really want to appreciate the attempt and being upfront with the estimates. It’s very relevant for the thing I am now trying to figure out. As I have almost no intuitions about this except about some raw FLOPS, it pointed to important flaws my analysis would have. There are not many public sources that would explain that [are not a book or don’t require me reading one-to-many to understand it]
Yes, but to defend (hehe) OP, he seems to be fully aware of that and addresses that explicitly in the linked article (which is also excellent, like this one):
In part because of those aforementioned stats on the frequency of guilty pleas, public defenders have garnered a reputation for being trial-averse, for pressuring clients to cop a plea just to keep the machine humming along. I think this reputation is ill-deserved. It’s completely counter to my own experience, at least, as few things are talked about with as much awed respect among one’s public-defender peers as the number of trials you have accumulated. It’s the functional equivalent of an attorney’s XP level.
Thanks for the feedback and the encouragement, I will incorporate these.
Btw. for questions 2-4 there is an intentional redundancy.
(slightly tangential) I think people are doing a terrible bucket error with competency, and that is that people overestimate how are others competent across all dimensions. I.e. it’s often enough for a person to show great competency in providing vision, and we assume that person also needs to be great in leadership or management, and people are shocked it’s not the case. Other examples:
scientist being great in research, therefore they need to be great teachers or college management
doctors are great in diagnosing, therefore they need to be great surgeons
programmer being great in programming, therefore they need to be great in leading the rest of team or mentoring …
Hey. I decided on a private school with more of a “democratic approach”. I unfortunately wasn’t able to find suitable tutors etc.
I am also trying to process what ChatGPT-like platforms will do with the landscape. E.g. my partner is using coding almost exclusively with ChatGPT and it’s outstanding. Kids gonna follow IMHO.
Thanks Duncan, I really appreciate you posting this, even though you are unsure about how exactly it all fits together. I am still glad to read it in this version, likely because you are quite clear about it, and not “leaving it as an exercise for the reader” to figure out where things do fit together and where they don’t (or worse, trying to make it more profound).
All of these might be stating obvious to some of you, but I am trying to clarify my thoughts and maybe some people will find it useful or correct me. At least part of this relates to (by me endorsed) aphorism (?) of “everyone is a mess”, or less controversially, “everyone is struggling”. Something something hedonic treadmill / adaptation—people will somehow struggle the same regardless of the bubble size, then adapt to what they were used to before by overcoming the main challenges (and learning how to deal with them) and then also “reprioritize” costs. I would definitely self-report that happening during my life. I do think this realization is important in a way I relate to others, including e.g. my daughter—the things she struggles with might seem trivial to me, but are not trivial to her—on contrary, they probably feel to her about the same magnitude as my “bigger” problems look to me. Same for everyone, everywhere.
Almost like I had some capacity of how much I can deal with stuff, and I always fill in this capacity with the things around me (my bubble?). Something like doing busy work if you don’t choose what to let into your to-do list. Or something closer to “I can always do 10 things, and their size doesn’t matter” (bear with me, I know this is not true and I do indeed sometimes work on a single project because it’s eating all of my capacity). If I let “bigger things” into it, I will be dealing with bigger things, while also just leaving some stuff behind me or “go wrong” in a way I wouldn’t allow in the N-1 bubble (something like not dealing with every single fuckup in my work, starting to take Uber instead of public transport etc).
It doesn’t hold entirely, I have clearly seen people who just couldn’t deal with e.g. a promotion (because it was too much for them at the given time), or, similarly, people who said they want to do more / be more ambitious but can’t for some objective reasons (like a physical disease).
Dunno.
Oh, I really enjoyed reading this, this is so LW-rationality-curiosity-boggle-at-things post. Thanks!
Thanks, that’s helpful.
Also, kudos for Lily to know active listening and being awesome.
I am curious about how you introduced money to your kids? Do you have some “framework” for that? I did a small research and didn’t end up with any really novel ideas (I am happy to share my findings and conclusions, but it’s a fairly small page in roam).
Basically, what I want to do with my daughter:
give her an opportunity to earn money by doing age-appropriate jobs (helping me to clean the kitchen), but not for things that she’s expected to do for “reasonable reasons”, such as cleaning her toys in common areas, rather lower allowance (but some)
make it clear that she can make more by doing more elaborate stuff (for which she might have to learn a few things first—happy to help), encourage her to come up with new ideas by herself (entrepreneurship!)
introduce saving mechanism, e.g. giving her interest for money she’s gonna put into my pocket
she can make more by betting and bonds as per your posts
she can do what she wants with her money (as long as it’s safe, but can’t really think about anything I would consider “forbidden”), let her do mistakes
I checked the marginal revolution repost and saw a few things that could go wrong...
Any other ideas?
This is an excellent post. I have been doing bets with my 4y old daughter already as well (and I am following your projects for a quite some time already)!
Yeah, that’s useful. Agree on the assessment, I want to give it a shot with one of those Bridgelux Vesta Thrive thing, it sounds like a good hobby project I would like to try. If that happens, I would do a post about it here.
By the way, I asked about this setup on reddit. They also recommend some custom COBs, which seems to be the most powerful solution, but isn’t as practical as strips.
These look very promising, ship to Europe too. Extra high-CRI, very powerful (up 2600 lumens/m) and even dimmable? Wow. A bit unfortunate they are 5x times as much expensive than other high-CRI high-power led strip.
I am glad there are more posts on this. Are there any reasons why not considering LED strips at all? When installed properly with the “milk” diffuser and as indirect lightning, it’s IMO quite nice and effective. They seem to be powerful enough (20W are about 2k lumens/m), can be also found in high-CRI variants, less expensive, various CCT, dimmable, etc. I am considering using them in a new house. Basically, multiple parallel led (with different CCT, like 3000, 4500, 6400) strips diffused against a wall/ceiling, controlled via smart relays and incorporated in home automation (so physical switches turn e.g. just the one with 3000K after ~9pm).
I need to validate that in some studio, but my hope is that e.g. 50 000 lumens from 6x4 meters of high-CRI led strips of 3 different CCT diffused over a wall would provide sufficient daylight feeling + not taking any space, generate many shadows, looks nice and are OK to look at. If I could make it dimmable, it would be great, but there seem to be tradeoffs (dimmable ones are not that good on CRI apparently).
Btw. there are apps for phones with light detectors which seem to follow basic physics (like showing 4x times smaller number when being 2x times further from the source)
I just recently made my new ugly but sufficient lumenator (a combination of 6400K and 4500K CRI95+ 1500 lumen bulbs, still adding additional ones) and also managed to integrate a WiFi relay with Smart Bulbs to my home automation via Home Assistant with a fallback to manual switch. I also tried Home Assistant’s “flux” component which sets the colour and brightness based on the sun position but replaced that idea with being less-sun-dependent. I want to have a lot of light until e.g. 8 pm regardless of the light outside and only then start dimming to red. Simple variant is plain automation with templating, hard but dynamic.
I am so glad this question is here, as it’s very relevant to my post a few weeks back about Effective Children Education.
By the way, I recommend following Duncan Sabien (referenced in the post below) on Facebook, he has good posts about children edu, e.g. his speech for sixth-graders (referenced by someone else here—but she picked the good parts).
As mentioned below, Julia Galef also sometimes mentions something related, but I haven’t found much
Interesting, thanks!