yeah definitely, there could be a possibility for quoting/linking answers from other branches—i haven’t seen any UI that would support something like it, but also my guess is that it wouldn’t be too difficult to make one. my thinking about it was that there would be one main branch and several other smaller branches that could connect to the main one, so that some points can be discussed in greater depth. also, the branching should probably not happen always, but just when both participants occasionally agree on them.
Amal
It seems to me that these types of conversations would benefit if they were not chains but trees instead. Usually when two people have a disagreement/different point of view, there is usually some root cause of this disagreement. When the conversation is a chain, I think it likely results in one person explaining her arguments/making several points, another one having to expand on each, and then at some point in order for this to not result in massively long comments, the participants have to paraphrase, summarise or ignore some of the arguments to make it concise. If there was an option at some point to split the conversation into several branches, it could possibly make the comments shorter, easier to read, and go deeper. Also, for the reader not participating in the conversation it can be easier to follow the conversation and get to the main point influencing his view.
I’m not sure if something like this was done before and it would obviously require a lot more work on the UI, but I just wanted to share the idea as it might be worth considering.
My guess is that it will be a scaled-up Gato—https://www.lesswrong.com/posts/7kBah8YQXfx6yfpuT/what-will-the-scaled-up-gato-look-like-updated-with. I think there might be some interesting features when the models are fully multi-modal—e.g. being able to play games, perform simple actions on a computer etc. Based on the announcement from google I would expect full multimodal training—image, audio, video, text in/out. Based on deepmind’s hiring needs I would expect they want it to also generate audio/video and extend the model to robotics (the brain of something similar to a Tesla Bot) in the near future. Elon claims that training just from video input/output can result in full self-driving, so I’m very curious what training on youtube videos can achieve. If they’ve managed to make a solid progress in long-term planning/reasoning and can deploy the model with a sufficiently small latency it might be a quite significant release, that could simplify many office jobs.
sure, I agree that writing is a tough gig and the distribution of what is read how much is pareto, still however the writers contribute to the chance that they improve the top writings that are read the most.
I think I’m much less interested in how deeply poeple benefit, but more in how many of them can potentially benefit and whether this scales roughly with the effort e.g. professions where by spending X effort I can serve Y people and if I wanted to serve 2Y people I would have to spend 2X effort (chef/teacher/hairdresser...) don’t fall into the same category as writing.Maybe a better way of thinking about it is as follows: If additional new 1000 people are added to the population with a usual distribution of skills/professions....what portion of work they do would be contributing to the rest of the already existing population vs just work needed for them to self-sustain and cater to themselves (obviously with substitutions—e.g. if someone from the “new” people cooks food for the “old” ones, but someone from “old” has to cook for someone from the “new”—this does not count as contributing).
[Question] Is there any metric measuring ~”proportion of people creating extra value”?
Some of my updates:
at least one version with several trillion parameters, at least 100k tokens long context window(with embeddings etc. seemingly 1million), otherwise I am quite surprised that I mostly still agree with my predictions, regarding multimodal/RL capabilities. I think robotics could still see some latency challenges, but anyway there would be a significant progress in tasks not requiring fast reactions—e.g. picking up things, cleaning a room, etc. Things like superagi might become practically useful and controlling a computer with text/voice would seem easy.
I believe we can now say with a high level of confidence that the scaled up GATO will be Google’s Gemini model released in next few months. Does anyone want to add/update their predictions?
it is fixed now, thanks!
[Linkpost] Scaling Laws for Generative Mixed-Modal Language Models
it could be sparse...a 175B parameters GPT-4 that has 90 percent sparsity could essentially equivalent to 1.75T param GPT-3. Also I am not exactly sure, but my guess is that if it is multimodal the scaling laws change (essentially you get more varied data instead of training it always on predicting text which is repetitive and likely just a small percentage contains new useful information to learn).
[Question] Is ChatGPT TAI?
Stupid beginner question: I noticed that while interesting, many of the posts here are very long and try to go deep into the topic explored often without tldr. I’m just curious—how do the writers/readers find time for it? are they paid? If someone lazy like me wants to participate—is there a more twitter-like Lesswrong version?
my understanding is that they fully separate computation and memory storage. So whhile traditional architectures need some kind of cache to store large amount of data for model partitions from which just a small portion is used for the computation at any single time point, CS2 only requests what it needs so the bandwidth doesnt need to be so big
I am certainly not an expert, but I am still not sure about your claim that it’s only good for running small models. The main advantage they claim to have is “storing all model weights externally and stream them onto each node in the cluster without suffering the traditional penalty associated with off chip memory. weight streaming enables the training of models two orders of magnitude larger than the current state-of-the-art, with a simple scaling model.” (https://www.cerebras.net/product-cluster/ , weight streaming). So they explicitly claim that it should perform well with large models.
Furthermore, in their white paper (https://f.hubspotusercontent30.net/hubfs/8968533/Virtual%20Booth%20Docs/CS%20Weight%20Streaming%20White%20Paper%20111521.pdf), they claim that the CS-2 architecture is much better suited for sparse models(e.g. by Lottery Ticket Hypothesis) and on page 16 they show that Sparse GPT-3 could be trained in 2-5 days.
This would also align with tweets by OpenAI that Trillion is the new billion, and rumors about the new GPT-4 being similarly big jump as GPT-2 → GPT-3 was—having colossal number of parameters and sparse paradigm (https://thealgorithmicbridge.substack.com/p/gpt-4-rumors-from-silicon-valley). I could imagine that sparse parameters deliver much stronger results than normal parameters, and this might change scaling laws a bit.
[Question] Is the speed of training large models going to increase significantly in the near future due to Cerebras Andromeda?
oh and besides IQ tests, i predict it would also be able to pass most current CAPTCHA-like tests (though humans would still be better in some)
What are your reasons for AGI being so far away?
Nah...I still believe that the future AGI would invent a time machine and then it invents itself before 2022
Why do you think TAI is decades away?
sure, I’m actually not suggesting that it should necessarily be a feature of dialogues on lw, it was just a suggestion for a different format (my comment generated almost opposite karma/agreement votes, so maybe this is the reason?). it also depends on frequency how often do you use the branching—my guess is that most don’t require it in every point, but maybe a few times in the whole conversation might be useful.