http://samuelshadrach.com/?file=/raw/english/about_me_summary.md
samuelshadrach
[Question] Should I fundraise for open source search engine?
I’m unsure what the theory of change associated with your LW post is. If you have a theory of change associated with it that also makes sense to me, my guess is you’d focus a lot more on cultural attitudes and incentives, and a lot less on legality or technical definitions.
The process for getting a certain desirable future is imo likely not going to be that you create the law first and everyone complies it with later when the tech is deployed.
It’ll look more like the biotech companies deploy the tech in a certain way, then a bunch of citizens get used to using it a certain way (and don’t have lots of complaints), and then a certain form of usage gets normalised, and only after that you can make a law codifying what is allowed and not allowed.
Until society has consensus agreement on certain ways of doing things, and experience doing them in practice (not theory), I don’t think it’ll be politically viable to pass a legal ban (that doesn’t say, get overturned soon after).
The other way around is possible, it is possible to make a law saying something is disallowed before it has actually been done. Historically societies have often been bad at this sort of thing. (Often you need mishap to happen before a law banning something is politically viable.) But cultures in general are averse to change, so that alone can be good enough to that a ban is politically viable.
This makes more sense for blanket ban though, it makes less sense for the kind of targeted ban of certain types of interventions. Culture does not already encode the stuff in your post as of 2025, what you’re proposing is novel.
Forum devs including lesswrong devs can consider implementing an “ACK” button on any comment, indicating I’ve read a comment. This is distinct from
a) Not replying—other person doesn’t know if I’ve read their comment or not
b) Replying something trivial like “okay thanks”—other person gets a notification though I have nothing of value to say
I already maybe mentioned this in some earlier discussion so maybe it’s not worth rehashing in detail but…
I strongly feel laws are downstream of culture. Instead of thinking which laws are best, it seems worthwhile to me to try thinking of which culture is best. First amendment in US is protected by culture rather than just by laws, if the culture changed then so would the laws. Same here with genomic liberty. Laws can be changed and their enforcement in day to day life can be changed. (Every country has examples of laws that exist on books but don’t get enforced in practice.)
(And if you do spend time thinking of what the ideal culture looks like, then I’ll have my next set of objections on why you personally can’t decide ideal culture of a civilisation either; how that gets decided is more complicated. But to have that discussion, first we will have to agree culture is important.)
I appreciate you for thinking about these topics. I just think reality is likely to look very different from what you’re currently imagining.
Got it. As of today a common setup is to let the LLM query an embedding database multiple times (or let it do Google searches, which probably has an embedding database as a significant component).
Self-learning seems like a missing piece. Once the LLM gets some content from the embedding database, performs some reasoning and reaches a novel conclusion, there’s no way to preserve this novel conclusion longterm.
When smart humans use Google we also keep updating our own beliefs in response to our searches.
P.S. I chose not to build the whole LLM + embedding search setup because I intended this tool for deep research rather than quick queries. For deep research I’m assuming it’s still better for the human researcher to go read all the original sources and spend time thinking about them. Am I right?
Cool!
Useful information that you’d still prefer using ChatGPT over this. Is that true even when you’re looking for book recommendations specifically? If so yeah that means I failed at my goal tbh. Just wanna know.
Since Im spending my personal funds I can’t afford to use the best embeddings on this dataset. For example text-embedding-3-large is ~7x more expensive for generating embeddings and is slightly better quality.
The other cost is hosting cost, for which I don’t see major differences between the models. OpenAI gives 1536 float32 dims per 1000 char chunk so around 6 KB embeddings per 1 KB plaintext. All the other models are roughly the same. I could put in some effort and quantise the embeddings, will update if I do it.
Concepts that are informed by game theory and other formal models
Strongly in favour of this.
There are people in academia doing this type of work, a lot of them are economists by training studying sociology and political science. See for example Freaknomics by Stephen Levitt or Daron Acemoglu who recently won a nobel prize. Search keywords: neo-instutionalism, rational choice theory. There are a lot of political science papers on rational choice theory, I haven’t read many of them so I can’t give immediate recommendations.
I’d be happy to join you in your search for existing literature, if that’s a priority for you. Or just generally discuss the stuff. I’m particularly interested in applying rational choice models to how the internet will affect society.
AI can do the summaries.
I agree that people behave differently in observed environments.
Thanks this is super helpful! Edited.
usually getting complete information was the hard part of the project
Thoughts on Ray Dalio-style perfect surveillance inside the org? Would that have helped? Basically put everyone on video camera and let everyone inside the org access the footage.
Disclaimer: I have no personal reason to accelerate or decelerate Anthropic. I’m just curious from an org design perspective.
One pager
Can you send the query? Also can you try typing the query twice into the textbox? I’m using openai text-embedding-3-small, which seems to sometimes work better if you type the query twice. Another thing you can try is retry the query every 30 minutes. I’m cycling subsets of the data every 30 minutes as I can’t afford to host the entire data at once.
Thanks for feedback.
I’ll probably do the title and trim the snippets.
One way of getting a quote would to be to do LLM inference and generate it from the text chunk. Would this help?
Update: HTTPS issue fixed. Should work now.
Books Search for Researchers
Thanks for your patience. I’d be happy to receive any feedback. Negative feedback especially.
Update: HTTPS should work now
use http not https
Search engine for books
http://booksearch.samuelshadrach.com
Aimed at researchers
Technical details (you can skip this if you want):
Dataset size: libgen 65 TB, (of which) unique english epubs 6 TB, (of which) plaintext 300 GB, (from which) embeddings 2 TB, (hosted on) 256+32 GB CPU RAM
Did not do LLM inference after embedding search step because human researchers are still smarter than LLMs as of 2025-03. This tool is meant for increasing quality for deep research, not for saving research time.
Main difficulty faced during project—disk throughput is a bottleneck, and popular languages like nodejs and python tend to have memory leak when dealing with large datasets. Most of my repo is in bash and perl. Scaling up this project further will require a way to increase disk throughput beyond what mdadm on a single machine allows. Having increased funds would’ve also helped me completed this project sooner. It took maybe 6 months part-time, could’ve been less.
Got it!
I haven’t spent a lot of time thinking about this myself. But one suggestion I would recommend:
For any idea you have, also imagine 20 other neighbouring ideas, ideas which are superficially similar but ultimately not the same.
The reason I’m suggesting this exercise is that ideas keep mutating. If you try to popularise any set of ideas, people are going to come up with every possible modification and interpretation of them. And eventually some of those are going to become more popular and others less popular.
For example with “no removing a core aspect of humanity” principle, imagine if someone who values fairness and equality highly considers this value a core aspect of humanity and then thinks through its implications. Or let’s say with “parents have a strong right to propagate their own genes”, a hardcore libertarian takes this very seriously and wants to figure out edge case of exactly how many “bad” genes are they allowed to transmit to their child before they run afoul of “aimed at giving their child a life of wellbeing” principle.
You can come up with a huge number of such permutations.