related: I’d like to be able to query what’s needed to display a page in a roamlike ui, which would involve a tree walk.
graph traversal: I want to be able to ask what references what efficiently, get shortest path between two nodes given some constraints on the path, etc.
search: I’d like to be able to query at least 3k (pages), maybe more like 30k (pages + line-level embeddings from lines of editable pages), if not more like 400k (line-level embeddings from all pages) vectors, comfortably; I’ll often want to query vectors while filtering to only relevant types of vector (page vs line, category, etc). milvus claims to have this down pat, weaviate seems shinier and has built in support for generating the embeddings, but according to a test is less performant? also it has fewer types of vector relationships and some of the ones milvus has look very useful, eg
sync: I’d like multiple users to be able to open a webclient (or deno/rust/python/something desktop client?) at the same time and get a realtime-ish synced view. this doesn’t necessarily have to be gdocs grade, but it should work for multiple users straightforwardly and so the serverside should know how to push to the client by default. if possible I want this without special setup. surrealdb specifically offers this, and its storage seems to be solid. but no python client. maybe that’s fine and I can use it entirely from javascript, but then how shall I combine with the vector db?
seems like I really need at least two dbs for this because none of them do both good vector search and good realtimeish sync. but, hmm, docs for surrealdb seem pretty weak. okay, maybe not surrealdb then. edgedb looks nice for main storage, but no realtime. I guess I’ll keep looking for that part.
Yeah, it seems likely you’ll end up with 2 or 3 different store/query mechanisms. Something fairly flat and transactional-ish (best-efforts probably fine, not long-disconnected edit resolution) for interactive edits, something for search/traversal (which will vary widely based on the depth of the traversals, the cardinality of the graph, etc. Could be a denormalized schema in the same DBM or.a different DBM). And perhaps a caching layer for low-latency needs (maybe not a different store/query, but just results caching somewhere). And perhaps an analytics store for asynchronous big-data processing.
Honestly, even if this is pretty big in scope, I’d prototype with Mongo or DynamoDB as my primary store (or a SQL store if you’re into that), using simple adjacency tables for the graph connections. Then either layer a GraphQL processor directly or on a replicated/differently-normalized store.
Can you give me some more clues here, I want to help with this. By vectors are you talking about similarity vectors between eg. lines of text, paragraphs etc? And to optimize this you would want a vector db?
Why is sync difficult? In my experience any regular postgres db will have pretty snappy sync times? I feel like the text generation times will always be the bottleneck? Or are you more thinking for post-generation weaving?
Maybe I also just don’t understand how different these types of dbs are from a regular postgres..
By sync, I meant server-initiated push for changes. Yep, vectors are sentence/document embeddings.
The main differences from postgres I seek are 1. I can be lazier setting up schema 2. realtime push built into the db so I don’t have to build messaging 3. if it could have surrealdb’s alleged “connect direct from the client” feature and not need serverside code at all that’d be wonderful
I’ve seen supabase suggested, as well as rethinkdb and kuzzle.
Good prompts.
related: I’d like to be able to query what’s needed to display a page in a roamlike ui, which would involve a tree walk.
graph traversal: I want to be able to ask what references what efficiently, get shortest path between two nodes given some constraints on the path, etc.
search: I’d like to be able to query at least 3k (pages), maybe more like 30k (pages + line-level embeddings from lines of editable pages), if not more like 400k (line-level embeddings from all pages) vectors, comfortably; I’ll often want to query vectors while filtering to only relevant types of vector (page vs line, category, etc). milvus claims to have this down pat, weaviate seems shinier and has built in support for generating the embeddings, but according to a test is less performant? also it has fewer types of vector relationships and some of the ones milvus has look very useful, eg
sync: I’d like multiple users to be able to open a webclient (or deno/rust/python/something desktop client?) at the same time and get a realtime-ish synced view. this doesn’t necessarily have to be gdocs grade, but it should work for multiple users straightforwardly and so the serverside should know how to push to the client by default. if possible I want this without special setup. surrealdb specifically offers this, and its storage seems to be solid. but no python client. maybe that’s fine and I can use it entirely from javascript, but then how shall I combine with the vector db?
seems like I really need at least two dbs for this because none of them do both good vector search and good realtimeish sync. but, hmm, docs for surrealdb seem pretty weak. okay, maybe not surrealdb then. edgedb looks nice for main storage, but no realtime. I guess I’ll keep looking for that part.
Yeah, it seems likely you’ll end up with 2 or 3 different store/query mechanisms. Something fairly flat and transactional-ish (best-efforts probably fine, not long-disconnected edit resolution) for interactive edits, something for search/traversal (which will vary widely based on the depth of the traversals, the cardinality of the graph, etc. Could be a denormalized schema in the same DBM or.a different DBM). And perhaps a caching layer for low-latency needs (maybe not a different store/query, but just results caching somewhere). And perhaps an analytics store for asynchronous big-data processing.
Honestly, even if this is pretty big in scope, I’d prototype with Mongo or DynamoDB as my primary store (or a SQL store if you’re into that), using simple adjacency tables for the graph connections. Then either layer a GraphQL processor directly or on a replicated/differently-normalized store.
Can you give me some more clues here, I want to help with this. By vectors are you talking about similarity vectors between eg. lines of text, paragraphs etc? And to optimize this you would want a vector db?
Why is sync difficult? In my experience any regular postgres db will have pretty snappy sync times? I feel like the text generation times will always be the bottleneck? Or are you more thinking for post-generation weaving?
Maybe I also just don’t understand how different these types of dbs are from a regular postgres..
By sync, I meant server-initiated push for changes. Yep, vectors are sentence/document embeddings.
The main differences from postgres I seek are 1. I can be lazier setting up schema 2. realtime push built into the db so I don’t have to build messaging 3. if it could have surrealdb’s alleged “connect direct from the client” feature and not need serverside code at all that’d be wonderful
I’ve seen supabase suggested, as well as rethinkdb and kuzzle.