hckrnws
Show HN: TeaTime – distributed book library powered by SQLite, IPFS and GitHub
by yoavm
Recently there seem to be a surge in SQLite related projects. TeaTime is riding that wave...
A couple of years ago I was intrigued by phiresky's post[0] about querying SQLite over HTTP. It made me think that if anyone can publish a database using GitHub Pages, I could probably build a frontend in which users can decide which database to query. TeaTime is like that - when you first visit it, you'll need to choose your database. Everyone can create additional databases[1]. TeaTime then queries it, and fetches files using an IPFS gateway (I'm looking into using Helia so that users are also contributing nodes in the network). Files are then rendered in the website itself. Everything is done in the browser - no users, no cookies, no tracking. LocalStorage and IndexedDB are used for saving your last readings, and your position in each file.
Since TeaTime is a static site, it's super easy (and free) to deploy. GitHub repo tags are used for maintaining a list of public instances[2].
Note that a GitHub repository isn't mandatory for storing the SQLite files or the front end - it's only for the configuration file (config.json) of each database, and for listing instances. Both the instances themselves and the database files can be hosted on Netlify, Cloudflare Pages, your Raspberry Pi, or any other server that can host static files.
I'm curious to see what other kinds of databases people can create, and what other types of files TeaTime could be used for.
[0] https://news.ycombinator.com/item?id=27016630
[1] https://github.com/bjesus/teatime-json-database/
[2] https://github.com/bjesus/teatime/wiki/Creating-a-TeaTime-in...
This looks neat!
IPFS contributor here.
I had a look at your code and saw how you handle downloading from multiple gateways.
There's a better way to this which also leverages direct P2P retrieval in browsers, which is now a thing with IPFS! [0] If you just want to fetch, checkout @helia/verified-fetch[1] which gives you a Fetch like API that accepts CIDs. It handles all the content routing and retrieval p2p magic and can help reduce reliance on gateways.
You can also pass it gateways in case it can connect to providers directly (due to transports)
[0] https://blog.ipfs.tech/2024-shipyard-improving-ipfs-on-the-w... [1] https://github.com/ipfs/helia-verified-fetch/tree/main/packa...
Thanks! I really wanted to use verified-fetch, but it seemed like it doesn't really support all types of CIDs? In my case, most of the files use blake2b-256, and the last time I checked verified-fetch didn't like that :( Did something change?
Edit: I now realised we previously talked on the Discord channel, thank you for the help there!
With some extra config you can pass custom hashers.
Any reason you chose BLAKE2b-256?
BLAKE2/3 are amazing hashing functions and have a lot of cool properties, but they really suck on the web in terms of performance. This came up literally yesterday https://bsky.app/profile/norman.life/post/3lbw2qltokc2i and has been reported by multiple devs recently (who also shared excitement about BLAKE)
The main reason I wanted to support it is that it happens to be what Library Genesis uses. I'll have a look into passing these custom headers! Would definitely be more elegant than hitting all gateways together and hoping for the best.
Small correction:
You can also pass [@helia/verified-fetch] gateways as a fallback for when it *cannot* connect to providers directly (due to transports).
The Pear P2P framework might be worth a look if you want to get off of GitHub and turn it into a truly distributed system. If the index must be on GitHub then what good is it to have the files on IPFS?
Thanks, I shall have a look into that. Technically the index too can live on IPFS, but the real challenge was how to know which databases exist, and where. This is what I'm using the GitHub topics for - everyone can create a repo and assign a topic to it, and the GitHub free API lets you search for matching repos. In a way, what I'm looking is to be able to make a request and get a different result when a new database is up, and the immutable nature of IPFS for example isn't great for that. GitHub stars and accounts also give some basic concept of content moderation. I should see what Pears offers in this space.
Generally, I see the distributed nature of TeaTime as an instrument for being resilient, and less so a goal by itself. My thinking is that the GitHub database repositories are "resilient enough", especially if they can contain just a config.json file pointing to database files living elsewhere. But I'm not a lawyer and could be wrong.
> Technically the index too can live on IPFS
Besides, HTTP requests against indices should be fast enough for a decent user experience and IPFS (and/or its gateways) aren’t great at that in my experience. I think using GitHub (or any other static hosting providers) was a good call in that regard.
> (I'm looking into using Helia so that users are also contributing nodes in the network)
I had to look that term up <https://github.com/ipfs/helia#readme> but while sniffing around in their <https://github.com/ipfs/helia/wiki/Projects-using-Helia> I was reminded of https://github.com/orbitdb/orbitdb#readme which seems like it may be much less rolling your own parts
Thanks, I haven't heard of it before. I wonder if it's possible to use OrbitDB to download a file from IPFS though? Based on it's CID, I mean, because that was my intention with Helia. I thought that one of the nice thing about reading books is that it takes time, and if users could be seeding their files from IndexedDB to the network, they could automatically and effortlessly become "contributing citizens".
Another interesting use case would be to see if this can replace (or be an addition to) SQLite as the database in which the queries are ran.
Seeding sounds like what's done by torrenting too no? How do you prevent people from distributing content they do not own?
Yes it is, and you're absolutely right with raising this concern. I was planning to make seeding (if I get it working at all) an opt-in feature, firstly to prevent excessive data usage on mobile connections.
I see, nice project nonetheless! Hope it gets where you want it :-)
Also see: Show HN: Fully-searchable Library Genesis on IPFS
"Everything is done in the browser - no users, no cookies, no tracking. LocalStorage and IndexedDB are used for saving your last readings, your position in each file."
I love this! Thanks for making this!
> The databases used in TeaTime are GitHub repositories tagged with the teatime-database topic, which are published on GitHub Pages.
Couldn't this be a security issue, for a bad actors to use this tag?
That's a fair point - I guess it could be abused. Databases are sorted by their number of GitHub stars, so I was hoping that with the power of the crowds it will be possible to minimize the bad effect such actor might have, by simply not voting them up.
there's been several attacks recently where a bad actor takes over a repo where the original maintainer wants to take a step back, then launch a supply chain attack. in recent cases, the attack came from obfuscated binary files in the repo rather than code. given we are dealing with documents here (books) that would be easy to hide malicious code in a file. pdfs have all sorts of execution vulnerabilities for example
Interesting - I'm kinda counting on PDF.js, which is used for PDF rendering, on doing it safely, but of course that doesn't always have to be the case. Do you have any thoughts on how to make this safer?
some other method of collection where you can hav eknown trust of the files contributed, some method of 'registering' a submission to create trust,
PDFs are not executables.
Other than the JavaScript, you mean.
https://cloudmersive.com/article/Understanding-Embedded-Java...
You’d be surprised what’s executable with the right attitude.
May I recommend the old 27C3 talk "OMG WTF PDF"?
[dead]
It does not work for me. For example:
> Access to fetch at 'https://bafykbzacedvzdo4hru3wul5excjthwjuz5ggd6jcs4wg77tg7bk...' (redirected from 'https://w3s.link/ipfs/bafykbzacedvzdo4hru3wul5excjthwjuz5ggd...') from origin 'https://bjesus.github.io' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled.
The CORS headers seem fine, though I'm a little surprised about the content-type:
$ curl -I https://bafykbzacedvzdo4hru3wul5excjthwjuz5ggd6jcs4wg77tg7bkn3c7hlvohq.ipfs.dweb.link/
HTTP/2 200
date: Thu, 28 Nov 2024 16:34:03 GMT
content-type: application/x-rar-compressed
content-length: 192215114
access-control-allow-headers: Content-Type
access-control-allow-headers: Range
access-control-allow-headers: User-Agent
access-control-allow-headers: X-Requested-With
access-control-allow-methods: GET
access-control-allow-methods: HEAD
access-control-allow-methods: OPTIONS
access-control-allow-origin: *
Can you try with another file? I do think that once I implement Helia's verified-fetch, issues like this should be less common.Sorry if I missed this, but is there an example instance we can play with?
I don't know what's in the database, so I don't know what to search for.
How about a browse feature?
Ha, that's a great point... I can probably work on a meta "table of content" for each database, shouldn't be hard to do. But generally the database descriptions say what they have - public domain books, library genesis non-fiction, etc.
Hi I’m here looking for a different offering. I have a personal library of physical books that our small group wants to be able to lend out to others.
I want people to be able to see which books are available and reserve books that are out and get notifications when due back.
And preferable self host as a small site.
Anyone know a small open source physical library management tool like this?
This is awesome. Would it be able to integrate Pocket bookmarks into this somehow? Would love to be able to keep a cache offline of all the things I wanted to read (esp if I can prevent link-rot), and to be able to query them.
I need to check about Pocket - do you have any relevant documentation? Note however that TeaTime does keep an offline cache, both a list of your recent books and the files themselves, but it does it using LocalStorage and IndexedDB because I really wanted to prevent the need for having user logins. Maybe I can add an option to download a "state dump" kind of thing, so that you can a file with your history and can always import it. I too always feel that LocalStorage could be a little flaky, in the sense that you don't always realize that switching a browser would mean losing all your data.
beautiful! i made a much crappier version of the same kind of thing a year or two ago with a manually converted libgen db from annas-archive's index, but the best i could come up with was webtorrenting a sqlite db of hashes and importing it into indexeddb then distributing the static fontend bundle as an urbit app; it was much uglier than this
Thank you! I've seen quite a lot of similar projects as I was working on this, and they all served as much inspiration and also a proof that this can in fact be done.
Also, thanks to whoever came up with thebookbay.org instance :)
Is this like an open source distributed libgen?
Libgen is one of the databases that are currently available (and was probably a good fit because they already had their files on IPFS), but I think that this architecture, in which the UI is decoupled from the database, the database doesn't hold any copyrighted materials, and the files are downloaded from IPFS, is quite resilient and could be used for serving all sorts of content.
Comment was deleted :(
Comment was deleted :(
Do static sites built with sphinx-build or jupyter-book or hugo or other jamstack static site generators work with TeaTime?
sphinx-build: https://www.sphinx-doc.org/en/master/man/sphinx-build.html
There may need to be a different Builder or an extension of sphinxcontrib.serializinghtml.JSONHTMLBuilder which serializes a doctree (basically a DOM document object model) to the output representation: https://www.sphinx-doc.org/en/master/usage/builders/#sphinxc...
datasette and datasette-lite can load CSV, JSON, Parquet, and SQLite databases; supports Full-Text Search; and supports search Faceting. datasette-lite is a WASM build of datasette with the pyodide python distribution.
datasette-lite > Loading SQLite databases: https://github.com/simonw/datasette-lite#loading-sqlite-data...
jupyter-lite is a WASM build of jupyter which also supports sqlite in notebooks in the browser with `import sqlite3` with the python kernel and also with a sqlite kernel: https://jupyter.org/try-jupyter/lab/
jupyterlite/xeus-sqlite-kernel: https://github.com/jupyterlite/xeus-sqlite-kernel
(edit)
xeus-sqlite-kernel > "Loading SQLite databases from a remote URL" https://github.com/jupyterlite/xeus-sqlite-kernel/issues/6#i...
%FETCH <url> <filename> https://github.com/jupyter-xeus/xeus-sqlite/blob/ce5a598bdab...
xlite.cpp > void fetch(const std::string url, const std::string filename) https://github.com/jupyter-xeus/xeus-sqlite/blob/main/src/xl...
> Do static sites built with sphinx-build or jupyter-book or hugo or other jamstack static site generators work with TeaTime?
I guess it depends on what you mean by "work with TeaTime". TeaTime itself is a static site, generated using Nuxt. Nothing that it does cannot be achieved with another stack - at the end it's just HTML, CSS and JS. I haven't tried sphinx-build or jupyter-book, but there isn't a technical reason why Hugo wouldn't be able to build a TeaTime like website, using the same databases.
> datasette-lite > Loading SQLite databases: https://github.com/simonw/datasette-lite#loading-sqlite-data...
I haven't seen datasette before. What are the biggest benefits you think it has over sql.js-httpvfs (which I'm using now)? Is it about the ability to also use other formats, in addition to SQLite? I got the impression that sql.js-httpvfs was a bit more of a POC, and later some possibly better solutions came out, but I haven't really went that rabbit hole to figure out which one would be best.
Edit: looking a little more into datasette-lite, it seems like one of the nice benefits of sql.js-httpvfs is that it doesn't download the whole SQLite database in order to query it. This makes it possible have a 2GB database but still read it in chunks, skipping around efficiently until you find your data.
So the IPFS part is instead of a CDN? Does it replace asset URLs in templates with e.g. cf IPFS gateway URLs with SRI hashes?
datasette-lite > "Could this use the SQLite range header trick?" https://github.com/simonw/datasette-lite/issues/28
From xeus-sqlite-kernel > "Loading SQLite databases from a remote URL" https://github.com/jupyterlite/xeus-sqlite-kernel/issues/6#i... re: "Loading partial SQLite databases over HTTP":
> sql.js-httpvfs: https://github.com/phiresky/sql.js-httpvfs
> sqlite-wasm-http: https://github.com/mmomtchev/sqlite-wasm-http
>> This project is inspired from @phiresky/sql.js-httpvfs but uses the new official SQLite WASM distribution.
Datasette creates a JSON API from a SQLite database, has an optional SQL query editor with canned queries, multi DB query support, docs; https://docs.datasette.io/en/stable/sql_queries.html#cross-d... :
> SQLite has the ability to run queries that join across multiple databases. Up to ten databases can be attached to a single SQLite connection and queried together.
The IPFS part is where the actual files are downloaded from. I wouldn't necessarily say that it's instead of a CDN - it just seems to be a reliable enough way of getting the files (without the pain of setting up a server with all these files myself). Indeed, it fetches them through IPFS Gateways (the Cloudflare IPFS Gateway is unfortunately no longer a thing) using their CID hash.
I absolutely don't understand this.
Doesn't ipfs require a server running? Where is the server code?
It does require an IPFS node, i.e. server to provide tha data. The point is more that data isn't tied to any specific server. By content addressing with CIDs you can fetch from multiple IPFS nodes and even become a "provider" for a CID once your IPFS node downloads a copy.
Yes, I'm aware how ipfs works, but that doesn't really answer the question :)
Sorry for misunderstanding.
I think the app was built on the assumption that someone is providing these CIDs.
I'm currently using public IPFS Gateways to fetch the files from IPFS.
super cool
IPFS needs server running or am I wrong?
For fetching files from IPFS, you can use a public IPFS Gateway, and the Gateway is essentially your server. That's what TeaTime is doing.
I'm slightly disappointed that this is not the Tea-Time that formed the comms/sync protocol used in the Croquet VR system (on Squeak Smalltalk) some years back. :)
Comment was deleted :(
Any particular reason for choosing IPFS instead of bittorrent or other p2p protocols? It feels like every time I try an IPFS tool it just crashes, whereas I rarely have issues with torrents.
IPFS contributor here.
IPFS certainly has its flaws, but it's been getting much better in the last year. If you want to do P2P retrieval in browsers, it's practically the most advanced from all protocols.
We just published an update on this topic https://blog.ipfs.tech/2024-shipyard-improving-ipfs-on-the-w...
> If you want to do P2P retrieval in browsers
My understanding is that IPFS always requires an IPFS → HTTP gateway to work in browsers, so I wonder what you mean by "P2P retrival in browsers".
Verified Fetch is about verifying content retrieved from gateways on client-side but the retrival is still very much client-server (vs P2P) today, and Service Worker Gateway seems to be relying on gateways too (<trustless-gateway.link> in the demo linked).
Based on IPFS & HTTP: The Blueprint for Radical Interoperability - Lidel [0], I believe the situation is acknowledged by IPFS devs too. In my unsolicited opinion, using a custom protocol (libp2p transports?) for a project targetting web browsers was a mistake, so I'm glad to see that HTTP is being considered now.
> My understanding is that IPFS always requires an IPFS → HTTP gateway to work in browsers, so I wonder what you mean by "P2P retrival in browsers".
By P2P retrieval, I mean retrieval directly from a peer that has the data without additional hops.
Historically HTTP gateways were the only way, because you couldn't dial (most) peers directly from a browser unless they had a CA-signed TLS certificate (needed in secure contexts.)
A couple of things changed that:
- WebTransport and WebRTC-direct allow browser-server(peer) communication without a CA signed certificate. WebTransport is very new and still has some problems in browser implementations.
- We just launched AutoTLS which help public IPFS node get a wildcard let's encrypt TLS certificate, making it "dialable" from browsers
With these, it becomes a whole lot easier to establish connections from browsers to IPFS nodes. But it takes time for the network to upgrade.
Note that Libp2p transports are wrappers over network transports to allow interoperability across runtimes, e.g. TCP, QUIC, WebSockets, WebTransport. These are not custom protocols.
Now you point about custom protocols, by which I presume you are referring to data exchange/retrieval protocols.
There are two IPFS data transfer protocols: - Bitswap: the first and default data transfer protocol which requires a streaming network transport (so HTTP isn't ideal). - HTTP Gateways: Initially, HTTP gateways were servers that would handle retrieval (over Bitswap) for you if they didn't have the data locally (sometimes we refer to these as Recursive Gateways, like trustless-gateway.link and ipfs.io). For all the reasons in Lidel's talk (caching, interop, composability), we extended this notion and made HTTP Gateway a general interface for data retrieval.
Today, there are large providers in the network (pinning services) that expose their data with an HTTP Gateway, which allows browsers to retrieve from directly.
We still have more work to do to expose the gateway in Kubo with the new AutoTLS certificate, but for the time being, Bitswap should work well enough even in browsers over a WebSockets connection. ---
Verified Fetch's aims to ensure you get data is it's available. If it fails to find a direct provider of the data, it will fall back on a recursive gateway like trustless-gateway.link which is configured in the defaults. As more IPFS nodes upgrade to newer versions, reliance on such centralised recursive gateways will become unnecessary.
TL;DR: We're all in on HTTP. The real constraint to P2P is browser vendors and slow standard bodies.
Comment was deleted :(
Yeah - the desire to have TeaTime run as a normal website. BitTorrent doesn't run over HTTP, unless you use WebTorrent, which most BitTorrent clients aren't 100% compatible with. This means you can basically only download from other WebTorrent nodes - and there aren't many.
I do think this might change with the introduction of things like the Direct Sockets API, but for now they are too restricted and not widely supported.
It's my first time working with IPFS and I agree it hasn't always been 100% reliable, but I do hope that if I manage to get TeaTime users to also be contributing nodes, this might actually improve the reliability of the whole network. Once it's possible to use BitTorrent in the browser, I do think it would be a great addition (https://github.com/bjesus/teatime/issues/3).
Crafted by Rajat
Source Code