Forking Graze's Turbostream: Rewriting a Bluesky Firehose Consumer in Rust

Here at Messijo, we are built on the backbone of other networks like Bluesky, HackerNews, and Lobster.rs. When they go down, we feel the same pain. If you’re interested, we’d love to tell you about our journey from relying on the awesome kindness of Graze.social to our own self-hosted turbostream so that we can deliver the best and most reliable Bluesky notifications possible.

Firehoses & Jetstreams & and Turbostreams, oh my!

The Bluesky ecosystem is built on the the AT Protocol, which is a distributed, decentralized, and secure protocol for building applications. The biggest application on ATProto is the Bluesky app. Some of your users are probably there talking about your product, your company, your concerns, or your competitors.

The Firehose

The AT Protocol is fully open and decentralized. You can even listen for every single event from many different sources and relays. This is called the “firehose”. You can learn more about the Firehose from the official Bluesky docs.

So, you can just connect up to it with your favorite websocket client and start seeing all the events on the AT Protocol. But there’s a catch. It’s binary encoded data using CBOR encoding, so you will need to figure out how to parse that in your language of choice. Not the hardest thing in the world, but definitely a hurdle.

The Jetstream

The Bluesky team recognized this bottleneck and built the Jetstream to help devs get their feet wet in the ATProto ecosystem. It’s a simple wrapper around the Firehose that provides JSON encoded events that are easier to work with and read as a developer.

You can subscribe to it with a simple websocket listener just like you can for the firehose, but now you can see the message payloads when console logging them in a language like JS/TS. That’s a great start, but you might realize that it’s lacking some of the metadata you would need to build a proper application. For that metadata, you have to make calls to the Bluesky API. This is a lot of extra work to get the full picture of what’s going on and you are likely to hit rate limits quickly without proper caching.

Graze.social’s Turbostream

Luckily, Graze.social ran into this same exact problem and realized they could give even more back to the community. They built a simple wrapper around the Jetstream and provided it to developers for free. They called it the Turbostream. It’s also just a simple websocket connection away. Under the hood, they’re caching data in Redis, SQLite, and even backing data up to S3 for historical records. When logging values from the feed, you can see replies, parent posts, user names and images, and much more that wasn’t in the original Jetstream payload.

Why We Forked

We’ve been using Graze.social’s Turbostream while building Messijo. But, it’s not ideal to rely upon another middleman company for such a core part of our business, for stability reasons, but also just for “being a good neighbor” reasons.

So we decided to “do it ourselves”. Technically, we built upon the shoulders of giants. Graze.social built a really solid foundation and we forked it, and THEN provided that context to an LLM.

The Actual Fork

We took the basic structure and realized a couple things we wanted to change. The first thing was that the way they built this was based on a runtime dependent language, Python. We wanted to use a compiled language instead because it meant less dependencies on the underlying OS.

It turns out they were using Docker for more than we thought. They were also using Docker spin up a Redis server. We didn’t need something as robust as Redis so we decided to build our own version of that, too.

The thing we built was not_redis. It uses the same API as redis_rs but only runs in-process. Yes, there are some libraries that do similar things, but we decided to vibe up exactly what we needed for this use case.

It’s important to recognize that replacing Redis was not a choice we took lightly. But, the more we looked at the problem, we realized this entire use-case didn’t benefit from the scalability that the TCP communication layer provided. It could just be an in-process cache. So, we built around the same public API as redis_rs and moved on. This gave us some crazy benchmark improvements.

The Result

We are hosting a turbostream implementation in a single VPS instance. It handles the scale we need and has even more room for improvement. We currently only allow connections from our own servers, but with a little encouragement we could open it up to other consumers. We do all of this with a single API token’s session and still maintain higher syncronicity than other providers.

Citations