I Built an Open-Source Firebase Analytics Alternative Because I Hit 1M Events/Day Once Too Many

Wait 5 sec.

Why I stopped trusting hosted analytics for the raw layer and built a self-hosted event pipeline instead.A few years ago, I was the data engineer on a mobile game team going through soft launch. We were watching the DAU graph climb in Firebase Analytics, feeling great about ourselves, when the daily event volume crossed one million.That's when Firebase quietly started dropping events.Not all at once. Not loudly. Just enough to make the numbers look slightly suspicious — until we realized that "slightly suspicious" in analytics means "you can no longer trust the dataset." Our retention curves had holes. Our funnels were leaking. The product team was reading charts built on data we could no longer rely on.The pricing page had warned us, if you read it closely enough. Firebase has a 1M event/day cap on the free tier. Cross it, and you either pay enterprise pricing to enable BigQuery export — which scales fast as the game grows — or your data starts disappearing.I didn't have enough data engineering experience back then to build us out of that corner. Message brokers like Kafka looked complicated. Self-hosting analytics felt like a project that would consume the whole team for months. We ended up paying.But the idea stuck with me: why isn't there a small, opinionated, self-hosted analytics stack that a single engineer can deploy in an afternoon? Not a full Amplitude clone with cohort builders and feature flags and session replay. Just the pipeline part — the boring part — done right, owned by you, with no event caps and no vendor lock-in.A few years later, I built it. It's called Rawbbit, it's Apache 2.0, and this is what it does, why it exists, and why I made the tradeoffs I made.The architectural principle: own the raw layerHosted analytics products are great at giving you prebuilt answers. They are less great at letting you keep the facts behind those answers.You send events. They ingest them. They aggregate them on the way in. They show you dashboards built on those aggregates. Meanwhile, the raw events — the actual records of what happened — live on the vendor's servers, behind export features that are often expensive, slow, or down-sampled.That works as long as your questions match the dashboards the product was designed to answer. The day you want to ask something the dashboard does not know about — "what's the day-1 retention for the cohort that came from the Tuesday update, before we changed the tutorial?" — you find out whether you still have the raw data. If it has already been summarized away, you can't recompute history.Rawbbit starts from a simple rule: the raw event layer is the durable source of truth, and everything downstream is replaceable.Dashboards, metric definitions, your warehouse, even your BI tool — all of them are derivatives. If one of them is wrong, you fix the derivation and recompute from the raw layer. If you want to switch tools, you point the new tool at the raw layer. If a new question comes up that you didn't anticipate, you query the raw layer.So the pipeline is boring on purpose: events come in over HTTP, get buffered through a durable message broker (NATS JetStream), and land as partitioned Parquet files in your own object storage. From there, you query with BigQuery external tables — or any tool that reads Parquet — and model with whatever you prefer. The Parquet layer is the system of record. Everything else is a query against it.The stack, and why each piece existsThe full pipeline is intentionally short:Producer (your app or game) ↓ HTTP POST eventsCollector API (Python, FastAPI) ↓ validates and enrichesNATS JetStream (durable buffer) ↓ buffered writesRaw Writer (Python service) ↓ partitioned ParquetObject storage (GCS, S3, or SeaweedFS) ↓ queried byBigQuery external table → SQLMesh modeling → Metabase or any BI toolWhy these pieces:HTTP collector, not an SDK-first design. The collector exposes a simple HTTP endpoint that accepts batched events. You can build SDKs for any platform on top of it, and we ship a JavaScript SDK. But the wire protocol is open and small — you can curl an event into Rawbbit. This matters because game teams often want to integrate from native code (Unity, Unreal, custom engines) without waiting on an official SDK.NATS JetStream as the buffer. Kafka is the obvious choice, but Kafka is operationally heavy for a small self-hosted setup. NATS JetStream gives me the durability I need with a fraction of the operational complexity: a single binary, clean defaults, persistent streams. For a pipeline ingesting analytics events — not running a bank — NATS is the right tradeoff. Decoupling the collector from the writer through a broker is what keeps ingestion online during traffic spikes. The collector accepts and acknowledges fast; the writer drains the queue at its own pace.Parquet as the storage format. This is the most important design choice. Parquet is columnar, open, and supported by basically every data tool that matters. It's tolerant of schema evolution, so you can add fields without breaking old reads. It compresses well, which keeps storage costs manageable even with billions of events. And critically, Parquet files in your bucket are yours. Switch warehouses, switch BI tools, change your modeling layer — the files don't move and don't change.BigQuery external tables on top. You point a BigQuery external table at the Parquet files in your bucket, and BigQuery queries them as if they were native tables. You pay BigQuery for compute when you query, not for storing the source of truth. If you want to leave BigQuery later, point another query engine — DuckDB, Trino, ClickHouse — at the same files. Migration becomes a config change, not a data rescue mission.SQLMesh for downstream modeling. Once you have raw events, you almost always want modeled tables: daily active users, cohort definitions, funnel completions. SQLMesh handles versioning, lineage, and incremental builds without the operational weight of dbt. The repo includes a starter project; teams extend it for their own metrics.What's intentionally not in the stackThis matters more than the stack diagram.Rawbbit doesn't ship with a cohort builder UI. It doesn't have visual funnel construction. There's no session replay, no feature flags, no A/B testing harness, no notification system. I left those out on purpose.Each of those is a substantial product on its own. Trying to build all of them is how you end up with a hosted product analytics platform — which is exactly what Rawbbit isn't.My assumption is simple: if you own a clean raw event layer in your warehouse, the cohort builder is a SQL query. The funnel is a SQL query. The retention curve is a SQL query. You probably already have a BI tool — Metabase, Superset, Looker, Sigma — that can render these. If you don't, Metabase is free and runs in a Docker container.That also means Rawbbit is not the right choice for teams who want a polished product analytics UI out of the box. For those teams, Amplitude or PostHog will be a better fit. Rawbbit is for teams that have grown past the constraints of hosted analytics — too much volume, too much vendor lock-in, too many questions the product UI can't answer — and want to own the pipeline foundation.What you gain / what you give upI don't want to pretend this is free. Self-hosting always has a tradeoff.What you get: ownership of the raw event layer, no event caps, no per-event pricing, no vendor lock-in, the ability to reprocess history when transformations have bugs, and full control over data residency and compliance.What you give up: time-to-first-dashboard, polished metric-builder UIs, support contracts, and the convenience of someone else's infrastructure.The way I think about it: hosted product analytics optimizes for time-to-first-insight. A self-hosted raw layer optimizes for never losing the ability to get an insight later. Early in a product's life, the first matters more. As volume and stakes grow, the second starts to dominate.A lot of teams with enough infra budget run both. A self-hosted raw layer acts as the durable archive that ML, finance, and ad hoc analysis pull from. A hosted product analytics tool gets a subset of events for the product team's day-to-day dashboards. The two aren't enemies. But only one of them is a contract you actually hold.Where Rawbbit is nowThis is an "I built X" post, not a victory lap, so here is the current state.The pipeline — collector, NATS, Parquet writer — is solid. I've run it through traffic levels that would have broken the Firebase free tier ten times over, with no event loss. BigQuery external tables on the Parquet files work cleanly. The SQLMesh starter project handles basic modeling: cohorts, daily and weekly metrics, retention skeletons.The part still in progress is SDKs. There's a JavaScript SDK that works for web events. The big request from game teams is a native Unity SDK and a native Unreal plugin, and that's what I'm building next. The SDK should abstract the HTTP transport, handle batching, retry, and offline buffering — the things you'd expect from a real game analytics SDK.Why I'm telling this storyIf you've hit the same wall I hit — Firebase Analytics event caps, Amplitude pricing at scale, the dread of reading your annual analytics bill — Rawbbit might be useful to you. Even if you don't use Rawbbit, don't outsource the only copy of your raw event data without thinking hard about the exit plan.And if you've gone down the self-hosting path yourself, I'm curious how you approached it — what pushed you over the edge, what you'd build differently, where the tradeoffs bit you. The architecture decisions here are far from settled, and I'd rather discuss them in the open. Issues and discussions on the repo are the best place for that.The repo is at github.com/mirlan-irokez/rawbbit. The architecture notes are on rawbbit.one.If you've ever looked at an analytics dashboard, known the underlying data was silently broken, and felt powerless to fix it — this is for you.\