TraceL: Rust & Vector.dev Logging Framework

When your microservices start speaking ten different log languages, chaos is inevitable 🤯.
To solve this, at our company, I built a Rust-based structured logging framework inspired by Zerodha’s tech stack, but designed and developed entirely from scratch. Powered by Vector.dev, Kafka, and ClickHouse.

Tracel makes distributed logs traceable, enforces structure across services, and ensures that essential metadata — like user inputs and thread contexts — never go missing.
All this, while keeping the framework developer-friendly and flexible enough to fit into any service without friction.

Why Our Microservices Needed Structured Logging ?

In a microservice-based architecture, distributed logs quickly become inconsistent. Each service may log data differently — varying field names, missing context, or inconsistent formatting. For example,

one developer might write: error!(“Failed: DB connection“) // without metadata;
while another might write: error!(“DB connection failed due to: {e}”).

This makes tracing errors extremely difficult, especially when services perform complex calculations in multithreaded environments. For example, if a user request triggers multiple threads processing different subsets of data, it becomes nearly impossible to pinpoint which input caused a failure without well-structured, consistent logs.

Why not using the existing tech ?

Despite the availability of existing logging libraries and frameworks, none fully addressed our needs. Off-the-shelf solutions either lacked enforcement for structured logging or required developers to manually add critical context, which was often forgotten or inconsistently applied. Some frameworks offered rigid schemas, but they were too restrictive, making it difficult to adopt across multiple services with diverse logging needs.

We needed a framework that could enforce consistent, structured logging across all services, automatically capture essential metadata like user inputs and thread contexts, and integrate seamlessly with our streaming and analytical pipeline — all without imposing friction on developers. This gap led me to the design and development of Tracel (Trace-L), a Rust-based asynchronous logging framework powered by Kafka, Vector.dev, and ClickHouse, designed to bring traceability, structure, and reliability to our distributed logs.

How Tracel Works: Architecture & Implementation

At its core, Tracel (Trace-L) is built to bring structure and consistency to distributed logging — without sacrificing performance or developer experience.
Here’s the end-to-end pipeline:
Service (Rust) → Tracel Logging API → Kafka (Streaming) → [Vector.dev] (Processing) → ClickHouse (Storage & Analytics)

Each component plays a focused role:

Rust + tracing-subscriber: We built Tracel on top of tracing-subscriber, Rust’s powerful structured logging crate. It supports span-based context propagation — crucial for nested logs in multithreaded workloads — and allows us to enforce consistent schemas across all logs.
Kafka (optional but helpful):
Kafka acts as a reliable buffer for log streaming. While logs could directly flow into Vector.dev, Kafka adds durability and backpressure handling — ensuring no data loss even if Vector or ClickHouse face temporary slowdowns.
Vector.dev:
Our processing powerhouse. Vector ingests JSON-structured logs, transforms and enriches them using VRL (Vector Remap Language), and batches them efficiently before storage. This stage gives us the flexibility to reshape data without touching the application layer.
ClickHouse:
A columnar, high-performance analytical database — perfect for querying billions of logs in seconds. Its compression and indexing help us save storage while enabling lightning-fast searches during debugging.

This combination gives us a streaming, async, and structured logging flow that scales — letting developers trace events across microservices and threads with consistency and minimal overhead.

Inside Tracel’s Design: Structuring Chaos

While Tracel’s high-level architecture focuses on where logs move, its low-level design focuses on how logs are structured, grouped, and correlated across services and threads. The goal was simple: make every log line tell a complete story — without forcing developers to manually stitch context together.

1. Two Log Categories — User & System

Tracel classifies every log into one of two broad categories:

User Logs: Actions directly triggered by a user, such as invoking an API or clicking a button.
System Logs: Background operations, calculations, or processes running as part of a user request.

This separation makes it easy to trace what the user did versus what the system did on their behalf.

2. Event-Based Grouping

Each incoming user request spawns a unique event_id — a UUID that binds all related operations together. Whether it’s validation, computation, or async tasks running in parallel, every log tied to that request inherits the same event_id.
This event-based grouping lets us replay an entire flow end-to-end in ClickHouse with a single query — incredibly useful when debugging distributed workflows.

3. Service & Module Context

To make multi-service debugging easier, each log also carries its service_name and module_name, automatically attached at runtime.
Example grouping:

service_name: "Payment"
module_name: "Transaction"

This allows analysts and developers to instantly filter logs by business context, not just by technical error or event ID. It’s particularly helpful when multiple services contribute to a single user workflow.

4. Automatic Context Propagation With Spans

One of Tracel’s strongest design choices was leveraging tracing-subscriber’s Span & instrument mechanism for context propagation.
Developers don’t manually pass identifiers like event_id, service_name. Instead, spans automatically inject and propagate this context across async and multithreaded boundaries.

Even if a request fans out into multiple threads, every log emitted under that span carries the correct context — ensuring full visibility with zero manual effort.

Here’s a simplified example illustrating how Tracel injects context automatically:

use tracel::LogBuilder::{UserLog, SystemLog};

#[tracel::instrument(fields(module = "create_module", event_id = %generate_event_id()), skip_all)]
async fn process_user_request_api(Json(payload): Json<ReqPayload>) {
    tracel::log_info!(UserLog(user_id, "REQ: user requested for this..."));
    compute_heavy_task(payload).await;
}

pub async fn compute_heavy_task(p: _) {
    // This log will automatically contain everything (event_id, module_name, etc.)
    tracel::log_info!(SysLog("Starting process"));
}

NOTE: Behind the scenes, each log still gets its own unique UUID, but remains linked to the broader event trace.

5. Developer Experience

Tracel was designed to make structured logging effortless.
Its internal macros enforce schema consistency, attach contextual metadata, and asynchronously emit logs downstream — letting developers focus purely on business logic, not logging syntax or boilerplate.

🧩 How We Designed the Logging Framework

To make Tracel both developer-friendly and observability-rich, we built it around two key components — a Rust SDKfor structured logging, and a Vector + ClickHouse pipeline for scalable ingestion and querying.

1. SDK and Log Schema Design

The SDK is designed so developers don’t need to worry about adding repetitive context like service names, module names, or event identifiers.
Each service initializes a subscriber layer (via tracing-subscriber), which automatically attaches contextual data to every log emitted within that span.

Each log entry follows a structured schema:

{
  "timestamp": "...",
  "level": "INFO | ERROR | WARN",
  "event_id": "...",
  "uuid": "...",
  "service_name": "...",
  "module_name": "...",
  "message": "...",
  "metadata": {...}
}

The event_id is created once per user request and is automatically propagated across async tasks and threads. This allows every piece of work spawned from a single request to share the same trace context — no manual parameter passing needed.

Developers simply write:

tracel::log_info!(SysLog("Fetching user payment info"));

and the SDK handles everything else — tagging it with the right event_id, service, and module names.

2. Ingestion and Storage

Logs are streamed to Vector, which acts as the unified transport layer. Vector handles:

Local buffering (for resilience against transient failures),
Batched delivery to ClickHouse,
Schema validation and normalization.

On the storage side, ClickHouse is used for its columnar efficiency — allowing millions of log rows to be queried quickly by event_id, service, or time window.
This makes it easy to trace a single user’s request end-to-end across multiple services and threads.

⚙️ Challenges & Solutions

Building a centralized logging framework sounds straightforward on paper — until you actually try to make it reliable, fast, and developer-friendly. Tracel’s design evolved through a few key challenges that shaped its architecture.

1. Context Propagation in Async Workflows

Problem:
In Rust’s async ecosystem, context often gets lost when tasks are spawned across threads or async boundaries. This meant that event_id and service_name weren’t consistently preserved.

Solution:
We leaned on the tracing crate’s span-based context model, where every async boundary inherits the active span automatically.
By wrapping user-facing endpoints with #[instrument], we ensured that any function spawned under that span carried the same contextual metadata — seamlessly linking logs across concurrent workloads.

2. Performance Overhead in High-Throughput Services

Problem:
Structured logging tends to be more expensive than simple text logs, especially under heavy I/O workloads where logs are emitted frequently.

Solution:
We used batched and asynchronous log emission through Vector, ensuring no blocking I/O on the main request path.
Additionally, we implemented log-level based filtering at compile time — meaning debug or trace logs never even hit the runtime in production builds.

3. Schema Drift Across Services

Problem:
With multiple services emitting logs, inconsistent log formats started creeping in — making queries unreliable in ClickHouse.

Solution:
We standardized all log emission through Tracel’s macros and enforced a fixed schema.
Developers couldn’t directly call info! or error! macros from tracing; instead, they used tracel::log_info! and tracel::log_error!, which automatically attached required fields and validated the structure before sending downstream.

4. Querying at Scale

Problem:
ClickHouse performs well, but debugging multi-service workflows often requires complex joins and time-correlated filtering.

Solution:
We created service-aware and event-based indices to make such queries faster.
Engineers could simply run:

SELECT * FROM logs WHERE event_id = '...';

and reconstruct the full trace — from API gateway to downstream async worker — within seconds.

🧭 Closing Thoughts: Building Your Own Logging Framework

If you’re planning to build a structured logging framework for your microservices, start small — but design for evolution.
Tracel didn’t arrive at its current form overnight. The schema, event grouping, and tracing integrations evolved through constant iteration, production feedback, and a fair amount of trial and error.

The most important takeaway:

Keep logs consistent and contextual. Every log should tell a complete story by itself.
Automate the boring parts. Don’t make developers think about event_id or service_name — let the framework handle it.
Treat logs as data, not text. Once you move to structured, queryable logs, you unlock an entirely new level of observability and insight.

If you’re building your own framework, you can refer to the same schema and design principles Tracel followed — and adapt them to your own stack.
The goal isn’t to copy the implementation, but to create a system that grows with your organization’s complexity while keeping observability effortless for developers.

✨ Final Reflection

At its core, TraceL wasn’t just about logging — it was about clarity.
When every service speaks the same log language, debugging stops feeling like archaeology and starts feeling like investigation.
Building it taught me that structured logging isn’t just a tooling choice — it’s a culture shift. You start thinking in context, not in isolated prints.

And if you ever find yourself staring at hundreds of noisy logs across microservices, remember: sometimes, all it takes is a little structure to turn chaos into traceability.

How I Built a Zerodha-Style Asynchronous Logging Framework — TraceL (with Rust & Vector.dev)

Why Our Microservices Needed Structured Logging ?

Why not using the existing tech ?

How Tracel Works: Architecture & Implementation

Inside Tracel’s Design: Structuring Chaos

1. Two Log Categories — User & System

2. Event-Based Grouping

3. Service & Module Context

4. Automatic Context Propagation With Spans

5. Developer Experience

🧩 How We Designed the Logging Framework

1. SDK and Log Schema Design

2. Ingestion and Storage

⚙️ Challenges & Solutions

1. Context Propagation in Async Workflows

3. Schema Drift Across Services

🧭 Closing Thoughts: Building Your Own Logging Framework

✨ Final Reflection

Comments

More from this blog

How I Made My Rust Builds 5× Faster on Azure Pipelines (and Finally Had Time for Coffee ☕)

Command Palette

Why Our Microservices Needed Structured Logging ?

Why not using the existing tech ?

How Tracel Works: Architecture & Implementation

Inside Tracel’s Design: Structuring Chaos

1. Two Log Categories — User & System

2. Event-Based Grouping

3. Service & Module Context

4. Automatic Context Propagation With Spans

5. Developer Experience

🧩 How We Designed the Logging Framework

1. SDK and Log Schema Design

2. Ingestion and Storage

⚙️ Challenges & Solutions

1. Context Propagation in Async Workflows

3. Schema Drift Across Services

🧭 Closing Thoughts: Building Your Own Logging Framework

✨ Final Reflection

Comments

More from this blog