Distributed Tracing in Go

In distributed architecture, logs often tell only part of the story: they capture events inside a single service, but they don’t show how a request flows across multiple services or where it spends most of its time. That’s where distributed tracing comes in.
Why Logs Alone Fall Short
Logs are excellent for recording specific events and debugging within a single process. But in a distributed system, they have limitations:
- Isolated scope: Each service logs independently, so correlating events across services is manual and error-prone.
- Missing sequence: Logs don’t inherently convey the order or timing of operations across boundaries.
- Lost context: Background jobs, retries, and asynchronous workflows often drop the correlation IDs you rely on.
Tracing complements logs by showing how and in what order things happened across the entire system.
How Tracing Works
Distributed tracing uses these three core concepts:
- Trace ID: A unique identifier for the entire request journey. All spans related to this request share the same trace ID.
- Span: A timed unit of work (e.g. “DB query”, “HTTP call to payment service”). Each span knows:
- Its start and end time
- Its parent span (except the root)
- Optional attributes (metadata such as HTTP status, SQL statement, etc.)
- Propagation: Passing trace context (trace ID + current span ID) to downstream calls so they can attach their own spans to the same trace.
It forms a tree-like structure: the root span represents the entry point (e.g. an HTTP request), and child spans represent work done in downstream services.
graph TD subgraph Trace[Trace ID: 1234abcd] A[Root Span: HTTP Request to Service A] --> B[Span: Process Request in Service A] B --> C[Span: HTTP Call to Service B] C --> D[Span: Handle Request in Service B] D --> E[Span: DB Query in Service B] B --> F[Span: HTTP Call to Service C] F --> G[Span: Handle Request in Service C] G --> H[Span: Cache Lookup in Service C] end
In the diagram above:
- The outer box represents a single trace identified by a Trace ID.
- The root span is the first recorded operation.
- Child spans show downstream calls or tasks, each linked to its parent.
- The structure reveals both the sequence and the relationships between operations.
The Role of context.Context
context.Context
is the standard way to propagate trace information, in addition to deadlines and cancellation signals, etc.
When you start a span, the tracing library returns a new context.Context
containing the span’s metadata. Passing this context through every function call ensures that the trace remains connected. Dropping or replacing the context breaks the chain.
Example:
func handler(w http.ResponseWriter, r *http.Request) {
ctx, span := tracer.Start(r.Context(), "handler")
defer span.End()
processRequest(ctx)
}
Here, processRequest
receives the same context and can attach its own spans to the trace.
Demo: OTEL + Jaeger
OpenTelemetry is the current standard for instrumentation. Jaeger and (Grafana) Tempo are common backends for storing and visualizing traces.
This example uses OpenTelemetry with Jaeger, but there are many other options. OpenTelemetry supports multiple tracing backends: you can swap Jaeger for Grafana Tempo, Datadog APM, New Relic, Honeycomb, or any OTLP-compatible system by changing the exporter. The instrumentation code remains the same—only the exporter configuration changes. This flexibility is one of OpenTelemetry’s key advantages.
A minimal Jaeger setup in Go:
package main
import (
"context"
"log"
"net/http"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/jaeger"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
"go.opentelemetry.io/otel/sdk/resource"
semconv "go.opentelemetry.io/otel/semconv/v1.17.0"
)
// setupTracing configures OpenTelemetry with a Jaeger exporter.
// It returns a shutdown function that should be deferred in main.
func setupTracing(serviceName string) (shutdown func(context.Context) error, err error) {
exp, err := jaeger.New(jaeger.WithCollectorEndpoint(
jaeger.WithEndpoint("http://localhost:14268/api/traces"),
))
if err != nil {
return nil, err
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exp),
sdktrace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String(serviceName),
)),
)
otel.SetTracerProvider(tp)
return tp.Shutdown, nil
}
func main() {
shutdown, err := setupTracing("example-service")
if err != nil {
log.Fatalf("failed to setup tracing: %v", err)
}
defer func() {
if err := shutdown(context.Background()); err != nil {
log.Printf("failed to shutdown tracer: %v", err)
}
}()
tracer := otel.Tracer("example-service")
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
ctx, span := tracer.Start(r.Context(), "root-handler")
defer span.End()
doWork(ctx)
w.Write([]byte("Done"))
})
log.Fatal(http.ListenAndServe(":8080", nil))
}
func doWork(ctx context.Context) {
tracer := otel.Tracer("example-service")
_, span := tracer.Start(ctx, "doWork")
defer span.End()
}
To run Jaeger locally:
docker run -d \
-p 16686:16686 \
-p 14268:14268 \
jaegertracing/all-in-one
Then you can view traces at http://localhost:16686
.
Once traces are flowing, visualization is where they shine. In Jaeger:
- Search for traces by service name.
- See the root span at the top, with child spans below.
- Duration bars make bottlenecks visible.
- Span metadata helps diagnose issues (e.g. slow DB query, failed HTTP call).
Common Issues
Traces can have missing spans, often due to:
- Async jobs: If you enqueue work without passing trace context, the worker starts a new trace. The solution for this: serialize trace context into the job payload and restore it in the worker.
- Lost spans: Forgetting
defer span.End()
or cancelling the context too early can cut spans short. - Background goroutines: Starting a goroutine without passing the context removes it from the trace. Always pass
ctx
explicitly.