OpenTelemetry for Node.js — the wiring that actually works in production

OpenTelemetry has won. The vendor-specific tracing SDKs (Datadog's tracer, New Relic's agent, Lightstep, Honeycomb's Beeline) are all either deprecated or in maintenance mode, with OTel as the recommended path. In 2026 there's no defensible reason to start a new Node service on anything else. The trade is: a single vendor-neutral instrumentation surface, exporting to whichever backend you happen to use this quarter.

The catch is that "just add OpenTelemetry" is more nuanced than the docs suggest. The auto-instrumentation works, but the defaults aren't always what you want; sampling, batching, and the collector layout matter; and a Node app that crashes on startup because its OTLP endpoint is unreachable is not what anyone signed up for.

This is the OpenTelemetry baseline we ship on every Node service in our managed observability stack.

The mental model: SDK, instrumentation, exporter, collector

Four pieces, often conflated. Worth keeping straight.

SDK — @opentelemetry/sdk-node. The thing you initialise. Creates the tracer provider, the metrics provider, the logs provider.
Instrumentation — @opentelemetry/auto-instrumentations-node, plus specific packages like @opentelemetry/instrumentation-http, -express, -pg, -redis. These monkey-patch (politely) the standard libraries and popular packages so spans appear without you writing any tracing code.
Exporter — usually @opentelemetry/exporter-trace-otlp-http or -otlp-grpc. Serialises spans into OTLP and ships them somewhere.
Collector — otelcol, a sidecar or daemon. Receives OTLP from your app, batches it, can transform it (filter PII, sample, enrich), and forwards to your backend (Tempo, Jaeger, Datadog, Honeycomb, Grafana Cloud, whatever).

The minimum viable setup is SDK + auto-instrumentation + OTLP exporter pointed directly at a backend. The production setup adds the collector as a middle layer, which we'll get to.

The minimal init file

The single most important rule: tracing init must happen before any application code runs. Otherwise the auto-instrumentation patches no longer take effect on already-loaded modules. The conventional pattern is a tracing.js file required as the first thing.

// tracing.js
'use strict';
 
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
 
const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: process.env.OTEL_SERVICE_NAME || 'unnamed',
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.SERVICE_VERSION || 'dev',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.DEPLOY_ENV || 'unknown',
  }),
  spanProcessor: new BatchSpanProcessor(
    new OTLPTraceExporter({
      url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces',
    }),
    { maxQueueSize: 2048, maxExportBatchSize: 512, scheduledDelayMillis: 5000 }
  ),
  instrumentations: [getNodeAutoInstrumentations({
    '@opentelemetry/instrumentation-fs': { enabled: false },
  })],
});
 
sdk.start();
 
process.on('SIGTERM', () => sdk.shutdown().catch(() => {}).finally(() => process.exit(0)));

Then in package.json:

{
  "scripts": {
    "start": "node --require ./tracing.js dist/index.js"
  }
}

The --require flag (or NODE_OPTIONS='--require ./tracing.js' as an env var) is what guarantees the init happens first. We have seen teams put require('./tracing.js') at the top of index.js; this works for most cases but fails subtly when index.js imports modules that the bundler has hoisted above the tracing init.

A few specific choices in that snippet worth justifying:

instrumentation-fs disabled. The filesystem instrumentation creates a span for every single fs.readFile, fs.stat, fs.access call. In a busy Node app that's thousands of spans per request. It's almost never useful and it overwhelms the collector. Disable it.

BatchSpanProcessor with explicit limits. The defaults are reasonable but we set them explicitly to make the back-pressure behaviour visible. If the exporter falls behind, spans drop after maxQueueSize; that's preferable to OOMing the app trying to buffer them.

Resource attributes set from env vars. service.name, service.version, and deployment.environment are the three load-bearing attributes for trace search. Anything else can come from auto-detection.

Sampling — the decision that matters most

The naive setup sends 100% of traces. For low-traffic services this is fine. For anything with non-trivial throughput, it'll bankrupt your tracing backend's bill in days.

Three sampling strategies, in order of when to use them:

Head-based parent-based sampling. Default. A trace is sampled at its origin (the first service to receive an external request), and that decision propagates downstream. Use TraceIdRatioBased(0.1) for 10% sampling.

const { TraceIdRatioBasedSampler, ParentBasedSampler } = require('@opentelemetry/sdk-trace-base');
 
new NodeSDK({
  // ...
  sampler: new ParentBasedSampler({
    root: new TraceIdRatioBasedSampler(parseFloat(process.env.OTEL_SAMPLE_RATIO || '0.1')),
  }),
});

Tail-based sampling at the collector. More expensive operationally, but you get to make sampling decisions based on the whole trace — "always sample traces with errors, always sample traces over 1 second, sample 1% of everything else." The collector buffers spans for a window (typically 30-60 seconds) before deciding. This is where the collector earns its keep.

Dynamic head sampling. Some backends (Honeycomb's Refinery, Tempo's metrics-generator-driven approach) push sampling decisions back into the head with feedback loops. Worth it at very large scale; overkill at small scale.

We default to 10% parent-based head sampling for everything, then add a collector with tail-based "always sample errors and slow traces" for anything where the customer is paying for full-fidelity error visibility.

The collector layout we actually deploy

For Node.js workloads on GCP and similar platforms, the collector goes in one of two places:

Sidecar collector (one per pod or per VM). Lowest latency, highest reliability — the app never has to traverse a network to reach its collector. Resource overhead is real, though, especially in containerised environments where you're paying for a collector container per pod.

Daemonset/gateway collector (one per node, or a central pool). Better resource efficiency, but adds a network hop and a shared failure domain. We use this when running dozens of small services on a smaller node count.

A minimal collector config that we actually deploy:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
      grpc:
        endpoint: 0.0.0.0:4317
 
processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_percentage: 75
    spike_limit_percentage: 15
  tail_sampling:
    decision_wait: 30s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 1000 }
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }
 
exporters:
  otlp/backend:
    endpoint: tempo.observability.svc.cluster.local:4317
    tls:
      insecure: true
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/backend]

The memory_limiter processor is non-negotiable. Without it, a slow downstream backend will cause the collector to OOM, taking down tracing across every app pointing at it. With it, the collector drops spans under pressure — which is the right failure mode.

Failure modes nobody warns you about

A handful of footguns we've hit in production:

The OTLP endpoint is unreachable at startup, and the SDK never recovers. Older versions of the OTLP HTTP exporter would crash the process if the initial connection failed. Newer versions log and retry, but configurations vary. Always wrap sdk.start() such that a failure logs but doesn't crash the app.

Auto-instrumentation traces internal monitoring traffic. Your liveness probe traces itself. Your scrape endpoint creates a span every 10 seconds. You end up paying to trace your own health checks. Filter these out:

// In the http instrumentation config
{
  ignoreIncomingRequestHook: (req) => {
    const url = req.url || '';
    return url.startsWith('/healthz') || url.startsWith('/metrics') || url === '/';
  },
}

PII in span attributes. The default HTTP instrumentation captures the full request URL, including query string. If your URLs contain emails, IDs, or tokens, those are now in your tracing backend. Filter at the collector:

processors:
  attributes/redact:
    actions:
      - key: http.target
        action: update
        pattern: '(\\?|&)(token|api_key|email)=[^&]*'
        value: '$1$2=REDACTED'

Memory growth from unbounded span attributes. A handler that does span.setAttribute('user.list', JSON.stringify(allUsers)) will balloon the span. The exporter then drops the batch because it exceeds the OTLP payload limit. Lots of failed exports, no traces in the backend, no obvious error in the app. Audit your manual instrumentation for attribute size.

Logs and traces not correlated. OpenTelemetry has a logs SDK now, but most teams still use Pino, Winston, or Bunyan. Wire the trace ID into log lines so a click from log to trace works:

const { trace, context } = require('@opentelemetry/api');
 
const traceMixin = () => {
  const span = trace.getSpan(context.active());
  if (!span) return {};
  const ctx = span.spanContext();
  return { trace_id: ctx.traceId, span_id: ctx.spanId };
};
 
const logger = pino({ mixin: traceMixin });

Two lines of code, enormous payoff at incident time.

What we ship by default

Every Node service we onboard gets:

tracing.js loaded via --require, with environment-driven endpoint and sample ratio
Auto-instrumentation enabled, instrumentation-fs disabled, healthz/metrics paths ignored
Resource attributes for service.name, service.version, deployment.environment
A sidecar or gateway collector with memory_limiter and batch always on
Tail sampling for errors and slow traces, probabilistic 5-10% baseline
Log/trace correlation via the Pino mixin (or Winston format equivalent)
A dashboard with p50/p95/p99 by route, error rate, and top-N slowest traces
An alert on "tracing pipeline export failures > 5% over 10 minutes" — because broken observability is the worst kind, you don't know what you don't know

The whole setup costs a couple of days of engineering on a greenfield app and maybe a week to retrofit onto an existing one. The payoff is the first time you debug a five-service request flow in fifteen minutes instead of two hours.

If your Node fleet is still on a vendor-specific tracer and you've been putting off the migration, that's the kind of work we price as a fixed engagement — usually one to two weeks per service, with the SDK swap on day one and the rest spent on sampling, attribute hygiene, and collector tuning.

Sudhanshu K. is a Senior Platform Engineer at EdgeServers (RemotIQ Pty Ltd, ABN 91 682 628 128). She remembers when distributed tracing meant correlation IDs in log lines and a lot of grep, and is grateful those days are mostly behind us.