IT October 22, 2024 3 min read

Observability for product teams: what logs, metrics, and traces actually give you

A plain walkthrough of the three observability pillars and why the combination matters - written for technical owners who want to understand what they are paying for.

I have had the same conversation at least a dozen times. A founder asks why their engineering team wants to add another tool - Datadog, Grafana, Honeycomb - on top of the cloud provider's built-in dashboards. "We already have logs," they say. "What does this actually add?"

It is a fair question. The answer is not obvious until you have been in a 3 am incident trying to figure out why a feature is slow for some users but not others, with nothing but log lines to look at.

Three signals, not one

Observability in practice means three things:

Logs are discrete records of events: "user X requested page Y at time T, got status 200, took 340ms." They are the most familiar signal. They are also the hardest to aggregate across. If you want to know "how many requests failed in the last hour that went through the payment service," you need to write a query, hope you logged the right fields, and wait for it to run.

Metrics are numeric time-series: request rates, error rates, latency percentiles, queue depths. They are cheap to store and fast to query. Dashboards and alerts live here. The tradeoff: they are aggregated, so you can see that latency went up at 14:32, but not which specific requests were affected or why.

Traces follow a single request through the entire system - from the browser to the load balancer to the API service to the database and back. A trace shows you exactly where time was spent in a distributed call chain. This is what you need when the problem is "checkout is slow for 2% of users" and you cannot reproduce it locally.

Why all three, not one

Each signal answers different questions. Logs tell you what happened in detail. Metrics tell you whether something is wrong right now, at scale. Traces tell you where the problem is in a specific path through the system.

Without traces, distributed debugging is guesswork. Without metrics, you find out about problems from customers before your dashboards catch them. Without structured logs, you can see a problem but not investigate the specifics around it.

What this actually costs

Observability tooling has a reputation for being expensive. That reputation is earned when teams instrument everything at maximum verbosity and route all of it to a vendor that charges per ingested gigabyte.

The practical approach: be selective. Trace your critical paths - the flows that handle money, authentication, and core product actions. Keep metrics at a coarse level for most services and detailed only where you have known pain. Log at info level in production, not debug; add structured fields that let you filter meaningfully (user ID, request ID, account tier) rather than writing paragraphs.

A mid-size product with disciplined instrumentation can run comfortably in the $500-2000/month range. I have seen the same product at $15,000/month because no one had ever looked at what was being ingested.

The management question

The right question for a technical owner is not "which observability tool" but "what questions do we need to be able to answer when something breaks, and how fast?" That question drives the instrumentation requirements, which drives the tooling choice.

If the answer is "we need to know that a service is down within two minutes and be able to find the broken request within fifteen" - that shapes the entire observability setup. If the answer is "we find out from users and fix it in a day" - then the current situation is a choice, not an accident, and you should be aware of what you are trading away.

Back to all posts

Contact

Three signals, not one

Why all three, not one

What this actually costs

The management question

If this resonated, write to me. I reply personally.