Data January 18, 2013 4 min read

From logs to metrics and back: building observability without noise

Not all events are equally useful. How to design system observability so it produces signal rather than just another stream of garbage.

When a system starts behaving badly, the first question is always the same: what is happening inside? Good observability answers it quickly. Poor observability makes the team spend hours digging through logs that record everything without answering anything.

I have noticed that most monitoring setups are built on the principle of "write as much as possible, sort it out later". This creates an illusion of control and a real flood of noise.

The difference between logs, metrics, and tracing

Three concepts are often conflated, even though each has a distinct role.

A log is a record of a specific event with context: what happened, when, with what parameters. Logs are good for incident investigation - when something has gone wrong and you need to reconstruct the sequence of events. Logs as a structured data source are useful well before any SIEM is in place.

A metric is a numerical value that changes over time: requests per second, response time, error rate. Metrics are good for monitoring the normal state of a system and detecting deviations.

A trace is the path of a single request through multiple system components. Tracing is needed where the system is distributed and a request is processed sequentially by different services.

The problem is that without a clear sense of what each layer is for, all three start duplicating each other and multiplying data volume without any benefit.

Where the noise comes from

The most common source of noise in monitoring is DEBUG-level logging in production. A developer added detailed logging during troubleshooting, forgot to remove it, and now the system is writing several gigabytes per day of events no one needs in normal operation.

The second source is alerting on everything. When every deviation from normal generates a notification, the team starts ignoring notifications. This is worse than no monitoring at all: it creates a false sense that everything is under control.

The third source is metrics without context. The number of requests on its own says nothing. What matters is knowing what is normal, what the acceptable range of variation is, and what should happen when that range is exceeded.

How to design observability

Good observability is designed from questions, not from tools. A useful starting point is a few direct questions about each system or service.

What does "the system is working normally" mean? What numerical indicators confirm that? These are the core metrics - there should be few of them, but they must always be current.

What does "the system is beginning to degrade" look like - before the user notices? These are early-warning metrics. They need to give a signal with enough lead time to react.

What is needed to investigate an incident? This determines what belongs in the logs: not everything, but the events and context that allow the cause of a problem to be reconstructed.

Signal, not volume

A good practical rule: if no one looks at a piece of monitoring data regularly, and it is not used during incidents - it probably does not need to be collected.

This is not a call for minimalism for its own sake. It is a call for every element of observability to have an owner who understands why it exists, and a usage scenario.

Systems where this is done well tend to look modest on the surface: a few key dashboards, a small number of alerts with clear thresholds and clear ownership. But these are exactly the systems that let teams respond to incidents quickly, without getting lost in a torrent of data.

Questions to check the state of things

A few questions that help assess the state of observability in a company:

How quickly does the team notice a problem in production - before the customer calls, or after?
How long does diagnosis take once a problem is detected?
Are there any alerts the team has learned to ignore?
If the person who "knows how to read the logs" leaves, can someone else figure it out?
How much monitoring data is stored, and how much of it is actually used?

If most of these questions do not have a good answer, observability is worth designing from scratch rather than layering on top of what exists.

Back to all posts

Contact

The difference between logs, metrics, and tracing

Where the noise comes from

How to design observability

Signal, not volume

Questions to check the state of things

If this resonated, write to me. I reply personally.