An oncall engineer gets paged. She checks the dashboard, someone else checks logs, another person asks whether we have traces, and eventually the team discovers that the one field that would have explained the failure was never emitted, never indexed, or already expired from retention.
So the lesson lands hard. We need more telemetry. That part is rational. Missing telemetry hurts. I have been in enough incidents to know the feeling. You stare at a broken system and realize the evidence isn’t there. Classic oh shit moment.
Naturally, a few painful incidents later, teams add more. More logs. More metrics. More labels. More dashboards. More alerts. More everything. Then the second complaint arrives. We have too much telemetry.
Queries are slow. We have way too many dashboards. Alerts are stale. Nobody knows which metrics are still used. Log search returns a dump. The platform team starts talking about cardinality. Fin ops guys are knocking about the vendor bill. Security folks start asking why customer identifiers, headers, and payload fragments are floating around in log storage.
Then the cleanup phase begins. Let’s kill dimensions, remove debug logs, reduce retention, sample more, stop indexing that field and so on. Move old data to cold storage. Eventually a platform engineer deletes something and waits to see who screams.
Broken Image Observability
Telemetry is Treated Like Exhaust
Telemetry often begins as exhaust. The application does the real work. Logs, metrics, traces, profiles, events, and audit records are what comes out the side. They are useful, of course, but they are rarely designed with the same seriousness as the production path.
A database schema, an API contract, a queue, a cache, or a new external dependency will usually get reviewed, and security-sensitive product data gets some kind of review too, at least in a healthy company. Telemetry changes often slide through as implementation detail. In production, telemetry is not exhaust anymore. It is infrastructure. It consumes CPU, memory, network and disk. It also comes at a cost of engineering attention, security review time and budget. It can get ridiculous enough that it can interfere with the workload it is supposed to observe.
An application container gets sized for business logic, then an agent, sidecar, collector, logger, or profiler joins the party. Maybe the overhead is tiny per pod but it can get enormous across a fleet. Once you see that, you realize you need to take this more seriously. The mirror is part of the machine.
The Bill is An Architecture Review
A telemetry bill tells you what your architecture hid under the rug. It exposes the mess that looked harmless while it was spread across services: too many clever components, too many retries, health checks, labels, and debug logs quietly multiplying in the background. Nobody feels the damage at the point of creation because the feedback loop sits way down.
The bill does not care about intent. It just multiplies. A developer adds a field today. The reviewer sees useful context. The platform team sees an ingestion spike later. Finance sees the invoice after that. Security finds the accidental data exposure during a review months later.
By then, the original pull request is gone from memory. Maybe the developer moved teams. Maybe service ownership changed. Maybe the metric now feeds a dashboard nobody knows how to replace.
The decision was local, but the cost became global. The person creating the work is not the person paying the queueing cost, and the person who understands the risk may not have the authority to block it. The invoice is the only mirror that never lies.
Cardinality Is Where Context Becomes Cost
The most common technical explanation of telemetry cost is cardinality. It is also one of the easiest to underestimate because the dangerous part looks small in code. A metric is not one number. In a time series database, a metric becomes the combination of its name and its labels. Every unique label set creates a distinct time series.
.
This is fine when the labels are bounded: service , env , region , status_code , route_template , team , zone . These labels describe stable operational dimensions. They let you group, filter, alert, and compare without creating an unbounded mess.
Then an incident comes along, and a team adds user_id to a metric. The review passes because the context is useful. They needed the specific user_id that caused the hot partition issue. Three weeks later, the platform team sees active series explode, the team's manager is told they got a huge bill, and security realizes customer identifiers are now part of metric storage.
The storage layer sees the cross-product of every possible value. The deeper point is that high-cardinality labels are where engineering intent, database physics, vendor pricing, and ownership gaps collide.
The developer wanted local context. The database got global multiplication. The vendor got billable usage. The platform team got a mess. Finance got a surprise. Security may have received a sensitive-data problem as a bonus. Voila!
Metrics usually explode through cardinality. Logs usually explode through fear. That fear is rational. Missing one log line during an incident can waste hours. So teams learn the easy lesson and add more logs. Every request gets logged. Every success gets logged.
The problem is logging without a question. A good log explains a state transition, a boundary crossing, a decision, a rejection, a fallback, or a failure. Compliance makes this worse when retention and indexing get treated as the same decision. Maybe you need to keep audit records for ninety days. That does not mean every debug line belongs in a hot searchable index for ninety days.
The Real Unit of Telemetry is a Decision
Most telemetry cost discussions start with data volume. Gigabytes per day. Spans per second. Active series. Indexed logs. Retention days. Cardinality. Query load. Those are important numbers, but they are not the real unit.
The real unit of telemetry is a decision. What decision does this signal support? Does this metric drive an alert? Does this log explain a state transition? Does this trace help diagnose a customer-visible path? Does this event prove a business outcome? Does this audit record satisfy a compliance need? Does this field help security investigate abuse? Does this dashboard influence a rollout decision? Does this attribute help compare normal and abnormal system behaviour?
Take two Spark signals. The first is spark_job_task_failures_total , labelled by failure_type , stage , and application . It results in an alert. When it spikes, the oncall engineer can tell whether jobs are dying from OOM errors, shuffle fetch failures, executor loss, or timeouts. It changes the next move, so it earns its place.
The second started during a bad incident. Someone added executor heartbeat metrics with executor_id , application_name , user_id , and a generated run identifier because the team needed more evidence. Fair enough. Nobody was being stupid.
Three months later, the incident is forgotten. The metric has no alert, no runbook, and one abandoned dashboard. The series count keeps multiplying across every executor, user, application, and run. The original problem disappeared. The anxiety stayed behind and started charging rent. Nobody told the mirror the incident was over. The first signal supports a decision. The second one is incident fear stored at scale.
This is where a lot of organisations lose discipline. It is as if teams forget YAGNI entirely when it comes to telemetry. The difference is that forgetting YAGNI in application code produces technical debt. Forgetting it in telemetry produces a monthly invoice and a noisier incident.
If nobody can articulate the decision a signal supports, that signal is probably engineering anxiety expressed as infrastructure spend. The vendors bill by volume, not by value. The distinction is yours to make.
Tooling Does Not Remove the Political Work
Tooling is the easy argument because it feels technical. Datadog or Grafana. SaaS or self-hosted. OpenTelemetry or vendor agents. But tools do not answer the harder question: who is allowed to create cost, and who is allowed to say no?
The Hard Question Is Who Can Say No
Who is allowed to add a new metric? Who approves an unbounded label? Who decides whether customer_id belongs in a metric, a trace attribute, a structured log, or nowhere at all? Who can say that debug logs expire after seven days? Who can force audit records out of hot search and into cheaper retention? Who owns a dashboard after the incident that created it is over? Who has the authority to say no when the answer is technically unpopular but economically necessary?
That is the political work. Without it, every tool becomes a different kind of landfill.
The Migration Is a Translation Test
A migration is a good time to admit some telemetry should not survive. Do not rebuild every old dashboard in a new query language. Some charts are stale. Some alerts belong to incidents nobody remembers. Some queries only exist because nobody wanted to be the person who deleted them.
The useful question is simple. What decision does this support, and who still depends on it? The hard work is translating intent, not preserving every old shape. If everything survives the migration, the old mess survived too.
Pick the tool after you know the operating model. If you choose SaaS, know which defaults you will override, which limits you will enforce, and which teams will pay for the signals they emit. If you choose self-hosting, count the people, not just the disks. If you choose OpenTelemetry, treat the collector as a production infrastructure with its own SLOs, capacity plan, dashboards, and failure modes.
Every observability architecture arrives as an invoice. Sometimes it arrives as payroll, sometimes as pager fatigue, sometimes as query latency, and sometimes as security exposure.
Ownership Has an Address
A new metric dimension should feel closer to a database migration than a logging tweak. That does not mean the platform team reviews every metric pull request. That would create the wrong incentives. Teams would either avoid adding useful telemetry because the process feels heavy, or someone would start rubber-stamping approvals just to keep work moving. Then we would have the worst version of governance: slower delivery, fake approval, and still no real ownership
The scalable version is guardrails. Give teams standard libraries, approved dimensions, and boring defaults: service , env , region , team , zone , status_code , route_template . Then block the garbage at the edge. CI rejects user_id , session_id , trace_id , container IDs, and generated run IDs. The collector strips unsafe attributes. The gateway drops health checks, debug noise, and low-value junk before it gets to the vendor. Human review is still needed for things like a new global dimension, a sensitive identifier, a new exporter, a longer retention class, and so forth.
Retention needs the same discipline. Every signal should be born with a life expectancy. Alert-driving metrics have longer retention. Diagnostic signals get a shorter hot window. Debug signals expire fast. Anything unclassified is short-lived by default. If a metric does not support an alert, dashboard, runbook, SLO, rollout check, compliance workflow, or security investigation, it should not live forever.
The platform team owns the paved road, the denylist, the collector rules, the exception process, and the budget boundaries. Service teams own the signals they emit and the decisions those signals support.
A Broken Mirror Is Worse Than No Mirror
A useless signal is easy to block at the collector. After six months, though, it has a dashboard, a half-dead alert, and one scary incident story everyone uses to protect it. Then deleting it becomes political because nobody wants to be that person. Bad telemetry does not stay harmless for long. Soon, people treat it like something that must exist.
That is why bad telemetry is worse than no telemetry. No telemetry tells you the ugly truth. You are blind. Bad telemetry gives you fake eyes. It lets people point at pretty graphs, stale dashboards, and noisy alerts as if they prove anything. The mirror is part of the machine. If it lies, the machine teaches everyone to trust the broken thing.