Data January 25, 2018 3 min read

A data pipeline is a production system, not a script

Why companies lose trust in their analytics when they treat data pipelines as one-off tasks rather than operated systems.

There is a typical story I see in companies that have grown into more or less regular analytics. An analyst wrote a script that pulls data from several sources, normalises it, and refreshes a table that a dashboard depends on. The script works. Everything is fine.

Then the analyst goes on holiday, or leaves the company, or is simply busy with something else. The script fails. Nobody knows why. The dashboard shows three-week-old data. Managers make decisions on stale numbers, or stop trusting the dashboard entirely.

This is not a story about a specific person. It is a story about a company that created a production dependency without creating a production system.

How a data pipeline differs from a script

A script is code that solves a task once or on demand. A data pipeline is a system that runs regularly, whose operation affects business processes, and on which other systems or people depend.

The difference is not in the technology. The difference is that a production system requires:

monitoring - someone must know when it fails;
an owner - a specific person responsible for its operation;
documentation - not academic, but minimally sufficient for another person to understand it;
an update process - what happens when a data source changes.

Most analytical scripts in companies meet none of these criteria.

Why this has become more acute in recent years

Previously, analytics was a support function - "look at the numbers once a month". Now in many companies dashboards have become part of daily management: pricing, inventory management, team performance evaluation. When data has not refreshed, or has refreshed incorrectly, it affects real decisions the same day.

At the same time, the number of data sources has grown. CRM, ERP, advertising platforms, website logs, equipment data - all of this needs to be collected, normalised and kept current. One person maintaining this manually is a risk, not an architecture.

Signs that pipelines have gone out of control

These symptoms allow a quick assessment:

Different people quote different numbers in answer to the same question.
Nobody knows exactly when a given report last updated.
When something breaks, finding the cause takes hours or days.
One person leaving the team creates a serious risk for analytics.
A change in a data source (say, the CRM was upgraded) causes unexpected failures in several places.

If even two of these five apply - pipelines have long since stopped being just a technical matter.

What to do about it

The first step is an inventory. List all processes that regularly move or transform data in the company - including what runs in Excel and is done manually. Often the list itself is surprising.

The second step is to assign an owner to each process. Not a team, not a department - a specific person who understands how it works and is responsible for keeping it running.

The third step is to decide what needs to move into managed infrastructure and what is enough to document. Not everything requires full engineering. But everything requires a conscious decision.

A production system is not a question of the technology stack. It is a question of agreements about who is responsible for what.

Back to all posts

Contact

How a data pipeline differs from a script

Why this has become more acute in recent years

Signs that pipelines have gone out of control

What to do about it

If this resonated, write to me. I reply personally.