Hadoop in business terms: when a cluster makes sense and when a DWH is enough
Hadoop does not replace a data warehouse and is not the answer to every large-volume question. A breakdown of which tasks justify a cluster and which ones do not.
Over the past two years Hadoop has become the symbol of "working with big data". If a company has data and ambitions, the question often comes up: "Do we need Hadoop?" Sometimes it comes up as a statement: "We need Hadoop."
I am skeptical - not because the technology is bad, but because the question is almost always framed wrong. The right question is not "do we need Hadoop" but "what problem do we have and which tool fits it". Buying a platform does not fix the underlying problems with data - and the same logic applies to cluster choices.
What Hadoop is and what it actually does
Hadoop is a framework for distributed processing of large data volumes across a cluster of commodity servers. Its main virtue is horizontal scaling: when data grows, you add nodes.
The cost is complexity. A cluster has to be deployed, configured, monitored, and maintained. Queries are written not in SQL but in MapReduce or on top of it - which requires a specific skill set. Response time for a single query is minutes, not seconds.
For certain tasks this is completely acceptable. For others it is a fatal constraint.
Four different layers that are often confused
When a company says "we need to store and analyze data", there are usually four distinct needs behind that statement:
Archive. Raw data that must be stored for a long time and cheaply - logs, events, transactions. Access is infrequent, the schema may change. Hadoop is a good fit here.
Exploratory layer. Analysts and data scientists work with large samples, form hypotheses, and check correlations. Queries are irregular and unpredictable. Hadoop and its ecosystem - Hive, Pig - are a reasonable choice here when volumes are large.
Operational analytics. Regular reports, dashboards, business metrics. Queries are predictable and response time matters. A classical DWH - a relational database optimized for analytical queries - handles this far better than a cluster.
Production reporting. Data used in real time: stock levels, order statuses, account balances. This is not a job for either Hadoop or a DWH - it is a transactional database with strict latency requirements.
Where the most common mistakes happen
The typical mistake is building a Hadoop cluster for operational analytics. Management wants reports fast, analysts want flexible queries, IT proposes the "modern solution" - and the company ends up with infrastructure that is expensive to maintain and slow to answer simple questions.
The opposite mistake is trying to store terabytes of historical logs in a relational DWH that was never designed for it. That is also expensive and also works poorly.
The problem is not the technology. It is the mismatch between tool and task.
A practical filter
Before making any storage architecture decision, it is worth answering a few questions:
- What is the current data volume and what will it be in two years?
- Who is the primary user: an analyst, a reporting system, or an operational process?
- How fast does the answer need to be: are seconds, minutes, or hours acceptable?
- Does the team have the skills to maintain a cluster?
- What happens if the cluster goes down - is that critical to the business right now?
If the data volume is under a few terabytes, the users are business analysts, and the answer needs to come back in seconds - a good DWH will cover 90% of requirements. Hadoop here would be an expensive and complex replacement for a simpler solution.
If the volume is large, the work is exploratory, and the team is ready to maintain a cluster - Hadoop is justified. But that is a choice made for specific conditions, not because "everyone is doing it".