m@ksim.pro
Back to all posts
AI 3 min read

Feature engineering is a business decision in disguise

The variables you feed into a machine learning model are not a purely technical choice. They encode assumptions about your business that deserve explicit review.

When a data scientist talks about feature engineering, they usually mean the process of choosing which variables to include in a model and how to transform raw data into those variables. It sounds technical. In practice, it is one of the places where business logic enters a machine learning system most quietly, and therefore most dangerously.

I have seen this cause real problems. A model built to predict customer churn used "number of support tickets submitted" as a feature. The assumption baked in: more support tickets means a less satisfied customer who is at higher churn risk. That is plausible. It also means the model scored customers who never contacted support as low churn risk - regardless of whether they were satisfied or just unable to reach a human being. The feature encoded a business process failure as a quality signal.

What features actually encode

Every feature is a hypothesis about the relationship between some measurable quantity and the outcome you are trying to predict. That hypothesis has to come from somewhere. In most real projects it comes from:

  • Domain expertise ("experienced operators know that X correlates with Y")
  • Historical patterns in the data ("when we look at churned customers, they tended to have Z")
  • Shortcuts and proxies ("we cannot measure what we really want, so we use W as a proxy")

Each of these is a business judgment, not a mathematical one. The data scientist can implement the hypothesis. They cannot verify whether the hypothesis is sound without input from someone who understands the business deeply.

The proxy problem

Proxy features are particularly worth examining. A proxy is a measurable variable that stands in for something you cannot directly measure. "Time since last purchase" is a proxy for engagement. "Days since last login" is a proxy for interest. These proxies are usually reasonable, but they can fail in specific conditions that the model will encounter in production.

Before accepting a proxy feature, the question to ask is: in what circumstances would this proxy become a misleading signal? If you can construct a plausible scenario where a disengaged customer looks engaged by this measure - or vice versa - the feature needs more careful treatment.

Where the business review fits

I am not arguing that every feature needs a board-level sign-off. I am arguing that feature selection should include a structured review by someone with business knowledge before a model goes into production.

The review does not need to be exhaustive. The questions are:

  • Does this feature make intuitive sense as a predictor of the outcome?
  • Can it be gamed or manipulated once the model is in use?
  • Does it encode any assumption about our customers or operations that we might want to revisit?
  • Is the definition of this feature stable over time, or is it likely to change as business processes change?

A practical point

Feature documentation is worth treating as a business document, not just a technical one. For each feature in a model, note what it measures, what assumption it encodes, and what would invalidate it. That document is what allows the model to be maintained, audited, and updated by someone who was not in the room when it was built.

This is less glamorous than hyperparameter tuning. It is more important.

Back to all posts
Contact

If this resonated, write to me. I reply personally.

WhatsApp