m@ksim.pro
Back to all posts
IT 3 min read

IT system resilience when conditions shift fast

How a manager should think about IT infrastructure resilience when the external environment changes quickly and unpredictably.

There is a difference between reliability and resilience. A reliable system works well under predictable conditions. A resilient system keeps working when conditions change unexpectedly.

In early 2022 this question became immediately practical for many companies. Instability in the external environment - economic, regulatory, logistical - forced a review of assumptions that had been baked into IT architecture during quieter times.

I want to break down how a manager should think about system resilience without going into technical details.

What makes a system fragile

Fragility is rarely visible in calm periods. It shows up when something goes differently than planned.

A few typical sources of fragility in IT infrastructure:

  • dependence on a single supplier with no alternative: one cloud provider, one software vendor, one network connection;
  • critical data and processes running on systems that nobody maintains or understands;
  • no tested recovery procedures - backup exists on paper, restoration has never been tested in practice;
  • key competences concentrated in one person or one contractor;
  • integrations between systems held together by manual processes that break under pressure.

Each of these is a potential point of failure that is invisible in normal conditions.

Three levels of resilience

It helps to think about IT resilience at three levels.

The first is operational resilience. The system continues to function when an individual component fails. This is achieved through redundancy, replication, and load balancing. Most mature companies address this level.

The second is recovery resilience. The system returns to operation after a serious failure within an acceptable time. This requires not only technical tools but practiced procedures: who makes the decision, who does what, where credentials are stored, who notifies customers. Many companies underestimate this level.

The third is adaptive resilience. The system allows the business to restructure operations when external conditions change: switch suppliers, move workloads to different infrastructure, restrict or expand functionality. This is the rarest level, and it is the one that becomes critical during unstable periods.

Practical audit questions

A few questions worth asking your team right now:

  1. Which of our systems are truly critical - the business stops without them? When was their resilience last tested?
  2. Do we have alternative suppliers for the key components of our infrastructure?
  3. Has the disaster recovery plan been tested with a real drill - not just documented on paper?
  4. If a key employee or contractor became unavailable today, where are the access credentials and documentation stored?
  5. What happens to our data if an external service shuts down or becomes inaccessible?

These are not paranoid questions. They are standard operational audit questions that most companies defer until the first serious incident.

Where to start

I do not recommend trying to solve everything at once. A good first step is to list your critical systems and assess which of the risks above is most likely in your specific situation.

Resilience is not achieved in a single project. It is a gradual reduction of risk concentration: adding alternatives, documenting procedures, periodically testing recovery plans.

In volatile periods, this is work that pays for itself.

Back to all posts
Contact

If this resonated, write to me. I reply personally.

WhatsApp