Whenever you see rusted pipes or chipped off paint or any unmaintained area in a plant, beware that some fault propagation is in the making. From small micro systems of failures, the propagation of a bigger tragedy almost always takes shape through a collective failure to comprehend and act.

Let me take the most bizarre example of a failure and its propagation, the Bhopal gas tragedy, which happened at a plant producing insecticide and when the leak happened the plant was in a shutdown condition.

As the output of the plant ‘Carbaryl’ which was an insecticide, did not sell well, the plant had taken a series of measures to cut cost. From manpower to maintenance hours to replacement of rusted pipes, almost every corner was cut. On the day when the plant was closed, a tank containing Methyl isocyanate got mixed with water as the water pipes got rusted and this mixture raised the temperature of the tank to 200 degree centigrade and mixture of toxic gases got produced and 42T of that got released into atmosphere. Bad maintenance, leaking valves, absence of safety systems during shutdown, no drills done of safety systems, large storage tanks instead of several chambers, absence of MIC Tank refrigeration system which alone could have prevented the disaster, all contributed to the tragedy.

Some have actually gone to say that these were symptoms of a far bigger problem, the culture of the company, which was itself a breeding ground for fault propagation. Speaking of wrong things happening within an organization itself could be the most difficult thing to happen, which could be the root cause why an error could propagate.

The Challenger disaster at NASA is one of the best examples of collective failure, but it is also the best example which prompted new learning that transformed NASA. A task as complex as a Space Shuttle is vulnerable in many respects to individual, group, process and systemic failures but the most potent of them is the collective failure to arrest the root cause and its propagation.

In this case it was just that the O-rings in the solid rocket booster seals did not work in the cold temperatures of the launch day; never before had NASA launched in such cold temperatures which was 20 degrees lower than the previous launch experiences. Some engineers had actually detected this flaw, but the discovery to reach the centers of decision making never got easy.

In both these examples it is a complex web of people and processes that were entangled in a complex puzzle of deciphering several problems but the solution which individually or collectively could have been found never got systemically co-created; instead a small error got propagated through a maze of controls to create the most devastating tragedy of our times.

Propagation of a small error and the absence of systems to detect and mitigate the impact pervades all systems from Space shuttles to the easiest problems that we encounter in factories and every day businesses.

Think of the salesman at the counter who first detects a positive or a negative customer experience around a product but never passes this information in a meaningful manner to the teams who are working on developing products, or think of the quality assurance person who first encounters the first symptoms of a process failure in the raw attributes of the product but never passes this information to the process teams. It is the starting point of a link failure which propagates through the system.

The system is run by diverse teams and within teams also you have diversity of experiences and skills. Diversity acts to actually mitigate the risk of ‘same-thinking’, or the habit of clones that will always accept a hypothesis with a singular world view. Collective identification or its absence works to propagate an error and some teams simply do not have the wherewithal to act as there is no shared belief or understanding. Sometimes there are no shared values as well.

Is diversity of beliefs, skills, training, experience therefore good in teams as it may be the reason why collective identification of a fault fails to take shape?

But is has been found through research that teams that have reasonable levels of diversity combined with higher presence of collective identification may be better positioned to steer through their errors; coordinated individual performance in real-time, the ability to adapt to dynamic goals and contingencies, and a capacity for continuous improvement have potential to arrest error transmission.

Complex teams may be more susceptible to errors that occur due to a breakdown in teamwork (e.g., coordination/collaboration), the high level of workflow interdependence may facilitate mutual performance monitoring and create redundant systems for trapping errors. However, when performance demands are high, teams may have trouble detecting errors, because in complex action cycles error signals are often unclear.

But human beings require a safety valve to operate and surface problems as they may be harmed if they highlight inherent flaws in organizational working. Psychological safety, which is defined as “a shared belief that the team is safe for interpersonal risk taking” is fundamental to collective identification of an error.

The ability to speak up in a non-threatening and respectful manner (deference to expertise) is a hallmark of learning organizations and the teams within them.

Next time when you see people hesitating to speak up, you better beware that it is the breeding ground of fault propagation and therefore collective failure is in the making. Making a self-correcting fault proof system simply requires an open culture, shared beliefs and diversity in teams.

But above all, leadership must be entrenched in creating the environment where it is safe to highlight flaws in the systems not hide them; that is fundamental to arresting collective failure.

Understanding fault propagation and collective error

Leave a Reply

Be the First to Comment!

Notify of