Risk Management – Tailoring Safer Systems

There is a lot of talk about measuring safety. That is something which is easier said than done. This article shares some reflections.

Measuring what?

Before starting to measure, one needs to know what one is measuring. How you define safety will determine what you measure and how you measure. Let us illustrate the problem with three quite common views on safety. As you will see, none of them covers the subject entirely and all have advantages and disadvantages.

Safety as compliance

A very basic way of thinking: safety is following the safety rules. Being compliant with these rules is being safe. This corresponds to the almost automatic reaction that many people have after an accident: if only they had followed the rules, this would not have happened. Many investigations therefore focus on breaches of protocol and deviations. Also, in ‘normal’ situations there is emphasis on compliance. Wear the mandatory safety gear. Hold the railing. Striving for compliance also appeals to the human tendency towards conformity. We are social creatures, after all.

Safety rules are important. They are a basic form of how we teach safety: “Don’t touch the stove, it’s hot!” “Watch left, right, left before crossing the street.” These things we teach our kids, our workers, etc. Safety as compliance works reasonably well in rather simple, ordered and predictable systems. In these situations, you have a reasonable chance to foresee what can happen and conceive actions to deal with variations. If you are on known territory, you can deal with the things that happen by applying prescribed routines. Following ‘best practice’ means acting safely, while acting outside of these scripts is regarded as unsafe.

Safety rules are not perfect, however. We live and work in a world with a lot of variability and we have a limited amount of foresight. This means that we cannot write rules for every eventuality. If we could, the rules would be impossible to handle because of their sheer volume. Besides, rules depend on context. In London it is smarter to look right, left, right before crossing, while this is not the best strategy for Zürich.

Rules are compromises and may sometimes not be enough to keep you safe. Even if you follow all the traffic rules, you can have an accident. For example, when others do not follow the rules. In some situations, following rules is even the unsafe option. One (in)famous example is the Piper Alpha disaster where the people that followed the emergency procedures died while the ones who ignored the procedures and just jumped overboard survived.

Safety as an absence of accidents

Go out on the street and ask a hundred randomly chosen people, “What is safety?” Chances are that many will answer something in the line of “Not having any accidents”. Thinking this way makes intuitive sense to most people. It feels right because in our minds safety and accidents are very much linked. When we do not have any accidents, we have been safe. Or have we? Actually, not necessarily. That nothing has happened does not mean that things are safe. In many cases it only means that nothing has happened yet. Although it can very well be that nothing happens ever.

A simple test is to reverse the definition and see whether it still works. Is “the absence of accidents is safety” true? Absence of accidents can be achieved by other ways. Randomness or luck are possible factors. Your definition of accident is another. Whether people choose to report accidents yet another. However, accidents do give an indication about safety, or rather unsafety. An accident can be regarded as a manifestation of risk, bringing us to the next definition.

Safety as acceptable risk

Whatever you do, there is some risk involved. We cannot avoid this. We even want some risk, but not too much. We need to compromise between various goals (financial, safety, production, quality, etc.), between uncertainty and control. We have only limited resources (money, time, expertise, etc.). Therefore, we must make trade-offs and search for balance.

This view of safety appeals to rational creatures. It suggests deliberation and decision based on ‘facts’. We will always face risks; we just have to make sure that they are acceptably low. The question is therefore what the right level of risk is. We should obviously try to put as much ‘distance’ as possible between ourselves and the hazard and the possible negative futures the hazard could lead to. But we do not want too much distance either. It has to be practicable and affordable. Besides, some hazards we actually do desire. Just think of drinking coffee. We want our coffee hot, but we do not want to burn ourselves. Therefore, we tend to sip our coffee carefully at first, or maybe blow a bit on it, instead of gulping it down at once.

The view of safety-as-acceptable-risk is useful, but there are also some drawbacks. One is its reliance on knowledge, another is how it can lead to quantitative approaches to risk that look more objective than they are, that it may lead to a static view of safety, and the problem of monitoring the risk level. Then there is of course the problem of who decides what is ‘acceptable’ and based on what. Who determines what is included in the assessment and what factors weigh in (and how much)? Who is allowed to participate in the process and how can they participate in the process? What language is used during the process and in the communication of the results?

One example of the latter is how consequences are selected and expressed. Certain risk assessments focus on fatalities, but those are often not the only bodily consequences. So, what to do with injuries? Should one choose a number of severe injuries that equals a fatality? Or should we, as one often sees, translate fatalities and injuries into monetary units? Is that really a good, and fair measure? Can you put a number on a human life? And if so, what number? Sure, you can estimate one person’s economic contribution to society and his/her family, but a person is so much more than his/her economic contribution.

Challenges

The above views of safety all bring their own ways of measuring safety. Regard safety as compliance and you may be tracking citations from the inspectorate, or observations of unsafe acts (e.g. not wearing protective equipment). If safety is seen as the absence of accidents, you will naturally follow up on accident and injury reports. Those who adopted a risk view of safety may have some kind of a risk register, present the most important risks in a risk matrix or heat map and follow up on actions to control the risks.

How you define safety will influence your choice of things you measure – and vice versa. What you measure may very well become your definition of safety, consciously or not. If corporate policy, an ISO standard or the regulator requires you to record accidents and near misses as part of your monitoring, it will become very natural to talk about these metrics when someone asks about “How are we doing at safety?”

Another challenge is that management dashboards and scorecards allow only limited space for the presentation of how things are going. Managers are busy people and they would very much like to get clear, concise, unambiguous and short answers. However, safety is a complex phenomenon. Therefore, we need a variety of measures to give a reasonable description. No one view captures everything. Every view shows some elements of safety, but never the full picture. A good answer thus needs rich information and nuances. Here is a tension between space and attention available and what is needed to give a high-quality answer.

Dumbing it down into an easy measure, no matter how intuitive, will not do justice to the subject. A fatality/injury-based metric only captures a tiny part of a very complex phenomenon. It would be like describing a river exclusively by its temperature – which, by the way, rather depends on its surroundings, location and season than on ‘itself’, just as injury rates may correlate stronger with the context than with safety efforts initiated by the organisation. A trade-off between thoroughness and efficiency is inevitable and carefully addressing this in the management system is essential.

This article is an adapted and abbreviated chapter from the book If You Can’t Measure It… Maybe You Shouldn’t. Reflections on Measuring Safety, Indicators, and Goals.

Today, companies are faced with ever-increasing complexity. On the one hand, companies themselves are complex, socio-technical systems, and on the other hand, they are embedded in a complex environment with numerous known and unknown factors. How can an organization successfully keep up with these constantly increasing demands?

In this article, I will focus on High Reliability Organizations (HRO) and the related High Reliability Theory (HRT). The idea of High Reliability Organizations originally comes from organizations that successfully operate in high-risk industries. Examples include air traffic control, nuclear power plants, aircraft carriers, power grid operators, and similar fields of activity. However, I am personally convinced that the insights gained from HRO and the associated operational principles can be transferred to any organization and – like the classic HRO – will make them more successful.

The emergence of the High Reliability Theory

In 1984, sociologist Charles Perrow published his book “Normal Accidents: Living with High-Risk Technologies”, in which he introduces the Normal Accident Theory (NAT) using an analysis of the reactor accident in the nuclear power plant Three Mile Island in the USA in 1979. His reasoning was that complex and tightly coupled systems would have to lead to a catastrophic accident sooner or later. It therefore was irresponsible to operate such systems, for example nuclear power plants. This theory appears to be absolutely plausible; however, it holds a problem: it cannot be falsified. Because – and this several times was Perrow’s answer to criticism – even if there has never been an accident, it is just a matter of time before it happens. Who can refute a forward-looking statement?

NAT has rightly received a lot of attention in safety sciences and Perrow has been cited thousands of times. But the question arose as to why there are still organizations that can successfully operate complex, tightly coupled systems with virtually no incidents. This question inspired the Berkeley scholars Gene I. Rochlin, Todd R. La Porte, and Karlene H. Roberts to study such organizations more closely and to publish an article on High Reliability Organizations in 1987 in response to Perrow’s NAT: the High Reliability Theory (HRT). Using a best-practice approach, the Berkeley scholars examined why operations on an aircraft carrier (in peacetime), where aircraft move at high speed and in tight space in the presence of dangerous goods such as fuel and weapons, do not (or did not yet) lead to a catastrophic accident.

Like NAT, HRT received a lot of attention. As a result, it was followed by numerous publications, and there are still articles and books published today on this topic. Of course, Perrow’s answer to HRT did not wait. While the Berkeley scholars considered their work as complementary to NAT, Perrow didn’t agree with this assessment and contradicted directly. After 1987, numerous studies and surveys were conducted in different industries such as nuclear power plants, aircraft carriers (in peacetime), power grid operators, air traffic control, etc. These studies have contributed to the further development of the HRT. In 2001, Karl E. Weick and Kathleen M. Sutcliffe published the first edition of “Managing the Unexpected”, in which they broke down the findings to five HRO principles that I will explain further down. “Managing the Unexpected” was published in a second edition in 2007 and in a third edition in 2015.

The five HRO principles

The first three of the five HRO principles are primarily to be understood from a prevention perspective. The focus is on preventing serious incidents and accidents.

Preoccupation with failure. To deal with possible failure means to think about what could go wrong and to look for weak signals in the system. These are activities that can be taken over by effective risk management but should also be firmly anchored in the everyday work routine of all employees. When analyzing risks, it is important to understand what could happen. However, it is even more important to get to the bottom of the question of why something could happen.

Reluctance to simplify. We tend to look at something in a way that fits our personal worldview. This applies to both individuals and entire organizations. With the reluctance to simplify, the existing worldview is questioned, and situations are viewed from different perspectives.

Sensitivity to operations. In an HRO, operational processes set the pace of the organization. The organization has a high awareness of details and identifies weak signals. The quality of relationships is strong. There is a culture of trust, which enables employees to speak openly about irregularities and concerns in connection with operational processes.

But an HRO is also aware that, despite all efforts, mistakes cannot be completely avoided. The last two of the five HRO principles focus primarily on coping with errors so that they do not develop into a crisis.

Commitment to resilience. Errors also occur in HROs. HRO do not try to be error-free, but on the one hand they have the ability to quickly recover from mistakes before a crisis arises, and on the other hand they have the ability – if a crisis does occur – to quickly get out of this crisis.

Deference to expertise. If an HRO is in a critical situation or even in a crisis, the decisions to deal with this situation are made “at the front” by the relevant experts and not necessarily by the management. Figuratively speaking, the hierarchy pyramid turns upside down. Once the crisis has been overcome, the hierarchy pyramid will normalize again.

Why become a High Reliability Organization?

The five HRO principles support a company to become a mindful organization that is able to focus on operational processes, detect weak signals, learn from irregularities, and manage potentially dangerous situations before they turn into a full-blown crisis. The benefits that result from this are manifold. For example, losses in connection with a crisis – be it a production shutdown or a far-reaching scandal – can be avoided or limited, or a significant competitive advantage can be achieved by optimizing operational processes. If a company is perceived as an HRO from outside, this can have a positive impact on public perception and can also play a decisive role in the war for talent.

In conclusion, it can be said that the concept of the High Reliability Organization not only supports companies in high-risk areas, but basically supports all companies in successfully sustaining themselves in their increasingly complex environment and in creating a competitive advantage.

Tag: Risk Management

Measuring safety