Tag: Resilience Engineering

  • What it means to tailor a system

    What it means to tailor a system

    A tailor’s shop, somewhere downtown. A back room, two mirrors, a table covered with bolts of fabric, chalk and pins on a board. A man stands on a low platform, in a rough cut of light wool, and the tailor walks around him. He doesn’t just take measurements. He observes. He sees how the customer shifts his weight, how he holds his shoulders, whether the seam at the back pulls to one side. A chalk mark where it doesn’t yet sit right. Then the measuring tape again, then a stitch, then a fitting. Adjust. Try again.

    What’s happening here isn’t fitting a suit. It’s a conversation between fabric, body, and habit. The tailor knows that no human stands exactly the way the pattern assumes. He knows the seams meant to sit centred will shift the moment the person moves. He plans for it. He builds in reserve at points where he knows the fabric needs room to settle. He isn’t surprised when the customer has to come back twice more. That’s his craft.

    What does this workshop have to do with safety? That’s the question this magazine owes its name to. I made the case in the opening article Three Assumptions We Need to Leave Behind; in short: safety doesn’t arise when people adapt to systems, but when systems are designed so they can be adapted to people. Tailoring Safer Systems. Measure, draft, fit, wear, adjust. The cycle repeats, only with different material. And just as in tailoring, it isn’t a one-off act but a stance.

    What this stance asks of us, in concept and tool, I want to lay out here. Three principles, each with a term you’ll recognise from the literature.

    Measure, don’t assume

    The tailor who doesn’t put down the measuring tape knows something that many safety departments treat as needless effort: that reality isn’t in the pattern.

    Steven Shorrock and Claire Williams, in Human Factors and Ergonomics in Practice, frame the distinction that’s been at the centre of the human-factors tradition since Hollnagel so simply that it works as a test. Work-as-Imagined is the picture designers, auditors, and executives have of how work gets done. Work-as-Done is what people actually do. Between them there’s regularly a gap. The question isn’t whether the gap exists. It always does. The question is whether the organisation knows it.

    Whoever doesn’t know it tailors into the assumption. They design procedures on the basis of what their model says. And the model says what’s convenient, what’s legible by audit standards, what sounds executive-ready. The procedure fits the assumption, not the practice. Within a short time, practice and procedure drift apart without anyone noticing, because no one ever measured how the fabric actually hangs.

    What measuring actually means isn’t spectacular. It means observing. It means walking the floor: what the Lean tradition calls a Gemba Walk, and what the safety world circulates under terms like “operational learning visit”. It means shadowing across more than one shift. It means asking questions open enough that they don’t contain the answer: not “Do you stick to the procedure?”, but “When was the last time the procedure didn’t fit your situation, and what did you do instead?”

    These questions regularly produce answers no one wants to hear. People describe workarounds that look like violations to compliance and look like the only way through to the system, on a day when a tool is missing, a stand-in is new, the plant has been moody since the update. The temptation is to read these answers as defect. And to close the case there. The work is to read them as finding.

    Whoever measures accepts what they see. What they see is regularly not what’s in the pattern. That’s exactly why they’re there.

    Measuring isn’t compliance on trial. It’s the willingness to see something that contradicts your own assumption.

    Respect the fabric

    Not every fabric can be tailored any way you like. Treat a soft knit like a firm wool and the seam won’t hold. The tailor knows the material’s properties before drafting the cut, and adapts the design to the fabric, not the other way around.

    Transposed to organisations: context, culture, and history are the material with which a system is tailored. What works in an airline where Crew Resource Management has been embedded practice for decades doesn’t translate directly to an industrial organisation where hierarchies are lived differently and “Stop the Line” still has to be explained as a concept. What takes hold on a ward where the unit lead has built a reporting culture over years runs into nothing on another ward, where every report passes through two layers of HR before anyone gets to see it.

    David Snowden’s Cynefin framework helps at this point. Simply put, it distinguishes between two kinds of problems: complicated and complex. Complicated problems are those where the link between cause and effect can be made visible with enough expertise: a machine, an accounting system, a construction plan. Best practices work here. Complex problems are those where cause and effect are only readable in hindsight, because the system shifts on every intervention. Culture, risk behaviour, learning capacity belong in this category. Best practices don’t work here. What worked in one organisation isn’t guaranteed to work in the next.

    The most common mistake in safety programmes I work with is mixing these two up. A proven concept from a best-practice collection gets sold as a universal solution, draped over an organisation made of different fabric. And everyone’s surprised when the seam doesn’t hold. What the organisation needed wasn’t the solution. It was the diagnosis: what kind of fabric is in front of us?

    Respecting the fabric doesn’t mean finding everything fine as it is. It means checking the cut against the material before reaching for the scissors. Whoever skips that builds a safety programme that fits the quarterly report, not the practice.

    Build adjustment in

    A good cut has give. The tailor doesn’t pull the fabric so tight that it tears at the first breath. He knows the body changes, the day changes, the fabric settles after the first few wearings. He builds that in. Where he leaves room, where he doesn’t, is craft. Eliminate the give, bind yourself to the exact measurement, and you get a garment that fits exactly once. Not the next moment.

    Erik Hollnagel’s work has circled this insight in safety language for years. In FRAM (the Functional Resonance Analysis Method), he argues against linear incident models that read variation as defect. Variation, Hollnagel writes, isn’t the opposite of function. It’s a condition of function. Complex socio-technical systems work because their components (people, tools, procedures) are flexible enough to respond to conditions that aren’t in the plan. When the plan tries to switch off this variation, it switches off adaptive capacity at the same time.

    In practice this means: a good procedure describes not only the intended path, but makes visible the conditions under which it holds. It knows the assumptions it makes, and it knows the places where it will break if those assumptions fail. A good procedure is aware of its limits. More than that: a good system keeps resources free that aren’t tied to the plan (slack in the staffing plan, time in the shift, room in the communication), because without these no adjustment is possible. What looks like inefficiency is the precondition for the system to make it through the day on which reality departs from the plan. And it departs. Every day.

    Building adjustment in means giving the system permission to adjust. Not afterwards, in the case of damage, but beforehand, in the design. It means shaping room deliberately rather than tolerating it by default. And it means making visible what otherwise stays hidden: that the workarounds no one admits to are often the last adjustments an over-standardised system still allows.

    What tailoring isn’t

    Off-the-rack sits in the warehouse waiting for someone it fits. It’s efficient, it’s cheap, it’s clean in the reporting. It’s a complete solution as long as the measurement is right. When it isn’t, it becomes the source of a quiet compromise: the person adapts to the suit, holds their shoulders differently, breathes shallower, moves as if they belonged in the pattern. For a while, this goes well.

    Safety from the compliance catalogue works on exactly this logic. It comes with finished procedures, standardised KPIs, audit templates that fit everything because they look at nothing. The problem isn’t that it’s structured. The problem is that it takes its own description of the system for the system. When reality departs from it (and it does), no adjustment is provided for in the catalogue. What remains is the admonition to please stick to the procedure.

    In contrast stands the tailor who doesn’t put down the measuring tape. Who knows he’ll have to come back twice. Who respects the give in the fabric. Who doesn’t finish the cut today, but draws it in conversation with what’s in front of him. Who accepts that the end product isn’t perfect on the first attempt, and that adjustment is part of the craft, not an admission of error.

    This is what Tailoring Safer Systems means. Shape room rather than eliminate it. Make adjustment visible so the system can learn from it. This is harder work than a dense catalogue. It’s also the only thing that works under conditions where the next measurement is already a different one.

    Sources

    • Steven Shorrock & Claire Williams (Eds.) – Human Factors and Ergonomics in Practice: Improving System Performance and Human Well-Being in the Real World, CRC Press 2017
    • Erik Hollnagel – FRAM: The Functional Resonance Analysis Method – Modelling Complex Socio-technical Systems, Ashgate 2012
    • David J. Snowden & Mary E. Boone – A Leader’s Framework for Decision Making, Harvard Business Review, November 2007
    • Erik Hollnagel – Safety-II in Practice, Routledge 2018
  • Three Assumptions We Need to Leave Behind

    Three Assumptions We Need to Leave Behind

    It is the night of 28 March 1979, shortly after four in the morning. In the control room at Three Mile Island, Unit 2, a light is on: pressure relief valve closed. The light says that because it doesn’t measure position. It displays the control signal, the command that was sent to the valve to close. What the valve is actually doing, nobody in the room knows. It has been open for two minutes and thirteen seconds, and it will stay open for the next two hours.

    In the hours that follow, the operators will do something that the later investigation will identify as the primary cause of the partial meltdown: they throttle back the emergency cooling. They do it because their instruments tell them the system is over-pressurised, and because their training has taught them to avoid exactly that condition. They act rationally given what they see. In the days that follow, the press will speak of “human error.”

    This reflex (the diagnosis of “human error” that follows a scene like this almost automatically) sits behind most of the safety conversations I have in consulting practice. Not because those involved are unwise. But because three assumptions are so deeply embedded in our safety tradition that they pass as common sense. We read them differently. What follows are three counter-positions, one per assumption.

    “Human error” is a diagnosis, not a finding

    Anyone working in this field knows the statistic: 80 to 90 percent of all incidents are attributed to “human error.” The number has been cited since the 1980s in talks, audits, executive reports, and it works: it makes plausible that the answer to safety problems must lie with people. More training, clearer standards, stricter discipline. The logic is clean: if the problem sits in the cockpit, the solution must sit in the cockpit too.

    The problem with this logic isn’t the statistic. It’s the interpretation. Sidney Dekker puts it in his Field Guide so sharply it hurts: “human error” is never the end of an investigation, it is the beginning. Whoever explains incidents this way has stopped asking: they have found a label and settled into it. Local rationality, the concept Dekker keeps sharpening, says: nobody comes to work intending to take a reactor into meltdown, harm a patient, or bring an aircraft down. What looks like failure from the bird’s-eye view of an investigation made sense at the moment of action, given what the person could see, given the pressure, given the training.

    Reconstructing that sense is the actual work.

    Hollnagel adds a second thread. His Safety-II argument runs, simplified: the same thing we call “failure” is the other side of an adaptive capacity without which the system wouldn’t function for an hour. People accomplish daily what procedures cannot accomplish on their own: they interpret context, they improvise when reality diverges from the script assumption (which it does constantly), they fill the gaps that designers and rule-books have left open. Whoever treats people as a weak point cuts themselves off from the only real source of resilience the system has.

    Back in the TMI control room, read through this lens: the operators throttle the emergency cooling because their instruments say the system is over-pressurised, and because their training has sensitised them to exactly that risk. At the moment of action, their decision is the only coherent interpretation of the data available to them. That we know today the valve was open and the system under-pressurised rather than over: that is information of the investigation, not information the operators had. This asymmetry between investigator and actor, “hindsight bias” in the research vocabulary, is not a methodological cosmetic flaw. It is the structural condition under which every incident investigation operates. Whoever doesn’t reflect on it sees in every past what those involved could have done. And overlooks what they actually could see.

    In training sessions, I now routinely ask participants: what is the most frequent cause of incidents and accidents in your operation? The answer comes every time, without exception: human error. It comes fast, it comes self-evidently, and it comes before the actual work of the training has begun. Over the hours that follow, there is regularly a moment when something dawns on the participants. And it isn’t a new term, no additional tool, but a shift of perspective: their own incident investigations, as they themselves recognise, have ended exactly where they should have begun. What that costs isn’t only a weaker investigation. It is the willingness of employees to report anything at all next time.

    The question that interests us more than “How do we prevent human errors?” is this: How does our system support the adaptive work people have to do for it to function at all?

    Human error is never an explanation. It is a diagnosis that says more about those diagnosing than about the incident.

    Compliance is a minimum, not safety

    The second assumption follows the first like a shadow. If people are the risk, then regulations, audits, and certifications are the instruments of control. Safety becomes a question of whether the right boxes are ticked. Executive teams read safety KPIs (lost-time injury rate, audit findings, training completion rates) and draw conclusions about the state of the organisation. The governance is clear, the reporting is clean, the responsibility is distributed. There is a reason this model survives so robustly: it interfaces well with law, insurance, and corporate reporting.

    The model has just one problem: compliance and safety regularly come apart. Boeing’s 737 MAX held FAA certification, a compliance status that was green by every auditable measure. And an MCAS system whose malfunction cost 346 people their lives. The Bristol Heart Scandal of the 1990s revealed a hospital whose internal safety indicators showed no clear anomalies, while paediatric cardiac surgery mortality had climbed to twice the British average. In both cases the signals were reported, by insiders no one wanted to listen to, because the compliance picture was clean.

    What happens between the audits is the actual safety story. Diane Vaughan, in her study of the Challenger disaster, coined a term for it: “normalisation of deviance.” Drift rarely arises as deliberate rule-breaking. It arises because, under real conditions, the system gradually departs from the norm (a small tolerance here, a step shortened in time there) and because these deviations mostly turn out fine. Every repetition without consequence widens the bandwidth of the acceptable, without anyone ever having made a conscious decision. From the audit perspective, this drift is invisible: on audit day the picture aligns again, because everyone knows what to show. From the perspective of learning capacity, it would be visible, if the organisation had the mechanisms to see it.

    What these cases share is not a compliance failure. It is a learning failure. Compliance is a property of a moment: it says that at time X rule Y was being followed. Safety is a property of a process: it says that the organisation is able to pick up weak signals, revise assumptions, and correct its own behaviour, before the next audit date enters the stage. The one is a state, the other is a capability. An organisation can be fully compliant at any given moment and at the same time completely blind to the drift it is in.

    The operational question that follows from this is not “Are we compliant?” It is: Do weaknesses become visible without being punished? Are near-misses treated as learning opportunities, or as reputational risks? Does the system get smarter after every incident, or just more defensive? Just culture, in the precise sense of Reason and Dekker, is the precondition. It is not the poster in the break room.

    It is the lived answer to what happens when someone admits something they could have kept quiet about.

    Standardisation creates brittleness, not resilience

    The third assumption is the most stubborn, because it speaks most directly to the safety reflex. When something goes wrong, we raise the level of standardisation. We write the next step into the SOP, we narrow the latitude, we formalise what used to be a matter of experience. The underlying assumption is clean and mechanical: variation is defect, uniformity is safety. What does not behave deviantly cannot go wrong.

    The assumption holds for simple, linear systems. It does not hold for the systems we deal with in HRO-adjacent contexts. Erik Hollnagel uses a precise word for the consequence of this reflex: brittleness. An over-standardised system loses the capacity to adapt to conditions its designers did not anticipate. It functions exactly as long as reality follows the script. And reality never follows the script all the way. The moment deviation arrives, the system has no reserve, no improvisational capacity, no repertoire other than “continue as planned.”

    What the HOP movement around Todd Conklin and others has been showing since the 2010s is banal and consequential at once: every functioning shift deviates from the script daily. Nurses combine orders that formally were not designed to be combined, because the original procedure does not fit the specific situation. Industrial operators put in small workarounds because a tool is missing or a step under time pressure has to be skipped. Pilots interpret checklists in an order that fits the situation. These deviations are not the problem. They are the safety. They are what carries the system through the day at all.

    Behind this stands a deeper insight from the resilience-engineering tradition: safety is not the absence of variation, but the capacity to absorb it. David Woods calls this “graceful extensibility”: the question of how far a system can be stretched before it breaks, and how it behaves while being stretched. Over-standardisation optimises for the normal case and ignores exactly this question. It makes the system efficient under ideal conditions and prone to brittle failure under real ones.

    What tailoring means is exactly this: shaping the latitude rather than eliminating it. Setting guardrails (the limits beyond which it becomes dangerous) and, within those guardrails, allowing adaptability, making it visible, keeping it learnable. This is more demanding than a thick rule-book, because it requires trust, conversation, and contextual knowledge. It is also the only thing that works under conditions where variation cannot be eliminated. Pilots who set the manual aside can be heroes or culprits. What they are depends on the system, not on themselves.

    What this means for us

    From this follows the position we write from: safety does not arise when people adapt to systems, but when systems are designed so they can be adapted to people: continuously, in operation, not in the audit room. Exactly this tailoring (this ongoing adaptation under real conditions) is the craft we want to lay out here. Not because the New View line is fashionable. It has been established in the literature for more than two decades. But because the operational gap between it and daily practice is still wide.

    In practice this means: We write about incidents to reconstruct conditions: the conditions under which reasonable people made reasonable decisions that turned out, in retrospect, to be consequential. Methods we treat as craft, requiring practice, judgement, and contextual knowledge. Organisations we read as learning-capable (or learning-incapable) systems.

    Back to Three Mile Island, shortly after four in the morning. Three operators stand in front of indicators, one of which shows the control signal rather than the position. They follow their training, they throttle the emergency cooling, because under suspected overpressure the procedure asks for exactly that. We can read them as the weak point of the system, or as the last people that night who acted by the rules they had been given. Which interpretation we choose decides what we build differently next time.

    What we build differently here is not, in the first place, an indicator that shows position rather than control signal. It is the willingness to change the question: not “Who failed?”, but “What made this, in that moment, plausible?” This question is more demanding. It does not lead to a person who can be sanctioned. It leads to a system that has to be rebuilt.

    Sources

    • Sidney Dekker – The Field Guide to Understanding Human Error, 3rd ed., CRC Press 2014
    • Erik Hollnagel – Safety-II in Practice, Routledge 2018
    • Todd Conklin – Pre-Accident Investigations, Ashgate 2012
    • Karl E. Weick & Kathleen M. Sutcliffe – Managing the Unexpected, 3rd ed., Wiley 2015
    • Charles Perrow – Normal Accidents: Living with High-Risk Technologies, Princeton University Press 1999 (on the TMI analysis)
    • Diane Vaughan – The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA, University of Chicago Press 1996
  • Reaching optimal Human Performance through effective System Design

    Reaching optimal Human Performance through effective System Design

    Designing automation for complex socio-technical systems, to ensure optimal Human Performance of human operators, is a challenging endeavour. Especially in safety-critical environments, humans may need to adapt quickly to changing levels of demands, complexity and uncertainty, in order to maintain optimal performance, efficiency and safety of operations. Under these conditions, humans may benefit from automation. In most cases, automation is designed to take over low-value tasks, i.e. tasks that are simple and easy to automate. However, designing automation to support the human with cognitively demanding tasks such as problem solving and complex decision-making is more challenging for various reasons. First, it is required to build an understanding of all high-level tasks and underlying (human) cognitive functions, and to identify to what extent these tasks are currently supported by automation, and what humans need in terms of resources to execute them. Second, automating tasks requires re-thinking the new distribution of (cognitive) functions between humans and automation on a higher level, what organizational structures are required, and how cognition is shared amongst humans and automation (i.e. how humans are able to work effectively with automation). Third, it needs to be understood how automation should be designed so it can support humans optimally in managing complex tasks, in particular when decision-making or problem solving under rapidly changing demands, high levels of complexity, and uncertainty is required. Therefore, creating automation to support humans requires a deep understanding of what strategies humans adopt when engaging in complex problem solving and decision making. What strategies do they adopt and what do they need as automation support? This article provides an overview of how to tackle these challenges.

    Step 1: Understanding tasks and underlying (cognitive) functions of a system

    We have to consider that in most cases, we do not develop systems from scratch. Rather, we are building upon existing systems for improvements in terms of safety, efficiency, or other performance dimensions. This means we have to understand what tasks and underlying (cognitive) functions currently exist and what functions currently are supported by automation, in order to identify possibilities to further automate complete tasks or underlying (cognitive) functions or improve existing automated functions.

    In order to identify what automation optimally supports the human in complex tasks (ensuring human-centric decision-making), we first need to identify all tasks and corresponding (cognitive) functions. We also need to identify the current allocation of tasks (and underlying cognitive functions) between humans and automation. Some tasks may be allocated to humans, with various levels of automation support; some tasks may be allocated fully to automation. But it is also possible that tasks are dynamically allocated to humans or automation. It is necessary to understand how changing the allocation of tasks may impact the overall system in terms of interdependencies between humans and automation. A Cognitive Function Analysis (CFA) (Boy, 1998) is an important instrument for Human Factors Engineers and Designers (e.g. UX Engineers) to generate an understanding of all tasks and underlying functions of a system, and the implications of changing the allocation of functions between humans and automation. When doing a CFA, it is important that a wide range of techniques is used, including interviews, observations as well as documentation study. Interviews and observations are important as in most cases, humans may have evolved to use the system differently than intended, which often is not documented.

    Step 2: Understanding the impact of function allocation on system stability

    Changing allocation of functions between humans and automation may have an impact on system stability (Straussberger et al., 2008). When automating existing functions currently allocated to humans, it therefore needs to be assessed what impact redesigning human and machine cognitive functions through increasing automation will have on the overall stability of a complex socio-technical system. This will ultimately determine the resilience of the system to respond to all operational demands. Stability exists on various different layers. It is the result of organizational structures linked to procedures and technical systems and will reflect a system’s ability to recover after disturbance. The stability of socio-technical systems is defined through two processes (Straussberger et al. 2008):

    • Global socio-cognitive stability
    • Local socio-cognitive stability

    Global socio-cognitive stability is concerned with the appropriateness of functions allocated to humans or automation, the pace of information flows and related coordination, through designing appropriate structures linked to:

    • Authority
    • Responsibility
    • Controllability
    • Ability

    Issues may arise if these structures have not been adequately designed. For example, when humans have formal responsibility but do not have controllability or ability to execute certain tasks or high-level functions. Or, alternatively, functions become fully allocated to automation, yet humans maintain formal responsibility for these functions, whereas they have no control or ability to intervene in their execution. Issues may also arise when functions are dynamically allocated to humans or automation or delegated to the system by humans, and the conditions which must be met for delegation are not transparent to humans or are simply not defined.

    Local socio-cognitive stability refers to humans’ workload, situation awareness, ability to make appropriate decisions and take action. Local socio-cognitive stability will mainly rely on humans’ ability to understand automation and to gain a mental model of the system. Automated systems need to be designed such that humans are able to predict (anticipate) responses of automated systems on human input as well as receive adequate feedback, and regain authority if needed (Boy, 1998). Also, transparency of automated functions needs to be considered, so that humans can develop a valid mental model of the system, its functions, and its behaviour.

    Ensuring both global as well as local socio-cognitive stability will ensure a common frame of reference, supporting joint situation awareness between humans and automated systems.

    Step 3: Design automation to support expert decision-making

    Designing automation to support human macro cognitive functions starts with understanding how human operators respond to high levels of complexity and uncertainty. Humans may need to adapt to changing demands, which requires anticipating, extrapolating into the future, and creating an assessment based on experience. It may also be required to plan ahead and build capacity to be able to manage situations in the near future. They may also need to engage in strategies to deal with future demands and unexpected situations. Such strategies may be dedicated to either reduce or manage complexity and uncertainty. Examples of complexity and uncertainty management strategies include (Corver & Grote, 2016):

    • Anticipatory thinking (extrapolating the current situation into the future based on past experience on observed deviations)
    • Adaptive planning (i.e. creating back-up plans)
    • Weighing pros and cons of different options (comparing alternative solutions)
    • Forestalling (improving readiness, e.g. to manage resources for future demands)
    • Reducing uncertainty (e.g. increase accuracy and reliability of data through the integration and validation of information from different sources)

    The understanding of these strategies is important to start designing useful automation to support human operator decision-making and task execution in highly dynamic situations with high levels of complexity. The following questions should be asked: what information is required from which sources and what data accuracy is required? What cues are required for human operators to be adequately alerted about deviations in order to allow them to quickly respond adequately? What do humans consider when analyzing a situation and engaging in complex decision-making? Automated support tools can be designed to support humans’ ability to filter and cluster information where it is needed, to extrapolate into the future, and be alerted when the situation deviates, or to make complex decisions based on operational trade-offs (Corver & Grote, 2016). Finally, an understanding of the tasks and information needs can support the design of automation which supports humans with clustering, integrating and filtering different information from different sources for improved and quicker decision-making.

    In summary, the identification of human macro cognitive strategies allows us to understand how automation can support human needs and will allow us to increase overall performance of a system.

    References

    • Corver, S.C. & Grote, G. (2016). Uncertainty management in en route air traffic control: a field study exploring controller strategies and requirements for automation. Cognition, Technology & Work.
    • Boy, G. (1998). Cognitive Function Analysis. Westport, CT: Ablex, Greenwood Publishing Group.
    • Straussberger, S., et al. (2008). PAUSA for the future – A synthesis of Phase 1. June 2008. Final Report.