Loading

Quantitative Risk Assessment

Loading

Probability of Failure

A Probability of Failure PF can be given as;

For systems that are running continuously or that deteriorate with time PF/H makes more sense where as for systems that deteriorate with use but not if not used PF/D makes more sense. We often call one use one "cycle".

Probability of failure should be written as PF/D or PF/H and probability of dangerous failure should be written as PF/DD or PF/HD but often the / is missed out and the result is written PFD, PFH, PFDD or PFHD

There are the added complexities that;

When we plot PF/H or PF/H against time or use cycles we end up with what is nicknamed "the bathtub curve"! [What is 0e-6, 2e-6 etc?]

We expect that on any safety part, manufacturers will have tested a large enough sample of parts to plot the characteristic bathtub curve.

Its the part of the characteristic between the early life and the wear-out phases that is of interest because that will determine the Service Life of the part. Of course no parts fail after the wear-out phase but that's because they are all worn out!

The manufacturer needs to give the customer a guarantee that the PF/H or PF/D will be less than a certain figure for the Service Life of the part.

They could quote a higher PF and a longer Service Life, as marked in green on the graph, or they could quote a lower PF and a shorter Service Life as in blue.

The bathtub curve is rarely flat during the service life but as long as the manufacturers quoted PF/H or PF/D figure is bettered by the actual figure then the situation is satisfactory.

It is up to the user to ensure that a part is replaced at the end of its quoted Service Life as part of the maintenance process.

If the part is not safety related, then of course they can simply run it until it failes if they like. In some cases a part might run for twice the quoted Service Life if your lucky, but you should never rely on luck where safety is concerned.

Service Life

A typical value for the PF/HD for a UE12-2FG Safety Relay is PF/HD = 1.58x10-9 over its Service Life of 20 years.[What is 10-9?]

A typical value for the PF/DD for a 3RT2026-1BB40 Contactor (power relay) is 4x10-8 over its Service Life of 20 years.[What is 10-8?]

Remember that the Service Life of the part is either the time or the number of cycles over which the the value given for PF/H or PF/D is valid.

A rocket engine might have a PF/H of 10-9 and yet have a Service Life of only quarter of an hour 0.25.

NEVER think you can calculate the Service Life of a part from its PF figure. For safety parts SERVICE LIFE MUST BE QUOTED as well as PF/H or PF/D.

Mean Time/Cycles Between/To Failures

Mean Time Between/To Failures MTBF or MTTF is the total running-time, divided by the number of stoppages due to failures.

Mean Cycles Between/To Failures MCBF or MCTF is the total cycles-performed, divided by the number of stoppages due to failures.

The use of the words "Between Failures" means the part is expected to fail in service and be REPAIRED and put back into service again.

The use of the words "To Failure" means the part is expected to fail in service and be DISCARDED.

Note that in both cases the part is EXPECTED TO FAIL IN SERVICE and so should not be used as a safety part!

B10

B10 refers to 10% of the parts having failed. It will normally be quoted as a given amount of time, distance or cycles.

This figure should be the result of testing a large sample of parts.

Of course in a safety related system a 10% chance of failure would be catastrophic.

Note that in this case also the part is EXPECTED TO FAIL IN SERVICE and so should not be used as a safety part!

WARNING

IF WE ASSUME THAT PF/D IS CONSTANT FOR EVER rather than just a quoted figure for a given quoted Service Life for a part.

Then with some maths we can show that the MTTF or MTBF = 1/ PF/H and that the MCTF or MCBF = 1/ PF/D.

We can also show MTTF or MTBF = B10 x 9.49 to 3 significant figures but this is generally approximated as 10.

Going back to our rocket example where the PF/H was 10-9 we can use the formulas above to calculate MTTF = 109 hours and B10 = 108 hours. Both are very misleading considering the quoted Service Life is only 15 minutes!

Take a look at the statistics page where you can "build your own bathtub curve and see what the MTTF value to understand why the above is so misleading.

We know that the MTTF for the rocket is probably between 15 and 20 minutes, by which time all the rockets will be spent. Taking the best figure for MTTF as 20 minutes we could use one of the above formulas to calculate the PF/Minute as 1/20 which is clearly not correct. This illustrates why the assumption of uniform PF/H and PF/D can lead to very wrong results.

A number of manufacturers give figures for safety related products like "100,000 cycles" and that is all they tell you. Do they mean?

In the first case they don't give a figure for the Probability of Failure during the 100,000 cycles Service Life (is it so close to 0 as not to count? How close to 0 is that?) and in the second two cases they are giving figures that indicate the part will be run until it fails in service!

All safety parts should specify a Probability of Failure per unit Time, Distance or Cycles, during a specified Service Life.

References

Diagnostic Coverage

Average Diagnostic Coverage written DCavg.

DCavg = Number of detected failures / Number of failures

Risk Reduction Factor

When two or more safety systems are running in parallel in such a way that all must fail in order for the system to fail as a whole then the PF is PF1 x PF2 as we have already seen. When things are multiplied together in maths we call each thing a "factor". Some people call the last factor they put onto the calculation, especially when it relates to guards and protective things, a Risk Reduction Factor. The change of name is technically unnecessary but perhaps explains something of the role of the factor rather than anything else.

Combined Failure Probabilities

Let 'p' mean either PF/h or PF/D.

If I have a system which has some parts or sub-systems working in parallel such that both have to fail to create a failure of the system then then;

     pALL = p1 x p2 x p3 x ...

If I have a system which has some parts or sub-systems working in series such that only one has to fail to create a failure of the system then;

     pALL ≅ p1 + p2 + p3 + ...

The first of these formulas is direct from probability maths but the second is an approximation for small PF/h or PF/D typically 0.001 or less to give 2 significant figures accuracy. (Why?) Remember that the calculations only apply for the Service Life of the system which is equal to the shortest Service Life of any part in it.

Word based safety statistics! BS EN ISO 13849-1:2015

Safety Integrity Levels

Given that life and reliability in the form of time or cycles and probability of failure etc. are such great ways to express safety you have to wonder what Safety Integrity Levels SIL or Performance Levels PL defined in,

13849-1:2015 BS EN ISO Safety of machinery - Safety- related parts of control systems Part 1: General principles for design

really add to the picture.

You can't do maths with them, if you take two parts rated at SIL1 or PLb and build a system out of them there is no rule for concluding the rating of the system as a whole other than to convert to probabilities of failure and work with those and then convert back. One things sure, the combined system won't be SIL1 or PLb. Also Performance Levels have a qualitative approach that seems incongruent with any quantitative approach. Here they are...

Safety Integrity Levels SIL

Safety Integrity Levels SILs have been defined as follows;

SIL PF/HD
SIL 1 10-5 > PF/HD > 10-6
SIL 2 10-6 > PF/HD > 10-7
SIL 3 10-7 > PF/HD > 10-8
SIL 4 10-8 > PF/HD > 10-9

Because it is also necessary to allocate a SIL to systems that deteriorate with use cycles the Safety Integrity Levels SILs have also been defined by assuming that the part is "demanded upon" used once every 10,000 hours (60days).

This is of course a very big assumption and should never be used to ignore the actual demand, but here it is.

SIL PF/DD
SIL 1 10-1 > PF/DD > 10-2
SIL 2 10-2 > PF/DD > 10-3
SIL 3 10-3 > PF/DD > 10-4
SIL 4 10-4 > PF/DD > 10-5

Performance Levels

Performance Levels are another system defined in terms of PF/Hd and so can be mapped to SIL in this respect. For continuous operation. Probability of dangerous Failure per Hour PF/HD for PLs and SILS is.

PL SIL PF/HD
PLa ~ 10-4 > PF/HD > 3x10-5
PLb SIL 1 10-5 > PF/HD > 3x10-6
PLc SIL 1 3x10-6 > PF/HD > 10-6
PLd SIL 2 10-6 > PF/HD > 10-7
PLe SIL 3 10-7 > PF/HD > 10-8
~ SIL 4 10-8 > PF/HD > 10-9

The technical benefits of quoting PL instead of PF/HD are doubtful. The reason for mapping PLb and PLc onto SIL 1 is not clear other than to give a slightly better resolution in this range.

Performance Level The Qualitative Approach

The performance level approach also has a descriptive way of doing things. It talks in terms of slight or serious injury, seldom or frequent exposure and whether the injury is possible or improbable. The difficulties in reconciling this approach to PF/H are huge. Personally I would avoid such subjective approaches in favour of estimating the probabilities of specific types of injury.

PL Injury Exposure Avoidance
PLa slight seldom possible 10-4 > PF/HD > 3x10-5
PLb slight seldom improbable 10-5 > PF/HD > 3x10-6
PLb slight frequent possible 10-5 > PF/HD > 3x10-6
PLc slight frequent improbable 3x10-6 > PF/HD > 10-6
PLc serious seldom possible 3x10-6 > PF/HD > 10-6
PLd serious seldom improbable 10-6 > PF/HD > 10-7
PLd serious frequent possible 10-6 > PF/HD > 10-7
PLe serious frequent improbable 10-7 > PF/HD > 10-8

Categories

Categories is referring to the required behaviour of the Safety Related Parts of a Control System SRP/CS in respect of its resistance to faults based on the design.

Category B

Behaviour: The occurrence of a fault can lead to the loss of the safety function.

Requirement: SRP/CS and/or their protective equipment, as well as their components, shall be designed, constructed, selected, assembled and combined in accordance with relevant standards so that they can withstand the expected influence. Basic safety principles shall be used.

Structure: Mainly characterized by selection of components.

Rating: MTTF = Low to medium, DCavg = none, CCF = Not relevant.

Category 1

Behaviour: The occurrence of a fault can lead to the loss of the safety function but the probability of occurrence is lower than category B.

Requirement: Requirements of B shall apply. Well-tried components and well-tried safety principles shall used.

Structure: Mainly characterized by selection of components.

Rating: MTTF = High, DCavg = none, CCF = Not relevant.

Category 2

Behaviour: The occurrence of a fault can lead to the loss of the safety function between the checks. The loss of safety function is detected by the checks.

Requirement: Requirements of B and the use of well-tried safety principles shall apply. The safety function shall be checked at intervals by the machine control system.

Structure: Mainly characterized by the structure.

Rating: MTTF = Low to high, DCavg = Low to medium, CCF = See Annex F of 13849-1.

Category 3

Behaviour: When a single fault occurs the safety function is always performed. Some but not all faults will be detected. Accumulation of undetected faults can lead to the loss of the safety function.

Requirement: Requirements of B and the use of well-tried safety principles shall apply. Safety related parts shall be designed so that a single fault in any of those parts does not lead to the loss of the safety function, and wherever reasonable practicable the single fault is detected.

Structure: Mainly characterized by the structure.

Rating: MTTF = Low to high, DCavg = Low to medium, CCF = See Annex F of 13849-1.

Category 4

Behaviour: When a single fault occurs the safety function is always performed. Detection of accumulated faults reduces the probability of the loss of the safety function (high DC). The faults will be detected in time to prevent the loss of the safety function.

Requirement: Requirements of B and the use of well-tried safety principles shall apply. Safety related parts shall be designed so that a single fault in any of those parts does not lead to the loss of the safety function, and the single fault is detected at or before the next demand upon the safety function, but that if this detection is not possible, an accumulation of undetected faults shall not lead to a loss of the safety function.

Structure: Mainly characterized by the structure.

Rating: MTTF = High, DCavg = High including an accumulation of faults, CCF = See Annex F of 13849-1.

Scoring process and quantification of measures against CCF

Scoring process and quantification of measures against CCF BS EN ISO 13849-1:2015 Table F.1 another bright idea for a rather points based approach to safety but it does raise some issues worth considering.

You need to score 65 or better to meet the requirements! They state "Where technological measures are not relevant, points attached to this column can be considered in the comprehensive calculation." I have put the scores in (round brackets).

1 Separation/ Segregation (15) Physical separation between signal paths, for example: separation in wiring/piping; detection of short circuits and open circuits in cables by dynamic test; separate shielding for the signal path of each channel; sufficient clearances and creepage distances on printed-circuit boards.

2 Diversity (20) Different technologies/design or physical principles are used, for example: first channel electronic or programmable electronic and second channel electromechanical hardwired, different initiation of safety function for each channel (e.g. position, pressure, temperature), and/or digital and analog measurement of variables (e.g. distance, pressure or temperature) and/or Components of different manufactures.

3 Design/application/experience

3.1 Protection against over-voltage, over-pressure, over-current, over-temperature, etc. (15)

3.2 Components used are well-tried. (5)

4 Assessment/analysis (5) For each part of safety related parts of control system a failure mode and effect analysis has been carried out and its results taken into account to avoid common-cause-failures in the design.

5 Competence/training (5) Training of designers to understand the causes and consequences of common cause failures.

6 Environmental (25)

6.1 For electrical/electronic systems, prevention of contamination and electromagnetic disturbances (EMC) to protect against common cause failures in accordance with appropriate standards (e.g. IEC 61326-3-1). Fluidic systems: filtration of the pressure medium, prevention of dirt intake, drainage of compressed air, e.g. in compliance with the component manufacturers' requirements concerning purity of the pressure medium. NOTE For combined fluidic and electric systems, both aspects should be considered.

6.2 Other influences Consideration of the requirements for immunity to all relevant environmental influences such as, temperature, shock, vibration, humidity (e.g. as specified in relevant standards).