Quantitative Risk Assessment

Probability of Failure

A Probability of Failure PF can be given as;

PF per Hour, written PF/H,
PF on Demand (when its used), written PF/D.

For systems that are running continuously or that deteriorate with time PF/H makes more sense where as for systems that deteriorate with use but not if not used PF/D makes more sense. We often call one use one "cycle".

Probability of failure should be written as PF/D or PF/H and probability of dangerous failure should be written as PF/D_D or PF/H_D but often the / is missed out and the result is written PFD, PFH, PFD_D or PFH_D

There are the added complexities that;

many parts have a much higher rate of failure in their very early life, and
at the end of their life's as they wear out.

When we plot PF/H or PF/H against time or use cycles we end up with what is nicknamed "the bathtub curve"! [What is 0e-6, 2e-6 etc?]

We expect that on any safety part, manufacturers will have tested a large enough sample of parts to plot the characteristic bathtub curve.

Its the part of the characteristic between the early life and the wear-out phases that is of interest because that will determine the Service Life of the part. Of course no parts fail after the wear-out phase but that's because they are all worn out!

The manufacturer needs to give the customer a guarantee that the PF/H or PF/D will be less than a certain figure for the Service Life of the part.

They could quote a higher PF and a longer Service Life, as marked in green on the graph, or they could quote a lower PF and a shorter Service Life as in blue.

The bathtub curve is rarely flat during the service life but as long as the manufacturers quoted PF/H or PF/D figure is bettered by the actual figure then the situation is satisfactory.

It is up to the user to ensure that a part is replaced at the end of its quoted Service Life as part of the maintenance process.

If the part is not safety related, then of course they can simply run it until it failes if they like. In some cases a part might run for twice the quoted Service Life if your lucky, but you should never rely on luck where safety is concerned.

Service Life

A typical value for the PF/H_D for a UE12-2FG Safety Relay is PF/H_D = 1.58x10^-9 over its Service Life of 20 years.[What is 10^-9?]

A typical value for the PF/D_D for a 3RT2026-1BB40 Contactor (power relay) is 4x10^-8 over its Service Life of 20 years.[What is 10^-8?]

Remember that the Service Life of the part is either the time or the number of cycles over which the the value given for PF/H or PF/D is valid.

A rocket engine might have a PF/H of 10^-9 and yet have a Service Life of only quarter of an hour 0.25.

NEVER think you can calculate the Service Life of a part from its PF figure. For safety parts SERVICE LIFE MUST BE QUOTED as well as PF/H or PF/D.

Mean Time/Cycles Between/To Failures

Mean Time Between/To Failures MTBF or MTTF is the total running-time, divided by the number of stoppages due to failures.

Mean Cycles Between/To Failures MCBF or MCTF is the total cycles-performed, divided by the number of stoppages due to failures.

The use of the words "Between Failures" means the part is expected to fail in service and be REPAIRED and put back into service again.

The use of the words "To Failure" means the part is expected to fail in service and be DISCARDED.

Note that in both cases the part is EXPECTED TO FAIL IN SERVICE and so should not be used as a safety part!

B₁₀

B₁₀ refers to 10% of the parts having failed. It will normally be quoted as a given amount of time, distance or cycles.

This figure should be the result of testing a large sample of parts.

Of course in a safety related system a 10% chance of failure would be catastrophic.

Note that in this case also the part is EXPECTED TO FAIL IN SERVICE and so should not be used as a safety part!

WARNING

IF WE ASSUME THAT PF/D IS CONSTANT FOR EVER rather than just a quoted figure for a given quoted Service Life for a part.

Then with some maths we can show that the MTTF or MTBF = 1/ PF/H and that the MCTF or MCBF = 1/ PF/D.

We can also show MTTF or MTBF = B₁₀ x 9.49 to 3 significant figures but this is generally approximated as 10.

Going back to our rocket example where the PF/H was 10^-9 we can use the formulas above to calculate MTTF = 10⁹ hours and B₁₀ = 10⁸ hours. Both are very misleading considering the quoted Service Life is only 15 minutes!

Take a look at the statistics page where you can "build your own bathtub curve and see what the MTTF value to understand why the above is so misleading.

We know that the MTTF for the rocket is probably between 15 and 20 minutes, by which time all the rockets will be spent. Taking the best figure for MTTF as 20 minutes we could use one of the above formulas to calculate the PF/Minute as 1/20 which is clearly not correct. This illustrates why the assumption of uniform PF/H and PF/D can lead to very wrong results.

A number of manufacturers give figures for safety related products like "100,000 cycles" and that is all they tell you. Do they mean?

the Service Life of the part is 100,000 cycles, or
MCTF is 100,000 cycles, or
B₁₀ is 100,000 cycles i.e. 10% will fail.

In the first case they don't give a figure for the Probability of Failure during the 100,000 cycles Service Life (is it so close to 0 as not to count? How close to 0 is that?) and in the second two cases they are giving figures that indicate the part will be run until it fails in service!

All safety parts should specify a Probability of Failure per unit Time, Distance or Cycles, during a specified Service Life.

References

Diagnostic Coverage

Average Diagnostic Coverage written DC_avg.

DC_avg = Number of detected failures / Number of failures

Risk Reduction Factor

When two or more safety systems are running in parallel in such a way that all must fail in order for the system to fail as a whole then the PF is PF₁ x PF₂ as we have already seen. When things are multiplied together in maths we call each thing a "factor". Some people call the last factor they put onto the calculation, especially when it relates to guards and protective things, a Risk Reduction Factor. The change of name is technically unnecessary but perhaps explains something of the role of the factor rather than anything else.

Combined Failure Probabilities

Let 'p' mean either PF/h or PF/D.

If I have a system which has some parts or sub-systems working in parallel such that both have to fail to create a failure of the system then then;

p_ALL = p₁ x p₂ x p₃ x ...

If I have a system which has some parts or sub-systems working in series such that only one has to fail to create a failure of the system then;

p_ALL ≅ p₁ + p₂ + p₃ + ...

The first of these formulas is direct from probability maths but the second is an approximation for small PF/h or PF/D typically 0.001 or less to give 2 significant figures accuracy. (Why?) Remember that the calculations only apply for the Service Life of the system which is equal to the shortest Service Life of any part in it.

Word based safety statistics! BS EN ISO 13849-1:2015

Safety Integrity Levels

Given that life and reliability in the form of time or cycles and probability of failure etc. are such great ways to express safety you have to wonder what Safety Integrity Levels SIL or Performance Levels PL defined in,

13849-1:2015 BS EN ISO Safety of machinery - Safety- related parts of control systems Part 1: General principles for design

really add to the picture.

You can't do maths with them, if you take two parts rated at SIL1 or PLb and build a system out of them there is no rule for concluding the rating of the system as a whole other than to convert to probabilities of failure and work with those and then convert back. One things sure, the combined system won't be SIL1 or PLb. Also Performance Levels have a qualitative approach that seems incongruent with any quantitative approach. Here they are...

Safety Integrity Levels SIL

Safety Integrity Levels SILs have been defined as follows;

SIL	PF/H_D
SIL 1	10^-5 > PF/H_D > 10^-6
SIL 2	10^-6 > PF/H_D > 10^-7
SIL 3	10^-7 > PF/H_D > 10^-8
SIL 4	10^-8 > PF/H_D > 10^-9

Because it is also necessary to allocate a SIL to systems that deteriorate with use cycles the Safety Integrity Levels SILs have also been defined by assuming that the part is "demanded upon" used once every 10,000 hours (60days).

This is of course a very big assumption and should never be used to ignore the actual demand, but here it is.

SIL	PF/D_D
SIL 1	10^-1 > PF/D_D > 10^-2
SIL 2	10^-2 > PF/D_D > 10^-3
SIL 3	10^-3 > PF/D_D > 10^-4
SIL 4	10^-4 > PF/D_D > 10^-5

Performance Levels

Performance Levels are another system defined in terms of PF/Hd and so can be mapped to SIL in this respect. For continuous operation. Probability of dangerous Failure per Hour PF/HD for PLs and SILS is.

PL	SIL	PF/H_D
PLa	~	10^-4 > PF/H_D > 3x10^-5
PLb	SIL 1	10^-5 > PF/H_D > 3x10^-6
PLc	SIL 1	3x10^-6 > PF/H_D > 10^-6
PLd	SIL 2	10^-6 > PF/H_D > 10^-7
PLe	SIL 3	10^-7 > PF/H_D > 10^-8
~	SIL 4	10^-8 > PF/H_D > 10^-9

The technical benefits of quoting PL instead of PF/HD are doubtful. The reason for mapping PLb and PLc onto SIL 1 is not clear other than to give a slightly better resolution in this range.

Performance Level The Qualitative Approach

The performance level approach also has a descriptive way of doing things. It talks in terms of slight or serious injury, seldom or frequent exposure and whether the injury is possible or improbable. The difficulties in reconciling this approach to PF/H are huge. Personally I would avoid such subjective approaches in favour of estimating the probabilities of specific types of injury.

PL	Injury	Exposure	Avoidance
PLa	slight	seldom	possible	10^-4 > PF/H_D > 3x10^-5
PLb	slight	seldom	improbable	10^-5 > PF/H_D > 3x10^-6
PLb	slight	frequent	possible	10^-5 > PF/H_D > 3x10^-6
PLc	slight	frequent	improbable	3x10^-6 > PF/H_D > 10^-6
PLc	serious	seldom	possible	3x10^-6 > PF/H_D > 10^-6
PLd	serious	seldom	improbable	10^-6 > PF/H_D > 10^-7
PLd	serious	frequent	possible	10^-6 > PF/H_D > 10^-7
PLe	serious	frequent	improbable	10^-7 > PF/H_D > 10^-8

Scoring process and quantification of measures against CCF

Scoring process and quantification of measures against CCF BS EN ISO 13849-1:2015 Table F.1 another bright idea for a rather points based approach to safety but it does raise some issues worth considering.

You need to score 65 or better to meet the requirements! They state "Where technological measures are not relevant, points attached to this column can be considered in the comprehensive calculation." I have put the scores in (round brackets).

1 Separation/ Segregation (15) Physical separation between signal paths, for example: separation in wiring/piping; detection of short circuits and open circuits in cables by dynamic test; separate shielding for the signal path of each channel; sufficient clearances and creepage distances on printed-circuit boards.

2 Diversity (20) Different technologies/design or physical principles are used, for example: first channel electronic or programmable electronic and second channel electromechanical hardwired, different initiation of safety function for each channel (e.g. position, pressure, temperature), and/or digital and analog measurement of variables (e.g. distance, pressure or temperature) and/or Components of different manufactures.

3 Design/application/experience

3.1 Protection against over-voltage, over-pressure, over-current, over-temperature, etc. (15)

3.2 Components used are well-tried. (5)

4 Assessment/analysis (5) For each part of safety related parts of control system a failure mode and effect analysis has been carried out and its results taken into account to avoid common-cause-failures in the design.

5 Competence/training (5) Training of designers to understand the causes and consequences of common cause failures.

6 Environmental (25)

6.1 For electrical/electronic systems, prevention of contamination and electromagnetic disturbances (EMC) to protect against common cause failures in accordance with appropriate standards (e.g. IEC 61326-3-1). Fluidic systems: filtration of the pressure medium, prevention of dirt intake, drainage of compressed air, e.g. in compliance with the component manufacturers' requirements concerning purity of the pressure medium. NOTE For combined fluidic and electric systems, both aspects should be considered.

6.2 Other influences Consideration of the requirements for immunity to all relevant environmental influences such as, temperature, shock, vibration, humidity (e.g. as specified in relevant standards).

Quantitative Risk Assessment

Probability of Failure

Service Life

Mean Time/Cycles Between/To Failures

B₁₀

WARNING

References

Diagnostic Coverage

Risk Reduction Factor

Combined Failure Probabilities

Word based safety statistics! BS EN ISO 13849-1:2015

Safety Integrity Levels

Safety Integrity Levels SIL

Performance Levels

Performance Level The Qualitative Approach

Categories

Category B

Category 1

Category 2

Category 3

Category 4

Scoring process and quantification of measures against CCF

Quantitative Risk Assessment

Probability of Failure

Service Life

Mean Time/Cycles Between/To Failures

B10

WARNING

References

Diagnostic Coverage

Risk Reduction Factor

Combined Failure Probabilities

Word based safety statistics! BS EN ISO 13849-1:2015

Safety Integrity Levels

Safety Integrity Levels SIL

Performance Levels

Performance Level The Qualitative Approach

Categories

Category B

Category 1

Category 2

Category 3

Category 4

Scoring process and quantification of measures against CCF

B₁₀