Problems Tracking Downtime Causes With Equipment Failure Codes and Fault Code

It is almost impossible for a maintenance technician to identify the real component failure root cause of a failure incident because there are so many paths to the same failure from many possible root causes. But they can identify the failure mode with great confidence, and even take a picture of it!



Hi Mike,

I’m a graduate engineer and as part of my training I’ve been asked to build a simple equipment downtime tracking device that is capable of tracking the sources of downtime on critical equipment in one of our key strategic processing facilities.

A proposal I’ve written suggests to have a database that keeps track of all the causes and durations of downtimes of vital components for the plants up time. The main aim of the downtime tracking device is through equipment criticality analysis, create a defect elimanation strategy that can reduce unplanned maintnenance efforts and improve our plants performance.

Please can you give me some very basic pointers that can aid me in setting up a simple yet effective tracking device?



Hello Friend,

What you have been asked to do has been asked in many companies over the years. If you’re using a CMMS, it will most likely have fields available to enter fault codes for the cause of failures. It is such a common request that most CMMS developers add them in as standard. You can get right down to the component level in some of the CMMS.

Unfortunately the approach is not particularly successful in most organisations for a number of reasons.

The first reason is that the failure codes are typically selected by the repairmen and they are guessing the cause of the failure from what they see when they repair the equipment. Typically there are many reasons that an item can fail, maybe 10, 20 or 30 possible failure paths. If all you have is the failed item to look at then how can you tell what was the actual route of failure and the true root cause of failure? You won’t know that without doing a lot of investigation. A tradesman would never go that far in their enquiry. And there isn’t enough time or money available in companies for the level of investigation required to get to the true root causes of all failures.

The second reason is that repairmen have varying experience, and varying quality of experience, and their knowledge of engineering is unreliable and patchy. Asking them to decide what caused a failure is fraught with errors of judgement.

The third issue is that tradesmen do not understand the value of the information and its long term purpose, so that when they record their observations they are not usually exact, clear and specific enough to use the input with confidence in reliability analysis.

The fourth reason is that it requires long term management commitment and ownership to do the training and forge the dedicated effort required to instill the practice of rigorous failure analysis into a company’s culture. Most times managers and engineers simply move on to other jobs and the people that take their place do not share the same interest in what was done in the past.

Fifth is the issue of time. The collecting of usable failure code data that can be investigated with reliability engineering methods takes some years to assemble. The big pay-off only comes when there is enough failure history data to analyse and you can identify patterns and trends indicating the root causes to fix.

Sixth is the issue of human error and forgetfulness. If there is a gap in the data or the incorrect information is included then the analysis results will always be suspect.

The use of failure cause codes and fault codes can be made to work, but it requires huge commitment and years of sturdy effort from management to lift tradesmen’s skills and knowledge to the level that their recording practices and appreciation of the causes of failure are reliable. However, there are other opportunities for you to take.

If you capture component failure dates instead of component failure causes you can do some basic reliability analysis. With failure dates of components you can do timeline plots to predict component failures. The failure dates of equipment also allows you to use Crow/AMSAA analysis to forecast equipment failures. Knowing when an expensive failure will happen justifies investigating the causes to remove the failure before it again happens. You can then apply Root Cause Analysis (RCA) to target the expensive problems with the experienced resources and skills already available in the organisation. This method will give you what your company wants most quickly and with sure certainty of good results.

Lastly, instead of requesting failure causes be recorded, which mostly will be wrong, you should instead ask for failure modes to be recorded. A failure mode is what you see when the failure happens. We detect a failure mode with our senses. You will be far more likely to get the right failure mode selected by maintenance people than they identify the right failure cause.

The other wonderful thing about recording failure modes is that you can take a photograph of them! Instead of only getting written maintenance history on maintenance work orders you can also have a photographic record of the failure mode that resulted from the failure cause. The failure mode will be always correct from a maintenance technician, unlike the failure cause which will always be their best guess.

With complete and plentiful failure mode data a reliability engineer will be able to identify many of the real failure causes with a high degree of confidence.

You should also read this FAQ response on Why 80% of the failure codes reported into CMMS get wrongly coded with the words Other, N/A, or Failed to Function.

I hope that you can get some useful ideas from the above


My best regards to you,

Mike Sondalini
Managing Director
Lifetime Reliability Solutions HQ