Professional Practice Skills

PPS-22: Troubleshooting

(Adapted from Course Notes 4N4, Tom Marlin and MPS 34, Don Woods 2003)


Pre-class assignment

  1. Read sections What is It?, Why Do It?, New Concepts, How to  Do It, and Learning Objectives
  2. Establish your Baseline on this skill on the Troubleshooting Feedback Form.
  3. Be able to describe how hypotheses are evaluated


What is It?

Troubleshooting is a specialized form of the Six Step problem solving approach that is designed to help diagnose problems.  It is particularly well suited to working with problems that arise in process industries and in failure analyses.


Troubleshooting by an engineer is analogous to differential diagnosis by a doctor.  In both cases, the expert is called in to deal with a problem or breakdown.  In both cases, determining the root cause is preferred to making the symptoms go away.  If the doc gives you aspirin for shoulder pain, and you die of a heart attack, you are going to be upset.  Well, you are going to be dead, but somebody else might be upset.  


New Concepts

Current State, Ideal State, Safe State, Hypotheses, Diagnostic Action


Why Do It?

A number of mechanical engineers are responsible for maintenance or production.  For these engineers, deviations from normal conditions, or outright breakdowns and failures are the important issues.  Diagnosing the problem, or troubleshooting, is the critical step before the problem can be solve.


Troubleshooting skills are also important to homeowners, shade tree mechanics, or anyone who regularly uses that most cranky of 21st century conveniences, the personal computer.


How to Do It

Good troubleshooters combine technique and knowledge.  We are going to focus on technique in this unit, since that is applicable to any troubleshooting problem.  The technique is a fine-tuned version of our old friend, the Six Step Method (Engage, Define, Explore, Plan, Do It, Look Back).


This unit lays out the Six Steps as they apply to trouble shooting.  In-class exercises will be used to practice specific skill.  A future unit provides a workshop opportunity to put it all together.



In this step we will

  • Ask if the problem is an Emergency?
  • Deal with our stress and say I want to and I can.


When presented with a problem on the production line (especially an expensive production shutdown), you may feel a rapid increase in pulse and blood pressure with accompanying onset of perspiration.  This fight or flight response may be useful or it may be harmful.


That adrenaline is useful if you have a significant safety problem.  If you have an overpressure boiler or a hazardous chemical discharge, you may need to evacuate the building, call the authorities and run like hell.  If it is a safety critical issue, you apply your emergency training, and start whatever shutdown procedures you have learned.


If the problem is not safety critical, (perhaps a drilling operation is leaving the holes slightly undersized), then the adrenaline can be harmful.  You need to resist the urge to do the first thing you think of.  Instead, apply your stress reduction training, take a deep breath and remember this Six Step Approach.  Remind yourself that you can get to the root cause if you just stay organized and on track.



In this step we will

  • Define our Current State (where we are)
  • Define our Ideal State (where we want to be)
  • Define (if needed) a temporary Safe State (like Safe Mode in Windows)


The definition of the Current State is like the initial exam at the doctor’s office.  They ask you to describe the problem, measure your temperature and blood pressure, and ask about any recent changes.  You should also write a description of the problem, then record the conditions when the problem arose (temperature, pressure, loading, etc), and enquire about any recent changes. 


Like the doctor you will need to do a thorough history and physical exam.  Talk to the people involved, collect data on what happened (especially the sequence of events), locate and save data histories from computers or data loggers.  Examine the hardware and software involved.


The Ideal State is the goal and should be unambiguous and measurable.  The SMART acronym introduced in PPS-2 is useful here. In that nomenclature, a goal should be

  • Specific
  • Measurable
  • Attainable
  • Realistic
  • Time Constrained

You may want to add cost-effective to the list, especially for production situations.  For the example of the undersized holes, our goal may be to get machine to make holes within specification by the next shift.  For larger or more chronic problems, the time frame could be much longer and the goal statement more complex.


In some production problems, we may be able to shift to a temporary Safe State in which we know the process will work, but the economics may not be ideal.  This could involve a reduction in speeds and feeds for a machining operation, use of 100% inspection to assure no bad parts get to customers, or substitution of a more expensive but more reliable material or tool. 


This Safe State is a bridge between the Current State (not working right) and the Ideal State (problem fixed), and may keep the bosses off your back long enough to find the root cause.



In this stage you

  • Review the fundamentals
  • Check information
  • Review trends and relevant changes


This stage is a little harder to describe, since it relies heavily on your knowledge and experience.  It assumes you know the process, have access to information sources, and have a good grasp of engineering fundamentals. 


To review the fundamentals

  • Understand the process and the problem
  • Apply conservation laws and accounting principles
  • Apply empirical relationships from the field
  • Consider Rules of Thumb


To check information

  • Get independent confirmation of gages and people
  • Distinguish between fact and opinion
  • Determine if the data is internally consistent and consistent with principles


To review trends and relevant changes

  • Examine data trends and history to look for relevant patterns
  • Consider all process changes for relevance
  • Consider time sequence and look at cause-effect relationships



Now it is time to

  • Brainstorm and list possible Hypotheses (problem diagnoses)
  • Compare the Initial (current) evidence with each Hypothesis to eliminate or support the diagnosis
  • Specify Diagnostic Actions to eliminate or support Hypotheses


Use of a spreadsheet type form (shown below) is of particular value in the plan stage.


Working Hypotheses

Initial Evidence

+ =support

X = eliminate

- = neutral

Diagnostic Action

+ =support

X = eliminate

- = neutral








Hypothesis 1







Hypothesis 2








We start by brainstorming all the things that could cause the problem and list those as our possible hypotheses.  The big danger in this section is to fall in love with the first or best hypothesis that you have.  Try to stay open-minded and let the process lead you to the best hypothesis.


Each significant piece of initial evidence (a temperature or visual markings on a fracture surface) is given a column (lower case a, b, c) in the Initial Evidence category.  For each of the hypotheses, you consider whether that piece of information supports the hypothesis, eliminates the hypothesis, or is neutral.  This is indicated by the appropriate mark in the cell that is the intersection of the hypothesis row and the evidence column.


Once you have considered the initial evidence, you probably will have more than one viable hypothesis.  This is the point at which the doctor orders more tests and the patient may need to remove their clothes and bend over.  Fortunately, you are the doctor in this case, so it is the workers in the plant who have to worry.


Diagnostic actions often involve changes to production lines, off-line test, or extra work by fellow employees.  Like medical tests they may be expensive, time consuming, or annoying.  And like medical tests they should only be done if they are likely to tell us something that is worth the time and money.


The following spreadsheet table helps us decide what tests to run and in what sequence.


Diagnostic Action

Price and Timing

What will test tell us?

Cost ($)

Time (hr)


















The Diagnostic Actions A, B, C in this table are identical to the Diagnostic Actions in the previous table.  Here we list each action and estimate the cost in dollars and time in hours to accomplish the action.  We also answer the question What sill the test tell us?  The answer to that question should be in the form If test results are __, then hypothesis 2 is eliminated and hypotheses 3 and 5 are supported.  If the test doesn’t serve to eliminate or strongly support a hypothesis, you may want to reconsider it.


So, what about the Sequence column?  Once you have a list of diagnostic tests, you need to consider which to run first.  You may pick the one with the most diagnostic power, or you may want to go with cheap and easy.  The sequence really depends upon the situation.  Remember that administrators hate to see production lines shut down or modified, so there may be costs that can’t be quantified in time and money.


You continue to perform diagnostic actions until 1) there is only one hypothesis standing, or 2) you have good enough supporting evidence to select a best hypothesis and move on to implement a solution.  Like in science, you cannot prove a hypothesis to be true.  You can only find a best hypothesis.


If you have a couple of strong contenders, and no way to choose between them, you may use a shotgun solution.  In a shotgun solution, you address both possible causes, and problem will be solved.  You may never know which of the two was the true cause, but at least the problem is solved.  If your car is not getting sufficient fuel to the cylinders and it is the day before a long trip, you may go ahead and replace both the fuel filter and the fuel pump without trying each I turn.


Do It

In this step, you use your diagnosis or best hypothesis to implement a solution (achieve the Ideal State).  This process is highly dependent on circumstances, so the suggestions are general.  It is even possible that through the act of diagnosis was the solution was implemented (e.g. Electronic troubleshooting is sometimes lampooned as swapping boards until the device works).


You should

  • Consider how the changes you will make could affect the process (e.g. Reinforcing a structure to prevent fatigue failure in one spot can result in a load transfer that causes fatigue in a new spot).
  • Make sure your changes comply with standards (safety, design codes, regulations, legal, ethical)
  • Communicate with appropriate parties (superiors, regulators, affected workers, design library)


Look Back

In the evaluation stage we ask ourselves

  • Did we find and solve the Root Cause?
  • What have we learned?
  • How can we prevent similar problems?


If we did not fix the Root Cause, our solution may be a temporary fix while we address other issues.  Note that some Root Causes may be problems for which you do not have ownership.  Investigations of the Challenger and Columbia accidents showed that hardware problems were the proximal cause of the failures, but management structure and attitudes were at the root of problems.


Taking stock of what you learned can help anchor the knowledge and convert the incident into valuable experience.  Look at what went well and what you would do differently in the future.


Another way to use what you learned is to try to prevent similar problems.  This may be through

  • Preventive maintenance of similar equipment
  • New monitoring or control equipment
  • Modified operational procedures
  • Improved employee training
  • Dissemination of information to others




Learning Objectives



In-Class Exercise

Exercise 1

Form into groups of 2-4

For the situation (with defined problem) presented by the instructor in class,

  • explore the situation, to understand it
  • brainstorm possible causes (develop hypotheses)
  • refrain from suggesting solutions



Exercise 2

Form into groups of 2-4

  • List the four most likely Hypotheses developed in Exercise 1 (in form below)
  • List the Initial Evidence as individual items (in form below
  • Relate each piece of evidence to each hypothesis, and indicate whether that piece of information supports, eliminates, or is neutral to the hypothesis.  Make the corresponding mark in the spreadsheet.


Working Hypotheses

Initial Evidence

+ =support

X = eliminate

- = neutral


















































Exercise 3

For the surviving hypotheses from Exercise 2

  • List possible diagnostic actions (in form below)
  • For each action, estimate cost, estimate time, and describe how the results will serve to support or eliminate hypotheses (in form below)
  • Prioritize the actions in terms of diagnostic power.


Diagnostic Action

Price and Timing

What will test tell us?

Cost ($)

Time (hr)





















In-class exercise, Triads

In this exercise, you will play one of three roles, Troubleshooter, Observer, or Information Expert.  There will be three sessions, so each person will play each role once.


Troubleshooter’s Role

The troubleshooter will

  • Be given a one page description of the situation at the beginning of the session.
  • Read the problem situation aloud
  • Will try to diagnose the problem, (describing their thoughts aloud)

To do this the troubleshooter will

  • Apply the given information
  • Request diagnostic actions from the information source

Request for diagnostic action will

  • Be in writing to the information expert (these should be specific, What is the temperature of the refrigerant before evaporator? rather than general Is there anything odd about the refrigerant?)
  • Receive a written response from the information expert (to avoid verbal or body language cues)


Information Expert’s Role

Before the session, the information expert

  • Is given the problem situation along with significant other data on the problem
  • Becomes thoroughly knowledgeable regarding the problem

During the session, the information expert

  • Provides information to the troubleshooter only in writing and only in response to a specific written request (no volunteering information or hinting)
  • Replies, No information available or NIA, when the troubleshooter’s question is not answered in the given information
  • Maintains a “poker face”


Observer’s Role

On the ____ Form, the observer will

  • Record time spent in each step of the Six Step Method
  • Record, with an “M” each time the troubleshooter uses monitoring statement (e.g. I have finished that step.  Let me check this.  How will this help me?)
  • Record, with an “H” each time the troubleshooter states a hypothesis
  • Record, with a “?” each time the troubleshooter submits a question



At the end of the session

  • The troubleshooter will assess their own performance on their own Troubleshooter Feedback Form.
  • The other two members will also assess the troubleshooter on the troubleshooters form (use initials to differentiate)


After all three sessions, turn in

  • All written materials (Problem statements, Question/Response sheets, Feedback forms)

 Feedback Form (long version)


Listener _______________________                  


1.       At the outset of this unit, place a “B” in each category to indicate your self assessment of your initial, or baseline skill level.

2.       At the end of the unit place an “A” in each category to indicate your self assessment of your skill level after practicing the skill.  Be prepared to provide documentation for your assessment.



(less successful)


(shows few expert behaviors)



Good Start

(some expert behavior)


Getting There

(many   expert behaviors)


Almost There

(mostly expert behavior)



(shows all expert behavior)




(more successful)


































































Reflection of the Listener

What did I learn from this?




Which of the skills do I do pretty well?  (List Evidence)




Which skills could use some work? (List Evidence)