In this article I am going to go over a few points that I believe make up a good troubleshooting methodology. Troubleshooting is a kind of problem-solving, and I believe it is a skill that can be learned with practice. It is a systematic search for the cause of a problem and there are a variety of logical steps and approaches that can be taken in the process.
Before getting into whatever it is you are trying to troubleshoot, it is important to have an understanding of how the system works. It is better to know how things work before hand, especially if you are a maintainer or administrator of that system. If not, you can also learn while troubleshooting. You don’t need to be an expert car mechanic to take a few steps to determine why your car is having a certain problem.
What are the facts of the situation? Part of troubleshooting is being a good investigator. Either mentally or in writing, note down the symptoms you are observing or that someone is sharing with you. Think of some questions to ask so you can learn more details and perhaps reveal more symptoms.
What is the size of the problem? Is the problem on an individual basis or is it widespread? A good example is someone complaining that their internet doesn’t work. Are they the only ones having issues connecting or is the whole office down?
Can you reproduce the problem? This is not always possible, but if you can replicate an issue you’ve gone a long way in narrowing the problem down. Continuing from the previous example, if someone complains their internet doesn’t work, try another computer. If you can’t reproduce the issue, the root cause likely lies with that one device an individual is complaining about.
Come up with some theories as to what the root cause is. A good theory is one that you can either prove or disprove. Pose a question and find out what the answer to that question is.
Start testing, and do it with one thing at a time. Keep asking and proving/disproving these theories.
From this point I am going to use a case study to showcase the points I’m laying out in this article.
You get a call one day that a location you maintain has stopped responding. This is a site that can normally be reached remotely as it is connected to the Internet. Today it is not accessible. It is only this location that is having the problem; you cannot reproduce the issue at other sites. You have determined the scope of the problem; one location. Based on your knowledge of how the systems at that site work, you know it has to be either a communications or power issue cutting of remote access.
So far the facts and evidence you have collected about this issue are that you know this remote site is not responding, it is either a power or communications issue, and the scope of the problem is just one location. Now it is time to gather the equipment, tools, and parts you may possibly need and head out to investigate in person.
Once you get on site, now you implement another troubleshooting method; following the chain. In this case study, you know how the systems are interconnected, so it is time to step from A to B to C and so one until you have isolated the root cause that is cutting off remote access.
Recall that before you arrived you were hypothesizing that the problem you are responding to is either a communications or power issue. This location has a router that serves as the access point to the site so it is the first item to look at to determine where the issue may be.
You notice the router is not powered up. Your theory now is that either the router died or it is turned off because of a power issue. You are still determining a communications or power issue. Does anything else have power? You follow the power chain to the next piece of equipment and observe it is powered off as well. Now you can eliminate the router as a root cause. You know at this point while communications are cut off, you have ruled it out the router because you know the issue lies somewhere with power.
As you continue investigating you notice that everything on site is powered off—not just the router. This power issue you have observed is impacting the entire location. Is the power issue because of a piece of equipment at the site? You can verify that no circuit breakers are tripped, as well as plug in a device you brought into an outlet to see if power is coming in. There is in fact no power at the outlet, and you ruled out the circuit breakers. Now you know the issue is not with your site nor with its equipment.
The next step in following the chain is to investigate the power meter. You notice that it is dead with no power coming in. Your testing and following the chain have led you to isolate the issue down to the power meter. It is now the power company’s responsibility to resolve the issue, and once they restore power to the meter the site you have responded to will have power restored as well.
To summarize, you have followed a troubleshooting methodology by taking a logical progression of steps to isolate the root issue of the issue you are trying to fix. You collected a list of symptoms, determined the scope of the problem, tried to reproduce the problem, tested both remotely and on site, and proved or disproved various questions and theories as to what was wrong by methodically following the chain until the root cause was isolated.