Whether your studying for your CWNA-107 or just troubleshooting issues at your day job, it helps to have defined steps for discovering and remediating problems when they arise. The steps laid out in the CWNA-107 blueprint section 7.1 are listed below and are the steps I personally use when working through an issue.
- Identify the Problem
- Discover the Scale of the Problem
- Define Possible Causes
- Narrow to the Most Likely Cause
- Create a Plan of Action or Escalate the Problem
- Perform Corrective Actions
- Verify the Solution
- Document the Results
Let’s take a look into each of these steps and how you can use them when trouble arises in your network.
Identify the Problem
First things first. Usually when we get a trouble ticket it roughly translates to ‘the wifi isn’t working’. Well, that could mean anything – are users having problems connecting, authenticating, using a specific app, WHAT I SAY! GIVE ME THE DETAILS!
Using Emotional Intelligence connect with the user experiencing the problem or help desk technician who escalated the case to gather as much detail as possible. You’ll want to ask their username if it’s a WPA2-ENT network, device they are using to connect, has this happened before or a new problem, and determine if some things work while others do not.
For example we spoke with Linda in Accounting and she is unable to connect her company issued phone to the WiFi. She is receiving an invalid credentials error when attempting to authenticate, and the credentials have been verified to work on other systems.
Great! Now we have identified the problem and are ready to move on in our process.
Discover the Scale of the Problem
THE WORLD IS ON FIRE! EVERYTHING IS CRASHING! THIS IS THE END! …oh. never-mind.
How many people are having this problem? Sometimes when an executive or C-level user has a problem the ticket will come to your queue as a high priority, followed by an immediate phone call, and an out of breath assistant who ran to your desk because ‘nobody’ can work with the wifi down. Again, address the situation with care and Emotional Intelligence. Now, let’s work on finding out how bad this really is.
Continuing with our previous example we test authentication from a different device to the network and determine that it is also failing. Sure, that’s not a good sign – but before we declare this a global outage we need to do a little more testing. Let’s log into the RADIUS server and check the logs. The logs show that we don’t have a new authentications over the last 2 hours.
OK, now we can probably determine that this is affecting all users and should be treated accordingly with your SLAs.
Define Possible Causes
This is the time for brainstorming and teamwork. What could be causing the problem at the scale you’re experiencing? Write ideas down, show some logs, do some debugs, work logically – but don’t rush it. The possible causes you run through may be different for a single device with a possible driver or credential issue compared to a global authentication failure.
In our example we know that the problem is authentication, but we aren’t sure where in the chain the problem is occurring. Is the connection between the RADIUS server and Active Directory down? Are our SSL certificates expired? Did a buggy update get applied to clients recently?
Narrow to the Most Likely Cause
A connectivity issue may start with pings and traceroutes. If they fail, then work down the OSI model. If they are successful, look at the upper layers. With authentication failures, start digging into the logs on the RADIUS server and client to seek any clues. Random client drops, let’s run some debugs on the controllers and try to see if there’s an answer there.
Our examples global authentication problem seems to be failing because of an issue on the RADIUS server. After logging into our RADIUS authentication server we are greeted with a big red error box saying that our LDAP connection has failed.
Create a Plan of Action or Escalate the Problem
Are you able to resolve the problem or does another team need to be involved? If you are able to apply a fix you should first create your plan of action. Your action plan may be an emergency change request made to your change advisory board, or it may be informing your manager of your findings and the fix you plan to implement. Either way, never take corrective action without a plan. This plan should include a plan to fix and a plan to back-out any changes made if it does not resolve the problem.
If another teammate or team owns where you suspect the problem exists, then escalate the problem to them appropriately. Following your companies processes, give the team all the information you have gathered to this point and offer assistance if you can be of any.
Since our problem exists between the RADIUS server and LDAP we will escalate the problem to our Systems team for assistance. After you provided the service account used to connect RADIUS with LDAP the Systems team tells you that the account had been expired.
Perform Corrective Actions
Let’s take that plan and put it to action! You have approval to make your change and fix the issue. Be sure to keep to your plan and not make changes on the fly. Keeping calm is very important, especially when dealing with a global outage. You wouldn’t want to make a lot of changes that don’t fix the problem AND create new problems in the process.
If you had to escalate your issue, check in with the teammate assigned to this issue. Again, offer assistance and ask if you can ride-along while they work the problem so you can learn more about their process.
Thankfully, our examples expired service account is an easy fix for the Systems team. They are able to make a standard change and reinstate access to your service account. We go back and create a plan of action to update the service account information on the RADIUS server while working in tandem with the Systems team.
Verify the Solution
Is everything OK out there? Is the solution you put in place the correct one? Are users happy? If yes, then move onto the next step! Congratulations!
If not, then return to previous steps to track down the possible cause and create a new plan of action. This may be the most frustrating part of our job. We have a fix that we KNOW is going to work, but alas it doesn’t. At this point, don’t get discouraged or frustrated (unless you’re on your 4th go-round, then it becomes understandable). Focus, take a step back, ask someone else to take a look. Maybe the problem is something staring you right in the face, but you can see it because you’ve been staring at this screen for hours.
With our example we were able to update the password in RADIUS for our LDAP service account, test, and success! We can now see new clients authentications coming through in the logs. We’ve reached out to Linda in Accounting to ensure she was able to access the wifi on her phone and gave thanks for her patience.
Document the Results
The most important step that often gets overlooked. Being able to look back at an issue or build a new knowledge-base article will save all the time you spent troubleshooting an issue in the future. Going through a postmortem with your team gives you a chance to see what you did well, where you could improve, provide new documentation, and prevent future instances of the same problem.
To finish up our example we created new alerts for service accounts to go to the owner when accounts are expired or disabled, alert when RADIUS fails to connect with LDAP, and built a redundant RADIUS server to prevent future global outages like the one we saw here.