When working with data in Excel, you’ll often have the issues of handling outliers in your data set.
Having outliers is quite common in all kinds of data, and it’s important to identify and treat these outliers to make sure that your analysis is correct and more meaningful.
In this tutorial, I’ll show you how to find outliers in Excel, and some of the techniques that I have used in my work to handle these outliers.
This Tutorial Covers:
What are Outliers and Why is it Important to Find these?
An outlier is a data point that is way beyond the other data points in the data set. When you have an outlier in the data, it can skew your data which can lead to incorrect inferences.
Let me give you a simple example.
Let’s say 30 people are traveling in a bus from destination A to destination B. All the people are in a similar weight group and income group. For the purpose of this tutorial, let’s consider the average weight to be 220 pounds and the average yearly income to be $70,000.
Now somewhere in the middle of our route, the bus stops, and Bill Gates hops in.
Now, what do you think this would do to the average weight and the average income of the people on the bus.
While the average weight is not likely to change much, the average income of the people on the bus is going to skyrocket heavily.
That’s because Bill Gates’s income is an outlier in our group, and that gives us a wrong interpretation of the data. The average income for each person on the bus would be a few billion dollars, which is way beyond the actual value.
When working with actual datasets in Excel, you can have outliers in any direction (i.e., a positive outlier or a negative outlier).
And to make sure that your analysis is correct, you somehow need to identify these outliers and then decide how to best treat them.
Now let’s see a couple of ways to find outliers in Excel.
Find Outliers by Sorting the Data
With small datasets, a quick way to identify outliers is to simply sort the data and manually go through some of the values at the top of this sorted data.
And since there could be outliers in both directions, make sure you first sort the data in ascending order and then in descending order and then go through the top values.
Let me show you an example.
Below I have a dataset where I have call durations (in seconds) for 15 customer service calls.
Below are the steps to sort this data so that we can identify the outliers in the dataset:
- Select the Column Header of the column you want to sort (cell B1 in this example)
- Click the Home tab
- In the Editing group, click on the Sort & Filter icon.
- Click on Custom Sort
- In the Sort dialog box, select ‘Duration’ in the Sort by drop-down and ‘Largest to Smallest’ in the Order drop-down
- Click Ok
The above steps would sort the call duration column with the highest values at the top. Now you can manually scan the data and see if there are any outliers.
In our example, I can see that the first two values are way higher than the rest of the values (and the bottom two are way lower).
Note: This method works with small datasets where you can manually scan the data. It’s not a scientific method but works well
Finding Outliers Using the Quartile Functions
Now let’s talk about a more scientific solution that can help you identify whether there are any outliers or not.
In statistics, a quartile is one-fourth of the data set. For example, if you have 12 data points, then the first quartile would be the bottom three data points, the second quartile would be the next three data points, and so on.
Below is the data set where I want to find the outliers. To do this, I will have to calculate the 1st and the 3rd quartile, and then using it calculate the upper and the lower limit.
Below is the formula to calculate the first quartile in cell E2:
and here is the one to calculate the third quartile in cell E3:
Now, I can use the above two calculations to get the Interquartile Range (which is 50% of our data within the 1st and the 3rd quartile)
Now we will use the interquartile range to find the lower and upper limit which would contain most of our data.
Anything which is out of these lower and upper limits would then be considered outliers.
Below is the formula to calculate the lower limit:
=Quartile1 - 1.5*(Inter Quartile Range)
which in our example becomes:
And the formula to calculate the upper limit is:
=Quartile3 + 1.5*(Inter Quartile Range)
which in our example becomes:
Now that we have the upper and lower limit in our data set, we can go back to the original data and quickly identify those values that do not lie in this range.
A quick way to do this would be to check every value and return a TRUE or FALSE in a new column.
I have used the below OR formula to get TRUE for those values that are outliers.
Now you can filter the Outlier column and only show the records where the value is TRUE.
Alternatively, you can also use conditional formatting to highlight all the cells where the value is TRUE
Finding the Outliers Using the LARGE/SMALL functions
If you work with a lot of data (values in multiple columns), you can extract the largest and the smallest 5 or 7 values and see if there are any outliers in it.
If there are any outliers, you will be able to identify them without having to go through all the data in both directions.
Suppose we have the below dataset and we want to know if there are any outliers.
Below is the formula that will give you the largest value in the dataset:
Similarly, the second largest value will be given by
If you’re not using Microsoft 365, which has dynamic arrays, you can use the below formula and it will give you the five largest values from the dataset with one single formula:
Similarly, if you want the smallest 5 values, use the below formula:
or the following in case you don’t have dynamic arrays:
Once you have these values, it’s really easy to find out any outliers in the dataset.
While I have chosen to extract the largest and smallest 5 values, you can choose to get 7 or 10 based on how big your dataset is.
I am not sure if this is an acceptable method for finding outliers in Excel or not, but this is the method that I used when I had to work with a lot of financial data in my job a few years ago. Compared to all the other methods covered in this tutorial, I found this one to be the most effective.
How to Handle Outliers the Right Way
So far, we have seen the methods that will help us find the outliers in our data set. But what to do once you know that there are outliers.
Here are a couple of methods that you can use to handle outliers so that your data analysis is correct.
Delete the Outliers
The easiest way to remove outliers from your data set is to simply delete them. This way it won’t skew your analysis.
It’s a more viable solution when you have large datasets and deleting a couple of outliers won’t impact the overall analysis. And of course, before deleting the data make sure you create a copy and delve into what’s causing these outliers.
Normalize the Outliers (Adjust the Value)
Normalizing the outliers is what I used to do when I was in my full-time job. For all the outlier values, I would simply change them to a value that is slightly higher than the maximum value in the data set.
This made sure that I’m not deleting the data but at the same time I’m not letting it skew my data.
To give you a real-life example, if you’re analyzing the net profit margin of companies, where most of the companies lie within -10% to 30%, and there are a couple of values that are upward of 100%, I would simply change these outlier values to 30% or 35%.
So these are some of the methods that you can use in Excel to find outliers.
Once you have identified the outliers, you can delve into the data and look for what’s causing these, at the same time pick one of the techniques to handle these outliers (which could be removing these or normalizing these by adjusting the value)
I hope you found this tutorial useful.
Other Excel tutorials you may also like: