The How-To Guide for Cleaning and Preparing Data for Analysis
Tidying up your data is part science, part art, and all work. If you’re lucky, you’ll get your hands on some perfectly formatting data (Slack does a nice job, for example). But more often than not, you’ll need to do some data cleaning before it is ready for analysis. Survey data is a particularly common clean-up challenge, but even pulling data from Google Analytics will require some clean-up.
To help you on this journey, we’ve created this opinionated guide to preparing your data for analysis. The goal is to get to a flat table of data — which will serve as the raw materials that need to analyze data. Think of it like the kind of data you’d use as input to a PivotTable in Excel.
One of the seminal discussions of this problem is by Hadley Wickham, who wrote a paper, summarized in this post, called Tidy Data. He says":
Like families, tidy datasets are all alike but every messy dataset is messy in its own way.
A couple of important things to understand as you try to create a beautiful, analyzable block of data.
Each row is an observation about a thing. It is helpful to get a clear idea of what a row represents in your data before you start the cleaning/preparation process. Rows might represent individual people. Or transactions. Or activity for a product for a given day. If you are rows are too aggregated (e.g. “financial results by year”) you won’t be able to do a lot of analysis on the data table.
Columns, then, are the descriptive information about each of the rows. We call them measures (variables that you add up, average, or are otherwise used for calculations) or dimensions (variables that can be put into buckets or describe the things in the rows). Date/time is a special kind of dimension. Often a data set will have lots of columns. It is nice to have a lot of data; it is even nicer to work with only the columns you need.
A cleaned and prepared data table should both look tidy (i.e. a flat table with column headers) and be tidy (i.e. values in each cell are consistent when they represent the same things and don’t include multiple values in a cell).
How do we get there?
While we have outlined a complete data preparation process below, but you can also use the associated Google spreadsheet to practice our growing collection of tips and tricks. Here are the steps to use for getting to a clean, tidy, and analytics-ready data table:
Get organized
Practice time-saving skills
Prep your columns
Fix columns
Fix rows
Restructure data
1. Get Organized (The Easy Stuff)
You want the first row of your data table to be the names for the columns. If there is other stuff in your data table above these data field labels, delete those rows.
Remove grid and color formatting. As a general rule (soon to be broken in the next section), color and grid formatting is distracting. Data for analysis is your raw materials, not the final canvas.
Add a sheet to document your changes. It is possible that you’ll need to clean this data again; you’ll want a record of what you did the first time. This is also a good place to keep lookup tables, references, and links to sources.
Here’s an example:
2. Practice Time-Saving Skills
There are a few spreadsheet skills that will save you a ton of time with cleaning data:
Learn to move quickly around the page. Holding the Command + Arrow key will let you move around the page without smashing on the arrow keys. This key combination jumps you to the next empty cell of a row (left or right arrow) or up and down a column. This is a quick way to move to the top of your table, or jump to the last column or last row
Command + Shift + Arrow moves in the same way as described above, but it also selects/highlights the set of cells that you moved through.
We will be using a lot of formulas to clean the data. However, once you’ve updated the data, you’ll often want to copy the results without the formula. In order to paste just values, learn the key combination: Command + Shift + V
In general, you’ll want to do the following when you clean a data column:
Add a new column to the right of the column you want to clean. Give it the same column label as the original — or a new label if you are constructing new values.
Add a formula at the top of the new column (see examples below) to adjust or revise the values in the original column.
Test the formula on 20-30 rows in the new column to make sure it is giving you the result you want.
Copy the formula to the bottom of the new column. This is where you use the key combinations Command + Arrow key and Shift + Command + Arrow key
Copy the new column in its entirety and then Paste as Values to replace the formulas with the static values.
Delete the original column.
3. Prep Your Columns
The majority of data preparation and cleaning involves evaluating and fixing one column at a time. Therefore, it is useful to get organized about the different types of columns you are working with.
Find the columns that have useful dimensions. Dimensions are ways that you might categorize the things represented by the rows.
Color these in blue.
Find measure columns that are either empty or all the same value. Consider deleting these columns because they don’t provide useful information for analysis.
Change the data field names to give them a label that is around 5-15 characters — abbreviations can be confusing, long labels will be hard to show in your visualizations.
Next, find the columns that have measures. These are the values you might add up, average, or do other calculations on.
Color these in orange.
Find measure columns that are either empty or all the same value. You can probably delete these columns.
Change the data field names to something that isn’t too long (like the whole question from a survey) or too short (a hard-to-decipher acronym).
Select your full range of data and turn on the data filter function. We’ll want to use this for sorting and filtering.
Three types of data cleaning/preparation problems
The majority of data preparation involves “fixing” columns. A column of data can have many types of problems: data formats, extraneous information, missing information. Section 4 provides a list of common problems, a solution, and a link to the example.
Your data may have individual rows that need to be fixed or removed. Section 5 highlights a few common issues here.
Finally, your data may need to be restructured. This is where things get complicated. You may need to turn certain columns into rows (or rows into columns). Data tables can be too granular for analysis (e.g. “transactional-level data”) or too summarized. In Section 6 we’ll provide a few examples of these problems.
4. Fix Columns
Problem: Values are broken into sections, with empty cells that imply values in between.
Solution: IF() function can fill in the implied values. Spreadsheet example
Problem: Cells contain multiple values and require complex logic to break out the values.
Solution: Use MID(), LEFT(), RIGHT() functions to identify break points and parse the field values. Spreadsheet example
Problem: Cells contain multiples that are consistently delineated by a comma or other character
Solution: SPLIT() function. Spreadsheet example
Problem: ALLCAPS or other capitalization of words.
Solution: Change text with LOWER(), UPPER(), or PROPER() functions. Spreadsheet example
Problem: Values for the desired data field are spread across multiple columns, E.g. first name, last name.
Solution: Use CONCATENATE() function to consolidate multiple values into a single column. Spreadsheet example
Problem: Numbers are being treated as text.
Solution: Convert text into numbers with VALUE() function. Spreadsheet example
Problem: You want a date field that uses date values (i.e. year, month, day) from other columns.
Solution: Use DATE() function to construct a date value. Spreadsheet example
Problem: The values in a column include information that is unnecessary.
Solution: Use a combination of CLEAN(), FIND(), and MID() functions to remove extra characters. Spreadsheet example
Problem: Values in a column need to be converted to 0 or 1 values for calculations.
Solution: Use a combination of IF() and ISBLANK() functions. Spreadsheet example
Problem: ZIP codes are being displayed as numbers and therefore losing the preceding ‘0’ for some codes.
Solution: Identify “shortened” ZIP codes and use the CONCATENATE() function to append the missing zero. Spreadsheet example
Problem: The values in a column need to be converted into a label that is more completer or easier to read, e.g. country names rather than country abbreviations.
Solution: Use the LOOKUP() function. Spreadsheet example
Problem: The values in a column need to “bucketed” into groups, e.g. age ranges.
Solution: Either use a nested IF() function and reference a lookup table or use the SWITCH() function. Spreadsheet example
Problem: Some measure values are represented by non-numeric values such as “—” or “..” or “NULL”
Solution: Identify the columns with issues using the COUNT() and COUNTA() functions. Use the data filtering feature to find the problematic values. Spreadsheet example
5. Fix Rows
Problem: The header row is repeated multiple times throughout the table.
Solution: Use data filtering to filter to only these rows and delete all but one. Spreadsheet example
Problem: Some cells are merged. This is not conducive to data analysis.
Solution: In Excel, select all cells in the spreadsheet, then use the menu option for ‘Merge & Center > Unmerge Cells’. Otherwise, Look for patterns as to when cells get merged and use data filter in the column header to filter and manually un-merge cells. Spreadsheet example
Problem: The table has a summary row or rows that adds up or averages values all the values in a column. These rows are not helpful for tidy data.
Solution: Identify how these rows are labeled. Use the data filter in the column header to narrow to only these rows and delete.
7. 🚧 Restructuring Data 🚧
There are three common ways that data needs to get restructured. This is where things get hard and where there isn’t a formula to make the necessary changes.
The same metric is broken out into separate columns. We will often see data tables where a metric, like sales, is broken into separate columns by a dimension like month/quarter, geographic region, product, etc. This structure is not conducive to analysis. The preferred structure is to have a column that represents that dimension and a single column for the metric.
Data has been pre-aggregated. Your data table needs to include enough details to support the slicing and dicing you want to do. If a table has already been summarized, all that information will be hidden behind aggregated values. In this case, you may need to get back to the original data source to access disaggregated data.
Data is too granular. The flip side of the previous problem is when you are dealing with very detailed data. Data that shows individual transactions or activities, like weblog data, maybe too granular for analysis. This data needs to be pre-aggregated so that it is more manageable. In web analytics data, for example, rather than having each row represent an action by a user, each row could be summarized to describe all the actions of each user.