Website: Data Literacy for the Social Sciences: Clean data

What is data cleaning?

It is important to understand that researchers almost never find the "perfect" dataset, containing data for all the variables they must measure in one download. Multiple datasets from different sources must often be combined. For example, you might find demographic and income information in the US census and then find urban revitalization information from the Department of Housing and Urban Development, producing two spreadsheets of data that you must combine.

Data cleaning, also sometimes referred to as data scrubbing, data munging or, in less academic settings, data wrangling, is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. When combining multiple data sources there are many opportunities for data to be duplicated, formatted incorrectly (ex. one data source uses MM/DD/YYYY date format and another uses DD/MM/YY) or mislabeled (ex. one data source uses "race" as a label and another uses "race/ethnicity"). It is also possible, when combining datasets, that data will be missing (ex. certain variables may be missing data for certain years). If data is incorrect or inconsistent, outcomes and algorithms are unreliable, even though they may look correct.

Strategies to prepare for data cleaning

When gathering datasets for your required variables, it is often helpful to keep a chart to record what you have successfully found, where, for what years (when applicable), for what populations (when applicable) and what format the data is available in. A sample chart might look like this:

	Data source / citation	Years available	Teams	Data format	Other
Variable1: (ex. Team ticket revenue)	(ex. https://www.statista.com/statistics/193553/revenue-of-national-football-league-teams-in-2010/	2021	all teams	revenue by team in 2021 (in million U.S. dollars)
Variable 2 (ex. Team concession revenue)	https://www.statista.com/statistics/202604/concession-stand-prices-in-the-national-football-league/	2006-2021	all teams	revenue by team in U.S. dollars
Variable 3 (ex. Team parking prices)	https://bolavip.com/en/nfl/nfl-stadiums-list-capacity-parking-spaces-and-rates-in-2021-20210815-0016.html	2022	top ten teams	revenue in US dollars per parking space available	Tickets and concessions revenue is included in download.
Variable 4

In the above example, if you were studying the value of hosting an NFL team from 2006-2016 for all host cities, the chart might help you understand you will need to do the following types of data cleaning:

Manage missing data: You are missing ticket revenue data for all years but 2021; you are missing parking info for both all but top 10 teams and all years but 2022.
Manage duplicate data: In addition to providng new information on parking revenue, your last data source duplicates data on ticket and concession revenue that your first two also provide.
Manage formatting inconsistency: Some of your data is in millions of dollars and some is simply in dollars. If you don't clean that formatting inconsistency, a $5 hotdog and 5 million in ticket revenue will look the same to your modeling software.

What are some general steps when cleaning data?

After you determine the critical data values you need for your analysis; collect the data, sort and organize it, then you must clean it. There is no one, set procedure you will always follow to clean data. What you will need to do depends on the datasets you have located. Some general issues to consider are:

Step 1: Handle missing data

You may need to search for additional datasets, revise your years studied or populations studied, etc. How you manage missing data will depend on the impact the missing data has on your ability to conduct your study.

Step 2: Remove duplicate or irrelevant observations

When combining datasets, as in the example above, you may obtain the same data from multiple sources. Similarly, some datasets may include information about variables that are not relevant to your study (ex. you are studying NFL team revenue and your dataset includes NFL, MLB and NBA revenue). Duplicate and irrelevant data should be removed.

Step 3: Fix structural error and formatting inconsistencies

Fix issues such as capitalization and spelling inconsistencies; ensure forms of measurement and labeling etc are consistent; remove unnecessary data formats (such as XML) and make sure data that requires formatting (such as numeral vs, text strings vs. dates) have the proper formatting.

Step 4: Filter unwanted outliers

Often, there will data that, at a glance, does not appear to fit. If you have a legitimate reason to remove an outlier, like improper data-entry, doing so will help the performance of the data you are working with. However, sometimes it is the appearance of an outlier that will prove a theory you are working on. Be thoughtful and intentional if you remove outliers.

Reference, Instruction, and Assessment Librarian

Julie Howe

she/her

Email Me

Contact:

220 Grace Doherty Library

859.238.5268