It is important to understand that researchers almost never find the "perfect" dataset, containing data for all the variables they must measure in one download. Multiple datasets from different sources must often be combined. For example, you might find demographic and income information in the US census and then find urban revitalization information from the Department of Housing and Urban Development, producing two spreadsheets of data that you must combine.
Data cleaning, also sometimes referred to as data scrubbing, data munging or, in less academic settings, data wrangling, is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. When combining multiple data sources there are many opportunities for data to be duplicated, formatted incorrectly (ex. one data source uses MM/DD/YYYY date format and another uses DD/MM/YY) or mislabeled (ex. one data source uses "race" as a label and another uses "race/ethnicity"). It is also possible, when combining datasets, that data will be missing (ex. certain variables may be missing data for certain years). If data is incorrect or inconsistent, outcomes and algorithms are unreliable, even though they may look correct.
When gathering datasets for your required variables, it is often helpful to keep a chart to record what you have successfully found, where, for what years (when applicable), for what populations (when applicable) and what format the data is available in. A sample chart might look like this:
Data source / citation | Years available | Teams | Data format | Other | |
Variable1: (ex. Team ticket revenue) | (ex. https://www.statista.com/statistics/193553/revenue-of-national-football-league-teams-in-2010/ | 2021 | all teams | revenue by team in 2021 (in million U.S. dollars) | |
Variable 2 (ex. Team concession revenue) | https://www.statista.com/statistics/202604/concession-stand-prices-in-the-national-football-league/ | 2006-2021 | all teams | revenue by team in U.S. dollars | |
Variable 3 (ex. Team parking prices) | https://bolavip.com/en/nfl/nfl-stadiums-list-capacity-parking-spaces-and-rates-in-2021-20210815-0016.html | 2022 | top ten teams | revenue in US dollars per parking space available | Tickets and concessions revenue is included in download. |
Variable 4 |
In the above example, if you were studying the value of hosting an NFL team from 2006-2016 for all host cities, the chart might help you understand you will need to do the following types of data cleaning:
After you determine the critical data values you need for your analysis; collect the data, sort and organize it, then you must clean it. There is no one, set procedure you will always follow to clean data. What you will need to do depends on the datasets you have located. Some general issues to consider are:
Step 1: Handle missing data
You may need to search for additional datasets, revise your years studied or populations studied, etc. How you manage missing data will depend on the impact the missing data has on your ability to conduct your study.
Step 2: Remove duplicate or irrelevant observations
When combining datasets, as in the example above, you may obtain the same data from multiple sources. Similarly, some datasets may include information about variables that are not relevant to your study (ex. you are studying NFL team revenue and your dataset includes NFL, MLB and NBA revenue). Duplicate and irrelevant data should be removed.
Step 3: Fix structural error and formatting inconsistencies
Fix issues such as capitalization and spelling inconsistencies; ensure forms of measurement and labeling etc are consistent; remove unnecessary data formats (such as XML) and make sure data that requires formatting (such as numeral vs, text strings vs. dates) have the proper formatting.
Step 4: Filter unwanted outliers
Often, there will data that, at a glance, does not appear to fit. If you have a legitimate reason to remove an outlier, like improper data-entry, doing so will help the performance of the data you are working with. However, sometimes it is the appearance of an outlier that will prove a theory you are working on. Be thoughtful and intentional if you remove outliers.