There’s an awesome Python package called Scrubadub that can can help you remove personally identifiable information from text data. This is a great step to take before publishing a dataset that may contain PII, in order to prevent inadvertent disclosure.
In this example, we’ll clean up some CSV data using Scrubadub, in order to prep it for loading in Socrata:
Before you start, make sure you have the following installed on your machine:
Create a dataframe from your local CSV file with Pandas:
Scrubadub is a simple package that will look for names and other identifying information, like email addresses, SSNs, and phone numbers.
Data cleansing is a serious topic and you should always work with your privacy or policy officers within your organization to make sure you are taking the correct steps to protect privacy.
Finally, we’ll write our cleansed records back out to CSV:
Once you’re done, the cleaned data file can be used to update a dataset via DataSync. For more information, see its detailed documentation