Data Cleaning
Introduction

The PERSIMUNE Data Warehouse (DWH) contains data from many different sources and is continuously expanding. The vast amounts of data that are stored require consistent and continuous cleaning, quality assurance and evaluation in order to provide the enriched datasets for researchers. As part of engaging in the collaborative research infrastructure of PERSIMUNE we ask you to contribute to the data cleaning if you require data that hasn’t previously undergone the cleaning process. When you request access to data, PERSIMUNE will let you know if cleaning is needed for your data request and engage in a dialogue regarding your possibilities for contribution.

Cleaning process

The cleaning process takes place at the PERSIMUNE location where the Data Management Study Interest Group will assist and supervise. We provide the facilities to get the work done. Briefly, the cleaning process consists of the following steps:

1 – Identification of codes

The interesting variable might be identified by more than one code, you need to identify all relevant codes as well as code location in the DWH tables. 

2 – Triangulate versus other source


Check with other source if the DWH contains the amount of data to be expected for the variable.

3 – Examine data quality

Check if the data is what you expect. Are the values within the expected range, are they complete and without duplications. 

4 – Define clean-up rules


Based on your knowledge from step 3 define rules to clean-up duplicates, missing values etc. in order to make data uniform and ready for statistical analysis.

5 – Validation and suggestions for prospective data quality monitoring


Apply and validate the rules from step 4. Revise and adapt the rules as needed for automatic implementation in the Data Warehouse to ensure the consistency and continuation of your cleaning effort.  


Time investment

How much time to invest in going through the steps of the cleaning process depends on your level of experience with databases.  You should expect to spend from 2 days to 15 days per variable.