The PERSIMUNE Data Warehouse (DWH) contains data from many different sources and is continuously expanding. The vast amounts of data that are stored require consistent and continuous cleaning, quality assurance and evaluation in order to provide the enriched datasets for researchers. As part of engaging in the collaborative research infrastructure of PERSIMUNE we ask you to contribute to the data cleaning if you require data that has not previously undergone the cleaning process. When you request access to data, PERSIMUNE will let you know if cleaning is needed for your data request and engage in a dialogue regarding your possibilities for contribution. You are also welcome to join our DWH meetings where data cleaning activities are discussed regularly.
Contact person for our data cleaning activities is Jamshed Gill.
Briefly, the data cleaning process consists of the following steps:
1 – Identification of codes
The interesting variable might be identified by more than one analysis code. You need to identify all relevant codes as well as code location in the DWH tables.
2 – Triangulate versus other source
Check with other source if the DWH contains the amount of data to be expected for the variable.
3 – Examine data quality
Check if the data is what you expect. Are the values within the expected range, are they complete and without duplications?
4 – Define clean-up rules
Based on your knowledge from step 3, define rules to clean-up duplicates, missing values etc. in order to make data uniform and ready for statistical analysis.
5 – Validation and suggestions for prospective data quality monitoring
Apply and validate the rules from step 4. Revise and adapt the rules as needed for automatic implementation in the Data Warehouse to ensure the consistency and continuation of your cleaning effort.
6 – Implementation of new rules
Contact PERSIMUNE IT to discuss which rules could be implemented in the Data Warehouse.
How much time to invest in going through the steps of the cleaning process depends on your level of experience with databases. You should expect to spend from 2 days to 15 days per variable.