Skip to main content

Cross-functional collaboration is essential in a clinical trial.  While data cleaning typically falls within the remit of data managers, the process can also benefit significantly from the involvement of statisticians.  In this blog, I’ll share some ways, based on my experiences working on a substantial respiratory trial that statisticians can positively input into the data cleaning process. 

Why is data cleaning important?

In any clinical trial, we present a set of questions and then use the generated data and analysis to answer them. So, to get answers that we can trust, and ensure the integrity of the statistical analyses we need clean, reliable, fit-for-purpose data.  While the process of data cleaning sounds straightforward in principle, in practice, there are challenges to overcome.  The sheer volume and scale of data can be an issue, particularly within more extensive studies involving many patients and lengthy recruitment timelines.  Where multiple sites are involved, different investigators are collecting the data – and different processes can lead to anomalies. 

How can statisticians enhance the process?

Typically, the responsibility lies with the data management team to clean the incoming data, based on a set of rules defined before the study starts, specifying that measurements should fall within some sensible range.  Then, if any data points fall outside these ranges or and other issues are identified, data queries are issued for resolution.

One of the advantages of involving statisticians, in collaboration with data managers at early stages in this process is that they are often closer to the final output. As statisticians, we are producing the analysis, and generally have more visibility on how unclean data is likely to affect the endpoints and the results.  We can also apply different techniques based on our statistical skills that can assist the data cleaning efforts.

Examples of statistical support for data cleaning issues

During the respiratory project I alluded to earlier, the analytical team received data extracts each month. These regular updates gave us an excellent picture of how the incoming data was going to affect our analyses. In response to several issues that we identified from these extracts, we started creating statistical data checks to complement the existing process.   We then used our programming skills to develop visual tools to clearly illustrate the issues at hand to the sites and the data management team. Through this approach, we not only improved the quality of the data but built even closer communication and positive collaborations with our virtual team colleagues. Below, I summarise two of these issues, and how we helped to address them. 

Overlapping exacerbations

Exacerbations is a common respiratory endpoint, defined in our protocol as an adverse event (AE) of worsening symptoms of the disease requiring treatment with either antibiotics or steroids. In the AE CRF page, we asked the sites to check a box to identify whether the adverse event met the exacerbation definition.  However, as we reviewed our monthly extracts, we began to notice several overlapping exacerbations.  These overlaps occurred where investigators were recording an adverse event with one clinical term and then recording the same event separately as an exacerbation. This problem wasn’t straightforward to communicate to sites, and we knew that errors could have resulted in a false increase in the exacerbation rate. Therefore, we created a tool that generated a profile of the exacerbations, including the parallel durations. The simple visual helped to describe the issue to the sites and our data management colleagues and substantially cleared up the issue of incorrect recording. 

Examining data points together for FEV1

FEV1 (forced expiratory volume in one second), another common respiratory endpoint that measures lung function was also analysed within this trial.  We wanted to determine whether the experimental therapy had any slowing effect on worsening FEV1 in the subjects.  However, we realised that the data management checks weren’t able to pick up all the potential issues –partly because they were designed to examine individual data points rather than look at the data longitudinally. To enhance the data cleaning process, we created another profile tool using SAS and Excel that could summarise the datapoints together for a specific patient, and look at features like standard deviations, or changes from the patient’s baseline. As with the previous example, the tool generated outputs that sites could easily review to highlight areas where data needed correction.

Statistical input benefits everyone 

In summary, different techniques can be applied by statisticians to complement and enhance existing cleaning processes. As statisticians, we also benefit from the outcomes- gaining better confidence in the data and analysis. In this particular trial, using visualisations helped us to dramatically improve the communication and data cleaning process by aligning data management, statistics and sites on the impact of unclean data.


Watch the webinar recording ‘Statistical input into improving data quality.’

Nick Cowans