Development of Automated Methods for Big Data to Achieve Compliance with IRB, Institutional, and Federal Requirements


Background and Statement of Problem
Institutional, legal, and regulatory requirements that govern data-mining of EMRs include provisions for maintaining data security, preserving patient confidentiality, and respecting institutional prerogatives. De-identification algorithms must be expressed in a manner that is clear, valid table, and revisable. In the context of proactive pharmacovigilance, much adverse drug or device event (ADE) information requires manual EMR review and data extraction and is therefore often unobtainable. Automated natural language processing (NLP) provides a practical means for extraction of data and detection of safety signals. Since manual EMR review has become prohibitive, and important ADE evidence is mostly undetectable, an effective NLP method is advocated to help annotate narratives and synthesize structured data.

We developed a corpus of automatically de-identified data from EMRs, including both narratives (e.g., clinical notes, pathology reports, imaging reports) and structured data (e.g., demographics, ICD codes, lab values). Our primary endpoints were to effectively demonstrate NLP reliability in the detection and extraction of secure, de-identified and IRB-compliant key ADE data using prediction and monitoring tools in order to advance pharmacovigilance capability with big data. To assess NLP capabilities, we used information about known ADEs to evaluate accuracy, robustness, and completeness of extracted EMR information and then applied this methodology to test the ability to automatically extract ADE-related data for a drug recently Food and Drug Administration-approved, romidepsin.

From a single institution data repository of three million individual EMRs, we automatically detected and extracted de-identified data from 52 EMRs (1,581 documents) for exposed patients. We used these data to evaluate effectiveness of two independent automated software processes, a de-identification process and a previously-trained machine-learning NLP tagger to identify relevant terms such as medication name, dosage, method of administration, and ADEs.

The NLP tagger provides a basis for using EMR narrative text to augment structured data fields already present in the EMR and the automated de-identification program shows advanced pharmacovigilance capability with well-supported sensitivity and specificity. 

While this automated process effectively removes PHI, the methodology continues to undergo further development and refinement. In doing so, we have devised a workflow that combines manual expert annotation of EMR text with a final relatively short manual de-identification step following time-saving automated de-identification of big data.

Directions for Future Study 
Next steps are to develop fully automated detection, extraction and de-identification of ADE-related narrative and structured EMR data for research purposes.