Neal D. Goldstein, PhD, MBI

About | Blog | Books | CV | Data | Lab

A Researcher's Guide to Using Electronic Health Records
From Planning to Presentation, 2nd edition

This is the official website for the textbook A Researcher's Guide to Using Electronic Health Records.

As webpages are meant to be, this is a living document and will be updated regularly with new content that I believe is useful to researchers working with EHR data. If you have suggestions for content, relevant articles, or other material in this area, please email me and I may share it here for others. Additionally and in the spirit of transparency and openness, I would like to point out several other books relevant to EHR researchers: Secondary Analysis of Electronic Health Records, Pharmacoepidemiology (esp. Part IIIb, Electronic Data Systems, in the 6th ed.), and Clinical Research Informatics. For those new to EHRs and the practice of clinical documentation, I highly recommend the book Electronic Health Records: Understanding and Using Computerized Medical Records. Although out of print, it can be found easily and inexpensively through any used book seller.

For those interested in a more formal treatment of the material, I offer an introductory workshop on EHR-based epidemiology at the Society for Epidemiologic Research annual meeting, as well as a full graduate level course on EHR research at Drexel University.


Use the links below to download a PDF copy of the research planner, source codes in R, and the example dataset.


As errors are discovered in the text, they will be posted here.


This section is used to track updates or additions to the chapters of the book since last publication.

Introduction: Chapters 1-2.

  • Ch. 2: Examples of SOAP notes. Typically, a SOAP note will be generated for each patient encounter with a clinician. Using a shortcut template, many clinicians will standardize their particular style of entry, however, while a SOAP note structures clinical documentation from a clinic standpoint, there is still heterogeneity in styles impacting the ability to automatically parse pertinent data. The two boxes below show equally well-written SOAP notes, but with varying degrees of machine interpretability.
    Jane Doe is 66 y/o who attended her follow-up of her HTN. She feels well. She does not have dizziness, headache, or fatigue. Jane has no history other than hypertension. Her only medication is HCTZ at 25mg per day. Jane has lost 53bs in the past 3 months, following a low-fat diet and walking 10 minutes a day. She drinks two glasses of wine each evening. Jane uses no OTC medications such as cold remedies or herbal remedies.

    Generally, Jane appears well.
    Weight 155lbs, Height 55 inches, BMI ~30, Pulse 76 reg, BP 153/80.
    She has no lower extremity edema.

    Jane is here for a follow-up of her hypertension. It is not well-controlled since blood pressure is above the goal of 135/85. A possible trigger to her poor control of HTN may be her alcohol use or the presence of obesity.

    Continue a low-fat diet and exercise. Consider increasing walking time to 20-30 minutes to assist with weight loss. Discussed alcohol use and its relationship to HTN. Jane agrees to a trial of drinking wine only on weekend evenings. Check home BPs. Check potassium since she is taking a diuretic.
    Follow-up in the clinic in 1 month. Bring a blood pressure diary to that visit. Consider adding ACE inhibitor at the next visit if BP is still elevated.

    S: 42 yo woman presents with pale skin, weakness, dizziness, and epigastric pain. 2 weeks ago she experienced decreased exercise tolerance. She takes frequent doses of antacids and uses ibuprofen 200mg prn headaches. NKDA. She has children age 15, 12, and 1.

    O: T 38 C, RR 18, BP: sitting 118/75, standing 120/60, HR: Sitting 90, standing 110. Hb 8gm/dL, Hct 27%, platelets 300,000/mm3, retics 0.2%, MCV 75, serum iron 40mcg/dL, serum ferritin 9ng/ml, TIBC 450 mcg/dL, guaiac stools. Cheilosis at corners of mouth, and koilonychias at nail beds. PMH: peptic uler and preeclampisa with last pregnancy. Dx: iron-deficiency anemia.

    1. Fe: counsel parent on tolerance and side effects
    2. Discuss guaiac stool and ibuprofen, f/u with PCP on GI bleed, possible ulcer. d/c ibuprofen, use APAP prn HA.

    1. Iron sulfate 325mg TID x 6 months - f/u with PCP for retic count after 7 days of therapy. counsel/educate patient on a) take on empty stomach if possible, ok with food if cannot tolerate, b) separate iron dose from antacid dose, c) iron can cause constipation and darken stool color, d) keep iron out of reach from children - toxic.
    2. Make appt with pcp for probable ulcer/GI bleed. d/c ibuprofen, take acetaminophen 500mg po q 4-6 h prn p, NTE 4000mg/24 h (counsel on liver toxicity)

    Researchers who are accustomed to working with sociodemographic, economic, or other "social determinants" typically obtained via a "baseline" survey in an epidemiological or clinical study may find the subjective portion of the SOAP note especially useful. The subjective portion may be further structured as past medical history (conditions, injuries, medications, surgeries, etc.), family medical history (diseases, health factors, genetics, longevity, etc.) and social history (personal and behavioral factors, such as diet, exercise, smoking, drug and alcohol use, etc). This is also where researchers will find self-reported symptomology at the outset of a current illness as well as outcomes of previous medical treatments. There are several challenges with these data. First, documentation in this section of the SOAP note is more likely to be free text than a discrete, coded entry in the EHR, discussed in detail in chapters 3 and 12. Second, as the 'S' in SOAP implies, this is self-reported data and the accuracy of these data will need to be validated, discussed further in chapter 6 and 9. The figure below depicts the spectrum of subjective (i.e., patient-provided) versus objective (i.e., provider-ascertained) data that may be captured in the EHR. When a patient is under care with a provider, the data tend to be the most objective while anything ascertained before or after a given encounter is largely going to be self-reported.
  • Ch. 2: Terminology and coding standards. In the U.S., the Centers for Medicare & Medicaid Services is the largest payor of healthcare services: "Nearly 90 million Americans rely on health care benefits through Medicare, Medicaid, and the State Children's Health Insurance Program (SCHIP)." As such, CMS has a large influence over documentation and billing practices in the EHR that relate to reimbursement, such as the use of DRG (inpatient) and APC (outpatient) codes.

    Evaluation and management (E&M) codes, based on CPT codes, are used for professional service reimbursement, such as to bill for a new patient visit or hospitalization, an established patient visit, or a consultation, and take into account the complexity of the encounter. There are four components of E&M codes: 1) patient type whether new or established, 2) patient history, 3) examination findings, and 4) extent of medical decision making. E&M codes are distinct from diagnostic codes but may be billed in tandem with ICD, DRG, or APC codes. For health services researchers, important details on patient encounters may be embedded in the E&M codes but not captured elsewhere in the medical record, which again demonstrates the complexity of operationalizing clinical phenotypes from EHRs.

    Another way to conceptualize the various terminologies used in medicine is by dividing them into codes used for documenting the clinical encounter and order entry and codes used for reimbursement and reporting, although there is some overlap between the two. Codes to document the clinical encounter and order entry tend to be more granular and invisible to the end user. These vocabularies/taxonomies include SNOMED-CT, MEDCIN, LOINC, and array of medication terminologies including NDC (FDA), ATC (WHO), VA class (VA), and RxNorm (NLM). Codes to document reimbursement and reporting tend to be more general and visible to the end user, which is why many providers can quote specific ICD codes. These include ICD, CPT (especially E&M codes), DRG/APC, and HCPCS (CPT is a component). Many EHRs have nomenclature/vocabulary crosswalks that map internal clinical concepts to ICD codes for billing purposes, and the external UMLS/I-MAGIC systems help with mapping between a variety of these vocabularies.
  • Ch. 2: Understanding ICD codes. There are numerous resources available for EHR researchers to find and understand ICD codes; this section provides a high level overview for ICD-10-CM. ICD-10-CM codes will always be between three and seven alphanumeric characters in length. The first character is always a letter, the second character is always a number, and characters three through seven can be either letters or numbers. Following the third character will always be a decimal and the more characters in the ICD code, the more specificity of the code. Coders are required to use the most specific code that is applicable. Broadly, ICD codes are categorized by chapter, grouping diagnoses and clinical findings by subject such as organ system, pathologic process, and other causes of illness and injury. For example, codes C00-D49 relate to neoplasms, codes J00-J99 relate to diseases of the respiratory system, and codes O00-O9A relate to pregnancy and childbirth. Characters four through six define the site, etiology, and manifestation or state of the disease or condition. For example, while C15 is a malignant neoplasm of the esophagus, C15.3 stipulates that the neoplasm was in the upper third of the esophagus. Some codes allow for a seventh character extension that defines aspects of the clinical encounter, such as initial encounter, subsequent encounter, or disease sequela. Researchers operating a clinical phenotype using ICD codes may not always need the detailed specificity inherent in these codes and thus may consider only matching on the first three or four characters in the ICD code. ICD-11, which came into effect worldwide in 2022, substantially changes ICD-10-CM codes, but as of this writing has not been adopted in the U.S. As discussed elsewhere in the book, research projects spanning multiple years may need to reconcile changing standards. Further, regardless of the vocabulary used, matching on ICD codes introduces the possibility of misclassification, discussed further in chapters 6 and 9.
  • Ch. 2: EHR concepts. The EHR exists as both a technology and a tool. By technology, we mean that it is a generalized platform for managing the health of a patient and for providers to be reimbursed for the care provided. The EHR aggregates and displays data from multiple areas including people (e.g., healthcare workers and patients), places (e.g., clinical departments such as radiology and pharmacy), and things (e.g., medical devices or wearables) (see Figure below).

    As a tool, the EHR encompasses a host of applications that are embedded within the system. These core modules include schedule, patient registration (admission, discharge, transfer), clinical documentation and charting, medical billing and charging, clinical decision support systems (CDSS) and computerized physician order entry (CPOE), electronic medicine administration record (i.e., barcode scanning of medications at the time of administration), communication, and ancillary functions such as interfacing (w/ other systems), reporting, and administration. Truly, these are complicated systems as exemplified from this simplified view of the Veterans Health Administration VistA architecture (Source:

    To expand on two of the more important modules in the EHR, namely CDSS and CPOE, the goals of these tools are to reduce errors, reduce costs, and limit practice variation under the evidence-based medicine paradigm. This is achieved by improving the structure and legibility of medical documentation, identification of pertinent medical details, and real-time feedback to providers. Such feedback may incorporate checks for patient or drug allergies, drug interactions, extreme doses, or flag potential drug-lab problems. While such alerts are the most famous - or perhaps infamous depending on your perspective - aspect of CDSS and CPOE, these technologies also enable computer assisted interpretation of labs and imaging, efficient consultation and messaging with specialists, critiquing of data entry to ensure proper data are captured, and automated teaching modes for trainees.
  • Ch. 2: Distinction between inpatient and outpatient EHR data. As a general rule we can say that outpatient notes tend to mostly be written by physicians whereas inpatient medicine, due to its team approach and complexity, tend to mostly be written by nurses, techs, and allied health professionals, with comparatively fewer physician notes. The volume of data in the inpatient EHR far exceeds that of the outpatient EHR, although as a reminder, the inpatient data presents a brief snapshot of a patient's life compared to the longitudinal outpatient record.

Section I, EHR Data for Research: Chapters 3-6.

  • Ch. 3: Responsible conduct of EHR research. Many organizations subscribe to the Collaborative Institutional Training Initiative Program ( to provide training content on responsible conduct of research. Such training must be completed prior to undertaking human subjects research, even if IRB-exempt de-identified EHR data. A typical course of study for a new researcher would likely include: 1) an introduction to the Belmont report, history, and ethics of research, 2) an overview of risks associated with secondary data research, 3) consideration of special or at-risk populations, 4) HIPAA and privacy considerations, and 5) identifying conflicts of interest.
  • Ch. 3: Data sharing models. An emerging technology, known as blockchains, may allow for secure and private sharing of EHR data. An overview of this technology and its application to healthcare and EHRs may be found in McGhin et al. (2019) and Han, Zhang, and Vermund (2022). What makes this technology especially appealing and relevant to EHR researchers is the ability to potentially retrieve data across disparate EHRs. This is similar, in promise, to what health information exchanges (HIEs, discussed further in chapter 4) set out to do, at least from a clinical view. Yet blockchain sharing of EHR data can be considered the opposite of the traditional HIE models. Whereas HIEs are centralized networks of information sharing because the HIE acts as the single authority, blockchains are a completely decentralized network for peer-to-peer exchange of health data (see Figure below for comparison). Nevertheless for this to be realized will require 1) healthcare systems to opt into the blockchain model, and 2) specific provisions are included for secondary uses of de-identified data for research purposes. This will take great political and economic capital.

    Blockchain technology is in fact already in use in healthcare. Schmeelk et al. (2022) reviewed the published literature for adoption of blockchain technology into EHRs and identified five relevant use cases. Additional use cases may be found in McGhin et al. (2019). To expand on one use case, ACTION-EHR is a blockchain framework for sharing for radiation oncology data across disparate EHRs. There are both provider and patient interfaces to the system, and any collaborating entities must install the relevant blockchain technology. This is one of several such frameworks proposed in the literature, e.g., see also Abunadi and Kumar (2021),
  • Ch. 4: Distinction between claims data and EHR data. Several other important distinctions between claims data and EHR data are the quantity and quality of variables for the researcher. In general, claims data are more limited in terms of the number of variables captured compared to the EHR. For example, while claims data may capture whether a patient fulfilled a medication order, it will not indicate whether the patient has taken the medication, which may be captured in EHR data should a clinician inquire during a follow-up visit. On the other hand, the quality of claims data for medications may be superior to EHR data, given the scrutiny that claims data receive from insurers. Yet disease and diagnosis data are likely better in the EHR as medication reimbursement depends less on an accurate diagnosis, especially in the outpatient setting. Below is an updated Table 4.1 to reflect these additions. Further detail on claims (also known as encounter or administrative) databases may be found in Ch. 12 in the 6th ed. of Pharmacoepidemiology (Gerhard et al., 2020), including a table contrasting characteristics of common claims databases in the U.S. (see Table 12.1 in Gerhard et al.). An overview of commonly used population EHR databases may be found in Chs. 13 and 14 in Pharmacoepidemiology.

    One family of claims databases, the Agency for Healthcare Research and Quality's Healthcare Cost and Utilization Project (HCUP) deserves particular mention as the largest collection of longitudinal hospital care data in the U.S. These databases contain encounter-level information for all payers nationwide and participating states from 1988 including inpatients (adult and children), emergency departments, and outpatient surgery and services. With the availability of multistate EHR databases, researchers have explored the overlap between claims databases and EHR databases. One such study reported similar demographic and clinical representation, with the exception of psychiatric/behavioral and obstetrics/gynecology diagnoses, which were reported less frequently in the EHR data. Other researchers have validated linked claims-EHR prescribing data and observed marked similarities. Importantly, neither of these sources should be viewed as the "gold standard" given these systems exist for different purposes, none of which are research.
  • Ch. 4: Health information exchanges. In addition to traditional government-funded HIEs that focus on specific jurisdictions, there are also nonprofit HIEs that seek to connect healthcare practices and systems in a jurisdiction-agnostic approach. For example, the CommonWell Health Alliance is a network of over 30,000 clinical sites across all 50 states covering approximately 200 million patients. The U.S. Department of Health and Human Services is also supporting the goal of nationwide health data access through qualified health information networks. In short, this initiative seeks to define the technical infrastructure and common information exchange principles to interconnect HIEs. Encouragingly, one of these principles (Principle #7) specifically encourages the use of HIEs for supporting population health research.
  • Ch. 4: Manual chart review. For those undertaking a manual chart review to derive a research database, Vassar and Holzmann (2013) assembled a list of ten best practices to follow to avoid mistakes in the retrospective chart review. (Source:
  • Ch. 4: Multi-institution EHR studies and Ch. 11: Correlated observations. Each healthcare system has a catchment process that is multifactorial (see catchment definition below). A consequence of this is that patients within a given EHR may be more likely to be similar to each other than patients between multiple EHRs. This introduces a modeling complexity since the errors in an ordinary regression would be correlated. Thus, the analytic modeling strategy should test this hypothesis, and if found to be true, employ a multilevel hierarchical model with patients clustered within their respective EHRs (i.e., healthcare systems).
  • Ch. 4: Accessing non-text EHR data. As more non-text data make their way into the EHR, epidemiologists may be interested in operationalizing these data in their research datasets. Examples of such data include medical images (e.g., radiographs), physiological signals and patient telemetry (e.g., ECG, EEG, EMG), wearable devices, anatomical and annotated drawings (esp. dermatology and ophthalmology), audio, video, scanned reports, and so on. Consider the challenges in operationalizing measures of cardiac function from the preoperative cardiac catheterization figure below (Source:

    There are technological challenges with creating textual representations of these data, but solutions due exist such as optical character recognition for scanned reports and artificial intelligence algorithms for creating meaning out of images, but these tools must be validated before widespread use. Without these tools, researchers may be left to rely on meta-data that accompanies these images. For example, the Digital Imaging and Communications in Medicine (DICOM) standard defines the digital storage and structure of many types of medical images. Embedded in the DICOM file are meta-data that describe the patient, image, and, in some cases, clinical findings. To obtain these data, researchers would need to interface with the Pictures Archive and Communication System (PACS), query the patient, exam, and corresponding medical image(s), and use a DICOM reader to extract the pertinent meta-data.
  • Ch. 4: Clinical Data Interchange Standards Consortium. Another relevant information exchange standard that EHR researchers may wish to be familiar with is the Clinical Data Interchange Standards Consortium (CDISC). Although this standard is largely specific to regulatory studies, as these types of studies interface with the EHR, data extracts may be created that conform to this standard, potentially useful to researchers.
  • Ch. 4: Problem list. Introduced in chapter 2 was the concept of a problem list that appears at the patient level in the EHR. In fact, a persistent problem list that spans encounters/visits is mandated per the Joint Commissions. Researchers need to be aware of difference between the problem list and diagnoses when abstracting data from the EHR. Namely, the problem list is populated based on diagnoses, and therefore follows temporally, and not all diagnoses will need to be captured as ongoing problems. When operationalizing a clinical phenotype, both sources may be consulted.
  • Ch. 5: Data linkage. Data linkage needs seem to arise more in the inpatient setting due to the number of electronic systems in a hospital whereas outpatient medicine tends to rely on a single electronic system.
  • Ch. 5: Data linkage: unique identifier. In addition to medical record numbers and financial identifiers used to link patients and encounters, requisition numbers and accession numbers are unique identifiers that link patients to laboratory or imaging orders.
  • Ch. 6: Missing data on confounders and Ch. 10: Multiple imputation. Confounder data, e.g. comorbidity and symptomology, may be commonly missing in EHR data. Occasionally these data may be present in unstructured encounter notes in the patient's record but are infrequently abstracted do to difficulty in parsing free text (discussed further in chapter 12). When confounder data are missing, even in extensive missingness situations, multiple imputation and propensity score calibration can recover missing data with minimal bias. Propensity score calibration is computationally more efficient. For further details, see Vader et al.
  • Ch. 6: Missing data definitions. Missing data defined as MCAR, MAR, or MNAR may not be the most intuitive way to think about missingness in the EHR. Rather, alternate definitions of missing data have been proposed in this context, namely unmeasured, clearly missing, or missing assumed negative. An unmeasured data point is one in which the EHR was never designed to capture, such as firearm ownership, and likely impacts a lot of traditional epidemiologic determinants, as brought up in chapter 3. A clearly missing data point is one in which the EHR was designed to capture, but the values are not present, for example, a patient declining to identify race or ethnicity on their intake questionnaire. The most pernicious of these is the missing assumed negative data point. In this case, the EHR is only capturing the presence of a health condition, not the absence, however, the researcher assumes that lack of documentation indicates lack of a condition. This fallacy is brought up several times in the book and requires the researcher to reflect upon why certain data points are or are not captured, and how we should handle it mechanistically: the decision to discard this variable altogether versus employ a sensitivity analysis if missing assumed negative is an incorrect assumption. Ideally, we would have access to the clinicians capturing these data for clarification, but for previously abstracted data, this may not be possible. To make this more concrete, consider the following figure. Each step of documenting a patient's self-reported high cholesterol is predicating on a decision point wherein the EHR must 1) have a space to record this information, 2) have a provider who asks about high cholesterol and 3) documents it, and 4) can be abstracted for research purposes. If there is no documentation of high cholesterol, then we have assumed it is a negative finding, perhaps incorrectly. In short, there are many more steps for a positive finding to wind up in the EHR versus a negative finding, which perhaps suggests why diagnostic codes in the EHR have higher specificity than sensitivity (discussed elsewhere in chapter 6).
  • Ch. 6: Catchment definitions and modeling. An older albeit still relevant definition of catchment identified five categories of patient selection forces: availability, accessibility, affordability, accommodation and acceptability. Additional to the distance radii and road and transportation network models of measuring catchment, there are numerous other approaches to defining spatial accessibility of healthcare, summarized by Guagliardo (2004). One of the more popular approaches that focuses on the number of healthcare providers in a given area is known as the two-step floating catchment area. This approach first defines a provider-to-population ratio based on a 30 minute drive time to the centroid of a geographical area, such a a ZIP code. This calculation is repeated for all ZIP codes in a given area, as such there may be overlap of some smaller ZIP codes while larger ZIP codes may have some unserved areas. The second step then sums the ratio measures for a given point, such as the location of the healthcare providers, which forms the measure of spatial accessibility, where the greater the number, the more providers there are relative to the population. Provider data may be freely downloaded in the U.S. via the Centers of Medicare & Medicaid Services' National Plan and Provider Enumeration System: almost all providers in the U.S. are assigned a National Provider Identifier (NPI), a HIPAA mandated identifier for transactions involving PHI. There have been numerous extensions to the two-step floating catchment model, as enumerated McGrail (2012), as well as approaches dealing with temporal changes to catchment.
  • Ch. 6: Data accuracy in inpatient medicine. It is also worth noting that in many studies inpatient diagnoses are more reliable than outpatient diagnoses (Segal & Powe, 2004, Lynch et al., 2021, Garza et al., 2021, Kern et al., 2015). This may be related to the importance of accurate codes for reimbursement in the inpatient setting. When using outpatient codes to operationalize a phenotype, investigators may consider requiring multiple (>1) occurrences of a particular code to define a positive finding.
  • Ch. 6: Data accuracy in outpatient medicine. A study involving patient review of ambulatory care notes found an observed 20% reported error rate in the EHR among respondents, with over 50% labeling these errors as "serious" or "very serious." Among the very serious errors, patients most commonly reported a mistake in past or current diagnoses (28%), followed by medical history (24%), medications or allergies (14%), and tests or procedures (8%). Several patients even suggested that the notes were ascribed to the wrong patient (7%). Errors tended to be reported more commonly older and sicker individuals. The 20% overall error rates observed in this study has been corroborated by other surveys, which also note that a majority of self-reported errors in the EHR are found in outpatient practices. Motivated by studies such as these has prompted Bell et al. (2022) to develop a framework for categorizing patient-reported breakdowns in the diagnosis process during ambulatory care. This framework can be used by EHR researchers as an aid to conceptualize data in the EHR and its potential impact study validity, especially among those working with outpatient EHRs. As part of this framework, Bell et al. described the mediating role of patient engagement between the healthcare provider and improved EHR data quality, represented in their figure below (Source:
  • Ch. 6: Data accuracy: status of code. As mentioned in chapter 6, false positive diagnoses in the abstracted data are possible if we fail to account for "rule out" codes and the condition is subsequently not diagnosed. There are other possible code modifiers in the EHR that researchers should be aware of. For example, diagnostic codes may have a status of "possible," "suspected," or "probable" and laboratory orders may carry a status of "pending," "preliminary," "final," or "corrected." The decision how to handle these is a trade off between sensitivity and specificity of a given diagnosis or measurement in the EHR.
  • Ch. 6: Duplicate observations. Chapter 5 discussed how duplicate observations can be an artifact of the data abstraction and management process. Duplication observations also exist in the EHR itself. This can occur via the patient registration process whereby a patient identifier (e.g., name, date of birth, MRN) fails to map to an existing record, for example, due to a typo. Some have estimated that over 20% of records may be duplicates in certain hospitals or healthcare systems. Patient reconciliation and de-duplication is a core function of EHRs. Aside from potential for patient harm, for researchers failure to identify duplicate records results in an artificial correlation among observations and must be handled statistically lest the standard errors be biased.
  • Ch. 6: The medical narrative. In 1991, Kathryn Montgomery Hunter published what is arguably the most important ethnographic study of how clinicians approach the art of medicine, interpret patient stories, and create the corresponding medical narrative. Although this study was conducted prior to widespread use of EHRs, the findings reveal the many nuances and subtleties of the clinical documentation process. The book contains numerous eloquent passages that are applicable to the modern day EHR researcher. Hunter writes "every patient is the object of a highly abbreviated and written-to-the-moment entry in an office or hospital chart" (p83). The EHR is a snapshot of medical care delivered to a specific patient at a specific time, and no more. It is a brief window of time into the lifecourse of this patient. In Hunter's words, the chart is a "record of each patients medical course, observed in one or more office visits or, in the hospital, from entry to discharge" (p84). The chart is a "minimalist account" (p91), "the chronicle of an individual's physical condition while under medical care [...] governed by the determination of a diagnosis and the selection of a treatment" (p87). As has been argued in this book, the EHR is not a substitute for a survey of epidemiological relevant factors and determinants of etiology: those must be obtained via data linkage or prospective study. As Hunter explores deeper meaning of the medical narrative, we can appreciate the heterogeneity of the type and quality of documentation in the patient's chart. For example, there are differences in the documentation between outpatient and inpatient medicine. Hunter observes that in the outpatient setting "the chart is a collection of private notes" (p84, emphasis added) whereas in an inpatient setting "it has at least a small audience of those who are its several authors." In other words, we tend to see more detail in the inpatient setting to avoid miscommunication of greater acuity patients. When multiple clinicians are involved in care, the chart becomes more objective and less subjective (p87). However, this is not to say that inpatient notes are uniformly superior to outpatient notes as more entries in the longitudinal record, with correspondingly greater detail, may also create more uncertainty due to potentially conflicting information or tests results, resulting in a "cascade of uncertainty" (p88). When such discrepancies arise, Hunter suggests that notes recorded by consulting specialists are likely the most authoritative, followed by attending physicians, and then residents and trainees (p89). In general, we can also observe in the EHR that more complicated, sicker, or "interesting" patients have more detail in the charts (p91-92). This is related to the notion of informed presence bias discussed in chapter 9. This heterogeneity in narrative detail can also be driven by clinician experience, as Hunter notes that detail is inversely proportional to years in the profession: more junior clinicians document more detail. However, as a whole, "concise notes are more highly valued than long ones" (p85), which has implications for the ability to parse the medical narrative and infer or impute epidemiological determinants that are not discretely coded. This is especially true under the EHR paradigm where traditional narratives are disincentivized (by payors) in favor of coded and discrete documentation using standardized templates. The transition to EHRs and its impact on the clinical encounter is further discussed in a review by Hedian, Greene, and Niessen (2018).
  • Ch. 6: Upcoding and downcoding. The practice of upcoding, or overbilling for diagnoses and services not provided, was introduced in chapter 6. In reality, downcoding, or underbilling, may be more common in the EHR to avoid potential for fines.

Section II, Epidemiology and Data Analysis: Chapters 7-12.

  • Ch. 7: Case-control study design. Despite the elegance and efficiency of the case-control study design, these study designs are not without controversy in EHR research. Schuemie et al. (2019) argue against using case-control studies in retrospective databases, such as the EHR, due to the appropriateness of comparison groups, the timing of when covariates are captured relative to exposure (thereby inappropriately adjusting for possible mediators), and lower statistical power. Instead the authors argue that retrospective cohort studies may be more appropriate when data on exposure and covariates are present in the EHR. When a case-control study is undertaken, negative controls (a condition unrelated to the one under study) may provide a sense of possible bias. On the other hand, one can consider a nested case-control study design/quasi-cohort approach or the incident user study design, as a way forward. Importantly, we should note that 1) there is nothing inherently flawed in the case-control study design, and 2) conceiving of the EHR as an open cohort allows for flexibility in the hybrid case-control designs since we sample directly from the study base.
  • Ch. 7: Modeling sample size. A common use of EHR data in a clinical setting is to build a model that may predict clinical outcomes, such as mortality or hospital readmission. In general, these models should be constructed with as large of sample size as reasonable, but at a minimum there should be 10 events per predictor. More guidance on calculating sample size for predictive modeling using EHR data may be found in Riley et al. (2020).
  • Ch. 9: Confounding by indication. Confounding by indication and protopathic bias can more readily be appreciated through the use of causal diagrams (below figure).

    Confounding by indication appears as a traditional confounder, for example, on the basis of the unmeasured covariate patient prognosis. On the other hand, protopathic bias appears more closely aligned to the issue of reverse causality in that the disease itself - more specifically a preclinical or subclinical state - influences the likelihood of treatment which then may subsequently impact the outcome.
  • Ch. 10: Describing the selection process. In addition to a traditional "table one" that describes the study population, researchers may also consider including a "table zero" that details the underlying clinical database the EHR data were derived from including the selection process. This not only aids in transparency and openness of science, it will allow an evaluation of potential sampling or selection biases. Regardless, a well constructed "table one" will include details on the target of inference along with a comparison to the study sample or population.
  • Ch. 11: Negative controls. Negative controls have been used in an EHR setting to deal with the confounding and selection bias forces of health seeking behavior. As is well known and argued in chapter 6, individuals who seek healthcare are fundamentally different from those who do not. If we are conducting an EHR-based study with the target of inference beyond the bounds of the EHR, for example, evaluating the effectiveness of vaccine, but we are limited to EHR data, we must methodologically deal with this potential threat to validity. Researchers have used negative controls, specifically two models - a negative exposure control and a negative outcome control - to remediate the bias. A negative exposure control seeks to identify an exposure that is not associated with the outcome but is associated with health seeking behavior, and a negative outcome control seeks to identify an outcome that is not associated with the exposure but is also associated with the health seeking behavior. Then, a two step regression is used to estimate the desired causal effect. For example, another vaccine may serve as a negative exposure control since its clinical effect will be limited to its antigens, while another disease may serve as a negative outcome control since its cause will be unrelated to the vaccine under study. However, both receipt of another vaccine and diagnosis of another disease relate to health seeking behavior, thereby "balancing" the groups.
  • Ch. 12: Machine learning and modeling. For readers interested in an overview of deep learning and neural networks, this article describes their application to predictive modeling of health data, including the EHR.
  • Ch. 12: Machine learning, modeling, and confounding. An important part of causal ML modeling is confounder identification. Benasseur et al. (2022) reported successful results on the use of several ML algorithms to identify known confounders in complex EHR datasets, and when confounders were hidden (i.e., unknown to the algorithm), identify proxies to reduce residual confounding. For researchers interested in delving further into confounder identification and adjustment methodologies using EHR data, one can create "plasmode" simulations from empirical EHR data to compare variable selection strategies.
  • Ch. 12: Natural language processing. It is worth noting that there is a spectrum of NLP implementation. That is, one may not need to deploy a full solution that automatically discretizes the textual data. Rather NLP can be used to retrieve "snippets" of text that a researcher manual reviews for the final determination. By analogy, NLP is more of a decision support tool than a diagnostic tool. The efficiency of the investigator team to classify these snippets will be influenced by the quantity of manually data to review, in which case some error may be acceptable at the cost of time or resources.
  • Ch. 12: EMERSE: Electronic Medical Record Search Engine. EMERSE is another EHR-specific free text processing engine. EMERSE is a fully developed, open source, EHR-agnostic information retrieval engine that processes medical records to identify terms and concepts in free text. While the authors claim it is similar to NLP, one of the key features is its purported ease of use, and that is implied in the over 500 publications that have used this tool. One possible workflow for the researcher would be to identify patients in the EHR comprising a study population (see chapter 7), retrieve relevant data for those patients (using one of the approaches described in chapter 4), and have EMERSE abstract terms and concepts from the free text in that data extract. Although the software is freely available, users will need to request access to the program.

Section III, Interpretation to Application: Chapters 13-15.

  • Ch. 14: Learning health system. Under the learning health system paradigm, technology, clinical practice, and research form a continuum to systematically refine and improve care delivery while reducing costs. EHRs are a key component of this. First, EHR-derived data are used to generate the evidence-base through analysis. Second, the EHR itself can be modified to help apply the evidence into practice. (Source:

    Post, Burningham, and Halwani (2022) provided a review of challenges and opportunities for using EHR data in the learning health system model. Although their review was specific to oncology, it is applicable to other clinical domains. Notable limitations identified in their review include lack of interoperability, need for population (as opposed to individual) real-time reporting, poor data quality, and difficulty in obtaining and operationalizing research datasets from the EHR. This book has addressed many of these issues at length. The authors proceed to identify both clinical and research opportunities to use the EHR in the learning health system. These opportunities included real-time data flow from EHRs combined with the latest evidence based guidelines for clinical decision making, interoperability between EHRs and 3rd party applications, increased use of machine learning to aid in prediction models and communication interfaces (e.g., speech recognition), improved clinical trial recruitment, and seamless reporting to external health registries and partners. This last point in fact motivated a research group from Wake Forest University (Winston Salem, NC) to create a diabetes disease registry from their EHR to enable the learning health system model. In their article, the authors discuss the difficulty in operationalizing a diabetes phenotype from the EHR, creating research databases, and integrating with population health systems. In addressing these challenges, the authors present an approach to the creation of a registry from the EHR using open source tools and a common data model. This is but one example of using EHR data for the learning health system: the U.S. Agency for Healthcare Research and Quality maintains a list of case studies on their website that detail how several healthcare systems have approached this complex task.

About | Blog | Books | CV | Data | Lab