Neal D. Goldstein, PhD, MBI

About | Blog | Books | CV | Data | Lab

A Researcher's Guide to Using Electronic Health Records
From Planning to Presentation, 2nd edition

This is the official website for the textbook A Researcher's Guide to Using Electronic Health Records.

As webpages are meant to be, this is a living document and will be updated regularly with new content that I believe is useful to researchers working with EHR data. If you have suggestions for content, relevant articles, or other material in this area, please email me and I may share it here for others. Additionally and in the spirit of transparency and openness, I would like to point out several other books relevant to EHR researchers: Secondary Analysis of Electronic Health Records, Pharmacoepidemiology (esp. Part IIIb, Electronic Data Systems, in the 6th ed.), and Clinical Research Informatics.

For those interested in a more formal treatment of the material, I offer an introductory workshop on EHR-based epidemiology at the Society for Epidemiologic Research annual meeting, as well as a full graduate level course on EHR research at Drexel University.


Use the links below to download a PDF copy of the research planner, source codes in R, and the example dataset.


As errors are discovered in the text, they will be posted here.


This section is used to track updates or additions to the chapters of the book since last publication.

  • Ch. 2: Examples of SOAP notes. Typically, a SOAP note will be generated for each patient encounter with a clinician. Using a shortcut template, many clinicians will standardize their particular style of entry, however, while a SOAP note structures clinical documentation from a clinic standpoint, there is still heterogeneity in styles impacting the ability to automatically parse pertinent data. The two boxes below show equally well-written SOAP notes, but with varying degrees of machine interpretability.
  • Subjective
    Jane Doe is 66 y/o who attended her follow-up of her HTN. She feels well. She does not have dizziness, headache, or fatigue. Jane has no history other than hypertension. Her only medication is HCTZ at 25mg per day. Jane has lost 53bs in the past 3 months, following a low-fat diet and walking 10 minutes a day. She drinks two glasses of wine each evening. Jane uses no OTC medications such as cold remedies or herbal remedies.

    Generally, Jane appears well.
    Weight 155lbs, Height 55 inches, BMI ~30, Pulse 76 reg, BP 153/80.
    She has no lower extremity edema.

    Jane is here for a follow-up of her hypertension. It is not well-controlled since blood pressure is above the goal of 135/85. A possible trigger to her poor control of HTN may be her alcohol use or the presence of obesity.

    Continue a low-fat diet and exercise. Consider increasing walking time to 20-30 minutes to assist with weight loss. Discussed alcohol use and its relationship to HTN. Jane agrees to a trial of drinking wine only on weekend evenings. Check home BPs. Check potassium since she is taking a diuretic.
    Follow-up in the clinic in 1 month. Bring a blood pressure diary to that visit. Consider adding ACE inhibitor at the next visit if BP is still elevated.

    S: 42 yo woman presents with pale skin, weakness, dizziness, and epigastric pain. 2 weeks ago she experienced decreased exercise tolerance. She takes frequent doses of antacids and uses ibuprofen 200mg prn headaches. NKDA. She has children age 15, 12, and 1.

    O: T 38 C, RR 18, BP: sitting 118/75, standing 120/60, HR: Sitting 90, standing 110. Hb 8gm/dL, Hct 27%, platelets 300,000/mm3, retics 0.2%, MCV 75, serum iron 40mcg/dL, serum ferritin 9ng/ml, TIBC 450 mcg/dL, guaiac stools. Cheilosis at corners of mouth, and koilonychias at nail beds. PMH: peptic uler and preeclampisa with last pregnancy. Dx: iron-deficiency anemia.

    1. Fe: counsel parent on tolerance and side effects
    2. Discuss guaiac stool and ibuprofen, f/u with PCP on GI bleed, possible ulcer. d/c ibuprofen, use APAP prn HA.

    1. Iron sulfate 325mg TID x 6 months - f/u with PCP for retic count after 7 days of therapy. counsel/educate patient on a) take on empty stomach if possible, ok with food if cannot tolerate, b) separate iron dose from antacid dose, c) iron can cause constipation and darken stool color, d) keep iron out of reach from children - toxic.
    2. Make appt with pcp for probable ulcer/GI bleed. d/c ibuprofen, take acetaminophen 500mg po q 4-6 h prn p, NTE 4000mg/24 h (counsel on liver toxicity)

  • Ch. 2: Terminology and coding standards. In the U.S., the Centers for Medicare & Medicaid Services is the largest payor of healthcare services: "Nearly 90 million Americans rely on health care benefits through Medicare, Medicaid, and the State Children's Health Insurance Program (SCHIP)." As such, CMS has a large influence over documentation and billing practices in the EHR that relate to reimbursement, such as the use of DRG (inpatient) and APC (outpatient) codes. Evaluation and management (E&M) codes, a subgroup of CPT codes, are used for professional service reimbursement, such as to bill for a new patient visit or hospitalization, an established patient visit, or a consultation, and take into account the complexity of the encounter. E&M codes are distinct from diagnostic codes (but may be billed in tandem with ICD, DRG, or APC codes). For health services researchers, important details on patient encounters may be embedded in the E&M codes but not captured elsewhere in the medical record, which again demonstrates the complexity of operationalizing clinical phenotypes from EHRs.
  • Ch. 2: EHR concepts. The EHR exists as both a technology and a tool. By technology, we mean that it is a generalized platform for managing the health of a patient and for providers to be reimbursed for the care provided. The EHR aggregates and displays data from multiple areas including people (e.g., healthcare workers and patients), places (e.g., clinical departments such as radiology and pharmacy), and things (e.g., medical devices or wearables) (see Figure below).

    As a tool, the EHR encompasses a host of applications that are embedded within the system. These core modules include schedule, patient registration (admission, discharge, transfer), clinical documentation and charting, medical billing and charging, clinical decision support systems (CDSS) and computerized physician order entry (CPOE), electronic medicine administration record (i.e., barcode scanning of medications at the time of administration), communication, and ancillary functions such as interfacing (w/ other systems), reporting, and administration. Truly, these are complicated systems as exemplified from this simplified view of the Veterans Health Administration VistA architecture (Source:

    To expand on two of the more important modules in the EHR, namely CDSS and CPOE, the goals of these tools are to reduce errors, reduce costs, and limit practice variation under the evidence-based medicine paradigm. This is achieved by improving the structure and legibility of medical documentation, identification of pertinent medical details, and real-time feedback to providers. Such feedback may incorporate checks for patient or drug allergies, drug interactions, extreme doses, or flag potential drug-lab problems. While such alerts are the most famous - or perhaps infamous depending on your perspective - aspect of CDSS and CPOE, these technologies also enable computer assisted interpretation of labs and imaging, efficient consultation and messaging with specialists, critiquing of data entry to ensure proper data are captured, and automated teaching modes for trainees.
  • Ch. 3: Responsible conduct of EHR research. Many organizations subscribe to the Collaborative Institutional Training Initiative Program ( to provide training content on responsible conduct of research. Such training must be completed prior to undertaking human subjects research, even if IRB-exempt de-identified EHR data. A typical course of study for a new researcher would likely include: 1) an introduction to the Belmont report, history, and ethics of research, 2) an overview of risks associated with secondary data research, 3) consideration of special or at-risk populations, 4) HIPAA and privacy considerations, and 5) identifying conflicts of interest.
  • Ch. 3: Data sharing models. An emerging technology, known as blockchains, may allow for secure and private sharing of EHR data. An overview of this technology and its application to healthcare and EHRs may be found in McGhin et al. (2019) and Han, Zhang, and Vermund (2022). What makes this technology especially appealing and relevant to EHR researchers is the ability to potentially retrieve data across disparate EHRs. This is similar, in promise, to what health information exchanges (HIEs, discussed further in chapter 4) set out to do, at least from a clinical view. Yet blockchain sharing of EHR data can be considered the opposite of the traditional HIE models. Whereas HIEs are centralized networks of information sharing because the HIE acts as the single authority, blockchains are a completely decentralized network for peer-to-peer exchange of health data (see Figure below for comparison). Nevertheless for this to be realized will require 1) healthcare systems to opt into the blockchain model, and 2) specific provisions are included for secondary uses of de-identified data for research purposes. This will take great political and economic capital.

    Blockchain technology is in fact already in use in healthcare. Schmeelk et al. (2022) reviewed the published literature for adoption of blockchain technology into EHRs and identified five relevant use cases. Additional use cases may be found in McGhin et al. (2019). To expand on one use case, ACTION-EHR is a blockchain framework for sharing for radiation oncology data across disparate EHRs. There are both provider and patient interfaces to the system, and any collaborating entities must install the relevant blockchain technology. This is one of several such frameworks proposed in the literature, e.g., see also Abunadi and Kumar (2021),
  • Ch. 4: Distinction between claims data and EHR data. Several other important distinctions between claims data and EHR data are the quantity and quality of variables for the researcher. In general, claims data are more limited in terms of the number of variables captured compared to the EHR. For example, while claims data may capture whether a patient fulfilled a medication order, it will not indicate whether the patient has taken the medication, which may be captured in EHR data should a clinician inquire during a follow-up visit. On the other hand, the quality of claims data for medications may be superior to EHR data, given the scrutiny that claims data receive from insurers. Yet disease and diagnosis data are likely better in the EHR as medication reimbursement depends less on an accurate diagnosis, especially in the outpatient setting. Below is an updated Table 4.1 to reflect these additions. Further detail on claims (also known as encounter or administrative) databases may be found in Ch. 12 in the 6th ed. of Pharmacoepidemiology (Gerhard et al., 2020), including a table contrasting characteristics of common claims databases in the U.S. (see Table 12.1 in Gerhard et al.). An overview of commonly used population EHR databases may be found in Chs. 13 and 14 in Pharmacoepidemiology.
  • Ch. 4: Health information exchanges. In addition to traditional government-funded HIEs that focus on specific jurisdictions, there are also nonprofit HIEs that seek to connect healthcare practices and systems in a jurisdiction-agnostic approach. For example, the CommonWell Health Alliance is a network of over 30,000 clinical sites across all 50 states covering approximately 200 million patients. The U.S. Department of Health and Human Services is also supporting the goal of nationwide health data access through qualified health information networks. In short, this initiative seeks to define the technical infrastructure and common information exchange principles to interconnect HIEs. Encouragingly, one of these principles (Principle #7) specifically encourages the use of HIEs for supporting population health research.
  • Ch. 4: Manual chart review. For those undertaking a manual chart review to derive a research database, Vassar and Holzmann (2013) assembled a list of ten best practices to follow to avoid mistakes in the retrospective chart review. (Source:
  • Ch. 4: Multi-institution EHR studies and Ch. 11: Correlated observations. Each healthcare system has a catchment process that is multifactorial (see catchment definition below). A consequence of this is that patients within a given EHR may be more likely to be similar to each other than patients between multiple EHRs. This introduces a modeling complexity since the errors in an ordinary regression would be correlated. Thus, the analytic modeling strategy should test this hypothesis, and if found to be true, employ a multilevel hierarchical model with patients clustered within their respective EHRs (i.e., healthcare systems).
  • Ch. 6: Missing data on confounders and Ch. 10: Multiple imputation. Confounder data, e.g. comorbidity and symptomology, may be commonly missing in EHR data. Occasionally these data may be present in unstructured encounter notes in the patient's record but are infrequently abstracted do to difficulty in parsing free text (discussed further in chapter 12). When confounder data are missing, even in extensive missingness situations, multiple imputation and propensity score calibration can recover missing data with minimal bias. Propensity score calibration is computationally more efficient. For further details, see Vader et al.
  • Ch. 6: Missing data definitions. Missing data defined as MCAR, MAR, or MNAR may not be the most intuitive way to think about missingness in the EHR. Rather, alternate definitions of missing data have been proposed in this context, namely unmeasured, clearly missing, or missing assumed negative. An unmeasured data point is one in which the EHR was never designed to capture, such as firearm ownership, and likely impacts a lot of traditional epidemiologic determinants, as brought up in chapter 3. A clearly missing data point is one in which the EHR was designed to capture, but the values are not present, for example, a patient declining to identify race or ethnicity on their intake questionnaire. The most pernicious of these is the missing assumed negative data point. In this case, the EHR is only capturing the presence of a health condition, not the absence, however, the researcher assumes that lack of documentation indicates lack of a condition. This fallacy is brought up several times in the book and requires the researcher to reflect upon why certain data points are or are not captured, and how we should handle it mechanistically: the decision to discard this variable altogether versus employ a sensitivity analysis if missing assumed negative is an incorrect assumption. Ideally, we would have access to the clinicians capturing these data for clarification, but for previously abstracted data, this may not be possible. To make this more concrete, consider the following figure. Each step of documenting a patient's self-reported high cholesterol is predicating on a decision point wherein the EHR must 1) have a space to record this information, 2) have a provider who asks about high cholesterol and 3) documents it, and 4) can be abstracted for research purposes. If there is no documentation of high cholesterol, then we have assumed it is a negative finding, perhaps incorrectly. In short, there are many more steps for a positive finding to wind up in the EHR versus a negative finding, which perhaps suggests why diagnostic codes in the EHR have higher specificity than sensitivity (discussed elsewhere in chapter 6).
  • Ch. 6: Catchment definitions and modeling. An older albeit still relevant definition of catchment identified five categories of patient selection forces: availability, accessibility, affordability, accommodation and acceptability. Additional to the distance radii and road and transportation network models of measuring catchment, there are numerous other approaches to defining spatial accessibility of healthcare, summarized by Guagliardo (2004). One of the more popular approaches that focuses on the number of healthcare providers in a given area is known as the two-step floating catchment area. This approach first defines a provider-to-population ratio based on a 30 minute drive time to the centroid of a geographical area, such a a ZIP code. This calculation is repeated for all ZIP codes in a given area, as such there may be overlap of some smaller ZIP codes while larger ZIP codes may have some unserved areas. The second step then sums the ratio measures for a given point, such as the location of the healthcare providers, which forms the measure of spatial accessibility, where the greater the number, the more providers there are relative to the population. Provider data may be freely downloaded in the U.S. via the Centers of Medicare & Medicaid Services' National Plan and Provider Enumeration System: almost all providers in the U.S. are assigned a National Provider Identifier (NPI), a HIPAA mandated identifier for transactions involving PHI. There have been numerous extensions to the two-step floating catchment model, as enumerated McGrail (2012), as well as approaches dealing with temporal changes to catchment.
  • Ch. 6: Data accuracy in inpatient medicine. It is also worth noting that in many studies inpatient diagnoses are more reliable than outpatient diagnoses (Segal & Powe, 2004, Lynch et al., 2021, Garza et al., 2021, Kern et al., 2015). This may be related to the importance of accurate codes for reimbursement in the inpatient setting. When using outpatient codes to operationalize a phenotype, investigators may consider requiring multiple (>1) occurrences of a particular code to define a positive finding.
  • Ch. 6: Data accuracy in outpatient medicine. A study involving patient review of ambulatory care notes found an observed 20% reported error rate in the EHR among respondents, with over 50% labeling these errors as "serious" or "very serious." Among the very serious errors, patients most commonly reported a mistake in past or current diagnoses (28%), followed by medical history (24%), medications or allergies (14%), and tests or procedures (8%). Several patients even suggested that the notes were ascribed to the wrong patient (7%). Errors tended to be reported more commonly older and sicker individuals. The 20% overall error rates observed in this study has been corroborated by other surveys, which also note that a majority of self-reported errors in the EHR are found in outpatient practices. Motivated by studies such as these has prompted Bell et al. (2022) to develop a framework for categorizing patient-reported breakdowns in the diagnosis process during ambulatory care. This framework can be used by EHR researchers as an aid to conceptualize data in the EHR and its potential impact study validity, especially among those working with outpatient EHRs. As part of this framework, Bell et al. described the mediating role of patient engagement between the healthcare provider and improved EHR data quality, represented in their figure below (Source:
  • Ch. 6: Duplication observations. Chapter 5 discussed how duplicate observations can be an artifact of the data abstraction and management process. Duplication observations also exist in the EHR itself. This can occur via the patient registration process whereby a patient identifier (e.g., name, date of birth, MRN) fails to map to an existing record, for example, due to a typo. Some have estimated that over 20% of records may be duplicates in certain hospitals or healthcare systems. Patient reconciliation and de-duplication is a core function of EHRs. Aside from potential for patient harm, for researchers failure to identify duplicate records results in an artificial correlation among observations and must be handled statistically lest the standard errors be biased.
  • Ch. 6: The medical narrative. In 1991, Kathryn Montgomery Hunter published what is arguably the most important ethnographic study of how clinicians approach the art of medicine, interpret patient stories, and create the corresponding medical narrative. Although this study was conducted prior to widespread use of EHRs, the findings reveal the many nuances and subtleties of the clinical documentation process. The book contains numerous eloquent passages that are applicable to the modern day EHR researcher. Hunter writes "every patient is the object of a highly abbreviated and written-to-the-moment entry in an office or hospital chart" (p83). The EHR is a snapshot of medical care delivered to a specific patient at a specific time, and no more. It is a brief window of time into the lifecourse of this patient. In Hunter's words, the chart is a "record of each patients medical course, observed in one or more office visits or, in the hospital, from entry to discharge" (p84). The chart is a "minimalist account" (p91), "the chronicle of an individual's physical condition while under medical care [...] governed by the determination of a diagnosis and the selection of a treatment" (p87). As has been argued in this book, the EHR is not a substitute for a survey of epidemiological relevant factors and determinants of etiology: those must be obtained via data linkage or prospective study. As Hunter explores deeper meaning of the medical narrative, we can appreciate the heterogeneity of the type and quality of documentation in the patient's chart. For example, there are differences in the documentation between outpatient and inpatient medicine. Hunter observes that in the outpatient setting "the chart is a collection of private notes" (p84, emphasis added) whereas in an inpatient setting "it has at least a small audience of those who are its several authors." In other words, we tend to see more detail in the inpatient setting to avoid miscommunication of greater acuity patients. When multiple clinicians are involved in care, the chart becomes more objective and less subjective (p87). However, this is not to say that inpatient notes are uniformly superior to outpatient notes as more entries in the longitudinal record, with correspondingly greater detail, may also create more uncertainty due to potentially conflicting information or tests results, resulting in a "cascade of uncertainty" (p88). When such discrepancies arise, Hunter suggests that notes recorded by consulting specialists are likely the most authoritative, followed by attending physicians, and then residents and trainees (p89). In general, we can also observe in the EHR that more complicated, sicker, or "interesting" patients have more detail in the charts (p91-92). This is related to the notion of informed presence bias discussed in chapter 9. This heterogeneity in narrative detail can also be driven by clinician experience, as Hunter notes that detail is inversely proportional to years in the profession: more junior clinicians document more detail. However, as a whole, "concise notes are more highly valued than long ones" (p85), which has implications for the ability to parse the medical narrative and infer or impute epidemiological determinants that are not discretely coded. This is especially true under the EHR paradigm where traditional narratives are disincentivized (by payors) in favor of coded and discrete documentation using standardized templates. The transition to EHRs and its impact on the clinical encounter is further discussed in a review by Hedian, Greene, and Niessen (2018).
  • Ch. 7: Case-control study design. Despite the elegance and efficiency of the case-control study design, these study designs are not without controversy in EHR research. Schuemie et al. (2019) argue against using case-control studies in retrospective databases, such as the EHR, due to the appropriateness of comparison groups, the timing of when covariates are captured relative to exposure (thereby inappropriately adjusting for possible mediators), and lower statistical power. Instead the authors argue that retrospective cohort studies may be more appropriate when data on exposure and covariates are present in the EHR. When a case-control study is undertaken, negative controls (a condition unrelated to the one under study) may provide a sense of possible bias. On the other hand, one can consider a nested case-control study design/quasi-cohort approach or the incident user study design, as a way forward. Importantly, we should note that 1) there is nothing inherently flawed in the case-control study design, and 2) conceiving of the EHR as an open cohort allows for flexibility in the hybrid case-control designs since we sample directly from the study base.
  • Ch. 9: Confounding by indication. Confounding by indication and protopathic bias can more readily be appreciated through the use of causal diagrams (below figure).

    Confounding by indication appears as a traditional confounder, for example, on the basis of the unmeasured covariate patient prognosis. On the other hand, protopathic bias appears more closely aligned to the issue of reverse causality in that the disease itself - more specifically a preclinical or subclinical state - influences the likelihood of treatment which then may subsequently impact the outcome.
  • Ch. 12: Machine learning and modeling. For readers interested in an overview of deep learning and neural networks, this article describes their application to predictive modeling of health data, including the EHR.
  • Ch. 12: Machine learning, modeling, and confounding. An important part of causal ML modeling is confounder identification. Benasseur et al. (2022) reported successful results on the use of several ML algorithms to identify known confounders in complex EHR datasets, and when confounders were hidden (i.e., unknown to the algorithm), identify proxies to reduce residual confounding. For researchers interested in delving further into confounder identification and adjustment methodologies using EHR data, one can create "plasmode" simulations from empirical EHR data to compare variable selection strategies.
  • Ch. 12: Natural language processing. It is worth noting that there is a spectrum of NLP implementation. That is, one may not need to deploy a full solution that automatically discretizes the textual data. Rather NLP can be used to retrieve "snippets" of text that a researcher manual reviews for the final determination. By analogy, NLP is more of a decision support tool than a diagnostic tool. The efficiency of the investigator team to classify these snippets will be influenced by the quantity of manually data to review, in which case some error may be acceptable at the cost of time or resources.
  • Ch. 14: Learning health system. Under the learning health system paradigm, technology, clinical practice, and research form a continuum to systematically refine and improve care delivery while reducing costs. EHRs are a key component of this. First, EHR-derived data are used to generate the evidence-base through analysis. Second, the EHR itself can be modified to help apply the evidence into practice. (Source:

    Post, Burningham, and Halwani (2022) provided a review of challenges and opportunities for using EHR data in the learning health system model. Although their review was specific to oncology, it is applicable to other clinical domains. Notable limitations identified in their review include lack of interoperability, need for population (as opposed to individual) real-time reporting, poor data quality, and difficulty in obtaining and operationalizing research datasets from the EHR. This book has addressed many of these issues at length. The authors proceed to identify both clinical and research opportunities to use the EHR in the learning health system. These opportunities included real-time data flow from EHRs combined with the latest evidence based guidelines for clinical decision making, interoperability between EHRs and 3rd party applications, increased use of machine learning to aid in prediction models and communication interfaces (e.g., speech recognition), improved clinical trial recruitment, and seamless reporting to external health registries and partners. This last point in fact motivated a research group from Wake Forest University (Winston Salem, NC) to create a diabetes disease registry from their EHR to enable the learning health system model. In their article, the authors discuss the difficulty in operationalizing a diabetes phenotype from the EHR, creating research databases, and integrating with population health systems. In addressing these challenges, the authors present an approach to the creation of a registry from the EHR using open source tools and a common data model. This is but one example of using EHR data for the learning health system: the U.S. Agency for Healthcare Research and Quality maintains a list of case studies on their website that detail how several healthcare systems have approached this complex task.

About | Blog | Books | CV | Data | Lab