Neal D. Goldstein, PhD, MBI

About | Blog | Books | CV | Data

Jul 8, 2019

Web Scraping for the Epidemiologist using R

It should come as no surprise that there are copious amounts of data available to the epidemiologist posted publicly on the Internet. These could be traditional health-related data or data that are useful to study as causes (or outcomes) of health-related states. Compare this to a mere generation ago when knowledge was captured in hardcopy, as opposed to electronic, format. For the researcher, in many cases this meant a laborious and time-consuming manual abstraction of the data. Now, with seemingly endless data available to us electronically, those who engage in etiologic research have a great opportunity to search for novel determinants.

(At the outset, I want to make it clear that I do not support pure predictive modeling and automated modeling approaches. A good epidemiologist always brings a sound causal model to the table. We need to fight this temptation in the face of such an abundance of data.)

The question becomes how do I obtain these publicly accessible data for my research? When the data are available to us in an exportable format, such as an Excel spreadsheet, the process to import the data into statistical software for modeling is trivial. Download and import. When the data are visible to the user, but not exportable, the researcher must abstract the data themselves. For small amounts of data, this can be done manually, although this opens up the possibility for transcription errors potentially resulting in information bias. For larger amounts of data, a manual abstract is likely impossible (or will result in a substantial burden to the poor graduate student tasked with this exercise in futility). Fortunately, a newer family of techniques colloquially known as "web scraping" algorithms make this task much less onerous.

Consider this motivating example: suppose we are interested in analyzing notifiable disease conditions reported to the CDC. These data are publicly available via the National Notifiable Diseases Surveillance System (NNDSS). Further suppose we wanted to retrieve annual NNDSS data available for the years between 1996 and 2015 (twenty years of data). These data are clearly visible here: Focusing on the 2015 data in particular, the first table that we encounter has the various notifiable conditions and the number of reported cases for that year. One way we can import these data is to manually transcribe the data directly into the statistical software. That is, we literally key in each and every value in to our statistical software GUI. Of course, this risks the possibility of a typo and is labor intensive. A second, better option is to copy and paste this table directly Excel to clean before we import to the statistical software. This is actually a somewhat useful general approach to data cleaning and import as Excel has a number of interesting data parsing tools (e.g., the ever so useful text to columns function). Since the data are already on a webpage (in an HTML table) Excel should recognize the table during a copy an paste operation, making this process somewhat seamless. This approach may work just fine if you're only interested in the 2015 data. But as was stated earlier, we want the last twenty year's worth of data. Clearly a manual copy and paste of the MMWR tables will take some time. Further, any formatting differences in the tables will create complications.

Now consider an alternative approach: one where we directly interface the statistical software to the Internet. Users of R have access to a package called rvest. In short, this package connects R to an Internet resource (in this case a webpage), saves a local copy of the webpage (converting the HTML to XML format), and allows direct abstraction of the data without ever performing a manual copy and paste. The trick to using this package is familiarity with basic webpage design; a working knowledge of HTML and XML is needed. All webpages rely upon the HTML language for client display; that is, HTML is a display/presentation programming language. On the other hand XML is a data transport language; it defines what the data represent. They both are markup languages that employ "tags" to define what attributes look like or represent. Thus, it is possible to normalize an HTML document to valid XML format. By pointing the html_table function in the rvest package at the correct table on the target webpage, it is possible to directly import the data into R, bypassing any manual copy and paste operation. Then, R can be used to perform any data management. This becomes highly efficient as the size of the data grow. In fact, you may even consider writing a generic function that requires only the webpage and the pointer location as arguments and automate this operation across the twenty years of data. For references, here's some sample code to abstract the 2015 reportable conditions into an R dataframe:

#read the 2015 data
nndss = read_html("")

#the nndss object is an XML document
#extract html "tables" from the nndss webpage
nndss_tables = html_nodes(nndss, "table")

#check the number of html tables

#there are 23 html tables in this document
#this is where a working knowledge of HTML
#is useful to identify which table is the
#target table; this could also be accomplished
#through trial and error

#extract the 2nd table, which corresponds to the target data
nndss2015 = html_table(nndss_tables[2], fill=T)[[1]]

#note that because some cells in this table span multiple
#columns, the table requires further cleaning to be useful
#this is where R packages such as dplyr come in handy
#although this can also be easily accomplished with 
#built in functions

The possibilities are endless, and this approach can readily be adapted to many other applications. The data do not have to be in a table. As long as there is some webpage element (i.e. XML node) that can be identified this approach will work. There are numerous examples of this used in other fields, one such application is found here: A simple web search for "web scraping using R" will find dozens more.

About | Blog | Books | CV | Data