Neal D. Goldstein, PhD, MBI, FCPP

About | Blog | Books | CV | Consulting

Jun 19, 2017

Sharing your analytic code in open source epidemiology

I've become involved in the open source movement and recently wrote an article in the journal Epidemiology about how we can advance epidemiology by releasing our analytic codes. There are four main benefits to releasing code: 1) transparency, 2) reproducibility, 3) advancement of methods, and 4) education. As epidemiologists, we do not maintain a monopoly on our methods, and information should be free and accessible to all.

For those interested in following suit, I've prepared this brief tutorial. I'm using GitHub as the source code repository, and Zenodo as a way to make the work citable (through a digital object identifier, or DOI). I don't specifically endorse any open source platform, but these two have worked well for me and are respected in the community.

First, you'll need to obtain a free account on GitHub. Then, using your GitHub account, setup a corresponding Zenodo account, configured as documented in the "Login to Zenodo" section in this other blog post.

Before releasing your code, be sure to have redacted any proprietary or personal information, cleaned up your coded according to "coding best practices," and provided the citation to the published article in the source code so your audience can learn more about your work. You may also be interested in applying an open source license. I recommend the simple MIT license for all code being released (unless you're code inherits other code, in which case a GPL license is likely the best fit).

Once you're ready to share your code, follow these five steps:

On GitHub, create a new repository specific to your analysis. I prefer having separate repositories per publication for a clean mapping from the published work to the underlying analysis. Fill in the required information, select to make this repository public, and add an open source license as appropriate.
Within this newly created repository, upload the analytic code source files and any ancillary data you wish to share, such as fitted models, simulated or de-identified datasets, etc.
On Zenodo, go to Home > Account > GitHub and flip the switch to ON for any repository on GitHub you would like to create a citable link.
Back on GitHub, create a release for the repository. When creating a release, I like using the publication title as the release title, and adding the citation as a description for ease of locating the manuscript. However you choose to do it, be sure it's meaningful to the public. Designated the release a "pre-release" can be useful for a work in progress, for example, a manuscript that is currently under review and not published.
Back on Zenodo, click the "Synch now" button towards the top of the GitHub page. If all has gone well, your repository will now have a DOI associated with it, ready for sharing with the world.

[EDIT: Nov 1 2017]

As I originally wrote this, I did not consider the complication of publishing your code during the peer review process of the corresponding manuscript. Now that I have been through that process, I have a few tips/recommendations to share. These are mainly applicable for those using GitHub to archive the code and Zenodo to share the code through a DOI.

When releasing your initial code, I would set it as a "pre-release" in GitHub. There is a simple checkbox when creating a release that flags it as such. Then you include the release specific DOI in the manuscript. Behind the scenes, Zenodo has actually created two DOIs when you create the initial release: one that is specific to the release, the other that is generic to the repository. This generic one will always link to the most recent release you have on file. This is potentially useful if you just wish to have the manuscript always link to the most recent code. However, this also creates a potential problem, because in theory, the manuscript should link to the code that was used for that iteration of the manuscript. The alternative - and I think proper -solution is to create a release for each iteration of the manuscript/code. Suppose you get back the initial submission peer reviews and it is a revise-and-resubmit. If, during this process, you needed to run additional analyses or have changes your code, you can create a new release on GitHub that will correspondingly (and automatically) generate a new DOI on Zenodo. The revised manuscript then includes this updated and release specific DOI. And This solution works fine...until the point that the manuscript is accepted for publication. Since you don't want to necessarily link to pre-release software in your manuscript, you have one final opportunity to update the DOI when you receive the page proofs. At this point, you create the final release on GitHub, uncheck "pre-release", and drop the new DOI into the page proofs for publication. In the simple example here, of an initial submission and revised (and ultimately accepted) manuscript, there will be three releases and four DOIs, as follows:

Initial release on GitHub for the initial submission of the manuscript. Two DOIs are generated from GitHub: one that is specific to the release, the other that is generic to the repository. Release specific DOI in the manuscript.
The second release on GitHub for the revised manuscript: One DOI is generated automatically by Zenodo for this new release. Revised and release specific DOI in the manuscript.
The third release on GitHub moving the code from "pre-release" to official release, corresponding to the in press version of the manuscript. One DOI is generated automatically by Zedono for this new release. Production release specific DOI in the manuscript.

Of course, you can avoid the burden of these additional steps by including the generic repository DOI in the original manuscript, but if you update the code at any point after the article goes in press, readers will be unsure of exactly what code was used for the manuscript analyses. And isn't that the point of release your code?

One final note for creating updating code/creating new releases on GitHub. If you are truly working in a collaborative environment, you would need to create a pull request to update the code. However, assuming you are the sole author of the analytic code, the original code can simply be updated by editing the current code in your repository. When you update the original code, you can commit the code changes directly to the master branch (effectively bypassing the collaborative nature of GitHub). Previous versions of the code are still accessible to interested parties, by clicking the History link. Creating a new release simple takes a snapshot of the current code and archives it for posterity.

Cite: Goldstein ND. Sharing your analytic code in open source epidemiology. Jun 19, 2017. DOI: 10.17918/goldsteinepi.