BreakoutDetection: Using the new Twitter tool for influenza surveillance
Hey world! It’s been a while. Regardless…
The ingenious duders over at Twitter recently released their BreakoutDetection tool (https://blog.twitter.com/2014/breakout-detection-in-the-wild) to evaluate anomalies and mean shifts in large time-series datasets. The first thing that came to my mind was: will this work for outbreak detection? Of course, it should – I regularly use statistical process control charts to monitor disease incidence (particularly of healthcare-associated infections) but I have never been that sold on SPC alone. Furthermore, if this works appropriately, it can be applied more broadly to cloud-based surveillance systems to more accurately detect outbreaks of anything (Ebola anyone — NHSN data????, etc.).
Although today is my first day back in the office from a six week paternity leave, I had to try this out on the CDC historical flu data to compare to their age old time series plot with their “epidemic threshold” confidence bands (my brain is only partially working from lack of sleep and lack of R use for a while).
Let us install the R package first:
Next, get the CDC data from here (I rename it data, because I like to repurpose all of my code with generic datasets, particularly for these test cases): View Chart Data(http://www.cdc.gov/flu/weekly/weeklyarchives2014-2015/data/NCHSData.csv)
Read that business into R:
Run the breakout function and view the plot. Note I’m using a very small min.size just for the heck of it. Also, I’m using a 2% increase for anomaly detection. This is pretty small – I have tried a bunch of other values and get reasonably similar results (a few more detections with the small percent – mostly in the first couple of years – which I think is more useful to determine what is happening with more precision).
res = breakout(data$Percent.of.Deaths.Due.to.P.I, min.size=4, method='multi', percent=0.02, degree=1, plot=TRUE)
The plot is nice – and the Twitter folks have a much cooler plot on their blog post. Unfortunately they didn’t provide the code for the fancier plot and I don’t have the time right now to recreate it. I also don’t use ggplot a ton (the plot is default ggplot), so I gave up after a couple of seconds trying to get the x tick labels to show up as the week/year. Sue me (please don’t).
Either way, I think this is pretty useful and appears to be accurate (sorry for the poor quality figures, wordpress seems to destroy my image compression).
Keep up the good work pals.
# create an object of the plot so i dont have to use $
# create the x labels
data$wkyr<-paste(data$Week, data$Year, sep="-")
# get every 10th observation and put it into a new vector of just the week/year labels for the plot
sub1<- data[seq(1, nrow(data), by=10), ]
stuff + labs(y="Percent of All Deaths Due to \nPneumonia and Influenza",
x="Week-Year") + scale_x_continuous(breaks = c(seq(from = 1, to = 261, by = 10)),
labels = wkyr2) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Assistant Professor of Medicine
Assistant Director of Epidemiology and Biostatistics
University of Louisville School of Medicine
Division of Infectious Diseases
Clinical and Translational Research Support Unit
501 E. Broadway, #120B (not for much longer – moving down the hall on Monday Nov 3)
Louisville, KY 40202