When You Want to be Spied Upon: Big Data, Epidemiologic Intelligence and the Spread of Ebola
The release of information about government surveillance by defense worker Edward Snowden added fuel to already simmering privacy concerns. But what if someone told you that surveillance – Big Data snooping and tracking – could save your life or the life of a loved one?
Herein lies a paradox of the Big Data connection to privacy. The same obtrusive, privacy-wrecking, data-mining, Facebook-snooping technology that many fear could prove indispensable to specialists belonging to the Epidemic Intelligence Service. Big Data could be used to identify and locate individuals who may have been exposed to infected persons or possible contact with surfaces or materials that are suspected of promoting spread of a disease.
If you visited the store that day, you want to be found. If you fall ill, you will want to recognize the illness as potentially serious and promptly seek treatment to improve your chance of survival.
Consider the possibility that a mother became infected, but was asymptomatic for a time. She traveled with an infant who had become symptomatic, but because the child was so young, the symptoms were not diagnosed as a highly contagious illness. Also, the mother was many miles from a clinic, and the child had recently been ill and had recovered without incident. The mother had no way to know this time would be different.
She decided to keep her travel plans, including overseas travel tickets purchased long ago at a substantial discount. Mother and child headed to Denver for big family wedding to be held in the hall of a local sports venue. She did notice, when hurriedly changing the child’s diaper in a Wal-Mart near an airport motel, that the child’s fever had not improved. Worse, she herself might be running a temperature.
So might begin a major Colorado outbreak, which soon spreads to California and Arizona. When the mother took her infant to the emergency room, the initial diagnosis was incorrect, which delayed proper treatment. By the time the mother returned, she too was quite ill.
Summoned: Big Data Epidemic Intelligence Service
The problem now faced by public health officials is to identify individuals who may have been in contact with the woman and her child once they became symptomatic. The simple data scenarios epidemic gumshoes begin with – airline ticketing data – are well understood. But in this scenario, while there is a concern for fellow passengers, they are less likely to have been infected. The bigger problem, it is believed, is possible contact at the Wal-Mart and the first hospital to treat and subsequently discharge the patients. There would be no records of shoppers entering and leaving the store.
The Big Data public health team might consider alternative data sources:
- Wal-Mart retail receipts on credit and debit card purchases
- Receipts from retailers in the Wal-Mart shopping mall who may have visited the big box store that day but not purchased anything
- Cell phone activity in the vicinity of the store
- Text and email activity in the vicinity which referenced Wal-Mart visits
- Twitter and Facebook activity referencing visits to the store
- Employee timecards
- Schedules from Wal-Mart suppliers, whose logistics software was used to identify contractor personnel likely to have visited that day
A similar collection of data for the sports venue would be needed.
In addition to the usual local media outlets, additional outreach resources contemplated might include:
- Direct mail / email lists used by Wal-Mart and other local retailers
- In-store and parking lot surveillance cameras
- Raw interview text from employees, security personnel or others planning or participating in nearby scheduled events
- Information from geolocation systems in the area
- Crowdsourcing of possible store visit data through social media
- Monitoring of existing FluNearYou.org data streams
- Prescriptions issued that day, especially those targeting infant patients
CDC Logo for the Epidemic Intelligence Service
Because of the sheer volume of shoppers and the difficulty in collecting and analyzing this data, it’s possible that the infection could spread to other infection points. This expands data volume and its time-critical aspect arithmetically.
A paper published last year envisioned just this sort of scenario. Its authors called for the use of Big Data to develop more systematic techniques for developing infectious disease risk maps. They suggested Iterative mapping processes, following Big Data methods. These processes could help develop epidemic intelligence for what specialists call definitive extent, occurrence points, pseudo-absence points, environmental covariates and risk prediction.
When Hay, George, Moyes and Brownstein published their paper in April 2013, they could not have known, and did not mention the Ebola virus that in early 2014 was about to devastate parts of West Africa. But one of the coauthors is involved in a startup, Epidemico. The Company’s techniques are taken from the Big Data playbook: aggregation from more than 50,000 disparate data sources, machine learning, visualization, crowdsourcing, data mining, analytics and natural language understanding. This is precisely the space where Syncsort Big Data ETL tools like Splunk-feeding Ironstream and Hadoop-feeding DMX-h play nicely.
Electron micrograph of Ebola virus c/o US Centers for Disease Control and Prevention
Other Battle Lines
As if geolocation, logistics and social network data streams weren’t enough, there’s Big Data work to be done on the battle’s genetics front. A team of researchers performed the difficult field world in Sierra Leone needed to identify infected patients, collect blood samples and ship them to a lab in Boston for deep sequencing. (Deep sequencing requires Big Computation. For example, a project at Northwestern’s Physical Sciences Oncology Center uses a 200-node computational cluster.) By the time results were published by Science in August, the full extent of hazards in this research became clear. Five of the paper’s coauthors were already dead from Ebola. Then in October, the UC Santa Cruz Genomics Institute released a new Ebola genome browser.
Some of the weapons on this front, too are computational. As the death toll continues to mount – nearing 4,000 in Africa as of this writing, with scattered fatalities elsewhere – this work cannot move fast enough.
S. I. Hay, D. B. George, C. L. Moyes, and J. S. Brownstein, “Big data opportunities for global infectious disease surveillance,” PLoS Med, vol. 10, no. 4, pp. e1 001 413+, Apr. 2013. [Online]. Available: http://dx.doi.org/10.1371/journal.pmed.1001413