Posted on July 31st, 2015
An iPython notebook containing some example code written for this project can be found on GitHub.
A quick slideshow summary of the project:
Major health problems have global effects on the lives of patients. Many of these health issues manifest via obvious symptoms, often painful. Other signs, however, including those that may serve as "early warnings” of larger problems, may be much more subtle, going undetected (or ignored) by the individual.
One such potential "red flag," particularly for heart disease, is a person's overall activity level: how and how often they walk, exercise, sleep, etc. Perhaps the best and most universal proxy for “overall activity level" is walking: patterns of when, how, and how much an individual moves through their environment under their own power. While walking is not a perfect proxy for activeness (e.g. it excludes disabled persons in wheelchairs; activities such as biking, weightlifting, etc.), it does capture in broad strokes the overall "vigor" of a person in their everyday environment. While various other factors (professional, socio-economic, climate) have large affects on walking patterns, so, of course, does a person's health.
While walking metrics are great proxies for overall activity, it is difficult for physicians to gain access to such data. This process generally requires a doctor or patient to have a priori knowledge of a health issue and to then have the patient come into the office and perform some sort of physical stress test. Such exams, like the Six-Minute Walk Test , have been shown to have prognostic value for health outcomes in patients with heart disease.
However, the newly-emerged product category of "always-on" wearables, such as wrist-worn walk/sleep trackers from companies like Fitbit, could lead to a paradigm shift in disease diagnosis. Collection and analysis of data from these trackers has the potential to sidestep numerous hurdles that separate doctors from potentially critical information about their patients.
UCSF's Health eHeart study is the first mass effort to combine disparate sorts of medical and digital information to better understand cardiovascular disease.
In collaboration with UCSF, I have specifically addressed the following questions:
My analysis led to the following insights for Health eHeart:
The Health eHeart study is a massive undertaking of the University of California San Francisco, begun in 2013. The overarching goals of the study, which is open to anyone 18+ (with or without heart disease), are to leverage big data to (the italicized is taken directly from the study website):
Participants in the study are asked to perform a variety of data-generating tasks, including, to:
My role as a consultant for the project was focused on predicting clinical status based on activity and survey data / medical reports.
As mentioned above, participants in the study may or may not have:
The UCSF team was most interested on those participants for which we have (1) minute-by-minute Fitbit data as well as (2) clinically-linked information. Thus, my first task was to identify the sample that comprised the intersection of these two groups.
A first pass look at the data showed that, of the ~30,000 current enrollees of Health eHeart, slightly under 1000 have contributed Fitbit data. Of this group, most (~850) contributed at least 100 days worth of data (not all of which passing the “smell test” of validity).
On the clinical side, the biggest interest concerned populations that are doctor-diagnosed with serious cardiovascular disorders (congestive heart failure (CHF), chronic obstructive pulmonary disorder (COPD)) or that have contributed daily self-reports that are relevant to such disorders (the primary case here being “shortness of breath,” which is a major flag for cardiovascular problems). Of the total cohort, 641 participants had CHF, while 519 were labelled as having COPD. Unfortunately, the intersections of these patient groups with the Fitbit group were quite small: less than 20 people were both disease-positive and came with walking data. Thus, the focus going forward turned to linking walk data to day-by-day symptomology reports.
Let’s take a look at what a randomly-chosen time series of a single day looks like:
In the example, this individual, like most people on most days, walks in bursts, as opposed to consistently throughout the day. We can see clear periods of inactivity (including the wee hours of the morning (< 400 (~6:30am) on the x-axis), when this person was probably asleep). We can zoom in on long periods of walking at a sustained pace (in this case, likely just after waking up):
Looking at another participant’s data, we can also clearly see the need for data filtering/cleaning:
There is something very off with this data, as even world class sprinters have a hard time reaching 300 steps per minute. The fact that the exact same number is recorded for several consecutive minutes adds support to the idea that this is not valid data and needs to be discarded. Removing days due to suspicious-looking outliers or a lack of data (less than a few hundred steps) was a major part of the project's processing pipeline.
Once we have a daily time series that we trust, we can engineer several features, including:
Other features require additional data processing stages. An example of such processing is temporal smoothing (averaging values over rolling time windows) to create features like maximum average step rates over various time periods (10 minutes, 1 hour, etc.). While, as we’ll see below, some of these features are correlated with one another to various degrees, the hope is that they will all capture a somewhat different angle on the data. Walking habits are likely different for many people during the work-week vs. the weekend, so it was desirable to create different feature sets for each. Handily python’s pandas data analysis package allows for automatic extraction of day-of-week information using its automatic date extraction tools.
Taking a bird’s eye view on a certain feature (daily step totals) for the same participant as above, we can look at long-term time series:
as well as histograms:
This specific participant walks more than is typical of our total cohort, as shown by the full histogram over all participants:
These histograms also drive home the above point about data cleaning: the distributions are bookended on the low end by a spike of days representing 0 steps. These days need to be discarded, not modeled! Separately, the smoothness of the full cohort distribution is interrupted by small peaks at around 10- and 20-thousand steps. These are likely driven by Fitbit’s goal-setting feature: a user can choose a certain target number of daily steps (the default is 10,000) and is given updates throughout the day about how far they’ve gone toward reaching that goal. It looks like this method actually works to promote physical activity!
Using Python's seaborn visualizations package, we can also plot individual features against one another over all days for which we have data:
These features range from being highly correlated (maximum 10- vs. 60-minute pace) to showing essentially no correlation (average total minutes of activity vs. variance of pace). Certain pairs, while having moderate correlation, also contain interesting off-axis clusters, an example shown on the scatter plot being maximum daily 1-minute step rate vs. total daily steps:
The the fuzzy border between likely walking and running paces sits somewhere around 150 steps/minute, so this group highlighted in green, while generally quite sedentary as evidenced by few total steps per day, at some point are running for a short period of time. (My speculation: they are running after the mailman after missing a package delivery!)
Now that I have a good bead on what the data look like, let's return to the primary directive:
Can we create a prototype diagnostic tool using Fitbit data, alone?
The specific case I will address is to diagnose the severity of daily symptom reports for shortness of breath purely from the step data generated by Fitbit trackers. Shortness of breath is one of the strongest indicators of cardiovascular problems (including CHF and COPD). Let’s take a look at how many data points we have for each “shortness” rating, where 1 = no symptoms and 5 = severe symptoms:
There are a couple of takeaways here:
While there are lots of data points here, most come from the “no symptoms” category. The main approach I will take to deal with these dramatically unbalanced data sets are to randomly down-sample the larger categories (1 and 2) to match the smallest (3: around 200 days). Using the features discussed in the previous section, I built a model to perform binary classifications (e.g. does walking data from day X belong to symptom report 1 or symptom report 2?). Random forest algorithms were used, with 8-fold cross validation: training on 7/8ths of the data, testing on the remaining 1/8th, repeating the entire analysis 7x (leaving out a different testing sample each time), and then averaging the results. As the groups were balanced, classification success was assessed by a simple accuracy score, which ranged between 62-64% accurate for the three pairwise analyses. Models built using only single features (i.e. can total steps in a day predict symptom score?) were substantially worse, typically only a couple of percentage points better than the 50% accuracy one would predict in a "coin flip" guess scenario.
Finally, these accuracies had to validated. Is a 64% accurate model truly “better” than guessing in a statistical sense? To test this, I performed a Monte Carlo style permutation test, named after the famous casino in Monaco. The point of this analysis was to generate the true “null” distribution by randomly scrambling the symptom report labels for each day and then attempting to build a predictive model using the same random forest algorithm. As I was now trying to model noise, the classifier should have failed, which is exactly what was observed:
All of the observed accuracies from above (dashed red line = lower bound) fell well to the right of this distribution, and all were highly significant (p < 0.001).
Separately, I conducted an analysis to dissociate the "1" ratings (no symptoms) from all of the other groups lumped together. This yielded accuracy scores in excess of 68%, despite the fact that the above analyses confirmed the existence of clear differences in walking patterns between levels of symptom severity (i.e. groups 2 and 3 could be dissociated). Thus, I can conclude that walking patterns contain information relevant to both (1) existence of symptoms, as well as (2) severity of symptoms.
The major conclusion from the consulting project is that Fitbit data can be used to predict clinical measures. While this is a prototype model built for a specific use-case (prediction of daily symptom self-reports), the pipeline and algorithm should be generalizable to other cases/clinical diagnoses.
While 62-64% decoding accuracies are not earth-shattering, the model should improve with:
With such improvements, Fitbit could quickly become a viable clinical tool for cardiologists: one that is inexpensive, non-invasive, and easy to automate.