Leveraging Fitbit data for automated clinical diagnostics

by Mike E. Klein

Posted on July 31st, 2015

In this post, I outline my project from the Insight Health Data Science program — work done in collaboration with the University of California San Francisco on their Health eHeart study.

An iPython notebook containing some example code written for this project can be found on GitHub.

A quick slideshow summary of the project:

[1] Overview of the project

Major health problems have global effects on the lives of patients. Many of these health issues manifest via obvious symptoms, often painful. Other signs, however, including those that may serve as "early warnings” of larger problems, may be much more subtle, going undetected (or ignored) by the individual.

One such potential "red flag," particularly for heart disease, is a person's overall activity level: how and how often they walk, exercise, sleep, etc. Perhaps the best and most universal proxy for “overall activity level" is walking: patterns of when, how, and how much an individual moves through their environment under their own power. While walking is not a perfect proxy for activeness (e.g. it excludes disabled persons in wheelchairs; activities such as biking, weightlifting, etc.), it does capture in broad strokes the overall "vigor" of a person in their everyday environment. While various other factors (professional, socio-economic, climate) have large affects on walking patterns, so, of course, does a person's health.

While walking metrics are great proxies for overall activity, it is difficult for physicians to gain access to such data. This process generally requires a doctor or patient to have a priori knowledge of a health issue and to then have the patient come into the office and perform some sort of physical stress test. Such exams, like the Six-Minute Walk Test , have been shown to have prognostic value for health outcomes in patients with heart disease.

However, the newly-emerged product category of "always-on" wearables, such as wrist-worn walk/sleep trackers from companies like Fitbit, could lead to a paradigm shift in disease diagnosis. Collection and analysis of data from these trackers has the potential to sidestep numerous hurdles that separate doctors from potentially critical information about their patients.

UCSF's Health eHeart study is the first mass effort to combine disparate sorts of medical and digital information to better understand cardiovascular disease.

In collaboration with UCSF, I have specifically addressed the following questions:

  1. What kind of population is enrolled in the study? Of the many thousands of enrollees, only a sub-population have contributed detailed Fitbit data, with a separate fraction having provided clinical details. How much overlap is there between these two groups and what can we start to do with such a sample?
  2. Can we create a prototype clinical diagnostic tool using Fitbit data, alone?

My analysis led to the following insights for Health eHeart:

  • Fitbit step data contains sufficient information to allow for significantly-above-chance level diagnoses of day-by-day symptom reports relevant to heart disease. Such a tool could be invaluable to doctors: patient compliance is much easier to achieve with a passively-worn device than with a subjective self-assessment that needs to be performed with regularity.
  • In order to achieve full-scale diagnosis of disease (i.e. does patient X have heart disease Y?), larger samples of disease-positive participants who also come with Fitbit data are needed. Following data preparation and cleaning, some of the most interesting clinical populations (e.g. congestive heart failure, chronic obstructive pulmonary disease) showed to be underpowered. That's likely to change soon, however: Health eHeart is an ongoing study, so new data are constantly flowing in!

[2] What is Health eHeart and who are the participants?

The study

The Health eHeart study is a massive undertaking of the University of California San Francisco, begun in 2013. The overarching goals of the study, which is open to anyone 18+ (with or without heart disease), are to leverage big data to (the italicized is taken directly from the study website):

  1. Develop new and more accurate ways to predict heart disease based on measurements, behavior patterns, genetics, and family and medical history
  2. Understand the causes of heart disease (including heart attack, stroke, heart failure, atrial fibrillation, and diabetes) and find new ways to prevent it
  3. Create personalized tools you can use yourself to forecast when you might develop heart disease or, if you have it already, when you might be getting worse

Participants in the study are asked to perform a variety of data-generating tasks, including, to:

  • Fill out detailed surveys on an approximately bi-annual basis
  • Contribute self-measurements of weight, blood pressure, etc.
  • Wear activity trackers (such as a Fitbit), pair sensors with smartphones
  • Download smartphone apps that record pulse, weight, sleep, activity, behavior
  • Provide medical records, updated information about hospitalizations
  • If possible, visit UCSF for in-person tests

My role as a consultant for the project was focused on predicting clinical status based on activity and survey data / medical reports.

The participants

As mentioned above, participants in the study may or may not have:

  • been diagnosed with a heart disorder
  • allowed for the use of personal activity tracking data
  • contributed other sorts of information (clinical, demographic)

The UCSF team was most interested on those participants for which we have (1) minute-by-minute Fitbit data as well as (2) clinically-linked information. Thus, my first task was to identify the sample that comprised the intersection of these two groups.

A first pass look at the data showed that, of the ~30,000 current enrollees of Health eHeart, slightly under 1000 have contributed Fitbit data. Of this group, most (~850) contributed at least 100 days worth of data (not all of which passing the “smell test” of validity).

On the clinical side, the biggest interest concerned populations that are doctor-diagnosed with serious cardiovascular disorders (congestive heart failure (CHF), chronic obstructive pulmonary disorder (COPD)) or that have contributed daily self-reports that are relevant to such disorders (the primary case here being “shortness of breath,” which is a major flag for cardiovascular problems). Of the total cohort, 641 participants had CHF, while 519 were labelled as having COPD. Unfortunately, the intersections of these patient groups with the Fitbit group were quite small: less than 20 people were both disease-positive and came with walking data. Thus, the focus going forward turned to linking walk data to day-by-day symptomology reports.

[3] What do the data look like?

The data, over time (-series)

Let’s take a look at what a randomly-chosen time series of a single day looks like:

In the example, this individual, like most people on most days, walks in bursts, as opposed to consistently throughout the day. We can see clear periods of inactivity (including the wee hours of the morning (< 400 (~6:30am) on the x-axis), when this person was probably asleep). We can zoom in on long periods of walking at a sustained pace (in this case, likely just after waking up):

Looking at another participant’s data, we can also clearly see the need for data filtering/cleaning:

There is something very off with this data, as even world class sprinters have a hard time reaching 300 steps per minute. The fact that the exact same number is recorded for several consecutive minutes adds support to the idea that this is not valid data and needs to be discarded. Removing days due to suspicious-looking outliers or a lack of data (less than a few hundred steps) was a major part of the project's processing pipeline.

Feature engineering

Once we have a daily time series that we trust, we can engineer several features, including:

  • Total steps
  • Maximum step rate per minute
  • Total “active” minutes (set at > 5 steps/minute)
  • Mean step rate (only considering “active” minutes)
  • Variance of step rate
  • Autocorrelations

Other features require additional data processing stages. An example of such processing is temporal smoothing (averaging values over rolling time windows) to create features like maximum average step rates over various time periods (10 minutes, 1 hour, etc.). While, as we’ll see below, some of these features are correlated with one another to various degrees, the hope is that they will all capture a somewhat different angle on the data. Walking habits are likely different for many people during the work-week vs. the weekend, so it was desirable to create different feature sets for each. Handily python’s pandas data analysis package allows for automatic extraction of day-of-week information using its automatic date extraction tools.

Taking a bird’s eye view on a certain feature (daily step totals) for the same participant as above, we can look at long-term time series:

as well as histograms:

This specific participant walks more than is typical of our total cohort, as shown by the full histogram over all participants:

These histograms also drive home the above point about data cleaning: the distributions are bookended on the low end by a spike of days representing 0 steps. These days need to be discarded, not modeled! Separately, the smoothness of the full cohort distribution is interrupted by small peaks at around 10- and 20-thousand steps. These are likely driven by Fitbit’s goal-setting feature: a user can choose a certain target number of daily steps (the default is 10,000) and is given updates throughout the day about how far they’ve gone toward reaching that goal. It looks like this method actually works to promote physical activity!

Using Python's seaborn visualizations package, we can also plot individual features against one another over all days for which we have data:

These features range from being highly correlated (maximum 10- vs. 60-minute pace) to showing essentially no correlation (average total minutes of activity vs. variance of pace). Certain pairs, while having moderate correlation, also contain interesting off-axis clusters, an example shown on the scatter plot being maximum daily 1-minute step rate vs. total daily steps:

The the fuzzy border between likely walking and running paces sits somewhere around 150 steps/minute, so this group highlighted in green, while generally quite sedentary as evidenced by few total steps per day, at some point are running for a short period of time. (My speculation: they are running after the mailman after missing a package delivery!)

[4] Predictive modeling

Now that I have a good bead on what the data look like, let's return to the primary directive:

Can we create a prototype diagnostic tool using Fitbit data, alone?

The specific case I will address is to diagnose the severity of daily symptom reports for shortness of breath purely from the step data generated by Fitbit trackers. Shortness of breath is one of the strongest indicators of cardiovascular problems (including CHF and COPD). Let’s take a look at how many data points we have for each “shortness” rating, where 1 = no symptoms and 5 = severe symptoms:

There are a couple of takeaways here:

  1. We have lots of data: approximately 6000 days worth. (An additional ~20,000 days of symptom reports were unlinked to Fitbit data.) These data come from 45 participants: 10 of whom have CHF and 5 with COPD (2 patients overlapped between those groups).
  2. Very few days were rated 4's or 5's, so the analyses going forward focus on dissociating those days containing no symptoms (1) from those with mild (2) and/or moderate (3) symptoms, based purely on the step-patterns from those days. These are arguably the most critical populations to dissociate, as:
    • patients who self-rate a 4 or 5 are likely feeling bad enough to take some sort of action on their own (call the doctor, go to the ER, etc.)
    • patients who self-rate a 2 or 3 may be (1) at real risk, but (2) inclined to ignore the warning signs (or simply never notice them to begin with when not explicitly asked to self-rate while enrolled in a study).

While there are lots of data points here, most come from the “no symptoms” category. The main approach I will take to deal with these dramatically unbalanced data sets are to randomly down-sample the larger categories (1 and 2) to match the smallest (3: around 200 days). Using the features discussed in the previous section, I built a model to perform binary classifications (e.g. does walking data from day X belong to symptom report 1 or symptom report 2?). Random forest algorithms were used, with 8-fold cross validation: training on 7/8ths of the data, testing on the remaining 1/8th, repeating the entire analysis 7x (leaving out a different testing sample each time), and then averaging the results. As the groups were balanced, classification success was assessed by a simple accuracy score, which ranged between 62-64% accurate for the three pairwise analyses. Models built using only single features (i.e. can total steps in a day predict symptom score?) were substantially worse, typically only a couple of percentage points better than the 50% accuracy one would predict in a "coin flip" guess scenario.

Finally, these accuracies had to validated. Is a 64% accurate model truly “better” than guessing in a statistical sense? To test this, I performed a Monte Carlo style permutation test, named after the famous casino in Monaco. The point of this analysis was to generate the true “null” distribution by randomly scrambling the symptom report labels for each day and then attempting to build a predictive model using the same random forest algorithm. As I was now trying to model noise, the classifier should have failed, which is exactly what was observed:

All of the observed accuracies from above (dashed red line = lower bound) fell well to the right of this distribution, and all were highly significant (p < 0.001).

Separately, I conducted an analysis to dissociate the "1" ratings (no symptoms) from all of the other groups lumped together. This yielded accuracy scores in excess of 68%, despite the fact that the above analyses confirmed the existence of clear differences in walking patterns between levels of symptom severity (i.e. groups 2 and 3 could be dissociated). Thus, I can conclude that walking patterns contain information relevant to both (1) existence of symptoms, as well as (2) severity of symptoms.

[5] Conclusions

The major conclusion from the consulting project is that Fitbit data can be used to predict clinical measures. While this is a prototype model built for a specific use-case (prediction of daily symptom self-reports), the pipeline and algorithm should be generalizable to other cases/clinical diagnoses.

While 62-64% decoding accuracies are not earth-shattering, the model should improve with:

  • more data (of course!)
  • additional feature engineering
  • incorporation of demographic info, medical history
  • inclusion of other tracked data (e.g. from smartphone apps)

With such improvements, Fitbit could quickly become a viable clinical tool for cardiologists: one that is inexpensive, non-invasive, and easy to automate.

Parts of the project

There were many stages of the steps4health project, all of which may not be relevant to all readers.

Overview Health eHeart Data Predictive modeling Conclusions