Apple Watch’s data ‘black box’ poses research problems

A Harvard biostatistician is rethinking plans to use Apple Watches as part of a research study after finding inconsistencies in the heart rate variability data collected by the devices. Because Apple tweaks the watch’s algorithms as needed, data from the same time period may change without warning.

“These algorithms are what we would call black boxes — they’re not transparent. So it’s impossible to know what’s in them,” said JP Onnela, an associate professor of biostatistics at the Harvard TH Chan School of Public Health and developer of the open- source data platform Beiwe. The edge.

Onnela does not usually include commercial wearable devices such as the Apple Watch in research studies. For the most part, his teams use research-grade devices designed to collect data for scientific studies. However, as part of a collaboration with the neurosurgery division at Brigham and Women’s Hospital, he was interested in the commercially available products. He knew there were sometimes data issues with those products, and his team wanted to check how serious they were before getting started.

So they checked the heart rate data that his collaborator Hassan Dawood, a researcher at Brigham and Women’s Hospital, had exported from his Apple Watch. Dawood exported his daily heart rate variability data twice: once on September 5, 2020 and a second time on April 15, 2021. For the experiment, they looked at data collected over the same period – from early December 2018 to September 2020.

Because the two exported datasets contain data from the same time period, the data from both sets should theoretically be identical. Onnela says he expected some differences. The “black box” of portable algorithms is a constant challenge for researchers. Instead of showing the raw data collected by a device, the products usually let researchers export information only after it has been analyzed and filtered by some algorithm.

Companies change their algorithms regularly and without warning, so the September 2020 export may contain data analyzed with a different algorithm than the April 2021 export. “What was surprising was how different they were,” he says. “This is probably the cleanest example I’ve seen of this phenomenon.” He published the data in a blog post last week.

Comparing the heart rate variability data collected at the two different time points reveals major differences.
Image: Beiwe

Apple did not respond to a request for comment.

It was striking to see the differences so clearly, says Olivia Walch, a sleep researcher who works with wearables and app data at the University of Michigan. Walch has long advocated for researchers to use raw data — data taken directly from a device’s sensors, rather than filtered through the software. “It’s valid because I get on my little soapbox about the raw data, and it’s nice to have a concrete example where it would really matter,” she says.

Constantly changing algorithms make it almost prohibitively expensive to use commercial wearables for sleep research, Walch says. Sleep research is already expensive. “Can you tie four FitBits to someone, each with a different version of the software, and then compare them? Probably not.”

Companies have incentives to change their algorithms to make their products better. “They’re not super incentivized to tell us how they change things,” she says.

That is a problem for research. Onnela compared it to tracking body weight. “If I wanted to jump on a scale every week, I would have to use the same scale every time,” he says. If that scale were adjusted without his knowledge, the daily weight changes would not be reliable. For someone who is only casually interested in tracking their health, that might be fine – the differences won’t be great. But in research, consistency is important. “That’s the concern,” he says.

For example, someone could conduct a study with a wearable and come to a conclusion about how people’s sleep patterns changed based on adaptations to their environment. But that conclusion may only be true with that particular version of the wearable’s software. “Maybe you’d have a very different result if you just used a different model,” Walch says.

Dawood’s Apple Watch data is not from a study and is just an informal example. But it shows how important it is to be careful with commercial devices that don’t allow access to raw data, Onnela says. It was enough to deter his team from plans to use the devices in studies. He believes that commercial wearables should only be used if raw data is available, or – at the very least – if researchers can get a heads-up when an algorithm is going to change.

There may be situations where portable data can still be useful. The heart rate variability information showed similar trends at both time points – the data went up and down at the same time. “If you’re concerned about things on that macro scale, then you can place the call that you would continue to use the device,” Walch says. But if the specific heart rate variability calculated each day matters to a study, the Apple Watch may be riskier to rely on, she says. “It should make people pause on using certain wearables, if the carpet is at risk of being pulled out from under their feet.”