Why Apple Gets It Right Where Hilo and Whoop Go Wrong

18 Sept

The FDA clearance of Apple’s new hypertension feature has got the entire medtech world buzzing. Over the past year, I’ve been highly critical of the marketing and validation strategies of Whoop and Hilo. So am I about to do the same to Apple?

The answer is no, Apple has done this the right way. There are two distinct reasons why:

Apple has made appropriate marketing claims for what they can actually do, not overstep their capabilities.
Apple has designed their algorithm to meet the problem they are trying to help solve within the realistic confines of the product.

And there are some peculiarities with their 510(k) clearance that I will cover as well as why Apples product is a welcome addition to the healthcare landscape.

One thing to think about. If you are viewing this as a diagnostic tool like a BP cuff you may be thinking about this wrong. I see this more as a mass screening tool akin to the lateral flow test, and if you think about it this way some of the things in the 510(k) make a bit more sense.

Why Apples claims are appropriate

With Hilo's clearance, it might seem strange that a company with Apple's resources and capabilities would make a hypertension algorithm, not a blood pressure algorithm. In my opinion that's because Apple knows they can’t. I went in depth into the issues with Hilo’s validation procedure and why it's inappropriate to use (they have found a loophole so they don’t actually validate its accuracy), and it would be easy for Apple to follow suit (I hope they don’t).

But estimating if someone has hypertension is a different ball game. It's a much coarser grained task (easier to develop), and can leverage demographic/anthropometric information without over-biasing the algorithm (although still possible, if you're 55, 5 ft 2 and weigh 150 kg, chances are you’re hypertensive). Given the limitations with the form factor, wearability and sensors, this is probably the best use of the product, as a mass screening tool (see the next section) not a diagnostic tool.

The important thing is that Apple are not claiming they can definitively dianose hypertension. They are also not claiming you can use it to manage your hypertension. Compare this to Whoop who outright displays BP(I) in mmHg without any validation, or Hilo which claims it can without the right validation.

Why Apples Algorithm is Better Engineered

When developing algorithms there is always a trade off in two things, sensitivity and specificity.

Sensitivity is a diagnostic test's ability to correctly identify those with the disease (True Positives / (True Positives + False Negatives)),
Specificity is its ability to correctly identify those without the disease (True Negatives / (True Negatives + False Positives))

Most companies in this space default to chasing sensitivity, the idea being that missing a disease is worse than raising false alerts. But that mindset doesn’t always make sense, especially when you’re deploying at population scale. Apple has chosen to optimize for specificity, which I think is the right call given the deployment environment and the likely limitations with the data.

Think about what happens if you push sensitivity too far at the cost of more false alarms: suddenly thousands of people get flagged for “possible hypertension” who don’t have it. That creates a cascade of non-trivial consequences:

Extra cost of seeing a GP or buying your own cuff.
The lack of trust in the product if you go to the doctor and they tell you, you don’t have hypertension.
Collective primary care time taken up by people checking they have hypertension, potentially blocking people with more serious conditions.
Lack of trust by medical professionals in the quality of all the data from your product.
Potential self diagnostics or medications that are bad for the patient.

Given the reach of things like the apple watch, garmin etc, even small increases in the number of false positives could cause rapid shifts in demand for inappropriate resource usage and product perception. And these increases in false positives are rarely linear. Healthcare data is both sparse and heterogeneous, this means that it's much harder to correctly identify diseases than it is to rule them out. It also means that as you increase sensitivity, you will get dramatic increases in false positives. Small boosts in sensitivity often come at the cost of dramatic jumps in false positives.

That’s why Apple’s specificity-first approach makes sense. They’ve engineered their algorithm not as a diagnostic, but as a screening tool, something that nudges people toward further evaluation, without overwhelming the system. And it’s exactly this kind of restraint that sets them apart from Whoop’s unvalidated and overreaching blood pressure claims.

What's peculiar about the 510(k)

Now finding the right predicates is like a balancing act with 510(k) submissions. You have to find a device that's not too far away from yours with similar functionality and intended use. And that's why Apple's 510(k) is interesting. Their predicate is a 12-lead ECG analysis tool that detects hypertrophic cardiomyopathy, and requires prescription, and is used in a clinic. The Apple watch is well, a watch that detects hypertension, OTC by a patient…

Going through the device comparison table the only thing in common is the regulation name “Cardiovascular Machine Learning Based Notification Software”.

I don’t know if this is a reaction to the perceived “wrong doing” of the FDA with Whoop (they did nothing wrong and I will die on that hill), and they wanted to show they will work well with consumer companies. But this is a massive stretch in use of a predicate device. The conclusion itself is very confusing…

“HTNF is substantially equivalent to the Viz HCM as they are identical with respect to intended use and there are no differences in technological or performance characteristics that raise new questions of safety and effectiveness.”.

Now there is some subtlety here, but on a technological and "intended use” level they are very different. The phrase “no differences in technological or performance characteristics that raise new questions of safety and effectiveness” is doing a lot of work. I think what they are arguing is that they are identical in the risk profile? Still, its a bit hard to square this circle and could be making a worrying precedent or double standard for future FDA submissions.

Criticisms of the hypertension feature and traditional cuff-based BP

The main criticism of the apple feature is that it can give patients a false sense of security. Based on the inverse log likelihood, if the feature says your not hypertensive (normotensive), you will only be 1.6 times more likely to not actually be hypertensive. This means that you cannot rely on the feature to rule out if you have hypertension or not. On the flip side you are 5 times more likely to have hypertension if it says you do.

This is important and non-trivial based on how you market the product. I think Apple are right at the border of marketing the product correctly, but the AI generated influencer hype train takes it a step further into “Apple now diagnoses you” which is unhelpful and misleading.

The nuance in understanding this may be beyond consumers, but there is another side to this story. Traditional BP cuffs have very non-trivial friction points in being used as a mass screening tool:

Buying a good BP cuff is not straight forward, most of the devices you can buy on amazon have not been validated (that's a separate issue) and understanding the nuances of BP validation are also not straight forward enough.

Taking a good blood pressure measurement is also not trivial, sit still for 10 mins, don’t talk, dark room etc… How many people would give themselves false readings from not following the protocol?
It's not comfortable. This will have effects on people's willingness to do it or buy it.
It's not passive. How regularly will people check their blood pressure? I imagine initially when they get a BP device they will be religious in checking, but that will fall off over time.

On the whole I think the Apple feature is a welcome one and does something different to the traditional cuff. It answers problems that traditional cuff based monitoring has, but doesn’t have the diagnostic and management capabilities. However it should come with the warning that users should not change how regularly the consumer gets there BP check at the PCP.

Conclusion

As a package I think this is probably acceptable. Apple tread a fine line in their claims and products related to health, and this is no different. But they never go too far like Whoop. If you think of this as a screening tool in an arsenal of tools that can be used by the healthcare system I think it's a welcome one.

Samamor .

Why Apple Gets It Right Where Hilo and Whoop Go Wrong

Why Apples claims are appropriate

Why Apples Algorithm is Better Engineered

What's peculiar about the 510(k)

Criticisms of the hypertension feature and traditional cuff-based BP

Conclusion

Learn More

Jinora Consulting

Contact

Why Apple Gets It Right Where Hilo and Whoop Go Wrong

Why Apples claims are appropriate

Why Apples Algorithm is Better Engineered

What's peculiar about the 510(k)

Criticisms of the hypertension feature and traditional cuff-based BP

Conclusion

Understanding the PCCP for non-AI Medical Devices

Has cuffless blood pressure been proven possible?

Learn More

Jinora Consulting

Contact