Three Challenges for Artificial Intelligence in Medicine
Why is the world’s most advanced AI used for cat videos, but not to help us live longer and healthier lives? A brief history of AI in Medicine, and the factors that may help it succeed where it has failed before.
Imagine yourself as a young graduate student in Stanford’s Artificial Intelligence lab, building a system to diagnose a common infectious disease. After years of sweat and toil, the day comes for the test: a head-to-head comparison with five of the top human experts in infectious disease. Over the first expert, your system squeezes a narrow victory, winning by just 4%. It beats the second, third, and fourth doctors handily. Against the fifth, it wins by an astounding 52%.
Would you believe such a system exists already? Would you believe it existed in 1979? This was the MYCIN project, and in spite of the excellent research results, it never made its way into clinical practice. 
In fact, although we’re surrounded by fantastic applications of modern AI, particularly deep learning — self-driving cars, Siri, AlphaGo, Google Translate, computer vision — the effect on medicine has been nearly nonexistent. In the top cardiology journal, Circulation, the term “deep learning” appears only twice . Deep learning has never been mentioned in the New England Journal of Medicine, The Lancet, BMJ, or even JAMA, where the work on MYCIN was published 37 years ago. What happened?
There are three central challenges that have plagued past efforts to use artificial intelligence in medicine: the label problem, the deployment problem, and fear around regulation. Before we get in to those, let’s take a quick look at the state of medicine today.
Medicine is life and death. With such high stakes, one could ask: should we really be rocking the boat here? Why not just stick with existing, proven, clinical-grade algorithms?
Well, consider a few examples of the status quo:
- The score that doctors use to prescribe your grandparents blood thinners, CHA2DS2-VASc, is accurate only 67% of the time. It was derived from a cohort that included only 25 people with strokes; out of 8 tested predictor variables, only one was statistically significant. 
- Google updates its search algorithm 550 times per year, but life-saving devices like ICDs are still programmed using simple thresholds — if heart rate exceeds X, SHOCK — and their accuracy is getting worse over time.
- Although cardiologists have invented a life-saving treatment for sudden cardiac death, our current algorithm to identify who needs that treatment will miss 250,000 out of the 300,000 people who will die suddenly this year. .
- Pressed for time, doctors can’t make sense of all the raw data being generated: “Most primary care doctors I know, if they get one more piece of data, they’re going to quit.”
To be clear, none of this means medical researchers are doing a bad job. Modern medicine is a miracle; it’s just unevenly-distributed. Computer scientists can bring much to the table here. With tools like Apple’s ResearchKit and Google Fit, we can collect health data at scale; with deep learning, we can translate large volumes of raw data into insights that help both clinicians and patients take real actions.
To do that, we must solve three hard problems: one technical, one a matter of political economy, and one regulatory. The good news is that each category has new developments that may let AI succeed where it has failed before.
Modern artificial intelligence is data-hungry. To make speech recognition on your Android phone accurate, Google trains a deep neural network on roughly 10,000 hours of annotated speech. In computer vision, ImageNet contains more than 1,034,908 hand-annotated images. These annotations, called labels, are essential to make techniques like deep learning work.
In medicine, each label represents a human life at risk.
For example, in our study with UCSF Cardiology, labeled examples come from people visiting the hospital for a procedure called cardioversion, a 400-joule electric shock to the chest that resets your heart rhythm. It can be a scary experience to go through. Many of these patients are gracious enough to wear a heart rate sensor (e.g., an Apple Watch) during the whole procedure in the hope of making life better for the next generation, but we know we’ll never get one million participants, and it would be unconscionable to ask.
Can AI work well in situations where we may have 1000x fewer labels than we’re used to? There’s already some promising work in this direction. First, it was unsupervised learning that sparked interest in deep learning in 2006–2009 — namely, pre-training and autoencoders, which can find structure in data that’s completely unlabeled. More recently, hybrid techniques such as semi-supervised sequence learning have established that you can make accurate predictions with less labeled data if you have a lot of unlabeled data. The most extreme setting is one-shot learning, where the algorithm learns to recognize a new pattern after being given only one label. Fei-Fei Li’s work on one-shot learning for computer vision, for example, “factored” statistical models into separate probability distributions on appearance and position; could similar “factorings” of medical data let us save lives sooner? Recent work on Bayesian program learning (Toronto, MIT, NYU) and one-shot generalization in deep generative models (DeepMind) are promising in this regard.
Underlying many of these techniques are the idea that large amounts of unlabeled data may substitute for labeled data. Is that a real advance? Yes. With the proliferation of sensors, unlabeled data is now cheap. Some medical studies are making use of that already. For Cardiogram, that involved building an Apple Watch app to collect about 8.5 billion data points from HealthKit, and then partnering with UCSF Health eHeart project to collect medical-grade labels [left, 5]. Researchers at USC have trained a neural embedding on EMR data. The American Sleep Apnea association partnered with IBM Watson to develop a ResearchKit app where anybody can contribute their sleep data.
Let’s say you’ve built a breakthrough algorithm. What happens next? The experience of MYCIN showed that research results aren’t enough; you need a path to deployment.
Historically, deployment has been difficult in fields like healthcare, education, and government. Electronic Medical Records have no real equivalent to an “App Store” that would let an individual doctor install a new algorithm . EMR software is generally installed on-premise, sold in multi-year cycles to a centralized procurement team within each hospital system, and new features are driven by federally-mandated checklists. To get a new innovation through a heavyweight procurement process, you need a clear ROI. Unfortunately, hospitals tend to prioritize what they can bill for, which brings us to our ossified and dirigiste payment system, fee-for-service. Under fee-for-service, the hospital bills for each individual activity; in the case of a misdiagnosis, for example, they may perform and bill for follow-up tests. Perversely, that means a better algorithm may actually reduce the revenue of the hospitals expected to adopt it. How on earth would that ever fly?
Even worse, since a fee is specified for each individual service, innovations are non-billable by default. While in the abstract, a life-saving algorithm such as MYCIN should be adopted broadly, when you map out the concrete financial incentives, there’s no realistic path to deployment.
Fortunately, there is a change we can believe in. First, we don’t “ship” software anymore, we deploy it instantly. Second, the Affordable Care Act creates the ability for startups to own risk end-to-end: full-stack startups for healthcare.
First, shipping. Imaging yourself as an AI researcher building Microsoft Word’s spell checker in the 90’s. Your algorithm would be limited to whatever data you could collect in the lab; if you discovered a breakthrough, it would take years to ship. Now imagine yourself ten years later, working on spelling correction at Google. You can build your algorithm based on billions of search queries, clicks, page views, web pages, and links. Once it’s ready, you can deploy it instantly. That leads to a 100x faster feedback loop. More than any one algorithmic advance, systems like Google search perform so well because of this fast feedback loop.
The same thing is quietly becoming possible in medicine. We all have a supercomputer in our pocket. If you can find a way to package up artificial intelligence within an iOS or Android app, the deployment process shifts from an enterprise sales cycle to an app store update.
Second, the unbundling of risk. The Affordable Care Act has created alternatives to fee-for-service. For example, in bundled payments, Medicare pays a fixed fee for a surgery, and if the patient needs to be re-hospitalized within 90 days, the original provider is on the hook financially for the cost. That flips the incentives: if you invent a new AI algorithm 10% better at predicting risk (or better, preventing it), that now drives the bottom line for a hospital. There are many variants of fee-for-value being tested now: Accountable Care Organizations, risk-based contracting, full capitation, MACRA and MIPS, and more.
These two things enable outside-in approaches to healthcare: build up a user base outside the core of the healthcare system (e.g., outside the EMR), but take on risk for core problems within the healthcare system, such as re-hospitalizations. Together, these two factors let startups solve problems end-to-end, much the same way Uber solved transportation end-to-end rather than trying to sell software to taxi companies.
Many entrepreneurs and researchers fear healthcare because it’s highly-regulated. The perception is that many regulatory regimes are just an expensive way to say “no” to new ideas.
And that perception is sometimes true: Certificates of Need, risk-based capital requirements, over-burdensome reporting, fee-for-service: these things sometimes create major barriers to new entrants and innovations, largely to our collective detriment.
But regulations can also be your ally. Take HIPAA. If the authors of MYCIN wanted to make it possible to run their algorithm on your medical record in 1978, there was really no way to do that. The medical record was owned by the hospital, not the patient. HIPAA, passed in 1996, flipped the ownership model: if the patient gives consent, the hospital is required to send the record to the patient or a designee. Today those records are sometimes faxes of paper copies, but efforts like Blue Button, FIHR, and meaningful use are moving them toward machine-readable formats. As my friend Ryan Panchadsaram says, HIPAA often says you can.
If you’re a skilled AI practitioner currently sitting on the sidelines, now is your time to act. The problems that have kept AI out of healthcare for the last 40 years are now solvable. And your impact is large.
Modern research has become so specialized that our notion of impact is sometimes siloed. A world-class clinician may be rewarded for inventing a new surgery; an AI researcher may get credit for beating the world record on MNIST. When two fields cross, there can sometimes be fear, misunderstanding, or culture clashes.
We’re not unique in history. In 1944, the foundations of quantum physics had been laid, including, dramatically, the later detonation of the first atomic bomb. After the war, a generation of physicists turned their attention to biology. In the 1944 book What is Life?, Erwin Schrödinger referred to a sense of noblesse oblige that prevented researchers in disparate fields from collaborating deeply, and “beg[ged] to renounce the noblesse”:
Over the next 20 years, the field molecular biology unfolded. Schrödinger himself used quantum mechanics to predict that our genetic material had the structure of an “aperiodic crystal.” Meanwhile, Luria and DelBrück (an M.D. and a physics PhD, respectively) discovered the genetic mechanism by which viruses replicate. The next decade, Watson (a biologist) and Crick (a physicist) applied x-rays from Rosalind Franklin (a chemist) to discover the double-helix structure of DNA. Both Luria & DelBrück and Watson & Crick would go on to win Nobel Prizes for those interdisciplinary collaborations. (Franklin herself had passed away by the time the latter prize was awarded.)
If AI in medicine were a hundredth as successful as physics was in biology, the impact would be astronomical. To return to the example of CHADS2-Vasc, there are about 21 million people on blood thinners worldwide; if a third of those don’t need it, then we’re causing 266,000 extra brain hemorrhages . And that’s just one score for one disease. But we can only solve these problems if we beg to renounce the noblesse, as generations of scientists did done before us. Lives are at stake.
 An interesting snapshot of history can be found in the 1982 book Artificial Intelligence in Medicine, long out of print. The author’s own exasperated reflection is that many of the ideas from the 80’s are still vibrant, but “medical record systems have moved toward routine adoption so slowly that the authors would have been shocked in 1982 to discover that many of the ideas we described are still immensely difficult to apply in practice because the data they rely on are not normally available in machine-readable form.” A more recent survey is Thirty years of artificial intelligence in medicine (AIME) conferences: A review of research themes (2015).
 Machine Learning in Medicine (Circulation, 2015) is a good summary, particularly the last section on the relationship between precision medicine and layers of representation in deep learning:
“In a deep learning representation of human disease, lower layers could represent clinical measurements (such as ECG data or protein biomarkers), intermediate layers could represent aberrant pathways (which may simultaneously impact many biomarkers), and top layers could represent disease subclasses (which arise from the variable contributions of ≥1 aberrant pathways). Ideally, such subclasses would do more than stratify by risk and would actually reflect the dominant disease mechanism(s). This raises a question about the underlying pathophysiologic basis of complex disease in any given individual: is it sparsely encoded in a limited set of aberrant pathways, which could be recovered by an unsupervised learning process (albeit with the right features collected and a large enough sample size), or is it a diffuse, multifactorial process with hundreds of small determinants combining in a highly variable way in different individuals? In the latter case, the concept of precision medicine is unlikely to be of much utility. However, in the former situation, unsupervised and perhaps deep learning might actually realize the elusive goal of reclassifying patients according to more homogenous subgroups, with shared pathophysiology, and the potential of shared response to therapy.”
 The c-statistic is also equivalent to the area under the ROC curve. Original source for CHADS2-Vasc: Lip 2010.
 From Risk stratification for sudden cardiac death: current
status and challenges for the future and Risk Stratification for Sudden Cardiac Death: A Plan for the Future.
 That’s not to imply the app is just a front for research: we care deeply that Cardiogram stands alone as an engaging, well-designed, and useful app. But deciding to build an app in the first place was driven by a broader purpose. As one doctor put it, “If you want to do world-class sociology research, build Facebook.”
 Warfarin causes brain hemorrhage in 38 out of every 1000 patients.
 A potential future exception: Illumina’s Helix. See Chrissy Farr’s piece in Fast Company on Illumina’s ambition to create an app store for your genome.
Have an Apple Watch? You can help save a life!
[Originally published March 16, 2016]
Today we’re launching the mRhythmStudy, which is a joint study within the Health eHeart effort at UCSF. The goal of this study is to understand how well heart rate data from the Apple Watch compares with data from medical grade devices, and to invite Apple Watch users to donate their heart rate data to help us build an algorithm for atrial fibrillation detection.
The cool thing is that you can help even if you’re completely healthy! Any data you contribute, whether you have atrial fibrillation or not, can help us improve the accuracy of our algorithm.
Atrial fibrillation (AF) is the most common arrhythmia and affects an estimated 2.7 million people in the US. AF increases the risk of stroke by five times over the general population, and strokes, in turn, is a leading cause of death in the US and worldwide. AF is often asymptomatic which means it is often undiagnosed until it is too late. 
If you have an Apple Watch, we invite you to join the mRhythmStudy and start donating your heart rate data today. You can help save a life!
We will be sharing details about how we’re applying deep learning for atrial fibrillation detection at Strata 2016 later this month. Come hear us speak if you’re interested!
 American Stroke Association — Atrial Fibrillation
Cardiogram 1.0: understand your heart rate on Apple Watch
(June 9, 2016)
Apple Watch will record more than 2 trillion heart rate measurements this year . What does all that data mean? Can your watch tell when you’re stressed? Sad? Excited? Can we figure out what types of exercise actually make you healthier? Could it one day detect abnormal heart rhythms?
We built Cardiogram for Apple Watch to help people answer those questions. We’re 0.01% of the way there, but today we’re launching a major product revision, Cardiogram 1.0, to help you understand your heart rate.
Complications show you information directly on your watch face. By default, the Cardiogram complication shows you your most recent heart rate; if you twist the digital crown, you can “time travel” to see how it’s varied throughout the day.
If you tap on the red heart rate icon, it will open the Cardiogram native Apple Watch app, which shows you a graph of your heart rate throughout the day so far.
Pinch-to-zoom is by far our most-requested feature. Now, when you pinch on the graph with two fingers, you can zoom in to see the details of your heart rate minute-to-minute.
In addition, you can tag your peaks using 3D touch. For example, I put my watch in workout mode while watching Game of Thrones and see my heart rate spike whenever the white walkers come on screen. Those guys are scary!
If you don’t have 3D touch, just use the “Add Tag” button below each Cardiogram to add a new tag.
Likewise, you can share a particularly exciting memory with the “share” button, favorite it with the heart, or search for it later with the new “Search Cardiograms” box at the top.
Curious to find out what your heart is telling you? Try Cardiogram for Apple Watch on the app store.
 Estimate based on 4000 average measurements per week (Cardiogram data) x external estimates of at at least 11 million Apple Watches sold.
What’s new in Cardiogram 0.9.7
[Originally posted March 4, 2016]
We just released a new update to Cardiogram for Apple Watch today. Here’s what’s new.
Tag your Peaks
You may have seen stories of people recording their heart rate after a breakup, during sex, and even scene-by-scene in Game of Thrones. We think your day is full of those moments, and we want to make it easier to notice what your heart is telling you. That’s why we built a beta feature to let you tag your peak heart rate each day.
That could range from something that lifts your spirit to the… decidedly physiological. For example, I hit 92bpm eating a burrito yesterday (left).
Resting Heart Rate
Your resting heart rate is a key metric of overall cardiovascular health. We’ve redesigned the resting heart rate cardiogram to more clearly show you the three most important things to know:
- What’s your average resting heart rate? Is it within the normal range (60–100)?
- How does it compare to others? We let you compare with the average runner, biker, or couch potato.
- How is it changing over time? By default we’ll show you your resting heart rate over last month.
Can Apple Watch save lives?
We’ve been stunned at the quality of heart rate sensor in Apple Watch. We think it can help save lives, and we’ll be asking for your help in a few weeks. Please stay tuned!
What do you want to see in the next version of Cardiogram?
Tell us! email@example.com.
Is Pokemon Go making us exercise more? What Apple Watch heart rate data says.
If you haven’t heard of Pokémon GO by now, read Recode’s summary. It’s already bigger than Tinder, poised to become bigger than Twitter, and implicated in everything from improving mental health to enabling robbery to bringing new foot traffic to restaurants.
It’s also being suggested that Pokémon GO is getting everybody to exercise.
The American Heart Association recommends you get about 30 minutes of exercise per day, and Apple Watch helpfully tracks exercise — which it defines as anything faster than a brisk walk, using the heart rate sensor — as one of its three activity rings.
So what does the data say? We analyzed heart rate and exercise data from a sample of about 35,000 Cardiogram for Apple Watch users to find out whether the anecdotes hold up.
Users tend to get more exercise on weekends than weekdays, so we took a rolling, 7-day average of the number of minutes of exercise per week.
During the 30 days prior to Pokémon GO’s release, anywhere from 37–46% of users got the recommended level of exercise. The day of launch (July 6) seemed like any other: about 45%. Two days later, that number climbed to 50%; today, so far, it’s at 53%. That’s a noticeable uptick!
The chart above is aggregated from the “Metrics” pane of Cardiogram (below, left). We included only users who had at least 7 days of trend data. There are a couple of caveats to keep in mind when drawing conclusions from this, or any other app’s, data:
- These trends are across Cardiogram users as a whole; we have no way of pinpointing which users installed Pokémon GO or use it actively.
- People who buy an Apple Watch are, of course, a biased sample of the population.
- People who download an Apple Watch heart rate app are themselves a biased sub-sample of wearable users.
If we zoom out the chart to cover all of 2016 so far, you can see how the recent uptick compares with other drivers of exercise: such as New Years Resolutions, seasonal variation (winter / summer / spring). Note that our user-base has also shifted over time (from less exercise-intensive to more mainstream audience), and we have not done a mix-adjusted analysis, so longer term shifts represent a combination of seasonality and a shift in user mix.
That said, the uptick in exercise not far from size to the one we see for New Years’ Resolutions (which seem to endure for about one week in our dataset). And given the excellent benefits exercise has shown both generally, and for specific conditions such as atrial fibrillation, that’s nothing to sneeze at.
As with many games, the true test is whether users continue to engage over time. Four days after launch is too soon to conclude, but if the effect on exercise proves to be durable, Pokémon GO may prove to be one of the largest developments in public health this year.