Researchers have concluded that Babylon Health has not offered ‘convincing evidence’ that its AI-powered diagnostic and triage system can perform better than doctors.

In July 2018, Babylon Health claimed a study had demonstrated its artificial intelligence (AI) system had diagnostic ability that is ‘on-par with human doctors’.

But in a letter to medical journal The Lancet, Hamish Fraser, Enrico Coiera and David Wong explained their review – ‘Safety of patient-facing digital symptom checkers’ – shows there ‘is a possibility that it [Babylon’s service] might perform significantly worse’.

Fraser, Coiera and Wong – respectively a qualified doctor and associate professor of medical science; professor in medical informatics; and lecturer in health informatics – argue that Babylon’s claims have been ‘met with scepticism because of methodological concerns’.

This included the fact ‘data in the trials were entered by doctors’ and not real-life patients or ‘lay users’.

Babylon made its original claim based on feeding a representive sample of questions from the Membership of the Royal College of General Practitioners (MRCGP) exam to its diagnostic and triage system. The company reported the AI scored 81%.

The researchers commended Babylon for releasing a ‘fairly detailed description of the system’ and said it ‘potentially showed some improvement to the average symptom checker’.

However, the letter states: “The study does not offer convincing evidence that the Babylon Diagnostic and Triage System can perform better than doctors in any realistic situation, and there is a possibility that it might perform significantly worse.”

It adds: “Further clinical evaluation is necessary to ensure confidence in patient safety.”

The letter concludes: “Symptom checkers have great potential to improve diagnostics, quality of care and health system performance worldwide.

“However, systems that are poorly designed or lack rigorous clinical evaluation can put patients at risk and likely increase the load on health systems.”

Babylon Health’s chief scientist, Saurabh Johri, thanked the co-authors for their letter and review.

He added: “As we emphasise in the conclusion of our paper, the ability to generalise the findings of our pilot will require further studies.

“We welcome the suggestions of the authors for developing guidelines for robust evaluation of computerised diagnostic decision support systems since they align with our own thinking on how best to perform clinical evaluation. Together with our academic partners, we are currently in the process of performing a larger, real-world study, which we intend to submit for peer-review.

Babylon’s chief scientist, Saurabh Johri said:

“We would like to thank the authors for their letter and review: ‘Safety of patient-facing digital symptom checkers’.

“As we outline in the original paper, the goal of our pilot study was to assess the performance of our system against a broad set of independently-created vignettes, which represent a diverse range of conditions, including both common and rare diseases. Hence, the purpose of the study was to perform an initial comparison through statistical summaries rather than detailed statistical analysis. This setting contrasts to a ‘real-world’ one, which would strongly favour common conditions at the expense of those of lower incidence. Despite the limited number of vignettes in our study, for increased breadth, we test against twice as many, as in another similar evaluation (Semigran et al. 2015, BMJ).

“It is also important to remark that we took appropriate care to rigorously ground our scientific findings by stating in our paper that ‘further studies using larger, real-world cohorts will be required to demonstrate the relative performance of these systems to human doctors’.

“The authors raise a number of concerns, some of which were addressed by us previously in our response to the online commentary provided by one of the authors (Prof. Coiera).

“In their correspondence, the authors claim that ‘the study does not offer convincing evidence that the Babylon Diagnostic and Triage System can perform better than doctors in any realistic situation, and there is a possibility that it might perform significantly worse.’ As we indicated in our original study, our intention was not to demonstrate or claim that our AI system is capable of performing better than doctors in natural settings. In fact, we stress that our study adopts a ‘semi-naturalistic role-play paradigm’ to simulate a realistic consultation between patient and doctor, and it is in the context of this controlled experiment that we compare AI and doctor performance.

“We would also like to take this opportunity to remark on a number of factual inaccuracies in the commentary. Firstly, the authors have commented that some of the doctors are outliers in terms of their accuracy. However, regardless of their performance in the study, all doctors are GMC-registered and regularly consult with real patients. Also, even if Doctor B is removed, the Babylon Triage and Diagnostic System’s performance is similar to the performance of Doctor A and Doctor D. Secondly, we would also like to clarify that in a previous paper (Semigran et al. 2015, BMJ), its authors included only relevant vignettes for each Symptom Checker tested, not just all adult vignettes as suggested in the appendix to the review letter.

“As we emphasise in the conclusion of our paper, the ability to generalise the findings of our pilot will require further studies. We welcome the suggestions of the authors for developing guidelines for robust evaluation of computerised diagnostic decision support systems since they align with our own thinking on how best to perform clinical evaluation. Together with our academic partners, we are currently in the process of performing a larger, real-world study, which we intend to submit for peer-review.”