The Distributed Data Science Unlock in Healthcare

Lauren Kaufman


Lauren Kaufman


Jul 11, 2022

The Distributed Data Science Unlock in Healthcare

Prominent HealthTech journalist-turned-investor Chrissy Farr recently shared an insight via Twitter which has been on our minds here at Bitfount:

Healthcare data veterans have long endured the headache of passing through layers of governance, navigating fractured data collection practices, and adapting to what feels like ancient data infrastructure in order to get access to personal health records. This process is still complex and by no means a breeze, however the advent of easy to use tools for technical privacy enforcement, secure connection between disparate databases, and automated data operations pipelines is allowing the industry to slowly overcome these hurdles to data access and collaboration. The question remains, what now?

If we look to the market for an answer, the use of AI trained on healthcare data to develop better diagnostics, therapeutic treatments, and consumer lifestyle products seems to be gaining steam. Medgadget recently reported that the global market for AI in healthcare will reach $215.53 Billion USD by 2030. Solutions purporting to leverage healthcare data-based AI for better patient outcomes range from DeepMind’s algorithm for determining the nature of protein structures to Benevolent AI’s ability to enable endotype-specific drug discovery. All of these solutions have one thing in common: Access to the right data and the right talent to make the best use of it.

But what happens if you have the right healthcare data, but not the right talent (or vice versa)? How can you glean the right insights, without physically transferring data externally? Enter, remote data science and its most exciting technique: Federated machine learning!

What is Federated Machine Learning?

For those uninitiated, machine learning is a category of AI in which computers are trained to “learn” using mathematical models. For example, a large hospital system processes thousands of CT-scans per week and employs several radiographers to take the scans and analyse them for anomalies. A machine learning model trained to detect anomalies could “pre-read” CT-scans and send those with likely anomalies to the top of the review queue for radiographers to confirm.

The process of training this type of model, however, requires the data scientist to have access to the right data from which their model can learn. This is not always a possibility because data owners may feel there is too much regulatory or commercial risk in transferring data to another party or even giving access to sensitive data to internal users. This is where federated learning comes in!

Federated learning allows a data scientist to send models to the data, where the data lives without the need to externally transfer the data. For example, in Bitfount’s federated learning architecture, a data custodian connects the relevant data to a “Pod” on the data custodian’s own infrastructure and grants access for a data scientist to send models to the data for training, testing, or fine-tuning (see simplified diagram below). Once the model has run, only the model parameters are delivered back to the data scientist.

Bitfount - architecture schematic
Bitfount’s federated analytics and machine learning architecture sends models and algorithms to the data’s location (Pods), providing control over what gets returned to the data scientist without sharing the data itself.

Federated learning can also be used in combination with other privacy-preserving techniques, such as differential privacy, secure multi-party computation (MPC), and homomorphic encryption to protect the privacy of data subjects within the dataset. These properties make federated learning an extremely useful technique for unlocking data silos where access was not previously possible due to regulatory or privacy risk concerns. It also means data owners without vast resources for data science teams can reasonably outsource data analysis capabilities without putting the underlying data at risk.

Given this access and talent problem has persisted in healthcare since the beginning of its modern incarnation, it’s no surprise healthcare institutions and tech providers are some of the earliest adopters of federated learning technology. But now that federated learning enables cross-entity access to personal health record data, how can the data be used?

The Power of Federated Learning in Healthcare

There are a wide variety of use cases to which federated learning can be applied. Here we will focus on three use cases for which federated learning not only unlocks access to data for modelling purposes, but also simplifies the data preparation and collection process throughout a drug’s development lifecycle:

  1. Drug discovery & development
  2. Clinical trial recruitment
  3. Post-trial real world data (RWD) studies
1. Drug Discovery & Development

It is common for pharmaceutical companies to leverage machine learning in modelling out predictive molecule structures or responses to treatment. They can typically leverage data in their possession to train these models with attributes such as ADME (absorption, distribution, metabolism, excretion) and solubility. However, for competitive reasons, traditional models are limited to learning from the data the pharmaceutical company has collected themselves. Wouldn’t we as a society be better off if companies working to treat the same conditions collaborated to develop the best drug?

Drug discovery consortium MELLODY is pursuing just that. Via federated learning, MELLODY enables 10 competing pharmaceutical companies to ‘pool’ their data in training drug discovery models without requiring data transfer or access to the explicit values in a given entity’s database. This federated approach demonstrated a boost in predictive performance of the model trained using federated learning over those trained independently on each Pharma company’s own data. MELLODY is a proof point that federated learning can be leveraged in drug discovery as a way to improve time to discovery and development with more accurate prediction without putting competitive data at risk.

2. Clinical Trial Recruitment

Once a new treatment has been developed, eligible patients must be recruited to participate in its multi-phase clinical trials. Given the fragmented nature of healthcare delivery, finding and enrolling clinical trial patients has historically relied on a network of clinics recommending patients to a specific trial in a relatively ad-hoc manner. The clinics typically operate in a silo to identify patients and data collection practices vary across locations, making it difficult for a Pharma company to efficiently find patients. This process takes time and can often be a bottleneck to beginning a trial for a new treatment, leading to 86% of clinical trials being cancelled or significantly delayed.

Increasingly, automated data entry and evaluation solutions are installed clinic-side to help identify eligible patients and enrol them in studies. This helps reduce the burden on clinicians to identify potential patients, but is it the most efficient way to reduce the enrolment bottleneck? Federated inference, or the ability to execute a machine learning model and output its predictions remotely, has emerged as a way to both a) help identify eligible patients quickly and accurately and b) reduce friction resulting from variation in data collection and entry methods. Typically, this is leveraged in combination with remote analytics, where queries to “filter” data prior to model execution are run in order to exclude populations who are not applicable to the trial.

For example, a Pharma company might wish to test a new treatment for macular degeneration of the eye. This type of condition is typically diagnosed via imaging data, scans of the eye that can be read by an ophthalmologist to confirm diagnosis. Traditionally, the diagnostic process is time consuming and laborious, requiring an expert to spend 20-30 minutes reviewing all of the images in order to confirm diagnosis.

These same images provide a powerful training dataset for a model which can be used to identify patients eligible for the Pharma company’s trial directly at the point of care. A federated approach allows different clinics to make imaging data available to the model automatically without the need to transfer data to an external party or reveal the identity of the patient. Once an image is evaluated based on previously trained data, a patient’s record can be automatically flagged as a good candidate for the trial, saving valuable expert time and effort. The clinic then recommends this patient for the trial, reducing friction for both the clinic, the patient, and the Pharma company in the process.

3. Post-Trial RWD Studies

What happens after a patient has completed a clinical trial? Often, pharmaceutical companies will wish to continue to analyse personal health record data on trial participants to understand the long-term effects of treatment. However, this can be challenging as “real life” is messy! Patients may drop out of the trial early, move and be difficult to locate, or fail to provide adequate feedback on the effects of treatment. A number of providers have emerged to address these gaps in the post-trial analysis period, and data from various real-world sources, including social determinants of health (SDOH) data such as income and grocery transactions, are linked to try to get a better picture of post-trial activity. However, the management of these data and assurance that they conform to the requirements of trial design can be cumbersome.

Federated analytics in combination with other privacy-preserving techniques provides post-trial studies with the opportunity to ensure the process of analysing these data sources will always be compliant, and adding new sources of RWD is not a hassle. Federated data science can be leveraged to analyse RWD from disparate sources without requiring those sources share data with one another or even an external third party provider, making data more accessible for the purposes of post-trial analysis.

As we can see from these applications for drug discovery & development, clinical trial recruitment, and post-trial analysis, federated data science can be leveraged throughout the drug development lifecycle to safely enable access to data. Remote approaches to data science are not only a means to get access to data, but rather they also represent an unlock for data collaboration in the healthcare sector in service of better patient outcomes. To call back to Chrissy Farr’s original question, what can we do with healthcare data once we get it? With federated learning and privacy-preserving techniques at your disposal, the possibilities are endless.

Want to try it out for yourself? Sign up at

Stay in the know

Get the latest product insights, announcements and industry news.

Related Posts