What is machine learning for medical devices?
Machine learning for medical devices is the discipline of training, validating, and deploying predictive models on sensor data collected from health monitoring hardware. Unlike consumer ML applications, medical device ML must meet a higher bar: every model must be validated against clinical datasets, its performance documented across the full distribution of inputs the device will encounter, and any output that informs clinical decisions must be traceable to a defined specification. The most common application in wearable health monitoring is classification — determining whether a sensor event represents a valid measurement or an artefact to be discarded. Off-wrist detection, a model that identifies when a wearable has been removed from the patient, is one example: a misclassification during a tremor episode produces a clinically incorrect gap in the monitoring record. Devsort has developed and evaluated ML-based off-wrist detection as part of the PKG Health algorithm suite.
How does clinical ML validation differ from standard model evaluation?
Standard ML evaluation is statistical: a model that achieves 95% accuracy on a held-out test set is considered good. Clinical ML evaluation is individual: a model that is wrong 5% of the time for a specific patient at a specific clinical moment can cause harm. This shifts the validation framework from aggregate metrics to distribution-level coverage — documenting performance not just at the mean but across the full range of patient characteristics, device conditions, and edge-case inputs the system will encounter in deployment.
Regulatory frameworks reinforce this. Under FDA guidance on AI/ML-based software as a medical device (SaMD), models that produce clinical outputs must be evaluated with predetermined performance objectives tied to the device's intended use. A model change — even retraining on a larger dataset — may constitute a modification that requires regulatory assessment before deployment.
Devsort approaches clinical ML with the same discipline as our deterministic algorithm work: documented input specifications, formal validation against clinical datasets, and a complete audit trail from training data to deployed model.
How do we build clinical ML systems?
- 1
Define the clinical problem and input space
We begin by precisely specifying what the model must predict, the inputs available at inference time, and the performance requirements tied to the clinical use case. For classification tasks — off-wrist detection, artefact rejection, event classification — we define the positive and negative classes, the acceptable false positive and false negative rates, and the patient subgroups that must be represented in validation.
- 2
Build and select the model
We implement candidate models appropriate for the data modality and deployment environment. For embedded or edge deployment, this typically means classical ML — gradient boosting, random forest, or lightweight neural architectures — over deep learning, because inference constraints on medical hardware favour models with bounded, predictable compute requirements. We evaluate candidates against the predetermined performance criteria, not arbitrary benchmark metrics.
- 3
Validate against clinical datasets
We run final model evaluation on clinical datasets that were not used in training or selection — datasets representative of the full intended patient population. Performance is reported against every predetermined criterion. Any criterion not met is investigated: is the data distribution different from what was assumed, or does the model need redesign? The validation record documents both the results and the investigation of any shortfall.
- 4
Document for regulatory and clinical review
We produce a model card and validation report documenting training data provenance, feature engineering decisions, model architecture, evaluation methodology, performance results, and known limitations. For regulatory submissions, we produce the software documentation artefacts required under FDA guidance on AI/ML SaMD and IEC 62304.
ML for wearable movement monitoring: the off-wrist detection case
PKG Health's movement disorder monitoring system requires confidence that the device is being worn during recording periods. An off-wrist event — the device removed from the patient's wrist — that goes undetected produces a false gap in the movement record, potentially misrepresenting the patient's motor state to the clinician.
Devsort developed and evaluated ML-based off-wrist detection for the PKG algorithm suite, working alongside the deterministic signal processing algorithms that analyse movement when the device is confirmed on-wrist. The model uses accelerometer signal features to classify wrist state, with thresholds tuned to minimise false negatives — periods incorrectly classified as on-wrist — at an acceptable false positive rate.
This work illustrated the integration challenge specific to clinical ML: the model's output feeds into a downstream algorithm whose regulatory validation assumes reliable wrist-state classification. A model that degrades on a specific patient subgroup degrades the clinical output for that subgroup even if the algorithm itself remains unchanged.
Frequently asked questions
What kinds of ML models do you build for medical devices?
We primarily build classical ML classifiers — gradient boosting, random forest, support vector machines, and lightweight neural networks — for wearable and IoT medical devices. These architectures are preferred for embedded and edge deployment because their inference compute requirements are bounded and predictable. For cloud-side clinical analytics, we also work with more complex architectures where inference latency constraints are looser. We do not work on generative AI or large language models in clinical contexts.
How do you handle class imbalance in clinical datasets?
Clinical datasets are often severely imbalanced — adverse events, artefacts, and off-wrist periods are less frequent than normal operation by design. We address this through a combination of resampling strategies, cost-sensitive learning, and threshold calibration, selecting the approach based on the clinical cost of false positives versus false negatives for the specific use case. All imbalance handling decisions are documented in the validation report because they affect the model's real-world performance characteristics.
Can you build ML models that run on embedded wearable hardware?
Yes. For embedded targets, we design for inference efficiency from the start: feature extraction that runs within the sensor sampling loop, model architectures with bounded memory and compute requirements, and quantisation or fixed-point conversion where the hardware requires it. We validate that the embedded implementation matches the reference model's output within tolerance using the same process we apply to deterministic algorithm ports. See our embedded firmware development service for the full picture of how we handle embedded deployment.
What does FDA guidance say about ML in medical devices?
FDA's guidance on AI/ML-based software as a medical device (SaMD) distinguishes between locked algorithms — those whose logic is fixed after deployment — and adaptive algorithms that update in use. Locked ML models deployed in cleared devices are subject to the standard software change control process: any modification, including retraining, must be assessed against the device's regulatory documentation before deployment. Devsort produces the technical documentation that supports this assessment: training data provenance, model specification, validation report, and change impact analysis.
Do you work on AI projects outside the medical device context?
Yes. Beyond clinical wearables, we have built AI-powered features for business software, including a CRM system with embedded AI-driven workflow automation and lead scoring. The validation requirements differ — business ML is evaluated on business performance metrics, not clinical safety criteria — but the engineering discipline of specifying inputs, documenting the model, and validating performance against predetermined criteria applies across both contexts.