Why clinical-grade algorithm development is different from regular software engineering
Most software is validated by asking: does it produce the right output for the right input? Clinical software under regulatory certification has a harder question: does it produce the same output as the certified reference implementation — exactly, under the same inputs, to the same numerical precision?
That distinction sounds subtle. In practice, it changes everything about how you work.
What regulatory certification actually means for your codebase
When the FDA clears a medical device algorithm under 510(k), the cleared artefact is specific. The algorithm, its inputs, its outputs, and the evidence of its clinical performance are all documented as part of the submission. If you change the algorithm — refactor the code, move it to a new language, optimise a loop — you have potentially changed the cleared device.
This is not a theoretical concern. It is the core engineering constraint that governs any maintenance work on a certified codebase.
The implication for engineering teams is significant. You cannot simply run your test suite, see green, and ship. The standard of correctness is not "does it work" but "is it the same" — which is a much harder guarantee to produce.
The refactoring problem that most teams don't anticipate
We encountered this directly during our 2.5-year engagement with PKG Health (now Empatica). PKG had FDA 510(k) cleared algorithms for monitoring motor fluctuations in Parkinson's disease patients — tremor severity, bradykinesia, dyskinesia detection.
The algorithms had been implemented in four different programming languages across a legacy codebase that was effectively frozen. Adding a new wearable device platform meant re-implementing each algorithm in the new environment. The brief was to do this without invalidating the certification.
The constraint: any re-implemented algorithm had to produce output that matched the cleared Python reference implementation within acceptable numerical tolerance, on the same input signals, across the full clinical range of patient data.
That requirement is not a preference. It is what separates a validated algorithm from an unvalidated one in the eyes of the regulator.
The validation methodology that works
The approach we developed over that engagement works as follows:
Establish the reference implementation as ground truth. The cleared Python code is not touched. It becomes the oracle against which every other implementation is compared. No improvements, no optimisations, no bug fixes — not during the validation phase.
Build a shared test harness. Both implementations — reference and candidate — are run from the same test runner on the same input vectors. The harness produces a diff of outputs for every test case.
Test on real signals at the clinical edge cases. Synthetic signals are insufficient. The test vectors must include real patient data (de-identified, from consented recordings) covering the full operating envelope: high tremor, minimal tremor, device off-wrist, ambulation artefact, sleep. Edge cases are where implementations diverge.
Compare outputs numerically with defined tolerances. Not "close enough" — exact comparison with tolerance thresholds derived from the clinical significance of the measurement. Any difference outside tolerance is a defect in the re-implementation, not a property of the algorithm.
Document every discrepancy and its resolution. The root cause of every numerical difference must be identified and resolved. This documentation becomes part of the traceability record that supports the regulatory file.
This is not unit testing. It is more rigorous, more systematic, and more conservative than most software teams practice. It also produces a much stronger guarantee of correctness.
Why this discipline is useful beyond regulated software
We have applied elements of this methodology — or a version of it — on projects that are not regulated. Not because regulation requires it, but because the habits it builds are genuinely useful.
Defining the expected output before you write the code. Testing against real inputs at the edge of the operating range. Keeping records of what changed and why. Treating the previous implementation as the specification rather than the source of truth.
These practices improve software quality in any domain where the cost of getting it wrong is real. The clinical environment just makes the cost explicit.
If you are working on a medical device software project and dealing with the refactoring problem described here, we would be glad to talk.