Validation · Provotics

The numbers, and what they measure

0.908macro-F1 on real held-out patients (balanced accuracy 0.911), every site weighted equally

89.5%on 381 independent tumors, 7 studies, different patients, centres, pipelines

8xcalibration error cut after temperature scaling (0.091 to 0.011)

90%conformal coverage in-distribution; about 90% resolve to a single confident call

Macro-F1 weights all 25 sites equally, so the common cancers cannot carry the rare ones. We report it, and balanced accuracy, rather than top-line accuracy, which would look higher for the wrong reasons. The 0.908 is measured on real patients only. A higher figure exists on a mixed set that includes easier external samples, and we do not lead with it.

How we validate

The validation was built to try to break our own results, which is why the limitations further down are stated plainly rather than hedged.

Held out by study. An external study never appears in both training and test, so the model cannot memorize a cohort and call it generalization.
Batch-origin probe. We train a classifier to tell our data apart from an external cohort. If it can still separate them after correction, the apparent "signal" is a batch artifact and the cohort is rejected.
Shuffled-label control. We re-run with the labels shuffled. If a recovery survives that, it was never real. This caught and excluded a false improvement on one site that was a majority-class shortcut.
Bootstrap intervals and deduplication. Headline numbers carry resampled confidence intervals, and near-duplicate samples are removed so a profile cannot sit on both sides of a split.

It knows when it does not know

A site prediction is only useful if its confidence is meaningful and the model can decline when it should. Three independent mechanisms make that true.

Calibrated confidence

Raw scores are temperature-scaled so a stated probability matches reality. Calibration error drops from 0.091 to 0.011 on held-out data, roughly an 8x reduction, so "80% confident" means about 80% in practice (in-distribution).

Conformal sets and abstention

Instead of forcing a single answer, the model returns a candidate set with a 90% coverage target. About 90% of cases resolve to one confident site; when the evidence is genuinely ambiguous it returns the short list it cannot rule out, and when nothing clears the bar it abstains rather than guess.

Novelty and input validity

A distance check flags profiles unlike anything in training, and an input-validity gate catches profiles that are not tumor RNA-seq at all. The gate flagged 100% of held-out normal-tissue samples with no tumor false positives. These are guardrails, not a clinical detector.

The cross-platform gap, in full

The hardest case is a single tumor sequenced on a different pipeline than the model trained on. We publish that gap rather than hide it. On those off-pipeline single tumors, the model abstains on roughly two-thirds of cases, and on the third it does commit to it is right about 98% of the time. So it declines rather than confidently misroutes. We do not quote the 90% coverage figure for off-pipeline inputs, because conformal coverage is an in-distribution property and does not hold under platform shift. The honest framing is selective: it answers less, but it is right when it answers.

Where it is weak

The same rigor that produced the numbers above produced these. They are part of the model.

It does not recognize real mesothelioma

On real mesothelioma cases the model scores 0% recall and confidently sends them elsewhere. An earlier per-site figure for Pleura and Mediastinum reflected one external batch's signature rather than the biology; we found it with the per-source split and corrected it. Treat Pleura and Mediastinum output as unreliable.

Rare sites are overconfident and data-starved

On a handful of rare sites the stated confidence runs well above measured accuracy (for example Soft Tissue and Esophagus). Those sites have very few examples, close to the entire public universe of their kind, so the data is exhausted and stronger models do not move the number. It is a data ceiling, not a model ceiling.

The guarantees are in-distribution only

Calibration and conformal coverage are measured on a held-out split from the same distribution and do not hold under platform or batch shift. The novelty and input-validity checks exist for exactly that reason, and are themselves reference-only.

No fairness audit, no clinical validation

All evaluation is retrospective on public cohorts. There is no prospective study, no independent clinical-site validation, and no subgroup-equity audit. Demographic representativeness is uncharacterized, and performance on underrepresented groups is unmeasured and may be worse.

Read the rest

See the full model card, the input and output contract in Docs, and the responsible-use notes on Safety.

Apply for research access