Validation
Every number here is from held-out data the model never trained on, scored adversarially. The same page that states the wins states the failures. That is the point.
Macro-F1 weights all 25 sites equally, so the common cancers cannot carry the rare ones. We report it, and balanced accuracy, rather than top-line accuracy, which would look higher for the wrong reasons. The 0.908 is measured on real patients only. A higher figure exists on a mixed set that includes easier external samples, and we do not lead with it.
The validation was built to try to break our own results, which is why the limitations further down are stated plainly rather than hedged.
A site prediction is only useful if its confidence is meaningful and the model can decline when it should. Three independent mechanisms make that true.
Raw scores are temperature-scaled so a stated probability matches reality. Calibration error drops from 0.091 to 0.011 on held-out data, roughly an 8x reduction, so "80% confident" means about 80% in practice (in-distribution).
Instead of forcing a single answer, the model returns a candidate set with a 90% coverage target. About 90% of cases resolve to one confident site; when the evidence is genuinely ambiguous it returns the short list it cannot rule out, and when nothing clears the bar it abstains rather than guess.
A distance check flags profiles unlike anything in training, and an input-validity gate catches profiles that are not tumor RNA-seq at all. The gate flagged 100% of held-out normal-tissue samples with no tumor false positives. These are guardrails, not a clinical detector.
The hardest case is a single tumor sequenced on a different pipeline than the model trained on. We publish that gap rather than hide it. On those off-pipeline single tumors, the model abstains on roughly two-thirds of cases, and on the third it does commit to it is right about 98% of the time. So it declines rather than confidently misroutes. We do not quote the 90% coverage figure for off-pipeline inputs, because conformal coverage is an in-distribution property and does not hold under platform shift. The honest framing is selective: it answers less, but it is right when it answers.
The same rigor that produced the numbers above produced these. They are part of the model.
On real mesothelioma cases the model scores 0% recall and confidently sends them elsewhere. An earlier per-site figure for Pleura and Mediastinum reflected one external batch's signature rather than the biology; we found it with the per-source split and corrected it. Treat Pleura and Mediastinum output as unreliable.
On a handful of rare sites the stated confidence runs well above measured accuracy (for example Soft Tissue and Esophagus). Those sites have very few examples, close to the entire public universe of their kind, so the data is exhausted and stronger models do not move the number. It is a data ceiling, not a model ceiling.
Calibration and conformal coverage are measured on a held-out split from the same distribution and do not hold under platform or batch shift. The novelty and input-validity checks exist for exactly that reason, and are themselves reference-only.
All evaluation is retrospective on public cohorts. There is no prospective study, no independent clinical-site validation, and no subgroup-equity audit. Demographic representativeness is uncharacterized, and performance on underrepresented groups is unmeasured and may be worse.
See the full model card, the input and output contract in Docs, and the responsible-use notes on Safety.
Apply for research access