When data and methods cannot be traced, science stops being falsifiable
Results can spread without remaining testable, allowing hidden flaws to propagate.
If you find value in #ComplexityThoughts, consider helping it grow by subscribing and sharing it with friends, colleagues or on social media. Your support makes a real difference.
→ Don’t miss the podcast version of this post: click on “Spotify/Apple Podcast” above!

A reproducibility failure in microbiome research and a wave of biomedical AI studies built on unverifiable data point to the same issue: scientific results can circulate and gain authority without remaining testable.
A recent post by the Meren Lab1 describes a failed attempt to reproduce a published microbiome study. The data were technically available, as required by the journal, yet essential metadata were missing. Without information linking samples to conditions, the dataset could not be interpreted, and the results could not be independently verified.
Around the same time, a piece in Nature covers a recent preprint that identified more than one hundred biomedical AI studies trained on datasets with unclear or implausible provenance. These datasets exhibit statistical properties that are difficult to reconcile with real-world data, yet they have been reused across multiple subsequent publications.
These two cases point to the same structural condition: scientific claims can be published, reused, and trusted without remaining practically falsifiable.
The issue extends beyond isolated mistakes or occasional failures in peer review. The system allows results to circulate in a form that cannot be meaningfully challenged. Once this condition is present, errors do not remain local and propagate.
Science as a dependency network
Scientific production can be described as a dependency network: datasets function as inputs, models as derived artifacts, and downstream studies as successive layers of reuse. Properties of the input layer constrain what can be validated at higher levels.
A parallel can be drawn with a software supply-chain vulnerability (CVE-2024-3094), in which a flaw introduced in a low-level dependency propagated upward into widely used infrastructure. The mechanism is straightforward: trust in upstream components allows faults to remain undetected until they reach sensitive layers.

In biomedical AI, datasets play a similar role: when their provenance is uncertain, downstream validation operates on unstable ground.
Mechanisms
Data without traceable origin. The first mechanism is the presence of data without traceable origin.
Some of the datasets described in the Nature piece lack clear documentation of how they were collected, curated, or processed. In some cases, their statistical regularities appear unusually clean, with patterns that are difficult to reconcile with known variability in clinical data. These properties do not demonstrate fabrication, but they prevent independent verification.
Unverifiable inputs set a limit on what can be established downstream: a model trained on such data can be evaluated, benchmarked, and deployed, yet the validity of its outputs remains conditional on assumptions that cannot be examined.
The problem lies in the missing link between the data and the process that generated it.
Validation without source accountability. The second mechanism concerns validation practices that do not extend to data provenance.
Evaluation in biomedical AI focuses on model performance: accuracy, sensitivity, specificity, and related metrics. These measures assess how well a model reproduces patterns present in a dataset, they do not assess whether the dataset itself is a faithful representation of the underlying phenomenon.
The Meren Lab case illustrates a related failure where the dataset was shared, satisfying formal requirements for openness, yet it could not be used. Without complete metadata, no independent researcher could reconstruct the analytical pipeline or test the original claims.
This produces a specific condition: results are accessible, but they cannot be tested. Formal compliance replaces functional verification. The consequence is an epistemic blind spot, where claims appear validated because they satisfy existing criteria, while the layer where the failure originates remains unexamined.
Trust accumulates through reuse. The third mechanism is the propagation of trust through reuse.

In the above picture, all scientific papers subsequent to the top one make use of its results and/or outputs to build new research and justify additional outputs. This is how science works.
The same mechanisms works for datasets of uncertain origin that are reused in subsequent studies, incorporated into aggregated datasets, and used to train new models in AI-based research. Each reuse is typically performed in good faith and, over time, repeated use creates the appearance of reliability. This process operates as a feedback loop between trust and reuse:
→ initial uncertainty is not resolved,
→ it becomes distributed across multiple applications,
→ as the dataset appears in more studies, its status shifts toward acceptance without the introduction of new evidence.
In AI pipelines, this effect is further amplified: models trained on such data can be fine-tuned, combined with other models, or used as benchmarks, and the original uncertainty becomes embedded across multiple layers of derived artifacts.
At that stage, identifying the source of error becomes difficult because it is dispersed across the system.
What is the practical implication? Restoring falsifiability as a system property
This structure has direct implications for validation, since reproducing every published result during peer review is not feasible — particularly in experimental and clinical contexts — and the associated cost and time requirements exceed what the current system can support.
At the same time, allowing results to circulate without enforceable conditions for falsifiability introduces a different type of risk: errors introduced upstream can propagate into applications where their consequences extend beyond research settings.
I do not have a clear solution, at the moment. A more tractable approach is to enforce minimal conditions that preserve the possibility of falsification.
First, data provenance should be treated as a primary object of validation. Datasets need structured metadata describing collection protocols, preprocessing steps, and inclusion criteria. This information should be integral to publication requirements rather than supplementary material.
Second, availability should be distinguished from executability. Data and code need to be usable by third parties: minimal reproducibility checks at publication can verify that pipelines run and outputs can be regenerated. This does not establish correctness, but at least it ensures that claims remain open to scrutiny.
Third, replication and falsification require explicit incentives. Reproducing existing work, particularly when it leads to negative results, is currently under-rewarded. Dedicated funding streams, publication formats for replication studies, and recognition of validation work would alter this incentive structure.
Fourth, verification efforts should be targeted. Priority can be assigned to clinically relevant models, widely reused datasets, and benchmarks that influence large segments of the field. This concentrates resources where propagation risks are highest.
Fifth, data pipelines should be treated as systems exposed to identifiable failure modes. Independent dataset reconstruction, cross-cohort validation, and statistical screening for anomalous patterns introduce redundancy: these measures reduce the likelihood that a single point of failure remains undetected.
Take home
Errors are an expected component of scientific practice. The concern lies in how systems handle them once they occur.
The Meren Lab case shows that a result can enter the literature in a form that cannot be tested. The biomedical AI cases show how such inputs can be reused and extended across multiple studies.
Under these conditions, producing new results is insufficient: scientific systems also need to preserve the practical conditions under which existing claims can still be tested, corrected, or rejected.
→ Please, remind that if you find value in #ComplexityThoughts, you might consider helping it grow by subscribing, or by sharing it with friends, colleagues or on social media. See also this post to learn more about this space.
I thank Thilo Gross for pointing me to it
