On the Path to
Trustworthy AI in Medicine
Holistic Quality Assurance from Development to Auditing
As a project partner in Pillar 2 of MISSION KI, DFKI has conducted research on important fundamentals for the long-term safe and trustworthy use of artificial intelligence (AI) in high-risk areas such as medicine.
Trust is essential for the safe use of AI in high-risk areas such as medicine. Compliance with regulatory requirements such as the EU AI Act forms its foundation. However, the long-term, safe use of trustworthy AI systems requires technically excellent solutions across all phases of the AI life cycle – from development to testing.
The ever-increasing complexity of modern AI systems and the multitude of potential deployment scenarios make the scalable and clear implementation and testing of these requirements considerably more difficult. The technically sound demonstration of various dimensions of trustworthiness is complex: assessments are context-dependent, thresholds must be defined meaningfully, and unpredictable use cases of AI in the real world lead to high testing costs.
Complementing the MISSION KI quality standard, the German Research Center for Artificial Intelligence (DFKI) has made various contributions within Pillar 2 of MISSION KI to pave the way for trustworthy medical AI and to simplify the auditing of these dimensions. The project has developed conceptual foundations for two central platforms for creating and verifying the trustworthiness of AI systems – the Quality Platform and the Test Platform.
Both platforms complement the quality standard in terms of supporting high-risk applications and partially automating risk assessment and AI debugging. Development and testing are based on real-world use cases from medical fields such as dermatology, oncology and psychotherapy, addressing the practical challenges of building trustworthy AI systems in highly regulated environments.
Site Overview
Quality Platform
Operationalization of Requirements and Risks of AI Systems
A major discrepancy between regulatory requirements and their technical feasibility can significantly hinder the development of trustworthy AI systems. Translating vaguely defined, rapidly changing AI regulations into concrete engineering and testing steps poses considerable challenges for practitioners. A common knowledge base that is both human-readable and machine-interpretable is required.
DFKI addresses these challenges by defining a uniform terminology that links requirements for the quality and trustworthiness of AI systems with known risks, testing tools and risk mitigation measures. A continuously growing knowledge base forms the basis for the prototype of a Quality Platform that will help developers identify risks at an early stage and take appropriate measures to mitigate them.
How will the Quality Platform work?
In four steps, the Quality Platform aims to support developers and technical experts in translating abstract requirements into concrete measures for minimising risk:
1. Structure Requirements
The structured knowledge base allows individual requirements from norms, standards, laws and practice to be entered.
2. Map Context
The platform enables the assignment of project contexts to relevant quality requirements and the identification of suitable test resources.
3. Analyze Risks
Anticipated risks are automatically derived. Complex and critical cases can be forwarded to subject matter experts for review. The system simultaneously suggests appropriate risk mitigation measures.
4. Continuous Improvement
The capabilities of the knowledge base are continuously expanded by documenting user feedback, test results and evaluations.
The AI Debugger
Tool for supporting the development of trustworthy AI
The AI Debugger aims to help developers associate datasets, models, and pipelines with relevant risks. Additionally, context-specific risk mitigation measures are proposed and implemented. Implementation is either fully automated or with human oversight for user approval. The AI Debugger does not work with fixed rules but uses the Quality Platform's knowledge base for the effective development of trustworthy and compliant AI.
Central Features of the Quality Platform
A unified quality platform reduces compliance overhead and strengthens trust in the deployed AI. The AI Debugger accelerates the development of trustworthy AI while simultaneously supporting the creation of valuable knowledge.
Common Language
A unified terminology that brings together legal, product-related, and technical perspectives.
Operationalization of Abstract Requirements
Abstract requirements are transformed into concrete, testable artifacts with end-to-end traceability and verifiability.
Reuse knowledge instead of reinventing it
Risks and solutions are captured centrally once and can be used across projects and domains.
Human Oversight
New knowledge is reviewed by subject matter experts before being incorporated into the knowledge base corpus – ensuring the highest quality and trustworthiness.
Test Platform
Towards Fully Automated Auditing of AI Systems
DFKI has developed a concept for a test platform intended to enable partly automated, reproducible, and scalable technical audits of AI systems in the future. In this process, testing tools for the dimensions of Transparency, Non-discrimination, Reliability, and AI-specific cybersecurity were consolidated and further developed.
Testing Tools
From Abstract Requirements to Measurable Metrics
Trustworthy AI systems must be able to present their decisions in a comprehensible manner, be robust against attacks and disturbances, ensure data protection, and deliver fair results for all user groups. The test platform addresses the quantification and verification of these requirements by developing and integrating specialized modules for each of these dimensions.
Transparency - Explainability
AI systems should not only provide explanations for their decisions - the utility and truthfulness of these explanations must also be verifiable. The vXAI Framework structures the evaluation of explainability methods by defining requirements for good explanations and providing a clear categorization scheme for metrics. This allows the quick and automatic selection of appropriate test tools for AI explainability, as well as the identification of gaps in the evaluability of explanations. In addition, a new metric has been developed that enables complementary perspectives for evaluating the truthfulness of attribution-based explanations with greater efficiency.
More information on this can be found on the VXAI Framework website and in the accompanying publication.
Non-Discrimination - Fairness
The more widely AI systems are used in critical decision-making processes, the more serious the effects of unfair bias become: systematic discrimination scales with technology and can lead to the structural disadvantage of entire population groups. Continuous and systematic fairness testing is therefore essential to ensure that AI decisions remain fair and that trust in these systems is justified. A concept for a Fairness-Testing-Pipeline represents an initial approach to the automated evaluation of the fairness of AI systems, taking full account of the application context. Automated testing thus covers various definitions of fairness and enables differentiated analyses of sources of unfair bias.
Reliability - Robustness
AI systems must be robust against changes in the system environment, data, and other disturbances to function reliably in safety-critical applications. A prototypical Robustness Toolbox enables systematic tests against adversarial attacks, data perturbations, and out-of-distribution scenarios. This allows vulnerabilities to be identified and rectified early before they cause problems in real-world deployment.
AI-Specific Cybersecurity - Privacy
The protection of personal data is of crucial importance, especially in areas such as healthcare where particularly sensitive information is processed. A prototypical Privacy Toolbox provides a collection of tests for quantifying data protection risks such as Membership Inference and Model Inversion Attacks. This ensures that AI systems comply with applicable data protection regulations and that user privacy is maintained.
MedGenAI
Synthetic Data to Support Test Coverage
In parallel with the audit dimensions, methods of generative AI – particularly image synthesis – were investigated and extended to improve the testing of various aspects of trustworthiness in cases of difficult data coverage. One initial result is a prototype toolbox that artificially generates missing or underrepresented data to facilitate testing aspects of fairness and robustness in AI systems.
The Test Platform
From Model Description to Automated Tests
AI systems are complex – the concept for a test platform is therefore designed to break down AI applications into individual components and interfaces that can be tested in a targeted and systematic way. It consists of three central building blocks:
Application Card
Inspired by existing methods for documenting AI systems such as FactSheets, the Application Card describes AI systems in such a way that technical interfaces become transparent while the application context is precisely captured. This is because the technically sound evaluation of trustworthy AI depends largely on the context in which the system is used.
Test Registry
The results of the Application Card are semi-automatically processed by the Test Registry to identify relevant and executable tests from the available portfolio. The Test Registry not only documents the applicability and requirements of the tests, but also provides interpretation aids for the results.
Execution Engine
The Execution Engine eliminates manual checklists and prevents forgotten test cases. It takes over the execution – efficient, reproducible, and documented.
Testing based on real medical use cases
To ensure the timely deployment of trustworthy AI systems, the developed methods and tools were tested using real medical use cases. These were selected to represent different challenges and requirements for trustworthy AI. In addition to the systems already developed at DFKI in advance, ExAID, SkinDoc, and KITTU, three further use cases from the fields of psychotherapy, anesthesiology, and resource optimization were realized in close cooperation with our project partners.
ExAID & SkinDoc - KI-gestützte Hautkrebserkennung
With ExAID and SkinDoc, two AI systems were developed to support dermatological diagnostics, which must meet high requirements for explainability, robustness, and fairness to be used in clinical practice. ExAID analyzes dermatoscopic images of skin lesions and provides explainable diagnoses for various skin diseases, while SkinDoc is designed as a mobile application for the early detection of skin cancer by laypersons.
KITTU - AI-assisted therapy support in urological oncology
The KITTU system supports urological oncologists by providing treatment recommendations for urothelial, renal and prostate cancer. It uses clinical information to make accurate predictions and stands out in particular for its ability to draw on external information to explain its decisions.
PsyRAI – Psychological Rater AI for automated feedback in psychotherapist training
In collaboration with the Department of Psychology at the University of Trier, DFKI has developed a multi-agent-based AI system with transparent inter-agent communication for the automated support of evaluation procedures in the training of prospective therapists. The system is capable of generating automatic evaluations of therapists' verbal responses based on audio recordings from video sequences of therapy situations.
XAIrway - Intelligent complication prediction in the operating room
Intubations lead to complications in approximately 8% of cases. However, these complications can be fatal and occur unexpectedly in 90% of cases. Together with Saarland University Hospital (UKS), an AI-based system was developed that predicts intubation risks before operations based on endoscopy videos. The system achieved an accuracy of >90% in internal validation. Mechanisms for concept-based explainability were also implemented to identify previously unknown risk factors.
DRK RESPOND - Resource Estimation & Simulation Platform for Operational Needs and Dispatch
In collaboration with the German Red Cross (DRK), DFKI explored how AI can support smarter emergency response planning. The designed system uses spatio-temporal data and operational context to forecast where and when emergency services are likely to be needed. It also supports data-driven planning by predicting when ambulances on duty will be ready for the next mission. Through the simulation of different dispatch scenarios under real-world conditions, RESPOND enables straightforward comparison and validation.
Next Steps on the Path to Trustworthy AI in Medicine
DFKI's work on trustworthy AI is an ongoing commitment. The contributions within MISSION KI are an important step on this journey, but by no means the end. We plan to continuously advance our research and development and further expand the platforms. Feel free to contact us for pilot projects, technical questions, or contributions to further development.
IQZ Community Summit on December 11, 2025
For deeper insights into our work and the opportunity for direct exchange, we cordially invite you to our IQZ Community Summit on December 11, 2025. Learn more about the latest developments, discuss with experts, and become part of our growing community. More information and registration can be found on our website.