Health AI and the Unvalidated Chatbot: What Health Tech Companies Need to Know

A peer-reviewed audit published in BMJ Open tested five widely used consumer AI chatbots across fifty health questions spanning cancer, vaccines, nutrition, stem cells, and athletic performance. Using a methodology that included adversarial prompting designed to stress- test the models against real-world misinformation patterns, the researchers found that nearly half of all responses, 49.6%, were flagged as problematic. One in five was classified as highly problematic. Median citation completeness was 40%. No chatbot produced a single fully accurate reference list in response to any question. Across 250 questions, there were only two refusals to answer, both from Meta AI. ChatGPT, Gemini, Grok, and DeepSeek refused nothing.


The study used free-tier versions of the models, including ChatGPT 3.5, the version available from November 2022. Paid API versions may perform differently, and health tech companies should conduct their own accuracy assessments for their specific deployment context. That qualification noted, the findings document real failure modes under realistic query conditions, and the pattern they reveal has direct regulatory and commercial implications.


What The Audit Found


Performance varied significantly by clinical domain. Vaccines and cancer were the best- performing categories, with adjusted z-scores of minus 2.57 and minus 2.12 respectively, though even in these areas roughly one in four responses was problematic. Nutrition and athletic performance were the worst, with z-scores of plus 4.35 and plus 3.74. Stem cells fell in between. Health tech companies building in metabolic health, nutrition, sports medicine, or rehabilitation are therefore operating in territory where the audit recorded the highest rates of problematic output. This is the kind of domain-specific accuracy data that any pre- deployment threshold assessment under Article 15 of the EU AI Act must engage with
directly.


The type of query also matters substantially. Open-ended prompts, the kind most consumers actually use, produced 32% highly problematic responses. Closed-ended prompts produced 7%. Conversational interfaces and symptom checkers, where users ask open questions rather than binary ones, carry a structurally higher accuracy risk than the headline figures suggest.


Grok was the statistical outlier for harmful responses, generating 30% highly problematic outputs, significantly more than would be expected under random distribution. The audit attributes this partly to Grok being partially trained on content from X, formerly Twitter, a platform the authors describe as known to spread false information more rapidly than truthful content. This is a specific procurement-relevant finding, it connects training data provenance directly to clinical output quality, which is precisely the data governance question that Article 10 of the EU AI Act is designed to address.

On transparency, the audit found that Gemini included a recommendation to consult a healthcare professional in 88% of its responses. It still produced 40% problematic responses. The internal disclaimer within the chatbot’s output does not resolve the accuracy problem. This matters for the Art. 13 transparency obligation, operationalising meaningful transparency requires more than instructing the model to include a caveat. It requires the output to be accurate, verifiable, and accessible. On readability, the American Medical Association recommends that patient education materials should not exceed sixth-grade reading level. Every model in the audit responded at college-sophomore to senior level. The gap between these two figures is the accessibility failure that Art. 13 is designed to prevent.

On citations, both DeepSeek and ChatGPT, when asked about their poor reference reliability, acknowledged the problem in their own outputs. DeepSeek stated that its references “may not correspond to actual, verifiable sources.” ChatGPT responded that it “may fabricate information to maintain the appearance of completeness, even if that means sacrificing accuracy.” These are not interpretations by the researchers. They are admissions by the systems themselves. A model that is aware it produces hallucinated citations but continues to do so cannot be described as explainable in any sense that satisfies Art. 13(1)(b) or supports the human oversight requirement under Art. 14.


The Regulatory Picture


Consumer health chatbots are currently classified as General Purpose AI models under Chapter V of the EU AI Act, with transparency obligations in force since 2 August 2025. The full high-risk obligation stack applies only upon classification as high-risk. That classification has not been formally applied to consumer health chatbots in any EU member state, and the audit findings are the clinical record of what unclassified deployment at scale produces.

On the Digital Omnibus, both the Parliament and Council have adopted aligned positions proposing fixed delayed deadlines, 2 December 2027 for stand-alone high-risk systems under Annex III, and 2 August 2028 for AI embedded in regulated products under Annex I. A political agreement was expected at the trilogue meeting of 28 April 2026. That agreement is not formal adoption, the Omnibus delays the August 2026 deadline only if published in the Official Journal before that date. Companies that have paused preparation are carrying unresolved legal exposure. GPAI obligations under Chapter V are unaffected by the Omnibus and are already in force. Standard commercial API terms from the major providers do not satisfy the Art. 53(1)(e) documentation entitlement, and this should be addressed at the next contract renewal.


What’s Next?


Health tech companies should take the following steps now, regardless of the Omnibus outcome. Conduct a use-case specific accuracy assessment before deployment. The audit demonstrates that problematic response rates vary from roughly 22% in vaccines to over 60% in nutrition and athletic performance. A health tech company cannot establish Article 15 compliance without knowing where its clinical domain sits on that spectrum and defining an acceptable threshold before building on top of any GPAI model.

Classify the system and document the reasoning. Under Art. 6(3), a self-assessment that a nominally Annex III system is not high-risk requires written justification. An undocumented assumption is not a legal position. Audit GPAI contracts before the next renewal. Art. 53(1)(e) entitles downstream deployers to technical documentation under Annex XI and training data summaries under Annex XII. For a model like Grok, whose training data provenance connects directly to its worst-in-class accuracy outcomes, this is not an administrative exercise. It is a clinical due diligence requirement.


Build explainability into the product architecture from the outset. The audit documents systems that acknowledge their own fabrication tendencies but cannot correct them. A health AI product built on that foundation, without a traceability layer that allows users to verify the basis for recommendations, does not have a transparency gap to be resolved later. It requires a different design approach from the start.


For SaMD companies at the MDR and IVDR intersection, early notified body engagement on conformity assessment remains advisable regardless of whether the April 28 political agreement moves the Annex I deadline to August 2028. Lead times are long and notified bodies are still developing AI Act competence.

Privacy Rules advises health tech companies, in-house counsel, and regulatory affairs teams on EU AI Act classification, GPAI deployer obligations, healthcare AI governance, and the MDR/IVDR intersection. If you are building or deploying AI in a clinical or health-adjacent context and would like to discuss your obligations, we would be glad to help.


About Privacy Rules
Privacy Rules provides EU data protection and AI regulatory advisory for US technology companies. Led by Tanya Chib, we help organizations navigate GDPR, AI Act compliance, and cross-border data transfer requirements.


© 2026 Privacy Rules. This analysis does not constitute legal advice. Organizations should consult qualified legal counsel for compliance decisions.

Leave a Reply

Your email address will not be published. Required fields are marked *