article Article Summary
Sep 30, 2025
Blog Image

A prominently displayed warning that "ChatGPT can make mistakes" failed to alter medical students' use of AI diagnostic advice, suggesting students already perceive ChatGPT as having low credibility—a finding that challenges assumptions about automation b

A prominently displayed warning that "ChatGPT can make mistakes" failed to alter medical students' use of AI diagnostic advice, suggesting students already perceive ChatGPT as having low credibility—a finding that challenges assumptions about automation bias and highlights limits of simple disclaimers in calibrating AI trust.

Objective: This randomized controlled trial investigated whether safety warnings about AI fallibility influence medical students' diagnostic behavior when receiving ChatGPT-generated feedback. Drawing on Judge-Advisor System (JAS) theory, the researchers hypothesized that a prominent disclaimer would reduce advice-taking by signaling lower advisor credibility and increasing information asymmetry between students (judges) and AI (advisor), thereby activating egocentric advice discounting. The study aimed to provide empirical evidence on whether commonly recommended AI safety warnings actually affect clinical decision-making behavior in health professions education.

Methods: The study enrolled 186 fourth-year medical students from Gazi University Faculty of Medicine in Turkey who had completed the Evidence-Based Medicine module. Students were recruited voluntarily through messaging apps, with no compensation. Ten participants were excluded for likely inattention (off-topic comments), leaving 186 in the final analysis.

The researchers developed three ambiguous clinical vignettes, each presenting two plausible diagnoses: (1) Depression vs. Hypothyroidism, (2) Pancreatitis vs. Peptic Ulcer Disease, and (3) Non-ST-Elevation Myocardial Infarction (NSTEMI) vs. Pulmonary Embolism. Each vignette was specifically designed to include clinical features supporting both diagnoses to create genuine ambiguity. Two independent medical education experts reviewed the vignettes for quality and appropriateness.

For each case, ChatGPT-4o generated two diagnostic justifications—one for each candidate diagnosis. These were then edited by the research team for clinical plausibility and study alignment. The study employed a forced-disagreement design: after students submitted their initial diagnosis, the system always displayed ChatGPT-attributed feedback supporting the alternate diagnosis. For example, if a student selected hypothyroidism, they received arguments for depression.

Students were randomly assigned via automated web application to one of two arms: (1) No-warning arm (n=96): Feedback appeared as "ChatGPT says" followed by the diagnostic justification. (2) Warning arm (n=90): Identical feedback was preceded by the sentence "ChatGPT can make mistakes. Check important info." displayed in larger font and centered on the page, making it impossible to miss while reading the feedback. The researchers provided this prominent positioning because ChatGPT's actual warning appears below chats where it's easily overlooked, and prior research shows font size influences effectiveness.

After viewing the AI feedback, students could either keep their original diagnosis or change to the AI-suggested alternative. They were also given an optional text field to justify their final decision. This process repeated for all three vignettes, presented in randomized order using a cryptographically secure Fisher-Yates shuffle.

The primary outcome was whether students changed their answer for each vignette (binary: 1=changed, 0=kept original). Secondary measures included weight-of-advice (WoA)—calculated as the proportion of students who changed their diagnosis—and whether students provided written explanations. Analysis used mixed-effects logistic regression with random intercepts for participant and case to account for clustering (three cases per student), reporting odds ratios with 95% confidence intervals. The researchers also calculated WoA and compared it to the 0.30 benchmark from a recent meta-analysis of 129 JAS studies (N=17,296). Statistical analyses were conducted in Jamovi 2.2.5 using GAMLj 2.6.5, with significance set at p<0.05.

Key Findings:

Vignette Ambiguity Verification: The clinical cases achieved the intended ambiguity. Two vignettes showed near 50-50 initial splits: Depression vs. Hypothyroidism (97 vs. 89; 52%-48%) and NSTEMI vs. Pulmonary Embolism (82 vs. 104; 44%-56%). Only Pancreatitis vs. Peptic Ulcer Disease was more skewed (125 vs. 61; 67%-33%), though still representing genuine diagnostic uncertainty.

Primary Outcome - No Warning Effect: The warning did not influence diagnostic changes. In the no-warning group, 15.3% of responses (44/288) involved changing the diagnosis after viewing AI feedback. In the warning group, 15.9% (43/270) changed—a difference that was not statistically significant (OR=1.09, 95% CI: 0.46-2.59, p=0.84). At the participant level, 30.2% of no-warning students (29/96) and 32.2% of warning students (29/90) changed at least once, again showing no meaningful difference (OR=1.10, 95% CI: 0.77-1.57, p=0.61).

Dramatically Low Weight of Advice: Students' mean weight-of-advice was 0.15 (SD=0.36), significantly lower than the 0.30 average reported in prior JAS meta-analysis (t(557)=9.37, p<0.001, Cohen's d=0.40). This represents a 50% reduction in advice-taking compared to the established benchmark across diverse judge-advisor contexts. The finding held even for the most imbalanced case (Pancreatitis-Peptic Ulcer: 17.2% change rate) compared to the more balanced vignettes (Depression-Hypothyroidism: 12.9%; NSTEMI-Pulmonary Embolism: 16.7%), indicating that initial distribution skew did not account for the low weight of advice.

Case Consistency: An Arm × Case interaction analysis revealed no significant differences (χ²=0.51, df=2, p=0.77). Odds ratios for warning vs. no-warning were similar across all three vignettes: 1.08 (Depression-Hypothyroidism), 1.26 (Pancreatitis-Peptic Ulcer), and 0.86 (NSTEMI-Pulmonary Embolism), with all confidence intervals including unity. This indicates the warning's null effect was consistent regardless of which diagnostic scenario students encountered.

Explanation Patterns: Among students who retained their original diagnosis, the warning group showed a borderline tendency to provide written explanations more often than the control (60% vs. 51%, χ²=3.56, df=1, p=0.059). However, among students who accepted ChatGPT's advice and changed their answer, explanation rates did not differ between groups (p=0.19).

Implications: The findings challenge common assumptions about automation bias in medical education. Rather than over-relying on AI advice—a frequently cited concern—students dramatically underweighted ChatGPT's diagnostic feedback, using only half the typical advice weight observed across general JAS research. The safety warning's complete ineffectiveness suggests students' perceived credibility of ChatGPT was already at or near a "behavioral floor"—a threshold below which additional negative cues cannot further reduce reliance because trust is already minimal.

This supports the existence of a credibility threshold in advice-taking: once perceived advisor quality falls below a critical point, extra warnings have limited effect because behavior has plateaued near zero uptake. The JAS Input-Process-Output model explains this: the warning operates at the Input stage (signaling low advisor credibility), which should trigger Process-stage egocentric discounting, leading to Output-stage reduced advice weight. However, when credibility is already perceived as very low, the input warning cannot meaningfully alter the process because students are already maximally discounting the advice.

Importantly, this threshold may operate bidirectionally. Research by Okamura and Yamada (2020) suggests that when perceived AI credibility is already high, simple warnings likewise fail to reduce reliance—users continue following advice despite disclaimers. This presents a significant patient safety risk: if medical professionals come to trust AI systems based on initial positive experiences, subsequent warnings may not prevent over-reliance on flawed outputs.

The borderline-significant finding on explanation provision (60% vs. 51% among students who rejected AI advice, p=0.059) suggests the warning may have prompted metacognitive engagement even without changing decisions. This aligns with JAS research showing that prompts to articulate reasoning heighten retrieval of internal evidence and can amplify egocentric discounting. The warning may have primed students to construct more elaborate justifications for disagreeing with ChatGPT, potentially strengthening reflective practice without altering bottom-line diagnostic choices.

From an educational design perspective, the forced-disagreement approach may offer value beyond accuracy. Exposure to AI feedback that systematically contradicts initial judgments could serve as a reflective practice tool, prompting students to retrieve and articulate internal evidence for their diagnostic reasoning. Rather than viewing AI as an accuracy aid, its greater educational utility may lie in scaffolding reflection when learners confront conflicting perspectives—essentially functioning as a structured disagreement prompt to develop diagnostic resilience and justification skills.

The findings also contextualize recent trends showing declining frequency of medical disclaimers in LLM and vision-language model outputs from 2022 to 2025, despite increasing use for medical question interpretation. If users systematically underweight AI medical advice regardless of warnings, developers may view disclaimers as having limited practical impact, though this reasoning overlooks the threshold's bidirectional nature and potential risks when trust is high.

Limitations: The study acknowledges several constraints. The always-contradict design, while methodologically rigorous for isolating advice-taking, doesn't reflect typical AI use where agreement is common. This may have artificially suppressed advice uptake, though it successfully standardized exposure across students. The binary outcome (change/keep) missed subtle reasoning shifts; confidence sliders or Likert scales might have captured more nuanced effects.

The sample comprised only fourth-year students who completed Evidence-Based Medicine at a single Turkish university, limiting generalizability across training levels, specialties, and cultural contexts. Recent research showing early-year medical students in team-based learning changed answers to align with incorrect ChatGPT outputs—even when peer discussion occurred—suggests advice-taking dynamics may differ substantially by experience level.

The vignettes were purpose-built for advice-taking research rather than simulating authentic clinical decision support or didactic tasks. While this ensured methodological purity and diagnostic ambiguity, it may not reflect how students engage with AI in actual study contexts (e.g., explaining concepts, generating practice questions, summarizing research). Students likely calibrate trust differently depending on task type, so findings shouldn't be overgeneralized to all LLM educational uses.

The study lacked qualitative analysis of student justifications due to resource constraints. Such analysis might have revealed important mechanisms underlying advice discounting and reflection—for example, whether warning-group students articulated specific concerns about AI reliability or whether they cited clinical reasoning principles when disagreeing with ChatGPT.

Finally, the findings may not extend to practicing physicians, interprofessional teams, or other healthcare contexts where domain expertise, accountability structures, and trust calibration operate under different constraints. The advice-taking literature shows expertise level substantially moderates these effects.

Future Directions: The researchers suggest several extensions. Qualitative analysis of student justifications could illuminate mechanisms underlying advice discounting—examining whether students in the warning condition articulated specific AI reliability concerns, cited evidence-based medicine principles, or demonstrated different depths of clinical reasoning when rejecting ChatGPT's advice.

Alternative warning designs warrant testing, including: severity-graded warnings (mild caution vs. strong alert), context-specific disclaimers highlighting particular error types (e.g., "AI may miss rare diagnoses"), interactive warnings requiring acknowledgment, and comparative effectiveness studies of different visual/textual warning formats informed by risk communication research.

Confidence and uncertainty measures would provide richer data than binary change/keep outcomes. Implementing slider scales for diagnostic confidence before and after AI feedback, or measuring certainty in both initial and revised judgments, could reveal whether warnings shift confidence calibration even when final decisions remain unchanged.

Longitudinal trust dynamics require investigation. Tracking how trust in AI diagnostic advice evolves over multiple exposures—particularly after students encounter correct vs. incorrect AI suggestions—would clarify whether single-session underweighting reflects stable distrust or initial wariness that adjusts with experience. This is crucial for understanding whether low advice weight represents appropriately calibrated skepticism or excessive dismissal of potentially helpful input.

Cross-population studies comparing medical students at different training levels, practicing physicians, nurses, and other health professionals would identify how domain expertise and professional role moderate the credibility threshold effect. Given evidence that early-year students over-rely on ChatGPT while fourth-years dramatically underweight it, mapping this trajectory could inform targeted interventions.

Task-type variations should examine whether underweighting generalizes across AI educational applications. Testing advice-taking for concept explanations, differential diagnosis generation, literature summarization, and study question creation would clarify whether students apply blanket distrust or calibrate trust to specific AI capabilities.

Reflection prompt interventions, rather than warnings, may prove more effective. Designing and testing structured prompts that ask students to explicitly compare their reasoning with AI's, identify agreement/disagreement points, and articulate confidence levels could enhance metacognitive engagement regardless of trust calibration. This aligns with the borderline finding that warnings increased explanation provision among students who rejected AI advice.

Real-world implementation studies in clinical education settings—with randomized trials comparing student outcomes and reflection quality with versus without AI disagreement exposure—would provide ecological validity and clarify whether laboratory findings transfer to authentic learning contexts.

Title and Authors: "'ChatGPT can make mistakes' warnings fail: A randomized controlled trial" by Yavuz Selim Kıyak, Özlem Coşkun, and Işıl İrem Budakoğlu from the Department of Medical Education and Informatics, Faculty of Medicine, Gazi University, Ankara, Türkiye.

Published On: Received July 18, 2025; Revised September 5, 2025; Accepted September 17, 2025.

Published By: Medical Education (Med Educ), published by the Association for the Study of Medical Education and John Wiley & Sons Ltd. DOI: 10.1111/medu.70056. This is an open access article under applicable Creative Commons licensing terms and conditions. The study received no external funding.

Related Link

Comments

Please log in to leave a comment.