Large language models for generating medical examinations: systematic review

BMC Medical Education

Table 2 Key parameters investigated in each study

Author	No. of MCQs	Tested vs. Human	Medical Field	Questions Evaluated By	Performance Scores
Sevgi et al.	3	No	Neurosurgery	Evaluated by the author according to current literature	2 (66.6%) of the questions were accurate
Biswas	5	No	General	N/A	N/A
Agarwal et al.	320	No	Medical Physiology	2 Physiologists	p value validity < 0.001 for: Chat-GPT vs. Bing < 0.001 Bard vs. Bing < 0.001 p value of difficulty < 0.006 Chat-GPT vs. Bing 0.010 Chat-GPT vs. Bard 0.003
Ayub et al.	40	No	Dermatology	2 board certified dermatologists	16 (40%) of questions valid for exams
Cheung et al.	50	Yes	Internal Medicine/Surgery	5 International medical experts and educators	Overall performance: AI score 20 (40%) vs. Human score 30 (60%) Mean difference -0.80 ± 4.82 Total time required: AI 20 min 25 s vs. Human 211 min 33 s
Totlis et al.	18	No	Anatomy	N/A	N/A
Han et al.	3	No	Biochemistry	N/A	N/A
Klang et al.	210	No	Internal Medicine Surgery Obstetrics & Gynecology Psychiatry Pediatrics	5 Specialist physicians in the tested fields	Problematic questions by field: Surgery 30% Gynecology 20% Pediatrics 10% Internal medicine 10% Psychiatry 0%

ISSN: 1472-6920