From: Large language models for generating medical examinations: systematic review
Author | No. of MCQs | Tested vs. Human | Medical Field | Questions Evaluated By | Performance Scores |
---|---|---|---|---|---|
Sevgi et al. | 3 | No | Neurosurgery | Evaluated by the author according to current literature | 2 (66.6%) of the questions were accurate |
Biswas | 5 | No | General | N/A | N/A |
Agarwal et al. | 320 | No | Medical Physiology | 2 Physiologists | p value validity < 0.001 for: Chat-GPT vs. Bing < 0.001 Bard vs. Bing < 0.001 p value of difficulty < 0.006 Chat-GPT vs. Bing 0.010 Chat-GPT vs. Bard 0.003 |
Ayub et al. | 40 | No | Dermatology | 2 board certified dermatologists | 16 (40%) of questions valid for exams |
Cheung et al. | 50 | Yes | Internal Medicine/Surgery | 5 International medical experts and educators | Overall performance: AI score 20 (40%) vs. Human score 30 (60%) Mean difference -0.80 ± 4.82 Total time required: AI 20 min 25 s vs. Human 211 min 33 s |
Totlis et al. | 18 | No | Anatomy | N/A | N/A |
Han et al. | 3 | No | Biochemistry | N/A | N/A |
Klang et al. | 210 | No | Internal Medicine Surgery Obstetrics & Gynecology Psychiatry Pediatrics | 5 Specialist physicians in the tested fields | Problematic questions by field: Surgery 30% Gynecology 20% Pediatrics 10% Internal medicine 10% Psychiatry 0% |