Large language models for generating medical examinations: systematic review

BMC Medical Education

Table 3 Present faulty questions generated by the AI

Author	Medically Irrelevant Questions	Invalid for Medical Exam	Inaccurate/Wrong Question	Inaccurate/Wrong Answer or Alternative answers	Low Difficulty Level
Sevgi et al.	N/A	N/A	N/A	1 (33.3%)	N/A
Biswas	N/A	N/A	N/A	N/A	N/A
Agarwal et al.	N/A	Highly valid	N/A	V/A	Somewhat difficult
Ayub et al.	9 (23%)	24 (60%)	5 (13%)	5 (13%)	10 (25%)
Cheung et al.	32 (64%)	28 (56%)	32 (64%)	29 (58%)	N/A
Totlis et al.	N/A	8 (44.4%)	N/A	N/A	8 (44.4%)
Han et al.	N/A	N/A	N/A	N/A	3 (100%)
Klang et al.	2 (0.95%)	1 (0.5%)	12 (5.7%)	14 (6.6%)	2 (0.95%)

ISSN: 1472-6920