Skip to main content

Table 2 Key parameters investigated in each study

From: Large language models for generating medical examinations: systematic review

Author

No.

of

MCQs

Tested

vs.

Human

Medical

Field

Questions

Evaluated

By

Performance

Scores

Sevgi et al.

3

No

Neurosurgery

Evaluated by the

author according

to current literature

2 (66.6%) of the questions

were accurate

Biswas

5

No

General

N/A

N/A

Agarwal et al.

320

No

Medical Physiology

2 Physiologists

p value validity < 0.001 for:

Chat-GPT vs. Bing < 0.001

Bard vs. Bing < 0.001

p value of difficulty < 0.006

Chat-GPT vs. Bing 0.010

Chat-GPT vs. Bard 0.003

Ayub et al.

40

No

Dermatology

2 board certified

dermatologists

16 (40%) of questions valid for exams

Cheung et al.

50

Yes

Internal Medicine/Surgery

5 International

medical experts

and educators

Overall performance:

AI score 20 (40%) vs. Human score 30 (60%)

Mean difference -0.80 ± 4.82

Total time required:

AI 20 min 25 s vs. Human 211 min 33 s

Totlis et al.

18

No

Anatomy

N/A

N/A

Han et al.

3

No

Biochemistry

N/A

N/A

Klang et al.

210

No

Internal Medicine

Surgery

Obstetrics & Gynecology

Psychiatry

Pediatrics

5 Specialist

physicians in the

tested fields

Problematic questions by field:

Surgery 30%

Gynecology 20%

Pediatrics 10%

Internal medicine 10%

Psychiatry 0%

  1. Summary of key parameters investigated in each study, November 2023