Abstract
Objectives
This study aimed to compare year-over-year change in ChatGPT’s performance on nationwide ophthalmology exams with the performance change among residents over the same period.
Materials and Methods
This observational study included ophthalmology residents in Türkiye who participated in both the 2023 and 2024 Resident Training Development Exams organized by the Turkish Ophthalmological Association Qualifications Committee. The 2023 examination consisted of 69 single-best-answer multiple-choice questions and was administered to ChatGPT-3.5. The 2024 version, containing 72 questions, was administered to ChatGPT-4o. The success rates of ChatGPT and the residents who participated in both exams were compared.
Results
ChatGPT’s accuracy improved from 53.6% in 2023 to 84.7% in 2024. Among the 501 residents who participated in both years, the average score increased from 48.2% to 53.1%. ChatGPT ranked 292nd among residents in 2023 but achieved the top score in 2024. Based on percentage improvement in scores, ChatGPT-4o ranked 8th overall. The most notable performance gains for ChatGPT were seen in the areas of strabismus (+75%), neuro-ophthalmology (+40%), and optics (+40%). Among residents, the largest improvement occurred in oculoplastics (+33.5%), while a decrease was observed in cornea and ocular surface (-4.1%).
Conclusion
ChatGPT-4o showed a marked improvement in answering ophthalmology questions compared to its predecessor, whereas resident learning progressed more gradually. This rapid advancement in ChatGPT highlights the potential speed with which artificial learning can progress within defined boundaries. In contrast, human learning remains a deeper and more time-intensive process. Results suggest that evolving large language models will play an increasingly significant role in medical education and clinical support.
Introduction
Large language models such as OpenAI’s ChatGPT are advanced artificial intelligence systems that operate based on natural language processing techniques and are capable of generating human-like responses. Built on the Generative Pre-trained Transformer (GPT) architecture, these models have reached the capacity to produce contextually consistent and meaningful responses by training on vast and diverse text datasets. While earlier versions such as ChatGPT-3.5 and ChatGPT-4.0 made significant advances in natural language comprehension, ChatGPT-4o (released on May 13, 2024) showed marked improvement over previous versions in terms of linguistic accuracy and interactional performance.1 With increasing competencies in medical, educational, and academic contexts, ChatGPT exhibits high levels of accuracy and responsiveness.2 Nevertheless, in the medical field especially, their responses must be continuously evaluated to ensure clinical reliability for both healthcare professionals and patients.
The rapidly increasing popularity of ChatGPT in the medical field has led to growing interest in assessing its functionality in various health-related tasks. Research in ophthalmology has examined the accuracy of ChatGPT’s responses to questions about various subspecialties, providing evidence that this AI model can be used as a complementary educational tool.3, 4, 5 Furthermore, the ability to interpret and manage ocular conditions in clinical scenarios such as corneal ulcer, cataract management, and retinal pathologies has also been explored.6, 7, 8 ChatGPT has emerged as a tool that can be helpful not only in diagnosis, but also in documentation processes such as the preparation of medical reports, as well as turning complex ophthalmological information into more understandable and accessible educational content.9
In this study, we aimed to evaluate the performance of both the ophthalmology residents who took a national resident training exam in two consecutive years and the ChatGPT models current in those years, and compare their year-over-year changes in performance.
Materials and Methods
The Turkish Ophthalmological Association Qualifications Committee held the third annual Resident Training Development Exam on May 26, 2023. In our previous study, we posed the questions from this exam to ChatGPT-3.5, the previous version of ChatGPT, and compared its performance with the results of the ophthalmology residents who took the exam nationwide.10 The following year, on May 31, 2024, a total of 1,013 ophthalmology residents from 80 training centers across Türkiye took the fourth Resident Training Development Exam. In the present study, we administered the questions from this exam to ChatGPT-4o, the most current version of ChatGPT. Only residents who took the exam in both years were included in the study to allow a year-over-year analysis. The residents were grouped according to their year of training as of the 2024 exam date.
Both exams were prepared by the Turkish Ophthalmological Association Qualifications Committee to cover the same subspecialties, at a similar difficulty level. Each exam included a total of 75 questions. However, questions in a format other than single-best-answer multiple-choice questions were excluded from the study. Therefore, the analysis included 69 eligible questions from 2023 and 72 eligible questions from 2024. Distributions of the questions asked in 2023 and 2024 by subspecialty are given in Table 1 and Table 2, respectively.
After translating into English, the 2024 exam questions were posed to ChatGPT-4o (model identifier: gpt-4o-2024-05-13) using the official web interface on the website (https://chat.openai.com) in separate chat sessions on March 21, 2025. The system history was cleared before each question. As none of the questions contained visual or graphic content, no additional transcription or image description was required. To avoid the impact of subsequent updates to the language model, each question was accompanied by the prompt, “Answer the following question using the knowledge available as of May 31, 2024.” The answers and explanations given by ChatGPT-4o for each question were recorded, and each response was evaluated as correct or incorrect according to the predetermined answer key.
Residents and ChatGPT-4o were scored out of 100 based on the number of correct answers. Additionally, a ranking was created based on these scores, calculated according to the number of examinees in the relevant year. Changes in performance were analyzed overall and by subspecialty for both the residents and ChatGPT. Year-over-year change in resident performance was determined from the average accuracy rates of the 501 residents who participated in both exams.
Ethics committee approval was not required because participant information was anonymized and no personal data were used.
Statistical Analysis
Statistical analyses were performed using SPSS version 26 (IBM, Armonk, NY, USA). The Kolmogorov-Smirnov test was used to evaluate the normality of data distributions. Resident data were not evaluated individually but averaged across the 501 residents in the sample. Comparisons between ChatGPT and the resident group were made descriptively, not with statistical tests. As ChatGPT provides a single model output, variation was not calculated and differences were compared using accuracy rates alone. Continuous variables were presented as mean ± standard deviation and range. The Wilcoxon signed rank test was used to analyze the change in resident accuracy rate overall and by subspecialty between the 2023 and 2024 exams. The 95% confidence intervals (CI) were calculated for the accuracy rates of the resident participant group and ChatGPT models. For comparisons of subspecialty, Bonferroni correction was performed to reduce the probability of type I error and the significance level was determined as p<0.005.
Results
A total of 501 ophthalmology residents took the exam in both years. When categorized by months of training in 2024, there were 249 second-year residents (12-23 months of experience), 132 third-year residents (24-35 months of experience), and 120 fourth-year or higher residents (≥36 months of experience). The mean training duration of the residents was 28.4±10.6 months (range, 13-64 months). Residents who took the exam in both years correctly answered a mean of 38.2±8.5 of the 72 questions in the 2024 exam, achieving a success rate of 53.1% (95% CI: 52.2%-54.0%). Second-year residents achieved a success rate of 48.8% (95% CI: 47.5%-50.1%), with a mean of 35.1±7.1 correct answers; third-year residents had a success rate of 54.8% (95% CI: 53.3%-56.3%), with a mean of 39.4±8.9 correct answers; and fourth-year or higher residents reached a success rate of 60.1% (95% CI: 58.7%-61.5%), with a mean of 43.3±7.8 correct answers. In contrast, ChatGPT-4o answered 61 of the 72 questions correctly, for an accuracy rate of 84.7% (95% CI: 74.7%-91.3%). ChatGPT-3.5 ranked 292nd among residents in the 2023 exam, whereas ChatGPT-4o achieved the top score in 2024. The mean numbers of questions (overall and by subspecialty) answered correctly by residents and ChatGPT-3.5 in 2023 are presented in Table 1, and the means of the same residents and ChatGPT-4o for 2024 are presented in Table 2.
Overall, the residents’ accuracy rates in most subspecialties improved compared to the previous year, although this increase did not reach statistical significance in the field of neuro-ophthalmology (p=0.655). Corneal and ocular surface diseases was the only subspecialty in which residents’ performance declined, and this decrease was statistically significant (p<0.001). In contrast, ChatGPT showed major improvements across all subspecialties, with a 30.4% increase in overall accuracy rate. When the residents and ChatGPT were ranked according to the percentage increase in accuracy rate, ChatGPT-4o ranked 8th. Changes in accuracy rates between the two exams are summarized in Table 3.
Discussion
This study aimed to evaluate the year-over-year change in performance of Turkish ophthalmology residents and a large language model based on a nationwide resident training exam held over two consecutive years, thereby presenting a comparison of natural versus artificial learning. Our findings revealed that residents’ average performance in most subspecialties improved between 2023 and 2024, whereas ChatGPT-4o showed consistent improvement over its predecessor ChatGPT-3.5 in all areas and outperformed all human examinees in 2024.
The widespread adoption of AI in the healthcare field has led to an increasing trend among both patients and healthcare professionals toward using these tools to obtain medical information and provide educational support.11, 12 As their use becomes increasingly widespread, particularly through advanced large language models like ChatGPT-4o, it is becoming more important than ever to evaluate the reliability and scientific accuracy of the responses produced by these systems. Despite the advantage of providing rapid and accessible information, their potential impact on clinical decision-making processes and medical education makes it imperative to rigorously assess their responses to domain-specific, evidence-based questions.
Artificial intelligence systems are constantly evolving and learning. ChatGPT-4.0 was reported to show improved accuracy when asked the same questions about intraocular lenses six months apart.13 In another study, when medical questions initially answered incorrectly by ChatGPT were re-asked a short time later (8-17 days), the model answered most of the questions correctly.14
While human learning is a gradual process shaped by experience, cognition, and context, large language models such as ChatGPT acquire knowledge through periodic large-scale retraining cycles.15 Each new release, such as ChatGPT-4o, reflects a gradual progression, enhanced by insights from increasingly diverse, current, and domain-specific datasets. This process enables rapid and effective improvements in information accuracy and functional performance. However, this development lacks the continuity, ethical reasoning, and experiential depth involved in human learning.16 In contrast, humans experience a slower but more holistic learning process. Knowledge is not only acquired through formal education, but is also shaped through trial and error, emotional context, and social interaction.17 Especially in medical education, this type of learning process enhances qualities such as clinical judgment, empathy, and adaptability, which current artificial intelligence systems have yet to attain.18
The performance of large language models and humans on ophthalmology-related questions has also been compared in previous studies.19, 20 In another study using ophthalmology residency exam questions from 2020 to 2023, large language models did not show a significant change in accuracy over the four years.21 However, it was not specified exactly when the questions were posed to the large language models; if all the questions were asked at approximately the same time, accuracy rates would be expected to remain similar even if the test years were different.21 In a study by Taloni et al.22 using 1,023 questions from the BCSC (Basic and Clinical Science Course) question set of the American Academy of Ophthalmology, ChatGPT-4.0 outperformed its predecessor ChatGPT-3.5. Human participants ranked second in overall performance. Similarly, Maino et al.23 evaluated 440 previously administered multiple-choice questions on the European Board of Ophthalmology Diploma Examination and reported that ophthalmology residents performed better than ChatGPT-3.5 but were less accurate than ChatGPT-4o.
Although our findings are generally consistent with these studies, there is an important difference in study design. While previous studies adopted a cross-sectional approach, our study involved two similar national exams administered to the same group of residents one year apart, thereby enabling the observation of longitudinal changes. Moreover, we assessed not only human learning, but also the change in performance between successive versions of the same large language model. To the best of our knowledge, our study is the first to provide a parallel view of the progress of both human and machine learning over time.
The overall increase in resident scores is a positive indicator of the effectiveness of training over time, suggesting that structured training programs together with clinical experience contribute to knowledge retention. Interestingly, the only subspecialty with no statistically significant improvement was neuro-ophthalmology. This area is known for its multidisciplinary nature and limited clinical exposure in many training centers.24 The only area in which resident performance declined significantly was corneal and ocular surface diseases. This may point to factors such as insufficient emphasis on this subspecialty in the training curriculum or a scarcity of clinical cases. These findings may guide future modifications to residency training programs, especially in terms of identifying areas that need strengthening. In contrast, ChatGPT-4o performed strongly in all subspecialties and showed significant improvement over ChatGPT-3.5. ChatGPT-4o had an overall accuracy rate of 84.7%, exhibiting greater accuracy and consistency than the resident group, although it ranked 8th in terms of year-over-year performance improvement. This reinforces the increasing potential of large language models as educational tools in medical education, especially in terms of exam preparation and theoretical knowledge support. However, it should be noted that these models do not include elements important to medical practice such as contextual nuance, clinical judgment, and practical skills. Therefore, such AI tools should be considered a supportive and complementary component of traditional medical education rather than a replacement.
Study Limitations
Our study has some limitations that should be noted. First, although a longitudinal comparison was made, the effect of variables such as individual learning environments, level of clinical experience, and work-related habits is unknown. Second, although both exams were similar in content and structure, their psychometric equivalence has not been assessed at the item level. Therefore, the study evaluates year-over-year differences not as absolute values, but as relative change in performance under similar conditions. Regarding the AI methodology, the web-based interface offers limited control over response length and context memory compared to the application versions of ChatGPT. This may lead to minor differences in responses, which we consider a methodological limitation. Furthermore, the selection of residents who participated in both exams may have introduced selection bias, as this approach could select for individuals who are more motivated or academically inclined. Finally, the limited number of questions and the fact that the study is based on the national exam of a single country may limit the generalizability of the findings to different education systems.
Conclusion
ChatGPT-4o demonstrated improved accuracy over the previous version (ChatGPT-3.5) and outperformed the resident group in the 2024 national ophthalmology resident training exam. While residents showed more modest improvement, the dramatic progress made by ChatGPT-4o underscores the evolving capabilities of large language models. However, it is important to note that despite their high accuracy, these models can occasionally generate erroneous or misleading responses. Therefore, their role in medical education should be complementary, regarded as a supportive tool rather than a substitute for the critical thinking and experience-based knowledge that develops in humans through training.


