Evaluation of large language models in breast cancer clinical scenarios: a comparative analysis based on ChatGPT-3.5, ChatGPT-4.0, and Claude2.

Linfang Deng, Tianyi Wang, Yangzhang, Zhenhua Zhai, Wei Tao, Jincheng Li, Yi Zhao, Shaoting Luo, Jinjiang Xu

International Journal of Surgery 2024 April 2

BACKGROUND: Large language models (LLMs) have garnered significant attention in the AI domain owing to their exemplary context recognition and response capabilities. However, the potential of LLMs in specific clinical scenarios, particularly in breast cancer diagnosis, treatment, and care, has not been fully explored. This study aimed to compare the performances of three major LLMs in the clinical context of breast cancer.

METHODS: In this study, clinical scenarios designed specifically for breast cancer were segmented into five pivotal domains (nine cases): assessment and diagnosis, treatment decision-making, postoperative care, psychosocial support, and prognosis and rehabilitation. The LLMs were used to generate feedback for various queries related to these domains. For each scenario, a panel of five breast cancer specialists, each with over a decade of experience, evaluated the feedback from LLMs. They assessed feedback concerning LLMs in terms of their quality, relevance, and applicability.

RESULTS: There was a moderate level of agreement among the raters (Fleiss' kappa=0.345, P<0.05). Comparing the performance of different models regarding response length, GPT-4.0 and GPT-3.5 provided relatively longer feedback than Claude2. Furthermore, across the nine case analyses, GPT-4.0 significantly outperformed the other two models in average quality, relevance, and applicability. Within the five clinical areas, GPT-4.0 markedly surpassed GPT-3.5 in the quality of the other four areas and scored higher than Claude2 in tasks related to psychosocial support and treatment decision-making.

CONCLUSION: This study revealed that in the realm of clinical applications for breast cancer, GPT-4.0 showcases not only superiority in terms of quality and relevance but also demonstrates exceptional capability in applicability, especially when compared to GPT-3.5. Relative to Claude2, GPT-4.0 holds advantages in specific domains. With the expanding use of LLMs in the clinical field, ongoing optimization and rigorous accuracy assessments are paramount.

Full text links

We have located open access text paper links.

Full Text PDF Full Text Web

Show additional links to paperHide additional links to paper

PubMed PubMed Central #1 Wolters Kluwer #1 PubMed Central #2 PubMed Central #3 Wolters Kluwer #2

Add to Saved Papers

Get 1-tap access

Related Resources

Lung ultrasound for diagnosis and management of ARDS.Marry R Smit, Paul H Mayo, Silvia MongodiIntensive Care Medicine 2024 April 25

Executive Summary: State-of-the-Art Review: Unintended Consequences: Risk of Opportunistic Infections Associated with Long-term Glucocorticoid Therapies in Adults.Daniel B Chastain et al.Clinical Infectious Diseases 2024 April 11

Autoimmune Hemolytic Anemias: Classifications, Pathophysiology, Diagnoses and Management.Melika Loriamini et al.International Journal of Molecular Sciences 2024 April 13

Clinical practice guidelines on the management of status epilepticus in adults: A systematic review.Luca Vignatelli et al.Epilepsia 2024 April 13

Should renin-angiotensin system inhibitors be held prior to major surgery?Matthieu LegrandBritish Journal of Anaesthesia 2024 May

Contrast-induced acute kidney injury: a review of definition, pathogenesis, risk factors, prevention and treatment.Yanyan Li, Junda WangBMC Nephrology 2024 April 23

For the best experience, use the Read mobile app

Get seemless 1-tap access through your institution/university

For the best experience, use the Read mobile app

All material on this website is protected by copyright, Copyright © 1994-2024 by WebMD LLC.
This website also contains material copyrighted by 3rd parties.

By using this service, you agree to our terms of use and privacy policy.

Your Privacy Choices

You can now claim free CME credits for this literature searchClaim now

Get seemless 1-tap access through your institution/university

For the best experience, use the Read mobile app

Evaluation of large language models in breast cancer clinical scenarios: a comparative analysis based on ChatGPT-3.5, ChatGPT-4.0, and Claude2.

Full text links

Related Resources

Trending Papers

For the best experience, use the Read mobile app