Comparative Study
Journal Article
Add like
Add dislike
Add to saved papers

Evaluation of large language models in breast cancer clinical scenarios: a comparative analysis based on ChatGPT-3.5, ChatGPT-4.0, and Claude2.

BACKGROUND: Large language models (LLMs) have garnered significant attention in the AI domain owing to their exemplary context recognition and response capabilities. However, the potential of LLMs in specific clinical scenarios, particularly in breast cancer diagnosis, treatment, and care, has not been fully explored. This study aimed to compare the performances of three major LLMs in the clinical context of breast cancer.

METHODS: In this study, clinical scenarios designed specifically for breast cancer were segmented into five pivotal domains (nine cases): assessment and diagnosis, treatment decision-making, postoperative care, psychosocial support, and prognosis and rehabilitation. The LLMs were used to generate feedback for various queries related to these domains. For each scenario, a panel of five breast cancer specialists, each with over a decade of experience, evaluated the feedback from LLMs. They assessed feedback concerning LLMs in terms of their quality, relevance, and applicability.

RESULTS: There was a moderate level of agreement among the raters (Fleiss' kappa=0.345, P<0.05). Comparing the performance of different models regarding response length, GPT-4.0 and GPT-3.5 provided relatively longer feedback than Claude2. Furthermore, across the nine case analyses, GPT-4.0 significantly outperformed the other two models in average quality, relevance, and applicability. Within the five clinical areas, GPT-4.0 markedly surpassed GPT-3.5 in the quality of the other four areas and scored higher than Claude2 in tasks related to psychosocial support and treatment decision-making.

CONCLUSION: This study revealed that in the realm of clinical applications for breast cancer, GPT-4.0 showcases not only superiority in terms of quality and relevance but also demonstrates exceptional capability in applicability, especially when compared to GPT-3.5. Relative to Claude2, GPT-4.0 holds advantages in specific domains. With the expanding use of LLMs in the clinical field, ongoing optimization and rigorous accuracy assessments are paramount.

Full text links

We have located open access text paper links.

Related Resources

For the best experience, use the Read mobile app

Mobile app image

Get seemless 1-tap access through your institution/university

For the best experience, use the Read mobile app

All material on this website is protected by copyright, Copyright © 1994-2024 by WebMD LLC.
This website also contains material copyrighted by 3rd parties.

By using this service, you agree to our terms of use and privacy policy.

Your Privacy Choices Toggle icon

You can now claim free CME credits for this literature searchClaim now

Get seemless 1-tap access through your institution/university

For the best experience, use the Read mobile app