Transformer versus traditional natural language processing: How much data is enough for automated radiology report classification?

Eric Yang, Matthew D Li, Shruti Raghavan, Francis Deng, Min Lang, Marc D Succi, Ambrose J Huang, Jayashree Kalpathy-Cramer

British Journal of Radiology 2023 May 11

OBJECTIVES: Current state-of-the-art natural language processing (NLP) techniques use transformer deep learning architectures, which depend on large training datasets. We hypothesized that traditional NLP techniques may outperform transformers for smaller radiology report datasets.

METHODS: We compared the performance of BioBERT, a deep learning-based transformer model pre-trained on biomedical text, and three traditional machine learning models (gradient boosted tree, random forest, and logistic regression) on seven classification tasks given free-text radiology reports. Tasks included detection of appendicitis, diverticulitis, bowel obstruction, and enteritis/colitis on abdomen/pelvis CT reports, ischemic infarct on brain CT/MRI reports, and medial and lateral meniscus tears on knee MRI reports (7,204 total annotated reports). The performance of NLP models on held-out test sets were compared after training using the full training set, and 2.5%, 10%, 25%, 50%, and 75% random subsets of the training data.

RESULTS: In all tested classification tasks, BioBERT performed poorly at smaller training sample sizes compared to non-deep learning NLP models. Specifically, BioBERT required training on approximately 1,000 reports to perform similarly or better than non-deep learning models. At around 1,250 to 1,500 training samples, the testing performance for all models began to plateau, where additional training data yielded minimal performance gain.

CONCLUSIONS: With larger sample sizes, transformer NLP models achieved superior performance in radiology report binary classification tasks. However, with smaller sizes (<1000) and more imbalanced training data, traditional NLP techniques performed better.

ADVANCES IN KNOWLEDGE: Our benchmarks can help guide clinical NLP researchers in selecting machine learning models according to their dataset characteristics.

Full text links

We have located links that may give you full text access.

Show additional links to paperHide additional links to paper

PubMed

Add to Saved Papers

Get 1-tap access

Related Resources

For the best experience, use the Read mobile app

Get seemless 1-tap access through your institution/university

For the best experience, use the Read mobile app

All material on this website is protected by copyright, Copyright © 1994-2024 by WebMD LLC.
This website also contains material copyrighted by 3rd parties.

By using this service, you agree to our terms of use and privacy policy.

Your Privacy Choices

You can now claim free CME credits for this literature searchClaim now

Get seemless 1-tap access through your institution/university

For the best experience, use the Read mobile app