Most recent papers with the keyword image captioning

#1

JOURNAL ARTICLE

Visual Analytics for Efficient Image Exploration and User-Guided Image Captioning.

Yiran Li, Junpeng Wang, Prince Aboagye, Chin-Chia Michael Yeh, Yan Zheng, Liang Wang, Wei Zhang, Kwan-Liu Ma

Recent advancements in pre-trained language-image models have ushered in a new era of visual comprehension. Leveraging the power of these models, this paper tackles two issues within the realm of visual analytics: (1) the efficient exploration of large-scale image datasets and identification of data biases within them; (2) the evaluation of image captions and steering of their generation process. On the one hand, by visually examining the captions generated from language-image models for an image dataset, we gain deeper insights into the visual contents, unearthing data biases that may be entrenched within the dataset...

38625780

April 16, 2024: IEEE Transactions on Visualization and Computer Graphics

#2

JOURNAL ARTICLE

Application of Multimodal Transformer Model in Intelligent Agricultural Disease Detection and Question-Answering Systems.

Yuchun Lu, Xiaoyi Lu, Liping Zheng, Min Sun, Siyu Chen, Baiyan Chen, Tong Wang, Jiming Yang, Chunli Lv

In this study, an innovative approach based on multimodal data and the transformer model was proposed to address challenges in agricultural disease detection and question-answering systems. This method effectively integrates image, text, and sensor data, utilizing deep learning technologies to profoundly analyze and process complex agriculture-related issues. The study achieved technical breakthroughs and provides new perspectives and tools for the development of intelligent agriculture. In the task of agricultural disease detection, the proposed method demonstrated outstanding performance, achieving a precision, recall, and accuracy of 0...

38611501

March 28, 2024: Plants (Basel, Switzerland)

#3

JOURNAL ARTICLE

XRaySwinGen: Automatic medical reporting for X-ray exams with multimodal model.

Gilvan Veras Magalhães, Roney L de S Santos, Luis H S Vogado, Anselmo Cardoso de Paiva, Pedro de Alcântara Dos Santos Neto

The importance of radiology in modern medicine is acknowledged for its non-invasive diagnostic capabilities, yet the manual formulation of unstructured medical reports poses time constraints and error risks. This study addresses the common limitation of Artificial Intelligence applications in medical image captioning, which typically focus on classification problems, lacking detailed information about the patient's condition. Despite advancements in AI-generated medical reports that incorporate descriptive details from X-ray images, which are essential for comprehensive reports, the challenge persists...

38560155

April 15, 2024: Heliyon

#4

JOURNAL ARTICLE

Hierarchical medical image report adversarial generation with hybrid discriminator.

Junsan Zhang, Ming Cheng, Qiaoqiao Cheng, Xiuxuan Shen, Yao Wan, Jie Zhu, Mengxuan Liu

BACKGROUND AND OBJECTIVES: Generating coherent reports from medical images is an important task for reducing doctors' workload. Unlike traditional image captioning tasks, the task of medical image report generation faces more challenges. Current models for generating reports from medical images often fail to characterize some abnormal findings, and some models generate reports with low quality. In this study, we propose a model to generate high-quality reports from medical images. METHODS: In this paper, we propose a model called Hybrid Discriminator Generative Adversarial Network (HDGAN), which combines Generative Adversarial Network (GAN) with Reinforcement Learning (RL)...

38547777

March 21, 2024: Artificial Intelligence in Medicine

#5

JOURNAL ARTICLE

Improved Image Caption Rating - Datasets, Game, and Model.

Andrew Taylor Scott, Lothar D Narins, Anagha Kulkarni, Mar Castanon, Benjamin Kao, Shasta Ihorn, Yue-Ting Siu, Ilmi Yoon

How well a caption fits an image can be difficult to assess due to the subjective nature of caption quality. What is a good caption? We investigate this problem by focusing on image-caption ratings and by generating high quality datasets from human feedback with gamification. We validate the datasets by showing a higher level of inter-rater agreement, and by using them to train custom machine learning models to predict new ratings. Our approach outperforms previous metrics - the resulting datasets are more easily learned and are of higher quality than other currently available datasets for image-caption rating...

38545917

April 2023: Extended Abstracts on Human Factors in Computing Systems

#6

JOURNAL ARTICLE

Insights into Object Semantics: Leveraging Transformer Networks for Advanced Image Captioning.

Deema Abdal Hafeth, Stefanos Kollias

Image captioning is a technique used to generate descriptive captions for images. Typically, it involves employing a Convolutional Neural Network (CNN) as the encoder to extract visual features, and a decoder model, often based on Recurrent Neural Networks (RNNs), to generate the captions. Recently, the encoder-decoder architecture has witnessed the widespread adoption of the self-attention mechanism. However, this approach faces certain challenges that require further research. One such challenge is that the extracted visual features do not fully exploit the available image information, primarily due to the absence of semantic concepts...

38544059

March 11, 2024: Sensors

#7

JOURNAL ARTICLE

Style-Enhanced Transformer for Image Captioning in Construction Scenes.

Kani Song, Linlin Chen, Hengyou Wang

Image captioning is important for improving the intelligence of construction projects and assisting managers in mastering construction site activities. However, there are few image-captioning models for construction scenes at present, and the existing methods do not perform well in complex construction scenes. According to the characteristics of construction scenes, we label a text description dataset based on the MOCS dataset and propose a style-enhanced Transformer for image captioning in construction scenes, simply called SETCAP...

38539736

March 1, 2024: Entropy

#8

JOURNAL ARTICLE

Advancing medical imaging with language models: featuring a spotlight on ChatGPT.

Mingzhe Hu, Joshua Yuan Qian, Shaoyan Pan, Yuheng Li, Richard L J Qiu, Xiaofeng Yang

This review paper aims to serve as a comprehensive guide and instructional resource for researchers seeking to effectively implement language models in medical imaging research. First, we presented the fundamental principles and evolution of language models, dedicating particular attention to large language models. We then reviewed the current literature on how language models are being used to improve medical imaging, emphasizing a range of applications such as image captioning, report generation, report classification, findings extraction, visual question response systems, interpretable diagnosis and so on...

38537293

March 27, 2024: Physics in Medicine and Biology

#9

JOURNAL ARTICLE

Hollman Facilitations: A user-friendly tool of supporting children with visual impairment and their families in daily life.

Tiziana Battistin, Silvia Trentin, Enrica Polato, Maria Eleonora Reffo

The Robert Hollman Foundation (RHF) designed "Hollman Facilitations" (HF), a user-friendly way of supporting children with visual impairment (VI) and their families on a daily basis. This tool consists of specifically designed pictures on simple A4 sheets, which highlight with images and captions the key aspects of these children's everyday lives. Professionals can easily modify Hollman Facilitations to customize them to the unique developmental needs of every single child with VI and to their individualized strengths and weaknesses...

38524308

June 2024: MethodsX

#10

JOURNAL ARTICLE

ICGA-GPT: report generation and question answering for indocyanine green angiography images.

Xiaolan Chen, Weiyi Zhang, Ziwei Zhao, Pusheng Xu, Yingfeng Zheng, Danli Shi, Mingguang He

BACKGROUND: Indocyanine green angiography (ICGA) is vital for diagnosing chorioretinal diseases, but its interpretation and patient communication require extensive expertise and time-consuming efforts. We aim to develop a bilingual ICGA report generation and question-answering (QA) system. METHODS: Our dataset comprised 213 129 ICGA images from 2919 participants. The system comprised two stages: image-text alignment for report generation by a multimodal transformer architecture, and large language model (LLM)-based QA with ICGA text reports and human-input questions...

38508675

March 20, 2024: British Journal of Ophthalmology

#11

JOURNAL ARTICLE

Cross-modal Retrieval with Noisy Correspondence via Consistency Refining and Mining.

Xinran Ma, Mouxing Yang, Yunfan Li, Peng Hu, Jiancheng Lv, Xi Peng

The success of existing cross-modal retrieval (CMR) methods heavily rely on the assumption that the annotated cross-modal correspondence is faultless. In practice, however, the correspondence of some pairs would be inevitably contaminated during data collection or annotation, thus leading to the so-called Noisy Correspondence (NC) problem. To alleviate the influence of NC, we propose a novel method termed Consistency REfining And Mining (CREAM) by revealing and exploiting the difference between correspondence and consistency...

38507381

March 20, 2024: IEEE Transactions on Image Processing: a Publication of the IEEE Signal Processing Society

#12

JOURNAL ARTICLE

Leveraging the Capabilities of AI: Novice Neurology-Trained Operators Performing Cardiac POCUS in Patients with Acute Brain Injury.

Jennifer Mears, Safa Kaleem, Rohan Panchamia, Hooman Kamel, Chris Tam, Richard Thalappillil, Santosh Murthy, Alexander E Merkler, Cenai Zhang, Judy H Ch'ang

BACKGROUND: Cardiac point-of-care ultrasound (cPOCUS) can aid in the diagnosis and treatment of cardiac disorders. Such disorders can arise as complications of acute brain injury, but most neurologic intensive care unit (NICU) providers do not receive formal training in cPOCUS. Caption artificial intelligence (AI) uses a novel deep learning (DL) algorithm to guide novice cPOCUS users in obtaining diagnostic-quality cardiac images. The primary objective of this study was to determine how often NICU providers with minimal cPOCUS experience capture quality images using DL-guided cPOCUS as well as the association between DL-guided cPOCUS and change in management and time to formal echocardiograms in the NICU...

38506968

March 20, 2024: Neurocritical Care

#13

JOURNAL ARTICLE

A visual-language foundation model for computational pathology.

Ming Y Lu, Bowen Chen, Drew F K Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, Anil V Parwani, Andrew Zhang, Faisal Mahmood

The accelerated adoption of digital pathology and advances in deep learning have enabled the development of robust models for various pathology tasks across a diverse array of diseases and patient cohorts. However, model training is often difficult due to label scarcity in the medical domain, and a model's usage is limited by the specific task and disease for which it is trained. Additionally, most models in histopathology leverage only image data, a stark contrast to how humans teach each other and reason about histopathologic entities...

38504017

March 2024: Nature Medicine

#14

JOURNAL ARTICLE

Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model.

Peng Wu, Jing Liu, Xiangteng He, Yuxin Peng, Peng Wang, Yanning Zhang

Video anomaly detection (VAD) has been paid increasing attention due to its potential applications, its current dominant tasks focus on online detecting anomalies, which can be roughly interpreted as the binary or multiple event classification. However, such a setup that builds relationships between complicated anomalous events and single labels, e.g., "vandalism", is superficial, since single labels are deficient to characterize anomalous events. In reality, users tend to search a specific video rather than a series of approximate videos...

38470582

March 12, 2024: IEEE Transactions on Image Processing: a Publication of the IEEE Signal Processing Society

#15

JOURNAL ARTICLE

Bridging the Cross-Modality Semantic Gap in Visual Question Answering.

Boyue Wang, Yujian Ma, Xiaoyan Li, Junbin Gao, Yongli Hu, Baocai Yin

The objective of visual question answering (VQA) is to adequately comprehend a question and identify relevant contents in an image that can provide an answer. Existing approaches in VQA often combine visual and question features directly to create a unified cross-modality representation for answer inference. However, this kind of approach fails to bridge the semantic gap between visual and text modalities, resulting in a lack of alignment in cross-modality semantics and the inability to match key visual content accurately...

38446647

March 6, 2024: IEEE Transactions on Neural Networks and Learning Systems

#16

JOURNAL ARTICLE

ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation.

Bang Yang, Fenglin Liu, Yuexian Zou, Xian Wu, Yaowei Wang, David A Clifton

Natural Language Generation (NLG) accepts input data in the form of images, videos, or text and generates corresponding natural language text as output. Existing NLG methods mainly adopt a supervised approach and rely heavily on coupled data-to-text pairs. However, for many targeted scenarios and for non-English languages, sufficient quantities of labeled data are often not available. As a result, it is necessary to collect and label data-text pairs for training, which is both costly and time-consuming. To relax the dependency on labeled data of downstream tasks, we propose an intuitive and effective zero-shot learning framework, ZeroNLG, which can deal with multiple NLG tasks, including image-to-text (image captioning), video-to-text (video captioning), and text-to-text (neural machine translation), across English, Chinese, German, and French within a unified framework...

38421845

February 29, 2024: IEEE Transactions on Pattern Analysis and Machine Intelligence

#17

JOURNAL ARTICLE

Zero-shot Video Grounding with Pseudo Query Lookup and Verification.

Yu Lu, Ruijie Quan, Linchao Zhu, Yi Yang

Video grounding, the process of identifying a specific moment in an untrimmed video based on a natural language query, has become a popular topic in video understanding. However, fully supervised learning approaches for video grounding that require large amounts of annotated data can be expensive and time-consuming. Recently, zero-shot video grounding (ZS-VG) methods that leverage pre-trained object detectors and language models to generate pseudo-supervision for training video grounding models have been developed...

38373123

February 19, 2024: IEEE Transactions on Image Processing: a Publication of the IEEE Signal Processing Society

#18

JOURNAL ARTICLE

Characterizing Anti-Vaping Posts for Effective Communication on Instagram Using Multimodal Deep Learning.

Zidian Xie, Shijian Deng, Pinxin Liu, Xubin Lou, Chenliang Xu, Dongmei Li

INTRODUCTION: Instagram is a popular social networking platform for sharing photos with a large proportion of youth and young adult users. We aim to identify key features in anti-vaping Instagram image posts associated with high social media user engagement by artificial intelligence. AIMS AND METHODS: We collected 8972 anti-vaping Instagram image posts and hand-coded 2200 Instagram images to identify nine image features such as warning signs and person-shown vaping...

38366336

February 15, 2024: Nicotine & Tobacco Research

#19

JOURNAL ARTICLE

SMART: Syntax-Calibrated Multi-Aspect Relation Transformer for Change Captioning.

Yunbin Tu, Liang Li, Li Su, Zheng-Jun Zha, Qingming Huang

Change captioning aims to describe the semantic change between two similar images. In this process, as the most typical distractor, viewpoint change leads to the pseudo changes about appearance and position of objects, thereby overwhelming the real change. Besides, since the visual signal of change appears in a local region with weak feature, it is difficult for the model to directly translate the learned change features into the sentence. In this paper, we propose a syntax-calibrated multi-aspect relation transformer to learn effective change features under different scenes, and build reliable cross-modal alignment between the change features and linguistic words during caption generation...

38349824

February 13, 2024: IEEE Transactions on Pattern Analysis and Machine Intelligence

#20

JOURNAL ARTICLE

Content analysis of oral (mouth) cancer-related posts on Instagram.

Omar Al Karadsheh, Alaa Atef, Dua'a Alqaisi, Siraj Zabadi, Yazan Hassona

OBJECTIVE: To examine the content of Instagram posts about oral cancer and assess its usefulness in promoting oral cancer awareness and early detection practices. METHODS: A systematic search of Instagram for posts about oral (mouth) cancer was conducted using the hashtags #oral cancer and #mouth cancer. Posts usefulness in promoting awareness and early detection was assessed using the early detection usefulness score, and caption readability was assessed using the Flesch Kincaid readability score...

38308094

February 2, 2024: Oral Diseases

Use the keywords feature with a free QxMD account.

image captioning

Save your favorite articles in one place with a free QxMD account.

Read

Search Tips