Papers with the keyword image captioning (Page 2)

#21

JOURNAL ARTICLE

Emotional Video Captioning with Vision-based Emotion Interpretation Network.

Peipei Song, Dan Guo, Xun Yang, Shengeng Tang, Meng Wang

Effectively summarizing and re-expressing video content by natural languages in a more human-like fashion is one of the key topics in the field of multimedia content understanding. Despite good progress made in recent years, existing efforts usually overlooked the emotions in user-generated videos, thus making the generated sentence a bit boring and soulless. To fill the research gap, this paper presents a novel emotional video captioning framework in which we design a Vision-based Emotion Interpretation Network to effectively capture the emotions conveyed in videos and describe the visual content in both factual and emotional languages...

38300778

February 1, 2024: IEEE Transactions on Image Processing: a Publication of the IEEE Signal Processing Society

#22

JOURNAL ARTICLE

Every Problem, Every Step, All In Focus: Learning to Solve Vision-Language Problems with Integrated Attention.

Xianyu Chen, Jinhui Yang, Shi Chen, Louis Wang, Ming Jiang, Qi Zhao

Integrating information from vision and language modalities has sparked interesting applications in the fields of computer vision and natural language processing. Existing methods, though promising in tasks like image captioning and visual question answering, face challenges in understanding real-life issues and offering step-by-step solutions. In particular, they typically limit their scope to solutions with a sequential structure, thus ignoring complex inter-step dependencies. To bridge this gap, we propose a graph-based approach to vision-language problem solving...

38261479

January 23, 2024: IEEE Transactions on Pattern Analysis and Machine Intelligence

#23

JOURNAL ARTICLE

Images, Words, and Imagination: Accessible Descriptions to Support Blind and Low Vision Art Exploration and Engagement.

Stacy A Doore, David Istrati, Chenchang Xu, Yixuan Qiu, Anais Sarrazin, Nicholas A Giudice

The lack of accessible information conveyed by descriptions of art images presents significant barriers for people with blindness and low vision (BLV) to engage with visual artwork. Most museums are not able to easily provide accessible image descriptions for BLV visitors to build a mental representation of artwork due to vastness of collections, limitations of curator training, and current measures for what constitutes effective automated captions. This paper reports on the results of two studies investigating the types of information that should be included to provide high-quality accessible artwork descriptions based on input from BLV description evaluators...

38249011

January 18, 2024: Journal of Imaging

#24

JOURNAL ARTICLE

Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990-2023).

Mauro Andrés Nievas Offidani, Claudio Augusto Delrieux

This paper details the acquisition, structure and preprocessing of the MultiCaRe Dataset, a multimodal case report dataset which contains data from 75,382 open access PubMed Central articles spanning the period from 1990 to 2023. The dataset includes 96,428 clinical cases, 135,596 images, and their corresponding labels and captions. Data extraction was performed using different APIs and packages such as Biopython, requests, Beautifulsoup, BioC API for PMC and EuropePMC RESTful API. Image labels were created based on the contents of their corresponding captions, by using Spark NLP for Healthcare and manual annotations...

38235175

February 2024: Data in Brief

#25

JOURNAL ARTICLE

Contact lens sensor for ocular inflammation monitoring.

Yuqi Shi, Lin Wang, Yubing Hu, Yihan Zhang, Wenhao Le, Guohui Liu, Michael Tomaschek, Nan Jiang, Ali K Yetisen

Contact lens sensors have been emerging as point-of-care devices in recent healthcare developments for ocular physiological condition monitoring and diagnosis. Fluorescence sensing technologies have been widely applied in contact lens sensors due to their accuracy, high sensitivity, and specificity. As ascorbic acid (AA) level in tears is closely related to ocular inflammation, a fluorescent contact lens sensor incorporating a BSA-Au nanocluster (NC) probe is developed for in situ tear AA detection. The NCs are firstly synthesized to obtain a fluorescent probe, which exhibits high reusability through the quench/recover (KMnO4 /AA) process...

38227993

January 6, 2024: Biosensors & Bioelectronics

#26

JOURNAL ARTICLE

Enhancing Surveillance Systems: Integration of Object, Behavior, and Space Information in Captions for Advanced Risk Assessment.

Minseong Jeon, Jaepil Ko, Kyungjoo Cheoi

This paper presents a novel approach to risk assessment by incorporating image captioning as a fundamental component to enhance the effectiveness of surveillance systems. The proposed surveillance system utilizes image captioning to generate descriptive captions that portray the relationship between objects, actions, and space elements within the observed scene. Subsequently, it evaluates the risk level based on the content of these captions. After defining the risk levels to be detected in the surveillance system, we constructed a dataset consisting of [Image-Caption-Danger Score]...

38203152

January 3, 2024: Sensors

#27

JOURNAL ARTICLE

USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval.

Yan Zhang, Zhong Ji, Di Wang, Yanwei Pang, Xuelong Li

As a fundamental and challenging task in bridging language and vision domains, Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality, and its key challenge is to measure the semantic similarity across different modalities. Although significant progress has been achieved, existing approaches typically suffer from two major limitations: (1) It hurts the accuracy of the representation by directly exploiting the bottom-up attention based region-level features where each region is equally treated...

38190676

January 5, 2024: IEEE Transactions on Image Processing: a Publication of the IEEE Signal Processing Society

#28

Retracted: Medical Image Captioning Using Optimized Deep Learning Model.

Computational Intelligence And Neuroscience

[This retracts the article DOI: 10.1155/2022/9638438.].

38124859

2023: Computational Intelligence and Neuroscience

#29

JOURNAL ARTICLE

Enhancing Visual Grounding in Vision-Language Pre-training with Position-Guided Text Prompts.

Alex Jinpeng Wang, Pan Zhou, Mike Zheng Shou, Shuicheng Yan

Vision-Language Pre-Training (VLP) has demonstrated remarkable potential in aligning image and text pairs, paving the way for a wide range of cross-modal learning tasks. Nevertheless, we have observed that VLP models often fall short in terms of visual grounding and localization capabilities, which are crucial for many downstream tasks, such as visual reasoning. In response, we introduce a novel Position-guided Text Prompt (PTP) paradigm to bolster the visual grounding abilities of cross-modal models trained with VLP...

38109234

December 18, 2023: IEEE Transactions on Pattern Analysis and Machine Intelligence

#30

JOURNAL ARTICLE

Image Captioning for the Visually Impaired and Blind: A Recipe for Low-Resource Languages.

Batyr Arystanbekov, Askat Kuzdeuov, Shakhizat Nurgaliyev, Huseyin Atakan Varol

Visually impaired and blind people often face a range of socioeconomic problems that can make it difficult for them to live independently and participate fully in society. Advances in machine learning pave new venues to implement assistive devices for the visually impaired and blind. In this work, we combined image captioning and text-to-speech technologies to create an assistive device for the visually impaired and blind. Our system can provide the user with descriptive auditory feedback in the Kazakh language on a scene acquired in real-time by a head-mounted camera...

38083226

July 2023: Annual International Conference of the IEEE Engineering in Medicine and Biology Society

#31

JOURNAL ARTICLE

RefCap: image captioning with referent objects attributes.

Seokmok Park, Joonki Paik

In recent years, significant progress has been made in visual-linguistic multi-modality research, leading to advancements in visual comprehension and its applications in computer vision tasks. One fundamental task in visual-linguistic understanding is image captioning, which involves generating human-understandable textual descriptions given an input image. This paper introduces a referring expression image captioning model that incorporates the supervision of interesting objects. Our model utilizes user-specified object keywords as a prefix to generate specific captions that are relevant to the target object...

38062184

December 7, 2023: Scientific Reports

#32

JOURNAL ARTICLE

ProtoCLIP: Prototypical Contrastive Language Image Pretraining.

Delong Chen, Zhao Wu, Fan Liu, Zaiquan Yang, Shaoqiu Zheng, Ying Tan, Erjin Zhou

Contrastive language image pretraining (CLIP) has received widespread attention since its learned representations can be transferred well to various downstream tasks. During the training process of the CLIP model, the InfoNCE objective aligns positive image-text pairs and separates negative ones. We show an underlying representation grouping effect during this process: the InfoNCE objective indirectly groups semantically similar representations together via randomly emerged within-modal anchors. Based on this understanding, in this article, prototypical contrastive language image pretraining (ProtoCLIP) is introduced to enhance such grouping by boosting its efficiency and increasing its robustness against the modality gap...

38048244

December 4, 2023: IEEE Transactions on Neural Networks and Learning Systems

#33

JOURNAL ARTICLE

Radiology report generation with medical knowledge and multilevel image-report alignment: A new method and its verification.

Guosheng Zhao, Zijian Zhao, Wuxian Gong, Feng Li

Medical report generation is an integral part of computer-aided diagnosis aimed at reducing the workload of radiologists and physicians and alerting them of misdiagnosis risks. In general, medical report generation is an image captioning task. Since medical reports have long sequences with data bias, the existing medical report generation models lack medical knowledge and ignore the interaction alignment between the two modalities of reports and images. The current paper attempts to mitigate these deficiencies by proposing an approach based on knowledge enhancement with multilevel alignment (MKMIA)...

38042601

December 2023: Artificial Intelligence in Medicine

#34

JOURNAL ARTICLE

EXSCLAIM!: Harnessing materials science literature for self-labeled microscopy datasets.

Eric Schwenker, Weixin Jiang, Trevor Spreadbury, Nicola Ferrier, Oliver Cossairt, Maria K Y Chan

This work introduces the EXSCLAIM! toolkit for the automatic extraction, separation, and caption-based natural language annotation of images from scientific literature. EXSCLAIM! is used to show how rule-based natural language processing and image recognition can be leveraged to construct an electron microscopy dataset containing thousands of keyword-annotated nanostructure images. Moreover, it is demonstrated how a combination of statistical topic modeling and semantic word similarity comparisons can be used to increase the number and variety of keyword annotations on top of the standard annotations from EXSCLAIM! With large-scale imaging datasets constructed from scientific literature, users are well positioned to train neural networks for classification and recognition tasks specific to microscopy-tasks often otherwise inhibited by a lack of sufficient annotated training data...

38035197

November 10, 2023: Patterns

#35

JOURNAL ARTICLE

Dense captioning and multidimensional evaluations for indoor robotic scenes.

Hua Wang, Wenshuai Wang, Wenhao Li, Hong Liu

The field of human-computer interaction is expanding, especially within the domain of intelligent technologies. Scene understanding, which entails the generation of advanced semantic descriptions from scene content, is crucial for effective interaction. Despite its importance, it remains a significant challenge. This study introduces RGBD2Cap, an innovative method that uses RGBD images for scene semantic description. We utilize a multimodal fusion module to integrate RGB and Depth information for extracting multi-level features...

38034836

2023: Frontiers in Neurorobotics

#36

JOURNAL ARTICLE

The Readability of Patient-Facing Social Media Posts on Common Otolaryngologic Diagnoses.

Elliot Morse, Eseosa Odigie, Helen Gillespie, Anaïs Rameau

OBJECTIVE: To assess the readability of patient-facing educational information about the most common otolaryngology diagnoses on popular social media platforms. STUDY DESIGN: Cross-sectional study. SETTING: Social media platforms. METHODS: The top 5 otolaryngologic diagnoses were identified from the National Ambulatory Medical Care Survey Database. Facebook, Twitter, TikTok, and Instagram were searched using these terms, and the top 25 patient-facing posts from unique accounts for each search term and poster type (otolaryngologist, other medical professional, layperson) were identified...

38018504

November 29, 2023: Otolaryngology—Head and Neck Surgery

#37

JOURNAL ARTICLE

Self-supervised multi-modal training from uncurated images and reports enables monitoring AI in radiology.

Sangjoon Park, Eun Sun Lee, Kyung Sook Shin, Jeong Eun Lee, Jong Chul Ye

The escalating demand for artificial intelligence (AI) systems that can monitor and supervise human errors and abnormalities in healthcare presents unique challenges. Recent advances in vision-language models reveal the challenges of monitoring AI by understanding both visual and textual concepts and their semantic correspondences. However, there has been limited success in the application of vision-language models in the medical domain. Current vision-language models and learning strategies for photographic images and captions call for a web-scale data corpus of image and text pairs which is not often feasible in the medical domain...

37952385

November 7, 2023: Medical Image Analysis

#38

JOURNAL ARTICLE

Your smartphone could act as a pulse-oximeter and as a single-lead ECG.

Ahsan Mehmood, Asma Sarouji, M Mahboob Ur Rahman, Tareq Y Al-Naffouri

In the post-covid19 era, every new wave of the pandemic causes an increased concern/interest among the masses to learn more about their state of well-being. Therefore, it is the need of the hour to come up with ubiquitous, low-cost, non-invasive tools for rapid and continuous monitoring of body vitals that reflect the status of one's overall health. In this backdrop, this work proposes a deep learning approach to turn a smartphone-the popular hand-held personal gadget-into a diagnostic tool to measure/monitor the three most important body vitals, i...

37935806

November 6, 2023: Scientific Reports

#39

JOURNAL ARTICLE

DeepPatent2: A Large-Scale Benchmarking Corpus for Technical Drawing Understanding.

Kehinde Ajayi, Xin Wei, Martin Gryder, Winston Shields, Jian Wu, Shawn M Jones, Michal Kucer, Diane Oyen

Recent advances in computer vision (CV) and natural language processing have been driven by exploiting big data on practical applications. However, these research fields are still limited by the sheer volume, versatility, and diversity of the available datasets. CV tasks, such as image captioning, which has primarily been carried out on natural images, still struggle to produce accurate and meaningful captions on sketched images often included in scientific and technical documents. The advancement of other tasks such as 3D reconstruction from 2D images requires larger datasets with multiple viewpoints...

37935698

November 7, 2023: Scientific Data

#40

JOURNAL ARTICLE

Image Captioning With Controllable and Adaptive Length Levels.

Ning Ding, Chaorui Deng, Mingkui Tan, Qing Du, Zhiwei Ge, Qi Wu

Image captioning is one of the fundamental problems of computer vision and has drawn great attention over the years. However, most existing methods in image captioning focus on improving the quality of the image captions, while ignoring the ability of controlling the caption style. In this work, we aim to improve the controllability of image captioning methods, especially, by choosing to describe the image either roughly or in detail. We find this can be achieved by adding a simple length level embedding into existing models, which enables them to generate length-controllable captions describing the image at a specified level of detail, and further improve the diversity...

37930907

November 6, 2023: IEEE Transactions on Pattern Analysis and Machine Intelligence

Use the keywords feature with a free QxMD account.

image captioning

Save your favorite articles in one place with a free QxMD account.

Read

Search Tips