井冈山大学学报自然科学版

文章摘要

李嘉鑫,汤鹏杰,谭云兰,张丽.CIC-CGT：基于多模态大模型的漫画图像标题与描述生成[J].井冈山大学自然版,2024,45(6):77-86

CIC-CGT：基于多模态大模型的漫画图像标题与描述生成

CIC-CGT: COMIC IMAGE CAPTIONING AND DESCRIPTION WITH MULTIMODAL LARGE SCALE MODEL

投稿时间：2024-06-12 修订日期：2024-07-23

DOI：10.3969/j.issn.1674-8085.2024.06.011

中文关键词: 大模型漫画图像标题生成与描述跨模态学习自然语言处理

英文关键词: large scale model comic images image captioning multimodal learning natural language processing

基金项目:国家自然科学基金项目(62362041,62062041,62362003)；江西省教育厅科技重点项目(GJJ211009)

作者	单位	E-mail
李嘉鑫	井冈山大学电子与信息工程学院, 江西, 吉安 343009
汤鹏杰	井冈山大学电子与信息工程学院, 江西, 吉安 343009 电子数据管控与取证江西省重点实验室, 江西, 吉安 343009	tangpengjie@jgsu.edu.cn
谭云兰	井冈山大学电子与信息工程学院, 江西, 吉安 343009 电子数据管控与取证江西省重点实验室, 江西, 吉安 343009
张丽	井冈山大学电子与信息工程学院, 江西, 吉安 343009 电子数据管控与取证江西省重点实验室, 江西, 吉安 343009

摘要点击次数: 730

全文下载次数: 1025

中文摘要:

不同于传统的图像描述任务，漫画图像描述不仅涉及图像识别与自然语言处理，同时还要求模型能够深入理解漫画所特有的幽默、文化和情感属性。针对上述挑战，本研究提出了漫画图像标题与描述生成任务，基于多模态大模型，设计了一种新的漫画标题与描述生成框架（CIC-CGT）。首先，通过CLIP大模型提取漫画图像特征，将获取的特征信息送入前缀嵌入映射模块，获得视觉语言对齐语义表达。然后将其送入GPT2模型，再合CLIP视觉特征，生成粗糙语言描述。最后，将粗糙描述送入T5模型进行语言特征编码，并解码为最终的漫画标题描述。在漫画图像描述数据集NYCCB上结果显示，本研究所提模型能够生成不同风格的漫画标题与描述，能够准确捕捉并表达漫画独有的幽默感和情感深度。

英文摘要:

Different from traditional image description tasks, comic image description not only involves image recognition and natural language processing, but also requires model to deeply understand the humor, culture, and emotional attributes unique to comics. In response to the above challenges, a task of comic image captioning and description is proposed in this work, and a novel framework based on the multimodal large model is developed for generating comic caption and description (CIC-CGT). The comic image features are firstly extracted by CLIP large model, which are fed into the prefix embedding mapping module. Then it is fed into GPT2 model to generate the rough language description combined with CLIP visual characteristics. Finally, the rough description is sent to the T5 model for language feature encoding, and decoding into the final comic title description. The results on the comic image description dataset NYCCB show that the model proposed in this work can generate different styles of comic title and description, and can accurately capture and express the unique humor and emotional depth of comics.

查看全文查看/发表评论下载PDF阅读器

关闭