基于特征解耦与跨模态深度交互增强的多模态情感分析
CSTR:
作者:
作者单位:

1安徽理工大学计算机科学与工程学院,淮南 232001;2淮南师范学院计算机学院,淮南 232038;3合肥综合性国家科学中心人工智能研究院,合肥 230026

作者简介:

通讯作者:

张顺香,男,教授,博士生导师,E-mail:sxzhang@aust.edu.cn。

中图分类号:

TP391.1

基金项目:

国家自然科学基金面上项目(62476005,62076006);认知智能全国重点实验室开放课题(COGOS-2023HE02);安徽高校协同创新项目(GXXT-2021-008)。


Feature Decoupling and Cross-Modal Deep Interaction Enhancement for Multimodal Sentiment Analysis
Author:
Affiliation:

1School of Computer Science and Engineering, Anhui University of Science and Technology, Huainan 232001,China;2School of Computer, Huainan Normal University, Huainan 232038,China;3Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei 230026,China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    当前多模态情感分析(Multimodal sentiment analysis, MSA)模型主流方法使用跨模态注意力机制处理不同模态特征信息,但该类方法没有考虑到不同模态特征之间的相似性与差异性,在跨模态交互中容易产生模态相似冗余信息并增加噪声,导致模型性能降低。本文提出一种基于特征解耦与跨模态深度交互增强(Feature decoupling and cross-modal deep interaction enhancement, FD-CMDIE)的多模态情感分析模型。首先在特征提取模块引入NeoBERT提取高质量文本特征,使用堆叠长短期记忆(Long short-term memory, LSTM)网络提取视觉与听觉特征。然后利用共同编码器与独立编码器将3种模态特征解耦成相似性特征与相异性特征,并使用对比学习以文本相似性特征为锚点,在特征空间中拉近相似性特征,同时推远相异性特征。最后设计一种跨模态交互增强网络实现解耦后特征的深度交互与融合,并利用门控注意力池化模块过滤交互产生的噪声信息。在两个基准数据集上进行实验,并与多个当前先进方法比较,在绝大部分指标上都超越了当前先进方法,验证了本文方法的有效性。

    Abstract:

    Current multimodal sentiment analysis (MSA) models predominantly employ cross-modal attention mechanisms to process feature information from different modalities. However, these approaches often overlook the inherent similarities and dissimilarities among modal features, which can easily lead to the generation of redundancy from modal similarities and an increase in noise during cross-modal interaction, thereby degrading model performance. To address these issues, this paper proposes a novel multimodal sentiment analysis model based on feature decoupling and cross-modal deep interaction enhancement (FD-CMDIE). Firstly, for feature extraction, NeoBERT is utilized to extract high-quality textual features, while stacked long short-term memory (LSTM) networks are employed for visual and acoustic features. Subsequently, common and private encoders are used to decouple the features of the three modalities into similar and dissimilar features. Contrastive learning is then applied, using the textual similar features as anchors, to pull similar features from different modalities closer in the feature space while pushing dissimilar features further apart. Finally, a cross-modal interaction enhancement network is designed for deep interaction and fusion of the decoupled features, and a gated attention pooling module is utilized to filter out noise generated during the interaction. Experiments conducted on two benchmark datasets demonstrate that our proposed method surpasses several state-of-the-art approaches across most metrics, validating its effectiveness.

    参考文献
    相似文献
    引证文献
引用本文

赵智伟,张顺香,孙亮,魏可欣,陈梦.基于特征解耦与跨模态深度交互增强的多模态情感分析[J].南京航空航天大学学报,2026,58(3):666-681

复制
分享
相关视频

文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2025-08-15
  • 最后修改日期:2026-10-06
  • 录用日期:
  • 在线发布日期: 2026-06-18
  • 出版日期:
文章二维码
您是第位访问者
网站版权 © 南京航空航天大学学报
技术支持:北京勤云科技发展有限公司