Multimodal Denoising Recommendation Based on Confidence and Hierarchical Cross-Modal Alignment

ShuguangZhang1✉Emailzsg@ahjzu.edu.cn

YuCheng1Email15855785213@163.com

XinyuZheng1Emailzxy_poem@163.com

YunlongWang1Emailyunlong_kay@163.com

LiangpengHu1Emaillp_hu2000@163.com

JiaxinYue1Email13155187927@163.com

LimingLiu1Email17682092506@163.com

Anhui Jianzhu UniversitySchool of Electronic and Information Engineering

Abstract

In multimodal recommendation systems, modal conflict and noise interference significantly degrade model performance. While aligning different modalities via full-attention mechanisms partially mitigates modal conflict, it suffers from highcomputational complexity and disregards semantic hierarchies. On the other hand, conventional contrastive learning, though effective in suppressing noise, often lacks sufficient discriminative power to distinguish between residual noise and semantically meaningful features during denoising. To address these limitations, we propose a Multimodal Denoising Recommendation Based on Confidence and Hierarchical Cross-Modal Alignment (MDR-CHCA). This model designs a hierarchical cross-modal alignment module, which reduces computational complexity and generates fine-grained aligned features through two-stage process: global alignment between phrases and image regions, and fine-grained alignment between words and image regions. Furthermore, we introduce a confidence-weighted contrastive loss to dynamically select high-quality positive and negative pairs, thereby enhancing the model’s robustness against noise and its discriminative capability. Extensive experiments on three public datasets (Baby, Sports, and Clothing) validate the effectiveness and superiority of the proposed approach.

Keywords

multimodal recommendation

cross-modal alignment

attention mechanism

denoising

Yu Cheng , Xinyu Zheng , Yunlong Wang , Liangpeng Hu , Jiaxin Yue and Liming Liu: These authors contributed equally to this work.

Introduction

With the diversification of Internet content forms, single-modal data struggles to fully characterize user preferences and item attributes, failing to meet the diverse demands in complex scenarios. Consequently, multimodal recommendation systems \cite{bib1,bib2,bib3} have gradually emerged and have been widely applied in domains such as e-Commerce, social media, and streaming media. Unlike single-modal recommendation systems \cite{bib4}, multimodal recommendation systems integrate multiple types of information (e.g., text, images, and audio), enabling them to provide more comprehensive information.

Despite the significant potential demonstrated by multimodal recommendation systems, several unresolved issues remain in their practical applications. Currently, in the research field of multimodal denoising recommendations, most methods take cross-modal attention mechanisms \cite{bib5} as the core. The computational complexity of such methods increases exponentially with the growth of feature dimensions, making it difficult to meet response speed requirements when processing large-scale data. On the other hand, these methods overlook the hierarchical structure inherently present in semantic information, often only achieving the alignment of surface-level features rather than accurate matching of deep-level semantics, which impairs recommendation performance. When dealing with multimodal noise, traditional contrastive learning \cite{bib6,bib7} lacks confidence evaluation for denoised features, making it hard to distinguish between residual noise and effective semantics. This easily leads to two problems during denoising: key semantics are mistakenly filtered out as noise, or excessive noise is retained, both of which compromise the robustness of features.

To overcome these limitations, we propose a Multimodal Denoising Recommendation Based on Confidence and Hier\-ar\-chi\-cal Cross-Modal Alignment .The core idea of our approach is to achieve a balance between computational efficiency and fine-grained semantic alignment across modalities, while also enhancing the discriminative power of modal features against noise.

Related works

The field of multimodal denoising recommendation has matured over years of development, yielding a diverse range of high-performing models. Researchers have optimized and improved the effectiveness of recommendations through various technical approaches.This section reviews the relevant literature by focusing on two key interconnected areas: multimodal recommendation and denoising recommendation.

Multimodal Recommendation

Multimodal recommendation, which integrates multi-source information (e.g., item images and text), enriches item representations and enhances the modeling of user preferences, thereby emerging as a key direction for improving rec\-om\-men\-da\-tion performance. Researchers have focused on the effective extraction and fusion mechanisms of multimodal features, which have yielded a series of representative achievements. Some studies embed single-modal features into traditional recommendation frameworks: for instance, VBPR \cite{bib8} embeds visual features into a matrix factorization framework, while ConvMF \cite{bib9} combines convolutional neural networks (CNNs) for text feature extraction with probabilistic matrix factorization. Others achieve fusion based on graph structures: MMGCN bib10, for example, constructs a user-item bipartite graph for each modality, and MEGCF bib11 integrates semantic entities in multimodal data into user-item interaction graphs and uses graph convolutional networks (GCNs) for information propagation. Additionally, some methods achieve fusion through disentangled representation learning or attention mechanisms. For example, DMRL bib12 disentangles the various factors influencing user preferences within each modality and employs a multimodal attention network to model user-specific factor weights across modalities. The aforementioned algorithms perform information fusion from multiple perspectives, improving the recommendation performance of systems to a certain extent. However, in multimodal recommendation, multi-source modal information may contain noise; this noise impairs the accuracy of feature extraction and the effectiveness of fusion, thereby reducing recommendation performance. Consequently, denoising tech\-nol\-o\-gy has become a critical component in enhancing the performance of multimodal recommendation systems.

Denoising Recommendation

The problem of data noise remains a critical challenge that constrains system performance. As recommendation scenarios grow more complex and data sources more diverse, effective denoising techniques are essential for enhancing the accuracy and robustness of recommender systems.

Research on denoising has been conducted across various recommendation domains. To address noise in self-attention sequential recommenders, Chen et al. bib13 proposed Rec-Denoiser, which employs differentiable masking and Jacobian regularization to mitigate the impact of noisy items. For session-based sequential recommendation, where noise arises from dynamic user intentions and behavioral uncertainty, Zhang et al. bib14 introduced a Dynamic Intention-Aware Iterative Denoising Network (DIDN) to construct dynamic item embeddings and explicitly filter noisy clicks. Further focusing on session recommendation, Feng et al. bib15 proposed a GNN with Global Noise Filtering (GNN-GNF) framework, which utilizes both item-level and session-level filtering modules to remove irrelevant data while capturing local and global user interests via GNNs. Similarly, Luo et al. bib16 designed a Dual-Perspective Denoising Model (DPDM) to identify noisy items and reconstruct noisy graph structures, addressing the inability of traditional methods to capture true user intention transitions.Recent research has confirmed the advantages of multimodal alignment in MRSs bib17,bib18,bib19. Currently, the mainstream alignment methods in this field mostly adopt item-level alignment based on contrastive learning. For example, SLM-Rec bib17 and MIRCO bib19 achieve their goals by aligning content across different modalities. BM3 bib18 achieves the alignment of behavior representations with multimodal content. They achieve this by bringing closer the representations learned under different perspectives of the same item, while pushing away the representations of different items bib17,bib18,bib19. In the more specific context of multimodal recommendation, which is most relevant to our work, denoising requires handling noise across multiple modalities. Xu et al. bib20 tackled three key issues: multimodal content noise, user feedback noise, and their misalignment. Their DA-MRS system reduces multimodal noise by constructing modality-specific item-item semantic and behavioral graphs. To solve the problems of strong graph structure dependence and excessive interaction noise in multimodal recommendation, Xu et al. bib21 proposed a Collaborative Denoising Graph Contrastive Learning (CDGCL) framework. It uses a modality-aware contrastive learning module to capture inter-modal and intra-modal collaborative relationships, and a multi-strategy denoising module to filter irrelevant interactions, thereby improving recommendation performance. Yuan et al. bib22 proposed a Contrastive Learning-based Cross-Modal Feature Alignment and Fusion Model (CLAM). It uses Item ID embeddings as anchors for indirect contrastive learning to align cross-modal features, and combines multi-task learning to address information loss or noise introduction during feature alignment, as well as insufficient modal fusion caused by over-reliance on interaction data.

While these existing methods have made significant progress in denoising, they often lack targeted mechanisms for handling fine-grained, local noise (e.g., in specific image regions or words), which is crucial for precise cross-modal alignment. To bridge this gap, this study proposes hierarchical alignment: through phrase-image region alignment, it compresses token-level text features and calculates similarity, thereby reducing unnecessary computations and laying the foundation for screening more relevant image regions in subsequent steps. For token-image region alignment, a multi-head attention mechanism is adopted to aggregate image region features, preserving fine-grained semantic interaction information. Additionally, a confidence-weighted contrastive loss module is designed: it calculates the contrastive loss via confidence evaluation and positive-negative sample pairs, and optimizes alignment parameters in a backpropagation manner, ultimately enhancing the discriminative ability of modal features.

MDR-CHCA Model

The core idea of this model is to perform preliminary denoising on modal features via a self-attention mechanism, then utilize a hierarchical alignment mechanism to gradually explore the semantic correlations between text descriptions and visual content, thereby addressing the noise caused by insufficient modal alignment. Subsequently, the model optimizes the aligned features by calculating the confidence-weighted contrastive loss, and finally generates fused features with strong noise discriminability for rating prediction. As illustrated in Figure 1, the overall model architecture consists of three key components: (1) Multimodal feature encoding and denoising: Preprocessing and preliminary denoising of data are implemented using graph convolutional networks (GCNs) and an attention mechanism; (2) Hierarchical cross-modal alignment: Semantic alignment between text and images (from coarse to fine granularity) is achieved through phrase-image and token-image alignment; (3) Confidence-weighted contrastive loss: Alignment parameters and aligned features are optimized by calculating the confidence loss function. Details are as follows:

Multimodal feature encoding and denoising: Text and image data are encoded to generate corresponding embedding matrices. Then, the k-nearest neighbors (KNN) algorithm is used to construct adjacency matrices for text and images. After that, a graph convolutional network (GCN) is employed to aggregate neighbor features, and a self-attention mechanism is utilized to perform preliminary denoising on the modalities, outputting purified text features

$T_{\text{out}}$

and image features

$V_{\text{out}}$

Hierarchical cross-modal alignment: For phrase-level alignment, text tokens are compressed into phrase features via average pooling, and a similarity matrix is calculated with image patches. Then, the top-kr regions with high similarity are retained to generate a mask matrix. For token-level alignment, based on the phrase-level mask, tokens inherit the mask of the phrases they belong to and extend it to the token level to focus on semantically relevant regions. After that, the alignment weights are calculated by combining the multi-head attention mechanism, and the top-ks regions are retained again. Finally, fine-grained semantic alignment is achieved by weighted aggregation of image region features, thus effectively removing the noise caused by semantic misalignment.

Confidence-weighted contrastive loss: Alignment parameters and aligned features are optimized through "confidence evaluation → selection of positive and negative samples → calculation of InfoNCE loss function", which maximizes the similarity of high-confidence pairs and minimizes the similarity of negative sample pairs, thereby enhancing the discriminability of denoised features and effectively distinguishing noise residues.

Fig. 1

The structural framework of MDR-CHCA model

Problem Definition

The goal of a recommendation system is to provide users with personalized item recommendations. Let

$( U = \{u_1, u_2, \cdots, u_M\} )$

denote the user set, and

$( I = \{i_1, i_2, \cdots, i_N\} )$

denote the item set. The user-item interaction matrix is

$( A_{ui} \in \{0,1\}^{M \times N} )$

, where

$( M )$

represents the number of users and

$( N )$

represents the number of items. The matrix element

$( A_{ui}(u,i) )$

indicates whether there is an interaction (such as click, purchase, etc.) between user

$( u )$

and item

$( i )$

. If there is an interaction,

$( A_{ui}(u,i) = 1 )$

; otherwise,

$( A_{ui}(u,i) = 0 )$

. The normalized adjacency matrix of the user-item interaction graph is

$( \tilde{A}_{ui} \in \mathbb{R}^{M \times N} )$

. The multimodal features of each item

$( i )$

are extracted through pre-trained models, where

$( h_t \in \mathbb{R}^{N \times d} )$

represents the feature matrix of text, and

$( h_v \in \mathbb{R}^{N \times d} )$

represents the feature matrix of images, with

$( d )$

being the embedding dimension.

Multi-modal feature encoding and denoising

Incorporating multimodal features offers significant advantages for enhancing the performance of recommendation systems, especially when processing different types of information. First, the embedding representations of users and items are updated through propagation via multi-layer graph convolutions:

$e_u^{(L)} &= \sum_{i \in N_u} \frac{1}{\sqrt{|N_u| |N_i|}} e_i^{(L-1)} \\e_i^{(L)} &= \sum_{u \in N_i} \frac{1}{\sqrt{|N_u| |N_i|}} e_u^{(L-1)}$

eq:graph_conv_1

Among them,

$e_u^{(L)}$

$e_i^{(L)}$

denote the embedding representations of user

$u$

and item

$i$

at the

$L$

-th layer.

$N_u$

and

$N_i$

represent the neighborhood sets of user

$u$

and item

$i$

, respectively.

$\frac{1}{\sqrt{|N_u| |N_i|}}$

is a symmetric normalization term.

For each word in a sentence, we use Transformer bib23,bib24 to extract Token features and obtain the text Token embedding matrix

$( T = [t_1, t_2, \cdots, t_w]^T \in \mathbb{R}^{W \times d} )$

$( (1 \leq w \leq W) )$

, where

$( t_w )$

is the feature vector of the

$( w )$

-th Token. For the input image, non-overlapping patches of

$( 16 \times 16 )$

pixels are adopted to generate 256 image Patches, and each Patch serves as an independent visual unit. Then, the pre-trained ViT-B/16 is used to extract Patch features, and the image Patch embedding matrix

$( V = [v_1, v_2, \cdots, v_p]^T \in \mathbb{R}^{P \times d} )$

is obtained, where

$( v_p )$

is the feature vector of the

$( p )$

-th Patch, and

$( d )$

is the embedding dimension. Then, for the text and image feature maps, KNN adjacency matrices

$( \tilde{A}_t \in \mathbb{R}^{W \times W} )$

and

$( \tilde{A}_v \in \mathbb{R}^{P \times P} )$

are constructed respectively, and features are extracted via graph convolution:

$T^{(L)} &= \sigma\left( \tilde{D}_t^{-1/2} \tilde{A}_t \tilde{D}_t^{-1/2} T^{(L-1)} W^{(L-1)} \right) \\V^{(L)} &= \sigma\left( \tilde{D}_v^{-1/2} \tilde{A}_v \tilde{D}_v^{-1/2} V^{(L-1)} W^{(L-1)} \right)$

eq:graph_conv_2

Where

$( T^{(L)} )$

and

$( V^{(L)} )$

denote the feature matrices after

$( L )$

graph convolution,

$( \sigma )$

is the ReLU activation function,

$( \tilde{D} )$

is a diagonal matrix, and

$( W^{(L-1)} )$

is a learnable weight matrix used to map features from the

$( (L-1) )$

-th layer to the

$( L )$

-th layer.

Then, self-attention calculation is performed on the high-level text features

$( T^{(L)} )$

extracted via graph convolution to capture the dependency relationships between feature vectors. The specific formula is as follows:

$T_a = \text{Softmax}\left( \frac{Q K^T}{\sqrt{d}} \right) V$

eq:self_attention

Where

$( Q = K = V = T^{(L)} )$

, all from the features of the same modality. Then, the residual coefficient

$( \lambda )$

is used to fuse the original features and the self-attention output to perform preliminary denoising on text features:

$T_{\text{out}} = \lambda \cdot T^{(L)} + (1 - \lambda) \cdot T_a$

eq:residual_denoise

where

$( \lambda \in (0, 1) )$

is a trainable parameter. This method not only retains the valid information in the original features but also combines the noise suppression effect of the self-attention mechanism. Similarly, the preliminarily denoised image features

$( V_{\text{out}} )$

can be obtained.

Hierarchical Cross-Modal Alignment

Most existing studies on multimodal denoising recommendation are based on cross-modal attention mechanisms. Current methods achieve alignment via full attention, which leads to high computational complexity and overlooks the hierarchical nature of semantics. In this section, hierarchical cross-modal alignment is adopted to accurately match the semantics between different modalities, thereby reducing the noise interference caused by insufficient modal alignment.

Phrase-Image Region Alignment

The preliminarily denoised token-level text features

$( t_{w}^\prime \, (1 \leq w \leq W) )$

are divided into

$( K )$

phrases according to the Token sequence, and average pooling is performed on the Token features within each phrase:

$p_k = \text{avg_pool}(t_{w1}^\prime, t_{w2}^\prime, \cdots, t_{we}^\prime)$

eq:phrase_pooling

$( w1 )$

and

$( we )$

are the start and end indices of the

$( k )$

-th phrase, and

$( K )$

is the total number of phrases. Then, the cosine similarity between phrase features and image region features is calculated using matrix multiplication, and the formula for calculating the cosine similarity is:

$s_{kp} = \cos(p_k, v_p^\prime) = \frac{p_k \cdot v_p^\prime}{\| p_k \| \| v_p^\prime \|}$

eq:cosine_similarity

$( v_p^\prime )$

is the Patch feature after preliminary denoising. For each phrase

$( p_k )$

, the

$( k_r )$

image regions with the highest similarity to it are selected to generate a binary mask matrix.

begin{equation}M^{t2v} = \begin{cases} 1 & \text{if } s_{kp} \in \text{Top-}k_r(s_k) \\0 & \text{else}\end{cases}\label{eq:binary_mask}\end{equation}

$( \text{Top-}k_r(s_k) )$

indicates that the

$( p )$

-th image region ranks among the top

$( k_r )$

in the similarity list of the

$( k )$

-th phrase. The binary mask is used for region feature screening in the subsequent token-image region alignment stage, reducing the computational complexity from

$( O(W M) )$

$( O(K M) )$

Similarly, by using image region features

$( v_p^\prime )$

and text phrase features

$( p_k )$

to calculate the similarity matrix and screen the Top-

$( k_r )$

phrases for each region, we can generate a mask

$( M^{v2t} \in \{0,1\} )$

Token-Image Region Alignment

On the basis of coarse alignment, if the

$( w )$

-th word belongs to the

$( k )$

-th phrase, the Token-level mask inherits the phrase-level mask, thereby directly mapping the phrase-level mask to the token level:

$\hat{M}_{wp}^{t2v} = M_{kp}^{t2v}$

eq:token_mask

$( \hat{M}^{t2v} \in \mathbb{R}^{W \times P} )$

is the token-level mask. The expanded mask allows each word to interact with multiple image regions while inheriting the screening results of phrase-image region alignment. Then, a masked multi-head attention mechanism is used to calculate the token-region attention weights:

$Q = T_{\text{out}} W_q \in \mathbb{R}^{W \times d}$

eq:query

$K = V_{\text{out}} W_k \in \mathbb{R}^{P \times d}$

eq:key

$Attn = \text{softmax}\left( \frac{Q K^T}{\sqrt{d/h}} \odot {M}_{wp}^{t2v} \right)$

eq:attention

$( W_q, W_k \in \mathbb{R}^{d \times d/h} )$

are learnable parameters,

$( h )$

is the number of attention heads,

$( \sqrt{d/h} )$

represents the dimension of each attention head, and

$( \odot )$

denotes element-wise multiplication. A secondary screening is performed on the attention weights, retaining only the top

$( k_s )$

high-confidence connections for each word:

begin{equation}Attn_p = \begin{cases} Attn_{wp} & \text{if } Attn_{wp} \in \text{Top-}k_s(Attn_w) \\0 & \text{else}\end{cases}\label{eq:attention_screen}\end{equation}

$( Attn_{wp} )$

represents the attention weight between word

$( w )$

and image region

$( p )$

. Finally, a sparse alignment weight matrix

$( Attn_p )$

is generated, which is used to weight and aggregate image region features:

$h_t^c = \sum_{p} Attn_p \cdot V_{\text{out}}$

eq:aligned_feature

$( h_t^c )$

is the aligned text feature, which retains fine-grained semantic interaction information and effectively reduces the noise introduced by insufficient semantic information interaction. Similarly, the aligned image feature

$( h_v^c )$

can be obtained.

Confidence-Weighted Contrastive Loss

In multimodal denoising recommendation, traditional con\-tras\-tive learning exhibits insufficient discriminative learning for denoised features, making it difficult to distinguish noise residues from valid semantics. How to design a denoising-oriented contrastive learning strategy to enhance the discrim\-in\-abil\-ity of modal features has become a key factor for the model to more accurately capture user preferences in noisy scenarios. As shown in Figure 2, this paper adopts a confidence-weighted contrastive loss to optimize features and improve their discriminability.

Fig. 2

Confidence-weighted contrastive loss

Confidence Evaluation

To quantify the quality of features after cross-modal alignment and distinguish valid semantics from noise interference, we first define text confidence based on feature similarity and attention weights:

$Conf_t = \text{sigmoid}(\max Attn_{wp})$

eq:text_confidence

The maximum value of the cross-modal attention weight

$( Attn_{wp} )$

is normalized by Sigmoid to measure the alignment quality between text word

$( w )$

and image region

$( p )$

. Combining the phrase-image region alignment similarity and fine-grained attention, the image confidence is defined as:

$Conf_v = \text{softmax}\left( \frac{\max s_{kp}}{\tau_c} \right)$

eq:image_confidence

By normalizing the maximum value of the phrase-image region similarity

$( \max s_{kp} )$

with Softmax, the global importance of image region

$( p )$

is measured. Here,

$( \tau_c \in (0,1] )$

is a temperature parameter, which is used to control the sharpness of the confidence distribution.

Generation of Positive and Negative Sample Pairs

High-quality positive samples and diversified negative samples are selected based on confidence to enhance the effect of contrastive learning. First, high-confidence cross-modal feature pairs are screened to generate positive sample pairs, which require both text and image confidences to be higher than the threshold. The details are as follows:

$Pos = \{(w,p) \mid Conf_t > \theta_p \land Conf_v > \theta_p\}$

eq:positive_samples

Where

$( \theta_p )$

is a trainable parameter representing the positive sample screening threshold.

Negative sample pairs include hard sample pairs and random negative sample pairs. Hard sample pairs are generated by screening cross-modal feature pairs with low confidence, which requires that at least one of the text confidence and image confidence is lower than the threshold. The formula is as follows:

$N_h = \{(w,z) \mid (w,z) \notin Pos, Conf_t < \theta_n \lor Conf_v < \theta_n\}$

eq:hard_negative

$( \theta_n = 1 - \theta_p )$

represents the negative sample screening threshold, filtering out low-confidence noise samples.

$( m )$

random negative samples are randomly selected from non-corresponding regions

$( |N_{rand}| = m )$

, and the formula is as follows:

$N_r = \{(w,z^\prime) \mid z^\prime \notin \{p \mid (w,p) \in Pos\}\}$

eq:random_negative

Confidence-Weighted Contrastive Loss Function

The influence of high-confidence samples is enhanced and noise interference is suppressed through weighted contrastive loss. In this section, an improved InfoNCE loss is adopted. During loss calculation, the similarity of positive sample pairs and negative sample pairs is weighted by confidence, so that high-confidence samples contribute more to the loss and low-confidence samples contribute less. The specific formula is as follows:

$L_{\text{NCE}} &= -\frac{1}{|\text{Pos}|} \sum_{(w,p) \in \text{Pos}} \text{Conf}_t \cdot \text{Conf}_v \cdot \notag \\&\quad \log \frac{\exp(s_{wp}/\tau)}{\sum_{(w,z) \in N_h} \exp(s_{wz}/\tau) + \sum_{(w,z') \in N_r} \exp(s_{wz'}/\tau)}$

eq:weighted_nce

$( \tau \in (0,1] )$

is a temperature parameter,

$( |Pos| )$

is the total number of positive sample pairs,

$( s_{wp} )$

represents the similarity of positive sample pairs,

$( s_{wz} )$

represents the similarity of hard sample pairs, and

$( s_{wz^\prime} )$

represents the similarity of random negative sample pairs. The gradient of the contrastive loss is backpropagated to the hierarchical alignment module to optimize its alignment parameters. And the statistics in the confidence calculation are updated. The specific formula is as follows:

$\theta_p^{(t+1)} &= \theta_p^{(t)} + \alpha \cdot \left( \frac{1}{|Pos|} \sum_{(w,p) \in Pos} Conf_t \cdot Conf_v - \theta_p^{(t)} \right)$

eq:theta_p_update

$\theta_n^{(t+1)} = \theta_n^{(t)} + \beta \cdot \left( \frac{|N_{hard}|}{|P| + |N_{hard}|} - \theta_n^{(t)} \right)$

eq:theta_n_update

$( \alpha, \beta )$

are learning rates, used to balance the update speed and stability.

$( \frac{1}{|Pos|} \sum_{(w,p) \in Pos} Conf_t \cdot Conf_v - \theta_p^{(t)} )$

is the average confidence. If it is higher than the current threshold, the threshold is increased to screen positive samples more strictly; otherwise, the threshold is decreased to retain more samples.

$( \frac{|N_{hard}|}{|Pos| + |N_{hard}|} )$

is the noise ratio. If the actual noise ratio is higher than the current threshold, the threshold is decreased to filter more noise; otherwise, the threshold is relaxed.

Modal Feature Fusion and Prediction

The aligned text features and image features are fused. Weights are assigned to the text features and image features respectively, and then the fused features are obtained by weighted summation. The specific formula is as follows:

$h_f = \alpha^\prime \cdot h_t^c + (1 - \alpha^\prime) \cdot h_v^c$

eq:feature_fusion

where

$( \alpha^\prime = 0.6 )$

represents the weight, obtained through model training.

Finally, the predicted rating of the user for the item is predicted by the inner product of the fused features and the user embedding, as follows:

$\hat{y}(u,i) = e_u^{(L)^T} h_f$

eq:rating_prediction

Experiments

Dataset Description

Experiments were conducted on three commonly used Amazon datasets: Baby, Sports, and Clothing. These datasets belong to different categories, and significant differences may exist in product forms and user preferences, this helps test the robustness and generalization ability of the model under various scenarios. Composed of visual and textual features, these datasets can better demonstrate the model's capability in multimodality. In the experiments, the datasets were split into training, validation, and test sets at a ratio of 8:1:1. Detailed statistical information of the three datasets is presented in Table 1.

begin{table}[h]\centering\caption{The Statistics of Datasets}\label{tab:dataset_stats}\begin{tabular}{|c|c|c|c|c|}\hlineDataset & Users & Items & Interactions & Sparsity \\\hlineBaby & 19445 & 7050 & 139110 & 99.88% \\\hlineSports & 35598 & 18357 & 256308 & 99.95% \\\hlineClothing & 39387 & 23033 & 237488 & 99.97% \\\hline\end{tabular}\end{table}

Baby: Contains users' reviews on various baby products and images of these products.

Sports: Involves users' reviews on various sports goods and images of these sports products.

Clothing: Includes users' reviews on clothing items and images of the corresponding products.

Evaluation Metrics

We adopted three widely used evaluation metrics: Recall@K, Precision@K and Normalized Discounted Cumulative Gain (NDCG@K). K is set to 20. Recall@K represents the percentage of a user's rated items that appear in the top K recommended items. Precision@K measures how many of the top K recommended items are actually items that users are interested in, and it is used to demonstrate the accuracy of recommendations. NDCG@K is an evaluation metric for ranking results and is used to assess the accuracy of ranking.

Baseline Methods

To evaluate the performance of MDR-CHCA, we conducted a comparative analysis with eight advanced recommendation models.

VBPR: Combines ID embeddings and visual features to predict interactions.

MMGCN: Constructs a modality-specific graph for each modality and refines modality-specific representations for users and items. Subsequently, it aggregates all these representations to predict interactions.

LATTICEbib25: Learns the latent semantic item-item structure from multimodal features for recommendation tasks.

GRCNbib26: Focuses on refining the user-item interaction graph by identifying false positive feedback and then pruning the corresponding noisy edges. This process aims to improve the quality of the graph and enhance the reliability of information used for recommendations.

SLMRec: A multimedia recommendation framework based on self-supervised learning, which explores multimodal patterns through data augmentation techniques such as feature dropping, feature masking, and feature granularity and spatiality, effectively enhancing model robustness.

BM3: A self-supervised multimodal recommendation model based on bootstrap latent representation, which generates contrastive views through latent embedding dropout, without requiring negative samples or auxiliary graph structures.

MICRO: Adopts a multimodal contrastive framework to model modality-specific representation spaces and modality-shared representation spaces based on a multimodal latent item graph.

MICRO: Captures cross-modal and intra-modal collaborative relationships through a modal-aware contrastive learning module, and filters irrelevant interactions with the help of a multi-strategy denoising module.

Experimental Setup

The operating system, GPU, and experimental environment used for the model and algorithms in this study are as follows: the operating system is Windows 10, the GPU is an Nvidia RTX 3090, and the experimental environment is built based on Python 3.8 and PyTorch 1.12.0. After repeated experiments, the hyperparameters used in the model are set as follows: the Top-

$k_r$

(number of retained image regions per phrase) is set to 10; the Top-

$k_s$

(number of retained fine-grained connections per word) is set to 5; the temperature hyperparameter

$\tau$

is set to 0.5; the Xavier method is adopted for parameter initialization; the optimizer is Adam with a learning rate of 0.001; the embedding size is set to 64; and the batch size for dataset training is set to 2048.

begin{table}[htbp] \centering \setlength{\tabcolsep}{2pt} \caption{The performance of different methods}\label{tab:performance} \begin{tabular}{l *{9}{c}} \hline \multirow{2}{*}{Model} & \multicolumn{3}{c}{Baby} & \multicolumn{3}{c}{Sports} & \multicolumn{3}{c}{Clothing} \\ \cmidrule(lr){2-4} \cmidrule(lr){5-7} \cmidrule(lr){8-10} & R@20 & NDCG@20 & P@20 & R@20 & NDCG@20 & P@20 & R@20 & NDCG@20 & P@20 \\ \hline VBPR & 0.0592 & 0.0234 & 0.0032 & 0.0719 & 0.0325 & 0.0039 & 0.0433 & 0.0185 & 0.0022 \\ MMGCN & 0.0622 & 0.0270 & 0.0033 & 0.0634 & 0.0270 & 0.0035 & 0.0426 & 0.0184 & 0.0021 \\ LATTICE & 0.0837 & 0.0367 & 0.0046 & 0.0928 & 0.0419 & 0.0050 & 0.0710 & 0.0316 & 0.0036 \\ GRCN & 0.0799 & 0.0348 & 0.0044 & 0.0856 & 0.0387 & 0.0047 & 0.0650 & 0.0283 & 0.0034 \\ SLMRec & 0.0878 & 0.0393 & 0.0047 & 0.1042 & 0.0468 & 0.0057 & 0.0728 & 0.0325 & 0.0038 \\ BM3 & 0.0864 & 0.0378 & 0.0047 & 0.0980 & 0.0442 & 0.0054 & 0.0662 & 0.0304 & 0.0035 \\ MICRO & 0.0882 & 0.0398 & 0.0047 & 0.0997 & 0.0462 & 0.0052 & 0.0803 & 0.0361 & 0.0041 \\ CDGCL & 0.0946 & 0.0409 & 0.0050 & 0.1061 & 0.0483 & 0.0053 & 0.0853 & 0.0392 & 0.0044 \\ MDR-CHCA & 0.0970 & 0.0422 & 0.0053 & 0.1093 & 0.0491 & 0.0055 & 0.0908 & 0.0413 & 0.0047 \\ Improve% & 2.5% & 3.2% & 6.0% & 3.0% & 1.6% & 3.8% & 6.4% & 5.4% & 6.4% \\ \hline \end{tabular}\end{table}

Performance Analysis

The experimental results of comparing our method with advanced baseline models are presented in the table. As observed from Table 2, the following conclusions can be drawn:

Compared to baseline models, our proposed method achieves superior performance across all metrics on three datasets. Specifically, relative to the current strongest baseline model CDGCL, MDR-CHCA achieves average gains of 3.9%, 2.8%, and 6.1% on the Baby, Sports, and Clothing datasets, respectively. This demonstrates that MDR-CHCA effectively captures user preferences to deliver more accurate recommendation results.In-depth analysis of baseline models reveals that traditional multimodal recommendation models like VBPR and MMGCN exhibit limited performance due to the absence of effective denoising mechanisms. SLM-Rec and BM3 enhance model robustness to some extent through self-supervised learning techniques, but they primarily focus on item-level representation learning and fail to effectively address fine-grained cross-modal semantic alignment challenges. MICRO achieves significant performance gains by constructing a multimodal latent item graph and conducting contrastive learning across dual spaces, establishing itself as a strong baseline. However, MICRO still exhibits limitations in fine-grained semantic matching. CDGCL, an advanced model specifically designed for multimodal denoising, employs modality-aware contrastive learning and multi-strategy denoising modules to achieve outstanding performance across three datasets, positioning it as the current state-of-the-art baseline. However, CDGCL primarily focuses on denoising at the graph structural level, failing to fully exploit the hierarchical semantic information within each modality. In contrast, MDR-CHCA emphasizes cross-modal semantic alignment and intra-modal feature optimization. By adopting a hierarchical approach, it enables fine-grained matching of data across different modalities. Compared to CDGCL, MDR-CHCA maintains robust denoising capabilities while achieving consistent improvements across all metrics through its refined semantic alignment mechanism. This enhancement is particularly pronounced on the semantically complex Clothing dataset. Furthermore, traditional models' contrastive losses (e.g., VBPR's Bayesian loss, MMGCN's standard contrastive loss) uniformly weight positive and negative samples, struggling to distinguish high-quality samples from noise interference. MDR-CHCA, however, assigns greater weight to high-confidence positive samples, enabling the model to focus on user preference information embedded in high-quality samples. Simultaneously, it performs targeted optimization on negative samples by filtering out high-noise negatives through confidence assessment. This addresses the issue of baseline models failing to strengthen high-quality samples due to uniform sampling. These innovations drive MDR-CHCA's progress in multimodal recommendation prediction.

Translated with DeepL.com (free version)

Fig. 3

The performance of MDR-CHCA and its variants

Ablation Study

This paper conduct an in-depth study on the key components of the model by comparing it with two variants, aiming to examine the impact of the main components on the final performance. The two variants are defined as follows:

MDR-CHCA-A: Removes the hierarchical cross-modal alignment module and adopts full-attention cross-modal alignment instead;

MDR-CHCA-B: Removes the confidence evaluation mechanism and uses standard contrastive loss instead.

From the results in Figure 3, the following conclusions can be drawn: Both modules have a certain impact on the model. After removing the hierarchical alignment module, the model performance decreases significantly, which indicates that hierarchical alignment can significantly improve the efficiency of semantic matching and suppress noise interference. As observed from MDR-CHCA-B, screening high-quality samples through dynamic thresholds can effectively enhance feature discriminability in noisy scenarios. Overall, the three metrics of the MDR-CHCA-A model are significantly lower than those of MDR-CHCA-B, which demonstrates that the hierarchical cross-modal alignment module has a more significant impact on the model performance.

Parameter Analysis

In this study, we focus on investigating the impact of hyperparameters on the performance of the recommendation model, including

$k_r$

(the number of retained regions for phrase-image region alignment) and the temperature parameter

$\tau$

To comprehensively evaluate the effects of

$k_r$

and

$\tau$

, we adjusted the values of these two hyperparameters while keeping other settings unchanged. By plotting the scores of Recall@20 and NDCG@20 on the datasets, we could clearly observe their specific impacts on model performance.

Figure 4 presents the results of adjusting the number of retained regions

$k_r$

on the three datasets, where

$k_r$

was adjusted within the range of [5, 10, 15, 20]. We found that: when

$k_r = 5$

, the model achieved very poor results, which may be attributed to insufficient semantic coverage leading to performance degradation; when

$k_r = 10$

the model reached the optimal performance; thereafter, as

$k_r$

further increased, the model performance decreased instead. This is because an excessively large

$k_r$

causes the model to retain more low-quality regions, thereby affecting cross-modal alignment.

Furthermore, we conducted a detailed investigation on the temperature parameter

$\tau$

, covering multiple values of [0.1, 0.3, 0.5, 0.7]. As can be clearly observed from Figure 5, the model achieved the best performance when

$\tau$

was set to 0.5. Deviation from this value resulted in a decline in model performance, which proves that

$\tau$

, as a temperature parameter, requires precise tuning to balance the "discriminability" and "generalization" of the confidence distribution.

Fig. 4

Performance of MDR-CHCA under different image regions kr

Fig. 5

Performance of MDR-CHCA under different temperature hyperparameters

$(\tau)$

Conclusion

In this study, a multimodal denoising recommendation model based on confidence and hierarchical cross-modal alignment (MDR-CHCA) is proposed. It realizes two-stage alignment of phrase-image regions and word-image regions: phrase-level alignment achieves coarse-grained region screening, while word-level alignment realizes fine-grained semantic matching through masked attention — a design that balances efficiency and accuracy. Secondly, a confidence evaluation mechanism for text and images is defined, which quantifies the quality of cross-modal alignment based on feature similarity and attention weights. Meanwhile, it screens positive and negative sample pairs to enhance the discriminative ability of contrastive learning, thereby effectively suppressing noise interference. Experiments on three datasets verify the effectiveness of MDR-CHCA. In future work, we will continue to study the alignment and fusion strategies of multimodal features and explore new perspectives.\\

Funding. This Declaration is not applicable.\\

Data Availability. We conduct experiments on three datasets which are available at http://jmcauley.ucsd.edu/data/amazon/links.html.

Declarations

Ethical approval Not applicable. \\

Competing interests The authors declare no competing interests.

bibliography{sn-bibliography}

Author Contribution

Zhang adjusted the paper's structure and refined its grammar. Cheng wrote the main manuscript text and conducted the experiments. Wang and Zheng improved the experimental section. Hu, Yue, and Liu were responsible for literature research. All authors reviewed the manuscript.

Data Availability

We conduct experiments on three datasets which are available at http://jmcauley.ucsd.edu/data/amazon/links.html.

References:

Zhou, Xin and Zhou, Hongyu and Liu, Yong and Zeng, Zhiwei and Miao, Chunyan and Wang, Pengwei and You, Yuan and Jiang, Feijun. (2023) Bootstrap latent representations for multi-modal recommendation. Proceedings of the ACM web conference : 845--854

Lei, Fei and Cao, Zhongqi and Yang, Yuning and Ding, Yibo and Zhang, Cong. (2023) Learning the user's deeper preferences for multi-modal recommendation systems. ACM Transactions on Multimedia Computing, Communications and Applications, 19(3s): 1--18

Zhang, Jinghao and Liu, Qiang and Wu, Shu and Wang, Liang. (2023) Mining stable preferences: Adaptive modality decorrelation for multimedia recommendation. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval : 443--452

Ko, Hyeyoung and Lee, Suyeon and Park,Yoonseo and Choi, Anna. (2022) A survey of recommendation systems: recommendation models, techniques, and application fields.. Electronics, 11(1): 141

Zhou, Yan and Guo, Jie and Sun, Hao and Song, Bin and Yu, Fei Richard. (2023) Attention-guided multi-step fusion: A hierarchical fusion network for multimodal recommendation.. Proceedings of the 46th international acm sigir conference on research and development in information retrieval : 1816--1820

Oord, Aaron Van Den and Li, Yazhe and Vinyals, Oriol. (2018) Representation Learning with Contrastive Predictive Coding.. arXiv preprint arXiv : 1807.03748

Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya. (2021) Learning transferable visual models from natural language supervision.. International conference on machine learning : 8748--8763

He, Ruining and McAuley, Julian. (2016) VBPR: Visual bayesian personalized ranking from implicit feedback.. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 30(1)

Kim, Donghyun and Park, Chanyoung and Oh, Jinoh and Lee, Sungyoung and Yu, Hwanjo. (2016) Convolutional Matrix Factorization for Document Context-Aware Recommendation.. Proceedings of the 10th ACM Conference on Recommender Systems : 233--240

Wei, Yinwei and Wang, Xiang and Nie, Liqiang and He, Xiangnan and Hong, Richang and Chua, Tat Seng. (2019) MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video.. Proceedings of the 27th ACM international conference on multimedia : 1437--1445

Liu, Kang and Xue, Feng and Guo, Dan and Wu, Le and Li, Shujie and and Hong, Richang. (2023) Megcf: Multimodal entity graph collaborative filtering for personalized recommendation.. ACM Transactions on Information Systems, 41(2): 1--27

Liu, Fan and Chen, Huilin and Cheng, Zhiyong and Liu, Anan and Nie, Liqiang and Kankanhalli,Mohan. (2022) Disentangled multimodal representation learning for recommendation.. IEEE Transactions on Multimedia, 25: 7149-7159

Chen, Huiyuan and Lin, Yusan and Pan, Menghai and Wang, Lan and Yeh, Chin Chia Michael and LI, Xiaoting and Zheng, Yan and Wang, Fei and Yang, Hao. (2022) Denoising self-attentive sequential recommendation.. Proceedings of the 16th ACM conference on recommender systems : 92--101

Zhang, Xiaokun and Lin, Hongfei and Xu, Bo and Li, Chenliang and Lin, Yuan and Liu, Haifeng and and Ma, Fenglong. (2022) Dynamic intent-aware iterative denoising network for session-based recommendation.. Information Processing & Management, 59(3): 102936

Feng, Lixia and Cai, Yongqi and Wei, Erling and Li, Jianwu. (2022) Graph neural networks with global noise filtering for session-based recommendation.. Neurocomputing, 472: 113--123

Luo, Zhen and Sheng, Zhenzhen and Zhang, Tao. (2024) Dual perspective denoising model for session-based recommendation.. Expert Systems with Applications, 249: 123845

Tao, Zhulin and Liu, Xiaohao and Xia, Yewei and Wang, Xiang and Yang, Lifang and Huang, Xianglin and Chua, Tat-Seng. (2022) Self-supervised learning for multimedia recommendation.. IEEE Transactions on Multimedia, 25: 5107-5116

Zhou, Xin and Zhou, Hongyu and Liu, Yong and Zeng, Zhiwei and Miao, Chunyan and Wang, Pengwei and You, Yuan and Jiang, Feijun. (2023) Bootstrap Latent Representations for Multimodal Recommendation.. In Proceedings of the ACM web conference 2023 : 845-854

Zhang, Jinghao and Zhu, Yanqiao and Liu, Qiang and Zhang, Mengqi and Wu,Shu and Wang, Liang. (2022) Latent structure mining with contrastive modality fusion for multimedia recommendation.. IEEE Transactions on Knowledge and Data Engineering, 35(9): 9154--9167

Xv, Guipeng and Li, Xinyu and Xie, Ruobing and Lin, Chen and Liu, Chong and Xia, Feng and Kang, Zhanhui and Lin, Leyu. (2024) Improving Multi-modal Recommender Systems by Denoising and Aligning Multi-modal Content and User Feedback.. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining : 3645--3656

Xu, Fuyong and Zhu, Zhenfang and Fu, Yixin and Wang, Ru and Liu, Peiyu. (2024) Collaborative denoised graph contrastive learning for multi-modal recommendation.. Information Sciences, 679: 121017

Yuan, Xu and Qi, Ange and Wu, Huinan and Wang, Jiaqiang and Guo, Yi and Li, Shijin and Zhao, Liang. (2025) Cross-modal feature alignment and fusion with contrastive learning in multimodal recommendation.. Knowledge-Based Systems : 114020

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Lukasz and Polosukhin, Illia. (2017) Attention is all you need.. Advances in neural information processing systems, 30

Liu, Han and Wei, Yinwei and Song, Xuemeng and Guan, Weili and Li, Yuanfang and Nie, Liqiang. (2024) MMGRec: Multimodal Generative Recommendation with Transformer Model.. arXiv preprint arXiv, 2404: 16555

Zhang, Jinghao and Zhu, Yanqiao and Liu, Qiang and Wu, Shu and Wang, Shuhui and Wang, Liang. (2021) Mining latent structures for multimedia recommendation.. Proceedings of the 29th ACM international conference on multimedia : 3872--3880

Wei, Yinwei and Wang, Xiang and Nie, Liqiang and He, Xiangnan and Chua, Tat Seng. (2020) Graph-refined convolutional network for multimedia recommendation with implicit feedback.. Proceedings of the 28th ACM international conference on multimedia : 3541--3549

Additional Files

Additional file 1