基于改进型BERT预训练模型的大规模文本语义匹配方法
作者:
作者单位:

丽水职业技术学院电子信息学院,浙江 丽水 323000

作者简介:

通讯作者:

基金项目:

浙江省高职教育“十四五”教学改革项目(jg20240348)。


A Large Scale Text Semantic Matching Method Based on An Improved BERT Pre-Training Model
Author:
Affiliation:

School of Electronic Information, Lishui Vocational & Technical College, LiShui 323000, ZheJiang, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    大规模文本数据具有数据量庞大的特点,且同一词汇在不同语境下可能具有完全不同的含义。仅依赖固定规则或模型,难以适应动态的语义变化,这会导致信息丢失和语义不完整。在这种情况下,无法捕捉到深层次的语义信息和语境关系,进而影响语义匹配的准确性。为解决这一问题,本文提出了一种基于改进型双向编码器表征量(bidirectional encoder representations from transformers,BERT)预训练模型的大规模文本语义匹配方法。该改进的BERT预训练模型通过文本词向量的位置编码来增强文本的语境信息特征,从而有效捕捉文本的语境信息。此外,采用注意力机制动态计算特征融合权重,并通过加权融合方法生成文本的融合语义特征。通过文本特征信息提取、多维知识编码、融合语义标签生成以及语义匹配关系预测4个步骤,评估待匹配文本之间的语义一致性。本文设定一致性阈值为0.8,即当预测值超过0.8时,认为待匹配文本具有较高的语义一致性,从而实现准确的文本语义匹配。实验结果表明,基于大规模文本样本数据得到的平均倒数排名(mean reciprocal rank,MRR)高于0.7,且与对比方法相比,匹配结果更加准确。

    Abstract:

    Large-scale text data is characterized by an enormous volume, and the same vocabulary may carry completely different meanings in diverse contexts. Relying solely on fixed rules or models makes it difficult to adapt to dynamic semantic changes, which leads to information loss and semantic incompleteness. In such cases, deep semantic information and contextual relationships cannot be captured, thereby impairing the accuracy of semantic matching. To address this issue, a large-scale text semantic matching method based on an improved bidirectional encoder representations from transformers (BERT) pre-training model is proposed in this paper. Therefore, a large-scale text semantic matching method based on an improved BERT pretrained model is proposed.” The improved BERT pre-training model is applied to enhance the contextual information features of text via positional encoding of text word vectors, thus capturing the contextual information of text effectively. Furthermore, the attention mechanism is adopted to dynamically calculate the feature fusion weights, and a weighted fusion method is used to generate the fused semantic features of text. The semantic consistency between texts to be matched is evaluated through four steps: text feature information extraction, multi-dimensional knowledge encoding, fused semantic tag generation, and semantic matching relationship prediction. A consistency threshold of 0.8 is set, meaning that the texts to be matched are considered to have high semantic consistency when the predicted value exceeds 0.8, thus achieving accurate text semantic matching. Test results show that the Mean Reciprocal Rank (MRR) obtained based on large-scale text sample data is higher than 0.7, and the matching results are more accurate compared with the contrast methods.

    参考文献
    相似文献
    引证文献
引用本文

周晓飞.基于改进型BERT预训练模型的大规模文本语义匹配方法[J].西昌学院学报(自然科学版),2026,40(1):88-97.

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2025-10-14
  • 最后修改日期:2025-10-14
  • 录用日期:2025-12-01
  • 在线发布日期: 2026-04-16