一种基于特征融合的耳语音向正常音的转换方法

doi:10.16356/j.1005-2615.2020.05.014

首页 > 过刊浏览>2020年第52卷第5期 >777-782. DOI:10.16356/j.1005-2615.2020.05.014

一种基于特征融合的耳语音向正常音的转换方法
DOI:
                        10.16356/j.1005-2615.2020.05.014
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:安徽大学计算智能与信号处理教育部重点实验室，合肥，230039
作者简介:
通讯作者:周健，男，副教授， E-mail: jzhou@ahu.edu.cn。
中图分类号:TN912.3
基金项目:国家自然科学基金(61301295)资助项目；安徽省自然科学基金(1708085MF151)资助项目； 安徽高校自然科学基金(KJ2018A0018)资助项目； 安徽大学科研训练计划 (J10118520444) 资助项目。

Method for Transforming Whisper to Normal Speech with Feature Fusion

Author:

Affiliation:

Key Laboratory of Computational Intelligence and Signal Processing， Ministry of Education，Anhui University， Hefei，230039， China

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

使用耳语音的频谱包络来预估正常音的基频特征，这类算法在对正常音基频预测的准确性上存在一定不足，在合成语音自然度方面存在着明显欠缺，有时会出现音调失常等问题。本文提出一种声学特征融合的方法，通过双向长短期记忆（Bi-long short-term memory， BLSTM）深度网络来逐帧预测正常音基频。首先，使用STRAIGHT模型和相关代码，分别对耳语音和正常音语料进行预处理，提取耳语音的梅尔倒谱系数（Mel-scale frequency cepstral coefficient，MFCC）、韵律及谱包络特征，正常音的基频与谱包络特征。然后使用BLSTM深度网络，分别建立耳语音和正常音谱包络特征之间映射关系，以及耳语音MFCC、韵律及谱包络特征对正常音基频F₀的映射关系。最后根据耳语音的MFCC、韵律及谱包络特征获得对应的正常音基频和谱包络，使用STRAIGHT模型合成正常音。实验结果表明，相较于仅使用谱包络估计基频，采用此种方法引入语音韵律和MFCC的融合特征是对基频特征的良好补充，解决了音调失常的现象，转换后的语音在韵律上更加接近正常发音。

Abstract:

Currently， in reconstruction of normal speech from whispered speech based on neural network， the spectral envelope of the whisper is often used to estimate F₀ characteristics of the normal speech. Such algorithms have certain deficiencies in the accuracy of F₀. There is a clear lack of naturalness， and sometimes the pitch distortion occurs. This paper proposes a method for predicting the F₀ of normal speech frame by frame using the Bi-long short-term memory（BLSTM）deep network with the acoustic fusion feature of normal speech. Firstly， the STRAIGHT model and related codes are used to preprocess the whisper and the normal speech corpus. Respectively， extract the Mel-scale frequency cepstral coefficient（MFCC）， rhythm and spectral envelope of the whisper speech and the F₀ and spectral envelope of the normal speech. Secondly， the BLSTM deep network is used to establish a mapping relationship between spectrums of whisper and normal speech， and a mapping relationship between MFCC， rhythm and spectral envelope features of whisper speech and F₀ of normal speech. Finally， according to MFCC， rhythm and spectral envelope features of whisper speech，the F₀ and spectral envelope of the corresponding normal speech are obtained， and the normal speech is synthesized using the STRAIGHT model. The experimental results show that compared with the estimation of the F₀ using only the spectral envelope， the introduction of fusion features of phonetic rhythm and MFCC is a good complement to the F₀ features， which solves the phenomenon of pitch disorders and the converted speech is closer to normal speech in rhythm.

参考文献

相似文献

引证文献

引用本文

庞聪,连海伦,周健,王华彬,陶亮.一种基于特征融合的耳语音向正常音的转换方法[J].南京航空航天大学学报,2020,52(5):777-782

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2019-06-06
最后修改日期:2020-01-05
录用日期:
在线发布日期: 2020-10-05
出版日期:

引用本文

分享

相关视频

文章指标

历史

文章二维码