ResearchBib Share Your Research, Maximize Your Social Impacts
Sign for Notice Everyday Sign up >> Login

Analyzing lip dynamics using sparrow search optimized BiLSTM classifier

Journal: International Journal of Advanced Technology and Engineering Exploration (IJATEE) (Vol.11, No. 119)

Publication Date:

Authors : ; ;

Page : 1430-1448

Keywords : Visual speech synthesis; Automatic speech recognition; Lip dynamics; Sparrow search algorithm; Bidirectional long short-term memory.;

Source : Downloadexternal Find it from : Google Scholarexternal

Abstract

Applications involving voice-based automatic speech recognition (ASR) have recently gained popularity. The voice-based applications fail in noisy backgrounds, overlapping speeches, and when the speech signal is completely distorted. Speech information can be recovered from the mouth region and facial emotions. The effective solution over ASR is visual speech synthesis (VSS) as it provides information about the utterance of the word from lip dynamics. The proposed methodology aims to generate speech directly from lip motion without text as an intermediate representation. A visual-voice embedding is introduced to store vital acoustic knowledge, enabling the production of audio from different speakers. The proposed sparrow search optimized bidirectional long short-term memory (BiLSTM) model takes input from lip movements and relative acoustic information, which are utilized during training. Our major contributions are: (1) suggested the use of visual voice embedding that provides additional audio information and enhances the visual aspects, thus generating superior speech from lip movements (2) the sparrow search algorithm (SSA) is employed to optimize the search for the best solution in generating audio samples from the search space, aiming to reduce loss (3) an autoregression model is proposed to produce speech from silent video without need of transcription of audio. The effectiveness of the model is checked on the GRID corpus. The performance analysis of the model is conducted by comparison between generated speech and ground truth signals concerning mean squared error (MSE), root mean square error (RMSE), signal to noise ratio (SNR), short time objective intelligibility (STOI), and perceptual evaluation of speech quality (PESQ). It is observed that the proposed methodology outperforms in terms of PESQ and STOI parameters. The PESQ score shows a significant improvement of 4.06 over the generative adversarial network (GAN), while the STOI score improves by 0.202.

Last modified: 2024-11-07 22:50:30