Analyzing lip dynamics using sparrow search optimized BiLSTM classifier
Journal: International Journal of Advanced Technology and Engineering Exploration (IJATEE) (Vol.11, No. 119)Publication Date: 2024-10-30
Authors : Shilpa Sonawane; P. Malathi;
Page : 1430-1448
Keywords : Visual speech synthesis; Automatic speech recognition; Lip dynamics; Sparrow search algorithm; Bidirectional long short-term memory.;
Abstract
Applications involving voice-based automatic speech recognition (ASR) have recently gained popularity. The voice-based applications fail in noisy backgrounds, overlapping speeches, and when the speech signal is completely distorted. Speech information can be recovered from the mouth region and facial emotions. The effective solution over ASR is visual speech synthesis (VSS) as it provides information about the utterance of the word from lip dynamics. The proposed methodology aims to generate speech directly from lip motion without text as an intermediate representation. A visual-voice embedding is introduced to store vital acoustic knowledge, enabling the production of audio from different speakers. The proposed sparrow search optimized bidirectional long short-term memory (BiLSTM) model takes input from lip movements and relative acoustic information, which are utilized during training. Our major contributions are: (1) suggested the use of visual voice embedding that provides additional audio information and enhances the visual aspects, thus generating superior speech from lip movements (2) the sparrow search algorithm (SSA) is employed to optimize the search for the best solution in generating audio samples from the search space, aiming to reduce loss (3) an autoregression model is proposed to produce speech from silent video without need of transcription of audio. The effectiveness of the model is checked on the GRID corpus. The performance analysis of the model is conducted by comparison between generated speech and ground truth signals concerning mean squared error (MSE), root mean square error (RMSE), signal to noise ratio (SNR), short time objective intelligibility (STOI), and perceptual evaluation of speech quality (PESQ). It is observed that the proposed methodology outperforms in terms of PESQ and STOI parameters. The PESQ score shows a significant improvement of 4.06 over the generative adversarial network (GAN), while the STOI score improves by 0.202.
Other Latest Articles
- Trust-based secure and optimal route selection in MANET utilizing multiple agent-based reinforcement learning
- Gait-based gender spoofing detection using depth images
- Tackling counterfeit certificate problems with blockchain technology: a review
- Revealing trends: a 25-year bibliometric analysis of MANETs in disaster research publications using the Scopus database
- Impact of fuel injection pressure on GDI engine performance: a numerical study with pre-combustion chamber
Last modified: 2024-11-07 22:50:30