ResearchBib Share Your Research, Maximize Your Social Impacts
Sign for Notice Everyday Sign up >> Login

OPTIMIZING SPEECH-TO-TEXT CONVERSION: DEVELOPING AN EFFICIENT MATLAB-BASED SPEECH RECOGNITION SYSTEM

Journal: International Journal of Mechanical Engineering and Technology(IJMET) (Vol.9, No. 2)

Publication Date:

Authors : ;

Page : 954-963

Keywords : speech-to-image translation; voice signals; visuals; teacher-student learning; generative adversarial models; embedding feature; adversarial generative network;

Source : Downloadexternal Find it from : Google Scholarexternal

Abstract

Because of the possible applications in human-computer interaction, artistic creativity, computer-aided design, etc., speech-to-image translation with no text is a fascinating and valuable issue. Not to mention the absence of writing in many languages. However, how to directly convert voice signals into visuals and how well they can be translated have not yet been thoroughly researched, as far as we are aware. In this research, we use the development of teacher-student learning and generative adversarial models to attempt to convert the voice signals into the picture signals without the transcription stage. In order to improve generalization ability on new classes, a voice encoder is specifically created to represent the input speech signals as an embedding feature. After that, a speech encoder's embedded feature is employed to train a stacked adversarial generative network to create high-quality pictures. As a result, the process entails the input, pre-processing, feature extraction, and classification processes. The identified voice signal and the related item will be shown as the end product. Our suggested technique effectively converts the raw voice signals into pictures without the intermediary text representation, according to experimental findings on dataset signals. Ablation research offers further information about our approach. The basic goal of this technique is to identify speech from audio, after which the identified speech is transformed into a text picture. To increase the process's precision is also one of the objectives.

Last modified: 2023-06-09 16:45:14