Evaluation of TIMIT Corpus with Hybrid VAD Methods

Abstract
This paper introduces a 99.6% accurate Texas Instruments/Massachusetts Institute of Technology (TIMIT) speech recognition model and 92.50% accuracy on LibriSpeech dataset speech recognition model, a new benchmark. It applies a hybrid model of convolutional neural networks, transformers, and bidirectional Long Short-Term Memory (LSTM) layers for efficient speech processing. The uniqueness of the model lies in its feature extraction algorithm that uses Melfrequency cepstral coefficients (MFCCs) and their delta coefficients and frame parameters: 25ms frame length, 10ms step, and 40ms window with 30ms overlap. It is acoustically extremely interference-resistant and still performs well in presence of noise. The proposed system is 96.0 at -5dB SNR, 22.3% better than the baseline of 73.7%, similar margins are reported at 0dB (97.8% vs. 86.1%), 5dB (98.6% vs. 91.5%), and 10dB (99.5% vs. 92.1%). By applying data augmentation methods such as time stretching (0.8-1.2), pitch shifting (±3 steps), and room reverberation to generalize. The main observation here is this method discards old frame parameters which refers to the removing previous extracted features from the earlier audio frames to ensure that the VAD decision is rely on the most recent speech information and shows impressive improvements, making architectural improvements the cause of the gains. The model also exhibits robustness in non-speech hit rate at low SNRs, 92.0% compared to the baseline of 61.2% at -5dB. This work greatly enhances noise-robust speech recognition technology for difficult acoustic environments where traditional systems deteriorate.
Keywords: Acoustic-Phonetic Models, Deep Learning in Speech Processing, Signal-to-Noise Ratio (SNR), TIMIT Corpus, Voice Activity Detection (VAD).

Author(s): Parshotam, Shilpa Sharma*
Volume: 7 Issue: 1 Pages: 1701-1720
DOI: https://doi.org/10.47857/irjms.2026.v07i01.07822