On Architectural Issues of Neural Networks in Speech Recognition

Recently, artificial neural networks (ANN) were able to improve the performance of speech recognition systems dramatically. There have been more than 25 years of extensive research on neural networks in speech recognition. Despite this huge effort, there are a number of open issues concerning the architecure of ANN based systems for speech recognition. Examples of such issues are: 1) Unlike the hybrid approach of replacing the emission probability function by an ANN, there is the possibility of a direct approach that models the posterior state sequence of (phonetic) labels directly without using the generative concepts of classicial hidden Markov models (HMM). 2) In the CTC approach (connectionist temporal classification), the HMM is simplified by using a single label per phoneme (or character in handwriting recognition) only. The CTC training criterion is the sum over all possible posterior distributions of label sequences. 3) Recently there have been so-called attention based approaches that replace the conventional HMM formalism by a recurrent neural network. In these three cases, we are faced with the questions of how these ANN based approaches compare with the conventional discriminative framework of hybrid HMMs. We will discuss the advantages and disadvantages of these approaches in more detail and compare them with conventional hybrid HMMs.