Dahl, G. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29(6), 82-97. (WARREN BUZANKO)

From Digital Culture & Society

<Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., . . . Kingsbury, B. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29(6), 82-97. doi:10.1109/msp.2012.2205597

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

The goal of speech recognition software is twofold. Speech recognition software aims to achieve 100% accuracy interpreting audio input, as well as 100% accuracy translating audio input into textual output. Most current automated intelligence speech recognition systems use hidden Markov models and Gaussian mixture models to understand acoustic input and determine the appropriate response. For the purpose of this article, this process will be referred to as the “traditional model”. When speech is input, software uses statistical analysis to correctly interpret spoken language, and predict the “real,” or “actual” meaning the user intended to associated with input. Every new data input creates a new “occurrence”. New occurrences are recorded and stored in the computer’s memory, where outcomes are aggregated and called upon to inform the computer about how it should respond to the next occurrence. The software immediately starts comparing input against known language patterns and information recorded from previous experiences about that specific individual’s speech patterns. The computer uses a predictive model to forecast the next logical output, communicating custom information tailored for each individual user via text.

Google Inc. is a significant contributor on several advanced projects concerning the development of speech recognition software. Researchers working for Google include some of the most published and most cited individuals currently in the field. While working as part of an advanced research team at Google in 2012, researcher George Dahl and 10 others published a paper in Signal Processing Magazine titled, Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups.

Working collaboratively the 4 teams from the University of Toronto, Google Inc., Microsoft Research, and IBM Research, identified a new alternative method for Acoustic Modeling in Speech Recognition, which uses a feed-forward Deep Neural Network to achieve near 100% accuracy interpreting audio input, and translating audio input into textual output. The authors claim that Deep Neural Networks outperform speech recognition softwares informed by the “traditional model” on a variety of benchmarks, sometimes by a large margin.

The researchers claim that despite all the advantages, the traditional model has serious shortcomings, mainly it is statistically inefficient for modeling data. The systems’ over complicated procedures slow down processing speeds, which increases the probability of the software producing an inaccurate response. Deep Neural Networks address these inefficiencies by introducing new advances in both machine learning algorithms and computer hardware. The implementation of new algorithms and an increase in computational power led to the development of modern training models for Deep Neural Networks, which created levels of system efficiency beyond what researchers previously considered possible.

A Deep Neural Network is an artificial neural network which model contains several hidden layers between the input layer and the output layer. DNN’s are typically used to model complex non-linear relationships. A DNN’s hidden middle layers are used to identify unique information from each layer, which is used to inform the the decisions in the next occurence. DNN’s are known as feedforward networks, because data typically flows from the input layer to output layer without looping back for multiple assessments. Recurrent Neural Networks allow data to flow in any direction depending on which layer or frame is best suited for creating a solution. In applications such as Language Modeling, long short-term memory is particularly effective for building individualised speech patterns. These customized DNN’s learn how individual users interact, tracking their speech patterns to develop an accurate prediction of what that person is likely to say, also taking into account the context of the conversation.

When deep neural networks were first used, they were trained discriminatively. It was only as recently as this publication that researchers showed significant gains could be achieved by adding an initial stage of generative pretraining. The addition of this pre training stage reduced overfitting by exploiting information in neighboring “frames/ occurrences,” and time taken for fine-tuning. Pre training DNN’s also reduces the time required for discriminative fine-tuning with backpropagation, one of the main impediments to using DNN’s when neural networks were first used in place of the traditional model. The Researchers also found that similar reductions in training time can be achieved by carefully adjusting the scales of the initial random weights in each layer.

Understanding AI systems, specifically Acoustic Modeling in Speech Recognition, has significant implication for those studying mass communication and digital culture. As products integrate these technologies more seamlessly into daily life, consumers need to be aware of how reliance on advanced thought machines can change their behaviour. This technology is found in almost every smartphone, building profiles of their owners language patterns so it can more effectively text and perform a variety of other tasks using only voice commands. In addition to applications in speech recognition, DNN’s are also being used in image recognition software, image restoration software, drug discovery and toxicology software, customer relationship management software, and mobile advertising software.

Warren Buzanko 5750021 >

Retrieved from "https://kumu.brocku.ca/digitalculturesociety/Dahl%2C_G._%282012%29._Deep_Neural_Networks_for_Acoustic_Modeling_in_Speech_Recognition:_The_Shared_Views_of_Four_Research_Groups._IEEE_Signal_Processing_Magazine%2C_29%286%29%2C_82-97._%28WARREN_BUZANKO%29"

Dahl, G. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29(6), 82-97. (WARREN BUZANKO)

From Digital Culture & Society

Views

Personal tools

Navigation

Search

Toolbox