Variani, E., Bagby, T., Mcdermott, E.,

From Digital Culture & Society

Jump to: navigation, search

End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow

Variani, E., Bagby, T., Mcdermott, E., & Bacchiani, M. (2017). End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow. Interspeech 2017. doi:10.21437/interspeech.2017-1284

Good Speech Recognition software is defined by the speed with which it delivers accurate results that are customized for individual users. This technology now comes in a package as small as a smartphone, and consumers are becoming increasingly familiar with its practical, everyday applications.

However to develop the complex algorithms for speech recognition software which are small enough to run on a cell phone or in your car, companies like Google have developed Deep Neural Network’s (DNN’s) run by supercomputers exponentially more powerful. DNN’s create the structure computers rely on for understanding human language input.

Such supercomputers can be thought of as large one a kind machines, built for the purpose of running Deep Neural Network’s capable of processing the complex computations speech recognition software involves. These supercomputers are limited by the physical hardware powering the network. The better the hardware, the more processing power the computer has, the faster and more accurate the network becomes, the more reliable and user friendly the Speech Recognition software. In this application, the Graphical Processing Unit, or GPU, can be thought of as the computer’s engine, it is the physical hardware component responsible for processing information. The better the GPU, the more processing power the computer has, the faster and more accurate the network becomes, resulting in more effective Speech Recognition.

Large Vocabulary Continuous Speech Recognition softwares developed by industry leaders such as Google, Microsoft, IBM, and the University of Toronto, all employ DNN’s to outperform speech recognition softwares informed by the “traditional model” on a variety of benchmarks. In their 2012 paper, Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups, lead researcher George Dahl demonstrates how in the future, more advanced generative pre training should reduce the amount of time networks require to conduct multidimensional data array analysis. But, because the supercomputers administering DNN’s are one of a kind, upgraded hardware components which will enable the development of advanced generative pre training need to be optimized for Speech Recognition applications.

Building on the research conducted by George Dahl and his team in 2012, a second research team from Google Inc. recently published a paper describing new training methods developed with the TensorFlow software program which produce greater efficiencies and increased processing speeds. On behalf of Google, Ehsan Variani, Tom Bagby, Erik McDermott, Michiel Bacchiani, published their paper, End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow, at the 2017 Interspeech conference held in Stockholm Sweden. TensorFlow was developed by the Google Brain Team in 2015, as an open source dataflow modeling software application, for the purpose of increasing the capacity of multidimensional data array analysis in Deep Neural Networks, specifically for Large Vocabulary Continuous Speech Recognition software. The TensorFlow software application establishes communication between custom built supercomputer components, maximizing the efficiency of Google’s Speech Recognition software. The researchers describe their successful implementation of the TensorFlow software program in a DNN, which dramatically reduced the training time for Google’s Large Vocabulary Continuous Speech Recognition software. The researchers claim that their new approach makes it possible to take advantage of both data parallelism and high speed computation on GPU, for state of the art sequence training of acoustic models. The effectiveness of the design was evaluated for different training schemes and model sizes, on a 20, 000 hour Voice Search task.

After optimizing new hardware upgrades for Speech Recognition Software parameters, researchers were able to significantly increase the processing power being diverted specifically for running core processes in TensorFlow. Increased power resulted in improved speeds completing fundamental tasks, mainly creating comparative acoustic models for Large Vocabulary Continuous Speech Recognition. The TensorFlow program enabled the machine to store and retrieve information, and run separate processes/analyses on its Graphical Processing Unit’s simultaneously, a feat which had previously not been possible. This new technique for state of the art comparison of acoustic models developed as a response to user input in real time has been described as “multidimensional data array analysis”.

Multidimensional data array analysis is the process through which audio input is broken down into multiple references frames, analyzed, graphically modelled, and results are compared with known language patterns, as well as with unique information the computer has learned about the user. During multidimensional data array analysis, the network’s GPU’s simultaneously retrieve input data, and working separately, produce different graphical models of dataflow analysis for direct comparison. When Tensorflow has been optimized to take advantage of upgraded hardware, the additional layer of comparative analysis added by the multidimensional data array analysis process significantly outperforms Large Vocabulary Continuous Speech Recognition softwares built on early DNN’s.

The TensorFlow software program pushes the limits of the most advanced hardware components available. When optimized TensorFLow uses the computer’s powerful GPU’s to complete more fundamental processes in shorter amount of time than early DNN’s. Combined with advanced “end to end” training, Large Vocabulary Continuous Speech Recognition software built on DNN’s employing TensorFlow achieve higher final word accuracy and reduced training times when compared to early DNN’s without TensorFlow architecture.

Warren Buzanko 5750021

Personal tools
Bookmark and Share