April 24, 2014

Speech technology at Google: teaching machines to talk and listen

This is the latest post in our series profiling entrepreneurial Googlers working on products across the company and around the world. Here, you’ll get a behind-the-scenes look at how one Googler built an entire R&D team around voice technology that has gone on to power products like YouTube transcriptions and Voice Search. – Ed.

When I first interviewed at Google during the summer of 2004, mobile was just making its way onto the company’s radar. My passion was speech technology, the field in which I’d already worked for 20 years. After 10 years of speech research at SRI, followed by 10 years helping build Nuance Communications, the company I co-founded in 1994, I was ready for a new challenge. I felt that mobile was an area ripe for innovation, with a need for speech technology, and destined to be a key platform for delivery of services.

During my interview, I shared my desire to pursue the mobile space and mentioned that if Google didn’t have any big plans for mobile, then I probably wouldn’t be a good fit for the company. Well, I got the job, and I started soon after, without a team or even a defined role. In classic Google fashion, I was encouraged to explore the company, learn about what various teams were working on and figure out what was needed.

After a few months, I presented an idea to senior management to build a telephone-based spoken interface to local search. Although there was a diversity of opinion at the meeting about what applications made the most sense for Google, all agreed that I should start to build a team focused on speech technology. With help from a couple of Google colleagues who also had speech backgrounds, I began recruiting, and within a few months people were busily building our own speech recognition system.

Six years later, I’m excited by how far we’ve come and, in turn, how our long-term goals have expanded. When I started, I had to sell other teams on the value of speech technology to Google’s mission. Now, I’m constantly approached by other teams with ideas and needs for speech. The biggest challenge is scaling our effort to meet the opportunities. We’ve advanced from GOOG-411, our first speech-driven service, to Voice Search, Voice Input, Voice Actions, a Voice API for Android developers, automatic captioning of YouTube videos, automatic transcription of voicemail for Google Voice and speech-to-speech translation, amongst others. In the past year alone, we’ve ported our technology to more than 20 languages.

Speech technology requires an enormous amount of data to feed our statistical models and lots of computing power to train our systems—and Google is the ideal place to pursue such technical approaches. With large amounts of data, computing power and an infrastructure focused on supporting large-scale services, we’re encouraged to launch quickly and iterate based on real-time feedback.

I’ve been exploring speech technology for nearly three decades, yet I see huge potential for further innovation. We envision a comprehensive interface for voice and text communication that defies all barriers of modality and language and makes information truly universally accessible. And it’s here at Google that I think we have the best chance to make this future a reality.

Update 9:39 PM: Changed title of post to clarify that speech technology is not only used on mobile phones but also for transcription tasks like YouTube captioning and voicemail transcription. -Ed. 

Posted by Mike Cohen, Manager, Speech Technology