I’ve been playing around with my new 3GS for a few days now. As you might know, it has some built-in automatic speech recognition (ASR) capability that Apple brands as Voice Control, good for making calls and operating the iPod. There is now, however, no programmer interface to allow its system to control 3rd party apps – bummer. I’d bet, however, that we’ll soon see Voice Control as a core iPhone capability for all apps – perhaps next year. When will it come to the PC? The future is murky. Importantly, though, the basic iPhone technology works pretty well. Better than anything else I’ve used. A thousand times better than the built-in bluetooth voice dialer my last car had.
For the first time in years, I found myself exploring my music collection. “Play songs by Johnny Cash” begets “Playing songs by Johnny Cash,” and then the classic crooner laments about love gone bad in Memphis. This is even safer than using the in-car stereo because I kept my eyes on the road the whole time. I found myself calling people. Of course, driving while distracted is bad, but it was actually practical to call the house and tell the kids that I was on my way home without having to take my eyes off the road. When ASR works well enough, you’ll want to use it. And use it in circumstances that were previously off-limits or much more dangerous.
There is a related technology in the iPhone called VoiceOver. Designed for people with vision impairments, this uses a text to speech (TTS) voice synthesizer and basically verbalizes what the person’s finger is touching. When you swipe between app screens, you’ll hear “Screen 3 of 11” followed by a listing of every app on that screen. Touching an app like PowerNap, prompts “Power Nap, double tap to launch.” A double tap, anywhere on the screen, then launches the app. This system works too, but not well enough. Among its various drawbacks, it still requires a lot of tapping. So, although you don’t need to use your eyes, you still need to use your hands.
Key Technologies
- Automatic Speech Recognition (ASR)
- Text To Speech (TTS)
- Automatic or Semi-Automatic Voice Transcribing – for recording voice memos, and converting them into text.
Implications
… for the application producer
- If your app would likely be used by somebody walking around, then start thinking about Voice Control now, while still in your design phase.
- Improved voice control, with high quality voices and better recognition, may be worth an in-app purchase/upgrade.
- With the Silver Market (retiring baby boomers) increasing, their eyes getting worse, their fingers hurting more, then maybe Voice Control is the way to bring/extend apps to that huge market.
… for the developers
- Think about Voice Control early in the design phase. Don’t count on being able to elegantly retrofit your eye & finger focused app into both eye & finger and ear & voice. Don’t believe me? Then ask yourself how many of the mouse & click focused apps were easily ported to the iPhone’s finger-based system?
- Jump-start your mind by testing out some existing apps with Accessibility turned on. It was an enlightening experience for me.
- Open-source SDKs here are seriously lagging behind the commercial implementations, but they might be good enough to act as a preview or teaser, and allowing only serious users to purchase the upgrade.
- The commercial implementations are just now starting to think about licensing their SDKs.
… for R&D and DoD
- Imagine mobile applications that are useful w/o hands or screens. We’re used to thinking about the field utility of such systems, but what about the non-combatant, or at least those not all the way on the front line.
- To following this line of thought, haptic feedback is just coming to the consumer space – so if you can’t imagine soldiers chatting away all day to their PDAs, can you imagine them gesturing to one, without ever seeing the screen.
-
Not sure these are quite the right commands
-
- For application survivability, if the screen is damaged, say, by a bullet, shouldn’t the software still be able to function via alternative input methods, like voice? It might not be as efficient in some cases, but perhaps it would be better than nothing.
- Low power consumption: software capable of voice control wouldn’t need a screen.
- Weight: No screen equals lower weight.