Comments
bruce.armstrong wrote: Somebody just said it better than I did, and with more chops to say it: Open Letter to Mark Zuckerberg, Sheryl Sandberg & Facebook Mobile
Cloud Expo on Google News

SYS-CON.TV
Cloud Expo & Virtualization 2009 East
PLATINUM SPONSORS:
IBM
Smarter Business Solutions Through Dynamic Infrastructure
IBM
Smarter Insights: How the CIO Becomes a Hero Again
Microsoft
Windows Azure
GOLD SPONSORS:
Appsense
Why VDI?
CA
Maximizing the Business Value of Virtualization in Enterprise and Cloud Computing Environments
ExactTarget
Messaging in the Cloud - Email, SMS and Voice
Freedom OSS
Stairway to the Cloud
Sun
Sun's Incubation Platform: Helping Startups Serve the Enterprise
POWER PANELS:
Cloud Computing & Enterprise IT: Cost & Operational Benefits
How and Why is a Flexible IT Infrastructure the Key To the Future?
Click For 2008 West
Event Webcasts
Look Ma Bell, No Hands! - VoiceXML, X+V, and the Mobile Device
Look Ma Bell, No Hands! - VoiceXML, X+V, and the Mobile Device

The emerging world without wires has fostered a growing number of small and mobile devices (everything from PDAs to smart phones) capable of accessing data and running applications. The trouble is, while devices are getting smaller, human hands and fingers are not.

To assist users in managing their devices, user interface designers have begun to combine the traditional keyboard-input model with such interactive technologies as voice-directed input. This type of interaction, in which the user has more than one means of accessing data in his or her device, is sometimes called multimodal interaction. It is fast becoming the norm in the world of wireless mobile computing.

If asked, most developers will cite speed and efficiency as the main reasons for developing multimodal interfaces. Parallel input - for example, the ability to both key in commands and voice them - allows users to more quickly access and respond to information delivered by their devices. In fact, multimodal systems don't just enable faster interactions, they also add value to the overall experience of interaction. Multimodal interfaces allow more room for user preference (giving users a choice of how they interact with the system) and reduce the overexertion that can result from single-modality interaction. Being able to switch between modes of interaction can lead to a lower incidence of error (because users can choose the mode most suited to different activities), as well as easier error recovery. And, finally, multimodal interfaces have the capacity to accommodate a wider range of tasks and environments.

Speech adds tremendous value to small mobile devices, but in tandem, mobility and wireless connectivity are also moving computing into new physical environments. Wireless networks now provide connectivity anywhere and anytime. Connecting mobile devices to the network links mobile computing to back-end data anywhere and anytime. If the need for multimodal interaction extends to the network, then the Internet needs new technologies and standards to enable that functionality. Increasingly, Web developers are seeking ways to turn existing visually oriented Web pages into multimodal ones. And that's where X+V comes in.

The XHTML+Voice profile brings spoken interaction to standard Web content by integrating the mature XHTML and XML-Events technologies with XML vocabularies developed as part of the W3C Speech Interface Framework. The profile includes voice modules that support speech synthesis, speech dialogs, command and control, and speech grammars. Voice handlers can be attached to XHTML elements and respond to specific Document Object Model (DOM) events, thereby reusing the event model familiar to Web developers. Voice interaction features are integrated with XHTML and CSS and can consequently be used directly within XHTML content.

X+V promises to deliver the feature set, flexibility, and ease of use that developers need to write one application that supports visual-only, voice-only, and multimodal interaction. The versatility of the Web and XML is reflected in the fact that X+V nicely integrates VoiceXML into the Web by marrying it with XHTML. X+V brings voice markup to the presentation layer, allowing you to speech-enable each component of the application interface.

Combining Voice and Visual Markup
Visual markup tells a Web browser what you want the user interface to look like and how you want it to behave when the user types, points, or clicks. Similarly, voice markup tells the Web browser what you want it to do when the user speaks to it. For visual markup, the browser uses a graphics engine; for voice markup, the browser uses a speech engine.

While both X+V and SALT use W3C standards for grammar and speech synthesis, only X+V is based entirely on standardized languages. X+V's modular architecture makes it very simple to separate an X+V application into different components. As a result, X+V applications can be coded in parts, with experts in voice programming developing voice elements and experts in visual programming developing visual ones. X+V's modularity also makes it adaptable to stand-alone voice application development. VoiceXML used in an X+V application can be reused inside a stand-alone VoiceXML application. SALT's reliance on the containing environment makes it very difficult to separate out its coding functions, and also makes the language insufficient to the task of stand-alone application development.

Richness is another factor that differentiates the two languages. Whereas SALT defines three tags - Prompt, Listen, and Bind - as its tag set for speech, X+V is based on the mature and tested VoiceXML standard. Because it uses VoiceXML's Form construct for its speech tag set, X+V includes all the utility of "prompt, listen, and bind," and more.

Just as visual markup specifies the visual interface items, voice markup specifies the voice interface items. Speech-enabling an application interface is a matter of first breaking the visual interface into its basic components (for example, an input field for a time of day and a checkbox for "a.m." or "p.m."), creating snippets of voice markup for each component, and then associating the snippets to the existing visual markup for each component. Consider the following questions:

  • What words should the speech engine speak or synthesize?
  • What words and phrases should the speech engine listen for?
  • What should the browser to do if the speech engine doesn't recognize a word or phrase?
  • What will be the result of the speech engine recognizing a word or phrase that has been spoken?
Correlating Voice and Visual Input/Output
Given an application's visual markup plus a collection of voice markup snippets, you have almost everything you need to create the presentation layer of a multimodal Web application. In fact, the only thing you still need is a way to tell the browser which snippets of voice markup go with which visual elements, and (because a speech engine can only have one snippet active at a time) when to activate each snippet of voice markup.

Given that the Web application environment is event-driven, X+V incorporates the DOM eventing framework used in the XML-Events standard. Using this framework, X+V defines the familiar event types from HTML such as "on mouse-over" or "on input focus" to create the correlation between visual and voice markup. Using XML-Events provides X+V with a uniform and standards-based eventing model that enables event integration between XML languages.

Separate Files and Reuse
Because all the parts of X+V are XML-compliant, the voice markup can be packaged in two ways: in the same file as the XHTML or in separate files. Separating voice markup from visual markup gives you more flexibility in developing your applications. For example, you can develop the voice markup separately from the visual markup and combine the two later.

Another advantage of keeping the files separate is reuse, such as the ability to reuse snippets of VoiceXML in numerous XHTML pages. In the example of a flight-reservation application, when users make the reservation they will be asked if they want a one-way, round-trip, or multi-leg reservation. For each answer, the system will call up a different form. While the three forms differ with regard to the type of trip desired, each one has the same departure city. If you have separated the voice snippet for the departure city you can reuse it in each of the three different XHTML forms, or containers.

The final advantage of keeping the VoiceXML separate from the XHTML is that it allows the snippets of VoiceXML to be reused in containers other than XHTML. In this case, X+V can utilize the VoiceXML notion of documents and forms, wherein a VoiceXML document contains one or more forms. You already know that VoiceXML forms can be linked to XHTML to create multimodal applications. But such forms can also be stitched together in a VoiceXML document (or container) to create voice-only applications. The end result is that you can (by reuse) create a single application that simultaneously supports multimodal browsers, GUI-only browsers, and voice-only systems such as IVRs.

Conclusion
X+V is the latest addition to the XML family of technologies for user interface development. Whereas XHTML is for developing visual interfaces, and VoiceXML focuses entirely on voice-based development, X+V is a hybrid, dedicated to developing multimodal application interfaces. X+V is particularly well suited to wireless development, where developers are faced with small visual interfaces and increasing user demand for voice input and output.

X+V's foundation in existing XML standards lends it tremendous strength and versatility. Interfaces developed using X+V are portable to a wide range of applications and development environments, can be easily developed in teams, and are highly scalable over time. Developers working with X+V can access the numerous resources that come with a well-developed standard such as XML. X+V also takes developers out of the loop of learning a new development language such as SALT, or adapting to the constraints of a more visually oriented development environment. Perhaps best of all, X+V does not require training invoice user interfaces or linguistics to operate; a basic knowledge of XML and related standards is sufficient to get started.

About Les Wilson
Les Wilson is an IBM senior technical staff member. He has been responsible for a variety of research and development projects related to man-machine interfaces, graphics, network computing, and user-interface technology. Les is currently the multimodal architect for IBM's Pervasive Computing Division.

In order to post a comment you need to be registered and logged in.

Register | Sign-in

Reader Feedback: Page 1 of 1

Short answer yes. Long Answer:
1) X+V uses a standardized (W3C term is "Recommended") language for the voice markup whereas SALT does not.
2) X+V specifies XHTML as the "containing" GUI language.
3) X+V uses XMLEvents (another W3C "Rec") as the syntax for the application developer to specify the events that activate voice handlers. SALT leaves this up to the language into which it is being integrated.
That is, in addition to the Synthesis and Grammar formats, the "X", the "+", and the "V" of X+V are all W3C recommendations. One advantage of this characteristic of X+V is that it specifies a platform that in turn enables portability of the application. Additionally, specifying the linkage between visual and voice languages using XMLEvents enables standards based interperability between devices and servers for platform implementations that choose to distribute function across that boundary (e.g. distributing voice processing to a server but doing the GUI in the client).

I understand that both X+V and SALT use W3C standards for grammar and speech synthesis, but is X+V the *only* one of the two of them based entirely on standardized languages?

Interesting


Your Feedback
Les Wilson wrote: Short answer yes. Long Answer: 1) X+V uses a standardized (W3C term is "Recommended") language for the voice markup whereas SALT does not. 2) X+V specifies XHTML as the "containing" GUI language. 3) X+V uses XMLEvents (another W3C "Rec") as the syntax for the application developer to specify the events that activate voice handlers. SALT leaves this up to the language into which it is being integrated. That is, in addition to the Synthesis and Grammar formats, the "X", the "+", and the "V" of X+V are all W3C recommendations. One advantage of this characteristic of X+V is that it specifies a platform that in turn enables portability of the application. Additionally, specifying the linkage between visual and voice languages using XMLEvents enables standards based interperability between devices and servers for platform implementations that choose to distribute functio...
quEzztion wrote: I understand that both X+V and SALT use W3C standards for grammar and speech synthesis, but is X+V the *only* one of the two of them based entirely on standardized languages?
Les Wilson wrote: Interesting
Latest Cloud Developer Stories
Navigating the complex web of regulatory and compliance requirements related to the processing and storage of sensitive enterprise data in the cloud is a huge challenge for business. The cloud is borderless – so how do you cover your business risk and security requirements when y...
As a Bronze Sponsor of Cloud Expo New York, HP is offering special passes to SYS-CON's 10th International Cloud Expo, which will take place on June 11–14, 2012, at the Javits Center in New York City, New York. HP is a technology company that operates in more than 170 countries a...
The latest generation of cloud computing is now capable of addressing the needs of the enterprise mission-critical applications. These applications require computing infrastructure that is secure, optimizes performance, and is highly resilient. In his Opening Keynote at the 10t...
The convergence of cloud and mobile trends has created demanding new challenges for IT departments to support global users accessing applications from many different devices. In addition, as more mission-critical applications are deployed to the cloud, sensitive data must be prot...
As an exhibitor at Cloud Expo New York, AT&T is offering special passes to SYS-CON's 10th International Cloud Expo, which will take place on June 11–14, 2012, at the Javits Center in New York City, New York. AT&T Inc. is a premier communications holding company and one of the mo...
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021

SYS-CON Featured Whitepapers
ADS BY GOOGLE