VIXML - Video Interactive XML
VIXML, or Video Interactive XML is a scripting language
designed for IVR sessions with 3G video calls. It also works with
standard voice calls.
A VIXML page is specified within an XML document by a list of objects
between <vixml> and </vixml> tags. The VIXML interpreter
works its way sequentially through the objects, rendering graphics
and playing sound as necessary.
<vixml>
objects ...
</vixml>
To ensure correct syntax, VIXML documents can also be created via the
XML schema http://www.vivatel.com.au/VIXMLSchema.xsd
A video IVR session has an audio and a video channel,
while a voice session only has an audio channel. Some VIXML objects
attempt to lock these channels, sometimes instantaneously, like when
an image is specified, or for a period of time, like when text is
spoken by the speech synthesizer. The VIXML interpreter will stop at
any object that is unable to obtain a lock, then proceed once the
channel becomes available.
Each object has optional attributes, and some may include a text value.
For example, the following XML tag sets the background colour to red.
<defaults bgcolour="red" />
while this object will say hello my friend with a male Scottish
accent.
<speech voice="cmu_us_awb_arctic_hts">hello my friend</speech>
Note that whenever a voice is set for the speech synthesizer, it will
persist for the duration of the call. This is one of the few attributes
that is not reset when a new page is loaded.
URLs within VIXML objects may be either absolute (e.g.
http://www.darkside.com/test/image.jpg) or relative
(image.jpg or /test/image.jpg). However, for relative
addresses to work, the URL of the referring page must contain a trailing
slash if it's the default file for its directory (e.g.
http://www.darkside.com/test/).
There are also special URLs that can be referenced from link and
timeout objects.
- system://hangup This causes the call to terminate immediately.
- system://signal?dtmf=NNN When connected to another party
via the call command, this will send them a sequence of one
or more DTMF digits. This is useful when the other party is an IVR.
All pages loaded by VIXML will have a duration variable appended
to their URL. This contains the duration of the call so far, in
milliseconds. It can be useful in situations where a call has to be
terminated after a certain length of time.
In addition, all pages will have a session variable appended.
This is an integer, unique to every call.
Start page
The start page of a VIXML session will be passed three HTTP GET
variables, type, localPhone, and remotePhone
if it is known. The value of type will be either video
or voice. If caller-ID is suppressed, then remotePhone
will not be specified. The phone numbers will be in E.164 format where
possible (e.g. 61410402221), but for numbers that aren't
accessible internationally the local format will be used (e.g.
1800555236).
The duration and session variables will also be set for
the start page. Note, however, that the duration on the start page will
always be zero.
Since the start page will probably need to be a dynamic script to
handle these variables, it is unlikely to have a .xml suffix.
To be safe, the generated page should have a Content-Type of
text/xml or application/xml, although this isn't
enforced.
VIXML objects
The complete list of VIXML objects is as follows.
| Name |
Attributes |
Text value |
Lock video channel |
Lock audio channel |
| defaults |
bgcolour, wordWrap, font, fontSize, fontColour, beepDuration,
beepFrequency, beepSRC, voice, hangupURL
|
No |
If bgcolour is set |
No |
| image |
src, x, y, width, height, animated, windowed |
No |
Yes |
No |
| text |
wordWrap, font, fontSize, fontColour, x, y, width, height, halign, valign
|
Yes |
Yes |
No |
| speech |
voice |
Yes |
No |
Yes |
| link |
keys, url |
No |
No |
No |
| timeout |
delay, url, background |
No |
No |
No |
| input |
name, validKeys, deleteKeys, clearKeys, acceptLength,
acceptKeys, acceptURL, cancelKeys, cancelURL, timeoutDelay,
timeoutURL, visible, wordWrap, font, fontSize, fontColour, x, y,
width, height, halign, valign, passwordChar
|
No |
If visible is true |
No |
| stream |
asrc, vsrc, avsrc, bytesPerSecond |
No |
If vsrc or avsrc is set |
If asrc or avsrc is set |
| capture |
url, contentType, realtime, audioDirection, videoDirection
|
No |
No |
No |
| call |
phone, conference, phoneConference, conferenceType, type, fromPhone,
successURL, connectedURL, failedURL, engagedURL, rejectedURL,
preambleURL
|
No |
If type is video |
Yes |
| application |
type, src |
No |
Yes |
Yes |
defaults
The defaults tag is used to set the background colour and
default attributes for the objects that follow.
| Attribute |
Default value |
Description |
| bgcolour |
white |
Fills the screen with the specified colour. Colours can be specified
by name or in the hexadecimal #rrggbb format, in other words
all HTML colours are valid. For example, white could also be
specified as #ffffff. Note that this colour will overwrite
anything that was previously on the screen.
|
| wordWrap |
false |
If true, text will break at whitespace where possible when
wrapping to the next row. Otherwise it breaks at the last character.
|
| font |
|
The default font for all rendered text. A blank value means a
regular (non-bold, non-italic, proportional) Lucida font. Other
valid fonts are bold, italic, bold-italic,
fixed, fixed-bold, fixed-italic, and
fixed-bold-italic. Note that fixed means a fixed-width
font.
|
| fontSize |
16 |
The height of the font, in pixels. Actually, this is the spacing
between rows of text. The characters are a bit smaller.
|
| fontColour |
black |
The colour of the font. As with the background colour, this can
be a name or a hexadecimal value.
|
| beepDuration |
0 |
The duration of the audio beep when a DTMF key is pressed, in
milliseconds. A value of zero disables the beep. A value of 250
sounds good.
|
| beepFrequency |
440 |
The frequency of the DTMF beep in Hertz. |
| beepSRC |
|
The URL of an audio file to play whenever a DTMF key is pressed.
This overrides the beepDuration value. Valid file formats
are wav, mp3 and raw audio. See the
stream object for details.
|
| voice |
cmu_us_awb_arctic_hts |
The default voice for synthesized speech. See the
speech object for more details.
|
| hangupURL |
|
This URL will be notified when the call ends. A duration
variable is appended, containing the duration of the call in
milliseconds. To indicate successful notification, the output of
the URL should be OK. Note that this setting will persist
for the duration of the call.
|
image
The image tag is used to load an image from a URL and display it
on the screen at the specified position at the specified size. Images
that are not the same resolution as the specified size will be scaled
to fit. Supported image formats are JPEG, GIF, PBM,
PGM, and PPM. For images from remote servers, JPEG format
is preferred because it uses less bandwidth. For images hosted locally,
PPM is preferred because it requires less processing.
| Attribute |
Default value |
Description |
| src |
|
The URL of the image to be loaded. |
| x |
0 |
The x position, in pixels, of the left edge of the image.
|
| y |
0 |
The y position, in pixels, of the top edge of the image.
|
| width |
176 |
The width of the image, in pixels. |
| height |
144 |
The height of the image, in pixels. |
| animated |
false |
Only relevant if the image is an animated GIF. Note that the video
channel will remain locked while the animation runs.
|
| windowed |
false |
When this is set true the user can zoom/pan/tilt through
the image with DTMF keys. The image is initially shown at the
smallest zoom that fills the screen while preserving its aspect
ratio. 5 zooms in, 0 zooms out, 4 and
6 pan left/right, 2 and 8 move up/down, and
1, 3, 7, and 9 move diagonally. Note
that these override any DTMF links, so to exit this page links
using the # or * keys must be specified. Also, once
a windowed image is loaded, the video channel will remain
permanantly locked.
|
text
The text tag displays text on the screen. The text is contained
within a bounding box, and when the width of the text exceeds the width
of the bounding box it wraps to the next row.
Note that if the text contains the HTML special characters &,
<, or >, or a character value greater than 127,
then those characters will have to be replaced with HTML escape
sequences. The relevant sequences are &, <,
>, and € respectively.
| Attribute |
Default value |
Description |
| wordWrap |
default word wrap |
The text wrapping policy. See the
defaults object for more details.
|
| font |
default font |
The font for the text. See the
defaults object for more details.
|
| fontSize |
default font size |
The font size. See the defaults
object for more details.
|
| fontColour |
default font colour |
The font colour. See the defaults
object for more details.
|
| x |
0 |
The x position, in pixels, of the left edge of the text
bounding box.
|
| y |
0 |
The y position, in pixels, of the top edge of the text
bounding box. Position zero is at the top of the screen, while 144
is at the bottom. Note that the y position determines the
baseline for the text, so for top-aligned text at position zero
only the descenders will be visible.
|
| width |
176 |
The width of the bounding box, in pixels. |
| height |
144 |
The height of the bounding box, in pixels. |
| halign |
left |
The horizontal alignment of text within the bounding box. Valid
alignments are left, centre, and right.
|
| valign |
top |
The vertical alignment of text within the bounding box. Valid
alignments are top, centre, and bottom.
|
speech
The speech tag causes the text to be spoken in a synthesized
voice.
| Attribute |
Default value |
Description |
| voice |
default voice |
Supported voices are kal_diphone (American English male),
ked_diphone (American English male),
cmu_us_awb_arctic_hts (Scottish English male),
cmu_us_jmk_arctic_hts (Canadian English male),
cmu_us_rms_arctic_hts (American English male),
cmu_us_bdl_arctic_hts (American English male),
cmu_us_slt_arctic_hts (American English female),
cmu_us_clb_arctic_hts (American English female). Note that
this setting will persist for the duration of the call.
|
link
The link tag is used to specify links to other VIXML documents.
These are triggered by DTMF key presses. Note that the VIXML
interpreter goes through the list of objects sequentially, so links
should appear at the start of the script if they are to be active the
whole time.
| Attribute |
Default value |
Description |
| keys |
|
One or more DTMF keys that will activate this link. |
| url |
|
The URL that the VIXML interpreter will go to when the link is activated.
|
timeout
The timeout tag will cause the interpreter to pause for the
specified number of milliseconds. If there is a URL specified, it will
then attempt to jump to that page.
If the background attribute is set to true, the interpreter
will immediately jump to the next command and run the timeout in the
background. When the timeout expires, the url (if specified) will
be called in background mode. VIXML scripts called in background
mode do not interfere with the main application, but run independently.
They are useful for playing periodic beeps and hanging up calls at a
pre-scheduled time. Note that background mode commands that try to
display visual information are ignored. If a timeout command is
executed during background mode with its background attribute set
to false, its URL will take control from the main application.
| Attribute |
Default value |
Description |
| delay |
0 |
The delay in milliseconds. Must be greater than zero. |
| url |
|
The URL that the VIXML interpreter jumps to when the timeout expires.
|
| background |
false |
When this is set to true, the url will be run in the
background.
|
input
The input tag is used collect a sequence of DTMF key presses
that will be passed to another VIXML script via HTTP GET. Note that
the destination URLs can contain variables (e.g.
http://www.vivatel.com.au/scripts/test.php?user=mkwan), in which
case the input variable and its value will be appended to the URL.
| Attribute |
Default value |
Description |
| name |
|
The name of the variable. |
| validKeys |
0123456789 |
The DTMF keys that are accepted as valid input. Note that these
keys can be overridden by other key attributes below.
|
| deleteKeys |
|
When these DTMF keys are pressed, the last input key value will
be deleted. Conceptually, these are backspace keys.
|
| clearKeys |
|
When these DTMF keys are pressed all input entered so far will be erased.
|
| acceptLength |
0 |
When this value is non-zero, the input will be automatically accepted
when it reaches the specified length. For example, a value of 4 could
be used when requesting a credit card expiry date.
|
| acceptKeys |
|
When these keys are pressed the input will be accepted. |
| acceptURL |
|
When the input is accepted, the interpreter will jump to this URL,
passing through the input variable and value as an HTTP GET parameter.
|
| cancelKeys |
|
When these keys are pressed the input will be cancelled.
|
| cancelURL |
|
When the input is cancelled, the interpreter will jump to this URL.
Note that the variable and value entered will still be passed through.
|
| timeoutDelay |
0 |
The timeout delay in milliseconds. The counter is reset whenever a
DTMF key is pressed.
|
| timeoutURL |
|
When the timeout counter expires, the interpreter will jump to this URL.
Note that the variable and value entered will still be passed through.
|
| visible |
false |
Should the entered DTMF values be displayed on the screen? If true
the attributes below determine how they will be displayed.
|
| wordWrap |
default word wrap |
The text wrapping policy. See the
defaults object for more details.
|
| font |
default font |
The font for the visible text. See the
defaults object for more details.
|
| fontSize |
default font size |
The font size. See the defaults
object for more details.
|
| fontColour |
default font colour |
The font colour. See the defaults
object for more details.
|
| x |
0 |
The x position of the text bounding box. See the
text object for more details.
|
| y |
0 |
The y position of the text bounding box. See the
text object for more details.
|
| width |
176 |
The width of the text bounding box. See the
text object for more details.
|
| height |
144 |
The height of the text bounding box. See the
text object for more details.
|
| halign |
left |
The horizontal alignment of the text. See the
text object for more details.
|
| valign |
top |
The vertical alignment of the text. See the
text object for more details.
|
| passwordChar |
|
If set, display this character instead of the keys entered by the user.
This can be used to conceal passwords.
|
stream
The stream tag is used to play an audio stream, a video stream,
both an audio stream and a video stream, or an audio-video stream. Note
that streams will be buffered for a quarter second before playing. If
both an audio and video stream are specified, they will be synchronized.
Audio streams can be read from wav files (MIME type
audio/x-wav), mp3 files (MIME type audio/mpeg),
and raw format au files (MIME type audio/basic). Note
that the raw format is 16-bit signed 16kHz samples in little-endian
order, and is specific to this application. This is not the
Sun Microsystems 8kHz u-law format.
| Attribute |
Default value |
Description |
| asrc |
|
The URL of the audio stream. Supported formats are wav,
mp3, and raw audio.
|
| vsrc |
|
The URL of the video stream. Supported format is QCIF mpeg4.
|
| avsrc |
|
The URL of the audio-video stream. |
| bytesPerSecond |
|
The playback speed of the video stream. Only applies to QCIF
mpeg4 streams.
|
capture
The capture tag is used to capture the stream of audio/video
being sent to/from the remote phone. It is usually used to capture
audio and video being sent from the phone, but can also record what
the phone is receiving, or the audio in both directions. The only
limitation is that transcoded video being sent to the phone cannot
be captured because it is never available as decoded frames within
the session.
When the capture tag is encountered, the address of the remote
capture script specified in the url attribute is called with the
following HTTP GET parameters.
| url |
The URL from which to read the captured inbound stream.
|
| contentType |
The content type of the captured inbound stream.
|
When called, the remote capture script should immediately start reading
from the address passed through in the url parameter. If it fails
to connect to the URL within 60 seconds the capture will be aborted.
By default the captured inbound stream is encoded as MPEG video, but
different formats, including audio-only, can be requested with the
contentType attribute.
Normally when capture begins, all input is buffered then sent as fast
as the connection will allow once the remote capture script connects
and starts downloading the stream. This is ideal for recorded-message
applications where all input must be captured. However, for applications
where the stream must be processed in real time, setting the realtime
attribute to true will disable buffering and ensure that stream data will
be sent as soon as it is received from the remote handset. Note that this
will result in the loss of some data if the capture script is slow to
connect or the connection is too slow to support the stream.
Capture ends when another capture tag is encountered with a different URL
or the call ends. If the new capture URL is not an empty string, then a new
capture will begin immediately to that URL.
Note that the captured data does not always strictly conform to the
format of the specified content type. Many audio and video formats
have headers that specify the length of the clip, but because the data
is streamed the header must be sent before the length is known, so dummy
values are used. Some players can cope with this, others can't. There may
be some trial-and-error involved in finding a format that works.
| Attribute |
Default value |
Description |
| url |
|
The URL of the remote capture script. |
| contentType |
video/mpeg |
The format of the captured stream. Supported formats are
video/mpeg, video/x-msvideo, video/x-ms-wmv,
audio/amr, audio/mpeg, audio/x-wav, and
audio/x-pn-realaudio.
|
| realtime |
false |
Do we capture the stream in real time? |
| audioDirection |
fromHandset |
Which audio stream to capture. Valid options are fromHandset,
toHandset, both, and none. When the both
option is selected, both sides of the conversation are combined
and captured.
|
| videoDirection |
fromHandset |
Which video stream to capture. Valid options are fromHandset,
toHandset, and none.
|
call
The call tag is used to link a call to another handset, a
SIP-capable softphone, or a streaming media source such as a webcam.
A number of URL callbacks are provided to handle the various outcomes.
When the phone parameter is specified, an outbound call is made
to that phone number. This call is charged at the usual outbound rate.
When that call is answered, the two phone calls are tied together like
a regular person-to-person call.
Alternatively, the conference parameter can be used to tie two
or more existing calls together. The parameter can be any string, and
when two calls specify the same value they will be linked together.
Note that the engagedURL and rejectedURL parameters are
not used in this case.
A third option is to specify the phoneConference parameter. This
will initiate an outbound call to the phone number, unless a call to that
number is already in progress. In that case, it will connect immediately
to that call in conference mode.
Note that when a remote party is called using the phoneConference
parameter they will always be the first party in the resulting conference.
This differs from the phone parameter, where the caller is always
the first party. This is important when the conference type is
Broadcast or OneToMany.
Five different conference types are possible, which affect the behaviour
of video in a conference with three or more parties.
- Normal. The first party to connect will send video to all
the other parties. The second party to connect will send to the first.
- OneToMany. Video works the same as Normal mode.
- FollowSpeaker. The current speaker will send video to all
the other parties. The previous speaker will send to the current one.
- SplitScreen. The screen is divided up to show all other
parties simultaneously, up to a maximum of 31.
- Broadcast. The first party to connect will send video to all
the other parties. Nothing will be sent to the first party. This is
the recommended conference type when the first party is a streaming
media URL.
In Broadcast mode, audio is sent from the first party to all other
parties. No audio is sent to the first party or between the other parties.
OneToMany mode is similar, except that audio from the second party
is sent to the first party. In all other modes audio will be shared among
participants. They will hear audio from all parties except themselves.
When a participant leaves the conference, all parties move up one place
unless there is only one participant left, in which case the conference
ends. The conference also ends if the type is OneToMany or
Broadcast and the first party leaves.
A call ends when another call tag is encountered with a different phone
number or conference, or the conference ends. If the new phone number
or conference is not an empty string, then a new call will begin immediately.
Most of the URL callbacks are self-explanatory, except for
preambleURL. This is called when the call connects and, unlike
other VIXML scripts, is played to the other party, not the caller.
It plays until it reaches the end, then connectedURL is called
and the call proceeds as usual. This callback is designed for situations
when the called party needs to be briefed about the caller before speaking
to them.
Note that the VIXML preamble script should only play audio and display
static images. Pre-encoded mpeg4 video clips will not work, and
other video formats will not play reliably.
| Attribute |
Default value |
Description |
| phone |
|
The phone number to call. This may be a phone number (e.g.
0401234567) a SIP URL (e.g. sip:123@voip.com), or a
streaming media URL (e.g. http://www.whatever.com/webcam.asf)
|
| phoneConference |
|
The phone number to call, or to join in an existing conference.
|
| conference |
|
The name of the conference to connect to. |
| conferenceType |
Normal |
The conference type. Valid values are Normal, OneToMany,
FollowSpeaker, SplitScreen, and Broadcast.
|
| type |
video |
The call type. Valid values are video, voice, and
silent. The silent option should be used when calling
a streaming media source that contains no audio.
|
| fromPhone |
|
The phone number to call from. By default, calls are made with
caller-ID suppressed, but this parameter allows the calling number
to be explicitly set. When a phone number is being called valid
values are in the range 03991093xx. When a SIP address is
being called, the calling address will be
sip:fromPhone@210.11.56.34
|
| successURL |
|
The URL to jump to when the call has successfully completed.
|
| connectedURL |
|
The URL to jump to when the call has connected. |
| engagedURL |
|
The URL to jump to if the call is engaged. |
| failedURL |
|
The URL to jump to if the phone number could not be dialled.
Possible reasons include a non-existent or unreachable phone number.
|
| rejectedURL |
|
The URL to jump to if the call is rejected. Sometimes calls are rejected
by the recipient because they don't want to answer the call, sometimes
they are rejected by the network because the phone is switched off or
the recipient is not able to accept video calls.
|
| preambleURL |
|
If specified, this URL is called when the call connects. The contents
of the URL are played only to the other party, and are not visible to
the caller. Once the script ends, connectedURL is called and
the call proceeds as normal.
|
application
The application tag is used to run applications written in
third-party scripting languages. Currently supported formats are
Java and Flash.
| Attribute |
Default value |
Description |
| type |
|
The application type. Java or Flash |
| src |
|
The URL of the application's source. |
Examples
Here is a simple script that requests a 16-digit credit card number.
<vixml>
<defaults bgcolour="#e0e0e0" beepDuration="250" fontColour="green" />
<text font="bold" x="0" y="16" halign="centre" wordWrap="true">
Enter your credit card number
</text>
<text x="0" y="56" halign="centre" wordWrap="true">Press # to delete, * to cancel</text>
<speech voice="cmu_us_slt_arctic_hts">Please type in your credit card number</speech>
<input name="ccnumber" acceptLength="16" acceptURL="http://www.mysite.com/cc.php"
deleteKeys="#" cancelKeys="*" cancelURL="http://www.mysite.com/ccCancel.php?reason=user"
timeoutDelay="20000" timeoutURL="http://www.mysite.com/ccCancel.php?reason=timout"
visible="true" y="112" halign="left" />
</vixml>
Here is the same script generated using the VIXML schema.
<?xml version="1.0" encoding="UTF-8"?>
<tns:vixml xmlns:tns="http://www.vivatel.com.au/VIXMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.vivatel.com.au/VIXMLSchema VIXMLSchema.xsd ">
<tns:defaults bgcolour="#e0e0e0" beepDuration="250" fontColour="green" />
<tns:text font="bold" x="0" y="16" halign="centre" wordWrap="true">
Enter your credit card number
</tns:text>
<tns:text x="0" y="56" halign="centre" wordWrap="true">Press # to delete, * to cancel</tns:text>
<tns:speech voice="cmu_us_slt_arctic_hts">Please type in your credit card number</tns:speech>
<tns:input name="ccnumber" acceptLength="16" acceptURL="http://www.mysite.com/cc.php"
deleteKeys="#" cancelKeys="*" cancelURL="http://www.mysite.com/ccCancel.php?reason=user"
timeoutDelay="20000" timeoutURL="http://www.mysite.com/ccCancel.php?reason=timout"
visible="true" y="112" halign="left" />
</tns:vixml>
|
|