Javax Speech Download

Posted on  by admin

Chapter 5

A link from the Desktop Java Java Speech API leads to the SourceForge page for FreeTTS. The FAQ says: The Java Speech API (JSAPI) is not part of the JDK and Sun does not ship an implementation of JSAPI. Instead, we work with third party speech companies to encourage the availability of multiple implementations. Speech Input Using a Microphone and Translation of Speech to Text. Configure Microphone (For external microphones): It is advisable to specify the microphone during the program to avoid any glitches. Type lsusb in the terminal. A list of connected devices will show up. The microphone name would look like this USB Device 0x46d:0x825: Audio (hw:1, 0).

A speech synthesizer is a speech engine that converts text to speech. The javax.speech.synthesis package defines the Synthesizer interface to support speech synthesis plus a set of supporting classes and interfaces. The basic functional capabilities of speech synthesizers, some of the uses of speech synthesis and some of the limitations of speech synthesizers are described in Section 2.1.

As a type of speech engine, much of the functionality of a Synthesizer is inherited from the Engine interface in the javax.speech package and from other classes and interfaces in that package. The javax.speech package and generic speech engine functionality are described in Chapter 4.

This chapter describes how to write Java applications and applets that use speech synthesis. We begin with a simple example, and then review the speech synthesis capabilities of the API in more detail.

  • 'Hello World!': a simple example of speech synthesis

5.1 'Hello World!'

The following code shows a simple use of speech synthesis to speak the string 'Hello World'.

This example illustrates the four basic steps which all speech synthesis applications must perform. Let's examine each step in detail.

  • Create: The Central class of javax.speech package is used to obtain a speech synthesizer by calling the createSynthesizer method. The SynthesizerModeDesc argument provides the information needed to locate an appropriate synthesizer. In this example a synthesizer that speaks English is requested.
  • Allocate and Resume: The allocate and resume methods prepare the Synthesizer to produce speech by allocating all required resources and puttingit in the RESUMED state.
  • Generate: The speakPlainText method requests the generation of synthesizedspeech from a string.
  • Deallocate: The waitEngineState method blocks the caller until the Synthesizer is in the QUEUE_EMPTY state - until it has finished speaking the text. The deallocate method frees the synthesizer's resources.

5.2 Synthesizer as an Engine

The basic functionality provided by a Synthesizer is speaking text, management of a queue of text to be spoken and producing events as these functions proceed. The Synthesizer interface extends the Engine interface to provide this functionality.

The following is a list of the functionality that the javax.speech.synthesis package inherits from the javax.speech package and outlines some of the ways in which that functionality is specialized.

  • The properties of a speech engine defined by the EngineModeDesc class applyto synthesizers. The SynthesizerModeDesc class adds information about synthesizer voices. Both EngineModeDesc and SynthesizerModeDesc are described in Section 4.2.
  • Synthesizers are searched, selected and created through the Central class in the javax.speech package as described in Section 4.3. That section explainsdefault creation of a synthesizer, synthesizer selection according to defined properties, and advanced selection and creation mechanisms.
  • Synthesizers inherit the basic state system of an engine from the Engine interface.The basic engine states are ALLOCATED, DEALLOCATED, ALLOCATING_RESOURCES and DEALLOCATING_RESOURCES for allocation state, and PAUSED and RESUMED for audio output state. The getEngineState methodand other methods are inherited for monitoring engine state. An EngineEvent indicates state changes. The engine state systems are described in Section 4.4. (The QUEUE_EMPTY and QUEUE_NOT_EMPTY states added by synthesizersare described in Section 5.4.)
  • Synthesizers produce all the standard engine events (see Section 4.5). The javax.speech.synthesis package also extends the EngineListener interfaceas SynthesizerListener to provide events that are specific to synthesizers.
  • Other engine functionality inherited as an engine includes the runtime properties (see Section 4.6.1 and Section 5.6), audio management (see Section 4.6.2) and vocabulary management (see Section 4.6.3).

5.3 Speaking Text

The Synthesizer interface provides four methods for submitting text to a speech synthesizer to be spoken. These methods differ according to the formatting of the provided text, and according to the type of object from which the text is produced. All methods share one feature; they all allow a listener to be passed that will receive notifications as output of the text proceeds.

The simplest method - speakPlainText - takes text as a String object. This method is illustrated in the 'Hello World!' example at the beginning of this chapter. As the method name implies, this method treats the input text as plain text without any of the formatting described below.

The remaining three speaking methods - all named speak - treat the input text as being specially formatted with the Java Speech Markup Language (JSML). JSML is an application of XML (eXtensible Markup Language), a data format for structured document interchange on the internet. JSML allows application developers to annotate text with structural and presentation information to improve the speech output quality. JSML is defined in detail in a separate technical document, 'The Java Speech Markup Language Specification.'

The three speak methods retrieve the JSML text from different Java objects. The three methods are:

The first version accepts an object that implements the Speakable interface. The Speakable interface is a simple interface defined in the javax.speech.synthesis package that contains a single method: getJSMLText. This method should return a String containing text formatted with JSML.

Virtually any Java object can implement the Speakable interface by implementing the getJSMLText method. For example, the cells of spread-sheet, the text of an editing window, or extended AWT classes could all implement the Speakable interface.

The Speakable interface is intended to provide the spoken version of the toString method of the Object class. That is, Speakable allows an object to define how it should be spoken. For example:

The second variant of the speak method allows JSML text to be loaded from a URL to be spoken. This allows JSML text to be loaded directly from a web site and be spoken.

The third variant of the speak method takes a JSML string. Its use is straight-forward.

For each of the three speak methods that accept JSML formatted text, a JSMLException is thrown if any formatting errors are detected. Developers familiar with editing HTML documents will find that XML is strict about syntax checks. It is generally advisable to check XML documents (such as JSML) with XML tools before publishing them.

The following sections describe the speech output onto which objects are placed with calls to the speak methods and the mechanisms for monitoring and managing that queue.

5.4 Speech Output Queue

Each call to the speak and speakPlainText methods places an object onto the synthesizer's speech output queue. The speech output queue is a FIFO queue: first-in-first-out. This means that objects are spoken in the order in which they are received.

The top of queue item is the head of the queue. The top of queue item is the item currently being spoken or is the item that will be spoken next when a paused synthesizer is resumed.

The Synthesizer interface provides a number of methods for manipulating the output queue. The enumerateQueue method returns an Enumeration object containing a SynthesizerQueueItem for each object on the queue. The first object in the enumeration is the top of queue. If the queue is empty the enumerateQueue method returns null.

Each SynthesizerQueueItem in the enumeration contains four properties. Each property has a accessor method:

  • getSource returns the source object for the queue item. The source is the object passed to the speak and speakPlainText method: a Speakable object,a URL or a String.
  • getText returns the text representation for the queue item. For a Speakable object it is the String returned by the getJSMLText method. For a URL it is the String loaded from that URL. For a string source, it is that string object.
  • isPlainText allows an application to distinguish between plain text and JSML objects. If this method returns true the string returned by getText is plain text.
  • getSpeakableListener returns the listener object to which events associatedwith this item will be sent. If no listener was provided in the call to speak and speakPlainText then the call returns null.
Javax Speech Download

The state of the queue is an explicit state of the Synthesizer. The Synthesizer interface defines a state system for QUEUE_EMPTY and QUEUE_NOT_EMPTY. Any Synthesizer in the ALLOCATED state must be in one and only one of these two states.

The QUEUE_EMPTY and QUEUE_NOT_EMPTY states are parallel states to the PAUSED and RESUMED states. These two state systems operate independently as shown in Figure 5-1 (an extension of Figure 4-2).

The SynthesizerEvent class extends the EngineEvent class with the QUEUE_UPDATED and QUEUE_EMPTIED events which indicate changes in the queue state.

The 'Hello World!' example shows one use of the queue status. It calls the waitEngineState method to test when the synthesizer returns to the QUEUE_EMPTY state. This test determines when the synthesizer has completed output of all objects on the speech output queue.

The queue status and transitions in and out of the ALLOCATED state are linked. When a Synthesizer is newly ALLOCATED it always starts in the QUEUE_EMPTY state since no objects have yet been placed on the queue. Before a synthesizer is deallocated (before leaving the ALLOCATED state) a synthesizer must return to the QUEUE_EMPTY state. If the speech output queue is not empty when the deallocate method is called, all objects on the speech output queue are automatically cancelled by the synthesizer. By contrast, the initial and final states for PAUSED and RESUMED are not defined because the pause/resume state may be shared by multiple applications.

The Synthesizer interface defines three cancel methods that allow an application to request that one or more objects be removed from the speech output queue:

The first of these three methods cancels the object at the top of the speech output queue. If that object is currently being spoken, the speech output is stopped and then the object is removed from the queue. The SpeakableListener for the item receives a SPEAKABLE_CANCELLED event. The SynthesizerListener receives a QUEUE_UPDATED event, unless the item was the last one on the queue in which case a QUEUE_EMPTIED event is issued.

The second cancel method requires that a source object be specified. The object should be one of the items currently on the queue: a Speakable, a URL, or a String. The actions are much the same as for the first cancel method except that if the item is not top-of-queue, then speech output is not affected.

The final cancel method - cancelAll - removes all items from the speech output queue. Each item receives a SPEAKABLE_CANCELLED event and the SynthesizerListener receives a QUEUE_EMPTIED event. The SPEAKABLE_CANCELLED events are issued to items in the order of the queue.

5.5 Monitoring Speech Output

All the speak and speakPlainText methods accept a SpeakableListener as the second input parameter. To request notification of events as the speech object is spoken an application provides a non-null listener.

Unlike a SynthesizerListener that receives synthesizer-level events, a SpeakableListener receives events associated with output of individual text objects: output of Speakable objects, output of URLs, output of JSML strings, or output of plain text strings.

The mechanism for attaching a SpeakableListener through the speak and speakPlainText methods is slightly different from the normal attachment and removal of listeners. There are, however, addSpeakableListener and removeSpeakableListener methods on the Synthesizer interface. These add and remove methods allow listeners to be provided to receive notifications of events associated with all objects being spoken by the Synthesizer.

The SpeakableEvent class defines eight events that indicate progress of spoken output of a text object. For each of these eight event types, there is a matching method in the SpeakableListener interface. For convenience, a SpeakableAdapter implementation of the SpeakableListener interface is provided with trivial (empty) implementations of all eight methods.

The normal sequence of events as an object is spoken is as follows:

  • TOP_OF_QUEUE: the object has reached to the top of the speech output queue and is the next object to be spoken.
  • SPEAKABLE_STARTED: audio output has commenced for this text object.
  • WORD_STARTED: audio output has reached the start of a word. The event includesinformation on the location of the word in the text object. This event is issued for each word in the text object. This event is often used to highlightwords in a text document as they are spoken.
  • MARKER_REACHED: audio output has reached the location of a MARKER tag explicitlyembedded in the JSML text. The event includes the marker text from the tag. For container JSML elements, a MARKER_REACHED event is issuedat both the start and end of the element. MARKER_REACHED events are not produced for plain text because formatting is required to add the markers.
  • SPEAKABLE_ENDED: audio output has been completed and the object has been removed from the speech output queue.

The remaining event types are modifications to the normal event sequence.

  • SPEAKABLE_PAUSED: the Synthesizer has been paused so audio output of this object is paused. This event is only issued to the text object at the top of the speech output queue.
  • SPEAKABLE_RESUMED: the Synthesizer has been resumed so audio output of this object has resumed. This event is only issued to the text object at the top of the speech output queue.
  • SPEAKABLE_CANCELLED: the object has been removed from the speech outputqueue. Any or all objects in the speech output queue may be removed by one of the cancel methods (described in Section 5.4).

The following is an example of the use of the SpeakableListener interface to monitor the progress of speech output. It shows how a training application could synchronize speech synthesis with animation.

It places two JSML string objects onto the output queue and requests notifications to itself. The speech output will be:

At the start of the output of each string the speakableStarted method will be called. By checking the source of the event we determine which text is being spoken and so the appropriate animation code can be triggered.

5.6 Synthesizer Properties

Javax.speech Download

The SynthesizerProperties interface extends the EngineProperties interface described in Section 4.6.1. The JavaBeans property mechanisms, the asynchronous application of property changing, and the property change event notifications are all inherited engine behavior and are described in that section.

The SynthesizerProperties object is obtained by calling the getEngineProperties method (inherited from the Engine interface) or the getSynthesizerProperties method. Both methods return the same object instance, but the latter is more convenient since it is an appropriately cast object.

The SynthesizerProperties interface defines five synthesizer properties that can be modified during operation of a synthesizer to effect speech output.

Text To Speech Download

The voice property is used to control the speaking voice of the synthesizer. The set of voices supported by a synthesizer can be obtained by the getVoices method of the synthesizer's SynthesizerModeDesc object. Each voice is defined by a voice name, gender, age and speaking style. Selection of voices is described in more detail in Selecting Voices.

The remaining four properties control prosody. Prosody is a set of features of speech including the pitch and intonation, rhythm and timing, stress and other characteristics which affect the style of the speech. The prosodic features controlled through the SynthesizerProperties interface are:

  • Volume: a float value that is set on a scale from 0.0 (silence) to 1.0 (loudest).
  • Speaking rate: a float value indicating the speech output rate in words per minute. Higher values indicate faster speech output. Reasonable speaking rates depend upon the synthesizer and the current voice (voices may have different natural speeds). Also, speaking rate is also dependent upon the language because of different conventions for what is a 'word'. For English,a typical speaking rate is around 200 words per minute.
  • Pitch: the baseline pitch is a float value given in Hertz. Different voices have different natural sounding ranges of pitch. Typical male voices are between80 and 180 Hertz. Female pitches typically vary from 150 to 300 Hertz.
  • Pitch range: a float value indicating a preferred range for variation in pitch above the baseline setting. A narrow pitch range provides monotonous outputwhile wide range provide a more lively voice. The pitch range is typicallybetween 20% and 80% of the baseline pitch.

The following code shows how to increase the speaking rate for a synthesizer by 30 words per minute.

As with all engine properties, changes to synthesizer properties are not necessarily instant. The change should take effect as soon as the synthesizer can apply it. Depending on the underlying technology, a property change may take effect immediately, or at the next phoneme, word, phrase or sentence boundary, or at the beginning of output of the next item in the synthesizer's queue.

So that an application knows when the change has actual taken effect, the synthesizer generates a property change event for each call to a set method in the SynthesizerProperties interface.

5.6.1 Selecting Voices

Most speech synthesizers are able to produce a number of voices. In most cases voices attempt to sound natural and human, but some voices may be deliberately mechanical or robotic.

The Voice class is used to encapsulate the four features that describe each voice: voice name, gender, age and speaking style. The voice name and speaking style are both String objects and the contents of those strings are determined by the synthesizer. Typical voice names might be 'Victor', 'Monica', 'Ahmed', 'Jose', 'My Robot' or something completely different. Speaking styles might include 'casual', 'business', 'robotic' or 'happy' (or similar words in other languages) but the API does not impose any restrictions upon the speaking styles. For both voice name and speaking style, synthesizers are encouraged to use strings that are meaningful to users so that they can make sensible judgements when selecting voices.

By contrast the gender and age are both defined by the API so that programmatic selection is possible. The gender of a voice can be GENDER_FEMALE, GENDER_MALE, GENDER_NEUTRAL or GENDER_DONT_CARE. Male and female are hopefully self-explanatory. Gender neutral is intended for voices that are not clearly male or female such as some robotic or artificial voices. The 'don't care' values are used when selecting a voice and the feature is not relevant.

The age of a voice can be AGE_CHILD (up to 12 years), AGE_TEENAGER (13-19), AGE_YOUNGER_ADULT (20-40), AGE_MIDDLE_ADULT (40-60), AGE_OLDER_ADULT (60+), AGE_NEUTRAL, and AGE_DONT_CARE.

Both gender and age are OR'able values for both applications and engines. For example, an engine could specify a voice as:

In the same way that mode descriptors are used by engines to describe themselves and by applications to select from amongst available engines, the Voice class is used both for description and selection. The match method of Voice allows an application to test whether an engine-provided voice has suitable properties.

The following code shows the use of the match method to identify voices of a synthesizer that are either male or female voices and that are younger or middle adults (between 20 and 60). The SynthesizerModeDesc object may be one obtained through the Central class or through the getEngineModeDesc method of a created Synthesizer.

The Voice object can also be used in the selection of a speech synthesizer. The following code illustrates how to create a synthesizer with a young female Japanese voice.

5.6.2 Property Changes in JSML

In addition to control of speech output through the SynthesizerProperties interface, all five synthesizer properties can be controlled in JSML text provided to a synthesizer. The advantage of control through JSML text is that property changes can be finely controlled within a text document. By contrast, control of the synthesizer properties through the SynthesizerProperties interface is not appropriate for word-level changes but is instead useful for setting the default configuration of the synthesizer. Control of the SynthesizerProperties interface is often presented to the user as a graphical configuration window.

Applications that generate JSML text should respect the default settings of the user. To do this, relative settings of parameters such as pitch and speaking rate should be used rather than absolute settings.

For example, users with vision impairments often set the speaking rate extremely high - up to 500 words per minute - so high that most people do not understand the synthesized speech. If a document uses an absolute speaking rate change (to say 200 words per minute which is fast for most users), then the user will be frustrated.

Changes made to the synthesizer properties through the SynthesizerProperties interface are persistent: they affect all succeeding speech output. Changes in JSML are explicitly localized (all property changes in JSML have both start and end tags).

5.6.3 Controlling Prosody

The prosody and voice properties can be used within JSML text to substantially improve the clarity and naturalness of the speech output. For example, one time to change prosodic settings is when providing new, important or detailed information. In this instance it is typical for a speaker to slow down, emphasise more words and often add extra pauses. Putting equivalent changes into synthetic speech will help a listener understand the message.

For example, in response to the question 'How many Acme shares do I have?', the answer might be 'You currently have 1,500 Acme shares.' The number will spoken more slowly because it is new information. To represent this in JSML text the <PROS> element is used:

The following example illustrates how an email message header object can implement the Speakable interface and generate JSML text with prosodic controls to improve understandability.

The JavaFX Software Development Kit (SDK) provides the command-line tools and technologies to develop expressive content for applications deployed to browsers, desktops, and mobile devices.

  • JavaFX Desktop runtime
  • JavaFX Mobile Emulator and runtime (Windows only)
  • JavaFX API documentation
  • Samples

The JavaFX SDK runs on Windows and Mac OS X. A beta release of the JavaFX SDK is provided for Ubuntu Linux and the OpenSolaris operating systems.

System Requirements

The system requirements for the JavaFX SDK, including the recommended version of the Java SE Development Kit (JDK), are listed in the JavaFX System Requirements document.

Installing the JavaFX SDK on Windows or Mac

  1. Download the latest JavaFX SDK installer file for Windows (an EXE extension) or Mac OS X (a DMG extension).
  2. Download older versions of the JavaFX SDK installer from the Previous Releases download page.
  3. After the download is complete, double-click the EXE or DMG file to run the installer. On the Mac OS X platform, double-click the open-box icon that appears after you start the installer.
  4. Complete the steps in the installation wizard.
  5. Note the default installation location:
  • For Windows. The default installation location is C:Program FilesJavaFXjavafx-sdk-version.
  • For Mac OS X. The default installation directory is /Library/Frameworks/JavaFX.framework/Versions/version-number.

On Mac OS X, the installation procedure also creates the following directories:

For information about samples and documentation in the SDK, see the README file in the top level of the SDK directory.

Speech

Installing the JavaFX SDK on Ubuntu Linux or OpenSolaris

  1. Download and save the JavaFX shell script for the Linux or OpenSolaris operating system.
  2. Download older versions of the JavaFX SDK installer from the Previous Releases download page.
  3. Run the .sh file. For example:
  4. Accept the license terms.

The shell script installs the JavaFX SDK in the current directory.

For information about samples and documentation in the SDK, see the README file in the top level of the SDK directory.

More Information

Discussion

We welcome your participation in our community. Please keep your comments civil and on point. You can optionally provide your email address to be notified of replies your information is not used for any other purpose. By submitting a comment, you agree to these Terms of Use.