More Newton Text to Speech

Note: this file originally appeared on the Newton Underground site (http://www.newton-underground.com/dev/a0000004.shtml, later moved to http://resources.pdadash.com/newtund/NU/dev/). Since it no longer seems to be available, I have uploaded it here, and commented out or modified links as appropriate. Steve

Article 00004
More Newton Text to Speech Contributed by Jim Bailey<jdb@shore.net>

This article builds on William Nelson's article on "How to work with the Text-to-Speech extension". Like Will's article, this article is presented for informational purposes only. I don't have any information on where to find the MacinTalk and SpeechText transport extensions. Apple hasn't released them. Please contact Apple and request that they make releasing the Newton Text to Speech extensions a priority.

Unlike Will's article, this article is less of a developer's tech note and more likely to be useful by users. I won't be showing any Newton source code or talking about how to incorporate the Text to Speech examples into your own applications. Will's article does a good job of explaining the ins and outs of working with the PlaySound() function to produce spoken results. This article talks about how to control the Text to Speech engine through embedded speech commands. Will's article talked about some of the commands but this article will talk about the commands in a more comprehensive way. Including how to calculate the effects of pitch and rate changes and how to use phonemes. Pitch and and rate changes allow you to change the way the voices sound or provide emphasis when needed. Phonemes allow you to change the default behavior of the speech synthesizer to correct pronunciation mistakes.

The unreleased Newton Text to Speech extensions currently consist of two packages. The speech synthesizer itself is in the "MacinTalk" package. The name is an indication of its origin as the Macintosh text to speech software. The name gave me the idea that I should go look for some information on the Macintosh version of MacinTalk and see if it also applied to the Newton version. Luckily, though not surprising considering its origin, the Newton speech drivers largely accept the same embedded speech commands as the Macintosh version.

The other package that makes up the Newton Text to Speech software is a transport that allows a user to recognize the text from any Newton application that supports routing of text. This is a clever way to support existing applications without requiring software changes but if you are writing new software and want to support the text to speech software you can work without the transport. For more details on how to support the Text to Speech driver in your application see Will's article .

You can experiment with the examples given here with the SpeakText transport. All the embedded commands work with any text. When the speech synthesizer is analyzing a block of text to speak, it also looks for special sequences of characters called delimiters. There are begin delimiters and end delimiters. When a begin delimiter is found in the text everything between the begin delimiter and the end delimiter is considered a list of commands to the synthesizer. The standard delimiters for the Newton Text to Speech synthesizer are [[ and ]], to begin and end respectively. These delimiters can be changed if they are inconvenient for some reason.

Here is an example of a text block with some embedded commands:

"[[vers 1; svox ralf; pmod 0; pbas 50]]start talking in monotone and then be silent for 2 seconds [[slnc 2000]] and then talk again."

If you play this with the synthesizer you will here a very mechanical voice talking in monotone, a pause of two seconds and then the rest of the sentence. Everything between the [[ and ]] delimiters are commands and are not spoken aloud. Instead the commands are interpreted by the synthesizer and cause various actions to occur. The rest of this article will go into detail about what the commands do and how to use them effectively.

Embedded Commands

The syntax for embedded commands is pretty flexible. You can group commands together by separating each command with a semicolon or you can put each individual command in its own begin/end delimiter command block.

Here is a list of embedded speech commands that seem to work with the Newton Text to Speech software. Each command is shown with the standard begin/end delimiters and with a representative set of parameters. You may be able to substitute different parameters from the ones shown, see the description of each command for more details on what parameters are accepted.

Version
[[vers 1]] The version command informs the synthesizer of the version number for the subsequent commands. I haven't tested anything but [[vers 1]] but the MacinTalk documentation highly recommends using this command to ensure compatibility with newer versions. It may be possible to substitute other numbers besides 1 as a parameter but that would defeat the purpose of the command.

Example: [[vers 1; svox ralf]]

Delimiter
[[dlim (* *)]] This command specifies the begin and end delimiters used for all subsequent commands. The new delimiters take effect when the next end delimiter is processed. The begin and end delimiters must be one or two characters. You must specify both the begin and end delimiters and they must be different from the current delimiters. You can use any one or two character sequence for the begin and end delimiters.

Example: [[dlim (( *; cmnt set begin to (( and end to a single *]]

Comment
[[cmnt blah blah blah]] The comment command allows you to insert comments in the text that won't be spoken. This can be handy to remind yourself what a certain command does or what word a phoneme string represents. You can put any text after the cmnt command as long as you follow it with the ending delimiter, subsequent commands in the same command block are considered comments and aren't processed.

Example: [[cmnt comment everything svox zarv; rate 50; volm 0.5]]

Reset
{{rset 0}} The reset command puts the pbas and pmod back to the defaults set by the SpeakText transport. It also puts the delimiter back to standard, the input mode back to text [[inpt TEXT}}, the character mode back to normal [[char NORM]], and the number mode back to normal as well [[nmbr NORM]]. It doesn't change the current voice or the rate of the voice. The current volume seems to be set back to full [[volm 1.0]].

Example: [[rset 0]]

Silence
[[slnc 2000]] The silence command tells the synthesizer to be quiet for the specified amount of time. The silence parameter is the number of milliseconds to wait. A 2000 will give 2 seconds, a 500 will pause for half a second.

Example: [[slnc 1000; cmnt be quiet for 1 second]]

Emphasis
[[emph +]] or [[emph -]] The emphasis command causes the next word to be spoken with either greater emphasis if the + is used or less emphasis if the - is used.

Example: [[emph -]]demphasize [[emph +]]emphasize

Input mode
[[inpt TEXT]] or [[inpt PHON]] The input mode command switches between reading text with [[inpt TEXT]] or processing raw phonemes with [[inpt PHON]]. Phonemes are discussed detail later in the article. Note that the Macintosh documentation says that [[inpt TX]] and [[inpt PH}} are equivalent to TEXT and PHON mode respectively, but those commands don't seem to work with the Newton version of the software.

Example: [[inpt PHON]]krIHstAXl[[inpt TEXT]]lattice

Character mode
[[char NORM]] or [[char LTRL]] The character mode command switches between the word speaking mode when using [[char NORM]] or speaking each letter one at a time with letter by letter mode, [[char LTRL]].

Example: Now I know my [[char LTRL]]AB[[char NORM]]seas

Number Mode
[[nmbr NORM]] or [[nmbr LTRL]] The number mode command switches between the number speaking mode when using [[nmbr NORM]] where each number is spoken as a whole, for example 1000 is said as one-thousand, or with [[nmbr LTRL]] where each number is spoken digit by digit as in 1000 said as one-zero-zero-zero.

Example: count [[nmbr LTRL]]from 123[[nmbr NORM]]10 times

Set Voice
[[svox fred]] The set voice command changes the current voice to the one specified. It also resets the voice back to default values as in the the reset command [[rset 0]]. You can use any of the following voices:

Male voices
[[svox fred]] Name: Fred; middle pitch male voice.
[[svox ralf]] Name: Ralph; low pitch male voice.
[[svox junr]] Name: Junior; high pitch male voice.

Female voices
[[svox kath]] Name: Kathy; middle pitch female voice.
[[svox prin]] Name: Princess; high pitch female voice.

Special effects voices
[[svox zarv]] Name: Zarvox; echoing computer voice.
[[svox whis]] Name: Whisper; whispering male voice.

Singing voices
[[svox gnws]] Name: Good News; Sings a happy song.
[[svox bnws]] Name: Bad News; Sings a sad song.

The singing voices seem to be incompatible with pitch modulation changes.

Volume
[[volm 0.3]] or [[volm +0.3]] or [[volm -0.3]] The volume command changes the speaking volume of the synthesizer. The default volume without this command seems to be full volume, even if the system volume is set differently; even if the volume is off altogether. There are two forms of this command, the first sets the volume absolutely. The second form sets the volume as an offset from the current volume. An increase to the current volume when a + is used and a reduction in volume with a -. The volume is a number between 0.0 for silence and 1.0 for full volume. Half volume would be 0.5. Any value over 1.0 means the current value is set at 1.0 even if the offset would raise it to a total greater than 1.0.

Example: [[volm 0.75; cmnt set the volume to 3/4]]

Speaking Rate
[[rate 100]] or [[rate +50]] or [[rate -50]] The speaking rate command sets the speaking rate in words per minute. There are two forms of this command. The first form sets the speaking rate to an absolute word per minute value. The second form sets the rate as an offset of the current value. An increase in the current rate with a + and a decrease in the current rate with a - value. The speaking rate can be set as low as 1 to as high as you wish though 500 seems like a practical limit and it seems that the synthesizer is limiting the slowest rate to about 50 words per minute.

Baseline Pitch
[[pbas 50]] or [[pbas +30]] or [pbas -30]] The baseline pitch command changes the current pitch that the synthesizer uses for a voice. The command has two forms, the first sets an absolute pitch value and the second form will set pitch that is an offset from the current pitch value. The default pitch is set by the SpeakNewton transport preferences. The range of values is between 1 and 100. The effect of changing the baseline pitch is related to the pitch modulation and is discussed in the next section.

Pitch modulation
[[pmod 50]] or [[pmod +20]] or [[pmod -20]] The pitch modulation command changes the modulation range that the synthesizer uses for a voice. There are two forms to the pitch modulation command. The first sets the modulation to an absolute value. The second form sets the modulation as an offset from the current value. Since the default pmod value is not specified or set by a transport default, the second form is only useful if a known value is set with the absolute form first. The pitch modulation can be set between 0 and 100. A value of zero causes the synthesizer to speak in a monotone. The effect of changing the pitch modulation on the voice is related to the baseline pitch and is discussed in the next section.

Setting Pitch

There is a somewhat complicated mathematical relationship between the baseline pitch (pbas) and the pitch modulation (pmod). You can set the values of each of these parameters by using embedded commands but if you don't know the mathematical relationship their effects can seem somewhat mysterious. The baseline pitch is what causes a voice to sound high or low. The value is entered and produces a particular frequency based on the following formula:

baseHertz = 440 * 2^((pbas-69)/12)

A pbas of 50 will produce a baseHertz of 147. This number is used by the pitch modulation calculation to find the minimum and maximum frequency. The pitch modulation value is used by the following equations:

maximum pitch = pbas+pmod
minimum pitch = pbas - pmod

maximum Hertz = baseHertz*2^(+pmod/12)
minimum Hertz = baseHertz*2^(-pmod/12)

A pbas of 50 with a pmod of 20 produces:

maximum pitch = 70
minimum pitch = 30
maximum Hertz = 446
minimum Hertz = 46

It isn't easy to see the relationship between the pitch modulation and the baseline pitch. The best advice is that a pitch modulation of 0 produces a monotone. The greater the difference between the minimum and maximum values the more inflection is heard from the synthesizer. I suggest setting up a spreadsheet to calculate the values if you are interested in the numeric relationships or experiment with different values to get a better understanding about the audible effects.

To get an idea of how pitch effects can be used here is a brief bit of "Joy to the World" sung with varying pitch. The text is made of phonemes (discussed below) but you can see the pmod value is set to 0 and the pbas is varied to produce different pitch during the song. . This is an amusing example but it is also a good example of the compatibility between the Macintosh and Newton versions of the synthesizers.

[[pmod 0;pbas -20]]
[[pmod 0;pbas +14;inpt PHON]]~J>>>>OY [[pbas +3]]t>>>>UW [[pbas +2]]D>>UX [[pbas +2]]w>>>>>UHrld [[pbas +1]]D>>UX [[pbas -1]]l>>>>OWrd [[pbas -3]]>>>>>>IHz [[pbas -3]]k>>>>>>>UHm [[pbas -4]]l>>>EHt [[pbas +2]]>>>>>>>UXrT r>>>IY [[pbas +2]]s>>>>>>>>IYv h>>>UXr [[pbas +1]]k>>>>>>>IHN.

[[pbas -3;inpt PHON]]~l>>EHt >>>EH [[pbas -1]]>>>EH [[pbas +1]]vr>>>IY [[pbas +2]]>>>IY h>>>AA [[pbas -5]]>AA [[pbas +5]]>>>AArt [[pbas +3]]pr>>IY p>>>EH [[pbas +2]]>>>EHr [[pbas +2]]h>>>IH [[pbas +1]]>>>IHm r>>>UW [[pbas -1]]>UW [[pbas -2]]>>>UWm.

[[pbas -3;inpt PHON]]~>AEnd h>>EH vn>>AEnd n>>EY ty>UH [[pbas -4]]>UHr [[pbas +2]]s>>>>>>>>IHN [[pbas +2]]>AE [[pbas +1]]>AEnd [[pbas -1]]h>>>EH vn>>>AEnd n>>>EY ty>UH [[pbas -2]]>UHr [[pbas -1]]s>>>>>>>IHN [[pbas -2]]>AE [[pbas +2]]>AEnd [[pbas +1]]h>>>EH [[pbas +10 ]]>>>>>EHvn [[pbas -1]]>>AEnd [[pbas -2]]h>>>EH [[pbas -2]]EH [[pbas -3]]>>>>>EHvn [[pbas -4]]>>>AEnd [[pbas +2]]n>>>>EY [[pbas +2]]ty>>>UHr [[pbas +1]]s>>>>>>>IHN.

You can cut and paste the song into a works paper or a notepad note and play it to hear the effects. If you are viewing this with a Newton based browser you may be able to play the song from your browser by using the Speak Text routing menu to play directly off the web page.

Phonemes

The speech synthesizer has many rules for automatically converting text into the correct English pronunciation. For example, this phrase uses two different pronunciations for the word "object":

Should he object about the object?

The first instance of the word object is a verb; the second is a noun. In these contexts the word is spoken with the stress on a different syllable. The only way to really know if a phrase will be spoken correctly is to listen to it. In some cases the synthesizer will not produce the results you want. The use of raw phonemes allows very precise control over the spoken output. The rules for phonetic text are shown in the following tables. You turn phonetic mode on and off using the [[inpt PHON]] embedded command.

American English phoneme symbols

Symbol	Example	Opcode
AE	bat	2
EY	bait	3
AO	caught	4
AX	about	5
IY	beet	6
EH	bet	7
IH	bit	8
AY	bite	9
IX	roses	10
AA	cot	11
UW	boot	12
UH	book	13
UX	bud	14
OW	boat	15
AW	bout	16
OY	boy	17
b	bin	18
C	chin	19
d	din	20
D	them	21
f	fin	22
g	gain	23
h	hat	24
J	gin	25
k	kin	26
l	limb	27
m	mat	28
n	nat	29
N	tang	30
p	pin	31
r	ran	32
s	sin	33
S	shin	34
t	tin	35
T	thin	36
v	van	37
w	wet	38
y	yet	39
z	zen	40
Z	genre	41
%	silence	0
@	breath intake	1

Prosodic control symbols

Type	Symbol	Description of Effect
Lexical Stress:		Marks stress within a word
Primary stress	1	anticipation AEnt2IHsIXp1EYSAXn ("anticipation")
Secondary stress	2	anticipation
Syllable breaks:		Marks syllable breaks within a word
Syllable mark	= (equal)	AEn=t2IH=sIX=p1EY=SAXn ("anticipation")
Word prominence:		Marks the beginning of a word (required)
Unstressed	~ (asciitilde)	Used for words with minimal information content
Normal stress	_ (underscore)	Used for information-bearing words
Emphatic stress	+ (plus)	Special emphasis for a word
Prosodic		Placed before the affected phoneme
Pitch rise	/ (slash)	pitch will rise on the following phoneme
Pitch fall	\ (backslash)	pitch will fall on the following phoneme
Lengthen phoneme	> (greater)	lengthen the duration of the following phoneme
Shorten phoneme	< (less)	shorten the duration of the following phoneme
Punctuation:	Pitch effect	Timing effect
. (period)	Sentence final fall	Pause follows
? (question)	Sentence final rise	Pause follows
! (exclam)	Sentence final sharp fall	Pause follows
(ellipsis)	Clause final level	Pause follows
, (comma)	Continuation rise	Short pause follows
; (semicolon)	Continuation rise	Short pause follows
: (colon)	Clause final level	Short pause follows
( (parenleft)	Start reduced range	Short pause precedes
) (parenright)	End reduced range	Short pause follow
" or ' (quote dbl left, quote single left)	Varies	Varies
" or ' (quote dbl right,' quote single right)	Varies	Varies
- (hyphen)	Clause-final level	Short pause follows
& (ampersand)		Forces no addition of silence between phonemes