Note: this file originally appeared on the Newton Underground site (http://www.newton-underground.com/dev/a0000004.shtml, later moved to http://resources.pdadash.com/newtund/NU/dev/). Since it no longer seems to be available, I have uploaded it here, and commented out or modified links as appropriate. Steve
Article 00004
More Newton Text to Speech
Contributed by Jim Bailey<jdb@shore.net>
© 1997 Jim Bailey
This article builds on William Nelson's article on "How to work with the Text-to-Speech extension". Like Will's article, this article is presented for informational purposes only. I don't have any information on where to find the MacinTalk and SpeechText transport extensions. Apple hasn't released them. Please contact Apple and request that they make releasing the Newton Text to Speech extensions a priority.
Unlike Will's article, this article is less of a developer's tech note and more likely to be useful by users. I won't be showing any Newton source code or talking about how to incorporate the Text to Speech examples into your own applications. Will's article does a good job of explaining the ins and outs of working with the PlaySound() function to produce spoken results. This article talks about how to control the Text to Speech engine through embedded speech commands. Will's article talked about some of the commands but this article will talk about the commands in a more comprehensive way. Including how to calculate the effects of pitch and rate changes and how to use phonemes. Pitch and and rate changes allow you to change the way the voices sound or provide emphasis when needed. Phonemes allow you to change the default behavior of the speech synthesizer to correct pronunciation mistakes.
The unreleased Newton Text to Speech extensions currently consist of two packages. The speech synthesizer itself is in the "MacinTalk" package. The name is an indication of its origin as the Macintosh text to speech software. The name gave me the idea that I should go look for some information on the Macintosh version of MacinTalk and see if it also applied to the Newton version. Luckily, though not surprising considering its origin, the Newton speech drivers largely accept the same embedded speech commands as the Macintosh version.
The other package that makes up the Newton Text to Speech software is
a transport that allows a user to recognize the text from any Newton application
that supports routing of text. This is a clever way to support existing
applications without requiring software changes but if you are writing new
software and want to support the text to speech software you can work without
the transport. For more details on how to support the Text to Speech driver
in your application see Will's article .
You can experiment with the examples given here with the SpeakText transport.
All the embedded commands work with any text. When the speech synthesizer
is analyzing a block of text to speak, it also looks for special sequences
of characters called delimiters. There are begin delimiters and end delimiters.
When a begin delimiter is found in the text everything between the begin
delimiter and the end delimiter is considered a list of commands to the
synthesizer. The standard delimiters for the Newton Text to Speech synthesizer
are [[ and ]], to begin and end respectively. These delimiters can be changed
if they are inconvenient for some reason.
Here is an example of a text block with some embedded commands:
"[[vers 1; svox ralf; pmod 0; pbas 50]]start talking in monotone and then be silent for 2 seconds [[slnc 2000]] and then talk again."
If you play this with the synthesizer you will here a very mechanical voice talking in monotone, a pause of two seconds and then the rest of the sentence. Everything between the [[ and ]] delimiters are commands and are not spoken aloud. Instead the commands are interpreted by the synthesizer and cause various actions to occur. The rest of this article will go into detail about what the commands do and how to use them effectively.
Embedded Commands
The syntax for embedded commands is pretty flexible. You can group commands together by separating each command with a semicolon or you can put each individual command in its own begin/end delimiter command block.
Here is a list of embedded speech commands that seem to work with the Newton Text to Speech software. Each command is shown with the standard begin/end delimiters and with a representative set of parameters. You may be able to substitute different parameters from the ones shown, see the description of each command for more details on what parameters are accepted.
Version
[[vers 1]] The version command informs the synthesizer of the version number
for the subsequent commands. I haven't tested anything but [[vers 1]] but
the MacinTalk documentation highly recommends using this command to ensure
compatibility with newer versions. It may be possible to substitute other
numbers besides 1 as a parameter but that would defeat the purpose of the
command.
Example: [[vers 1; svox ralf]]
Delimiter
[[dlim (* *)]] This command specifies the begin and end delimiters used
for all subsequent commands. The new delimiters take effect when the next
end delimiter is processed. The begin and end delimiters must be one or
two characters. You must specify both the begin and end delimiters and they
must be different from the current delimiters. You can use any one or two
character sequence for the begin and end delimiters.
Example: [[dlim (( *; cmnt set begin to (( and end to a single *]]
Comment
[[cmnt blah blah blah]] The comment command allows you to insert comments
in the text that won't be spoken. This can be handy to remind yourself what
a certain command does or what word a phoneme string represents. You can
put any text after the cmnt command as long as you follow it with the ending
delimiter, subsequent commands in the same command block are considered
comments and aren't processed.
Example: [[cmnt comment everything svox zarv; rate 50; volm 0.5]]
Reset
{{rset 0}} The reset command puts the pbas and pmod back to the defaults
set by the SpeakText transport. It also puts the delimiter back to standard,
the input mode back to text [[inpt TEXT}}, the character mode back to normal
[[char NORM]], and the number mode back to normal as well [[nmbr NORM]].
It doesn't change the current voice or the rate of the voice. The current
volume seems to be set back to full [[volm 1.0]].
Example: [[rset 0]]
Silence
[[slnc 2000]] The silence command tells the synthesizer to be quiet for
the specified amount of time. The silence parameter is the number of milliseconds
to wait. A 2000 will give 2 seconds, a 500 will pause for half a second.
Example: [[slnc 1000; cmnt be quiet for 1 second]]
Emphasis
[[emph +]] or [[emph -]] The emphasis command causes the next word to be
spoken with either greater emphasis if the + is used or less emphasis if
the - is used.
Example: [[emph -]]demphasize [[emph +]]emphasize
Input mode
[[inpt TEXT]] or [[inpt PHON]] The input mode command switches between reading
text with [[inpt TEXT]] or processing raw phonemes with [[inpt PHON]]. Phonemes
are discussed detail later in the article. Note that the Macintosh documentation
says that [[inpt TX]] and [[inpt PH}} are equivalent to TEXT and PHON mode
respectively, but those commands don't seem to work with the Newton version
of the software.
Example: [[inpt PHON]]krIHstAXl[[inpt TEXT]]lattice
Character mode
[[char NORM]] or [[char LTRL]] The character mode command switches between
the word speaking mode when using [[char NORM]] or speaking each letter
one at a time with letter by letter mode, [[char LTRL]].
Example: Now I know my [[char LTRL]]AB[[char NORM]]seas
Number Mode
[[nmbr NORM]] or [[nmbr LTRL]] The number mode command switches between
the number speaking mode when using [[nmbr NORM]] where each number is spoken
as a whole, for example 1000 is said as one-thousand, or with [[nmbr LTRL]]
where each number is spoken digit by digit as in 1000 said as one-zero-zero-zero.
Example: count [[nmbr LTRL]]from 123[[nmbr NORM]]10 times
Set Voice
[[svox fred]] The set voice command changes the current voice to the one
specified. It also resets the voice back to default values as in the the
reset command [[rset 0]]. You can use any of the following voices:
Male voices
[[svox fred]] Name: Fred; middle pitch male voice.
[[svox ralf]] Name: Ralph; low pitch male voice.
[[svox junr]] Name: Junior; high pitch male voice.
Female voices
[[svox kath]] Name: Kathy; middle pitch female voice.
[[svox prin]] Name: Princess; high pitch female voice.
Special effects voices
[[svox zarv]] Name: Zarvox; echoing computer voice.
[[svox whis]] Name: Whisper; whispering male voice.
Singing voices
[[svox gnws]] Name: Good News; Sings a happy song.
[[svox bnws]] Name: Bad News; Sings a sad song.
The singing voices seem to be incompatible with pitch modulation changes.
Volume
[[volm 0.3]] or [[volm +0.3]] or [[volm -0.3]] The volume command changes
the speaking volume of the synthesizer. The default volume without this
command seems to be full volume, even if the system volume is set differently;
even if the volume is off altogether. There are two forms of this command,
the first sets the volume absolutely. The second form sets the volume as
an offset from the current volume. An increase to the current volume when
a + is used and a reduction in volume with a -. The volume is a number between
0.0 for silence and 1.0 for full volume. Half volume would be 0.5. Any value
over 1.0 means the current value is set at 1.0 even if the offset would
raise it to a total greater than 1.0.
Example: [[volm 0.75; cmnt set the volume to 3/4]]
Speaking Rate
[[rate 100]] or [[rate +50]] or [[rate -50]] The speaking rate command sets
the speaking rate in words per minute. There are two forms of this command.
The first form sets the speaking rate to an absolute word per minute value.
The second form sets the rate as an offset of the current value. An increase
in the current rate with a + and a decrease in the current rate with a -
value. The speaking rate can be set as low as 1 to as high as you wish though
500 seems like a practical limit and it seems that the synthesizer is limiting
the slowest rate to about 50 words per minute.
Baseline Pitch
[[pbas 50]] or [[pbas +30]] or [pbas -30]] The baseline pitch command changes
the current pitch that the synthesizer uses for a voice. The command has
two forms, the first sets an absolute pitch value and the second form will
set pitch that is an offset from the current pitch value. The default pitch
is set by the SpeakNewton transport preferences. The range of values is
between 1 and 100. The effect of changing the baseline pitch is related
to the pitch modulation and is discussed in the next section.
Pitch modulation
[[pmod 50]] or [[pmod +20]] or [[pmod -20]] The pitch modulation command
changes the modulation range that the synthesizer uses for a voice. There
are two forms to the pitch modulation command. The first sets the modulation
to an absolute value. The second form sets the modulation as an offset from
the current value. Since the default pmod value is not specified or set
by a transport default, the second form is only useful if a known value
is set with the absolute form first. The pitch modulation can be set between
0 and 100. A value of zero causes the synthesizer to speak in a monotone.
The effect of changing the pitch modulation on the voice is related to the
baseline pitch and is discussed in the next section.
Setting Pitch
There is a somewhat complicated mathematical relationship between the baseline pitch (pbas) and the pitch modulation (pmod). You can set the values of each of these parameters by using embedded commands but if you don't know the mathematical relationship their effects can seem somewhat mysterious. The baseline pitch is what causes a voice to sound high or low. The value is entered and produces a particular frequency based on the following formula:
baseHertz = 440 * 2^((pbas-69)/12)
A pbas of 50 will produce a baseHertz of 147. This number is used by the pitch modulation calculation to find the minimum and maximum frequency. The pitch modulation value is used by the following equations:
maximum pitch = pbas+pmod
minimum pitch = pbas - pmod
maximum Hertz = baseHertz*2^(+pmod/12)
minimum Hertz = baseHertz*2^(-pmod/12)
A pbas of 50 with a pmod of 20 produces:
maximum pitch = 70
minimum pitch = 30
maximum Hertz = 446
minimum Hertz = 46
It isn't easy to see the relationship between the pitch modulation and the baseline pitch. The best advice is that a pitch modulation of 0 produces a monotone. The greater the difference between the minimum and maximum values the more inflection is heard from the synthesizer. I suggest setting up a spreadsheet to calculate the values if you are interested in the numeric relationships or experiment with different values to get a better understanding about the audible effects.
To get an idea of how pitch effects can be used here is a brief bit of "Joy to the World" sung with varying pitch. The text is made of phonemes (discussed below) but you can see the pmod value is set to 0 and the pbas is varied to produce different pitch during the song. . This is an amusing example but it is also a good example of the compatibility between the Macintosh and Newton versions of the synthesizers.
[[pmod 0;pbas -20]]
[[pmod 0;pbas +14;inpt PHON]]~J>>>>OY [[pbas +3]]t>>>>UW
[[pbas +2]]D>>UX [[pbas +2]]w>>>>>UHrld [[pbas +1]]D>>UX
[[pbas -1]]l>>>>OWrd [[pbas -3]]>>>>>>IHz
[[pbas -3]]k>>>>>>>UHm [[pbas -4]]l>>>EHt
[[pbas +2]]>>>>>>>UXrT r>>>IY [[pbas +2]]s>>>>>>>>IYv
h>>>UXr [[pbas +1]]k>>>>>>>IHN.
[[pbas -3;inpt PHON]]~l>>EHt >>>EH [[pbas -1]]>>>EH [[pbas +1]]vr>>>IY [[pbas +2]]>>>IY h>>>AA [[pbas -5]]>AA [[pbas +5]]>>>AArt [[pbas +3]]pr>>IY p>>>EH [[pbas +2]]>>>EHr [[pbas +2]]h>>>IH [[pbas +1]]>>>IHm r>>>UW [[pbas -1]]>UW [[pbas -2]]>>>UWm.
[[pbas -3;inpt PHON]]~>AEnd h>>EH vn>>AEnd n>>EY ty>UH [[pbas -4]]>UHr [[pbas +2]]s>>>>>>>>IHN [[pbas +2]]>AE [[pbas +1]]>AEnd [[pbas -1]]h>>>EH vn>>>AEnd n>>>EY ty>UH [[pbas -2]]>UHr [[pbas -1]]s>>>>>>>IHN [[pbas -2]]>AE [[pbas +2]]>AEnd [[pbas +1]]h>>>EH [[pbas +10 ]]>>>>>EHvn [[pbas -1]]>>AEnd [[pbas -2]]h>>>EH [[pbas -2]]EH [[pbas -3]]>>>>>EHvn [[pbas -4]]>>>AEnd [[pbas +2]]n>>>>EY [[pbas +2]]ty>>>UHr [[pbas +1]]s>>>>>>>IHN.
You can cut and paste the song into a works paper or a notepad note and play it to hear the effects. If you are viewing this with a Newton based browser you may be able to play the song from your browser by using the Speak Text routing menu to play directly off the web page.
Phonemes
The speech synthesizer has many rules for automatically converting text into the correct English pronunciation. For example, this phrase uses two different pronunciations for the word "object":
Should he object about the object?
The first instance of the word object is a verb; the second is a noun. In these contexts the word is spoken with the stress on a different syllable. The only way to really know if a phrase will be spoken correctly is to listen to it. In some cases the synthesizer will not produce the results you want. The use of raw phonemes allows very precise control over the spoken output. The rules for phonetic text are shown in the following tables. You turn phonetic mode on and off using the [[inpt PHON]] embedded command.
Symbol | Example | Opcode |
AE | bat | 2 |
EY | bait | 3 |
AO | caught | 4 |
AX | about | 5 |
IY | beet | 6 |
EH | bet | 7 |
IH | bit | 8 |
AY | bite | 9 |
IX | roses | 10 |
AA | cot | 11 |
UW | boot | 12 |
UH | book | 13 |
UX | bud | 14 |
OW | boat | 15 |
AW | bout | 16 |
OY | boy | 17 |
b | bin | 18 |
C | chin | 19 |
d | din | 20 |
D | them | 21 |
f | fin | 22 |
g | gain | 23 |
h | hat | 24 |
J | gin | 25 |
k | kin | 26 |
l | limb | 27 |
m | mat | 28 |
n | nat | 29 |
N | tang | 30 |
p | pin | 31 |
r | ran | 32 |
s | sin | 33 |
S | shin | 34 |
t | tin | 35 |
T | thin | 36 |
v | van | 37 |
w | wet | 38 |
y | yet | 39 |
z | zen | 40 |
Z | genre | 41 |
% | silence | 0 |
@ | breath intake | 1 |
Type | Symbol | Description of Effect |
Lexical Stress: | Marks stress within a word | |
Primary stress |
1 | anticipation AEnt2IHsIXp1EYSAXn ("anticipation") |
Secondary stress |
2 | anticipation |
Syllable breaks: | Marks syllable breaks within a word | |
Syllable mark |
= (equal) | AEn=t2IH=sIX=p1EY=SAXn ("anticipation") |
Word prominence: | Marks the beginning of a word (required) | |
Unstressed |
~ (asciitilde) | Used for words with minimal information content |
Normal stress |
_ (underscore) | Used for information-bearing words |
Emphatic stress |
+ (plus) | Special emphasis for a word |
Prosodic | Placed before the affected phoneme | |
Pitch rise |
/ (slash) | pitch will rise on the following phoneme |
Pitch fall |
\ (backslash) | pitch will fall on the following phoneme |
Lengthen phoneme |
> (greater) | lengthen the duration of the following phoneme |
Shorten phoneme |
< (less) | shorten the duration of the following phoneme |
Punctuation: | Pitch effect | Timing effect |
. (period) |
Sentence final fall | Pause follows |
? (question) |
Sentence final rise | Pause follows |
! (exclam) |
Sentence final sharp fall | Pause follows |
(ellipsis) |
Clause final level | Pause follows |
, (comma) |
Continuation rise | Short pause follows |
; (semicolon) |
Continuation rise | Short pause follows |
: (colon) |
Clause final level | Short pause follows |
( (parenleft) |
Start reduced range | Short pause precedes |
) (parenright) |
End reduced range | Short pause follow |
" or ' (quote dbl left, quote single left) | Varies | Varies |
" or ' (quote dbl right,' quote single right) | Varies | Varies |
- (hyphen) |
Clause-final level | Short pause follows |
& (ampersand) |
Forces no addition of silence between phonemes |