Requirements for Maintaining Web Access for Hearing-Impaired Individuals

Daniel M. Berry
Computer Science Department, University of Waterloo
Waterloo, Ontario N2L 3G1, Canada
Phone: None, use fax or e-mail
FAX: +1-519-746-5422
dberry@uwaterloo.ca

reprinted from

Proceedings of the Third International Workshop on Web Site Evolution, Florence, Italy, 10 November 2001, published by IEEE Computer Society.

© 2001 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

Below is the paper as it appeared in the proceedings, converted to HTML form in a way that exhibits its structure:

Requirements for Maintaining Web Access for Hearing-Impaired Individuals [footnote:1]

Daniel M. Berry
Computer Science Department, University of Waterloo
Waterloo, Ontario N2L 3G1, Canada
Phone: None, use fax or e-mail
FAX: +1-519-746-5422
dberry@uwaterloo.ca

(All footnotes are gathered in a section at the end of the document)

Abstract

The current textual and graphical interfaces to computing, including the Web, is a dream come true for the hearing impaired. However, improved technology for voice and audio interface threaten to end this dream. Requirements are identified for continued access to computing for the hearing impaired. Consideration is given also to improving access to the sight impaired.

Keywords:

access
closed captioning
e-mail
fax
hearing impaired
lipreading
lipsynching
movies
sight impaired
talking head
telephone
textual and graphical interfaces
TTY
TV
video phone
voice and audio interfaces
voice synthesis

1: Introduction

I am hearing impaired (HI) from birth and understand spoken language mostly by reading lips. I have always had problems using a telephone; it is hard to read lips on it. I have always been more comfortable with written communication. I have been using computers since 1965 and have been using the ARPA Net and later the Internet for communication since 1979. Computers, up to now, have been a boon to me, and for that matter to the rest of the HI world. In particular, they allow me to communicate with nearly all of my circle of acquaintances, a large fraction of which are in the computer business, by textual and graphical means, i.e., by e-mail, by Web page interaction, etc. For the few acquaintances that do not have e-mail, [footnote:2] fax usually is available.

More recently, telephones have gotten even more difficult to use. The equipment available today is of markedly lower quality than the equipment we used to rent from Western Electric, and there is more distortion when the sound is amplified. In addition, the increased use of answering machines, voice mail, and voice-directed menu selection [footnote:3] have taken away the possibility of my asking the person on the other end of a call if I understood her [footnote:4] or of my requesting her to repeat what she just said. In essence, I have become disenfranchised from the telephone, so much so that I do not give out my phone number any more. [footnote:5] This disenfranchisement was not so bad, since it was always difficult to use the telephone, and in any case, computers provided an alternative communication means that has become almost as universal as the telephone, at least among those with whom I want and need to communicate. Quite naturally, I have a vested interest in keeping things the way they are.

Therefore, when I read about work being done to build voice interfaces to computers, [footnote:6] I panic. I see that computers computers may be going the way of telephones towards my disenfranchisement. I watch Star Trek, taking place some 250 years in the future and see people interacting with the shipboard computer by talking with it. I personally would prefer that computers stay with entirely textual and graphical interfaces (TGIs). Of course, I cannot stop the trend. Also, strictly TGIs are a problem for sight-impaired (SI) people, who naturally prefer voice and audio interfaces, i.e, sound interfaces. Therefore, by this paper, I attempt to prevent my total disenfranchisement by recommending changes to the future directions that will make it possible for me, and the rest of the HI world, to continue to work with computers and to use computers for communication.

I get the feeling that my disenfranchisement from the phone happened partially because people like me did not complain enough, probably because an alternative was becoming more usable at the same time. Thus, I feel that it is necessary for me and people like me to take active steps to prevent disenfranchisement from the computer, the Internet, and the Web, that is, to maintain Web access for the HI individual.

Lest the reader believe that the problem is entirely mine, consider that according to 1990 and 1991 surveys by the National Center for Health Statistics, approximately 8.6% of the U.S. population three years and older have hearing problems, and that among these, 2.75% are profoundly deaf. [footnote:7]

In order to understand the reasoning behind the proposals, it is necessary to understand what an HI individual can and cannot do and why. Section 2 tries to give this background. In case Section 2 is too abstract, the appendix gives details about one specific HI person, me. While I am unique and atypical in many ways, I share many attributes, problems, limitations, solutions, needs, and hopes with all HI people. Section 3 observes that the HI and the SI have conflicting requirements. The proposals are presented in Section 4. Section 5 describes other work towards the same goal. Section 6 concludes the paper.

2: Abilities and Classifications of HI Persons

According to traditional audiology, understanding speech requires being able to hear with no more than a 75 decibel (db) loss in the range of 500 to 2000 Hertz (Hz). Figure 1 shows my audiogram with this requirement represented as a rectangle bounded by a dotted line.

Figure 1: Audiogram

Below is a textual description of the picture in the PDF file:

The left ear plot goes through:
(125 hz, 30 db loss)
(250 hz, 50 db loss)
(500 hz, 75 db loss)
(1000 hz, 85 db loss)
(2000 hz, 120 db loss)
(4000 hz, 120 db loss)

The right ear plot goes through:
(125 hz, 15 db loss)
(250 hz, 30 db loss)
(500 hz, 55 db loss)
(1000 hz, 80 db loss)
(2000 hz, 110 db loss)
(4000 hz, 110 db loss)

Figure 1: Audiogram

An audiogram shows two plots, one for each ear. The plot for an ear shows for each frequency, the hearing loss of the ear at the frequency. The loss of an ear at a frequency is measured by determining the minimum volume required for the ear to hear a tone of the frequency. The more of the speech-understanding rectangle that lies below the plots for an ear, the more that the ear can help understand human speech. More recently, the regions required for hearing vowels and consonants have been mapped. They give a more accurate way to determine whether or not a person can understand speech and to identify which part of it he does. The more of these regions that lie below the plots for an ear, the more that that ear can help understand the vowels and consonants, respectively. Note that the vowel region is entirely contained within the consonant region, since some consonants, e.g. ``m'', are not just explosions and have a voice component, as do all vowels. Note also that according to the speech-understanding rectangle, I appear to understand much less than I know I do; the vowel and consonant regions model my understanding more accurately.

There are several independent ways to classify an HI person, by

severity of his hearing loss,
length of time he has had the hearing loss, and
kind of input he requires in place of pure voice.

This classification is at best a guide for an initial guess as to what the HI person is able to do. Many individuals do not fit exactly into the classifications, and the capabilities of many individuals differ from what I claim is typical for persons in each classification. Nevertheless, the reader should gain an appreciation for what is possible and what is needed in Web interfaces to accommodate the HI.

2.1: Severity-of-Loss Classification

There are three basic groups of HI, according to severity of hearing loss:

A person in the first group has less than a 50db loss in all frequencies; that is, he has some usable hearing in all frequencies.
A person in the second group has greater than 100 db loss in all frequencies; that is, he is considered totally deaf.
A person in the third group is in neither the first nor the second group. He has usable hearing in some ranges of frequencies and is totally deaf in other ranges of frequencies.

I happen to be in the third group.

Typically, a person in the first group speaks fairly well and wears a hearing aid that amplifies all frequencies. With such an aid, the person functions about as well has a non-HI person. Typically, a person in the second group only signs and does not wear an aid, which is actually quite useless for his hearing. However, very rarely, a person in the second group has been trained to make use of the very tiny residual hearing he does have with the help of a hearing aid and with or without lipreading. In the third group, a smaller majority only sign. Less rarely than in the second group, a person of the third group uses the hearing he does have with the help of an aid and with or without lipreading. The reason that most of the second and third group sign is that for historical and traditional reasons, most of them are sent to schools for the deaf in which they learn signing and are not taught to make use of the hearing they do have.

A person in the first group may be functionally not HI, especially if he is using a good hearing aid.

2.2: Length-of-Time-of-Loss Classification

When classifying an HI person by the length of time he has had the hearing loss, two groupings emerge.

A person in the first group has loss his hearing since before he could talk, i.e., during birth or infancy.
A person in the second group has loss his hearing after he learned to talk, i.e., during youth or adulthood.

I am in the first group.

This classification is fuzzier than most, but the keys are whether at the time the person loses his hearing,

he has already learned to speak normally and can continue to make the sounds correctly even though he can no longer hear what he is supposed to be imitating, and
he already knows what speech normally sounds like and thus knows what he is missing.

Someone in the first group answers ``no'' to both questions and someone in the second group answers ``yes'' to both questions.

The typical person in the second group speaks quite well but has difficulty understanding speech because he has had to relearn hearing or to learn lipreading or signing at an age in which acquisition of a new language or even a new form of input for a familiar language is very difficult. This difficulty seems to be independent of the severity of loss. The typical person in the first group behaves as predicted according to the severity-of-loss classification.

A person in the second group may be functionally not HI, especially if his hearing loss is not severe or he is wearing a good hearing aid.

2.3: Kind-of-Input Classification

There are three groups, when classifying a person according to the input he requires.

In the first group, the person requires signing.
In the second group, the person uses a combination of residual hearing and lipreading to understand speech as it is spoken.
In the third group, the person uses only residual hearing.

I am in the second group.

A person in the third group typically has a mild loss that is uniform over the spectrum. He can generally get by in the hearing world if he is assisted by a hearing aid that corrects the loss. A person in the first group has never really learned to handle arbitrary speech, and even a hearing aid does not make it possible for him to understand speech without use of the alternative input medium such as signing or text. A person in the second group generally wears an aid. Usually, he also signs, particularly if he has a lot of acquaintances that are also HI.

A person in the third group may be functionally not HI, especially if his hearing loss is not severe or he is wearing a good hearing aid.

Many signers cannot read lips at all. Among those that do read lips, many do so poorly and could not rely on lipreading for total and accurate input. Statistically, these signers are the largest group of HI that have to be accommodated on the Web. Therefore, the next paragraph describes the situation of the typical signer. As mentioned in the Introduction, there are exceptions to this description

The typical signer is communicating only by signing. He has very poor speech, which is very difficult for a non-HI person to understand without getting used to it. He interacts only with other signers, whether they be HI or non-HI that have learned signing, e.g., his non-HI close relatives and friends. He is not able to hear on the telephone and uses TTY [footnote:8] in place of the telephone to communicate with his HI acquaintances, with relatives and close friends who have gotten TTY units and with organizations offering TTY lines. He reads and writes and can use computers, e-mail, and fax. He requires captions or subtitles on TV shows or movies.

2.4: Summary

However different the abilities of HI persons are, for any given HI person, unless he is functionally not HI, the basic fact is that he cannot depend on auditory input, and such auditory input must be replaced by or augmented by visual input.

3: The HI and the SI

It should be clear what is good for the HI is not good for the SI and vice versa. Right now the Web is perfect for the HI and not so good for the SI. Since the HI have it good on the Web, they are not complaining. However, the SI are complaining, and legitimately. As a result of the complaints of the SI, R&D exists that is directed towards enfranchising the SI. That enfranchisement can easily come at the expense of the HI, possibly even disenfranchising the HI. There is no need for the HI and SI to be competing. Therefore, this paper is recommending ways that prevent the disenfranchisement of the HI without impeding progress to enfranchise the SI.

I give recommendations that are valid for all HI, providing, when possible, for those who do have some auditory input and oral output. I also try to take into account the SI who cannot use text and pictures directly, but can use text converted to voice or textures, e.g., in the form of Braille.

For my recommendations on behalf of the SI, I am using the experiences of a blind student that took one of my courses recently. He had difficulty with the electronic copies of my slides and the course Web page, particularly when these involved pictures and diagrams. He was able to read the text of these through a device with earphones that could read ASCII or scanned text and pronounce what it read.

4: Recommendations for Sound-Based Human-Computer Interfaces

At the highest level, my recommendations are:

When the computer speaks to the user, it do so both by sound and text or pictures, and that the sound and text be synchronized to minimize the cognitive interference that happens when captions are shifted too far from the video that they caption. An added nicety would be to have a visible talking head mouthing out the sound, to allow those who read lips to do so rather than to have to read the text.
When the computer is to accept input from the user, it should accept both voice and textual input. Many HI people are not able to speak well or consistently, and many SI people find that typing is difficult.

4.1: Output From the Computer

As mentioned, when the computer outputs to the user, it should be both in sound and text or pictures. The specifics of this recommendation depends on which medium is the original source and thus, which other media has to be generated from the source.

4.1.1: Source is Text

If the source is text, then the sound can be generated by a voice synthesizer that is operating on the text, such as what my blind student had to read ASCII files. Providing a talking lipreadable head synchronized with the generated sound would require use of the technology of lipsynching. [footnote:9] Lipsynching allows animation of faces having lipreadable mouths synchronized with sound. However, the talking lipreadable head is not essential if the source is already text.

If the source is text in a phonetic alphabet designed to make voice synthesis easier, then this phonetic text should be displayed. HI people who watch real-time close captioning are used to dealing with incorrect spellings that yield correct pronunciations. It would take such a person a short time to get used to reading the phonetic alphabet.

4.1.1: Source is Real Person's Voice

If the source is the voice of a real person, then a video of that person can be made as he is being recorded. This video would provide the lipreadable talking head. In this case, captioning is necessary to augment the video and sound. If the person is reading a script, then the script can be displayed, as is done with closed captioned pre-recorded shows. The captions should be synchronized with the sound.

For alive video, presenting the text requires real-time captioning by a person with the skills of a court-room stenographer, as is done for closed captioning of alive television, e.g., the news or sporting events. Perhaps in the future, automatic voice and speech recognition will have advanced to the stage that this software can provide captions in real time.

For previously recorded video such as of movies and pre-recorded TV shows, captions, if available, should be shown. If captions are not already in the video, then they need to be added. In any case, the captions should be synchronized with the sound.

4.2: Input from User

The computer should be prepare to accept input by a variety of means without the user having to announce beforehand the preferred form of input. That is, the way the user replies to any query output by the computer should determine the actual medium of input on the fly.

The means of input that can be accepted are

voice, powered by voice recognition technology,
keyboard, typing a direct response, and
mouse, clicking on buttons or menu entries or making gestures.

If the user has difficulty speaking clearly and consistently, as do many HI people, voice input may not work reliably, and the other means of input will be needed.

4.3: Summary

Looking back over the recommendations, it appears that a textual interface is the key. The HI who is not SI can function with text. Moreover, from text, one can synthesize other representations, such as large letters, braille, and voice, that can help the SI. While to generate other media from text is straightforward, generating text from other media is not even algorithmic in many cases. We still cannot generate text reliably from voice. Thus text is the simplest basis representation.

5: Other Work

Just as this paper was accepted for publication, ACM's Interactions published W3C's ``Web Content Accessibility Guidelines 1.0'', dated 5 May 1999 [reference:1]. I was completely unaware of the effort, but will endeavor to participate in the future. The report is noteworthy to me because it goes to the heart of my own recommendation.

The W3C report's main recommendation is that text should always be available for any artifact. ``The guidelines do not suggest avoiding images as a way to improve accessibility. Instead, they explain that providing a text equivalent of the image will make it accessible.... Text content can be presented to the user as synthesized speech, braille, and visually-displayed text. Each of these three mechanisms uses a different sense—ears for synthesized speech, tactile for braille, and eyes for visually-displayed text—making the information accessible to groups representing a variety of sensory and other disabilities.... While Web content developers must provide text equivalents for images and other multimedia content, it is the responsibility of user agents (e.g., browsers and assistive technologies such as screen readers, braille displays, etc.) to present the information to the user.''

If an artifact is not readily textual, a functionally equivalent textual representation should be available. That is, if the artifact is a digitized photograph of a house,

and the purpose of the picture is to show the viewer a pleasant scene containing a house, the alternative text for the picture should be something like ``photograph of a pleasant scene containing a house''
and the purpose of the picture is to be an icon for transferring to the home sales department, the alternative text for the picture should be something like ``transfer to the home sales department''
and the purpose of the picture is to sell the specific house pictured, the alternative text for the picture should be a detailed description of the house, for example, ``picture of newly painted \pwood-frame house with three-bedrooms, two and a half bathrooms, large kitchen, two-car garage....''

The reader is urged to consult the published report or the Web page for more details.

Finally, just as the final copy of this paper was being prepared for inclusion in these proceedings, I learned of two organizations dealing with Internet access for disabled people,

the Special Needs Working Group of the Internet Societal Task Force (ISTF) part of the Internet Society (ISOC) (http://www.istf.org/wg/special-needs/index.html), and
the International Center for Disability Resources on the Internet (ICDRI) (http://www.icdri.org/).

I learned of a company, Signtel ( (http://www.signtelinc.com), that builds assistive technology for the hearing impaired for use by on-line organizations. The company has developed some of the technology that is needed to implement the suggestions of Section 4. In particular, it has developed software to map

from speech to text,
from text to sign language,
from text to speech, and
from text to moving lips

and to do so synchronously, so that the various media can be used to complement each other.

6: Conclusions

In this paper, I have given some recommendations of things that will help keep computers accessible to the HI population while affording more opportunity for the SI population to use computers. I have described the various kinds of hearing impairment, including my own, to motivate and explain my recommendations.

The recommendations do not require any new technology or research. They required only understanding the problem and the solutions, being aware of opportunities to solve the problem, and being careful to apply the recommendations as Web page structure and content are planned.

Acknowledgments

I thank Nitza Avni, Helen Beebe, David Brown, Mike Burks, Vint Cerf, Michael Comet, Igor Finestein, Emil Froeschels, Craig Gotsman, Antoinette Goffredo, Naomi Heller, and Mike Melkanoff for their teaching, discussions, and input. I thank the anonymous referees of the first draft of this paper for criticisms that allowed me to improve the presentation considerably. I was supported in part by NSERC grant NSERC-RGPIN227055-00.

References

Chisholm, W., Vanderheiden, G., and Jacobs, I. ``Web Content Accessibility Guidelines 1.0'', ACM Interactions 8:4, pp. 35–53, July + August 2001, Also at http://www.w3.org/TR/1999/WAI-WEBCONTENT-19990505/

A: Appendix—My Hearing, Speech, and Communication

This is a personally motivated position paper. Therefore, a little background about me is useful. Also, I am a concrete example of the general HI person described in Section 2.

A.1: My Hearing

I am HI since birth. I do not sign, but I do read lips. I read lips well enough that people forget that I do not hear very well and that I cannot understand sound equipment that does not allow me to see the speaker's lips, such as the telephone. Notice that in the author's address information in this paper, I explicitly list no telephone number; instead I direct people to fax or e-mail.

I hear a little, with a 50 db loss, at frequencies below 500 Hz. Thus, I can hear vowels and sounded consonants such as ``m'' and ``b''. I am essentially totally deaf, with a 110 db loss, at frequencies above 1000 Hz. Thus, I cannot hear non-sounded consonants such as ``s'' and ``p''. My audiogram, shown in Figure 1, shows that my hearing misses most of the rectangular region considered essential for understanding speech. Clearly, I cannot follow normal speech because so many of the sounds are missing. That is, with the sound that I hear, the language is too ambiguous. To me, with sound alone, each of ``cam'', ``fam'', ``ham'', ``kam'', ``pam'', ``qam'', ``ram'', ``sam'', ``tam'', ``wam'', and ``xam'' sounds like ``am''.

I wear a hearing aid to help me make better use of the little hearing I do have. A hearing aid that amplifies every frequency would be counter productive since it would amplify beyond comfort that which I can hear without it, and it would amplify low-frequency background noise to the point of distraction. Therefore, I wear a special, prescription hearing aid. The amount of amplification at any frequency below 1000 Hz decreases with the frequency. Since I have no hearing at all above 1000 Hz, it does nothing to those frequencies. Also since my hearing decreases with increasing frequency, it shifts frequencies below 1000 Hz a bit lower, although not enough to cause me to lose the ability to distinguish voice tones sufficiently to read emotions.

The hearing aid has also a telephone coil. This coil is actually a radio receiver that picks up the radio waves generated by the electromagnetic oscillator in the good handset speakers. By picking up radio waves, the sound I hear has not suffered any distortion by transmission through the air; the sound is generated inside the hearing aid. Unfortunately, there are handsets that do not work with the phone coil; they use carbon oscillators that do not generate electromagnetic waves in addition to the sound waves. Carbon oscillators are found on the cheaper handsets and cellular phones.

A.2: My Lipreading

I read lips to fill in on the missing sounds. I learned to read lips the same way that most people learn to understand spoken language. As a toddler, I began to notice patterns of lip movements and the sounds that I heard that were highly correlated with meaning, just as the average person notices patterns of sounds that are highly correlated with meaning. To the average person, the sound patterns are sufficiently unambiguous, that lip movements are not needed to disambiguate. In my case, with the addition of lipreading, all of the words above that sound like ``am'' are distinguishable from ``am'' and each other.

Lipreading itself is not unambiguous. It is a lot less ambiguous that the portion of speech that I hear, but is a bit more ambiguous than speech for the hearing person. Specifically, some letters that sound differently appear the same on the lips. For example ``m'', ``b'', and ``p'' appear the same and so do ``d'' and ``t''. I said that this is a slight ambiguity, because even hearing people deal with this sort of ambiguity; ``k'' and ``c'' followed by ``a'', ``o'', or ``u'' sound the same, but people distinguish words containing them by context. In my case, I am able to hear ``m'' and ``b'', but cannot hear ``p''. So if the lips appear like one of them, and I cannot hear the sound, the letter must be a ``p''. This decision is carried out entirely subconsciously, just as distinguishing the different meanings of a homonym. Therefore, I need the sounds I hear to disambiguate the lookalike letters. Thus, I cannot read lips when there is no voice or in noisy rooms, because I am lacking some important disambiguating information.

This need of voice to disambiguate lipreading lookalikes is quite personal and is language dependent. Other HI people with less hearing do not hear even ``m'' and ``b'', but they have learned as effortly as the hearing person learns to distinguish homonyms, to use language knowledge and context to distinguish between ``m'', ``b'', and ``p''. The lips for ``micro'' are definitely saying ``micro'' because ``bicro'' and ``picro'' and not words, and knowledge of the context tells the listener whether the word is ``Mom'', ``Bob'', ``Pop'', ``mop'', ``mob'', ``bomb'', or ``pomp'' after language knowledge has eliminated the other combinations. In Hebrew, there is a group of eight letters that appear the same and have sounds that are outside of my hearing range. So I have trouble with Hebrew. There are native Hebrew lip-readers. Thus, the ambiguity introduced by these eight letters must be manageable for the native speaker.

I am able to read lips from the side, and the lips of a non-native speaker of English speaking with a heavy accent seems not to faze me. However, I do have problems reading lips and understanding native speakers of Australian English, known as Strine (spelled ``Australian''), and of the Scottish brogue.

A.3: My Speech

My native, natural speech is a reflection of what I hear and lipread, just as the hearing person's natural speech is a reflection of what he or she hears. I do not hear the letter ``s'' at all and recognize it only by its lip and teeth configuration. Thus, in my natural speech, when I intend to say ``s'', my lips and teeth go to the right places, but there is no sound. My pronunciation of ``Sam'' is ``am'' preceded by my lips and teeth being right for ``s'' for the right amount of time, but with no sound. Later, as a teenager, I was trained to make sounds I cannot hear. However, since I cannot hear them, I cannot be sure that I make them correctly or even at all. I am quite sure that I sometimes do not.

A.4: My Communication

My hearing, lipreading, and speech contribute to a particular pattern of communication in which I do certain things to ensure understanding of speech and in which I avoid things I cannot do.

A.4.1: My Conversations

In order for me to listen to or converse with someone, I need to position myself so that I can both hear her voice and see her lips. Lectures, when I can sit close enough to the speaker, and one-on-one conversations are easiest. When the number of people in a conversation is more than three and the conversation moves randomly around the group, I get lost. By the time I have found the person who is speaking to read her lips, I have missed the first sentence or so. I end up missing portions of the conversation that are essential for following the conversation. Hence, I shy away from large groups and parties.

When I follow the conversation by lipreading, I interact well enough that people forget that I am HI. I sometimes have to remind people to face me or to not cover their lips.

A.4.2: Other Languages

I read, write, and speak several languages besides English, namely French, German, Hebrew, Portuguese, and Spanish. However, I am not able to understand any of them spoken. I speak them well enough that people answer me in the language I speak. Therefore, it is dangerous for me to speak these languages, because I quickly get responses that lose me. The reason I cannot understand these spoken is that I cannot read lips in them. I have tried to learn to read lips in Hebrew by taking lessons and living in a Hebrew-speaking environment, in Israel, but even after three years of lessons and eleven years living in Israel, I was not able to break loose from the low plateau on which I was stuck. I later learned that learning to read lips in anything but one's native language after the age of 5 is virtually impossible.

A.4.3: Signing

I do not sign. Therefore, a signing interpreter is of no use to me. As a side effect of not signing, I have very few HI acquaintances, the number of which I can count on one hand. [footnote:10]

A.4.4: Telephone Use

I generally cannot understand what the person on the other end of a telephone conversation is saying because I cannot see her lips. If however, I am controlling the phone conversation and have constrained the subject or am asking yes-or-no questions, then I can follow what the other person is saying. In the first case, the possible answers are constrained enough that I can often hear enough of the words that I can tell which of the possible answers it might be. Then I can ask yes-or-no questions to confirm that I have heard them correctly. My hearing is good enough that I am able to distinguish ``Yes'' from ``No'' without reading lips; the vowels, which I can hear are quite distinctive. I have learned to structure many conversation so that I can get all the information I need by asking strategic yes-or-no questions. While numbers are difficult to distinguish, I can ask the other person to count up to each digit.

Apart from these highly constrained situations, I cannot understand the other person, particularly if I am not expecting such a call and have no idea what the call might be about. I am often not even able to understand the name of the person who is calling.

Therefore, I generally do not answer my telephone. I use the telephone mostly only for incoming and outgoing faxes and outgoing phone calls that I can control. I have caller ID allowing me to see who is calling if she has not disabled my seeing that information. I make an exception and answer an incoming call when I can identify the caller and it is someone that I know well and can thus guess what the conversation might be about. On my home phone, so that people do not assume that I am not at home for long periods, I have a recording saying that even if I am at home, I do not answer the phone and to please send a fax to the same number. [footnote:11]

I cannot use a cellular phone or remote hand set, even when I am controlling the call. Unfortunately most such equipment does not have the required volume or if it does, it distorts too much at the high volumes so I cannot even understand ``Yes'' and ``No''. Many of them have only carbon oscillators that do not broadcast to the phone coil in my hearing aid. In fact, the only telephones I can use are the old Western Electric 600 standard telephones. The handsets have such good undistorted sound that I can hear what I do hear even without amplification so long as I am using the phone coil on my hearing aid. It seems that because these phones were built for rental and AT&T had to replace them free of charge if there were any damage, they were built so well and so far beyond the minimum threshold that even with maximum amplification they are not near the equipments limits. Since the so-called liberation of the phone services and we had to start buying our equipment, the quality has gone down hill. Fortunately for me, these old phones are indestructible. So, I have saved them and continue to use them.

If I am in a situation in which I need to make a phone call and I do not have the right equipment and I cannot be in control of the conversation, I ask someone else to be my ear, even when I am asking for a date!

A.4.5: Recording and IVR

The bane of my life are recorded messages, left for me in hotel rooms or played at numbers that I have called. Even if the subject is controlled, I have no way to confirm with the recording that I have heard it correctly. Moreover, the quality of the recoded voice is never as good as a real voice. What I hate the most is Interactive Voice Response (IVR), namely the automatic, recording-directed menu selection regime that is so common these days when one calls an institution. I am referring to these recordings that say ``Welcome to XXX. If you want to deal with AAA, press 1 now. If you want to deal with BBB, press 2 now, … and if you wish to speak to a customer service representative, please stay on the line.''

Not only do I have all the problems of understanding the recording and not being able to ask if I understood correctly, but also if I take a chance and hit the wrong key, I tend to get into a state that I cannot escape, because I do not always understand what is being said to me. Moreover, it seems like I am put on hold forever when I choose to stay on the line to speak to a human being. I am not even sure that there is a human being, because I cannot be sure that the recording did say, ``Please stay on the line to speak to a customer service representative.''

A.4.6: E-mail and Fax

Thus, for telephone-like communication with others, I use mainly e-mail and fax. Most of my acquaintances are computer people or their relatives. So, most people I know have e-mail and have had it for years. With the popularity of the Internet these days, more and more of my other acquaintances have e-mail. It has gotten to the point that when I meet a new acquaintance, female or male, I ask for an e-mail address instead of a phone number and I usually get it. These days the few acquaintances that do not have e-mail are businesses that have not computerized. Almost all of these have fax. So it is very rare indeed that I have to use the phone.

A.4.7: TTY

Many HI people use TTY units with the telephone in order to be able to communicate with others via a telephone with text. Two people with TTY units at the opposite ends of a call connection type to each other in real time, much as with the UNIX talk command, except that the screen is not split into send and receive windows. The sent and received text are interleaved. Hence, the conversers have to set up a protocol to prevent the two from talking at once.

Many institutions provide TTY numbers and operators to allow HI people to interact with them. A TTY unit consists of basically an old fashioned hard copy (key and ribbon) terminal together with a 150 baud modem operating with an ancient 5-bit character code called Baudot. Baudot was the code used before ASCII and it was adopted for TTY so that the HI community could get discarded equipment cheaply as the rest of the world adopted ASCII. [footnote:12] I do not use TTY because no one I communicate with has a unit. There are less than a handful of hearing impaired people in my circle of acquaintances, perhaps because I do not sign. Each of these HI people happens to use e-mail like I do.

A.4.8: Watching TV or Movies

I cannot watch TV or movies by lipreading alone, since not always is the person speaking facing the camera. Some TV shows and movies have narration from off screen. I watch only TV shows and movies that are subtitled or that have closed captioning. I do not go to theaters except for subtitled movies. I wait until movies appear on video tape or DVD, and I boycott movies and producers that make non-captioned videos.

When I go to a place in which French, German, Portuguese, or Spanish is spoken, and I am able to follow English speaking movies that are subtitled in these languages. I can read these languages fast enough. While I can read Hebrew, because of its non-Latin alphabet, I cannot read it fast enough to be able to follow Hebrew-subtitled English-speaking movies. A subtitle disappears before I have finished reading it.

A.4.9: Video Conferencing

Quite clearly, it is impossible for me to participate in meetings conducted with a conference call or with a speaker phone. Assuming that a face-to-face meeting is not possible, then only video conferencing has a possibility of working for me, as the possibility exists to read lips. I have been in a meeting in which the video was transmitted over a high speed dedicated line that cost a fortune, and the update of the video was at standard TV rate, often enough that it was possible to read lips. So long as the speaker arranged to be facing the camera, I fared well. However, most of the time, the video conferencing is done over a cheaper standard phone call connection or over the Internet, and the update of the picture is not frequent enough for smooth lip movement. Consequently it is impossible to read lips. As the bandwidth of phone lines increases, this problem will solve itself.

A.5: Technology That I Would Love To Have

I am waiting for the day when video phone use is widespread enough that everyone with whom I interact has one. Then I would get one and would be able to lipread over the phone. There are videophones available now. Even ignoring the fact that not enough people have them, there is a problem inhibiting their use for lipreading. The current bandwidth available for videophones allows the video to be updated less frequently than is required for live action. The consequence is that the picture is updated infrequently enough that the video is really a sequence of disjoint stills rather than a continual stream in which the lips appear to move. If I understand correctly, the designers of the video phone had a choice as what to allow to degrade, the video or the audio. Based on the needs of most of the population, which hears well enough, it was decided that audio quality is more critical and that to see the person to which one is talking and to see where that person is, stop-motion video is sufficient. Stop-motion video might even be enough to read body language. However, for me and other HI people, the opposite choice should be made. That is, it would be preferable to me and them that the audio degrade to preserve video quality. I could probably get enough of the voice to disambiguate lipreading from degraded audio.

Since each user is different, the best would be to give a means for the user to choose what to degrade and by how much, perhaps with a slider stretching from 100% video quality to 100% audio quality.

Voice recognition is improving steadily to the point that there are products that can be taught to translate one user's voice into ASCII text. Perhaps in the near future, software will be able to translate an arbitrary voice or a voice in a set of hundreds of previously training voices into ASCII text. When such technology is available, it should be utilized to provide real-time captioning of voices, both on TV and in voice-based user interfaces. Even if the accuracy were not perfect, but were only 95%, it would be usable by the HI. We are quite used to sloppy, slightly delayed captions produced by human courtroom-style stenographers in real time during alive news and sporting event broadcasts. The mistakes are plentiful and sometimes amusing. Most often the mistake is to a sound-alike sequence of words, e.g., ``eye deal'' instead of ``ideal'', and the listener has to listen to herself speak the words mentally. My feeling is that the technology will be no worse than the current real-time captioning. [footnote:13]

Footnotes

In order to follow my own recommendations, a purely ASCII copy of the text of this paper is available at http://se.uwaterloo.ca/~dberry/FTP_SITE/reprints.journals.conferences/WSE_paper.txt.
It's hard to believe that there are any left these days!
I understand that voice-directed menu selection is universally disdained, even among the non-HI population.
To avoid heavy usage of ``he or she'' as a third person singular personal pronoun, this paper alternates, on a section-by-section basis, the gender of the arbitrary persons introduced by quantifier equivalents.
Please note the lack of phone number in the author's address at the beginning of this paper.
I know also of proprietary research being done in a start up to provide voice recognition technology for use by e-commerce applications.
Cited after http://www.signtelinc.com/dia-1.htm
A TTY unit is a keyboard plus modem that communicates directly with other TTY devices over telephone lines using the 5-bit Baudot code at 150 baud. Consequently, it is incompatible with ASCII and the e-mail world, and TTY users form a closed world.
See:
1. http://www.garycmartin.com/ by Gary C. Martin
2. http://www.thirdwishsoftware.com/magpie.html by Third Wish Software
3. http://www.comet-cartoons.com/toons/3ddocs/lipsync/lipsync.html by Michael Comet
For reasons beyond the scope of this paper, I believe that teaching signing or even signing and speaking is the worst thing that can be done to a HI person. He learns to sign, does not learn to speak, and can interact only with other signing people. Not teaching signing leaves the HI person no choice but to learn to read lips and to utilize the residual hearing he has. He does so with no more effort than hearing people learn to understand spoken language and than HI people learn to sign.
Not one of those #%&! solicitors has been willing to take the effort to send me a fax!
Of course, now the HI community is cut off even more from the rest of the world, which has gone ASCII and into bandwidths in the thousands of baud.
Of course, for previously recorded TV shows, series, etc., it is possible to do perfect and synchronized captioning. Since the code used by the closed-captioning system is ASCII, often an ASCII rendition of the script is used. In this case, sometimes the captions do not agree with what is actually said. The actor said something that meant the same thing and the director accepted the change. However, the captions remain a copy of the script.