Usability Evaluation of Captions for People who are Deaf or Hard of Hearing

Sushant Kafle, Rochester Institute of Technology, sushant@mail.rit.edu
Matt Huenerfauth, Rochester Institute of Technology, matt.huenerfauth@rit.edu

Abstract

Many Deaf and Hard-of-Hearing (DHH) individuals across the world benefit from various captioning services for accessing information existing in the form of speech. Today, the Automatic Speech Recognition (ASR) technology has the potential to replace the existing human-provided services for captioning due to their lowered cost of operation and ever-increasing accuracy. However, as with most automatic systems, ASR technology is still not fully perfect – which leads to issues in terms of its trust and acceptance when focusing on building a human-free service of communication for these users. Thus, there is a need for evaluating the usability these systems with the users before deploying them into the real-world. Yet, most researchers lack access to sufficient DHH users for extrinsic, empirical studies of these automatic captioning systems. This articles presents our work on the development of automatic caption quality evaluation metric which we design and validate through studies and real-world observations with DHH users.

1. Introduction

People who are Deaf and Hard-of-Hearing (DHH) make use of offline captioning (e.g., for pre-recorded television programming) or real-time captioning services (e.g., in classrooms, meetings, and live events) for accessing the aural information in various mainstream communication settings. Today, such services are most commonly provided by a human professional who transcribes the human-speech audio or other sounds into visual text. Although well-trained transcriptionists can produce accurate real-time captions with a speed of over 200 words per minute, systems that rely on trained transcriptionists, e.g. Computer-Aided Access Real-Time (CART) or similar services [21], are not suitable for impromptu meetings or extremely brief conversational interactions, given the overhead cost of arranging a transcriptionist. With the recent improvements in the accuracy and speed of automatic speech recognition (ASR), many ASR-based applications are now seeing wide commercial use, in a variety of consumer applications. Due to their low cost and scalability, ASR systems have great potential for the task of automatic captioning.

However, accurate, large-vocabulary, continuous speech recognition is still considered an unsolved problem. Although there have been recent leaps in the performance of these systems [23], ASR performance is generally not on par with humans, who currently provide most caption text for DHH users. ASR system could produce errors due to noise in the input audio, the ambiguity of human speech, or unforeseen speaker characteristics (e.g. a strong accent). As researchers continue to improve ASR accuracy, they generally report the performance of their systems using a metric called Word Error Rate (WER). Given the ubiquity of this metric, it is reasonable that reducing WER may be a goal of many ASR research efforts (implicitly, if not overtly).

Figure 1. Formula for Word Error Rate (WER), which is based on S (number of erroneous substitutions of one word for another), D (number of deletions, i.e. erroneous omissions of words that were spoken), I (number of insertions of spurious words in the ASR.

WER is calculated by comparing the "hypothesis text" (the output of the ASR system) to the "reference text" (what the human actually said in the audio recording). As shown in Fig. 1, the metric considers the number of misrecognition mistakes in the hypothesis text, normalized by the word-length of the reference text.

Notably, WER does not consider whether some words may be more important to the meaning of the message or whether some words might be more important than others in a text. This is a concerning limitation because researchers have previously found that humans perceive different ASR errors as having different degrees of impact on a text – i.e. some errors might distort the meaning of the text more harshly than others [17]. Other researchers have found that the impact of errors may be dependent upon the specific application in which ASR is being used [4, 19].

1.1 Metric of ASR Quality for DHH readers

There are also reasons to believe that it is important to create and evaluate metrics for measuring ASR output quality specifically targeted for DHH users. Anecdotally, some accessibility researchers have argued that ASR-generated errors on captions are more comprehension-demanding than human produced errors [1, 2]. Further, prior research has characterized differences in literacy rates and reading mechanisms between DHH readers and their hearing peers: Standardized testing in the U.S. has measured lower English literacy rates for deaf adults [8, 15]. Furthermore, literacy researchers have hypothesized that the basic mechanism employed by many deaf adults to understand written sentences differs from that of hearing readers: Specifically, deaf readers may identify the most frequent content words and derive a complete representation of the meaning of the sentence, ignoring other words [2, 3]. This reading strategy is often referred to as a “keyword” strategy, and it suggests that a subset of the words in a caption text might be of very high importance to DHH users (for text understandability). Following this same reasoning, it might be disadvantageous to penalize each error in a caption text equally. Some errors may be very consequential to the understandability of the text (with the potential to mislead or confuse the readers), while other errors may have little impact (perhaps easily ignored by readers). Our goal is to develop a metric that can predict the quality of an ASR text output based on the usability of the text as a caption for DHH users. Unlike WER, we want our metric to distinguish between harmful errors in the caption (likely to degrade the quality of caption for DHH users) and less harmful errors; the metric should use this distinction when penalizing a text for each type of error.

2. Related Work

While WER is the most commonly used metric for evaluating speech recognition performance, researchers have argued for alternative evaluation measures that would better predict human performance on tasks that depend on ASR text output usability [16, 18]. Researchers have also argued that WER is ideally suited to evaluation of ASR quality only for those applications in which the human can correct errors by typing, since the WER metric is based upon counting errors – which directly relates to the cost of restoring the output word sequence to the original input sequence [16]. Moreover, researchers have observed a weak relationship between WER and human task performance on various application use. For example, in the task of spoken document retrieval (in which a human is searching for a speech audio file, which has been transcribed by ASR, by typing search terms for desired information), researchers have found that the WER of the ASR system has little correlation with the retrieval system performance [5, 6]. Furthermore, researchers in [22] saw improvements in a spoken language understanding task, even during a significant increase in WER.

Several researchers have proposed alternative metrics to WER for evaluating the performance of ASR for specific applications, ranging from differential treatment of errors based on their linguistic property (e.g., based on relevance of a word in the text, word categories etc.) [7, 17, 19] to re-engineering the metric structure to allow for better representation of effects on the quality text-transcripts due to errors [18, 16], that has been shown to work well in various application settings. In contrast, we propose a new captioning-focused evaluation framework called the Automatic-Caption Evaluation (ACE) framework to accurately model the impact of an error in the understandability of a caption-text. To measure the impact of an error in a caption-text, the framework considers the importance of words and the semantic deviation due to the error. With the help of this framework, the article discusses the design of several caption quality evaluation metrics and provides evaluations of their performance, through studies with DHH users.

3. Automatic Caption Evaluation Framework

In our work, we are interested in the potential for ASR systems to be used as a real-time captioning tool, especially for setting such as impromptu meetings where arranging a transcriptionist is not always feasible. There are many commercial and research ASR systems available, each with different capabilities, e.g., adapting to the voice of specific speakers, operating in contexts with different types of background noise, or recognizing different vocabulary or genres [7, 13, 14]. A natural question is how to compare ASR systems to determine their suitability for use in this context. Given the limitations of WER discussed in the previous section, we therefore present a new framework, called the Automatic-Caption Evaluation (ACE-) framework, which helps the design of a better evaluation metrics for carefully assessing the efficacy of these tools.

The framework considers two primary factors for evaluating the impact of an error in a caption text: (a) the importance of the spoken word (reference) in understanding the meaning of the message and (b) the semantic deviation between the error word and reference word. These two factors are used to predict the impact of an error in a caption text as follows:

I(W_r, W_h) = α x IMP(W_r) + (1 - α) x D(W_r, W_h) (Eq. 1)

where the (W_r,W_h) pair represents a recognition pair obtained after comparing (aligning) the automatic caption text with the actual human transcription of the spoken message, such that (W_r≠ W_h). IMP(W_r) represents the importance score of the reference word (W_r) in the meaning of the spoken message, D(W_r,W_h) represents the semantic distance of the aligned pair (W_r,W_h), and I(W_r,W_h) represents the impact due to the error. Alpha (α) represents the interpolation weight, which determines how much each of the two factors (word-importance or semantic-distance) contributes to the overall impact score. In other words, the overall impact of an error is determined by the weighted combination of the importance of the reference word and the semantic distance between the error word and the reference word. The weighting factor is determined by the value of alpha (α). In the coming sections, we will provide a more detailed explanation of the key components of this framework and how we use them to create different automatic caption evaluation metrics.

3.1 Word Importance Sub-Score

The word importance measure, represented by IMP(W_r) in Eq. 1, attempts to quantify the semantic contribution of a word to the user’s understanding a text. This contribution could be based on various complex underlying factors (e.g., parts-of-speech of the word, semantic role of the word and prior context of the conversation). This could also be a very subjective measure which could easily be influenced by factors such as user’s literacy level in the language of speech, previous experience with the topic of discussion, etc. Furthermore, there could challenges specific to the application of captioning impromptu meetings for DHH users.

Bar chart showing word importance (from 0 to 1) with words i, lived, and in below 0.2. Words omaha, for, five, winters above 0.8. — Figure 2. An example of word importance score being assigned to an example sentence based on our n-gram based word predictability model.

Despite this complexity, several prior researchers have proposed measures of word importance in a text [16, 10] – for instance, statistical measures like TF-IDF are commonly used. Apart of these measures, we also take inspiration from the results from prior eye-tracking studies on sentence-comprehension strategies of various readers, where researchers found predictability of words to be one of the important predictors of word importance (in terms of eye-gaze duration during reading). In general, highly predictable words have been shown to be read faster and skipped more often than unpredictable words by most readers [20], and especially by less-skilled readers [2]. To predict the predictability of a word as a proxy for word importance, we investigate various methods for modelling the probability of occurrence of words based on a large corpus of text, and make predictions on the predictability of words given its context. Mainly, we consider count-based n-gram distribution model and neural-network based word prediction model for word predictability estimation. An example of word importance scoring in shown in Fig. 2.

3.2 Semantic Distance Sub-Score

Figure 3. An example of semantic distance scoring on an example transcript from an ASR system, where (->) indicate the automatically transcribed version of the spoken word.

Misrecognition errors in automatic captioning systems may be identified by comparing the caption text with a human transcription of what was actually said by the human speaker. This comparison is typically conducted through a process called "text-alignment." The semantic distance sub-score D(W_r,W_h) is used to measure the quality of an aligned-unit by measuring how far a prediction (W_h) is from the actual message (W_r). Notably, compared to the aforementioned word importance sub-score, the semantic distance sub-score considers the quality of the transcription itself, without regard to its importance in the context. For example, an error on an important word could be even more harmful to a text’s understandability, if the erroneous word that is displayed (in place of the correct word) is especially misleading or confusing.

Among the various possible different strategies to compute this semantic disagreement between the error word and the actual reference word, we investigated a pre-trained word2vec tool from Google for our ACE-based evaluation. The word2vec tool provides a vector representation of a word which can be subsequently used in many natural language processing applications and research. The word2vec tool is used to compute semantic distance between two words, based on the cosine similarity of the vectors representing each word. Fig. 3 provides an example of a word2vec based semantic distance scoring.

3.3 The Weighting Variable

The word importance sub-score IMP(W_r) and the semantic distance sub-score D(W_r,W_h) are combined using a weighted sum to produce an error impact score (as shown in Eq. 1), but that equation requires us to select a tuning parameter alpha (α) to specify how much each sub-score contributed to the overall error impact score.

To learn the appropriate value of alpha (α), we "fit" the value of this parameter, using a dataset of texts that have been labelled as to the overall understandability of each text. In this case, we calculated these understandability labels based on response data we had collected during a prior study with DHH participants [10]. Participants’ ages ranged from 20 to 32 (mean = 22.63 and deviation= 2.63), which included 12 men and 18 women, and identified as DHH (26 participants self-identified as Deaf and 4 as Hard-of-Hearing). In the study, the participants were presented with imperfect English text passages (containing errors that had been produced by an ASR system when the text had been spoken aloud and then automatically recognized), and then participants were asked to answer questions that required understanding the information content of each passage. Each question was based on information from a single sentence in the passage, and each sentence contained 0 or 1 ASR errors. Each text received a score of 1 if the participant answered the question correctly, and 0, if the participant answered incorrectly.

From this dataset we had previously conducted, we examined the subset of question- responses that corresponded to English sentences that contained ASR errors. This data was used to calculate an aggregate "comprehension score" for each sentence, by averaging the scores from the 30 participants on questions about that sentence.

4. Our Findings and Future Directions

In our recent paper [11], for which we were honored to receive the Best Paper Award at the ASSETS'17 conference, using the ACE-framework, we compared different ASR evaluation metrics (ACE and WER) in their performance when evaluating the usability of automatically generated captions, with the help of subjective judgement ratings gathered from the DHH users. In addition, that paper presented the results of an evaluation study with DHH participants, which found that participants’ subjective preference of text captions, was better correlated to our new ACE metric than to the standard Word Error Rate (WER) metric.

While we identified word predictability and semantic distance as useful predictors of the usability of an automatically generated caption text, our next step is to conduct a closer evaluation of other unexplored strategies for building the automatic caption evaluation metric. The data collected during our first study will enable us to perform empirical evaluations needed explore various trade-offs between our original ACE metric and other, alternative metric designs for evaluation of the ASR output.

5. Discussion and Conclusions

The long-term goal of our research is to investigate the use of ASR technology to provide captioning services for DHH users, especially in real-time contexts such as meetings with hearing colleagues. Current metrics used for evaluating (and sometimes optimizing) the performance of the ASR systems rely on counting the number of recognition errors without regard to what the errors are and where they occur in the transcription. However, this approach of evaluating the performance of the ASR systems had previously been shown to be loosely connected with the actual human-subject opinion on various application settings.

As a part of our analysis, we examined various approaches to estimate the various predictor of the impact of errors in a caption-text. Ultimately, our findings reveal that the users’ subjective evaluation of the quality of captions is not correlated with the number of errors in text, and we have described and evaluated a metric that can be used by future researchers for evaluating the suitability of ASR systems for generating captions for DHH users. Such a metric can be used as an initial investigation of caption quality under various environmental conditions or speakers, and it could be used to compare various ASR systems for this application – prior to conducting a study with DHH users. We also see potential for such metrics to be used to drive the development of ASR- based captioning systems, rather than the use of currently popular metrics, such as WER, which had very little correlation to DHH users’ judgements of text quality.

6. Acknowledgments

We are grateful to Peter Yeung, Abraham Glasser, Larwan Berke and Christopher Caulfield, who assisted with the data collection for this study. We would also like to thank our collaborators Michael Stinson, Lisa Elliot, Donna Easton and James Mallory.

7. References

Keith Bain, Sara H. Basson, and MikeWald. 2002. Speech recognition in university classrooms: liberated learning project. In Proceedings of the ACM Conference on Assistive Technologies, ASSETS 2002, Edinburgh, Scotland, UK, July 8-10, 2002. 192–196. https://doi.org/10.1145/638249.638284

Nathalie N Bélanger and Keith Rayner. 2013. Frequency and predictability effects in eye fixations for skilled and less-skilled deaf readers. Visual cognition 21, 4 (2013), 477–497.

Ana-Belén Domínguez and Jesus Alegria. 2009. Reading mechanisms in orally educated deaf adults. Journal of deaf studies and deaf education 15, 2 (2009), 136–148.

Benoît Favre, Kyla Cheung, Siavash Kazemian, Adam Lee, Yang Liu, Cosmin Munteanu, Ani Nenkova, Dennis Ochei, Gerald Penn, Stephen Tratz, Clare R. Voss, and Frauke Zeller. 2013. Automatic human utility evaluation of ASR systems: does WER really predict performance?. In INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, Lyon, France, August 25-29, 2013. 3463–3467. http://www.isca-speech.org/archive/interspeech_2013/i13_3463.html

John S. Garofolo, Cedric G. P. Auzanne, and Ellen M. Voorhees. 2000. The TREC Spoken Document Retrieval Track: A Success Story. In Computer-Assisted Information Retrieval (Recherche d’Information et ses Applications) - RIAO 2000, 6th International Conference, College de France, France, April 12-14, 2000. Proceedings. 1–20.

David Grangier, Alessandro Vinciarelli, and Hervé Bourlard. 2003. Information retrieval on noisy text. Technical Report. IDIAP.

Sharmistha S. Gray, Daniel Willett, Jianhua Lu, Joel Pinto, Paul Maergner, and Nathan Bodenstab. 2014. Child automatic speech recognition for US English: child interaction with living-room-electronic-devices. In the 4th Workshop on Child, Computer and Interaction, WOCCI 2014, Singapore, September 19, 2014. 21–26. http://www.isca-speech.org/archive/wocci_2014/wc14_021.html

Dorothy W Jackson, Peter V Paul, and Jonathan C Smith. 1997. Prior knowledge and reading comprehension ability of deaf adolescents. Journal of Deaf Studies and Deaf Education (1997), 172–184.

Sushant Kafle, Matt Huenerfauth. 2018. A Corpus for Modeling Word Importance in Spoken Dialogue Transcripts. In Proceedings of the 11th International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7 – May 12, 2018.

Sushant Kafle and Matt Huenerfauth. 2016. Effect of Speech Recognition Errors on Text Understandability for People who are Deaf or Hard of Hearing. In Proceedings of the 7th Workshop on Speech and Language Processing for Assistive Technologies (SLPAT), INTERSPEECH 2016, San Francisco, CA, USA.

Sushant Kafle and Matt Huenerfauth. 2017. Evaluating the Usability of Automatically Generated Captions for People who are Deaf or Hard of Hearing. In Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility, ASSETS 2017, Baltimore, MD, USA, October 29 - November 01, 2017. 165–174. https://doi.org/10.1145/3132525.3132542

Raja S. Kushalnagar, Walter S. Lasecki, and Jeffrey P. Bigham. 2014. Accessibility Evaluation of Classroom Captions. TACCESS 5, 3 (2014), 7:1–7:24. https://doi.org/10.1145/2543578

Xin Lei, Andrew W. Senior, Alexander Gruenstein, and Jeffrey Sorensen. 2013. Accurate and compact large vocabulary speech recognition on mobile devices. In INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, Lyon, France, August 25-29, 2013. 662–665. http://www.isca-speech.org/archive/interspeech_2013/i13_0662.html

Jinyu Li, Li Deng, Yifan Gong, and Reinhold Haeb-Umbach. 2014. An Overview of Noise-Robust Automatic Speech Recognition. IEEE/ACM Trans. Audio, Speech & Language Processing 22, 4 (2014), 745–777. https://doi.org/10.1109/TASLP.2014.2304637

John L Luckner and C Michele Handley. 2008. A summary of the reading comprehension research undertaken with students who are deaf or hard of hearing. American Annals of the Deaf 153, 1 (2008), 6–36.

Iain A McCowan, Darren Moore, John Dines, Daniel Gatica-Perez, Mike Flynn, Pierre Wellner, and Hervé Bourlard. 2004. On the use of information retrieval measures for speech recognition evaluation. Technical Report. IDIAP.

Taniya Mishra, Andrej Ljolje, and Mazin Gilbert. 2011. Predicting Human Perceived Accuracy of ASR Systems. In INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy, August 27-31, 2011. 1945–1948. http://www.isca-speech.org/archive/interspeech_2011/i11_1945.html

Andrew Cameron Morris, Viktoria Maier, and Phil D. Green. 2004. From WER and RIL to MER and WIL:Improved evaluation measures for connected speech recognition. In INTERSPEECH 2004 - ICSLP, 8th International Conference on Spoken Language Processing, Jeju Island, Korea, October 4-8, 2004. http://www.isca-speech.org/archive/interspeech_2004/i04_2765.html

Hiroaki Nanjo and Tatsuya Kawahara. 2005. A New ASR Evaluation Measure and Minimum Bayes-Risk Decoding for Open-domain Speech Understanding. In 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’05, Philadelphia, Pennsylvania, USA, March 18-23, 2005. 1053–1056. https://doi.org/10.1109/ICASSP.2005.1415298

Keith Rayner. 1998. Eye movements in reading and information processing: 20 years of research. Psychological bulletin 124, 3 (1998), 372.

Michael S. Stinson, Pamela Francis, Lisa B. Elliot, and Donna Easton. 2014. Real-time caption challenge: C-print. In Proceedings of the 16th international ACM SIGACCESS conference on Computers & accessibility, ASSETS ’14, Rochester,NY, USA, October 20-22, 2014. 317–318. https://doi.org/10.1145/2661334.2661337

Ye-Yi Wang, Alex Acero, and Ciprian Chelba. 2003. Is word error rate a good indicator for spoken language understanding accuracy. In Automatic Speech Recognition and Understanding, 2003. ASRU’03. 2003 IEEE Workshop on. IEEE, 577–582.

Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. 2016. Achieving Human Parity in Conversational Speech Recognition. CoRR abs/1610.05256 (2016). arXiv:1610.05256 http://arxiv.org/abs/1610.05256

About the Authors

Sushant Kafle is a Ph.D. student in the Golisano College of Computing and Information Science at the Rochester Institute of Technology. His research focuses on applying machine learning concepts in building accessibility tools, especially those that are designed and validated through real-world observations and studies with users. Sushant is a recipient of an ACM ASSETS best paper award, and a CHI honorable mention award.

Matt Huenerfauth is a Professor at the Rochester Institute of Technology (RIT); his research focuses on applications of artificial intelligence for computing accessibility and assistive technology, especially linguistic technologies for users who are Deaf or Hard of Hearing. Matt is a three-time recipient of the ACM ASSETS best paper award, a CHI honorable mention award, and a Distinguished Member of the ACM. He currently serves as vice-chair of ACM SIGACCESS Special Interest Group on Accessible Computing and co-Editor-in-Chief of the ACM Transactions on Accessible Computing.