PROGRAMMING BY VOICE

Sadia Nowrin, Michigan Technological University snowrin@mtu.edu

Abstract

Using a keyboard and mouse is difficult or even impossible for many programmers due to motor impairments such as Repetitive Strain Injury (RSI). Programming by voice has enormous potential for helping motor-impaired programmers to input code and continue their careers. It also has the potential to allow motor-impaired students to more easily participate in computer science courses. By decreasing keyboard and mouse usage, programming by voice can also aid persons who are at risk of acquiring RSI. This work explores how different programmers naturally speak code without having to learn prescriptive grammar. The aim of this research is to develop a system that allows programmers to speak code in a natural manner.

Background

Programming is a text-intensive job that is usually done with a keyboard and mouse. Programmers who have permanent or temporary disability can use speech as a substitute for typing. Some prior work has investigated speech alternatives for programming. In 2000, Arnold et al. proposed to develop a programming by voice system based on syntax-directed programming environment, but it is no longer being developed [1]. VoiceGrip is a system which enables users to speak code using a pseudo syntax and then translates it into native code [4] but the system is not as productive as keyboard and mouse. Price et al. [10] conducted a Wizard of Oz study to explore how people use voice interface for programming. Begel and Graham [2] conducted a study to investigate how programmers read code written on a piece of paper and based on that they developed a tool called Spoken Java. The Spoken Java interface requires users to learn a series of commands [3]. VoiceCode also requires users to learn a complex set of commands to program by voice [5]. Myna [14] is a voice-driven interface to enable motor-impaired children to learn programming within Scratch. Scratch is a block-based visual programming language. At PyCon 2013 [12], Tavis Rudd demonstrated how coding by voice can be more efficient than typing but his system consists of a set of over 1000 commands. HyperCode is able to create an object in Java along with its attributes and methods inside the IntelliJ IDEA [7]. Rosenblatt et al. conducted a wizard of oz study to explore how programmers give command to code by voice [11]. Based on the results from the study, they developed a voice-to-code editor named VocalIDE in which writing and editing code are done by a set of commands. Mancodev is similar to an IDE which can be used to write and run basic programs in JavaScript by transcribing voice input and identifying the keywords [8]. Recently Brummelen et al. [13] developed a system that allows users to write program by conversing to a conversational agent in natural language, but users need to state the commands exactly for the system to understand. Talon Voice is a freely available voice interface that allows users to customize commands.

In summary, earlier voice programming methods require users to learn complex set of commands. It takes time to learn all these commands and get accustomed to programming by voice. The necessity of speaking code has been addressed in a recent article [9] where two motor impaired programmers mentioned that programming by voice has a great potential for all programmers as voice coding can be relaxing.

Research Questions

This research aims to develop a system that takes spoken code utterances and convert them to actual code. But for that we need a language model that indicates a wide range of variability in programmer’s spoken code. But currently, there is no dataset available to train such language model which motivated me to investigate how to collect a large amount of naturally spoken code efficiently that are similar to actual code. I will address the following research questions in this thesis:

RQ1: How do different programmers speak code in a natural manner without a prescriptive grammar?
RQ2: What is an efficient and realistic way to collect a large training set of spoken code?
RQ3: How accurate are existing speech recognizers on naturally spoken code and how can we make them better?
RQ4: How to translate the literal text of recognized spoken code into actual code?

Proposed Solution and Methodology

To address RQ1 and RQ2, we look into ways to collect spoken code utterances that is analogous to actual code. Reading a line of code would be an easier way to collect spoken code but in a real programming by voice system, users would invent code from scratch. Thus, data will be collected in two ways: speaking a highlighted line of code and speaking a missing line of code. The goal is to realize if programmers speak differently in two conditions and which method is preferable to collect a large amount of data. This work aims to analyze the variations in collected data such as the verbalization of punctuation, capitalization and user defined symbols (e.g. variable names, function names). Additionally, I plan to explore if different programmers (e.g. novices and expert programmers) speak differently. Also, if people speak block of code with similar construct (e.g. for loop has a similar construct in both C and Java) differently in two different languages.

RQ3 will be addressed by first measuring the accuracy of state-of-the art commercial speech recognizers as well as research recognizers using the collected audio. This work will also explore if language model adaptation improves the accuracy of the speech recognizers to recognize literal words in spoken code.

To address RQ4, I plan to investigate state-of-the-art neural machine translation approaches using the transcribed data from speech recognizer. I also plan to explore sequence to sequence machine translation approach to translate the spoken utterances into actual code.

To build a complete programming by voice system, it is necessary to know the target users and what are their perception about such system. This thesis aims to explore how motor-impaired programmers currently program, what challenges they face to program by voice and how they imagine a future voice programming system.

Research Status and Preliminary Findings

I conducted semi-structured interviews with six programmers who have motor impairments. The details about the programmers are presented in Table 1. We asked the participants about their experience of using speech recognition tools to write programs, how they imagine a future voice programming system, what are the challenges of coding by voice, and if they have any privacy or social concerns and so on.

Table 1. Details about the programmers with motor impairments we interviewed.
Participants	Gender	Current Job	Physical Condition	Programming Experience
P1	female	graduate student	chronic musculoskeletal pain	8 years
P2	male	web developer	neurological accident	6-7 years
P3	female	graduate student	spinal muscular atrophy	20 years
P4	female	academic researcher	small fiber neuropathy	16 years
P5	male	professor	congenital upper limb deficiency	17 years
P6	male	academic researcher	upper limb musculoskeletal disorder	6 years

Participants mentioned they had to learn a lot of commands and it took time to get used to programming by voice: “The most challenging thing is editing and navigation (P1)”. The majority of the participants wished they could write code via natural language: “I guess it would be more similar to the experience of pair programming with someone (P4)”. They also identified the challenges in dictating code such as speaking variable names and comments: “There are some real challenges with variable names and comments (P3)”. All participants appreciated the benefits of programming by voice for people with motor impairments like themselves: “People won’t be intimidated by the task of typing by hand (P2)”. I am currently performing thematic analysis on the interview transcripts to gain a better understanding of participants’ viewpoints.

I designed a study to collect examples of novice programmers speaking code. First, I conducted a pilot study where participants were asked to speak either a highlighted line or a missing line code. I created two PowerPoint files, each having the same 20 Java programs. 16 undergraduate students participated in the study. I collected total 320 recordings and then transcribed all audio files. I analyzed the variations in the transcript and found that participants had a tendency to omit punctuation especially in case of missing lines.

Later, I conducted the main study where a web-based application is used to collect spoken data. This allowed me to track the duration of the experiment as well as the time spent by users for specific actions. The web application contains 20 programs for C and 20 programs for Java. In this study, 16 programs had a missing or highlighted single line of code, 4 programs had multiple missing or highlighted lines. 44 people completed Java and 15 people completed the C version. In total, I collected 840 audio files in C and 280 for Java. Similar to the pilot study, I compared how participants speak in two conditions and analyzed the variations in the collected data. According to the results, participants speak faster in highlighted condition compared to missing condition. But in missing condition, I found a lot of variations in the speech such as they seemed to skip symbols and punctuation when they could not see the line. I will further explore how to collect large amounts of data efficiently. I will also investigate if participants speak differently in different languages.

Table 2: Details about the data in each study and the word error rate (WER) using the IBM and Google speech recognizers.
Language	Study	Participants	Lines	Reference words	IBM WER (%)	Google WER (%)
Java	pilot	16	255	5537	37.98	31.03
Java	main	14	280	4131	40.64	35.75
C	main	42	840	12272	42.33	36.38

I conducted offline recognition experiments with both IBM Watson Speech-to-Text and Google Cloud Speech-to-Text services on the audio from both user studies. I tested the baseline models on the audio recordings from the pilot study in Java. I calculated the word error rate (WER) of all models using the human transcripts as reference. The baseline models for IBM and Google exhibited a high WER on audio from the pilot and main study, as demonstrated in Table 2.

After experimenting with the baseline model of IBM, I explored the language customization feature of IBM Watson Speech-to-Text. I used all Java transcripts from the main study to adapt the language model and found that WER dropped by 52% compared to the baseline mode. This demonstrates the potential for improving recognition accuracy by altering the underlying language model of the speech recognizer. Table 3 shows some example of human transcription and recognition results from IBM’s custom language model. In another experiment, I investigated if language adaptation works across languages. I used C transcripts as the training data to adapt the language model. On Java utterances, WER was lowered by 48% relative. In the next step, I will explore the research recognizers and language model adaptation more to recognize spoken code.

I performed a small experiment with a neural machine translation system [6]. Even with a limited amount of parallel training text, I was able to translate at least some of the recognized transcripts into code. Further research is needed to explore more sophisticated machine translation systems, for example tree-based neural machine translation approaches versus straight seq2seq approach.

Table 3. Examples of how participants spoke lines of code in the pilot study. Recognition result used IBM Watson adapted on Java transcripts from the main study. Recognition errors are highlighted in red and underlined.
Target code	Transcript	Text
else if(scan.hasNextDouble())	human	else if scan dot function has next double
	IBM	else if scan dot function has next double
	human	else if scan dot has next double then
	IBM	else if scan dot has next double than
private String email;	human	private uppercase s lowercase t r i n g email semicolon
	IBM	private upper case s lower case e r i n g email semicolon
	human	private string email semicolon
	IBM	I haven’t string email semicolon
case "two":	human	case quotation marks t w o end quotation marks colon
	IBM	case quotation marks t w o in quotation marks colon
	human	case quote two quote colon
	IBM	case quote to quote colon

Expected Contributions

Motor impairments can disrupt the careers of programmers who code as part of their profession. Even students in the introductory programming courses may also experience difficulties learning to write programs due to motor impairments. People who program on a regular basis may develop RSI. Current programming by voice systems requires learning a large set of commands which may hinder people’s ability to get used to programming by voice. I expect my dissertation research will allow programmers with disabilities to write programs by voice in a natural manner. This research explores the methodologies to build a complete system for programming by voice. The results from the user studies, improvements of the algorithms, crowd-sourcing procedures and design guidelines realized in this research will be useful in developing future systems.

References

Stephen C Arnold, Leo Mark, and John Goldthwaite. 2000. Programming by voice, Vocal Programming. In Proceedings of the fourth international ACM conference on Assistive technologies. ACM, 149–155.
Andrew Begel and Susan L Graham. 2005. Spoken programs. In 2005 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC’05). IEEE, 99–106.
Andrew Begel and Susan L Graham. 2006. An Assessment of a Speech-Based Programming Environment. In Visual LAnguages and Human-Centric Computing (VL/HCC’06). IEEE, 116–120.
Alain Désilets. 2001. VoiceGrip: A Tool for Programming-by-Voice. International Journal of Speech Technology 4, 2 (2001), 103–116.
Alain Désilets, David C Fox, and Stuart Norton. 2006. VoiceCode: An innovative speech interface for programming-by-voice. In CHI’06 Extended Abstracts on Human Factors in Computing Systems. ACM, 239–242.
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. 2017. OpenNMT: Open-Source Toolkit for Neural Machine Translation. In Proceedings of ACL 2017, System Demonstrations. Association for Computational Linguistics, Vancouver, Canada, 67–72.
Rinor S Maloku and Besart Xh Pllana. 2016. HyperCode: Voice aided programming. IFAC-PapersOnLine 49, 29 (2016), 263–268.
Jonathan Giovanni Soto Muñoz, Arturo Iván de Casso Verdugo, Eliseo Geraldo González, Jesús Andrés Sandoval Bringas, and Miguel Parra Garcia. 2019. Programming by Voice Assistance Tool for Physical Impairment Patients Classified in to Peripheral Neuropathy Centered on Arms or Hands Movement Difficulty. In 2019 International Conference on Inclusive Technologies and Education (CONTIE). IEEE, 210–2107.
Anna Nowogrodzki. 2018. Speaking in code: how to program by voice. Nature 559, 2 (2018), 141–142.
David E Price, DA Dahlstrom, Ben Newton, and Joseph L Zachary. 2002. Off to See the Wizard: using a" Wizard of Oz" study to learn how to design a spoken language interface for programming. In 32nd Annual Frontiers in Education, Vol. 1. IEEE, T2G–T2G.
Lucas Rosenblatt, Patrick Carrington, Kotaro Hara, and Jeffrey P Bigham. 2018. Vocal Programming for People with Upper-Body Motor Impairments. In Proceedings of the Internet of Accessible Things. ACM, 30.
Tavis Rudd. 2013. Using Python to Code by Voice. http://pyvideo.org/video/1735/using-python-to-code-by-voice.
Jessica Van Brummelen, Kevin Weng, Phoebe Lin, and Catherine Yeo. 2020. CONVO: What does conversational programming need? In 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 1–5.
Amber Wagner, Ramaraju Rudraraju, Srinivasa Datla, Avishek Banerjee, Mandar Sudame, and Jeff Gray. 2012. Programming by Voice: A Hands-Free Approach for Motorically Challenged Children. In CHI ’12 Extended Abstracts on Human Factors in Computing Systems (Austin, Texas, USA) (CHI EA ’12). Association for Computing Machinery, New York, NY, USA, 2087–2092.

About the Authors

Sadia Nowrin is a PhD candidate in department of Computer Science at Michigan Technological University. She is advised by Dr. Keith Vertanen, associate professor, Michigan Technological University. Her main research interests are in the fields of Human Computer Interaction, Accessibility and Natural Language Processing.