PROGRAMMING BY VOICE
Sadia Nowrin, Michigan Technological University snowrin@mtu.eduAbstract
Using a keyboard and mouse is difficult or even impossible for many programmers due to motor impairments such as Repetitive Strain Injury (RSI). Programming by voice has enormous potential for helping motor-impaired programmers to input code and continue their careers. It also has the potential to allow motor-impaired students to more easily participate in computer science courses. By decreasing keyboard and mouse usage, programming by voice can also aid persons who are at risk of acquiring RSI. This work explores how different programmers naturally speak code without having to learn prescriptive grammar. The aim of this research is to develop a system that allows programmers to speak code in a natural manner.
Background
Programming is a text-intensive job that is usually done with a keyboard and mouse. Programmers who have permanent or temporary disability can use speech as a substitute for typing. Some prior work has investigated speech alternatives for programming. In 2000, Arnold et al. proposed to develop a programming by voice system based on syntax-directed programming environment, but it is no longer being developed [1]. VoiceGrip is a system which enables users to speak code using a pseudo syntax and then translates it into native code [4] but the system is not as productive as keyboard and mouse. Price et al. [10] conducted a Wizard of Oz study to explore how people use voice interface for programming. Begel and Graham [2] conducted a study to investigate how programmers read code written on a piece of paper and based on that they developed a tool called Spoken Java. The Spoken Java interface requires users to learn a series of commands [3]. VoiceCode also requires users to learn a complex set of commands to program by voice [5]. Myna [14] is a voice-driven interface to enable motor-impaired children to learn programming within Scratch. Scratch is a block-based visual programming language. At PyCon 2013 [12], Tavis Rudd demonstrated how coding by voice can be more efficient than typing but his system consists of a set of over 1000 commands. HyperCode is able to create an object in Java along with its attributes and methods inside the IntelliJ IDEA [7]. Rosenblatt et al. conducted a wizard of oz study to explore how programmers give command to code by voice [11]. Based on the results from the study, they developed a voice-to-code editor named VocalIDE in which writing and editing code are done by a set of commands. Mancodev is similar to an IDE which can be used to write and run basic programs in JavaScript by transcribing voice input and identifying the keywords [8]. Recently Brummelen et al. [13] developed a system that allows users to write program by conversing to a conversational agent in natural language, but users need to state the commands exactly for the system to understand. Talon Voice is a freely available voice interface that allows users to customize commands.
In summary, earlier voice programming methods require users to learn complex set of commands. It takes time to learn all these commands and get accustomed to programming by voice. The necessity of speaking code has been addressed in a recent article [9] where two motor impaired programmers mentioned that programming by voice has a great potential for all programmers as voice coding can be relaxing.
Research Questions
This research aims to develop a system that takes spoken code utterances and convert them to actual code. But for that we need a language model that indicates a wide range of variability in programmer’s spoken code. But currently, there is no dataset available to train such language model which motivated me to investigate how to collect a large amount of naturally spoken code efficiently that are similar to actual code. I will address the following research questions in this thesis:
- RQ1: How do different programmers speak code in a natural manner without a prescriptive grammar?
- RQ2: What is an efficient and realistic way to collect a large training set of spoken code?
- RQ3: How accurate are existing speech recognizers on naturally spoken code and how can we make them better?
- RQ4: How to translate the literal text of recognized spoken code into actual code?
Proposed Solution and Methodology
To address RQ1 and RQ2, we look into ways to collect spoken code utterances that is analogous to actual code. Reading a line of code would be an easier way to collect spoken code but in a real programming by voice system, users would invent code from scratch. Thus, data will be collected in two ways: speaking a highlighted line of code and speaking a missing line of code. The goal is to realize if programmers speak differently in two conditions and which method is preferable to collect a large amount of data. This work aims to analyze the variations in collected data such as the verbalization of punctuation, capitalization and user defined symbols (e.g. variable names, function names). Additionally, I plan to explore if different programmers (e.g. novices and expert programmers) speak differently. Also, if people speak block of code with similar construct (e.g. for loop has a similar construct in both C and Java) differently in two different languages.
RQ3 will be addressed by first measuring the accuracy of state-of-the art commercial speech recognizers as well as research recognizers using the collected audio. This work will also explore if language model adaptation improves the accuracy of the speech recognizers to recognize literal words in spoken code.
To address RQ4, I plan to investigate state-of-the-art neural machine translation approaches using the transcribed data from speech recognizer. I also plan to explore sequence to sequence machine translation approach to translate the spoken utterances into actual code.
To build a complete programming by voice system, it is necessary to know the target users and what are their perception about such system. This thesis aims to explore how motor-impaired programmers currently program, what challenges they face to program by voice and how they imagine a future voice programming system.
Research Status and Preliminary Findings
I conducted semi-structured interviews with six programmers who have motor impairments. The details about the programmers are presented in Table 1. We asked the participants about their experience of using speech recognition tools to write programs, how they imagine a future voice programming system, what are the challenges of coding by voice, and if they have any privacy or social concerns and so on.
Participants |
Gender |
Current Job |
Physical Condition |
Programming Experience |
---|---|---|---|---|
P1 | female | graduate student | chronic musculoskeletal pain | 8 years |
P2 | male | web developer | neurological accident | 6-7 years |
P3 | female | graduate student | spinal muscular atrophy | 20 years |
P4 | female | academic researcher | small fiber neuropathy | 16 years |
P5 | male | professor | congenital upper limb deficiency | 17 years |
P6 | male | academic researcher | upper limb musculoskeletal disorder | 6 years |
Participants mentioned they had to learn a lot of commands and it took time to get used to programming by voice: “The most challenging thing is editing and navigation (P1)”. The majority of the participants wished they could write code via natural language: “I guess it would be more similar to the experience of pair programming with someone (P4)”. They also identified the challenges in dictating code such as speaking variable names and comments: “There are some real challenges with variable names and comments (P3)”. All participants appreciated the benefits of programming by voice for people with motor impairments like themselves: “People won’t be intimidated by the task of typing by hand (P2)”. I am currently performing thematic analysis on the interview transcripts to gain a better understanding of participants’ viewpoints.
I designed a study to collect examples of novice programmers speaking code. First, I conducted a pilot study where participants were asked to speak either a highlighted line or a missing line code. I created two PowerPoint files, each having the same 20 Java programs. 16 undergraduate students participated in the study. I collected total 320 recordings and then transcribed all audio files. I analyzed the variations in the transcript and found that participants had a tendency to omit punctuation especially in case of missing lines.
Later, I conducted the main study where a web-based application is used to collect spoken data. This allowed me to track the duration of the experiment as well as the time spent by users for specific actions. The web application contains 20 programs for C and 20 programs for Java. In this study, 16 programs had a missing or highlighted single line of code, 4 programs had multiple missing or highlighted lines. 44 people completed Java and 15 people completed the C version. In total, I collected 840 audio files in C and 280 for Java. Similar to the pilot study, I compared how participants speak in two conditions and analyzed the variations in the collected data. According to the results, participants speak faster in highlighted condition compared to missing condition. But in missing condition, I found a lot of variations in the speech such as they seemed to skip symbols and punctuation when they could not see the line. I will further explore how to collect large amounts of data efficiently. I will also investigate if participants speak differently in different languages.
Language |
Study |
Participants |
Lines |
Reference words |
IBM WER (%) |
Google WER (%) |
---|---|---|---|---|---|---|
Java |
pilot |
16 |
255 |
5537 |
37.98 |
31.03 |
Java |
main |
14 |
280 |
4131 |
40.64 |
35.75 |
C |
main |
42 |
840 |
12272 |
42.33 |
36.38 |
I conducted offline recognition experiments with both IBM Watson Speech-to-Text and Google Cloud Speech-to-Text services on the audio from both user studies. I tested the baseline models on the audio recordings from the pilot study in Java. I calculated the word error rate (WER) of all models using the human transcripts as reference. The baseline models for IBM and Google exhibited a high WER on audio from the pilot and main study, as demonstrated in Table 2.
After experimenting with the baseline model of IBM, I explored the language customization feature of IBM Watson Speech-to-Text. I used all Java transcripts from the main study to adapt the language model and found that WER dropped by 52% compared to the baseline mode. This demonstrates the potential for improving recognition accuracy by altering the underlying language model of the speech recognizer. Table 3 shows some example of human transcription and recognition results from IBM’s custom language model. In another experiment, I investigated if language adaptation works across languages. I used C transcripts as the training data to adapt the language model. On Java utterances, WER was lowered by 48% relative. In the next step, I will explore the research recognizers and language model adaptation more to recognize spoken code.
I performed a small experiment with a neural machine translation system [6]. Even with a limited amount of parallel training text, I was able to translate at least some of the recognized transcripts into code. Further research is needed to explore more sophisticated machine translation systems, for example tree-based neural machine translation approaches versus straight seq2seq approach.
Target code |
Transcript |
Text |
---|---|---|
else if(scan.hasNextDouble()) | human | else if scan dot function has next double |
IBM | else if scan dot function has next double | |
human | else if scan dot has next double then | |
IBM | else if scan dot has next double than | |
private String email; | human | private uppercase s lowercase t r i n g email semicolon |
IBM | private upper case s lower case e r i n g email semicolon | |
human | private string email semicolon | |
IBM | I haven’t string email semicolon | |
case "two": | human | case quotation marks t w o end quotation marks colon |
IBM | case quotation marks t w o in quotation marks colon | |
human | case quote two quote colon | |
IBM | case quote to quote colon |
Expected Contributions
Motor impairments can disrupt the careers of programmers who code as part of their profession. Even students in the introductory programming courses may also experience difficulties learning to write programs due to motor impairments. People who program on a regular basis may develop RSI. Current programming by voice systems requires learning a large set of commands which may hinder people’s ability to get used to programming by voice. I expect my dissertation research will allow programmers with disabilities to write programs by voice in a natural manner. This research explores the methodologies to build a complete system for programming by voice. The results from the user studies, improvements of the algorithms, crowd-sourcing procedures and design guidelines realized in this research will be useful in developing future systems.
References
- Stephen C Arnold, Leo Mark, and John Goldthwaite. 2000. Programming by voice, Vocal Programming. In Proceedings of the fourth international ACM conference on Assistive technologies. ACM, 149–155.
- Andrew Begel and Susan L Graham. 2005. Spoken programs. In 2005 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC’05). IEEE, 99–106.
- Andrew Begel and Susan L Graham. 2006. An Assessment of a Speech-Based Programming Environment. In Visual LAnguages and Human-Centric Computing (VL/HCC’06). IEEE, 116–120.
- Alain Désilets. 2001. VoiceGrip: A Tool for Programming-by-Voice. International Journal of Speech Technology 4, 2 (2001), 103–116.
- Alain Désilets, David C Fox, and Stuart Norton. 2006. VoiceCode: An innovative speech interface for programming-by-voice. In CHI’06 Extended Abstracts on Human Factors in Computing Systems. ACM, 239–242.
- Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. 2017. OpenNMT: Open-Source Toolkit for Neural Machine Translation. In Proceedings of ACL 2017, System Demonstrations. Association for Computational Linguistics, Vancouver, Canada, 67–72.
- Rinor S Maloku and Besart Xh Pllana. 2016. HyperCode: Voice aided programming. IFAC-PapersOnLine 49, 29 (2016), 263–268.
- Jonathan Giovanni Soto Muñoz, Arturo Iván de Casso Verdugo, Eliseo Geraldo González, Jesús Andrés Sandoval Bringas, and Miguel Parra Garcia. 2019. Programming by Voice Assistance Tool for Physical Impairment Patients Classified in to Peripheral Neuropathy Centered on Arms or Hands Movement Difficulty. In 2019 International Conference on Inclusive Technologies and Education (CONTIE). IEEE, 210–2107.
- Anna Nowogrodzki. 2018. Speaking in code: how to program by voice. Nature 559, 2 (2018), 141–142.
- David E Price, DA Dahlstrom, Ben Newton, and Joseph L Zachary. 2002. Off to See the Wizard: using a" Wizard of Oz" study to learn how to design a spoken language interface for programming. In 32nd Annual Frontiers in Education, Vol. 1. IEEE, T2G–T2G.
- Lucas Rosenblatt, Patrick Carrington, Kotaro Hara, and Jeffrey P Bigham. 2018. Vocal Programming for People with Upper-Body Motor Impairments. In Proceedings of the Internet of Accessible Things. ACM, 30.
- Tavis Rudd. 2013. Using Python to Code by Voice. http://pyvideo.org/video/1735/using-python-to-code-by-voice.
- Jessica Van Brummelen, Kevin Weng, Phoebe Lin, and Catherine Yeo. 2020. CONVO: What does conversational programming need? In 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 1–5.
- Amber Wagner, Ramaraju Rudraraju, Srinivasa Datla, Avishek Banerjee, Mandar Sudame, and Jeff Gray. 2012. Programming by Voice: A Hands-Free Approach for Motorically Challenged Children. In CHI ’12 Extended Abstracts on Human Factors in Computing Systems (Austin, Texas, USA) (CHI EA ’12). Association for Computing Machinery, New York, NY, USA, 2087–2092.