FACILITATING SHARING AND RE-USE OF ACCESSIBILITY DATASETS: BENEFITS AND RISKS
Rie Kamikubo, University of Maryland, College Park rkamikub@umd.eduAbstract
While advances in technologies like artificial intelligence promise a lot of possibilities for the disability community, they are centered around data-driven approaches. Datasets and data sharing play an important role in training and testing machine learning models and helping deployed systems work better in the real world. However, sharing data sourced from people with disabilities or older adults poses ethical and privacy concerns, which significantly limit the availability and re-use of accessibility datasets. Under such tension between making their data accessible and restricting access to protect the people represented in the data, this paper serves as a starting point to call for action in developing guidelines and frameworks for ethical use and sharing of accessibility datasets. The work proposes to take a mixed-method research approach to gain a deep understanding of the need and challenges of shared resources in this field. The insights gained will facilitate discussions on the future of data sharing and ownership in accessibility research contributing to informing the development of inclusive AI applications and assistive technologies.
Introduction
Emerging technologies benefiting from advances in artificial intelligence and robotics have the potential to improve the lives of people with disabilities and older adults. Human-computer interaction and accessibility researchers have quickly leveraged computer vision, natural language processing, and speech recognition for innovative assistive technologies [3,4,6,15]. At the core of these efforts is the need for data, which are inherently necessary to employ cutting-edge and data-hungry machine learning approaches [17]. Accessibility datasets can play an important role in training, testing, and benchmarking machine learning models -- ultimately, helping AI-infused systems work better when deployed in real-world scenarios both for assistive and general purpose contexts [17,24]. By accessibility datasets in this work we refer to data sourced from people with disabilities and older adults that can be used in machine learning models. For example, as shown in Figure 1, these can be photos taken by blind users [12,21], sign language videos from Deaf signers [14,15], or voice recordings of people with speech impairments [5,31].
Despite the critical role these datasets play, they are scarce [4,17,18]. While this is partly due to smaller populations typically common for underrepresented groups [26], other factors relate to disparate characteristics within a given disability or age group; data being often constrained to specific tasks or applications making it difficult to aggregate; and the cost of data annotation requiring domain knowledge [19]. Such patterns that can further marginalize people with disabilities and older adults in data-driven innovation may impact the model to perform poorly [11], or worse, discriminate against them [29,30]. More so, increasing their representation yet can amplify ethical concerns. People who have distinct data patterns may be more susceptible to data abuse and misuse [1,11,13,28].
Prior work focusing within the CHI community highlighted that data sharing is very sparse among human-computer interaction researchers partially due to privacy concerns and associated disclosing risks of sensitive information from individuals contributing data [1]. Other concerns relate to lack of frameworks for guaranteeing that data use is aligned with the original purpose of data collection and documenting the data collection in a consistent way as most documentation efforts are currently ad hoc in many computing communities. Efforts such as `datasheets for datasets' [9] have not been broadly adopted.
Through a mixed-method approach, this work aims to build a better understanding around benefits and risks unique to many accessibility datasets, making discussions around data practices more pressing, especially when such datasets can be used in AI-infused applications. Specifically, we intend to gain better insights into the need for shared resources in this field, surface perspectives from the disability community on best ways to balance benefits and risks, and understand challenges and sociotechnical frameworks that can facilitate sharing of accessibility datasets.
Methodology
Motivated by an increasing interest in open data science and ongoing discussions around data artifacts (e.g., [9,16,23]), this research aims to improve the norms for collecting and sharing data related to wellness, accessibility, and aging. Prior efforts [16,24] investigated better technical, legal, and institutional privacy frameworks by understanding the benefits and harms for social data collection (e.g., how people in archives collect and provide access to data [16]). We build upon these prior efforts to pursue effective and ethical use of accessibility data, with the following research questions:
RQ1 What are the current practices being carried out for data collection, reporting and sharing within and across disability groups?
RQ2 What are the contributing reasons and harms that lead to certain communities of interest deciding to share or not to share data resources?
RQ3 What intervention is effective for mitigating the risks and promoting the ethical use and sharing of accessibility datasets?
A Multi-step Plan of Action
Step 1. Systematic review of accessibility datasets. First, we analyze available accessibility datasets that include data generated by people with disabilities with potential for training machine learning models. Building on top of the search and collection of accessibility datasets in Kacorri et al. [18], we systematically review the manually extracted metadata of the datasets to gain insights on current data sharing practices and reporting standards across different populations, data collection tasks in the communities of focus.
Step 2. Survey and interview with data scientists and the communities of focus. Second, we explore attitudes, interests, and challenges of data scientists towards machine learning on data generated by people with disabilities and older adults. We employ survey methods followed by interviews in order to gain deeper insights into self-reported attitudes, intent, and interactions. In addition to the insights gained from data scientists, we integrate two-way perspectives by conducting interviews with people contributing data to understand their point of view towards datasets and data sharing for social good. We synthesize the findings as implications to uncover effective ways that can attract and nurture data scientists to include people with disabilities and older adults in their work.
Step 3. Evaluation of intervention. The final step will involve implementing and assessing the intervention that enables data scientists to discover and contribute accessibility datasets, while nurturing ethical practices for data use and sharing. Based on the insights gained from the above two steps, we will combine them with a concrete resource that achieves the following objectives: (i) increase the availability of data collected from populations for the purpose of accessibility research and engineering and (ii) promote transparency and accountability in the use of accessibility data.
Potential Limitations
There are some inherent challenges with the work that we proposed research activities. For example, a systematic review of accessibility datasets assumes that there are ways to search and identify all accessibility datasets. However, given that such datasets are dispersed across different research areas with no consistent reporting it is hard to do so. Dataset search is indeed a challenging problem with a research area of its own [6]. In this research, we propose to do the detective work with a staggering strategy [20]. Though, this may still result in missing datasets.
We also anticipate, that by using a non-probability sampling method when recruiting researchers and data scientists working with accessibility datasets can limit representation and introduce sampling bias. Also there could be a response bias, where those participating may more or less willing to share datasets. We will include and report their response to experience and attitude questions that may capture such dispositions.
Last, in the overall data analysis of datasets and the communities of focus, rigid categorization of disability is difficult [2,3] and perhaps a questionable task as conditions are diverse and fluid [30]. The proposed intervention for risk mitigation strategies for sharing accessibility datasets will not be a one size fits all solution.
Current Progress and Future Work
Currently, we have completed major portions of the analysis in answering RQ1, and this work is published as a technical paper at ASSETS 2021 [20]. It presents a qualitative and descriptive analysis of the identified datasets spanning over 35 years (1984-2020, N=137 datasets) that represent populations from different communities in wellness, accessibility, and aging. Our first step of the research activity with the systematic review considered how the accessibility community navigates the tension between data needs and concerns by exploring existing data practices in terms of the communities of focus, current distribution of the datasets across communities and the changes over the years, and data collection purpose and methods. So far, we observed multiple interdependent factors leading to the current status of accessibility datasets and data sharing standards. We will take a deeper dive into this phenomena, following the second step of the research activity with surveys and interviews, to gain better insights into potential interest and challenges in working with accessibility datasets.
IncluSet [18], a data surfacing repository for accessibility datasets, has contributed to providing a platform that allows researchers and the disability community to discover and link to accessibility datasets (Figure 2). For those linking their dataset to this repository, they fill out a submission form which facilitates the process of documenting the sources. The third step in the research plan will extend this part of nurturing ethical data sharing in a way that incorporates sociotechnical guidelines and frameworks to promote greater transparency and less data misuse.
Conclusion
Datasets directly sourced from underrepresented communities such as people with disabilities and older adults can contribute to more inclusive AI applications as well as innovative assistive technologies. However, there are challenges in data collection and sharing practices as it pertains to inclusivity, bias, and privacy [24]. Given the increasing concerns in machine learning towards fairness and ethics, the overarching goal of this research is to extend data science methods and data-driven technologies to be inclusive in order to accelerate technological innovations for everyone, regardless of disability and health conditions.
Despite the limitations of this research, the broader impact of this research still remains in better equipping the accessibility research community for open innovation and effective use of accessibility data. Also, the limitations are not unique to this research which contribute to the significance of this work. Accessibility data are difficult to collect while the populations are small and encounter physical or information barriers that limit their representation in the society. The ambiguity and complexity of disability “classification” also complicate the data collection and analysis process as well as the solutions to address technical, legal, and institutional issues in these populations. It is thus important to call for better data practices that are more attuned to the concerns from these communities and the researchers and practitioners involved.
Acknowledgments
I am grateful to my Ph.D. advisor, Dr. Hernisa Kacorri, whose work and ideas provided much of the inspiration behind this research. I am also thankful to fellow students Utkarsh Dwivedi and Amnah Mahmood for their support and contribution, as well as prior students Sravya Amancherla, Mayanka Jha, and Riya Chanduka for their preliminary contributions to the project. This work is supported by National Institute on Disability, Independent Living, and Rehabilitation Research (NIDILRR), ACL, HHS (#90REGE0008).
References
- Jacob Abbott, Haley MacLeod, Novia Nurain, Gustave Ekobe, and Sameer Patil. 2019. Local Standards for Anonymization Practices in Health, Wellness, Accessibility, and Aging Research at CHI. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems - CHI ’19, 1–14. https://doi.org/10.1145/3290605.3300692
- Robert L Berg and Joseph S Cassells. 1992. The second fifty years: Promoting health and preventing disability. https://doi.org/10.17226/1578
- Brianna Blaser and Richard E. Ladner. 2020. Why is Data on Disability so Hard to Collect and Understand? In Proceedings of the 5th International Conference on Research in Equity and Sustained Participation in Engineering, Computing, and Technology (RESPECT).
- Danielle Bragg, Oscar Koller, Mary Bellard, Larwan Berke, Patrick Boudreault, Annelies Braffort, Naomi Caselli, Matt Huenerfauth, Hernisa Kacorri, Tessa Verhoef, Christian Vogler, and Meredith Ringel Morris. 2019. Sign Language Recognition, Generation, and Translation: An Interdisciplinary Perspective. In The 21st International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS ’19), 16–31. https://doi.org/10.1145/3308561.3353774
- Ugo Cesari, Giuseppe De Pietro, Elio Marciano, Ciro Niri, Giovanna Sannino, and Laura Verde. 2018. A new database of healthy and pathological voices. Computers & Electrical Engineering 68: 310–321. https://doi.org/10.1016/j.compeleceng.2018.04.008
- Adriane Chapman, Elena Simperl, Laura Koesten, George Konstantinidis, Luis-Daniel Ibáñez, Emilia Kacprzak, and Paul Groth. 2020. Dataset search: a survey. The VLDB Journal 29, 1: 251–272.
- Huiyu Duan, Guangtao Zhai, Xiongkuo Min, Zhaohui Che, Yi Fang, Xiaokang Yang, Jesús Gutiérrez, and Patrick Le Callet. 2019. A dataset of eye movements for the children with autism spectrum disorder. In Proceedings of the 10th ACM Multimedia Systems Conference, 255–260.
- Raymond Fok, Harmanpreet Kaur, Skanda Palani, Martez E Mott, and Walter S Lasecki. 2018. Towards more robust speech interactions for deaf and hard of hearing users. In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility, 57–67.
- Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna M. Wallach, Hal Daumé III, and Kate Crawford. 2018. Datasheets for Datasets. CoRR abs/1803.09010. Retrieved from http://arxiv.org/abs/1803.09010
- João Guerreiro, Daisuke Sato, Saki Asakawa, Huixu Dong, Kris M Kitani, and Chieko Asakawa. 2019. Cabot: Designing and evaluating an autonomous navigation robot for blind people. In The 21st International ACM SIGACCESS Conference on Computers and Accessibility, 68–82.
- Anhong Guo, Ece Kamar, Jennifer Wortman Vaughan, Hanna Wallach, and Meredith Ringel Morris. 2019. Toward Fairness in AI for People with Disabilities: A Research Roadmap. In arXiv preprint arXiv:1907.02227.
- Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. 2018. VizWiz Grand Challenge: Answering Visual Questions from Blind People. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/cvpr.2018.00380
- Foad Hamidi, Kellie Poneres, Aaron Massey, and Amy Hurst. 2018. Who Should Have Access to My Pointing Data?: Privacy Tradeoffs of Adaptive Assistive Technologies. In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS ’18), 203–216. https://doi.org/10.1145/3234695.3239331
- Saad Hassan, Larwan Berke, Elahe Vahdani, Longlong Jing, Yingli Tian, and Matt Huenerfauth. 2020. An Isolated-Signing RGBD Dataset of 100 American Sign Language Signs Produced by Fluent ASL Signers. In Proceedings of the LREC2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives, 89–94. Retrieved from https://www.aclweb.org/anthology/2020.signlang-1.14
- Matt Huenerfauth and Hernisa Kacorri. 2014. Release of experimental stimuli and questions for evaluating facial expressions in animations of American Sign Language. In Proceedings of the 6th Workshop on the Representation and Processing of Sign Languages: Beyond the Manual Channel, The 9th International Conference on Language Resources and Evaluation (LREC 2014), Reykjavik, Iceland.
- Eun Seo Jo and Timnit Gebru. 2020. Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency: 306–316. https://doi.org/10.1145/3351095.3372829
- Hernisa Kacorri. 2017. Teachable Machines for Accessibility. In SIGACCESS Access. Comput., 10–18. https://doi.org/10.1145/3167902.3167904
- Hernisa Kacorri, Utkarsh Dwivedi, Sravya Amancherla, Mayanka Jha, and Riya Chanduka. 2020. IncluSet: A Data Surfacing Repository for Accessibility Datasets. In The 22nd International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS ’20). https://doi.org/10.1145/3373625.3418026
- Hernisa Kacorri, Utkarsh Dwivedi, and Rie Kamikubo. 2020. Data Sharing in Wellness, Accessibility, and Aging. In NeurIPS Workshop on Dataset Curation and Security.
- Rie Kamikubo, Utkarsh Dwivedi, and Hernisa Kacorri. 2021. Sharing Practices for Datasets Related to Wellness, Accessibility, and Aging. In The 23rd International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS ’21).
- Kyungjun Lee and Hernisa Kacorri. 2019. Hands Holding Clues for Object Recognition in Teachable Machines. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19), 1–12. https://doi.org/10.1145/3290605.3300566
- Daniel Leightley, Moi Hoon Yap, Jessica Coulson, Yoann Barnouin, and Jamie S McPhee. 2015. Benchmarking human motion analysis using kinect one: An open source dataset. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 1–7.
- Association for Computing Machinery. 2020. Artifact Review and Badging Version 1.1. Retrieved from https://www.acm.org/publications/policies/artifact-review-and-badging-current
- Meredith Ringel Morris. 2020. AI and Accessibility. In Commun. ACM, 35–37. https://doi.org/10.1145/3356727
- Luz Rello, Ricardo Baeza-Yates, and Joaquim Llisterri. 2014. DysList: An Annotated Resource of Dyslexic Errors. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), 1289–1296. Retrieved from http://www.lrec-conf.org/proceedings/lrec2014/pdf/612_Paper.pdf
- Andrew Sears and Vicki L. Hanson. 2012. Representing Users in Accessibility Research. In ACM Trans. Access. Comput., 7:1-7:6. https://doi.org/10.1145/2141943.2141945
- Majid Shishehgar, Donald Kerr, and Jacqueline Blake. 2018. A systematic review of research into how robotic technology can help older people. Smart Health 7: 1–18.
- Jutta Treviranus. 2019. The Value of Being Different. In Proceedings of the 16th Web For All 2019 Personalization - Personalizing the Web (W4A ’19), 1:1-1:7. https://doi.org/10.1145/3315002.3332429
- Shari Trewin. 2018. AI fairness for people with disabilities: Point of view. arXiv preprint arXiv:1811.10670.
- Meredith Whittaker, Meryl Alper, Cynthia L Bennett, Sara Hendren, Liz Kaziunas, Mara Mills, Meredith Ringel Morris, Joy Rankin, Emily Rogers, Marcel Salas, and others. 2019. Disability, Bias, and AI. AI Now Institute, November.
- Emre Yilmaz, MS Ganzeboom, LJ Beijer, Catia Cucchiarini, and Helmer Strik. 2016. A Dutch dysarthric speech database for individualized speech therapy research.