EMPIRICAL WEB ACCESSIBILITY EVALUATION FOR BLIND WEB USERS
Alexander Hambley, The University of Manchester, alexander.hambley@manchester.ac.ukAbstract
The project intends to produce a novel, accurate and holistic system that assesses the accessibility of an entire website based on a sample of webpages. The project also strives to bring statistical rigour to the sampling and clustering of webpages. By orchestrating pre-existing evaluative tools, alongside developer testing, we can produce a combined accessibility assessment. A major benefit to the accessibility field is the effective, large-scale evaluation of large websites. The approach strives to improve automated testing coverage by combining evaluative tools, and minimise inaccuracy through effective clustering and sampling.
Problem addressed
There are three main problems that this project addresses: automated testing; sampling in accessibility; and the orchestration of different evaluative tools.
Automated evaluation tools assess websites for accessibility issues. These tools are not completely effective as they are limited in their coverage of accessibility problems [1]. In effect, the Web Content Accessibility Guidelines (WCAG) is the standard for accessibility used by the community, but the problems that users encounter when using a website are not always covered by the WCAG criteria. All of the automated evaluation tools failed to cover more than 50% of the success criteria. Moreover, accessibility problems on well-developed websites can be subtle, thus harder to catch.
WCAG 2.1 has a set of success criteria that fall into four principles: perceivable, operable, understandable, and robust. All the criteria have an associated conformance level including A (the lowest), AA, and AAA (the highest). A higher conformance level will include the lower; if a site meets conformance level AAA, then it will implicitly meet levels AA and A.
A benefit of automated testing is that it is scalable. It is much easier to improve system performance than it is to employ more human evaluators.
W3C state that including users in your testing, thus having a combinatorial approach, will help create a better site for users, improve the efficiency of development, and enhance the motivation of the workforce [2]. An issue with user testing is that human evaluators are not consistent in their assessments. Experienced evaluators would only agree on 55% of test page ratings. The accuracy of experienced evaluators was determined to be 76% [3]. Despite this, user testing allows for the evaluation of criteria that automated testing cannot cover, and it provides developers with rich qualitative feedback. This highlights the importance of a combinatorial solution, consisting of automated and user testing.
Websites can be very large, with unique types of pages. We need to devise a method of sampling these pages in a way that minimises inaccuracy. We intend to produce a solution where numerous tools are used in a modular way to increase our evaluation coverage. Resources, such as time, processing capacity, and disk space, are limited. The performance impact of running several evaluation tools on a large volume of web pages is significant. To solve this, we aim to draw a statistically relevant sample from a set of webpages. Sampling and clustering approaches are discussed in the next section.
Motivation
Population sampling approaches are common, particularly when dealing with human participants. We aim to use similar approaches to sample webpages.
In our statistical analysis, our target population is the complete set of webpages, and we aim to draw a statistically significant sample to use in our evaluation.
Limitations to current approaches to sampling
A vital part of this project is to effectively sample the dataset to reduce the number of webpages to evaluate. In the accessibility sector, the sampling approaches used is limited. A common approach is stratified sampling.
Several stratified sampling approaches were reviewed, each differing in their metrics [4]. The approaches included Euclidean, Manhattan and Cosine distances, and included several error profiles that were computed by accessibility testing software. Euclidean distance is the straight line between two points. It is the hypotenuse of a triangle drawn in Euclidean space. Euclidean distance loses value with a higher dimension of data. Manhattan distance is often referred to as "taxicab geometry". Rather than taking the hypotenuse, it will calculate the distance based on the adjacent and opposite sides. Cosine distance looks at the size of the angle created between the two points.
The accuracy of stratified sampling is lacking: the best approach has an inaccuracy of 24.49% for WCAG 1.0 conformance. Using the Cosine distance and "FailRate" error profile is computationally expensive because a PAM (k-medoids) clustering algorithm is running in the background.
Ad-hoc sampling is another common approach. Rather than being probabilistic, this method is based on a "rule" such as having a precomputed lexicon [5]. This lexicon then generated search engine queries. The approach removes the need for human intervention in selecting webpages, but there are still significant biases. Firstly, the words that were chosen to be in the lexicon can lead to biases in the search engine results. Secondly, a search engine will prioritise pages in a proprietary way that we cannot control. Thirdly, pages with large amounts of words matching the query will often appear higher in search engine results, and this might not lead to samples that we want. Ad-hoc approaches are often situational; many of these approaches have been designed to work in a specific context. Their use case may not extend beyond the one they were developed for. It is clear that ad-hoc methods are not an optimal solution. When compared to stratified sampling approaches, the ad-hoc sampling method produced the most inaccurate results for both WCAG and WAQM [4].
Another non-probabilistic approach is quota sampling. The population is segmented into mutually exclusive subgroups, but in contrast to stratified sampling, a non-random and targeted sample is taken. This approach is useful if we wish to prioritise specific types of webpages. The concern with this approach is that care must be taken to manage selection biases.
Limitations to current approaches to clustering
Cluster analysis (clustering) is a form of unsupervised learning. By using clustering, we can group objects by similarity. There are many clustering algorithms, each with their benefits and drawbacks. There are also different implementations of these clustering algorithms; some algorithms can be used with browser session data, others with extracted DOM data.
Hierarchical clustering is a common approach and comes in two distinct forms: divisive and agglomerative. Divisive clustering is infrequently used, so we will only consider hierarchical agglomerative clustering (HAC).
HAC produces a tree known as a dendrogram. This dendrogram is a significant advantage to hierarchical clustering as we can diagrammatically see the relationship the algorithm produced between clusters [6]. By observing the dendrogram, we can see that similar clusters are merged first, with dissimilar clusters being merged later.
HAC allows us to easily infer an optimal number of clusters. The most significant drawback to HAC is that the process is iterative. The most similar groups are clustered together until a single cluster remains. This means that the complexity of a standard implementation of HAC is O(n3).
There are faster implementations of HAC, but these come at the cost of being limited to specific distance methods. One example is SLINK, which has a complexity of O(n2) [7]. SLINK uses single-link clustering, also known as nearest neighbour clustering. A major drawback of this approach is that because the algorithm searches for the minimum distance, it produces long chains whereby a distant group may not be clustered until the very end. In other clustering techniques, that distant group will often be clustered with another distant group earlier in the process. The most significant problem is that SLINK, and other similar algorithms, are not appropriate for the large data sets that will be used in this project.
The Distributed Hierarchical Clustering (DHC) algorithm addresses this issue [8]. The DHC algorithm will first partition the data set into a k-dimensional tree. By calculating the nearest neighbour boundary regions in this tree, the algorithm groups together the data. Following this, the nearest neighbour pairs are found, and the global mutual nearest neighbour pairs are merged to create new clusters. A major benefit of the DHC algorithm is that numerous distance methods, including single-link clustering and centroid linkage, are supported. That said, this algorithm is only appropriate for very large data sets. This is because the algorithm has a significant start-up time creating a k-dimensional tree. It is also distributed, and as such, comes with concurrency considerations.
Proposed solution and methodology
This project will produce a combinatorial outlook of the projected accessibility of a website. As Figure 1 outlines, a crawler will first crawl a site. The dataset produced by the crawler needs to be managed through clustering and sampling to produce an accurate subset of the website. XML data is useful as we can compare the tree structure, number of nodes, or the metadata. This is then used as an input for evaluation, where we will orchestrate pre-existing evaluation tools to assess the conformity of the sample. We will then produce a combined output of these tools so developers can use this feedback to improve the accessibility of their webpages.
The developed web crawler differs from ordinary crawlers. It prioritises the high-level subdirectories of a URL. Consider the URL: www.bbc.co.uk/
The web crawler first looks at all of the anchor tags on that page and will add these tags to a list. Then, rather than downloading all of the links in this list, the crawler will prioritise the URLs in the list that only have a one-level subdirectory.
This will include links such as: www.bbc.co.uk/news and www.bbc.co.uk/sport. After which, the crawler will download pages such as: www.bbc.co.uk/news/uk and www.bbc.co.uk/news/technology. The logic behind this approach is that we wish to download a large variety of pages, rather than many similar pages.
Status of research
This PhD project is in its first year, and it builds on an MSc thesis and associated paper [9]. The MSc project creates new pages called "representational pages" out of similarly clustered pages. Firstly, a web crawler was built that outputs a list of pages to download. These were then downloaded and processed to produce the DOM. As the DOM is an XML file, the HTML must be repaired, which could be problematic as broken HTML is an accessibility problem in itself.

Pages are then clustered by DOM similarity. This produces groups based on demographics. The comparison to produce these groups is strict. This means that demographics with just a single difference are placed into distinct groups. Following this, pages are sampled for the output. The sample size is calculated for the entire website. Based on this value, a sample size for the larger groups is created, but smaller groups under this threshold are merged.
In comparison to the MSc project, this PhD project aims to bring greater statistical rigour to better sample and cluster the dataset. To date, a literature review has been conducted on various clustering and sampling techniques, and the web crawler has been created. The crawler scrapes webpages such that it downloads a dataset more accurate for templated sites. The project is in its early stages, and as such, this crawler technique will need to be evaluated to see whether the set of pages it downloads results in a better set of data.
Future work for this project will include comparing and evaluating various techniques for clustering and sampling on test data sets. In no particular order, the considerations for choosing a clustering methodology will include how effective the technique is on specific data types, the size of the average data set, and the speed and accuracy of the clustering. We then need to produce a tool to cluster and then sample the data outputted by the crawler. Alongside clustering, active learning techniques are also being considered, but it is too early to state whether they will be used in this project.
Automated tools exist, but they do not evaluate websites empirically. The key contribution of this project is that it demonstrates how sampling can be done empirically, developing a unified report from the orchestration of numerous interoperable evaluation tools. This project also demonstrates how this can be done using a minimal number of pages through effective sampling, and it aims to demonstrate how effective a combinatorial approach is to evaluation. More accurate accessibility evaluation is critical if we wish to improve the web experience for disabled people.
References
- M. Vigo, J. Brown, and V. Conway, “Benchmarking web accessibility evaluation tools: Measuring the harm of sole reliance on automated tests,” W4A 2013 - Int. Cross-Disciplinary Conf. Web Access., 2013, doi: 10.1145/2461121.2461124.
- “Involving Users in Web Projects for Better, Easier Accessibility.” Jan. 2019, [Online]. Available: https://www.w3.org/WAI/planning/involving-users/.
- G. Brajnik, Y. Yesilada, and S. Harper, “Is accessibility conformance an elusive property? A study of validity and reliability of WCAG 2.0,” ACM Trans. Access. Comput., vol. 4, no. 2, 2012, doi: 10.1145/2141943.2141946.
- G. Brajnik, A. Mulas, and C. Pitton, “Effects of sampling methods on web accessibility evaluations,” ASSETS’07 Proc. Ninth Int. ACM SIGACCESS Conf. Comput. Access., no. May, pp. 59–66, 2007, doi: 10.1145/1296843.1296855.
- K. Bharat and A. Broder, “A technique for measuring the relative size and overlap of public Web search engines,” Comput. Networks ISDN Syst., vol. 30, no. 1, pp. 379–388, 1998, doi: https://doi.org/10.1016/S0169-7552(98)00127-5.
- M. Piernik, D. Brzezinski, T. Morzy, and A. Lesniewska, “XML clustering: a review of structural approaches,” vol. 30, no. 3, pp. 297–323, 2015.
- R. Sibson, “SLINK: An optimally efficient algorithm for the single-link cluster method,” Comput. J., vol. 16, no. 1, pp. 30–34, 1973, doi: 10.1093/comjnl/16.1.30.
- W. Zhang, G. Zhang, X. Chen, Y. Liu, X. Zhou, and J. Zhou, “DHC: A Distributed Hierarchical Clustering Algorithm for Large Datasets,” J. Circuits, Syst. Comput., vol. 28, pp. 1950065:1-1950065:26, 2019.
- S. Harper, A. Moon, M. Vigo, G. Brajnik, and Y. Yesilada, “DOM block clustering for enhanced sampling and evaluation,” 2015, pp. 1–10, doi: 10.1145/2745555.2746649.