Tianshi Xie, Computer Science and Software Engineering Department Auburn University, Auburn, USA
Cheryl D. Seals, Computer Science and Software Engineering Department Auburn University, Auburn, USA


Blind and visually impaired people cannot accurately judge the changes in their surroundings due to eyesight limitations, which increases the risk of accidents indoors and outdoors. We propose mobile ARML based on Mobile-Net Single-shot Detection (MobileNet-SSD), Augmented Reality (AR), and a Voice Interaction system for object detection and distance calculation. ARML aimed to (i) aid visually impaired people to avoid obstacles in daily life and (ii) quickly query the user-specified items utilizing computer vision; (iii) Integrates Lidar for both AR/VR experiences reducing additional equipment and improving detection accuracy of distance between users and obstacles. This system safely improves their ability to identify obstacles in their environment and improves the quality of life for visually impaired people. Experimental results indicate distance accuracy of 96% within the five-meter range, outpacing other research (Chen et al., 2019) and FPS more than forty-four frames per second, surpassing similar projects (Srinivasan et al., 2020).


According to a 2019 World Health Organization report, more than 2.2 billion people worldwide are visually impaired or blind. Researchers have investigated navigation tools to help the visually impaired (i.e., cane and navigation dog, depth cameras, radio- frequency, ultrasonic sensors, or infrared sensor), but most of these tools are marginally effective in an indoor setting and in close proximity. AI-Guide (Aldas et al., 2020). is one of the most popular solutions based on an Apple SDK(ARKit) to locate and grasp objects for the visually impaired. The problem is that this system cannot identify everyday items such as a shoe, apple, and bottle. This research project will improve small object recognition by incorporating a machine learning model to recognize small 3D objects. We also investigated Microsoft was Seeing-AI. Seeing-AI based on Augmented Reality and LIDAR technology supporting identification of text, color, currency, and others. This research project will improve their functionality to support the detection of general 3D objects in real-time. A neural network (CNN) is used to identify obstacles in the visually impaired 2 person's environment and navigate safely. (Lin et al., 2017) adopted Faster-R-CNN in the image recognition module of the server. The network protocol is RTP in fast mode, and Yolo is used as the image recognition module on the server. In the direction and distance module, the author uses the Mono-SLAM formula to calculate the distance between the camera and the target. (Chen et al., 2019) used two neural networks: The Vision Disparity Network and the Semantic Segmentation Network. The former is used to generate the dissimilarity map to calculate the distance between the camera and the target, the latter is used to detect the obstacle. (Srinivasan et al., 2020) used SSDLIFE MobilenetV2 model to detect the obstacle and Stereo Vision To calculate the estimated true distance based on camera parameters and disparity. The author classifies the object and the obstacle based on the Fuzzy Algorithm and sends the object and the obstacle information back to the user via the Text to Speech engine. The experiment shows that a too high frame rate will lead to an increase of computation, so they adjust the frame rate to 15 FPS. (Jason et al., 2019) presented a method based on the ALEXNET neural network to detect the obstacle, this method can effectively overcome the influence of illumination on image recognition. This paper discussed the system design, a mobile device that interacts with users through a virtual assistant. The virtual assistant provides different modules such as obstacle detection, distance estimation, and Alert System to analyze real-time environments and remind users to avoid obstacles. To improve the distance accuracy between the obstacle and the user, we introduce the Lidar sensor instead of the ARKIT API (ARAnchor, 2021) to actively detect the environment without being affected by ambient light. Experimental results show that our system can estimate distances with 96% accuracy within a five-meter range, outpacing other research projects (Chen et al, 2019; Lin et al, 2019; Kang et al, 2019). The FPS of our experimental system reaches more than forty-four frames per second, which has surpassed similar projects (Srinivasan et al., 2020)


Our contributions included solving the following fundamental research questions for visually impaired people. (i) How to design the system to make it easy for users to carry? (ii) How to notify the user of the nearest obstacle in real-time? (iii) How to improve the distance accuracy between the obstacle and the user (iv) How to recognize more common items than existing approaches? (v). How to search for user-specified items within a certain area? (vi) What information should the system prompt for users to use more easily? To solve these problems, the proposed solution in this paper includes only one mobile device on the hardware, which would be very convenient for the visually impaired to go out without any additional burden. We also provide the speech recognition module to convince the user to operate this system. The system performs real-time analysis of the environment around the user through Augmented Reality to accurately locate the object according to the Lidar provided by the mobile device and a lightweight neural network module is used to avoid too much computation. It includes obstacle recognition and distance calculation from user to obstacle and maintains a list of obstacles. According to the list, the user can get the nearest obstacle and the user-specified items within a certain area. Also, the system describes the environment around the user in real-time with different types of feedback, including visual, audio, and vibration modes of different frequencies. For the visually impaired, the system uses the augmented reality module to visualize the data. For example, add a 2D or 3D model with different color markers to highlight obstacles. For the blind user, the system describes obstacles (category, distance, orientation) in detail according to the voice command. Also, users can set different distances of safety according to their level of vision impairment to reduce the risk of accidents. The average distance accuracy of the ARML system is over 96% in complex environments. This system has two modes: (i) Safe mode, the virtual assistant can quickly detect the nearest obstacle (laptop computer) from many obstacles within the user-defined distance (less than 1 meter), and simultaneously send a reminder to the user 3 through both visual and audio methods. The nearest obstacle Obstacles are marked with a red icon on the main interface, and the detected distance (0.68m) is displayed below the main interface. (ii) Query mode, users can click the query button on the main interface or enter the query mode by voice to search for the desired item (TV), broadcast the item and its distance in real-time, and display the detected distance (4.22m) below the main interface.

Requirements and Design

In the proposed system, users interact with the virtual assistant supported through Deep Learning and AR and the mobile device system will provide users feedback utilizing sound, vision, and vibrations to assist users in avoiding obstacles. We developed ARML as an Apple app and experimented with an iPad Pro 11 device with a LIDAR scanner.

System Design

The system includes the main logic core module and the object detection module that provides a virtual voice assistant to interact with users. The main logic core module consists of a security/safe mode (i.e., detects obstacles smaller than 1 meter) and search mode, (i.e., the user can obtain the item's name from a listing). The system will indicate when the object is found and the object distance from the user, and the relative direction. There are several subsystems, and one is the vibration system (e.g., in safe mode, when the user and the obstacle are less than 1 meter, the vibrating system vibrates at Frequency A and other frequencies at different distances). Another subsystem is the visual display system, which displays the name and bounding box of the target obstacle and the searched object to the user in 2D and adds a 3D instruction model at the bounding box center. In this way, the target object can be more prominently displayed on the mobile phone so that the user can identify it more clearly. The object detection module, it includes target detection and distance estimation. The function of target detection is accomplished by the Machine Learning module. First, the real-time video is transmitted to the target detection neural network in the form of frames. The final output of the model is the type of object and the bounding box. The function of distance estimation is accomplished by the AR Module. According to the 2D coordinates of the bounding box outputted by the neural network, we get the central point of the bounding box as the input data of the AR Module and then use the central point of the bounding box as the starting point to launch a ray into 3D space. When the Ray Collides with the object at the point of impact, we can calculate the distance from user A (x1, y1, z1) to object B (x2, y2, z2) based on the EUCLIDEAN distance. This distance is then returned to the main logical core module, which continues to execute the user instructions in the corresponding mode, and finally presents the data to the user in visual, tactile, and acoustic form. This can greatly improve the user’s ability to identify obstacles in the surrounding environment and reduce the risk of users walking indoors or outdoors. Using the right vector of the camera and the vector of the camera to the target object in the AR system, we can get the angle of the target object relative to the user. The angle range is divided into three categories: left, right, forward. For Distance Setting, we have three levels: normal distance (2m<= d <= 5m), safe distance (1m < d <=2m) and warning distance (d <= 1m).

Data, Evaluation and Results

We used the pre-trained model of Mobile-Net-SSD, which trained on the COCO dataset, which contains 300k images, 1.5 million object instances, and 90 different categories of items. The input image size is 300*300*3. We integrated the model of Mobile-Net-SSD and AR system into Unity. Also, we captured more than 50 images when testing the system and retrieved data to calculate the experiment results. This experiment is divided into four groups. In the 4 first group, the system estimates the accuracy of the user’s distance from the object according to the different objects (Refrigerator, TV, and Chair).

Vo-VA/VA*100   (1)

For example, when a user is less than one meter from an object, the system estimates the distance values Vo between the user to the obstacle. We set the real distance as VA and use equation (1) to calculate the percent accuracy of distance. We repeated the process at different distance levels. We found the distance accuracy of the system is best between three and four meters. Since our experiments are not conducted in a simulated lab but in a real user’s room, our average distance accuracy in complex environments is over 96%, indicating that our system can maintain high range accuracy within a five-meter indoor range. In the second set of experiments, the system’s virtual assistant will give a speech prompt for the recognized obstacle. We tested the prompts for categories of items and the distance between the item and the user. If the prompt is correct, we mark it as YES, otherwise, we mark it as NO. For example, when the system detects a TV 2.1 meters in front of the user, the virtual assistant will tell the user that “the TV is more than two meters in front of you.”. For three common household appliances, the system’s voice broadcast accuracy rate achieved 100%. In the third experiment, we evaluated the reaction time of the system to identify the object. Robert Billers argued that, ideally, users should get feedback on their actions within 100 milliseconds, because the fastest subliminal movements are those in which the blink of an eye lasts between 100 and 150 milliseconds, and 100 milliseconds feels like an instant. We also derived an average response time of 19 milliseconds to greatly improve user experience. In the fourth experiment, we tested the accuracy of the system for object recognition. Depending on the range, we get an average object recognition accuracy of 73% within a five-meter range. There are two possible reasons for this. First, the lighting problem. Insufficient Indoor illumination will lead to a decrease in the object recognition rate. Although Lidar is not affected by illumination, image recognition based on deep learning is affected by illumination. (Ou et al., 2020) proposed that the problem can be solved by using depth maps. Second, it has to do with the image of the training data. For example, if the images of the training data are taken within two meters, the recognition accuracy will be improved. The highest precision was concentrated within two meters and the low precision was within two to five meters. We can increase the accuracy of object recognition by adding more extensive training data.

Conclusion and future Work

In this paper, we proposed a visual obstacle recognition framework based on object detection and augmented reality (AR) for obstacle recognition and object retrieval. This framework has strong extensibility and generality. The modules in this framework are flexible, and the system implemented based on this framework can be applied to multiple platforms. Also, this system requires only one mobile phone without the need to carry additional detection equipment or any network requirements. It is far more portable than any other existing system. The system also provides a virtual assistant to interact with the user in real-time, which greatly simplifies the user’s operation. For future work, we will improve the user interaction design of the virtual assistant and provide richer real-time feedback and more intelligent operation to help users quickly assess the surrounding environment to reduce the risk of unnecessary accidents.


I want to thank Dr. Seals (my advisor) for her support and valuable suggestion for this thesis. Many thanks for the computing power support by the Department of Computer Science & Software engineering at Auburn University.


  1. Reference here. Huang, J., Kinateder, M., Dunn, M.J., Jarosz, W., Yang, X., & Cooper, E.A. (2019). An augmented reality sign-reading assistant for users with reduced vision. PLoS ONE, 14
  2. H. Chen, W. Chiu, J. Yu, H. Chen and W. Wang, "The obstacles detection for outdoor robot based on computer vision in deep learning," 2019 IEEE 9th International Conference on Consumer Electronics (ICCE-Berlin), Berlin, Germany, 2019, pp. 184-188, doi: 10.1109/ICCEBerlin47944.2019.8966217.
  3. Ou, S., Park, H., & Lee, J. (2020). Implementation of an obstacle recognition system for the blind. Applied Sciences, 10(1), 282.
  4. Lin, B. S., Lee, C. C., & Chiang, P. Y. (2017). Simple smartphone-based guiding system for visually impaired people. Sensors, 17(6), 1371.
  5. Zhu, J., & Fang, Y. (2019). Learning Object-Specific Distance From a Monocular Image. In Proceedings of the IEEE International Conference on Computer Vision (pp. 3839-3848).
  6. Hua, M., Nan, Y., & Lian, S. (2019). Small Obstacle Avoidance Based on RGB-D Semantic Segmentation. In Proceedings of the IEEE International Conference on Computer Vision Workshops (pp. 0-0).
  7. Lin, Y., Wang, K., Yi, W., & Lian, S. (2019). Deep Learning based Wearable Assistive System for Visually Impaired People. In Proceedings of the IEEE International Conference on Computer Vision Workshops (pp. 0-0).
  8. A. K. Srinivasan, S. Sridharan and R. Sridhar, "Object Localization and Navigation Assistant for the Visually challenged," 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 2020, pp. 324-328, doi: 10.1109/ICCMC48092.2020.ICCMC-00061.
  9. Jason, Jiang, Z., Zhao, Q., & Tomioka, Y. (2019, July). Depth Image-Based Obstacle Avoidance for an In-Door Patrol robot. In 2019 International Conference on Machine Learning and cybernetics (ICMLC) (pp. 1-6). IEEE.
  10. Kang, H., Lee, G., & Han, J. (2019, November). Obstacle Detection and Alert System for Smartphone AR Users. In 25th ACM Symposium on Virtual Reality software and Technology (pp. 1-11).
  11. Troncoso Aldas, N. D., Lee, S., Lee, C., Rosson, M. B., Carroll, J. M., & Narayanan, V. (2020, October). AIGuide: An Augmented Reality Hand Guidance Application for People with Visual Impairments. In The 22nd International ACM SIGACCESS Conference on Computers and Accessibility (pp. 1-13).
  12. ARAnchor. Retrieved June 15, 2021, from

About the Authors

Tianshi Xie is a Ph.D. student in the Department of Computer Science & Software engineering at Auburn University works with Dr. Cheryl Seals. His research focuses on human-computer interaction, Accessibility, Computer Vision and machine learning. He hopes to use these technologies to help visually impaired people reduce the environmental risks in real life and improve their quality of life. He believes that integrating these technologies into portable mobile devices can be used more conveniently by visually impaired people.

Dr. Cheryl Seals is a professor in Auburn University's Department of Computer Science and Software Engineering. Her research areas of expertise are human computer interaction, user interface design, usability evaluation and educational gaming technologies. Seals also works with outreach initiatives to improve computer science education at all levels. The programs are focused on increasing the computing pipeline by getting students interested in STEM disciplines and future technology careers.