GesturAR: An Authoring System for Creating Freehand Interactive Augmented Reality Applications
Part of the push towards exploring and creating more useful applications in AR/VR involves designing authoring systems that are easy to use. With an easier interface, more people can be involved in the prototyping process, increasing the number of applications created and making designing AR/VR more accessible.
GesturAR is one of the systems created to simplify the AR authoring process. This paper marries three timely research topics: AR authoring systems, freehand interactions, and machine learning enhanced user experience.
This Week’s Paper
Wang, Tianyi, Xun Qian, Fengming He, Xiyun Hu, Yuanzhi Cao, and Karthik Ramani. 2021. “GesturAR: An Authoring System for Creating Freehand Interactive Augmented Reality Applications.” In The 34th Annual ACM Symposium on User Interface Software and Technology, 552–67. UIST ’21. New York, NY, USA: Association for Computing Machinery.
DOI Link: 10.1145/3472749.3474769
Summary
GesturAR is an authoring system that focuses on helping users easily prototype custom hand interactions with virtual objects through demonstration.
Users start by creating the virtual content to be interacted with. Then, they demonstrate the gesture used to interact with the virtual content. The user then creates an action that the virtual content can do (e.g., re-scale or re-size). The gesture and action can then be linked through visual programming (see (d) in the image above). Once the interactions are completed, GesturAR can enter play mode, where anyone can try out the application (see (e-1, e-2) in the image above).
Design
Gestures
As the focus of this paper, the authors referenced prior work to design a system that supports the range of gestures a typical authoring system would require. In particular, their freehand interaction model can be broken down into four categories:
Static-provoking - Interactions where virtual content will respond after a static gesture is detected, e.g., pressing a button.
Manipulating - Interactions where virtual content continues to react to the static gesture, e.g., holding a mug.
Dynamic-provoking - Interactions where the virtual content responds after a dynamic gesture, e.g., waving hands.
Synchronous - Interactions where the virtual content responds as the dynamic gesture is being performed, e.g., pulling a door open.
These four categories were derived from a combination of 2 models to categorize gestures. The first separate gestures into input and output, similar to prior interaction prototyping systems. The second categorized gesture into 6 different dimensions. 4 of the 6 were concerned with defining the gesture’s semantic meaning, but the other 2 defined a gesture’s relation to time and when the content should respond (before, during, or after the gesture was performed).
The authors trained a neural network to detect 18 different hand poses referred from prior studies. To demonstrate a static gesture, the user would have to hold the pose for 2 seconds so the system will obtain multiple samples of the same pose. If the user’s pose changes, the system categorizes it as a dynamic gesture and only stops recording once the gesture remains the same for 2 seconds. The final 2 seconds will then be removed from the recorded dynamic gesture.
Creating Virtual Content
There are 4 ways GesturAR allows users to create virtual content in situ:
Import a 3D model from a provided set of common objects.
Sketch in mid-air, where the user can change the color and width of the brush.
3D scan the object in “real-time” by using their finger to color the surrounding surface with a tip. This is not actually implemented by GesturAR, it is just a simulation.
Add pre-defined mechanical constraints to the virtual content (like a hinge joint that limits the range of motion for the model).
Visual Programming
The final piece connects the gestures created (i.e., triggers) to different actions. Actions can be defined either by a user manipulating a virtual content’s bounding box to re-scale, re-size, and/or rotate the asset or selecting from a list of pre-defined actions (such as appear/disappear). A connection between a trigger and an action can be made by drawing a line between the associated labels (see (d) in the image above).
Each gesture comes with a bounding box that the user can define. The gesture is only recognized if it is performed within its associated bounding box. The gesture can also be stuck to the world coordinate, follow the user, or be bound to the relevant object.
Results
The authors conducted a 3-session user study. The first is to evaluate the hand detection model’s accuracy, the second is to evaluate the authoring feasibility of the system, and the third is to evaluate the overall system usability. They had 12 participants who all had a basic understanding of AR/VR concepts, and only 1 had not experienced AR/VR applications before. Each session involves two participants. Here are some of the highlights of each study:
Hand Detection Model Accuracy
Participants were asked to demonstrate 8 static gestures and 7 dynamic gestures as chosen by the authors.
To evaluate the system’s gesture recognition performance, each participant was asked to perform the target gesture using each hand and perform 2 gestures that they think are different from the target gesture.
Each participant tested against their own and their partner’s data.
The system successfully differentiates static gestures correctly.
The hand detection model is accurate regardless of who performed the gesture for static gestures.
There is a slightly lower accuracy for dynamic gesture detection, with a larger difference for other-authored gestures.
Authoring Feasibility
The participants were asked to author different types of interactions for a set of pre-created models. The researchers then recorded whether the participants successfully interacted with their own authored interactions and their partner’s interactions on the first try.
94.44% success rate for both self-authored and other-authored interactions
The interview and survey showed that participants agreed that customized hand gestures are necessary, found the demonstration approach to be receptive, and were content with the capability of the hand-object interactions in the system.
Participants particularly mentioned that dynamic and synchronous gesture support was helpful.
Overall System Usability
Participants were tasked to create a table-top AR puzzle game from scratch. The authors provided the premise and components of the game, but the participant had to author the interaction. The system was all-in-all positively received, with a SUS score of 86/100. Participants found the trigger-action metaphor to be helpful for this authoring system and found the visual programming interface to be intuitive.
Related Work
GesturAR’s novelty is that it involves a real-time authoring process and allows end users to author customized freehand input. It heavily references past literature for most of its design decisions. Here are some of the interesting prior work mentioned:
A highly influential study that elicited a set of hand gestures from participants. This study defined the 6 dimensions of gestures and the gesture set used to train GesturAR’s neural network. (User-Defined Gestures for Augmented Reality)
A study that asks participants to demonstrate how they grasp objects. This then informs the object-gesture mapping in their design. (VirtualGrasp: Leveraging Experience of Interacting with Physical Objects to Facilitate Digital Object Retrieval)
A system that allows users to use hand gestures for authoring animations in VR. This paper did not allow users to author custom hand gestures, however. (MagicalHands: Mid-Air Hand Gestures for Animating in VR)
This book introduced the metaphor of programming by demonstration. (End-User Development: An Emerging Paradigm)
Technical Implementation
Since the system was built on Hololens2, the researchers had to work around its technical limitations. Here I highlight some of the implementation decisions so that the system can run smoothly on the device.
Hand gesture detection
Static hand gesture recognition uses one-shot learning so that the system can identify the hand gesture after only one demonstration.
A siamese neural network with fully connected layers is trained to categorize different gestures. With this network, the Hololens2 can thus process the hand pose at 60 frames per second.
The authors used the hand finger joint data from MRTK as input to the siamese network.
The authors also proposed a changing-state method for dynamic gesture recognition. The authors divided up the dynamic gesture into frames, each containing hand information, e.g., joint positions. These frames are further summarized into a shortlist of states using the neural network. This is done to reduce the computational load on the Hololens2.
To support synchronous interactions, the authors pre-defined how a numeric value increases or decreases with a type of gesture (E.g., rotating a hand clockwise would increase the value).
Tools
The system is built using Unity3D.
The user interface uses Microsoft’s Mixed Reality Toolkit (MRTK), FinalIK, and several mesh effect libraries from the Unity asset store.
The neural network runs on Hololens2 through Unity Barracuda.
Discussion
Why is this interesting?
Hands are incredibly flexible and dexterous. They have the potential to make interactions more intuitive and enriched with metaphors. Over the past couple of years, hand tracking has gradually improved and has even been introduced in some commercial AR/VR devices.
GesturAR is the first system that allows users to create AR applications with custom hand gestures. Thus, this paper covers several interesting design choices the researchers made to create a usable yet flexible system. It also covers many key questions such as how to categorize hand gestures, how to differentiate hand gestures in a user-friendly way, how to create a model that is accurate yet computationally more efficient, and what are the key components of an in-situ authoring system.
Limitations
Most importantly, the number of participants in their study is small. With this size, we would not know for sure whether the system is truly as usable and accurate even with a larger participant pool.
Additionally, the authors did not explain why several design decisions were so. For example, how did the authors decide which pre-defined actions GesturAR should support, and why was authoring an AR table-top puzzle game the chosen task to evaluate system usability?
Opportunities
GesturAR creates room for future research into creating better in-situ AR authoring systems. The authors mentioned an interesting research direction, which is to investigate how to author freehand interactions that can be easily used by others. As we saw in the results, hand detection accuracies were lower in other-authored scenarios. Hand gesture design itself can also be ambiguous and personal. A deeper analysis into this direction would help create more usable freehand interaction AR applications.