Videoconference and Embodied VR: Communication Patterns Across Task and Medium
This Week’s Paper
Abdullah, Ahsan, Jan Kolkmeier, Vivian Lo, and Michael Neff. 2021. “Videoconference and Embodied VR: Communication Patterns Across Task and Medium.” Proc. ACM Hum.-Comput. Interact., 453, 5 (CSCW2): 1–29.
DOI Link: 10.1145/3479597
Why this?
“Zoom fatigue” is probably circled around so frequently these past two years that you may even feel tired just hearing about it. But it is a genuine concern, and with the future of work tending towards increased remote work and mobility, better ways to collaborate remotely are needed.
Enter: XR. Remote collaboration is one of the active research fields of XR. Factors such as being able to show hand gestures, spatial audio, and gaze are some of the reasons why it’s poised to be a better alternative than videoconferencing. But how is VR better than Zoom in terms of collaboration? How much better? This paper aims to answer these questions by running a large-scale, 210 participant study.
Two of the researchers are also part of Facebook Reality Labs, which makes me wonder whether some of these findings made their way into Facebook’s recent Workrooms app.
Summary
Study Design
This paper analyzed performance, subjective measures, and behavior in videoconferencing and embodied VR for work meetings. Their goal is to understand where the two mediums differ and gauge embodied VR’s true potential for work meetings.
Mediums
All factors between the two mediums were kept as similar as possible to ensure that the results were only affected by technological differences. Also, much of the design decisions were informed by best practices derived from related work, which I will cover later.
Videoconference
The software used was Zoom. All participants were displayed in same-sized windows with their self-view turned off (to mimic embodied VR). The video frame was wide enough so that the hands were visible. Participants could also point using a mouse.
Embodied VR
Participants wore Oculus Rift headsets and conducted the meeting in a VR office meeting room. Their face, finger, and body movements were tracked and reflected by their virtual avatars. 36 avatars that accounted for all 6 racial groups were built for this study. Although there were male and female avatars, the paper did not mention whether they accounted for the full range of gender identities.
Tasks
Each meeting was conducted as a triad (group of 3 participants). Each participant is assigned 1 medium but must complete all 4 tasks, each representing different forms of group interaction. The four task types are:
Estimation - an intellectual task where participants had to solve problems with correct answers
Bribery - a decision-making task where participants had to reach a subjective consensus
Party planning - a mixed-motive negotiation task, where participants had different motives and needed to reach a consensus
Floor planning - another mixed-motive negotiation task but with a shared visual artifact. The addition of a shared visual artifact can induce additional gestures for nonverbal interaction.
Metrics
The researchers evaluated the study using 3 overarching measures: performance, qualitative metrics, and behavioral patterns. The performance analysis was not detailed in this paper.
Qualitative metrics
Participants had to complete surveys that covered: satisfaction with medium, mutual understanding, co-presence (the sense of being together in a virtual space), and clear communication of affect (emotion).
Behavioral Patterns
The researchers measured gaze, conversational turns, and gestures to analyze participant behavior. Each session was recorded, and multiple annotators annotated the videos to label the different gestures used and conversational turns. Training, trials, spot checks, and vetting blanketed the annotation process to ensure quality.
Results and their interpretations
Interestingly, the researchers did not find significant differences between the two mediums regarding performance and subjective metrics. Where the differences are evident, however, are in the behavioral patterns. All subjective metrics measured scored high in both mediums.
Because behavior results can only be observations, I will cover the researchers’ interpretations followed by the results that support them. The overarching finding is that VR is similar to face-to-face interaction when compared to videoconferencing. The paper breaks down possible explanations for each finding, but I’ll detail only the conclusions for brevity.
Communication in embodied VR is less formal than in videoconferencing
There was more conversation overlap and failed interruptions in VR than in videoconferencing, showing that participants adhered less to social turn-taking rules in VR.
There were also fewer backchannels (i.e., verbal and non-verbal feedback provided by listeners to show that they’re listening), implying less need to affirm others.
People may feel a greater need to actively maintain the social connection in videoconferencing over embodied VR.
Participants gazed at each other more in videoconferencing than in embodied VR. This increase (56%) is the same as another paper that compared videoconferencing to face-to-face interaction.
Despite the wide framing of each participant, gaze was still primarily directed toward the eyes in videoconferencing, as opposed to in VR, where gaze was more distributed between the head, torso, and hands.
There were more backchannels in videoconferencing.
Participants displayed more self-adaptors (self-comforting gestures) in videoconferencing, which is found to signal anxiety. This may imply less comfort induced by an increased effort needed to maintain the connection.
Nonverbal behavior may be more effective in embodied VR.
Gestures were more frequent, and participants were more likely to replace a word with a reference in VR.
In both mediums, it seems like participants felt the need to show that they were attentively listening in tasks where a lot of coordination and subjective discussion is required to reach a consensus.
Those tasks involve the longest speaking turns, longest and most frequent backchannels, and the most person-directed gaze.
Related Work
The selected related work below can be split into those that explain the researchers’ design decisions and those that support the interpretations derived from the results.
Design decisions
Although the social presence and behavioral patterns between motion-tracked avatars in VR with face-to-face communication were similar, presence decreased and communication patterns changed when the avatar is no longer present (Communication Behavior in Embodied Virtual Reality)
McGrath’s circumplex: a taxonomy for tasks that represent different forms of group interaction (Groups: Interaction and Performance)
Upper-body video framing, as opposed to head-only, induced more empathy (More than face-to-face: empathy effects of video framing)
Work supporting the interpretation of results
Conversational turn-taking in videoconferencing was found to be more formal than that in face-to-face interaction (i.e., adhere to conventional turn-taking rules) (Conversations Over Video Conferences: An Evaluation of the Spoken Aspects of Video-Mediated Communication)
Another paper found that interruptions happened twice as much in face-to-face interaction than in videoconferencing (Face-to-face and video-mediated communication: A comparison of dialogue structure and task performance)
Participants using video gazed at each other much more than those who were interacting face-to-face (Comparison of face-to-face and video-mediated interaction)
Technical Implementation
Used Tobii Pro Nanos for eye tracking.
Used UseTogether, a software that allows each participant to move their mouse on a shared artifact.
A single Kinect sensor was used to track the body. Then, a custom motion solver estimated the skeleton pose.
Finger pose was calculated using the 4 cameras on the headset. The custom solver for this referenced (MEgATrack: monochrome egocentric articulated hand-tracking for virtual reality).
Cameras were placed inside and outside the headset to track the participant’s eyes and mouth and estimate facial deformations. These were also used to track gaze.
Discussion
Why is this interesting?
Until this paper, the benefits of embodied VR compared to video conferencing were assumed. This large-scale study illuminates the evidence of the differences between videoconferencing and embodied VR. Future research can build on top of this paper’s findings to justify their design choices and interpret their results. As we’ve seen with the creation of Workrooms, the industry also benefits: it now has verified that embodied VR does have the potential to host more natural remote meetings.
I also think the findings from this study are quite surprising. I had hypothesized that there would be significant differences in the subjective measures (like co-presence and communication of affect) between the two mediums. Instead, it seems like humans would regulate their behavior to compensate for what is missing. The differences between the two mediums are thus rather subtle - if behavior was not analyzed, embodied VR would seem to provide no additional benefit. We have yet to understand the extent to which behavioral differences impact user comfort and preference.
Limitations/Concerns
All in all, I thought this was a comprehensive paper that considered alternative interpretations of their results and aimed to be as fair as possible throughout. It is important to note that we cannot conclude that embodied VR is better than video conferencing. Other unmeasured factors such as the level of fatigue a user feels in VR could affect the overall user experience. Preference is also a fickle thing. Despite evidence that suggests that embodied VR feels more like face-to-face interaction, users may still prefer video conferencing for various reasons (such as familiarity).
Another point to consider is that the avatar used in embodied VR does provide a layer of anonymity. This may affect the study results and thus may not transfer when the avatar fidelity changes.
Opportunities
I think the relation between embodied VR and avatar fidelity can be further explored. For example, does an increase in avatar fidelity lead to behavioral changes between participants? Which type of avatar, cartoonish or more realistic, would participants feel more comfortable interacting with?
Longitudinal studies can be run to understand how embodied VR performs through time. For example, would there be an “embodied VR fatigue”? How does user preference change over time? Since this paper covered only seated meetings, future studies can also incorporate additional benefits that VR provides, such as walking around in virtual space and having side conversations.