Holoportation: Virtual 3D Teleportation in Real-time
This week, we’re tackling something that feels almost like science fiction. Published in 2016, this paper covers the essential innovation and information we need to get closer to a Jedi council holo meeting :). But in all seriousness, the vision of true telepresence has the potential to save a lot of travel time, trillions of dollars for business travel, reduce carbon emissions, and keep relationships close despite physical distance.
Holoportation is a new type of technology that captures, transmits, and renders 3D models of any person or object in high-quality, real-time. Most significantly, this paper describes the optimized pipeline that enables high-quality, real-time transmission and reconstruction, as well as an active stereo depth camera design that enables high-quality 3D capture.
This Week’s Paper
Sergio Orts-Escolano, Christoph Rhemann, Sean Fanello, Wayne Chang, Adarsh Kowdle, Yury Degtyarev, David Kim, Philip L. Davidson, Sameh Khamis, Mingsong Dou, Vladimir Tankovich, Charles Loop, Qin Cai, Philip A. Chou, Sarah Mennicken, Julien Valentin, Vivek Pradeep, Shenlong Wang, Sing Bing Kang, Pushmeet Kohli, Yuliya Lutchyn, Cem Keskin, and Shahram Izadi. 2016. Holoportation: Virtual 3D Teleportation in Real-time. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology (UIST '16). Association for Computing Machinery, New York, NY, USA, 741–754.
DOI Link: 10.1145/2984511.2984517
Summary
System Overview
The authors posit that spatial audio, high-quality full-body 3D reconstruction, and shared virtual space are the key components to creating a sense of co-presence. Thus, the Holoportation system was designed to support these requirements.
I will go over a summary of the entire Holoportation pipeline and then dive into specific sections after. Here is the summary:
8 custom capture pods surround the capture space. A user can wear a headset in this capture space to see the 3D live stream from the other site. The system synchronizes the space, so the local and remote users have a shared virtual space.
For each frame:
Any person/object that enters the space will have its depth estimated and RGB image captured. Audio along with the headset’s pose is also captured.
Any person/object in the scene is segmented from the background.
The depth data from each pod are fused to derive a resultant 3D volumetric mesh.
The mesh, color, and audio streams are then transmitted over the network.
Color is rendered on the mesh on the receiving PC.
The mesh is then remotely rendered to the receiving untethered headset via WiFi. Audio is played back relative to the transmitted pose to simulate a sense of spatial audio.
Capture Pods and 3D Depth Reconstruction
Each capture pod comprises 2 Near Infra-Red cameras (NIR), a color camera, and an IR projector (which contains a diffractive optical element (DOE) and a laser). To keep them stable, each pod is mounted on top of an optical bench (a platform that controls vibration).
The IR projector projects a random IR dot pattern onto the scene. This dot pattern adds texture to the scene so that the depth of textureless materials like a white wall can still be estimated. After weighing multiple depth estimation methods, the authors decided to use active stereo, with the specific method based on PatchMatch stereo. This method’s runtime does not increase as the number of depth values increases whilst achieving high-quality depth maps.
Another key advantage of this method is it does not need to know the pattern that is projected. Thus, overlapping projectors do not interfere with the depth estimation results for each pod. Rather, they increase the texture of that patch, improving depth estimation quality.
Foreground Segmentation
Separating the contents of a scene from its background helps to reduce the amount of data that needs to be transmitted over the network and helps achieve temporally consistent 3D reconstructions. Each pixel is labeled either a foreground or background based on an energy function that considers its Hue and Saturation values, a function of its depth value, and the RGB gaussian edge potentials between it and all other pixels. The HSV color space was used since it’s found to be robust to lighting changes.
The labeling problem is modeled using a fully-connected Conditional Random Field. The authors also used an efficient algorithm proposed here for performing mean field inference so that it is less computationally taxing.
Temporally Consistent 3D Reconstruction
A common strategy for 3D reconstruction involves the fusing of data from all cameras to generate a mesh for each frame. However, this method won’t account for the differences between the meshes generated in each frame due to noise or holes in the depth map, causing the resultant stream to have some flickering parts. Temporal consistency is when these flickering effects are minimized. To create this, the authors employed Fusion4D, a state-of-the-art method that essentially estimates the deformation model between two frames and fuses the data of the tracked parts based on the deformation model to denoise the surface over time.
Color is also added through projective texturing, where each pixel’s color is computed by blending the RGB images. The color weight of each image’s pixel to the final texture is dependent on its surface normal and the camera viewpoint direction. The weightage favors the frontal view by a factor and is only non-zero when the point is visible from at least one view.
Additionally, to minimize “ghosting” artifacts, where imperfect geometry can cause the wrong color to be projected to surfaces behind it, the authors implemented their own approach that assumes this surface error is bounded. This approach involves a combination of depth discontinuity search and a color voting system. Their approach was found to work well in most cases but is a compromise between quality and performance.
Spatial Audio
Audio samples and the head pose of the transmitting user are transmitted alternatingly. On the receiving user’s side, the transmitted head pose determines where the virtual audio source should play the audio sample from. The audio source is spatialized by filtering the audio signal using a head related transfer function (HRTF). A cardioid radiation pattern was also applied to the amplitude of the source to simulate a muffled sound when the speaking user is facing away from the listening user.
Additional key optimizations employed
To keep the rendering high-quality, the authors performed lightweight real-time compression. They performed vertex deduplication on their mesh data. LZ4 compression was applied to the mesh index data and the color imagery from all 8 color cameras.
To keep audio running smoothly and unaffected by network jitter, the authors buffered the playback. The authors also ensured that the playback rate was slightly faster than the transmission rate to reduce the number of samples in the buffer at any given time. The audio communication system is separate from the visual communication system, and there is no need to synchronize the two since any difference is unnoticeable.
All-in-all, the average per-frame transmission size is 1-2 Gbps for a 30fps capture stream. The compression time takes <10ms.
Results
The authors ran a two-part study on 5 pairs of participants (a total of 10). The first part involved a social interaction task where each participant had to tell 2 truths and a lie. Deception and its detection will encourage participants to focus on both non-verbal and verbal communication, making it a great way to stress-test Holoportation’s 3D reconstruction quality. The second part involves a physical interaction task where participants, one in AR and another in VR, collaborated to arrange 6 3D objects according to a given configuration. Each participant had 3 of the objects in real life and could see the other 3 objects virtually. Only 1 participant had access to the desired configuration.
Researchers observed and recorded participant behavior and conducted a semi-structured interview afterward. Here are some of the result highlights:
Participants felt a strong sense of interpersonal space awareness. The authors noted that participants would adopt common non-verbal cues in real-life conversations. Participants also noted a sense of presence.
Participants had to adapt to use more gestures instead of simply giving verbal instructions and pointing.
A shared spatial frame of reference is an advantage compared to other forms of telecommunication.
Interestingly, AR made users feel like their partner was entering their space while VR made users feel like they were in their partner’s space or another space entirely.
Participants also liked that they could move freely without constraints and explore objects or environments independently.
7/10 of participants felt like their conversational partner looked like a real person. Some noted that the reconstruction was imperfect at the edges of the reconstruction volume.
Lack of eye contact was clearly noted as a limitation. To combat this, the authors built a prototype for synthetic headset removal.
Related Work
While there has been a huge amount of work on immersive 3D telepresence systems, much of the work focuses on one specific aspect or would introduce constraints. Here are some of them that I find worth noting:
3D capture
Multiple proposed algorithms for depth estimation. (A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms)
Offline noise-free and complete 3D capture of a room-sized scene using multiple Kinects. (Temporally enhanced 3D capture of room-sized dynamic scenes with commodity depth cameras)
3D telepresence systems
A 3D real-time telepresence system using Kinect cameras to record and a projector to project the 3D capture of the remote user to the local user’s space. (Room2Room: Enabling Life-Size Telepresence in a Projected Augmented Reality Environment)
A system that is most similar to Holoportation. However, it lacked the high-quality display and depth cameras that Holoportation has. (General-purpose telepresence with head-worn optical see-through displays and projector-based lighting)
Others
2014 overview of immersive 3D telepresence work from a joint effort between NTU, ETH Zurich, and UNC. (Immersive 3D Telepresence)
Holographic display system for quasi-real-time 3D telepresence. (Holographic three-dimensional telepresence using large-area photorefractive polymer)
Technical Implementation
The system can display graphics at a native refresh rate of 60Hz for Hololens and 75Hz for Oculus Rift DK2. The authors described in detail the hardware and implementation that contribute to achieving real-time capture, transmission, and reconstruction. Here are some highlights to summarize their implementation:
Each capture site contains 4 PCs (Intel Core i7 3.4GHz CPU and uses 2 NVIDIA Titan X GPUs), each handling 2 capture pods, to calculate the depth and foreground segmentation.
Each headset offloads rendering costs to another dedicated dual-GPU machine of the same specs to fuse the depth maps and texturize the mesh.
The rendering PC is connected to the headset via WiFi and constantly receives the user’s 6DoF pose. The system predicts the pose at runtime to determine the scene to be rendered. To accommodate for pose mispredictions, the system renders into a larger FoV.
The authors parallelized the PatchMath stereo implementation.
For the temporal reconstruction stage, the authors split the algorithm into 2 parts, with a GPU handling each. The first GPU estimates the low resolution per-frame motion field and the second refines the motion field and performs the volumetric data fusion. Because the first GPU won’t need input from the second, they can run in parallel.
Stalls between CPU and GPU are avoided by using pinned memory for DMA data transfers between the two, organized as ring buffers.
Used the HRTF audio processing object in the Windows 10 XAUDIO2 framework.
Discussion
Why is this interesting?
I think this paper has a lot to offer. Although the idea of 3D telepresence is not novel, this paper does innovate in technical implementation and method to create an end-to-end system with minimal compromise so users can truly feel a sense of co-presence. Not only did it push the boundaries of co-presence in telecommunication systems, but it also enabled ample space for future research to work off of. Further research can be done to optimize computations for each section in its pipeline and system throughput as well as to explore its implementation for different use cases.
Limitations/Concerns
Of course, the biggest limitation is the amount of high-end hardware required to run the system. It is currently not easy to reproduce, let alone scale. The 3D reconstruction is also still imperfect. It is not sufficient to support more fine-grained interaction tasks that require smaller geometry reconstructions like a finger. As mentioned above, eye contact is also a clear limitation that inhibits collaboration.
Opportunities
Regardless, there are still many opportunities to improve and extend Holoportation. Besides opportunities to improve the Holoportation pipeline’s technical implementation, Holoportation can also be applied to explore multiple use cases. For example, design exploration studies with Holoportation can be conducted to support doctor-caregiver interactions, family gatherings, and co-design meetings.
The authors also proposed using Holoportation to capture one’s body and insert it into the user’s VR world in real-time. They theorize that having the user see their own body in VR can increase the sense of presence in the space. This idea can be further explored in games and meetings in VR.