1 unit length in the world coordinate represents 1 cm.
Due to the large data size, we can only release the compressed videos. We use ffmpeg's H.264 encoder with CRF 10 for VGAs and CRF 20 for HDs.
We use ffmpeg to extract individual image files from the videos, using the scripts vgaImgsExtractor.sh and hdImgsExtractor.sh (these scripts are downloaded by getData.sh). See the GitHub page for more details.
If you don't use the provided extraction scripts, note that the first frame of the video should have index 0 to be compatible to our frame indexing rule. In ffmpeg, this can be done as
Videos from the same type of sensors (i.e., all HD or all VGA) are already synchronized by hardware clocks, meaning that images with the same frame index are taken at the same moment.
However, the frame rates among different types of sensors are different. For example, VGAs capture at 25 Hz, and HDs at 29.97 Hz, and, thus, their frame numbers are independent. We provide additional synchronization tables with the precise time alignment between them.
Finally, the Kinect image streams are neither synchronized nor have a constant frame rate. The synchronization tables must be used with this data.
Skeleton Reconstruction Results
We reconstruct 3D motion of people using the method of [Joo et al. 2016] (under submission), which is an extension of [Joo et al. 2015].
The reconstruction results are generated by using the 480 VGA camera views.
The outputs are saved as JSON files. Each file contains 3D skeletons at a single time instance. A skeleton is composed of 15 joints.
An array "bodies" holds each skeleton, where each element is