CMU Panoptic Dataset

PanopticStudio Toolbox

Download the PanopticStudio Toolbox on GitHub (Matlab and Python usage examples included).
With the PanopticStudio Toolbox, you can
- Download the data as compressed video files
- Extract images from downloaded videos
- Load camera calibration parameters
- Load 3D pose reconstruction results
- Project 3D pose to 2D camera views

Downloading the Data

You can easily download our dataset using the ./getData.sh script, which can be obtained from GitHub.
getData.sh works on Linux and Mac, or on Windows using Cygwin.
The basic usage is: ./getData.sh {sequence name} {number of VGA cameras} {number of HD cameras}
For example, you can download a small sample dataset (~5 MB), named sampleData, with a default of 10 VGA cameras and 1 HD camera, via
```
./getData.sh sampleData
```
Browse various sequences in this website. Select the sequences you are interested in. Download the data using the sequence name. For example,
```
./getData.sh 160422_ultimatum
```
You can also specify the number of cameras you want to download. For example, to download 240 VGA videos and 5 HD videos
```
./getData.sh 160422_ultimatum 240 5
```
In the script, we have sorted the VGA camera order so that you download uniformly distributed views given the specified number of views.

Camera Naming Rule

Camera names are given by {sensorIdx}_{nodeIdx}
The {sensorIdx} is represented as a two digit number, and can be one of the following:

00: for HD cameras
01-20: for VGA cameras, where the number denotes a VGA module (panel) index
50: for Kinect RGB cameras

The nodeIdx represents a camera index within each sensor type (or each module in VGAs).
Each VGA module has 24 cameras, so nodeIdx in each VGA module ranges from 1 to 24.
In summary,

HD (31 cameras): 00_00 ~ 00_30
VGA (24 cameras/module): 01_01 ~ 01_24 through 20_01 ~ 20_24
Kinect (10 cameras): 50_01 ~ 50_10

HD nodeIdx is zero-based, while the nodeIdx of VGA and Kinects are one-based.
Note that the order of the camera indices has nothing to do with their locations. VGA module 1 and VGA module 2 may not be neighboring panels.

Calibration Data

Calibration parameters for all cameras (VGAs, HDs, and Kinects) are provided as a JSON file.

Each camera is an element in the "cameras" array, with the following information:

"cameras": [
	{
		"name": "01_01", 

		"type": "vga",
		"resolution": [640,480],
		"panel": 1,
		"node": 1,
		"K": [
			[745.716,0,374.297],
			[0,746.048,226.517],
			[0,0,1]
		],
		"distCoef": [-0.318745,0.0454429,-0.000811973,0.000847189,0.0799718],
		"R": [
			[0.969466296,0.02846943647,-0.2435664017],
			[-0.04833552526,0.9959371934,-0.07597883721],
			[0.2404137638,0.08543183185,0.9669036272]
		],
		"t": [
			[-51.22735213],
			[142.8763812],
			[289.9330519]
		]
	},
	...

The camera names follow the naming rule described above.
K,R,t are the camera intrinsics, rotation matrix, and translation respectively.
If X is a 3x1 vector, then the camera transform is x = K*(R*X + t) (with projection and lens distortion).
distCoef represents lens distortion parameters, [k1,k2,p1,p2,k3], as in the OpenCV calibration format.
1 unit length in the world coordinate represents 1 cm.

Video Data

Due to the large data size, we can only release the compressed videos. We use ffmpeg's H.264 encoder with CRF 10 for VGAs and CRF 20 for HDs.
We use ffmpeg to extract individual image files from the videos, using the scripts vgaImgsExtractor.sh and hdImgsExtractor.sh (these scripts are downloaded by getData.sh). See the GitHub page for more details.
If you don't use the provided extraction scripts, note that the first frame of the video should have index 0 to be compatible to our frame indexing rule. In ffmpeg, this can be done as
```
ffmpeg -i "videoName.mp4" -f image2 -start_number 0 "frame_%05d.png"
```
Videos from the same type of sensors (i.e., all HD or all VGA) are already synchronized by hardware clocks, meaning that images with the same frame index are taken at the same moment.
However, the frame rates among different types of sensors are different. For example, VGAs capture at 25 Hz, and HDs at 29.97 Hz, and, thus, their frame numbers are independent. We provide additional synchronization tables with the precise time alignment between them.
Finally, the Kinect image streams are neither synchronized nor have a constant frame rate. The synchronization tables must be used with this data.

Skeleton Reconstruction Results

We reconstruct 3D motion of people using the method of [Joo et al. 2016] (under submission), which is an extension of [Joo et al. 2015].
The reconstruction results are generated by using the 480 VGA camera views.
The outputs are saved as JSON files. Each file contains 3D skeletons at a single time instance. A skeleton is composed of 15 joints.
An array "bodies" holds each skeleton, where each element is

"bodies" : [
	{
		"id": 1,
		"joints15": [82.8466, -144.961, 23.0948, 0.495789, 77.4016, -169.599, 18.2888, 0.477661, ...]
	},
	...

id: a unique subject index within a sequence. Skeletons with the same id across time represent temporally associated moving skeletons (an individual).
joints15: fifteen 3D joint locations, formatted as [x1,y1,z1,c1,x2,y2,z2,c2,...] where each c is a per-joint confidence score.
The order of joints is as follows (see this example for an illustration):

Neck, HeadTop, BodyCenter, lShoulder,lElbow, lWrist, lHip, lKnee, lAnkle, rShoulder, rElbow, rWrist, rHip, rKnee, rAnkle