清华大学车辆与运载学院自动驾驶汽车实验室

UrbanPose: A New Benchmark for VRU Pose Estimation in Urban Traffic Scenes

2021-07-20

The work is a joint research by Tsinghua University and Daimler, and has been published on IEEE Intelligent Vehicles Symposium (IV) 2021.

Abstract

Human pose, serving as a robust appearance-invariant mid-level feature, has proven to be effective and efficient for human action recognition and intention estimation. Pose features also have a great potential to improve trajectory prediction for the Vulnerable Road User (VRU) in ADAS or automated driving applications. However, the lack of highly diverse and large VRU pose datasets makes a transfer and application to the VRU rather difficult. This paper introduces the Tsinghua-Daimler Urban Pose dataset (TDUP), a large-scale 2D VRU pose image dataset collected in Chinese urban traffic environments from on-board a moving vehicle. The TDUP dataset contains 21k images with more than 90k high-quality, manually labeled VRU bounding boxes with pose keypoint annotations and additional tags. We optimize four state-of-the-art deep learning approaches (AlphaPose, Mask R-CNN, Pose-SSD and PifPaf) to serve as baselines for the new pose estimation benchmark. We further analyze the effect of using large pre-training datasets and different data proportions as well as optional labeled information during training. Our new benchmark is expected to lay the foundation for further VRU pose studies and to empower the development of accurate VRU trajectory prediction methods in complex urban traffic scenes. The dataset is available for non-commercial scientific use.

Download

Please log in before you begin to download the images and annotations.

Dataset Statistics

Leaderboard

Methods	LAMR ↓				AP ↑				Inference *time [ms]**
Methods	Reasonable	Small	Occluded	Combined	Reasonable	Small	Occluded	Combined	Inference *time [ms]**
AlphaPose (GT bbox)	11.10	15.76	27.82	12.05	79.92	68.37	41.44	71.07	30.5
AlphaPose (YOLOv3)	26.33	38.43	58.75	34.03	70.41	52.33	24.27	59.37	230.2
Mask R-CNN	26.91	39.57	59.61	34.72	65.93	45.37	19.27	54.54	62.8
SimpleBaseline (YOLOv3)	28.30	43.44	60.96	36.34	69.84	49.16	23.01	58.40	172.5
HRNet (YOLOv3)	28.61	44.45	60.47	36.57	70.50	49.16	23.71	59.00	196.8
Pose-SSD	31.27	47.47	61.47	38.92	62.72	40.08	18.01	51.51	48.5
HigherHRNet	33.91	42.84	64.40	40.90	53.39	34.90	13.71	43.61	2110
PifPaf	36.49	56.63	64.55	44.12	57.15	29.30	17.75	46.49	79.3

*Inference time per image of Pose-SSD (Tensorflow) was computed on a Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz processor and a NVIDIA GeForce GTX 1070 with 8GB memory, while others (PyTorch) were computed on a Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz processor and a NVIDIA GeForce GTX TITAN V with 12 GB memory.

You are welcome to submit your results on the test set of TDUP. A json file in MSCOCO keypoint format is acceptable for our evaluation.

Please log in before your submission.

Evaluation metrics

Object keypoint similarity (OKS) is used for object-level keypoint matching:

OKS = ∑_i[exp(-d_i²/2s²κ_i²)δ(v_i>0)] / ∑_i[δ(v_i>0)]

where i denotes the index of the keypoint; d_i is the Euclidean distance between the estimated keypoint and the ground truth; s² refers to the enclosed area of a person, empirically approximating as 0.532 times the area of the person bounding box; κ_i=2σ_i and σ_i is the per-keypoint normalized standard deviation obtained in annotation process based on the response of multiple labelers marking the same keypoints; δ(v_i>0) is set to one, if the keypoint i is labeled, otherwise it is set to zero.

1. LAMR

Miss rate (mr) and false positives per image (fppi) are defined as

mr(c) = fn(c) / (tp(c)+fn(c))

fppi(c) = fp(c) / #img

where tp(c), fp(c) and fn(c) respectively denote the number of true positives, false positives and false negatives with a given confidence value c as the threshold.

Log average miss rate (LAMR) is calculated with

LAMR = exp{1/9*∑_f log[mr(argmax_fppi(c) fppi(c))]}

where the nine reference point are equally spaced in log space, such that f∈{10^-2, 10^-1.75,..., 10⁰}. Here, the control variable for mr depends on fppi with respect to its reference point f.

2. AP

Precision (pr) and recall (re) are defined as

pr(c) = tp(c) / (tp(c)+fp(c))

re(c) = tp(c) / (tp(c)+fn(c)) = 1-mr(c)

where reference point r∈{0, 0.1,..., 1} is equally distributed in [0, 1].

Average precision (AP) is defined as

AP = 1/11*∑_r max_re(c)≧r pr(c)

3. Subsets and combined performance

Since persons in the vicinity of the ego vehicle are usually at higher risk and occluded persons require particular attention in urban environments, we define different evaluation subsets. The final Combined (LAMR and AP) metric is a weighted sum of the metric results within the three evaluation subsets.

图3.jpg

Qualitative performance of algorithms on TDUP test set

Here list cutouts of success cases (first two rows) and failure cases (last row) produced by the four selected baseline methods. From left to right: AlphaPose (YOLOv3) (1-3), Mask R-CNN (4-6), Pose-SSD (7-9), and PifPaf (10-12). Swap, redundancy and miss errors are orderly displayed for each method.

Presentation in IV2021

License

By using the dataset, you accept the terms and conditions set forth by its License.

Contact

Sijia Wang (wsj17@mails.tsinghua.edu.cn)

UrbanPose: A New Benchmark for VRU Pose Estimation in Urban Traffic Scenes

License

Contact

实验室负责人

实验室副主任