English
数据集

UrbanPose: A New Benchmark for VRU Pose Estimation in Urban Traffic Scenes

2021-07-20

The work is a joint research by Tsinghua University and Daimler, and has been published on IEEE Intelligent Vehicles Symposium (IV) 2021.


Abstract

Human pose, serving as a robust appearance-invariant mid-level feature, has proven to be effective and efficient for human action recognition and intention estimation. Pose features also have a great potential to improve trajectory prediction for the Vulnerable Road User (VRU) in ADAS or automated driving applications. However, the lack of highly diverse and large VRU pose datasets makes a transfer and application to the VRU rather difficult. This paper introduces the Tsinghua-Daimler Urban Pose dataset (TDUP), a large-scale 2D VRU pose image dataset collected in Chinese urban traffic environments from on-board a moving vehicle. The TDUP dataset contains 21k images with more than 90k high-quality, manually labeled VRU bounding boxes with pose keypoint annotations and additional tags. We optimize four state-of-the-art deep learning approaches  (AlphaPose, Mask R-CNN, Pose-SSD and PifPaf) to serve as baselines for the new pose estimation benchmark. We further analyze the effect of using large pre-training datasets and different data proportions as well as optional labeled information during training. Our new benchmark is expected to lay the foundation for further VRU pose studies and to empower the development of accurate VRU trajectory prediction methods in complex urban traffic scenes. The dataset is available for non-commercial scientific use.


Download

Please log in before you begin to download the images and annotations.


Dataset Statistics


1628002136253069310.jpg

1628226338670075404.jpg


Leaderboard


Methods LAMR ↓
AP ↑

Inference 

time* [ms]

Reasonable Small Occluded Combined Reasonable Small Occluded Combined

AlphaPose

(GT bbox)

11.10 15.76 27.82 12.05 79.92 68.37 41.44 71.07 30.5

AlphaPose

(YOLOv3)

26.33 38.43 58.75 34.03 70.41 52.33 24.27 59.37 230.2
Mask R-CNN 26.91 39.57 59.61 34.72 65.93 45.37 19.27 54.54 62.8

SimpleBaseline

(YOLOv3)

28.30 43.44 60.96 36.34 69.84
49.16 23.01 58.40 172.5

HRNet

(YOLOv3)

28.61 44.45 60.47 36.57 70.50 49.16
23.71 59.00 196.8
Pose-SSD 31.27 47.47 61.47 38.92 62.72 40.08 18.01 51.51 48.5
HigherHRNet 33.91 42.84 64.40 40.90 53.39 34.90
13.71 43.61 2110
PifPaf 36.49 56.63 64.55 44.12 57.15 29.30 17.75 46.49 79.3

*Inference time per image of Pose-SSD (Tensorflow) was computed on a Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz processor and a NVIDIA GeForce GTX 1070 with 8GB memory, while others (PyTorch) were computed on a Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz processor and a NVIDIA GeForce GTX TITAN V with 12 GB memory.


You are welcome to submit your results on the test set of TDUP. A json file in MSCOCO keypoint format is acceptable for our evaluation. 

Please log in before your submission.


Evaluation metrics

Object keypoint similarity (OKS) is used for object-level keypoint matching:

OKS = i[exp(-di2/2s2κi2)δ(vi>0)] / i[δ(vi>0)]

where i denotes the index of the keypoint; di is the Euclidean distance between the estimated keypoint and the ground truth; s2 refers to the enclosed area of a person, empirically approximating as 0.532 times the area of the person bounding box; κi=2σi and σi is the per-keypoint normalized standard deviation obtained in annotation process based on the response of multiple labelers marking the same keypoints; δ(vi>0) is set to one, if the keypoint i is labeled, otherwise it is set to zero. 

1. LAMR

Miss rate (mr) and false positives per image (fppi) are defined as

mr(c) = fn(c) / (tp(c)+fn(c))

fppi(c) = fp(c) / #img

where tp(c), fp(c) and fn(c) respectively denote the number of true positives, false positives and false negatives with a given confidence value c as the threshold.

Log average miss rate (LAMR) is calculated with

LAMR = exp{1/9*f log[mr(argmaxfppi(c) fppi(c))]}

where the nine reference point are equally spaced in log space, such that f∈{10-2, 10-1.75,..., 100}. Here, the control variable for mr depends on fppi with respect to its reference point f.

2. AP

Precision (pr) and recall (re) are defined as

pr(c) = tp(c) / (tp(c)+fp(c))

re(c) = tp(c) / (tp(c)+fn(c)) = 1-mr(c)

where reference point r∈{0, 0.1,..., 1} is equally distributed in [0, 1].

Average precision (AP) is defined as

AP = 1/11*r maxre(c)≧r pr(c)

3. Subsets and combined performance

Since persons in the vicinity of the ego vehicle are usually at higher risk and occluded persons require particular attention in urban environments, we define different evaluation subsets. The final Combined (LAMR and AP) metric is a weighted sum of the metric results within the three evaluation subsets.

图3.jpg


Qualitative performance of algorithms on TDUP test set


1628002673926060258.png

Here list cutouts of success cases (first two rows) and failure cases (last row) produced by the four selected baseline methods. From left to right: AlphaPose (YOLOv3) (1-3), Mask R-CNN (4-6), Pose-SSD (7-9), and PifPaf (10-12). Swap, redundancy and miss errors are orderly displayed for each method.



Presentation in IV2021


License

By using the dataset, you accept the terms and conditions set forth by its License.


Contact

Sijia Wang (wsj17@mails.tsinghua.edu.cn)

实验室负责人

杨殿阁 ydg@tsinghua.edu.cn

实验室副主任

江昆 jiangkun@tsinghua.edu.cn