UrbanPose: A New Benchmark for VRU Pose Estimation in Urban Traffic Scenes
The work is a joint research by Tsinghua University and Daimler, and has been published on IEEE Intelligent Vehicles Symposium (IV) 2021.
Abstract
Human pose, serving as a robust appearance-invariant mid-level feature, has proven to be effective and efficient for human action recognition and intention estimation. Pose features also have a great potential to improve trajectory prediction for the Vulnerable Road User (VRU) in ADAS or automated driving applications. However, the lack of highly diverse and large VRU pose datasets makes a transfer and application to the VRU rather difficult. This paper introduces the Tsinghua-Daimler Urban Pose dataset (TDUP), a large-scale 2D VRU pose image dataset collected in Chinese urban traffic environments from on-board a moving vehicle. The TDUP dataset contains 21k images with more than 90k high-quality, manually labeled VRU bounding boxes with pose keypoint annotations and additional tags. We optimize four state-of-the-art deep learning approaches (AlphaPose, Mask R-CNN, Pose-SSD and PifPaf) to serve as baselines for the new pose estimation benchmark. We further analyze the effect of using large pre-training datasets and different data proportions as well as optional labeled information during training. Our new benchmark is expected to lay the foundation for further VRU pose studies and to empower the development of accurate VRU trajectory prediction methods in complex urban traffic scenes. The dataset is available for non-commercial scientific use.
Download
Please log in before you begin to download the images and annotations.
Dataset Statistics
Leaderboard
Methods |
LAMR ↓ |
AP ↑ |
Inference time* [ms] |
||||||
Reasonable | Small | Occluded | Combined | Reasonable | Small | Occluded | Combined | ||
AlphaPose (GT bbox) |
11.10 | 15.76 | 27.82 | 12.05 | 79.92 | 68.37 | 41.44 | 71.07 | 30.5 |
AlphaPose (YOLOv3) |
26.33 | 38.43 | 58.75 | 34.03 | 70.41 | 52.33 | 24.27 | 59.37 | 230.2 |
Mask R-CNN | 26.91 | 39.57 | 59.61 | 34.72 | 65.93 | 45.37 | 19.27 | 54.54 | 62.8 |
SimpleBaseline (YOLOv3) |
28.30 | 43.44 | 60.96 | 36.34 |
69.84 |
49.16 | 23.01 | 58.40 | 172.5 |
HRNet (YOLOv3) |
28.61 | 44.45 | 60.47 | 36.57 | 70.50 |
49.16 |
23.71 | 59.00 | 196.8 |
Pose-SSD | 31.27 | 47.47 | 61.47 | 38.92 | 62.72 | 40.08 | 18.01 | 51.51 | 48.5 |
HigherHRNet | 33.91 | 42.84 | 64.40 | 40.90 | 53.39 |
34.90 |
13.71 | 43.61 | 2110 |
PifPaf | 36.49 | 56.63 | 64.55 | 44.12 | 57.15 | 29.30 | 17.75 | 46.49 | 79.3 |
*Inference time per image of Pose-SSD (Tensorflow) was computed on a Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz processor and a NVIDIA GeForce GTX 1070 with 8GB memory, while others (PyTorch) were computed on a Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz processor and a NVIDIA GeForce GTX TITAN V with 12 GB memory.
You are welcome to submit your results on the test set of TDUP. A json file in MSCOCO keypoint format is acceptable for our evaluation.
Please log in before your submission.
Evaluation metrics
Object keypoint similarity (OKS) is used for object-level keypoint matching:
OKS = ∑i[exp(-di2/2s2κi2)δ(vi>0)] / ∑i[δ(vi>0)]
where i denotes the index of the keypoint; di is the Euclidean distance between the estimated keypoint and the ground truth; s2 refers to the enclosed area of a person, empirically approximating as 0.532 times the area of the person bounding box; κi=2σi and σi is the per-keypoint normalized standard deviation obtained in annotation process based on the response of multiple labelers marking the same keypoints; δ(vi>0) is set to one, if the keypoint i is labeled, otherwise it is set to zero.
1. LAMR
Miss rate (mr) and false positives per image (fppi) are defined as
mr(c) = fn(c) / (tp(c)+fn(c))
fppi(c) = fp(c) / #img
where tp(c), fp(c) and fn(c) respectively denote the number of true positives, false positives and false negatives with a given confidence value c as the threshold.
Log average miss rate (LAMR) is calculated with
LAMR = exp{1/9*∑f log[mr(argmaxfppi(c) fppi(c))]}
where the nine reference point are equally spaced in log space, such that f∈{10-2, 10-1.75,..., 100}. Here, the control variable for mr depends on fppi with respect to its reference point f.
2. AP
Precision (pr) and recall (re) are defined as
pr(c) = tp(c) / (tp(c)+fp(c))
re(c) = tp(c) / (tp(c)+fn(c)) = 1-mr(c)
where reference point r∈{0, 0.1,..., 1} is equally distributed in [0, 1].
Average precision (AP) is defined as
AP = 1/11*∑r maxre(c)≧r pr(c)
3. Subsets and combined performance
Since persons in the vicinity of the ego vehicle are usually at higher risk and occluded persons require particular attention in urban environments, we define different evaluation subsets. The final Combined (LAMR and AP) metric is a weighted sum of the metric results within the three evaluation subsets.
Qualitative performance of algorithms on TDUP test set
Here list cutouts of success cases (first two rows) and failure cases (last row) produced by the four selected baseline methods. From left to right: AlphaPose (YOLOv3) (1-3), Mask R-CNN (4-6), Pose-SSD (7-9), and PifPaf (10-12). Swap, redundancy and miss errors are orderly displayed for each method.
Presentation in IV2021
License
By using the dataset, you accept the terms and conditions set forth by its License.
Contact
Sijia Wang (wsj17@mails.tsinghua.edu.cn)