June 2020
tl;dr: Pedestrian intension estimation dataset and achieves ~80% of accuracy and ~90% F1 score.
Intention and trajectory prediction are different. Trajectory prediction are only effective when the pedestrians have started crossing or about to do so, so basically the algorithm react to an action instead of anticipating it.
Intention helps trajectory prediction. This is similar conclusion as in IntentNet.
The key question is "does this pedestrian want to cross the street?"
- Intention estimation allows one to predict a future situation using expected behaviors rather than merely reply on scene dynamics.
- Annotation:
- Labels walking, standing, looking, not looking, crossing, not crossing.
- Crossing intention confidence from 1 to 5.
- Bbox with occlusion level.
- Annotators are allowed to show a clip ~3 sec long before the vehicle reaches 1.5-3 sec time to event, so the labeler does not see the final event.
- Inter-rater consistency is high.
- Humans are good at telling if a pedestrian is going to cross. There are 2/1800 (0.1%) samples that crossed the street but the annotators indicate otherwise.
- LSTM is used for intention and trajectory prediction.
- The trajectory prediction is onboard bbox prediction similar to hevi and Nvidia's demo.
- Uses AMT to gather 10 annotations per video clip.
- Feeding bbox and image patches around the bbox (context info) helps.
- It would have been very interesting to know the average performance of a human annotator.