[Article] Multiview transformers for video recognition.

Summary: This is follow-up study of Video Vision Transformer. For multiscale modelling in video recognition task, they used multiview tubulets and applied cross-view attention over seperate transformer models. It is currently SOTA in kinetics600 and 5 other standard video benchmarks. It was introduced in CVPR2022. They also have official code in the scenic project.

Yan, Shen, et al. “Multiview transformers for video recognition.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.