[Article] Multiview transformers for video recognition.
Summary: This is follow-up study of Video Vision Transformer. For multiscale modelling in video recognition task, they used multiview tubulets and applied cross-view attention over seperate transformer models. It is currently SOTA in kinetics600 and 5 other standard video benchmarks. It was introduced in CVPR2022. They also have official code in the scenic project.