[Article] Expression snippet transformer for robust video-based facial expression recognition

Summary: Although Transformer can be powerful for modeling visual relations and describing complicated patterns, it could still perform unsatisfactorily for video-based facial expression recognition, since the expression movements in a video can be too small to reflect meaningful spatial-temporal relations. They propose to decompose the modeling of expression movements of a video into the modeling of a series of expression snippets, each of which contains a few frames. Their propsed model, Expression Snippet Transformer (EST) process intra-snippet and inter-snippet information seperately and combine them together. Code is available in github.

Liu, Y., Wang, W., Feng, C., Zhang, H., Chen, Z., & Zhan, Y. (2023). Expression snippet transformer for robust video-based facial expression recognition. Pattern Recognition, 138, 109368.