[Article] Expression snippet transformer for robust video-based facial expression recognition
Summary: Although Transformer can be powerful for modeling visual relations and describing complicated patterns, it could still perform unsatisfactorily for video-based facial expression recognition, since the expression movements in a video can be too small to reflect meaningful spatial-temporal relations. They propose to decompose the modeling of expression movements of a video into the modeling of a series of expression snippets, each of which contains a few frames. Their propsed model, Expression Snippet Transformer (EST) process intra-snippet and inter-snippet information seperately and combine them together. Code is available in github.