[Article] Transformer-based multimodal information fusion for facial expression analysis.
summary : In this work, they utilize multimodal features of spoken words, speech prosody and facial expression from Aff-WIld2 dataset. They combine these features using a transformer-based fusion module which makes the output embedding features of sequences of images, audio and text. Integrated output feature is then processed in MLP layer for Action Unit (AU) detection and also facial expression recognition.