Facial Expression Recognition of Students in Classroom Using Hybrid MobileNetV3-Vision Transformer with Token Downsampling
DOI:
https://doi.org/10.47709/brilliance.v5i1.6323Keywords:
Classroom, Facial Expression Recognition, Hybrid Vision Transformer, MobileNetV3, Student, Token DownsamplingAbstract
In large classroom environments, teachers often struggle to monitor each student’s facial expression throughout the learning process. Yet, facial expressions are important indicators of students’ emotional states and engagement, which, when detected in real time, can support a more adaptive learning experience. Most previous research on Facial Expression Recognition (FER) has relied on Convolutional Neural Networks (CNN), which tend to be limited in capturing global relationships between facial features. Additionally, many studies focus on model accuracy without evaluating their practical effectiveness in real classroom settings. This study aims to develop a facial expression recognition model that is both accurate and efficient for use in classroom contexts. A hybrid Vision Transformer (ViT) architecture is proposed, which combines MobileNetV3 for local feature extraction and a Vision Transformer for global context modeling. To reduce the number of tokens and computational cost, a Token Downsampling method is introduced within the transformer blocks. The model is trained using the FER2013 dataset and achieves a test accuracy of 71.24%, surpassing the baseline pretrained ViT model, which reached only 70.10%. Additionally, the Token Downsampling method improves inference speed. Furthermore, the model is tested on a custom dataset collected from students in a real classroom setting to evaluate its performance in practical implementation. Although the performance on the classroom dataset is not yet optimal, the results on FER2013 demonstrate the potential of this approach for further development toward real-time and accurate facial expression recognition in educational environments.
References
Andrew, H., Mark, S., Grace, C., Liang-Chieh, C., Bo, C., Mingxing, T., Weijun, W., Yukun, Z., Ruoming, P., & Vijay, V. (2019). Searching for MobileNetV3. Proceedings of the IEEE International Conference on Computer Vision, 1314–1324.
Bouhlal, M., Aarika, K., AitAbdelouahid, R., Elfilali, S., & Benlahmar, E. (2020). Emotions recognition as innovative tool for improving students’ performance and learning approaches. Procedia Computer Science, 175, 597–602. https://doi.org/10.1016/j.procs.2020.07.086
Deng, J., Guo, J., Zhou, Y., Yu, J., Kotsia, I., & Zafeiriou, S. (2019). RetinaFace: Single-stage Dense Face Localisation in the Wild. http://arxiv.org/abs/1905.00641
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE. ICLR 2021 - 9th International Conference on Learning Representations. https://doi.org/10.48550/arXiv.2010.11929
Duki?, D., & Krzic, A. S. (2022). Real-Time Facial Expression Recognition Using Deep Learning with Application in the Active Classroom Environment. Electronics (Switzerland), 11(8). https://doi.org/10.3390/electronics11081240
Fang, B., Li, X., Han, G., & He, J. (2023). Facial Expression Recognition in Educational Research from the Perspective of Machine Learning: A Systematic Review. IEEE Access, 11(August), 112060–112074. https://doi.org/10.1109/ACCESS.2023.3322454
Foret, P., Kleiner, A., Mobahi, H., & Neyshabur, B. (2021). Sharpness-Aware Minimization for Efficiently Improving Generalization. ICLR 2021 - 9th International Conference on Learning Representations. https://doi.org/10.48550/arXiv.2010.01412
Goodfellow, I. J., Erhan, D., Luc Carrier, P., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., Lee, D. H., Zhou, Y., Ramaiah, C., Feng, F., Li, R., Wang, X., Athanasakis, D., Shawe-Taylor, J., Milakov, M., Park, J., … Bengio, Y. (2015). Challenges in representation learning: A report on three machine learning contests. Neural Networks, 64, 59–63. https://doi.org/10.1016/j.neunet.2014.09.005
Goyal, S., Choudhury, A. R., Raje, S. M., Chakaravarthy, V. T., Sabharwal, Y., & Verma, A. (2020). PoWER-BERT: Accelerating BERT inference via progressive word-vector elimination. 37th International Conference on Machine Learning, ICML 2020, PartF16814, 3648–3657. https://doi.org/10.48550/arXiv.2001.08950
Grandini, M., Bagli, E., & Visani, G. (2020). Metrics for Multi-Class Classification: an Overview. 1–17. http://arxiv.org/abs/2008.05756
Huang, Q., Huang, C., Wang, X., & Jiang, F. (2021). Facial expression recognition with grid-wise attention and visual transformer. Information Sciences, 580, 35–54. https://doi.org/10.1016/j.ins.2021.08.043
Indolia, S., Nigam, S., Singh, R., Singh, V. K., & Singh, M. K. (2023). Micro Expression Recognition Using Convolution Patch in Vision Transformer. IEEE Access, 11(August), 100495–100507. https://doi.org/10.1109/ACCESS.2023.3314797
Jiang, B., Li, N., Cui, X., Zhang, Q., Zhang, H., Li, Z., & Liu, W. (2024). Research on facial expression recognition algorithm based on improved MobileNetV3. Eurasip Journal on Image and Video Processing, 2024(1), 1–19. https://doi.org/10.1186/s13640-024-00638-z
Khan, A., Rauf, Z., Sohail, A., Khan, A. R., Asif, H., Asif, A., & Farooq, U. (2023). A survey of the vision transformers and their CNN-transformer based variants. Artif. Intell. Rev., 56(Suppl 3), 2917–2970. https://doi.org/10.1007/s10462-023-10595-0
Lawpanom, R., Songpan, W., & Kaewyotha, J. (2024). Advancing Facial Expression Recognition in Online Learning Education Using a Homogeneous Ensemble Convolutional Neural Network Approach. Applied Sciences (Switzerland), 14(3). https://doi.org/10.3390/app14031156
Marin, D., Chang, J.-H. R., Ranjan, A., Prabhu, A., Rastegari, M., & Tuzel, O. (2021). Token Pooling in Vision Transformers. 1–21. http://arxiv.org/abs/2110.03860
Ping, H. (2024). Advancing Facial Expression Recognition: A Comparative Study of CNNs and Transformers. 2024 IEEE 2nd International Conference on Electrical, Automation and Computer Engineering, ICEACE 2024, 222–226. https://doi.org/10.1109/ICEACE63551.2024.10898937
Santoso, R. R., Megasari, R., & Hambali, Y. A. (2020). Implementasi Metode Machine Learning Menggunakan Algoritma Evolving Artificial Neural Network Pada Kasus Prediksi Diagnosis Diabetes. 3(2).
Shen, Z. (2024). A Comparative Study of Hybrid CNN and Vision Transformer Models for Facial Emotion Recognition. 2024 11th International Conference on Dependable Systems and Their Applications (DSA), 401–408. https://doi.org/10.1109/DSA63982.2024.00061
Wihardi, Y., Junaeti, E., Setiawan, W., Wahyudin, W., & Erlangga, E. (2022). Smart Classroom System (SCS) Berbasis Kamera Untuk Memantau Keadaan Peserta Didik. INFORMATION SYSTEM FOR EDUCATORS AND PROFESSIONALS?: Journal of Information System, 6(1), 67. https://doi.org/10.51211/isbi.v6i1.1771
Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., & Girshick, R. (2021). Early Convolutions Help Transformers See Better. Advances in Neural Information Processing Systems, 36(NeurIPS), 30392–30400. https://doi.org/10.48550/arXiv.2106.14881
Zhu, M. (2004). Recall, precision and average precision. Department of Statistics and Actuarial Science, 1–11.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Mochamad Khaairi, Rasim, Yaya Wihardi

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.














