Facial Expression Recognition of Students in Classroom Using Hybrid MobileNetV3-Vision Transformer with Token Downsampling

Authors

  • Mochamad Khaairi Universitas Pendidikan Indonesia, Indonesia
  • Rasim Universitas Pendidikan Indonesia, Indonesia
  • Yaya Wihardi Universitas Pendidikan Indonesia, Indonesia

DOI:

https://doi.org/10.47709/brilliance.v5i1.6323

Keywords:

Classroom, Facial Expression Recognition, Hybrid Vision Transformer, MobileNetV3, Student, Token Downsampling

Abstract

In large classroom environments, teachers often struggle to monitor each student’s facial expression throughout the learning process. Yet, facial expressions are important indicators of students’ emotional states and engagement, which, when detected in real time, can support a more adaptive learning experience. Most previous research on Facial Expression Recognition (FER) has relied on Convolutional Neural Networks (CNN), which tend to be limited in capturing global relationships between facial features. Additionally, many studies focus on model accuracy without evaluating their practical effectiveness in real classroom settings. This study aims to develop a facial expression recognition model that is both accurate and efficient for use in classroom contexts. A hybrid Vision Transformer (ViT) architecture is proposed, which combines MobileNetV3 for local feature extraction and a Vision Transformer for global context modeling. To reduce the number of tokens and computational cost, a Token Downsampling method is introduced within the transformer blocks. The model is trained using the FER2013 dataset and achieves a test accuracy of 71.24%, surpassing the baseline pretrained ViT model, which reached only 70.10%. Additionally, the Token Downsampling method improves inference speed. Furthermore, the model is tested on a custom dataset collected from students in a real classroom setting to evaluate its performance in practical implementation. Although the performance on the classroom dataset is not yet optimal, the results on FER2013 demonstrate the potential of this approach for further development toward real-time and accurate facial expression recognition in educational environments.

References

Andrew, H., Mark, S., Grace, C., Liang-Chieh, C., Bo, C., Mingxing, T., Weijun, W., Yukun, Z., Ruoming, P., & Vijay, V. (2019). Searching for MobileNetV3. Proceedings of the IEEE International Conference on Computer Vision, 1314–1324.

Bouhlal, M., Aarika, K., AitAbdelouahid, R., Elfilali, S., & Benlahmar, E. (2020). Emotions recognition as innovative tool for improving students’ performance and learning approaches. Procedia Computer Science, 175, 597–602. https://doi.org/10.1016/j.procs.2020.07.086

Deng, J., Guo, J., Zhou, Y., Yu, J., Kotsia, I., & Zafeiriou, S. (2019). RetinaFace: Single-stage Dense Face Localisation in the Wild. http://arxiv.org/abs/1905.00641

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE. ICLR 2021 - 9th International Conference on Learning Representations. https://doi.org/10.48550/arXiv.2010.11929

Duki?, D., & Krzic, A. S. (2022). Real-Time Facial Expression Recognition Using Deep Learning with Application in the Active Classroom Environment. Electronics (Switzerland), 11(8). https://doi.org/10.3390/electronics11081240

Fang, B., Li, X., Han, G., & He, J. (2023). Facial Expression Recognition in Educational Research from the Perspective of Machine Learning: A Systematic Review. IEEE Access, 11(August), 112060–112074. https://doi.org/10.1109/ACCESS.2023.3322454

Foret, P., Kleiner, A., Mobahi, H., & Neyshabur, B. (2021). Sharpness-Aware Minimization for Efficiently Improving Generalization. ICLR 2021 - 9th International Conference on Learning Representations. https://doi.org/10.48550/arXiv.2010.01412

Goodfellow, I. J., Erhan, D., Luc Carrier, P., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., Lee, D. H., Zhou, Y., Ramaiah, C., Feng, F., Li, R., Wang, X., Athanasakis, D., Shawe-Taylor, J., Milakov, M., Park, J., … Bengio, Y. (2015). Challenges in representation learning: A report on three machine learning contests. Neural Networks, 64, 59–63. https://doi.org/10.1016/j.neunet.2014.09.005

Goyal, S., Choudhury, A. R., Raje, S. M., Chakaravarthy, V. T., Sabharwal, Y., & Verma, A. (2020). PoWER-BERT: Accelerating BERT inference via progressive word-vector elimination. 37th International Conference on Machine Learning, ICML 2020, PartF16814, 3648–3657. https://doi.org/10.48550/arXiv.2001.08950

Grandini, M., Bagli, E., & Visani, G. (2020). Metrics for Multi-Class Classification: an Overview. 1–17. http://arxiv.org/abs/2008.05756

Huang, Q., Huang, C., Wang, X., & Jiang, F. (2021). Facial expression recognition with grid-wise attention and visual transformer. Information Sciences, 580, 35–54. https://doi.org/10.1016/j.ins.2021.08.043

Indolia, S., Nigam, S., Singh, R., Singh, V. K., & Singh, M. K. (2023). Micro Expression Recognition Using Convolution Patch in Vision Transformer. IEEE Access, 11(August), 100495–100507. https://doi.org/10.1109/ACCESS.2023.3314797

Jiang, B., Li, N., Cui, X., Zhang, Q., Zhang, H., Li, Z., & Liu, W. (2024). Research on facial expression recognition algorithm based on improved MobileNetV3. Eurasip Journal on Image and Video Processing, 2024(1), 1–19. https://doi.org/10.1186/s13640-024-00638-z

Khan, A., Rauf, Z., Sohail, A., Khan, A. R., Asif, H., Asif, A., & Farooq, U. (2023). A survey of the vision transformers and their CNN-transformer based variants. Artif. Intell. Rev., 56(Suppl 3), 2917–2970. https://doi.org/10.1007/s10462-023-10595-0

Lawpanom, R., Songpan, W., & Kaewyotha, J. (2024). Advancing Facial Expression Recognition in Online Learning Education Using a Homogeneous Ensemble Convolutional Neural Network Approach. Applied Sciences (Switzerland), 14(3). https://doi.org/10.3390/app14031156

Marin, D., Chang, J.-H. R., Ranjan, A., Prabhu, A., Rastegari, M., & Tuzel, O. (2021). Token Pooling in Vision Transformers. 1–21. http://arxiv.org/abs/2110.03860

Ping, H. (2024). Advancing Facial Expression Recognition: A Comparative Study of CNNs and Transformers. 2024 IEEE 2nd International Conference on Electrical, Automation and Computer Engineering, ICEACE 2024, 222–226. https://doi.org/10.1109/ICEACE63551.2024.10898937

Santoso, R. R., Megasari, R., & Hambali, Y. A. (2020). Implementasi Metode Machine Learning Menggunakan Algoritma Evolving Artificial Neural Network Pada Kasus Prediksi Diagnosis Diabetes. 3(2).

Shen, Z. (2024). A Comparative Study of Hybrid CNN and Vision Transformer Models for Facial Emotion Recognition. 2024 11th International Conference on Dependable Systems and Their Applications (DSA), 401–408. https://doi.org/10.1109/DSA63982.2024.00061

Wihardi, Y., Junaeti, E., Setiawan, W., Wahyudin, W., & Erlangga, E. (2022). Smart Classroom System (SCS) Berbasis Kamera Untuk Memantau Keadaan Peserta Didik. INFORMATION SYSTEM FOR EDUCATORS AND PROFESSIONALS?: Journal of Information System, 6(1), 67. https://doi.org/10.51211/isbi.v6i1.1771

Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., & Girshick, R. (2021). Early Convolutions Help Transformers See Better. Advances in Neural Information Processing Systems, 36(NeurIPS), 30392–30400. https://doi.org/10.48550/arXiv.2106.14881

Zhu, M. (2004). Recall, precision and average precision. Department of Statistics and Actuarial Science, 1–11.

Downloads

Published

2025-07-24

How to Cite

Khaairi, M., Rasim, R., & Wihardi, Y. (2025). Facial Expression Recognition of Students in Classroom Using Hybrid MobileNetV3-Vision Transformer with Token Downsampling. Brilliance: Research of Artificial Intelligence, 5(1), 510–520. https://doi.org/10.47709/brilliance.v5i1.6323

Similar Articles

1 2 3 4 5 6 7 > >> 

You may also start an advanced similarity search for this article.