Comparative Analysis of CNNs and Vision Transformers for Age Estimation

Authors

DOI:

https://doi.org/10.17979/ja-cea.2025.46.12251

Keywords:

Computer vision, CNNs, Vision transformers, VGG-16, ResNet-50, EfficientNet-B0, ViT, CaiT, Swin

Abstract

Vision Transformers have recently gained significant importance in computer vision tasks due to their self-attention mechanisms.
Previously, CNNs dominated the computer vision field by achieving remarkable results in various applications such as image classification, object recognition, and more. However, with the arrival of Vision Transformers, an intense competition has emerged between the two. This paper presents a comparative analysis of the performance of CNNs and Vision Transformers for the task of age estimation on the FG-NET and UTKFace datasets. We performed age estimation using six models, including three CNN models (VGG-16, ResNet-50, EfficientNet-B0) and three Vision Transformer models (ViT, CaiT, Swin). Our experimental results show that the Swin Transformer outperformed both CNN and other Vision Transformers, achieving a mean absolute error (MAE) of 2.79 years on FG-NET and 4.37 years on UTKFace.

References

Agbo-Ajala, O., Viriri, S., 2021. Deep learning approach for facial age classification: a survey of the state-of-the-art. Artificial Intelligence Review 54 (1), 179–213.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N., 2021. An image is worth 16x16 words: Transformers for image recognition at scale.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al., 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., Yang, Z., Zhang, Y., Tao, D., 2023. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (1), 87–110. DOI: 10.1109/TPAMI.2022.3152247

Hatamizadeh, A., Yin, H., Heinrich, G., Kautz, J., Molchanov, P., 2023. Global context vision transformers. In: Proceedings of the 40th International Conference on Machine Learning. ICML’23. He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778. DOI: 10.1109/CVPR.2016.90

Hiba, S., Keller, Y., 2023. Hierarchical attention-based age estimation and bias analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (12), 14682–14692. DOI: 10.1109/TPAMI.2023.3319472

King, D. E., 2009. Dlib-ml: A machine learning toolkit. The Journal of Machine Learning Research 10, 1755–1758.

Kuang, H., Huang, X., Ma, X., Liu, X., 2023. Efficientrf: Facial age estimation based on efficientnet and random forest. In: 2023 IEEE 3rd International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA). Vol. 3. pp. 196–200. DOI: 10.1109/ICIBA56860.2023.10165244

Kuprashevich, M., Tolstykh, I., 2023. Mivolo: Multi-input transformer for age and gender estimation. In: International Conference on Analysis of Images, Social Networks and Texts. pp. 212–226.

Lanitis, A., Taylor, C., Cootes, T., 2002. Toward automatic simulation of aging effects on face images. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (4), 442–455. DOI: 10.1109/34.993553

Li, X., Wang, L., Zhu, R., Ma, Z., Cao, J., Xue, J.-H., 2025. Srml: Structurerelation mutual learning network for few-shot image classification. Pattern Recognition 168, 111822. DOI: https://doi.org/10.1016/j.patcog.2025.111822

Liu, P., Qian,W., Huang, J., Tu, Y., Cheung, Y.-M., 2025. Transformer-driven feature fusion network and visual feature coding for multi-label image classification. Pattern Recognition 164, 111584. DOI: https://doi.org/10.1016/j.patcog.2025.111584

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9992–10002. DOI: 10.1109/ICCV48922.2021.00986

Maurício, J., Domingues, I., Bernardino, J., 2023. Comparing vision transformers and convolutional neural networks for image classification: A literature review. Applied Sciences 13 (9). DOI: 10.3390/app13095521

Moutik, O., Sekkat, H., Tigani, S., Chehri, A., Saadane, R., Tchakoucht, T. A., Paul, A., 2023. Convolutional neural networks or vision transformers: Who will win the race for action recognitions in visual data? Sensors 23 (2). DOI: 10.3390/s23020734

Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A., 2021. Do vision transformers see like convolutional neural networks? In: Advances in Neural Information Processing Systems.

Rothe, R., Timofte, R., Van Gool, L., 2018. Deep expectation of real and apparent age from a single image without facial landmarks. International Journal of Computer Vision 126 (2), 144–157.

Shi, C., Zhao, S., Zhang, K., Wang, Y., Liang, L., 2023. Face-based age estimation using improved swin transformer with attention-based convolution. Frontiers in Neuroscience Volume 17 - 2023. DOI: 10.3389/fnins.2023.1136934

Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

Song, Y.,Wang, F., 2024. Coreface: Sample-guided contrastive regularization for deep face recognition. Pattern Recognition 152, 110483. DOI: https://doi.org/10.1016/j.patcog.2024.110483

Takahashi, S., Sakaguchi, Y., Kouno, N., Takasawa, K., Ishizu, K., Akagi, Y., Aoyama, R., Teraya, N., Bolatkan, A., Shinkai, N., et al., 2024. Comparison of vision transformers and convolutional neural networks in medical image analysis: a systematic review. Journal of Medical Systems 48 (1), 84.

Tan, M., Le, Q., 09–15 Jun 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In: Chaudhuri, K., Salakhutdinov, R. (Eds.), Proceedings of the 36th International Conference on Machine Learning. Vol. 97. pp. 6105–6114.

Tomasini, U. M., Petrini, L., Cagnetta, F., Wyart, M., 2023. How deep convolutional neural networks lose spatial information with training. Machine Learning: Science and Technology 4 (4), 045026.

Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., J´egou, H., 2021. Going deeper with image transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 32–42. DOI: 10.1109/ICCV48922.2021.00010

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., Polosukhin, I., 2017. Attention is all you need. In: Advances in Neural Information Processing Systems. Vol. 30.

Wang, X., Zhang, L. L., Wang, Y., Yang, M., 2022. Towards efficient vision transformer inference: a first study of transformers on mobile devices. In: Proceedings of the 23rd Annual International Workshop on Mobile Computing Systems and Applications. p. 1–7. DOI: 10.1145/3508396.3512869

Xu, L., Hu, C., Shu, X., Yu, H., 2025. Cross spatial and cross-scale swin transformer for fine-grained age estimation. Computers and Electrical Engineering 123, 110264.

Yu, S., Zhao, Q., 2025. Improving age estimation in occluded facial images with knowledge distillation and layer-wise feature reconstruction. Applied Sciences 15 (11). DOI: 10.3390/app15115806

Zhang, Z., Song, Y., Qi, H., 2017. Age progression/regression by conditional adversarial autoencoder. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4352–4360. DOI: 10.1109/CVPR.2017.463

Zhao, Z., Qian, P., Hou, Y., Zeng, Z., 2022. Adaptive mean-residue loss for robust facial age estimation. In: 2022 IEEE International Conference on Multimedia and Expo (ICME). pp. 1–6. DOI: 10.1109/ICME52920.2022.9859703

Downloads

Published

2025-09-01

Issue

Section

Visión por Computador