Comparative Analysis of CNNs and Vision Transformers for Age Estimation

Waqar Tanveer; Laura Fernández-Robles; Eduardo Fidalgo; Víctor González-Castro; Enrique Alegre; Milad Mirjalili

doi:10.17979/ja-cea.2025.46.12251

Autores/as

Waqar Tanveer University of León https://orcid.org/0000-0001-5051-5596
Laura Fernández-Robles University of León https://orcid.org/0000-0001-6573-8477
Eduardo Fidalgo University of León https://orcid.org/0000-0003-1202-5232
Víctor González-Castro University of León https://orcid.org/0000-0001-8742-3775
Enrique Alegre University of León https://orcid.org/0000-0003-2081-774X
Milad Mirjalili University of León https://orcid.org/0009-0004-3000-1570

DOI:

https://doi.org/10.17979/ja-cea.2025.46.12251

Palabras clave:

Visión por computadora, CNNs, Transformadores de visión, VGG-16, ResNet-50, EfficientNet-B0, ViT, CaiT, Swin

Resumen

Los transformadores de visión han adquirido recientemente una importancia significativa en las tareas de visión por ordenador debido a sus mecanismos de autoatención. Anteriormente, las CNN dominaban el campo de la visión por ordenador al lograr resultados notables en diversas aplicaciones como la clasificación de imágenes o el reconocimiento de objetos, entre otras. Sin embargo, con la llegada de los Transformadores de Visión, ha surgido una intensa competencia entre ambos. Este artículo presenta un análisis comparativo del rendimiento de las CNNs y los Transformadores de Visión para la tarea de estimación de la edad en los conjuntos de datos FG-NET y UTKFace. Realizamos la estimación de la edad utilizando seis modelos, incluidos tres modelos de CNN (VGG-16, ResNet-50, EfficientNet-B0) y tres modelos de transformadores de visión (ViT, CaiT, Swin). Nuestros resultados experimentales muestran que el transformador Swin superó tanto a la CNN como a los demás transformadores de visión.

Referencias

Agbo-Ajala, O., Viriri, S., 2021. Deep learning approach for facial age classification: a survey of the state-of-the-art. Artificial Intelligence Review 54 (1), 179–213.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N., 2021. An image is worth 16x16 words: Transformers for image recognition at scale.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al., 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., Yang, Z., Zhang, Y., Tao, D., 2023. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (1), 87–110. DOI: 10.1109/TPAMI.2022.3152247

Hatamizadeh, A., Yin, H., Heinrich, G., Kautz, J., Molchanov, P., 2023. Global context vision transformers. In: Proceedings of the 40th International Conference on Machine Learning. ICML’23. He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778. DOI: 10.1109/CVPR.2016.90

Hiba, S., Keller, Y., 2023. Hierarchical attention-based age estimation and bias analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (12), 14682–14692. DOI: 10.1109/TPAMI.2023.3319472

King, D. E., 2009. Dlib-ml: A machine learning toolkit. The Journal of Machine Learning Research 10, 1755–1758.

Kuang, H., Huang, X., Ma, X., Liu, X., 2023. Efficientrf: Facial age estimation based on efficientnet and random forest. In: 2023 IEEE 3rd International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA). Vol. 3. pp. 196–200. DOI: 10.1109/ICIBA56860.2023.10165244

Kuprashevich, M., Tolstykh, I., 2023. Mivolo: Multi-input transformer for age and gender estimation. In: International Conference on Analysis of Images, Social Networks and Texts. pp. 212–226.

Lanitis, A., Taylor, C., Cootes, T., 2002. Toward automatic simulation of aging effects on face images. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (4), 442–455. DOI: 10.1109/34.993553

Li, X., Wang, L., Zhu, R., Ma, Z., Cao, J., Xue, J.-H., 2025. Srml: Structurerelation mutual learning network for few-shot image classification. Pattern Recognition 168, 111822. DOI: https://doi.org/10.1016/j.patcog.2025.111822

Liu, P., Qian,W., Huang, J., Tu, Y., Cheung, Y.-M., 2025. Transformer-driven feature fusion network and visual feature coding for multi-label image classification. Pattern Recognition 164, 111584. DOI: https://doi.org/10.1016/j.patcog.2025.111584

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9992–10002. DOI: 10.1109/ICCV48922.2021.00986

Maurício, J., Domingues, I., Bernardino, J., 2023. Comparing vision transformers and convolutional neural networks for image classification: A literature review. Applied Sciences 13 (9). DOI: 10.3390/app13095521

Moutik, O., Sekkat, H., Tigani, S., Chehri, A., Saadane, R., Tchakoucht, T. A., Paul, A., 2023. Convolutional neural networks or vision transformers: Who will win the race for action recognitions in visual data? Sensors 23 (2). DOI: 10.3390/s23020734

Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A., 2021. Do vision transformers see like convolutional neural networks? In: Advances in Neural Information Processing Systems.

Rothe, R., Timofte, R., Van Gool, L., 2018. Deep expectation of real and apparent age from a single image without facial landmarks. International Journal of Computer Vision 126 (2), 144–157.

Shi, C., Zhao, S., Zhang, K., Wang, Y., Liang, L., 2023. Face-based age estimation using improved swin transformer with attention-based convolution. Frontiers in Neuroscience Volume 17 - 2023. DOI: 10.3389/fnins.2023.1136934

Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

Song, Y.,Wang, F., 2024. Coreface: Sample-guided contrastive regularization for deep face recognition. Pattern Recognition 152, 110483. DOI: https://doi.org/10.1016/j.patcog.2024.110483

Takahashi, S., Sakaguchi, Y., Kouno, N., Takasawa, K., Ishizu, K., Akagi, Y., Aoyama, R., Teraya, N., Bolatkan, A., Shinkai, N., et al., 2024. Comparison of vision transformers and convolutional neural networks in medical image analysis: a systematic review. Journal of Medical Systems 48 (1), 84.

Tan, M., Le, Q., 09–15 Jun 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In: Chaudhuri, K., Salakhutdinov, R. (Eds.), Proceedings of the 36th International Conference on Machine Learning. Vol. 97. pp. 6105–6114.

Tomasini, U. M., Petrini, L., Cagnetta, F., Wyart, M., 2023. How deep convolutional neural networks lose spatial information with training. Machine Learning: Science and Technology 4 (4), 045026.

Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., J´egou, H., 2021. Going deeper with image transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 32–42. DOI: 10.1109/ICCV48922.2021.00010

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., Polosukhin, I., 2017. Attention is all you need. In: Advances in Neural Information Processing Systems. Vol. 30.

Wang, X., Zhang, L. L., Wang, Y., Yang, M., 2022. Towards efficient vision transformer inference: a first study of transformers on mobile devices. In: Proceedings of the 23rd Annual International Workshop on Mobile Computing Systems and Applications. p. 1–7. DOI: 10.1145/3508396.3512869

Xu, L., Hu, C., Shu, X., Yu, H., 2025. Cross spatial and cross-scale swin transformer for fine-grained age estimation. Computers and Electrical Engineering 123, 110264.

Yu, S., Zhao, Q., 2025. Improving age estimation in occluded facial images with knowledge distillation and layer-wise feature reconstruction. Applied Sciences 15 (11). DOI: 10.3390/app15115806

Zhang, Z., Song, Y., Qi, H., 2017. Age progression/regression by conditional adversarial autoencoder. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4352–4360. DOI: 10.1109/CVPR.2017.463

Zhao, Z., Qian, P., Hou, Y., Zeng, Z., 2022. Adaptive mean-residue loss for robust facial age estimation. In: 2022 IEEE International Conference on Multimedia and Expo (ICME). pp. 1–6. DOI: 10.1109/ICME52920.2022.9859703

Comparative Analysis of CNNs and Vision Transformers for Age Estimation

Autores/as

DOI:

Palabras clave:

Resumen

Referencias

Descargas

Publicado

Número

Sección

Licencia

Enviar un artículo

Últimas publicaciones

Idioma