Comparative Analysis of CNNs and Vision Transformers for Age Estimation
DOI:
https://doi.org/10.17979/ja-cea.2025.46.12251Palabras clave:
Visión por computadora, CNNs, Transformadores de visión, VGG-16, ResNet-50, EfficientNet-B0, ViT, CaiT, SwinResumen
Los transformadores de visión han adquirido recientemente una importancia significativa en las tareas de visión por ordenador debido a sus mecanismos de autoatención. Anteriormente, las CNN dominaban el campo de la visión por ordenador al lograr resultados notables en diversas aplicaciones como la clasificación de imágenes o el reconocimiento de objetos, entre otras. Sin embargo, con la llegada de los Transformadores de Visión, ha surgido una intensa competencia entre ambos. Este artículo presenta un análisis comparativo del rendimiento de las CNNs y los Transformadores de Visión para la tarea de estimación de la edad en los conjuntos de datos FG-NET y UTKFace. Realizamos la estimación de la edad utilizando seis modelos, incluidos tres modelos de CNN (VGG-16, ResNet-50, EfficientNet-B0) y tres modelos de transformadores de visión (ViT, CaiT, Swin). Nuestros resultados experimentales muestran que el transformador Swin superó tanto a la CNN como a los demás transformadores de visión.
Referencias
Agbo-Ajala, O., Viriri, S., 2021. Deep learning approach for facial age classification: a survey of the state-of-the-art. Artificial Intelligence Review 54 (1), 179–213.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N., 2021. An image is worth 16x16 words: Transformers for image recognition at scale.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al., 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., Yang, Z., Zhang, Y., Tao, D., 2023. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (1), 87–110. DOI: 10.1109/TPAMI.2022.3152247
Hatamizadeh, A., Yin, H., Heinrich, G., Kautz, J., Molchanov, P., 2023. Global context vision transformers. In: Proceedings of the 40th International Conference on Machine Learning. ICML’23. He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778. DOI: 10.1109/CVPR.2016.90
Hiba, S., Keller, Y., 2023. Hierarchical attention-based age estimation and bias analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (12), 14682–14692. DOI: 10.1109/TPAMI.2023.3319472
King, D. E., 2009. Dlib-ml: A machine learning toolkit. The Journal of Machine Learning Research 10, 1755–1758.
Kuang, H., Huang, X., Ma, X., Liu, X., 2023. Efficientrf: Facial age estimation based on efficientnet and random forest. In: 2023 IEEE 3rd International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA). Vol. 3. pp. 196–200. DOI: 10.1109/ICIBA56860.2023.10165244
Kuprashevich, M., Tolstykh, I., 2023. Mivolo: Multi-input transformer for age and gender estimation. In: International Conference on Analysis of Images, Social Networks and Texts. pp. 212–226.
Lanitis, A., Taylor, C., Cootes, T., 2002. Toward automatic simulation of aging effects on face images. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (4), 442–455. DOI: 10.1109/34.993553
Li, X., Wang, L., Zhu, R., Ma, Z., Cao, J., Xue, J.-H., 2025. Srml: Structurerelation mutual learning network for few-shot image classification. Pattern Recognition 168, 111822. DOI: https://doi.org/10.1016/j.patcog.2025.111822
Liu, P., Qian,W., Huang, J., Tu, Y., Cheung, Y.-M., 2025. Transformer-driven feature fusion network and visual feature coding for multi-label image classification. Pattern Recognition 164, 111584. DOI: https://doi.org/10.1016/j.patcog.2025.111584
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9992–10002. DOI: 10.1109/ICCV48922.2021.00986
Maurício, J., Domingues, I., Bernardino, J., 2023. Comparing vision transformers and convolutional neural networks for image classification: A literature review. Applied Sciences 13 (9). DOI: 10.3390/app13095521
Moutik, O., Sekkat, H., Tigani, S., Chehri, A., Saadane, R., Tchakoucht, T. A., Paul, A., 2023. Convolutional neural networks or vision transformers: Who will win the race for action recognitions in visual data? Sensors 23 (2). DOI: 10.3390/s23020734
Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A., 2021. Do vision transformers see like convolutional neural networks? In: Advances in Neural Information Processing Systems.
Rothe, R., Timofte, R., Van Gool, L., 2018. Deep expectation of real and apparent age from a single image without facial landmarks. International Journal of Computer Vision 126 (2), 144–157.
Shi, C., Zhao, S., Zhang, K., Wang, Y., Liang, L., 2023. Face-based age estimation using improved swin transformer with attention-based convolution. Frontiers in Neuroscience Volume 17 - 2023. DOI: 10.3389/fnins.2023.1136934
Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Song, Y.,Wang, F., 2024. Coreface: Sample-guided contrastive regularization for deep face recognition. Pattern Recognition 152, 110483. DOI: https://doi.org/10.1016/j.patcog.2024.110483
Takahashi, S., Sakaguchi, Y., Kouno, N., Takasawa, K., Ishizu, K., Akagi, Y., Aoyama, R., Teraya, N., Bolatkan, A., Shinkai, N., et al., 2024. Comparison of vision transformers and convolutional neural networks in medical image analysis: a systematic review. Journal of Medical Systems 48 (1), 84.
Tan, M., Le, Q., 09–15 Jun 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In: Chaudhuri, K., Salakhutdinov, R. (Eds.), Proceedings of the 36th International Conference on Machine Learning. Vol. 97. pp. 6105–6114.
Tomasini, U. M., Petrini, L., Cagnetta, F., Wyart, M., 2023. How deep convolutional neural networks lose spatial information with training. Machine Learning: Science and Technology 4 (4), 045026.
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., J´egou, H., 2021. Going deeper with image transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 32–42. DOI: 10.1109/ICCV48922.2021.00010
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., Polosukhin, I., 2017. Attention is all you need. In: Advances in Neural Information Processing Systems. Vol. 30.
Wang, X., Zhang, L. L., Wang, Y., Yang, M., 2022. Towards efficient vision transformer inference: a first study of transformers on mobile devices. In: Proceedings of the 23rd Annual International Workshop on Mobile Computing Systems and Applications. p. 1–7. DOI: 10.1145/3508396.3512869
Xu, L., Hu, C., Shu, X., Yu, H., 2025. Cross spatial and cross-scale swin transformer for fine-grained age estimation. Computers and Electrical Engineering 123, 110264.
Yu, S., Zhao, Q., 2025. Improving age estimation in occluded facial images with knowledge distillation and layer-wise feature reconstruction. Applied Sciences 15 (11). DOI: 10.3390/app15115806
Zhang, Z., Song, Y., Qi, H., 2017. Age progression/regression by conditional adversarial autoencoder. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4352–4360. DOI: 10.1109/CVPR.2017.463
Zhao, Z., Qian, P., Hou, Y., Zeng, Z., 2022. Adaptive mean-residue loss for robust facial age estimation. In: 2022 IEEE International Conference on Multimedia and Expo (ICME). pp. 1–6. DOI: 10.1109/ICME52920.2022.9859703
Descargas
Publicado
Número
Sección
Licencia
Derechos de autor 2025 Waqar Tanveer, Laura Fernández-Robles, Eduardo Fidalgo, Víctor González-Castro, Enrique Alegre, Milad Mirjalili

Esta obra está bajo una licencia internacional Creative Commons Atribución-NoComercial-CompartirIgual 4.0.