Human-Centric Video Summarization via Identity-Aware Tracking
DOI:
https://doi.org/10.17979/ja-cea.2025.46.12249Keywords:
Computer vision, Human-machine interaction (HMI), Human-centric design, Deep learning, Artificial intelligence (AI)Abstract
In this paper, we present an approach to video summarization that focuses on the presence and identity of people across video frames. The proposed framework combines pose landmarks, rich facial embeddings, and visual appearance features of the body to build a robust representation for each detected person. These features are clustered offline to enable consistent tracking of individuals throughout the video. Our method does not require labeled data, making it suitable for processing large-scale video collections without the need for annotations. By selecting representative frames in which key individuals appear most frequently, the system generates concise and identity-aware summaries that reflect the dynamics of human presence over time. We conducted experiments on diverse video sequences and achieved an average F1 score of 99.4% for consistent identity tracking. This person-centric strategy offers a scalable and generalizable solution for summarizing videos in domains where understanding human activity is essential.
References
Alaa, T., Mongy, A., Bakr, A., Diab, M., Gomaa, W., 2024. Video Summarization Techniques: A Comprehensive Review. https://doi.org/10.48550/ARXIV.2410.04449
Apostolidis, E., Adamantidou, E., Metsai, A.I., Mezaris, V., Patras, I., 2021. AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks for Unsupervised Video Summarization. IEEE Trans. Circuits Syst. Video Technol. 31, 3278–3292. https://doi.org/10.1109/TCSVT.2020.3037883
Argaw, D.M., Yoon, S., Heilbron, F.C., Deilamsalehy, H., Bui, T., Wang, Z., Dernoncourt, F., Chung, J.S., 2024. Scaling Up Video Summarization Pretraining with Large Language Models, in: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Presented at the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, WA, USA, pp. 8332–8341. https://doi.org/10.1109/CVPR52733.2024.00796
Basavarajaiah, M., Sharma, P., 2021. GVSUM: generic video summarization using deep visual features. Multimed. Tools Appl. 80, 14459–14476. https://doi.org/10.1007/s11042-020-10460-0
Bazarevsky, V., Grishchenko, I., Raveendran, K., Zhu, T.L., Zhang, F., Grundmann, M., 2020. BlazePose: On-device Real-time Body Pose tracking. ArXiv abs/2006.10204.
Bazarevsky, V., Kartynnik, Y., Vakunov, A., Raveendran, K., Grundmann, M., 2019. BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs. https://doi.org/10.48550/ARXIV.1907.05047
Biswas, R., Chaves, D., Fernández-Robles, L., Fidalgo, E., Alegre, E., 2021. A Video Summarization Approach to Speed-up the Analysis of Child Sexual Exploitation Material, in: XLII JORNADAS DE AUTOMÁTICA : LIBRO DE ACTAS. Servizo de Publicacións da UDC, pp. 648–654. https://doi.org/10.17979/spudc.9788497498043.648
Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., Sheikh, Y., 2021. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 43, 172–186. https://doi.org/10.1109/TPAMI.2019.2929257
Chaves, D., Fidalgo, E., Alegre, E., Alaiz-Rodríguez, R., Jáñez-Martino, F., Azzopardi, G., 2020. Assessment and Estimation of Face Detection Performance Based on Deep Learning for Forensic Applications. Sensors 20, 4491. https://doi.org/10.3390/s20164491
Deng, J., Guo, J., Xue, N., Zafeiriou, S., 2019. ArcFace: Additive Angular Margin Loss for Deep Face Recognition, in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Presented at the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Long Beach, CA, USA, pp. 4685–4694. https://doi.org/10.1109/CVPR.2019.00482
Ester, M., Kriegel, H.-P., Sander, J., Xu, X., 1996. A density-based algorithm for discovering clusters in large spatial databases with noise, in: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96. AAAI Press, Portland, Oregon, pp. 226–231.
Fajtl, J., Sokeh, H.S., Argyriou, V., Monekosso, D., Remagnino, P., 2019. Summarizing Videos with Attention, in: Carneiro, G., You, S. (Eds.), Computer Vision – ACCV 2018 Workshops, Lecture Notes in Computer Science. Springer International Publishing, Cham, pp. 39–54. https://doi.org/10.1007/978-3-030-21074-8_4
Gangwar, A., Fidalgo, E., Alegre, E., González-Castro, V., 2017. Pornography and child sexual abuse detection in image and video: a comparative evaluation, in: 8th International Conference on Imaging for Crime Detection and Prevention (ICDP 2017). Presented at the 8th International Conference on Imaging for Crime Detection and Prevention (ICDP 2017), Institution of Engineering and Technology, Madrid, Spain, pp. 37–42. https://doi.org/10.1049/ic.2017.0046
Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L., 2014. Creating Summaries from User Videos, in: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (Eds.), Computer Vision – ECCV 2014, Lecture Notes in Computer Science. Springer International Publishing, Cham, pp. 505–520. https://doi.org/10.1007/978-3-319-10584-0_33
Hsu, T.-C., Liao, Y.-S., Huang, C.-R., 2023. Video Summarization With Spatiotemporal Vision Transformer. IEEE Trans. Image Process. 32, 3013–3026. https://doi.org/10.1109/TIP.2023.3275069
Jocher, G., Qiu, J., Chaurasia, A., 2023. Ultralytics YOLO.
Li, H., Klabjan, D., Utke, J., 2024. Unsupervised Video Summarization via Iterative Training and Simplified GAN, in: Proceedings of the Asian Conference on Computer Vision (ACCV). pp. 1585–1601.
Liu, T., Meng, Q., Huang, J.-J., Vlontzos, A., Rueckert, D., Kainz, B., 2022. Video Summarization Through Reinforcement Learning With a 3D Spatio-Temporal U-Net. IEEE Trans. Image Process. 31, 1573–1586. https://doi.org/10.1109/TIP.2022.3143699
Meena, P., Kumar, H., Kumar Yadav, S., 2023. A review on video summarization techniques. Eng. Appl. Artif. Intell. 118, 105667. https://doi.org/10.1016/j.engappai.2022.105667
Paul, M., Musfequs Salehin, Md., 2019. Spatial and Motion Saliency Prediction Method Using Eye Tracker Data for Video Summarization. IEEE Trans. Circuits Syst. Video Technol. 29, 1856–1867. https://doi.org/10.1109/TCSVT.2018.2844780
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., others, 2021. Learning transferable visual models from natural language supervision, in: International Conference on Machine Learning. PmLR, pp. 8748–8763.
Ramos, W., Silva, M., Araujo, E., Moura, V., Oliveira, K., Marcolino, L.S., Nascimento, E.R., 2023. Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement Learning Method. IEEE Trans. Pattern Anal. Mach. Intell. 45, 2492–2504. https://doi.org/10.1109/TPAMI.2022.3157198
Tiwari, V., Bhatnagar, C., 2021. A survey of recent work on video summarization: approaches and techniques. Multimed. Tools Appl. 80, 27187–27221. https://doi.org/10.1007/s11042-021-10977-y
U., S.M., Kovoor, B.C., 2021. An aggregated deep convolutional recurrent model for event based surveillance video summarisation: A supervised approach. IET Comput. Vis. 15, 297–311. https://doi.org/10.1049/cvi2.12044
Varghese, R., M., S., 2024. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness, in: 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS). Presented at the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), IEEE, Chennai, India, pp. 1–6. https://doi.org/10.1109/ADICS58448.2024.10533619
Wojke, N., Bewley, A., Paulus, D., 2017. Simple online and realtime tracking with a deep association metric, in: 2017 IEEE International Conference on Image Processing (ICIP). Presented at the 2017 IEEE International Conference on Image Processing (ICIP), IEEE, Beijing, pp. 3645–3649. https://doi.org/10.1109/ICIP.2017.8296962
Wu, G., Lin, J., Silva, C.T., 2022. Intentvizor: Towards generic query guided interactive video summarization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10503–10512.
Yale Song, Vallmitjana, J., Stent, A., Jaimes, A., 2015. TVSum: Summarizing web videos using titles, in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Presented at the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Boston, MA, USA, pp. 5179–5187. https://doi.org/10.1109/CVPR.2015.7299154
Yang, J.-A., Lee, C.-H., Yang, S.-W., Somayazulu, V.S., Chen, Y.-K., Chien, S.-Y., 2016. Wearable social camera: Egocentric video summarization for social interaction, in: 2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). Presented at the 2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), IEEE, Seattle, WA, USA, pp. 1–6. https://doi.org/10.1109/ICMEW.2016.7574681
Zhang, Ke, Chao, W.-L., Sha, F., Grauman, K., 2016. Video Summarization with Long Short-Term Memory, in: Leibe, B., Matas, J., Sebe, N., Welling, M. (Eds.), Computer Vision – ECCV 2016, Lecture Notes in Computer Science. Springer International Publishing, Cham, pp. 766–782. https://doi.org/10.1007/978-3-319-46478-7_47
Zhang, Kaipeng, Zhang, Z., Li, Z., Qiao, Y., 2016. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Process. Lett. 23, 1499–1503.
Zhao, Y., Lv, G., Ma, T., Ji, H., Zheng, H., 2015. A novel method of surveillance video Summarization based On clustering and background subtraction, in: 2015 8th International Congress on Image and Signal Processing (CISP). Presented at the 2015 8th International Congress on Image and Signal Processing (CISP), IEEE, Shenyang, China, pp. 131–136. https://doi.org/10.1109/CISP.2015.7407863
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Milad Mirjalili, Enrique Alegre Gutiérrez, Eduardo Fidalgo Fernández, Víctor González Castro, Waqar Tanveer

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.