Human-Centric Video Summarization via Identity-Aware Tracking

Authors

DOI:

https://doi.org/10.17979/ja-cea.2025.46.12249

Keywords:

Computer vision, Human-machine interaction (HMI), Human-centric design, Deep learning, Artificial intelligence (AI)

Abstract

In this paper, we present an approach to video summarization that focuses on the presence and identity of people across video frames. The proposed framework combines pose landmarks, rich facial embeddings, and visual appearance features of the body to build a robust representation for each detected person. These features are clustered offline to enable consistent tracking of individuals throughout the video. Our method does not require labeled data, making it suitable for processing large-scale video collections without the need for annotations. By selecting representative frames in which key individuals appear most frequently, the system generates concise and identity-aware summaries that reflect the dynamics of human presence over time. We conducted experiments on diverse video sequences and achieved an average F1 score of 99.4% for consistent identity tracking. This person-centric strategy offers a scalable and generalizable solution for summarizing videos in domains where understanding human activity is essential.

References

Alaa, T., Mongy, A., Bakr, A., Diab, M., Gomaa, W., 2024. Video Summarization Techniques: A Comprehensive Review. https://doi.org/10.48550/ARXIV.2410.04449

Apostolidis, E., Adamantidou, E., Metsai, A.I., Mezaris, V., Patras, I., 2021. AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks for Unsupervised Video Summarization. IEEE Trans. Circuits Syst. Video Technol. 31, 3278–3292. https://doi.org/10.1109/TCSVT.2020.3037883

Argaw, D.M., Yoon, S., Heilbron, F.C., Deilamsalehy, H., Bui, T., Wang, Z., Dernoncourt, F., Chung, J.S., 2024. Scaling Up Video Summarization Pretraining with Large Language Models, in: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Presented at the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, WA, USA, pp. 8332–8341. https://doi.org/10.1109/CVPR52733.2024.00796

Basavarajaiah, M., Sharma, P., 2021. GVSUM: generic video summarization using deep visual features. Multimed. Tools Appl. 80, 14459–14476. https://doi.org/10.1007/s11042-020-10460-0

Bazarevsky, V., Grishchenko, I., Raveendran, K., Zhu, T.L., Zhang, F., Grundmann, M., 2020. BlazePose: On-device Real-time Body Pose tracking. ArXiv abs/2006.10204.

Bazarevsky, V., Kartynnik, Y., Vakunov, A., Raveendran, K., Grundmann, M., 2019. BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs. https://doi.org/10.48550/ARXIV.1907.05047

Biswas, R., Chaves, D., Fernández-Robles, L., Fidalgo, E., Alegre, E., 2021. A Video Summarization Approach to Speed-up the Analysis of Child Sexual Exploitation Material, in: XLII JORNADAS DE AUTOMÁTICA : LIBRO DE ACTAS. Servizo de Publicacións da UDC, pp. 648–654. https://doi.org/10.17979/spudc.9788497498043.648

Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., Sheikh, Y., 2021. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 43, 172–186. https://doi.org/10.1109/TPAMI.2019.2929257

Chaves, D., Fidalgo, E., Alegre, E., Alaiz-Rodríguez, R., Jáñez-Martino, F., Azzopardi, G., 2020. Assessment and Estimation of Face Detection Performance Based on Deep Learning for Forensic Applications. Sensors 20, 4491. https://doi.org/10.3390/s20164491

Deng, J., Guo, J., Xue, N., Zafeiriou, S., 2019. ArcFace: Additive Angular Margin Loss for Deep Face Recognition, in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Presented at the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Long Beach, CA, USA, pp. 4685–4694. https://doi.org/10.1109/CVPR.2019.00482

Ester, M., Kriegel, H.-P., Sander, J., Xu, X., 1996. A density-based algorithm for discovering clusters in large spatial databases with noise, in: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96. AAAI Press, Portland, Oregon, pp. 226–231.

Fajtl, J., Sokeh, H.S., Argyriou, V., Monekosso, D., Remagnino, P., 2019. Summarizing Videos with Attention, in: Carneiro, G., You, S. (Eds.), Computer Vision – ACCV 2018 Workshops, Lecture Notes in Computer Science. Springer International Publishing, Cham, pp. 39–54. https://doi.org/10.1007/978-3-030-21074-8_4

Gangwar, A., Fidalgo, E., Alegre, E., González-Castro, V., 2017. Pornography and child sexual abuse detection in image and video: a comparative evaluation, in: 8th International Conference on Imaging for Crime Detection and Prevention (ICDP 2017). Presented at the 8th International Conference on Imaging for Crime Detection and Prevention (ICDP 2017), Institution of Engineering and Technology, Madrid, Spain, pp. 37–42. https://doi.org/10.1049/ic.2017.0046

Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L., 2014. Creating Summaries from User Videos, in: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (Eds.), Computer Vision – ECCV 2014, Lecture Notes in Computer Science. Springer International Publishing, Cham, pp. 505–520. https://doi.org/10.1007/978-3-319-10584-0_33

Hsu, T.-C., Liao, Y.-S., Huang, C.-R., 2023. Video Summarization With Spatiotemporal Vision Transformer. IEEE Trans. Image Process. 32, 3013–3026. https://doi.org/10.1109/TIP.2023.3275069

Jocher, G., Qiu, J., Chaurasia, A., 2023. Ultralytics YOLO.

Li, H., Klabjan, D., Utke, J., 2024. Unsupervised Video Summarization via Iterative Training and Simplified GAN, in: Proceedings of the Asian Conference on Computer Vision (ACCV). pp. 1585–1601.

Liu, T., Meng, Q., Huang, J.-J., Vlontzos, A., Rueckert, D., Kainz, B., 2022. Video Summarization Through Reinforcement Learning With a 3D Spatio-Temporal U-Net. IEEE Trans. Image Process. 31, 1573–1586. https://doi.org/10.1109/TIP.2022.3143699

Meena, P., Kumar, H., Kumar Yadav, S., 2023. A review on video summarization techniques. Eng. Appl. Artif. Intell. 118, 105667. https://doi.org/10.1016/j.engappai.2022.105667

Paul, M., Musfequs Salehin, Md., 2019. Spatial and Motion Saliency Prediction Method Using Eye Tracker Data for Video Summarization. IEEE Trans. Circuits Syst. Video Technol. 29, 1856–1867. https://doi.org/10.1109/TCSVT.2018.2844780

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., others, 2021. Learning transferable visual models from natural language supervision, in: International Conference on Machine Learning. PmLR, pp. 8748–8763.

Ramos, W., Silva, M., Araujo, E., Moura, V., Oliveira, K., Marcolino, L.S., Nascimento, E.R., 2023. Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement Learning Method. IEEE Trans. Pattern Anal. Mach. Intell. 45, 2492–2504. https://doi.org/10.1109/TPAMI.2022.3157198

Tiwari, V., Bhatnagar, C., 2021. A survey of recent work on video summarization: approaches and techniques. Multimed. Tools Appl. 80, 27187–27221. https://doi.org/10.1007/s11042-021-10977-y

U., S.M., Kovoor, B.C., 2021. An aggregated deep convolutional recurrent model for event based surveillance video summarisation: A supervised approach. IET Comput. Vis. 15, 297–311. https://doi.org/10.1049/cvi2.12044

Varghese, R., M., S., 2024. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness, in: 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS). Presented at the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), IEEE, Chennai, India, pp. 1–6. https://doi.org/10.1109/ADICS58448.2024.10533619

Wojke, N., Bewley, A., Paulus, D., 2017. Simple online and realtime tracking with a deep association metric, in: 2017 IEEE International Conference on Image Processing (ICIP). Presented at the 2017 IEEE International Conference on Image Processing (ICIP), IEEE, Beijing, pp. 3645–3649. https://doi.org/10.1109/ICIP.2017.8296962

Wu, G., Lin, J., Silva, C.T., 2022. Intentvizor: Towards generic query guided interactive video summarization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10503–10512.

Yale Song, Vallmitjana, J., Stent, A., Jaimes, A., 2015. TVSum: Summarizing web videos using titles, in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Presented at the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Boston, MA, USA, pp. 5179–5187. https://doi.org/10.1109/CVPR.2015.7299154

Yang, J.-A., Lee, C.-H., Yang, S.-W., Somayazulu, V.S., Chen, Y.-K., Chien, S.-Y., 2016. Wearable social camera: Egocentric video summarization for social interaction, in: 2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). Presented at the 2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), IEEE, Seattle, WA, USA, pp. 1–6. https://doi.org/10.1109/ICMEW.2016.7574681

Zhang, Ke, Chao, W.-L., Sha, F., Grauman, K., 2016. Video Summarization with Long Short-Term Memory, in: Leibe, B., Matas, J., Sebe, N., Welling, M. (Eds.), Computer Vision – ECCV 2016, Lecture Notes in Computer Science. Springer International Publishing, Cham, pp. 766–782. https://doi.org/10.1007/978-3-319-46478-7_47

Zhang, Kaipeng, Zhang, Z., Li, Z., Qiao, Y., 2016. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Process. Lett. 23, 1499–1503.

Zhao, Y., Lv, G., Ma, T., Ji, H., Zheng, H., 2015. A novel method of surveillance video Summarization based On clustering and background subtraction, in: 2015 8th International Congress on Image and Signal Processing (CISP). Presented at the 2015 8th International Congress on Image and Signal Processing (CISP), IEEE, Shenyang, China, pp. 131–136. https://doi.org/10.1109/CISP.2015.7407863

Downloads

Published

2025-09-01

Issue

Section

Visión por Computador