A Deep Human Action Representation For Retrieval Application

Document Type : Research Paper

Authors

1 Department of Computer Science, University of Kurdistan, Sanandaj, Iran

2 Department of Computer Engineering University of Kurdistan Sanandaj, Iran

3 Department of Electrical and Computer Engineering Semnan University Semnan, Iran

Abstract

Human action retrieval as a challenging research area has wide-spreading applications in surveillance, search engines, and human-computer interactions. Current methods seek to represent actions and create a model with global and local features. These methods do not consider the semantics of actions to create the model, so they do not have proper final retrieval results. Each action is not considered a sequence of sub-actions, and their model is created using scattered local or global features. Furthermore, current action retrieval methods ignore incorporating Convolutional Neural Networks (CNN) in the representation procedure due to a lack of training data for training them. At the same time, CNNs can help them improve the final representation. In the present paper, we propose a CNN-based human action representation method for retrieval applications. In this method, the video is initially segmented into sub-actions to represent each action based on their sequence using keyframes extracted from the segments. Then, the sequence of keyframes is given to a pre-trained CNN to extract deep spatial features of the action. Next, a 1D average pooling is designed to combine the sequence of spatial features and represent the temporal changes by a lower-dimensional vector. Finally, the Dynamic Time Wrapping technique is used to find the best match between the representation vectors of two videos. Experiments on real video datasets for both retrieval and recognition applications indicate how created models for the actions can outperform other representation methods. 

Keywords

Main Subjects


  1. Ramezani M, Yaghmaee F. Motion pattern-based representation for improving human action retrieval. Multimedia Tools and Applications. 2018 Oct 1;77(19), pp:26009-32.
  2. Veinidis C, Pratikakis I, Theoharis T. Unsupervised human action retrieval using salient points in 3D mesh sequences. Multimedia Tools and Applications. 2019 Feb 1;78(3), pp:2789-814.
  3. Qin J, Liu L, Yu M, Wang Y, Shao L. Fast action retrieval from videos via feature disaggregation. Computer Vision and Image Understanding. 2017 Mar 1;156, pp:104-16.
  4. Ding S, Li G, Li Y, Li X, Zhai Q, Champion AC, Zhu J, Xuan D, Zheng YF. Survsurf: human retrieval on large surveillance video data. Multimedia Tools and Applications. 2017 Mar 1;76(5), pp:6521-49.
  5. Zhang L, Wang Z, Yao T, Mei T, Feng DD. Exploiting spatial-temporal context for trajectory-based action video retrieval. Multimedia Tools and Applications. 2018 Jan 1;77(2), pp:2057-81.
  6. Zong M, Wang R, Chen X, Chen Z, Gong Y. Motion saliency-based multi-stream multiplier ResNets for action recognition. Image and Vision Computing. 2021 Mar 1;107:104108.
  7. Ramezani M, Yaghmaee F. A review on human action analysis in videos for retrieval applications. Artificial Intelligence Review. 2016 Dec 1;46(4), pp:485-514.
  8. Zhao S, Chen L, Yao H, Zhang Y, Sun X. Strategy for dynamic 3D depth data matching towards robust action retrieval. Neurocomputing. 2015 Mar 5;151, pp:533-43.
  9. Naeem HB, Murtaza F, Yousaf MH, Velastin SA. T-VLAD: Temporal vector of locally aggregated descriptor for multiview human action recognition. Pattern Recognition Letters. 2021 Aug 1;148:22-8.
  10. Jiang X, Zhong F, Peng Q, Qin X. Action recognition based on global optimal similarity measuring. multimedia Tools and Applications. 2016 Sep 1;75(18), pp:11019-36.
  11. Liu X, Li Y. Research on human action recognition based on global and local mixed features. In2014 International Conference on Mechatronics, Control and Electronic Engineering (MCE-14) 2014 Mar. Atlantis Press.
  12. Jones S, Shao L, Du K. Active learning for human action retrieval using query pool selection. Neurocomputing. 2014 Jan 26;124, pp:89-96.
  13. Junejo IN, Dexter E, Laptev I, Perez P. View-independent action recognition from temporal self-similarities. IEEE transactions on pattern analysis and machine intelligence. 2010 Mar 18;33(1), pp:172-85.
  14. Junejo IN, Dexter E, Laptev I, PÚrez P. Cross-view action recognition from temporal self-similarities. European Conference on Computer Vision 2008 Oct 12 (pp. 293-306). Springer, Berlin, Heidelberg.
  15. Shao L, Zhen X, Tao D, Li X. Spatio-temporal Laplacian pyramid coding for action recognition. IEEE Transactions on Cybernetics. 2013 Jul 31;44(6), pp:817-27.
  16. Veinidis C, Pratikakis I, Theoharis T. Querying 3D mesh sequences for human action retrieval. In2014 2nd International Conference on 3D Vision 2014 Dec 8 (Vol. 2, pp. 33-40). IEEE.
  17. Yamato J, Ohya J, Ishii K. Recognizing human action in time-sequential images using hidden Markov model. InCVPR 1992 Jun 15 (Vol. 92, pp. 379-385).
  18. Efros AA, Berg AC, Mori G, Malik J. Recognizing action at a distance. Innull 2003 Oct 13 (p. 726). IEEE.
  19. Lin Z, Jiang Z, Davis LS. Recognizing actions by shape-motion prototype trees. In2009, IEEE 12th international conference on computer vision 2009 Sep 27 (pp. 444-451). IEEE.
  20. Yilmaz A, Shah M. Matching actions in the presence of camera motion. Computer vision and image understanding. 2006 Nov 1;104(2-3), pp:221-31.
  21. Zhu F, Shao L, Lin M. Multi-view action recognition using local similarity random forests and sensor fusion. Pattern recognition letters. 2013 Jan 1;34(1), pp:20-4.
  22. Shao L, Wu D, Chen X. Action recognition using correlogram of body poses and spectral regression. In2011 18th IEEE International Conference on Image Processing 2011 Sep 11 (pp. 209-212). IEEE.
  23. Choi J, Jeon WJ, Lee SC. Spatio-temporal pyramid matching for sports videos. InProceedings of the 1st ACM international conference on Multimedia information retrieval 2008 Oct 30 (pp. 291-297).
  24. Shao L, Chen X. Histogram of Body Poses and Spectral Regression Discriminant Analysis for Human Action Categorization. InBMVC 2010 (pp. 1-11).
  25. Shao L, Liu L, Yu M. Kernelized multiview projection for robust action recognition. International Journal of Computer Vision. 2016 Jun 1;118(2), pp:115-29.
  26. Ramezani M, Yaghmaee F. Retrieving human action by fusing the motion information of interest points. International Journal on Artificial Intelligence Tools. 2018 May 21;27(03):1850008.
  27. Sharif M, Khan MA, Zahid F, Shah JH, Akram T. Human action recognition: a framework of statistical weighted segmentation and rank correlation-based selection. Pattern Analysis and Applications. 2020 Feb;23(1), pp:281-94.
  28. Sahoo SP, Ari S. On an algorithm for human action recognition. Expert Systems with Applications. 2019 Jan 1;115, pp:524-34.
  29. Dollár P, Rabaud V, Cottrell G, Belongie S. Behavior recognition via sparse Spatio-temporal features. In2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance 2005 Oct 15 (pp. 65-72). IEEE.
  30. Ramezani M, Yaghmaee F. A novel video recommendation system based on efficient retrieval of human actions. Physica A: Statistical Mechanics and its Applications. 2016 Sep 1;457, pp:607-23.
  31. Chen S, Sun Z, Zhang Y, Li Q. Relevance feedback for human motion retrieval using a boosting approach. Multimedia Tools and Applications. 2016 Jan 1;75(2), pp:787-817.
  32. Shao L, Jones S, Li X. Efficient search and localization of human actions in video databases. IEEE Transactions on Circuits and Systems for Video Technology. 2013 Aug 6;24(3), pp:504-12.
  33. Jones S, Shao L. Action retrieval with relevance feedback on YouTube videos. InProceedings of the Third International Conference on Internet Multimedia Computing and Service 2011 Aug 5 (pp. 42-45).
  34. Jiang YG, Li Z, Chang SF. Modeling scene and object contexts for human action retrieval with a few examples. IEEE Transactions on Circuits and Systems for Video Technology. 2011 Mar 17;21(5), pp:674-81.
  35. Laptev I. On space-time interest points. International journal of computer vision. 2005 Sep 1;64(2-3), pp:107-23.
  36. Scovanner P, Ali S, Shah M. A 3-dimensional sift descriptor and its application to action recognition. InProceedings of the 15th ACM international conference on Multimedia 2007 Sep 29 (pp. 357-360).
  37. Jones S, Shao L. Unsupervised spectral dual assignment clustering of human actions in context. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2014 (pp. 604-611).
  38. Klaser A, Marszałek M, Schmid C. A Spatio-temporal descriptor based on 3d-gradients, 2008.
  39. Jones S, Shao L. Content-based retrieval of human actions from realistic video databases. Information Sciences. 2013 Jul 1;236:56-65.
  40. Zhen X, Shao L, Tao D, Li X. Embedding motion and structure features for action recognition. IEEE Transactions on Circuits and Systems for Video Technology. 2013 Jan 16;23(7), pp:1182-90.
  41. Ji R, Yao H, Sun X. Actor-independent action search using spatiotemporal vocabulary with appearance hashing. Pattern Recognition. 2011 Mar 1;44(3), pp:624-38.
  42. Yu G, Yuan J, Liu Z. Unsupervised trees for human action search. InHuman Action Analysis with Randomized Trees 2015 (pp. 29-56). Springer, Singapore.
  43. Páez F, Vanegas JA, González FA. Online multimodal matrix factorization for human action video indexing. In2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI) 2014 Jun 18 (pp. 1-6). IEEE.
  44. Ramezani M, Yaghmaee F. Eliminating the Repetitive Motions as a Preprocessing step for Fast Human Action Retrieval. In2019 9th International Conference on Computer and Knowledge Engineering (ICCKE) 2019 Oct 24 (pp. 26-31). IEEE.
  45. Barnachon M, Bouakaz S, Boufama B, Guillou E. A real-time system for motion retrieval and interpretation. Pattern Recognition Letters. 2013 Nov 1;34(15), pp:1789-98.
  46. Tang J, Shao L, Zhen X. Human action retrieval via efficient feature matching. In2013 10th IEEE International Conference on Advanced Video and Signal Based Surveillance 2013 Aug 27 (pp. 306-311). IEEE.
  47. Laptev I, Marszalek M, Schmid C, Rozenfeld B. Learning realistic human actions from movies. In2008 IEEE Conference on Computer Vision and Pattern Recognition 2008 Jun 23 (pp. 1-8). IEEE.
  48. Paez F, Vanegas JA, Gonzalez FA. An evaluation of NMF algorithm on human action video retrieval. InSymposium of Signals, Images and Artificial Vision-2013: STSIVA-2013 2013 Sep 11 (pp. 1-4). IEEE.
  49. Bulbul MF, Jiang Y, Ma J. Human action recognition based on DMMs, HOGs and Contourlet transform. In2015 IEEE International Conference on Multimedia Big Data 2015 Apr 20 (pp. 389-394). IEEE.
  50. Choi J, Jeon WJ, Lee SC. Spatio-temporal pyramid matching for sports videos. InProceedings of the 1st ACM international conference on Multimedia information retrieval 2008 Oct 30 (pp. 291-297).
  51. Grauman K, Darrell T. Approximate correspondences in high dimensions. Advances in Neural Information Processing Systems 2007 (pp. 505-512).
  52. Bregonzio M, Gong S, Xiang T. Recognising action as clouds of space-time interest points. In2009 IEEE conference on computer vision and pattern recognition 2009 Jun 20 (pp. 1948-1955). IEEE.
  53. Afza F, Khan MA, Sharif M, Kadry S, Manogaran G, Saba T, Ashraf I, Damaševičius R. A framework of human action recognition using length control features fusion and weighted entropy-variances based feature selection. Image and Vision Computing. 2021 Feb 1;106:104090.
  54. Ullah A, Muhammad K, Haq IU, Baik SW. Action recognition using optimized deep autoencoder and CNN for surveillance data streams of non-stationary environments. Future Generation Computer Systems. 2019 Jul 1;96, pp:386-97.
  55. Muhammad K, Ullah A, Imran AS, Sajjad M, Kiran MS, Sannino G, de Albuquerque VH. Human action recognition using attention-based LSTM network with dilated CNN features. Future Generation Computer Systems. 2021 Jun 24.
  56. Khan MA, Javed K, Khan SA, Saba T, Habib U, Khan JA, Abbasi AA. Human action recognition using fusion of multiview and deep features: an application to video surveillance. Multimedia tools and applications. 2020 Mar 14:1-27.
  57. Dai C, Liu X, Lai J. Human action recognition using two-stream attention-based LSTM networks. Applied soft computing. 2020 Jan 1;86:105820.
  58. Tu Z, Xie W, Qin Q, Poppe R, Veltkamp RC, Li B, Yuan J. Multi-stream CNN: Learning representations based on human-related regions for action recognition. Pattern Recognition. 2018 Jul 1;79:32-43.
  59. Berlin SJ, John M. Particle swarm optimization with deep learning for human action recognition. Multimedia Tools and Applications. 2020 Feb 16, pp:1-23.
  60. Wang J, Shao Z, Huang X, Lu T, Zhang R, Lv X. Spatial-temporal pooling for action recognition in videos. Neurocomputing. 2021 Sep 3;451:265-78.