Depthwise spatio-temporal STFT convolutional neural networks for human action recognition

Show simple item record

dc.contributor.author Kumawat, Sudhakar
dc.contributor.author Verma, Manisha
dc.contributor.author Nakashima, Yuta
dc.contributor.author Raman, Shanmuganathan
dc.coverage.spatial United States of America
dc.date.accessioned 2021-05-14T05:18:43Z
dc.date.available 2021-05-14T05:18:43Z
dc.date.issued 2022-09
dc.identifier.citation Kumawat, Sudhakar; Verma, Manisha; Nakashima, Yuta and Raman, Shanmuganathan, “Depthwise spatio-temporal STFT convolutional neural networks for human action recognition”, IEEE Transactions on Pattern Analysis and Machine Intelligence, DOI: 10.1109/TPAMI.2021.3076522, vol. 44, no. 9, pp. 4839-4851, Sep. 2022. en_US
dc.identifier.issn 0162-8828
dc.identifier.issn 1939-3539
dc.identifier.uri https://doi.org/10.1109/TPAMI.2021.3076522
dc.identifier.uri https://repository.iitgn.ac.in/handle/123456789/6434
dc.description.abstract Conventional 3D convolutional neural networks (CNNs) are computationally expensive, memory intensive, prone to overfitting, and most importantly, there is a need to improve their feature learning capabilities. To address these issues, we propose spatio-temporal short term Fourier transform (STFT) blocks, a new class of convolutional blocks that can serve as an alternative to the 3D convolutional layer and its variants in 3D CNNs. An STFT block consists of non-trainable convolution layers that capture spatially and/or temporally local Fourier information using a STFT kernel at multiple low frequency points, followed by a set of trainable linear weights for learning channel correlations. The STFT blocks significantly reduce the space-time complexity in 3D CNNs. In general, they use 3.5 to 4.5 times less parameters and 1.5 to 1.8 times less computational costs when compared to the state-of-the-art methods. Furthermore, their feature learning capabilities are significantly better than the conventional 3D convolutional layer and its variants. Our extensive evaluation on seven action recognition datasets, including Something-something v1 and v2, Jester, Diving-48, Kinetics-400, UCF 101, and HMDB 51, demonstrate that STFT blocks based 3D CNNs achieve on par or even better performance compared to the state-of-the-art methods.
dc.description.statementofresponsibility by Sudhakar Kumawat, Manisha Verma, Yuta Nakashima and Shanmuganathan Raman
dc.format.extent vol. 44, no. 9, pp. 4839-4851
dc.language.iso en_US en_US
dc.publisher Institute of Electrical and Electronics Engineers en_US
dc.subject Short-term Fourier transform en_US
dc.subject 3D convolutional networks en_US
dc.subject Human action recognition en_US
dc.title Depthwise spatio-temporal STFT convolutional neural networks for human action recognition en_US
dc.type Article en_US
dc.relation.journal IEEE Transactions on Pattern Analysis and Machine Intelligence


Files in this item

Files Size Format View

There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record

Search Digital Repository


Browse

My Account