Abstract:
Video object segmentation is the task of estimating foreground object segments from the background throughout video. We propose a frame-by-frame approach for video object segmentation that uses cluster information in order to select foreground segments. Unlike previous approaches for video object segmentation that makes use of optical flow in order to localize dynamic object segments throughout the video, we rather focus on selecting a set of foreground segments from a pool of region proposals through clustering, which helps to avoid making use of optical flow and thus help our algorithm to scale-up to longer video sequences. Object localization is the task of estimating precise localized windows around all object instances in the image. We proposed an algorithm for object localization given that single object instance appears in the image. Unlike previous supervised and weakly supervised techniques that require heavy training in order to learn classifiers, our approach is completely unsupervised. Our approach depends on iterative spectral clustering in order select proposals that contain an object from a huge set of proposals generated from an object proposal generation algorithm. From these set of filtered object proposals, we then estimate the final localized window by considering the inter and intra class variations among the object proposals, thus making the entire algorithm completely unsupervised. We consider designing a fully automated action recognition system under uncontrolled environments. Most existing algorithms rely on constructing handcrafted features from the input and then learn classifiers based on the designed features. However, these hand-crafted features are inefficient in modelling more complex scenes. CNN are a class of deep learning models that can learn features automatically from the input during the training process. We design a 3D convolutional neural network for human action recognition. This model is able to extract features in spatio-temporal domain, thereby able to capture the motion information encoded in multiple contiguous frames required for all video processing applications.