To address this need, efforts have been applied to the development of techniques such as Hyperlapse and Semantic Hyperlapse, which aims to create visually pleasant shorter videos and emphasize semantic portions of the video, respectively. Consequently, there is a rise in the need to provide quick access to the information therein. Such videos are usually composed of long-running unedited streams captured by a device attached to the user body, which makes them tedious and visually unpleasant to watch. The availability of low-cost, high-quality personal wearable cameras combined with the unlimited storage capacity of video-sharing websites has evoked a growing interest in First-Person Videos (FPVs). At last, this paper is concluded and lists a set of promising future directions for self-supervised visual feature learning. Finally, quantitative performance comparisons of the reviewed methods on benchmark datasets are summarized and discussed for both image and video feature learning. Next, the schema and evaluation metrics of self-supervised learning methods are reviewed followed by the commonly used datasets for images, videos, audios, and 3D data, as well as the existing self-supervised visual feature learning methods. Then the common deep neural network architectures that used for self-supervised learning are summarized. First, the motivation, general pipeline, and terminologies of this field are described. This paper provides an extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos. To avoid extensive cost of collecting and annotating large-scale datasets, as a subset of unsupervised learning methods, self-supervised learning methods are proposed to learn general image and video features from large-scale unlabeled data without using any human-annotated labels. Large-scale labeled data are generally required to train deep neural networks in order to obtain better performance in visual feature learning from images or videos for computer vision applications. Furthermore, we also apply SpeedNet for generating time-varying, adaptive video speedups, which can allow viewers to watch videos faster, but with less of the jittery, unnatural motions typical to videos that are sped up uniformly. We demonstrate how those learned features can boost the performance of self-supervised action recognition, and can be used for video retrieval. Importantly, we show that through predicting the speed of videos, the model learns a powerful and meaningful space-time representation that goes beyond simple motion cues. We demonstrate prediction results by SpeedNet on a wide range of videos containing complex natural motions, and examine the visual cues it utilizes for making those predictions. We show how this single, binary classification network can be used to detect arbitrary rates of speediness of objects. SpeedNet is trained on a large corpus of natural videos in a self-supervised manner, without requiring any manual annotations. The core component in our approach is SpeedNet-a novel deep network trained to detect if a video is playing at normal rate, or if it is sped up. We wish to automatically predict the "speediness" of moving objects in videos-whether they move faster, at, or slower than their "natural" speed.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |