Deep-Learning-Team-Project

(This is mainly generated by our presentation poster. You could check the poster directly).

Abstract

Gesture recognition is popular in smart TV applications. In this work, we use different deep learning models including CNN+RNN, Conv3D and Temporal Convolution architects to train a specific video dataset containing 5 gestures for operating TVs. Up to now we’ve found that Conv3D has the best training results, which lays a solid foundation for us to explore more advanced training models and pursue better results in the future.

Background & Motivation

Gesture recognition has become an important component of human-computer interaction, especially in smart home applications. For smart TVs, hand gesture recognition allows users to perform commands to control the TV without physical contract. This improves the convenience and reduces the reliance on traditional remote controls. Prior research has already explored gesture recognition by using different models. For example, hybrid CNN_RNN models applied to gesture recognition with EMG signals showed the robust performance and scalability. In our project, we would like to explore more about how to improve the accuracy of the gesture recognition using different model configurations.To address these challenges, our project compares the hybrid CNN and RNN model, Conv3D and Temporal Convolution model to find the most effective approach for smart TV gesture recognition.

Dataset

This dataset has videos categorised into one of the five classes. Stop, Right swipe, Left swipe, Thumbs down, Thumbs up Each video is divided into a sequence of 30 frames. Two types of dimensions - 360x360 && 120x160.

Methods

Different Methods are tested by different team members.

You could check the detail code in each branch.

Haotao: RNN model

Zhen Xu: CNN + RNN model

Siyi: CNN + RNN model

Zihang: Conv3D, 2D CNN + Conv1D model

Method1

Method 1 used a hybrid neural network model combining Convolutional Neural Networks (CNNs) with Recurrent Neural Networks (RNNs) using the Keras framework.

Method2

Method 2 leverages 3D Convolution to extract spatial and temporal features. By extending traditional 2D convolutions into the temporal domain, 3D convolution is an ideal approach to analyzing spatial and temporal patterns in image sequences.

Method3

Method 3 integrates 2D and 1D convolution layers

2D Convolutions are used to capture spatial features from the input data
1D Convolutions, inspired by Temporal Convolution Network, is used to model temporal relationships across frames.

Result

Index	Model Config	Accuracy
1	1 CNN Layer + RNN, epoch = 5	0.24
2	1 CNN Layer + RNN, epoch = 40	0.30
3	2 CNN Layer + RNN, epoch = 10	0.35
4	3 CNN Layer + RNN, epoch = 10	0.42
5	4 CNN Layer + RNN, epoch = 10	0.45
6	4 CNN Layer + RNN, epoch = 40	0.47
7	Conv3D, 3 Layer, epoch = 10	0.63
8	Conv3D, 4 Layer, epoch = 10	0.75
9	Conv3D, 4 Layer, epoch = 20	0.83
10	Conv3D, 4 Layer, epoch = 20, dropout	0.85
11	Conv3D, 5 Layer, epoch = 20, dropout	0.92
12	ResNet 18 + 1 layer 1D conv, epoch = 10	0.74
13	ResNet 18 + 1 layer 1D conv, epoch = 20	0.87

Currently, Conv3D has best performance: 92% Accuracy. Method 3 (2D CNN + 1D Conv) also has relatively high accuracy. Currently, Method 1’s performance is relatively low.

Conclusion and Future Work

It can be seen that the results of Method 2 training is of the best training result, while the results of Method 3 are second, and the results of Method 1 remains the third. We hope to implement more advanced model to train our dataset, and combine all models we have used together into a mixed one, to get better results than Conv3D.

Reference

Gesture Recognition Dataset:https://www.kaggle.com/datasets/abhishek14398/gesture-recognition-dataset/data

"Gesture Recognition with Hybrid Models." PLOS ONE, 2024, https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0264543.

Sapiński, Tomasz, et al. "Hybrid Deep Learning Models for Hand Gesture Recognition with EMG Signals." IEEE Xplore, 2024, https://ieeexplore.ieee.org/document/10582166.

Tran, Du, et al. "Learning spatiotemporal features with 3d convolutional networks." Proceedings of the IEEE international conference on computer vision. 2015.

Bai, Shaojie, J. Zico Kolter, and Vladlen Koltun. "An empirical evaluation of generic convolutional and recurrent networks for sequence modeling." arXiv preprint arXiv:1803.01271 (2018).

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
poster[rev1.3].pdf		poster[rev1.3].pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep-Learning-Team-Project

Abstract

Background & Motivation

Dataset

Methods

Method1

Method2

Method3

Result

Conclusion and Future Work

Reference

About

Releases

Packages

xzxzlala/Deep-Learning-Team-Project

Folders and files

Latest commit

History

Repository files navigation

Deep-Learning-Team-Project

Abstract

Background & Motivation

Dataset

Methods

Method1

Method2

Method3

Result

Conclusion and Future Work

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages