Images might take some time to load
Models and weights were not uploaded because the files were too large
The project can be divided into four sections:
Player Detection, Court Detection and Tracking, Team Classification and Player Identification, and Player Tracking
Player detection was achieved by using a YOLOv4 model and training it to detect basketball players.
Since gathering data is tedious and there were no appropriate public datasets, some automation tools were created to help with this process.
The training of the model was done using pre-trained weights, which is called transfer learning.
The final trained model takes a single frame of a video as input and outputs bounding-box coordinates which can be used to draw boxes around detections.
The court detection system was built so that we could filter detections made outside the basketball court.
We project a 2D reference court onto the real court using reverse homography.
This allows us to build a mask that zeroes out all pixel values outside the court so that detections made there are differentiated.
Team classification is performed because it greatly simplifies player identification and player tracking, as the number of targets is reduced by half. Additionally, it also allows us to filter out referees.
Since players in the same team wear uniforms of the same colour, we use RGB colour histograms to characterize them. So for every image, a 10-bin colour histogram for each of the three RGB channels is computed, resulting in a 30-bin colour histogram.
So every image is represented by a 30-dimensional feature vector, where each dimension is a positive value. These are then used to train a logistic regression classifier.
For player identification, some experimentation was done to decide on which features to use to characterize the images.
Trained identification models performed best when representing images using a combination of MSER, SIFT, and RGB features.
MSER stands for maximally stable extremal regions, they are regions of an image which stay the same through the use of a wide range of thresholds. The MSER representation of an image is a 300-dimensional binary vector.
SIFT stands for scale-invariant feature transform, and are particular points of an object that are invariant to image scale and rotation. They are also robust to changes in illumination, noise, and minor changes in viewpoint, making them excellent features to extract. The SIFT representation of an image is a 500-dimensional binary vector.
RGB features were represented in the same way as before, with a 30-dimensional vector.
All these vectors are combined to form an 830-dimensional feature vector, where the first 800 dimensions are binary, and the last 30 dimensions contain positive values.
These were used to train two logistic regression classifiers, one for each team.
The classifiers take an image’s feature vector as input and outputs the predicted player identity, which is then used to annotate bounding boxes accordingly.
Player tracking was achieved using DeepSORT, an extension to SORT which is an object-tracking algorithm which uses the method of tracking by detection. This means that objects are detected first and then assigned to tracks.
On top of this, DEEPSORT introduces a distance metric called the appearance feature vector, which computes the appearance of the object.
A convolutional neural network classifier is trained, and the final classification layer is removed.
This leaves us with a dense layer that produces a single feature vector waiting to be classified.
Our custom feature extractor used in our DeepSORT model was trained on a dataset of basketball players.
Our final DeepSORT model outputs bounding-box coordinates along with the ID associated with them which can be used to draw boxes around detections and annotate them appropriately.