Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
vecxoz committed Jun 6, 2022
1 parent 3079bdc commit b4d3c7b
Show file tree
Hide file tree
Showing 14 changed files with 3,582 additions and 0 deletions.
5 changes: 5 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
MIT License

5th place solution
"Turtle Recall: Conservation Challenge"
https://zindi.africa/competitions/turtle-recall-conservation-challenge

Copyright (c) 2022 Igor Ivanov
Email: [email protected]

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
159 changes: 159 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
Turtle Recall: Conservation Challenge. 5th place solution
=========================================================

Competition: [link](https://zindi.africa/competitions/turtle-recall-conservation-challenge)
Author: Igor Ivanov
License: MIT


Solution overview
=================

In order to ensure generalization ability I built my solution as an ensemble
of 6 models each of which was trained on a 5-fold stratified split.
For the same purpose I chose large deep architectures which have
enough capacity to capture important features from the diverse dataset.
All models share the same multiclass classification formulation over 2265 classes
with average pooling and softmax on top. Optimization was performed using
categorical cross-entropy loss and Adam optimizer.
I used all available data for training i.e. joint set of training and extra images.
Raw model prediction contains 2265 probabilities. Any predicted `turtle_id`
which does not belong to 100 original training individuals is considered a `new_turtle`.
Ensemble is computed as an arithmetic average of 30 predictions (6 models by 5 folds).

Architectures used:
- EfficientNet-v1-B7
- EfficientNet-v1-L2
- EfficientNet-v1-L2
- EfficientNet-v2-L
- EfficientNet-v2-XL
- BEiT-L

Architectures are implemented in the following repositories:
- https://github.com/qubvel/efficientnet
- https://github.com/leondgarse/keras_cv_attention_models

For augmentation I used rotations multiple of 45 degrees (with central crop) and flips.
For validation purposes I measured Accuracy and MAP5 over 2265 classes.
Software stack is based on Tensorflow and Keras.
All hyperparameters are listed in a dedicated section on the top
of the `run.py` file and can be passed as command line arguments.


Results
=======

Each score in the table is an average of 5 folds.
Suffix `2265` means that metric uses 2265 unique turtle ids (100 training + extra)
Suffix `101` means that metric uses 101 unique turtle ids (100 training + 1 `new_turtle`)

| Model | CV-acc1-2265 | CV-map5-2265 | Public-LB-map5-101 | Private-LB-map5-101 |
|--------------------------|--------------|--------------|--------------------|----------------------|
| run-20220310-1926-ef1b7 | 0.8731 | 0.9067 | 0.9523 | 0.9567 |
| run-20220316-1310-beitl | 0.8896 | 0.9202 | 0.9611 | 0.9317 |
| run-20220317-1954-ef1l2 | 0.8782 | 0.9112 | 0.9543 | 0.9501 |
| run-20220318-1121-ef2xl | 0.8553 | 0.8928 | 0.9421 | 0.9332 |
| run-20220322-2024-ef1l2 | 0.8720 | 0.9056 | 0.9625 | 0.9514 |
| run-20220325-1527-ef2l | 0.8829 | 0.9151 | 0.9557 | 0.9545 |
| - | | | | |
| Ensemble | 0.9320 | 0.9503 | 0.9875 | 0.9648 |



Conclusions:
============

1) Solution generalizes well between public and private test sets
despite very small test data size (147 and 343 examples respectively).
As a result I was able to retain high position in both leaderboards:
2nd place public, 5th place private.

2) Ensembling gives stable significant improvement (about 0.01-0.03)
observed by all metrics on all subsets of data (public/private).

3) Combination of GeM pooling and ArcFace loss is a popular approach in the tasks dealing with image similarity.
But in this task I did not see an improvement from this approach in my experiments.


Hardware
========

Training: TPUv3-8, 4 CPU, 16 GB RAM, 500 GB HDD
Training time: 100 hours total

Inference: V100-16GB GPU, 4 CPU, 16 GB RAM, 500 GB HDD
Inference time: 30 minutes total


Software
========

- Ubuntu 18.04
- Python: 3.9.7
- CUDA: 11.2
- cuDNN: 8.1.1
- Tensorflow: 2.8.0


Demo
====

The following example `solution/notebook/notebook.ipynb` demonstrates
how to infer any single image using pretrained weights.


Steps to reproduce
==================

```
# Install
cd $HOME
unzip solution.zip
conda create -y --name py397 python=3.9.7
conda activate py397
pip install tensorflow==2.8.0 tensorflow-addons numpy pandas \
scikit-learn h5py efficientnet keras-cv-attention-models cloud-tpu-client
# Prepare data
cd $HOME/solution/data
curl -L -O https://storage.googleapis.com/dm-turtle-recall/train.csv
curl -L -O https://storage.googleapis.com/dm-turtle-recall/extra_images.csv
curl -L -O https://storage.googleapis.com/dm-turtle-recall/test.csv
curl -L -O https://storage.googleapis.com/dm-turtle-recall/sample_submission.csv
curl -L -O https://storage.googleapis.com/dm-turtle-recall/images.tar
mkdir images
tar xf images.tar -C images
rm images.tar
cd $HOME/solution
python3 create_tfrecords.py --data_dir=$HOME/solution/data --out_dir=$HOME/solution/data/tfrec
# Training
# Please remove all weights from previous runs if present.
# All hyperparameters are configured for training on TPUv3-8.
# To train on GPU (or several GPUs) set the following arguments in `run_training.sh`:
# --tpu_ip_or_name=None
# --data_tfrec_dir=$HOME/solution/data/tfrec
# and adjust batch size and learning rate accordingly.
# To use mixed precision set:
# --mixed_precision=mixed_float16
bash run_training.sh
# Inference
bash run_inference.sh
# Submission will appear as $HOME/solution/submission.csv
```


Acknowledgement
===============

Thanks to [TRC program](https://sites.research.google/trc/about/)
I had an opportunity to run experiments on TPUv3-8.

116 changes: 116 additions & 0 deletions create_tfrecords.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
#------------------------------------------------------------------------------
#------------------------------------------------------------------------------

import os
import sys
sys.path.append('lib')
import glob
import warnings
warnings.simplefilter('ignore', UserWarning)
import collections
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GroupKFold
import tensorflow as tf
print('tf:', tf.__version__)
from vecxoz_utils import create_cv_split
from argparse import ArgumentParser

parser = ArgumentParser()
parser.add_argument('--data_dir', default='data', type=str, help='Data directory')
parser.add_argument('--out_dir', default='data/tfrec', type=str, help='Out directory')
args = parser.parse_args()

os.makedirs(args.out_dir, exist_ok=True)

#------------------------------------------------------------------------------
#------------------------------------------------------------------------------

class TFRecordProcessor(object):
def __init__(self):
self.n_examples = 0
#
def _bytes_feature(self, value):
if isinstance(value, type(tf.constant(0))):
value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
#
def _int_feature(self, value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
#
def _float_feature(self, value):
return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
#
def _process_example(self, ind, A, B, C, D):
self.n_examples += 1
feature = collections.OrderedDict()
#
feature['image_id'] = self._bytes_feature(A[ind].encode('utf-8'))
feature['image'] = self._bytes_feature(tf.io.read_file(B[ind]))
feature['label_id'] = self._bytes_feature(C[ind].encode('utf-8'))
feature['label'] = self._int_feature(D[ind])
#
example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
self._writer.write(example_proto.SerializeToString())
#
def write_tfrecords(self, A, B, C, D, n_shards=1, file_out='train.tfrecord'):
n_examples_per_shard = A.shape[0] // n_shards
n_examples_remainder = A.shape[0] % n_shards
self.n_examples = 0
#
for shard in range(n_shards):
self._writer = tf.io.TFRecordWriter('%s-%05d-of-%05d' % (file_out, shard, n_shards))
#
start = shard * n_examples_per_shard
if shard == (n_shards - 1):
end = (shard + 1) * n_examples_per_shard + n_examples_remainder
else:
end = (shard + 1) * n_examples_per_shard
#
print('Shard %d of %d: (%d examples)' % (shard, n_shards, (end - start)))
for i in range(start, end):
self._process_example(i, A, B, C, D)
print(i, end='\r')
#
self._writer.close()
#
return self.n_examples

#------------------------------------------------------------------------------
#------------------------------------------------------------------------------

train_df, test_df = create_cv_split(args.data_dir, n_splits=5)

tfrp = TFRecordProcessor()

#------------------------------------------------------------------------------
#------------------------------------------------------------------------------

for fold_id in range(len(train_df['fold_id'].unique())):
print('Fold:', fold_id)
n_written = tfrp.write_tfrecords(
train_df[train_df['fold_id'] == fold_id]['image_id'].values,
train_df[train_df['fold_id'] == fold_id]['image'].values,
train_df[train_df['fold_id'] == fold_id]['turtle_id'].values,
train_df[train_df['fold_id'] == fold_id]['label'].values,
#
n_shards=1,
file_out=os.path.join(args.out_dir, 'fold.%d.tfrecord' % fold_id))

n_written = tfrp.write_tfrecords(
test_df['image_id'].values,
test_df['image'].values,
test_df['turtle_id'].values,
test_df['label'].values,
#
n_shards=1,
file_out=os.path.join(args.out_dir, 'test.tfrecord'))

#------------------------------------------------------------------------------
#------------------------------------------------------------------------------


96 changes: 96 additions & 0 deletions ensemble.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
#------------------------------------------------------------------------------
#------------------------------------------------------------------------------

import os
import sys
sys.path.append('lib')
import warnings
warnings.simplefilter('ignore', UserWarning)
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from vecxoz_utils import create_cv_split

# List of models to ensemble
dirs = [
'run-20220310-1926-ef1b7',
'run-20220316-1310-beitl',
'run-20220317-1954-ef1l2',
'run-20220318-1121-ef2xl',
'run-20220322-2024-ef1l2',
'run-20220325-1527-ef2l',
]

model_dir = 'models'
data_dir = 'data'
n_folds = 5
n_tta = 0

#------------------------------------------------------------------------------
#------------------------------------------------------------------------------

# Load predictions from all models
y_preds_test = []
for counter, d in enumerate(dirs):
for tta_id in range(n_tta + 1):
for fold_id in range(n_folds):
y_preds_test.append(np.load(os.path.join(model_dir, d, 'preds', 'y_pred_test_fold_%d_tta_%d.npy' % (fold_id, tta_id))))
print(counter, end='\r')
assert len(y_preds_test) == (n_tta + 1) * len(dirs) * n_folds

#------------------------------------------------------------------------------
#------------------------------------------------------------------------------

# Compute mean and argsort
probas = np.mean(y_preds_test, axis=0)
preds = np.argsort(probas, axis=1)[:, ::-1]

# train_df contains train + extra data of 2265 classes
# train_orig_df contains 100 original classes
train_df, _ = create_cv_split(data_dir, 5)
train_orig_df = pd.read_csv(os.path.join(data_dir, 'train.csv'))
turtle_ids_orig = sorted(train_orig_df['turtle_id'].unique()) # 100 unique

# Fit LabelEncoder on 2265 clases to decode our predictions
le = LabelEncoder()
le = le.fit(train_df['turtle_id'])

# Replace all predicted labels outside of 100 train ids with a "new_turtle"
label_str = []
for row in preds: # 490
turtle_ids_predicted = le.inverse_transform(row) # transform a row of length 2265
turtle_ids_replaced = []
for turtle_id in turtle_ids_predicted:
if turtle_id in turtle_ids_orig:
turtle_ids_replaced.append(turtle_id)
else:
turtle_ids_replaced.append('new_turtle')
label_str.append(turtle_ids_replaced)
label_str_npy = np.array(label_str) # (490, 2265)

# There may be more than 1 "new_turtle" prediction for any given example
# We replace all repetitions except the first with the most probable predictions form 100 train ids
rows_by_5 = []
for row in label_str_npy:
cand = [x for x in row[row != 'new_turtle'] if x not in row[:5]][:4]
row_new = []
for t_id in row[:5]:
if t_id not in row_new:
row_new.append(t_id)
for _ in range(5 - len(row_new)):
row_new.append(cand.pop(0))
rows_by_5.append(np.array(row_new))
rows_by_5_npy = np.array(rows_by_5)

# Crete submission file
subm_df = pd.read_csv('/home/vecxoz/data/sample_submission.csv')
subm_df['prediction1'] = rows_by_5_npy[:, 0]
subm_df['prediction2'] = rows_by_5_npy[:, 1]
subm_df['prediction3'] = rows_by_5_npy[:, 2]
subm_df['prediction4'] = rows_by_5_npy[:, 3]
subm_df['prediction5'] = rows_by_5_npy[:, 4]

subm_df.to_csv('submission.csv', index=False)

#------------------------------------------------------------------------------
#------------------------------------------------------------------------------
Loading

0 comments on commit b4d3c7b

Please sign in to comment.