Initial commit

vecxoz · Jun 6, 2022 · b4d3c7b · b4d3c7b
1 parent 3079bdc
commit b4d3c7b
Show file tree

Hide file tree

Showing 14 changed files with 3,582 additions and 0 deletions.
diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,11 @@
 MIT License
 
+5th place solution
+"Turtle Recall: Conservation Challenge"
+https://zindi.africa/competitions/turtle-recall-conservation-challenge
+
 Copyright (c) 2022 Igor Ivanov
+Email: [email protected]
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/README.md b/README.md
@@ -0,0 +1,159 @@
+Turtle Recall: Conservation Challenge. 5th place solution
+=========================================================
+
+Competition: [link](https://zindi.africa/competitions/turtle-recall-conservation-challenge)  
+Author: Igor Ivanov  
+License: MIT  
+
+
+Solution overview
+=================
+
+In order to ensure generalization ability I built my solution as an ensemble
+of 6 models each of which was trained on a 5-fold stratified split.
+For the same purpose I chose large deep architectures which have 
+enough capacity to capture important features from the diverse dataset.
+All models share the same multiclass classification formulation over 2265 classes 
+with average pooling and softmax on top. Optimization was performed using 
+categorical cross-entropy loss and Adam optimizer.
+I used all available data for training i.e. joint set of training and extra images.
+Raw model prediction contains 2265 probabilities. Any predicted `turtle_id` 
+which does not belong to 100 original training individuals is considered a `new_turtle`.
+Ensemble is computed as an arithmetic average of 30 predictions (6 models by 5 folds).
+
+Architectures used:
+- EfficientNet-v1-B7
+- EfficientNet-v1-L2
+- EfficientNet-v1-L2
+- EfficientNet-v2-L
+- EfficientNet-v2-XL
+- BEiT-L
+
+Architectures are implemented in the following repositories:
+- https://github.com/qubvel/efficientnet
+- https://github.com/leondgarse/keras_cv_attention_models
+
+For augmentation I used rotations multiple of 45 degrees (with central crop) and flips.
+For validation purposes I measured Accuracy and MAP5 over 2265 classes.
+Software stack is based on Tensorflow and Keras.
+All hyperparameters are listed in a dedicated section on the top 
+of the `run.py` file and can be passed as command line arguments.  
+
+
+Results
+=======
+
+Each score in the table is an average of 5 folds.  
+Suffix `2265` means that metric uses 2265 unique turtle ids (100 training + extra)  
+Suffix `101` means that metric uses 101 unique turtle ids (100 training + 1 `new_turtle`)  
+
+| Model                    | CV-acc1-2265 | CV-map5-2265 | Public-LB-map5-101 | Private-LB-map5-101  |
+|--------------------------|--------------|--------------|--------------------|----------------------|
+| run-20220310-1926-ef1b7  | 0.8731       | 0.9067       | 0.9523             | 0.9567               |
+| run-20220316-1310-beitl  | 0.8896       | 0.9202       | 0.9611             | 0.9317               |
+| run-20220317-1954-ef1l2  | 0.8782       | 0.9112       | 0.9543             | 0.9501               |
+| run-20220318-1121-ef2xl  | 0.8553       | 0.8928       | 0.9421             | 0.9332               |
+| run-20220322-2024-ef1l2  | 0.8720       | 0.9056       | 0.9625             | 0.9514               |
+| run-20220325-1527-ef2l   | 0.8829       | 0.9151       | 0.9557             | 0.9545               |
+| -                        |              |              |                    |                      |
+| Ensemble                 | 0.9320       | 0.9503       | 0.9875             | 0.9648               |
+
+
+
+Conclusions:
+============
+
+1) Solution generalizes well between public and private test sets
+   despite very small test data size (147 and 343 examples respectively).
+   As a result I was able to retain high position in both leaderboards:
+   2nd place public, 5th place private.
+
+2) Ensembling gives stable significant improvement (about 0.01-0.03) 
+   observed by all metrics on all subsets of data (public/private).
+
+3) Combination of GeM pooling and ArcFace loss is a popular approach in the tasks dealing with image similarity.
+   But in this task I did not see an improvement from this approach in my experiments.
+
+
+Hardware
+========
+
+Training: TPUv3-8, 4 CPU, 16 GB RAM, 500 GB HDD  
+Training time: 100 hours total  
+
+Inference: V100-16GB GPU, 4 CPU, 16 GB RAM, 500 GB HDD  
+Inference time: 30 minutes total  
+
+
+Software
+========
+
+- Ubuntu 18.04  
+- Python: 3.9.7
+- CUDA: 11.2
+- cuDNN: 8.1.1
+- Tensorflow: 2.8.0
+
+
+Demo
+====
+
+The following example `solution/notebook/notebook.ipynb` demonstrates
+how to infer any single image using pretrained weights.
+
+
+Steps to reproduce
+==================
+
+```
+# Install
+
+cd $HOME
+unzip solution.zip
+conda create -y --name py397 python=3.9.7
+conda activate py397
+pip install tensorflow==2.8.0 tensorflow-addons numpy pandas \
+scikit-learn h5py efficientnet keras-cv-attention-models cloud-tpu-client
+
+# Prepare data
+
+cd $HOME/solution/data
+
+curl -L -O https://storage.googleapis.com/dm-turtle-recall/train.csv
+curl -L -O https://storage.googleapis.com/dm-turtle-recall/extra_images.csv
+curl -L -O https://storage.googleapis.com/dm-turtle-recall/test.csv
+curl -L -O https://storage.googleapis.com/dm-turtle-recall/sample_submission.csv
+curl -L -O https://storage.googleapis.com/dm-turtle-recall/images.tar
+
+mkdir images
+tar xf images.tar -C images
+rm images.tar
+cd $HOME/solution
+python3 create_tfrecords.py --data_dir=$HOME/solution/data --out_dir=$HOME/solution/data/tfrec
+
+# Training
+
+# Please remove all weights from previous runs if present.
+# All hyperparameters are configured for training on TPUv3-8.
+# To train on GPU (or several GPUs) set the following arguments in `run_training.sh`:
+# --tpu_ip_or_name=None
+# --data_tfrec_dir=$HOME/solution/data/tfrec
+# and adjust batch size and learning rate accordingly.
+# To use mixed precision set:
+# --mixed_precision=mixed_float16
+
+bash run_training.sh
+
+# Inference
+
+bash run_inference.sh
+# Submission will appear as $HOME/solution/submission.csv
+```
+
+
+Acknowledgement
+===============
+
+Thanks to [TRC program](https://sites.research.google/trc/about/) 
+I had an opportunity to run experiments on TPUv3-8.
+
diff --git a/create_tfrecords.py b/create_tfrecords.py
@@ -0,0 +1,116 @@
+#------------------------------------------------------------------------------
+#------------------------------------------------------------------------------
+
+import os
+import sys
+sys.path.append('lib')
+import glob
+import warnings
+warnings.simplefilter('ignore', UserWarning)
+import collections
+import numpy as np
+import pandas as pd
+from sklearn.preprocessing import LabelEncoder
+from sklearn.model_selection import train_test_split
+from sklearn.model_selection import KFold
+from sklearn.model_selection import StratifiedKFold
+from sklearn.model_selection import GroupKFold
+import tensorflow as tf
+print('tf:', tf.__version__)
+from vecxoz_utils import create_cv_split
+from argparse import ArgumentParser
+
+parser = ArgumentParser()
+parser.add_argument('--data_dir', default='data', type=str, help='Data directory')
+parser.add_argument('--out_dir', default='data/tfrec', type=str, help='Out directory')
+args = parser.parse_args()
+
+os.makedirs(args.out_dir, exist_ok=True)
+
+#------------------------------------------------------------------------------
+#------------------------------------------------------------------------------
+
+class TFRecordProcessor(object):
+    def __init__(self):
+        self.n_examples = 0
+    #
+    def _bytes_feature(self, value):
+        if isinstance(value, type(tf.constant(0))):
+            value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
+        return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
+    #
+    def _int_feature(self, value):
+        return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
+    #
+    def _float_feature(self, value):
+        return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
+    #
+    def _process_example(self, ind, A, B, C, D):
+        self.n_examples += 1
+        feature = collections.OrderedDict()
+        #
+        feature['image_id'] = self._bytes_feature(A[ind].encode('utf-8'))
+        feature['image'] =    self._bytes_feature(tf.io.read_file(B[ind]))
+        feature['label_id'] = self._bytes_feature(C[ind].encode('utf-8'))
+        feature['label'] =    self._int_feature(D[ind])
+        #
+        example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
+        self._writer.write(example_proto.SerializeToString())
+    #
+    def write_tfrecords(self, A, B, C, D, n_shards=1, file_out='train.tfrecord'):
+        n_examples_per_shard = A.shape[0] // n_shards
+        n_examples_remainder = A.shape[0] %  n_shards   
+        self.n_examples = 0
+        #
+        for shard in range(n_shards):
+            self._writer = tf.io.TFRecordWriter('%s-%05d-of-%05d' % (file_out, shard, n_shards))
+            #
+            start = shard * n_examples_per_shard
+            if shard == (n_shards - 1):
+                end = (shard + 1) * n_examples_per_shard + n_examples_remainder
+            else:
+                end = (shard + 1) * n_examples_per_shard
+            #
+            print('Shard %d of %d: (%d examples)' % (shard, n_shards, (end - start)))
+            for i in range(start, end):
+                self._process_example(i, A, B, C, D)
+                print(i, end='\r')
+            #
+            self._writer.close()
+        #
+        return self.n_examples
+
+#------------------------------------------------------------------------------
+#------------------------------------------------------------------------------
+
+train_df, test_df = create_cv_split(args.data_dir, n_splits=5)
+
+tfrp = TFRecordProcessor()
+
+#------------------------------------------------------------------------------
+#------------------------------------------------------------------------------
+
+for fold_id in range(len(train_df['fold_id'].unique())):
+    print('Fold:', fold_id)
+    n_written = tfrp.write_tfrecords(
+        train_df[train_df['fold_id'] == fold_id]['image_id'].values,
+        train_df[train_df['fold_id'] == fold_id]['image'].values,
+        train_df[train_df['fold_id'] == fold_id]['turtle_id'].values,
+        train_df[train_df['fold_id'] == fold_id]['label'].values,
+        #
+        n_shards=1, 
+        file_out=os.path.join(args.out_dir, 'fold.%d.tfrecord' % fold_id))
+
+n_written = tfrp.write_tfrecords(
+    test_df['image_id'].values,
+    test_df['image'].values,
+    test_df['turtle_id'].values,
+    test_df['label'].values,
+    #
+    n_shards=1,
+    file_out=os.path.join(args.out_dir, 'test.tfrecord'))
+
+#------------------------------------------------------------------------------
+#------------------------------------------------------------------------------
+
+
diff --git a/ensemble.py b/ensemble.py
@@ -0,0 +1,96 @@
+#------------------------------------------------------------------------------
+#------------------------------------------------------------------------------
+
+import os
+import sys
+sys.path.append('lib')
+import warnings
+warnings.simplefilter('ignore', UserWarning)
+import numpy as np
+import pandas as pd
+from sklearn.preprocessing import LabelEncoder
+from vecxoz_utils import create_cv_split
+
+# List of models to ensemble
+dirs = [
+    'run-20220310-1926-ef1b7',
+    'run-20220316-1310-beitl',
+    'run-20220317-1954-ef1l2',
+    'run-20220318-1121-ef2xl',
+    'run-20220322-2024-ef1l2',
+    'run-20220325-1527-ef2l',
+]
+
+model_dir = 'models'
+data_dir = 'data'
+n_folds = 5
+n_tta = 0
+
+#------------------------------------------------------------------------------
+#------------------------------------------------------------------------------
+
+# Load predictions from all models
+y_preds_test = []
+for counter, d in enumerate(dirs):
+    for tta_id in range(n_tta + 1):
+        for fold_id in range(n_folds):
+            y_preds_test.append(np.load(os.path.join(model_dir, d, 'preds', 'y_pred_test_fold_%d_tta_%d.npy' % (fold_id, tta_id))))
+    print(counter, end='\r')
+assert len(y_preds_test) == (n_tta + 1) * len(dirs) * n_folds
+
+#------------------------------------------------------------------------------
+#------------------------------------------------------------------------------
+
+# Compute mean and argsort
+probas = np.mean(y_preds_test, axis=0)
+preds = np.argsort(probas, axis=1)[:, ::-1]
+
+# train_df contains train + extra data of 2265 classes
+# train_orig_df contains 100 original classes
+train_df, _ = create_cv_split(data_dir, 5)
+train_orig_df = pd.read_csv(os.path.join(data_dir, 'train.csv'))
+turtle_ids_orig = sorted(train_orig_df['turtle_id'].unique()) # 100 unique
+
+# Fit LabelEncoder on 2265 clases to decode our predictions
+le = LabelEncoder()
+le = le.fit(train_df['turtle_id'])
+
+# Replace all predicted labels outside of 100 train ids with a "new_turtle"
+label_str = []
+for row in preds: # 490
+    turtle_ids_predicted = le.inverse_transform(row) # transform a row of length 2265
+    turtle_ids_replaced = []
+    for turtle_id in turtle_ids_predicted:
+        if turtle_id in turtle_ids_orig:
+            turtle_ids_replaced.append(turtle_id)
+        else:
+            turtle_ids_replaced.append('new_turtle')
+    label_str.append(turtle_ids_replaced)
+label_str_npy = np.array(label_str) # (490, 2265)
+
+# There may be more than 1 "new_turtle" prediction for any given example
+# We replace all repetitions except the first with the most probable predictions form 100 train ids
+rows_by_5 = []
+for row in label_str_npy:
+    cand = [x for x in row[row != 'new_turtle'] if x not in row[:5]][:4]
+    row_new = []
+    for t_id in row[:5]:
+        if t_id not in row_new:
+            row_new.append(t_id)
+    for _ in range(5 - len(row_new)):
+        row_new.append(cand.pop(0))
+    rows_by_5.append(np.array(row_new))
+rows_by_5_npy = np.array(rows_by_5) 
+
+# Crete submission file
+subm_df = pd.read_csv('/home/vecxoz/data/sample_submission.csv')
+subm_df['prediction1'] = rows_by_5_npy[:, 0]
+subm_df['prediction2'] = rows_by_5_npy[:, 1]
+subm_df['prediction3'] = rows_by_5_npy[:, 2]
+subm_df['prediction4'] = rows_by_5_npy[:, 3]
+subm_df['prediction5'] = rows_by_5_npy[:, 4]
+
+subm_df.to_csv('submission.csv', index=False)
+
+#------------------------------------------------------------------------------
+#------------------------------------------------------------------------------