GitHub - vecxoz/ai4code: Google AI4Code 31st/34th solution

Google AI4Code competition. 31st/34th solution

Point-wise ranking approach. Model is based on CodeT5-base Encoder with 1024 sequence length.
Writeup and discussion on Kaggle.

Requirements

Hardware

Training: 4 cores, 16 GB RAM, TPUv3-8
Inference: 2 cores, 12 GB RAM, P100

Software

Ubuntu 18.04
Python: 3.9.7
CUDA: 11.2 (for GPU inference)
cuDNN: 8.1.1 (for GPU inference)

Install

In fact I used TF 2.8.0 for training, but newer versions should also be OK

git clone https://github.com/vecxoz/ai4code
cd ai4code
pip3 install -r requirements.txt

Inference using trained weights

Inference time is about 7 hours for Kaggle’s hidden test dataset.

kaggle competitions download -c AI4Code
kaggle datasets download vecxoz/model-codet5base
kaggle datasets download vecxoz/ai4code-weights

unzip -q AI4Code.zip -d AI4Code
unzip -q model-codet5base.zip -d model-codet5base
unzip -q ai4code-weights.zip -d ai4code-weights

python3 infer.py --data_dir=AI4Code --weight_dir=ai4code-weights --model_dir_or_name=model-codet5base

If you use newly trained models for inference, adjust ensemble coefficients according to their performance.
On Kaggle choose P100 GPU notebook, attach 2 datasets model-codet5base and ai4code-weights, and set paths accordingly.

Create training data

It takes about 3 hours to create the data on a GCP VM.
For some reason it may take much longer on Kaggle's latest notebook environment.

python3 create_data.py --data_dir=AI4Code --out_dir=ai4code-tfrec

There is a prebuilt dataset on Kaggle. You can attach it to your notebook or download:

kaggle datasets download vecxoz/ai4code-tfrec
unzip -q ai4code-tfrec.zip -d ai4code-tfrec

Train

I trained two first folds (0 and 1) for 20 and 7 full epochs respectively.
Both were interrupted before full convergence.
Training time is about 3.5 hours per epoch.

python3 train.py --data_tfrec_dir=ai4code-tfrec --initial_fold=0 --final_fold=2

On Kaggle choose TPU notebook, attach dataset ai4code-tfrec, and set path accordingly.
Due to Kaggle time limits one needs to train each fold in several separate sessions.

Acknowledgement

Many thanks to the TRC program for TPU resources.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_data.py		create_data.py
infer.py		infer.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Google AI4Code competition. 31st/34th solution

Requirements

Install

Inference using trained weights

Create training data

Train

Acknowledgement

About

Releases

Packages

Languages

License

vecxoz/ai4code

Folders and files

Latest commit

History

Repository files navigation

Google AI4Code competition. 31st/34th solution

Requirements

Install

Inference using trained weights

Create training data

Train

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages