Task: Given a source code as the query, the task aims to retrieve codes with the same semantics from a collection of candidates in zero-shot setting.
Guo et al., 2022 collected 11,744/15,594/23,530 programs from CodeNet corpus for Ruby/Python/Java programming languages. Each program solves one of 4,053 problems. The task is to take each program as the query and retrieve programs that solves the same problem from each PL.
Get the data (jsonl files) from https://github.com/microsoft/CodeBERT/tree/master/UniXcoder/downstream-tasks/zero-shot-search/dataset.
- Total examples - 23,530
- Total programming problems - 3,142
- Total examples in Python - 15,594
- Total programming problems - 2,072
- Total examples in Ruby - 11,744
- Total programming problems - 1,708
- Java, Python - 2,001
- Python, Ruby - 1,566
- Java, Ruby - 1,695
- Java, Python, Ruby - 1,560
We extend this dataset to 3 other programming languages - C, C++, C#, Go, Javascript, Typescript, and PHP.
wget https://dax-cdn.cdn.appdomain.cloud/dax-project-codenet/1.0.0/Project_CodeNet.tar.gz
wget https://dax-cdn.cdn.appdomain.cloud/dax-project-codenet/1.0.0/Project_CodeNet_metadata.tar.gz
tar -xvzf Project_CodeNet.tar.gz
tar -xvzf Project_CodeNet_metadata.tar.gz
python extend.py
- C - 11,260
- C++ - 30,197
- C# - 11,952
- Go - 9,720
- Javascript - 6,866
- Typescript - 3,385
- PHP - 6,782