NOTE: the .csv files that compose the TSSM dataset is available in the figshare repository.
Mining the GitHub repository is a time-consuming activity because it has more than 8 million projects. Therefore, we used the 147,991 Java projects with more than five stars listed by Loriot et al. (2020) and Durieux et al. (2021). The list of projects is available at project_list.txt
.
We established the following criteria to select projects that match the requirements of data extraction tools:
- Open-source projects. We limit our search to projects with a declared license compliant with OSI (Open-Source Initiative) or FSF (Free Software Foundation) licensing;
- Non-forked projects. We removed the forked projects because they contain excerpts of code similar to the original ones, which return similar values in the data extraction, biasing the results.
- Projects that use Java as the primary programming language. Besides the initial list contains projects that use Java as the primary programming language, the projects evolved since the GitHub mining. Therefore, we checked whether the projects continue using Java as a primary language.
We developed MiningGitHub
to collect the following data of the projects:
- Test Smells. We used the JNose Test to collect data of test smells in the test code.
- Structural metrics. We used the CK Metrics to collect structural metrics from the test and production code.
- Metadata from GitHub. We used the GHRepository to collect the metadata of the projects sucessfully executed by JNose Test and CK Metrics.
As a result of the three last steps, the TSSM dataset contains data of 13,703 open-source Java projects. The list of selected projects is avaliable at selected_project_list.txt
, and the metadata of such projects are available at projects.csv
. In addition, we made the files containing the data on test smells and metrics available in the folder TSSM
in the figshare repository. It is structured as follows:
TSSM
│
├── Metrics of the test code (test_data):
| ├── testClassSmells.csv: contains data of 18 test smells at class level
│ ├── testMethodSmells.csv: contains data of 18 test smells at method level
│ ├── testClassMetrics.csv: contains data of 44 structural metrics at test class level
│ ├── testMethodMetrics.csv: contains data of 28 structural metrics at test method level
│ ├── mergeTestClass.csv: contains data of 44 structural metrics and 18 test smells at test class level
│ ├── mergeTestMethod.csv: contains data of 28 structural metrics and 18 test smells at test method level
├── Metrics of the production code (production_data):
| ├── productionClassSmells.csv: contains data of 18 test smells at class level
│ ├── productionMethodSmells.csv: contains data of 18 test smells at method level
│ ├── productionClassMetrics.csv: contains data of 44 structural metrics at test class level
│ ├── productionMethodMetrics.csv: contains data of 28 structural metrics at test method level
│ ├── mergeProductionClass.csv: contains data of 44 structural metrics and 18 test smells at test class level
│ ├── mergeProductionMethod.csv: contains data of 28 structural metrics and 18 test smells at test method level
├── Projects metadata (projects_metadata)
│ ├── projects.csv: contains the metadata of 13,703 projects
│ ├── selected_project_list.txt: contains the selected projects' name
|
Prerequisites:
- JDK 1.8
- Maven 3
The JNose Test requires the jnose-core dependency. Install the dependecy following the steps:
git clone [email protected]:arieslab/jnose-core.git
cd jnose-core
mvn install
Clone the project to generate the dataset using the following command:
git clone [email protected]:arieslab/TSSM.git
Open the project MiningGitHub
in the IDE as a Maven project (we used IntelliJ), configure and run the class Main.java
with the information:
- ghKey receives a personal access token from GitHub. Generating a ghKey.
- startNumberList receives a initial lineID of a project from the
project_list.txt
to start the data collection. - endNumberList receives a final lineID of a project from the
project_list.txt
to start the data collection. Next, we developedscript.py
to merge the test smells and metrics. We analyzed the collected data at class and method levels to establish a traceability link between the JNose Test and CK Metrics tools. It is important to notice that not all production classes of a project match with their respective test class, and the same occurs at the method level. We followed the JUnit naming convention of either pre-pending or appending the wordTest
to the name of the production class at the same level at the package hierarchy. For example, a production class in the package/src/java/example/
is calledExampleName.java
, so its test class should be in the package/src/test/example
and named asExampleNameTest.java
orTestExampleName.java
.
In addition, we analyzed the production and test classes regarding the number of lines, number of classes and number of methods using the Scripts/calculate_size_metrics.py
. The results are merged into the other metadata collected from GitHub at projects.csv
To run the file, open the Scripts
folder and execute the command:
python3 calculate_size_metrics.py
Prerequisites:
- Python 3
Open the Scripts
folder and execute the command:
python3 merge_production_data.py
python3 merge_test_data.py