-
Notifications
You must be signed in to change notification settings - Fork 332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi Language Program Support #2087
Multi Language Program Support #2087
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just found two small things.
language-api/src/main/java/de/jplag/options/LanguageOptions.java
Outdated
Show resolved
Hide resolved
languages/multi-language/src/main/java/de/jplag/multilang/MultiLanguageOptions.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat!
@TwoOfTwelve could you adress the Sonar issues please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For conceptually larger PRs, PRs that add a lot of new functionality, or PRs that are research-related, the PR description should be more detailed. It should sufficiently describe the intent of the PR and how it works. Also, edge cases like the one mentioned below. Note that the PRs are linked in the release note, so we should write them as intended for someone external.
As an example, for this PR, write:
- What is the intent (parsing programs with code from multiple languages)
- How is it implemented (language module that passes the files to the language-specific ones and receives tokens)
- Design decisions (e.g., loading languages dynamically instead of dependencies)
- Edge cases (e.g. see below)
|
||
private Optional<Language> findLanguageForFile(File file) { | ||
return this.languages.stream().filter(language -> Arrays.stream(language.suffixes()).anyMatch(suffix -> file.getName().endsWith(suffix))) | ||
.findFirst(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will currently pick the first language for cases where multiple languages support the same file type. Have you discussed this behavior? This affects the C/C++ modules and also the EMF modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also in future Java vs. Java-CPG)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have discussed this with Robin, Nils and Sebastian. Currently it's not that big of an issue, since the user has to select the modules manually. If there are multiple selected modules for the same file, the module is chosen arbitrarily.
This should be addressed in the future, maybe by adding priorities to language modules or by distinguishing files in more detail than just the suffix. I think it should be done separately though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I understand; I overlooked that the users specify the languages when using the module.
In the long run, I considered making the language module the default, but that would only be user-friendly if users do not need to specify languages. This would mean the multi-language module automatically parses all code that JPlag supports. Then we need prioritization.
With the current solution, we add yet another cli argument, which is less likely to be used by many users. For now, let us leave it as is, but before the release, we need to think about which mode we truly want. If we want more people to try out the language module, we probably need to implement the unparameterized version. However, in all cases, I would not make it the default language straight away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good; we should discuss in the future if we prefer specifying languages or if all supported languages should be used by default.
@tsaglam Can you look at the sonar issue with the cyclic dependency? I think it's fine in this case. |
I briefly looked at the cycles. The other one was not immediately clear to me, I will have a look at it later. |
I didn't expect the other one either. I'll take a look at it too |
I think the other one is: EDIT: I have an idea how to fix this cycle in a future PR, but this is not time-sensitive at all. |
Quality Gate passed for 'JPlag Plagiarism Detector'Issues Measures |
This PR adds the capability to parse multiple languages with JPlag in on run.
A new language module is added, which identifies a module for each file and delegates the parsing to that module. The resulting tokens are concatenated and passed back to JPlag. This does not allow comparing different languages with each other, but it allows for projects that contain multiple languages (multi-language projects).
The languages are discovered on runtime using the same mechanism as the cli. This made it necessary to move the LanguageLoader class from the cli module to the language-api module.
The language modules are identified using the suffixes defined in the individual language modules. This matches the way JPlag already identifies code files in a submission.
The user has to select all language modules that should be used manually using a language specific option.
If multiple language modules are selected for the same suffix, a module is chosen arbitrarily.
Future considerations:
Usage example: