Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TIKA-4254 - Fix non-idempotent unit test TestMimeTypes#testJavaRegex #1754

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kaiyaok2
Copy link

@kaiyaok2 kaiyaok2 commented May 11, 2024

Fixes https://issues.apache.org/jira/projects/TIKA/issues/TIKA-4254

Brief Description of the Bug

The test TestMimeTypes#testJavaRegex is non-idempotent, as it passes in the first run but fails in the second run in the same environment. The source of the problem is that each test execution initializes a new media type (MimeType) instance testType (same problem for testType2), and all media types across different test executions attempt to use the same name pattern "rtg_sst_grb_0\\.5\\.\\d{8}". Therefore, in the second execution of the test, the line this.repo.addPattern(testType, pattern, true); will throw an error, since the name pattern is already used by the testType instance initiated from the first test execution. Specifically, in the second run, the addGlob() method of the Pattern class will assert conflict patterns and throw aMimeTypeException(line 123 in Patterns.java).

Failure Message in the 2nd Test Run:

org.apache.tika.mime.MimeTypeException: Conflicting glob pattern: rtg_sst_grb_0\.5\.\d{8}
	at org.apache.tika.mime.Patterns.addGlob(Patterns.java:123)
	at org.apache.tika.mime.Patterns.add(Patterns.java:71)
	at org.apache.tika.mime.MimeTypes.addPattern(MimeTypes.java:450)
	at org.apache.tika.mime.TestMimeTypes.testJavaRegex(TestMimeTypes.java:851)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)

Reproduce

Use the NIOInspector plugin that supports rerunning individual tests in the same environment:

cd tika-parsers/tika-parsers-standard/tika-parsers-standard-package
mvn edu.illinois:NIOInspector:rerun -Dtest=org.apache.tika.mime.TestMimeTypes#testJavaRegex

Proposed Fix

Declare testType and testType2 as static variables and initialize them at class loading time. Therefore, repeated runs of testJavaRegex() will not conflict each other. All tests pass and are idempotent after the fix.

Necessity of Fix

A fix is recommended as unit tests shall be idempotent, and state pollution shall be mitigated so that newly introduced tests do not fail in the future due to polluted shared states.

@tballison
Copy link
Contributor

The repo is refreshed with each unit test in the @BeforeEach call, though. Is NIODetector respecting that?

@kaiyaok2
Copy link
Author

kaiyaok2 commented May 11, 2024

The repo is refreshed with each unit test in the @BeforeEach call, though. Is NIODetector respecting that?

@tballison Yes, NIOInspector uses the JUnit Jupiter engine and takes into account of all setup and teardown methods.

It seems that the patterns matcher of repo is not reset by the following code in @BeforeEach:

TikaConfig config = TikaConfig.getDefaultConfig();
repo = config.getMimeRepository();

@THausherr
Copy link
Contributor

Maybe I get it: repo = config.getMimeRepository(); isn't creating anything new, it's retrieving something that is changed later by the test? If my understanding is correct then it's a deeper problem.

@kaiyaok2
Copy link
Author

kaiyaok2 commented May 11, 2024

getMimeRepository

I think it might the case. I wrote this dummy test, and it fails under surefire:

@Test
public void testResetRepo() throws Exception {
    TikaConfig config0 = TikaConfig.getDefaultConfig();
    MimeTypes repo0 = config0.getMimeRepository();
    MimeType testType0 = new MimeType(MediaType.parse("baz/bar"));
    String pattern = "rtg_sst_grb_0\\.5\\.\\d{9}";
    repo0.addPattern(testType0, pattern, true);

    TikaConfig config1 = TikaConfig.getDefaultConfig();
    MimeTypes repo1 = config1.getMimeRepository();
    MimeType testType1 = new MimeType(MediaType.parse("baz/bar"));
    repo1.addPattern(testType1, pattern, true);
}

Error Message:

[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.857 s <<< FAILURE! -- in org.apache.tika.mime.TestMimeTypes
[ERROR] org.apache.tika.mime.TestMimeTypes.testResetRepo -- Time elapsed: 0.846 s <<< ERROR!
org.apache.tika.mime.MimeTypeException: Conflicting glob pattern: rtg_sst_grb_0\.5\.\d{9}
	at org.apache.tika.mime.Patterns.addGlob(Patterns.java:123)
	at org.apache.tika.mime.Patterns.add(Patterns.java:71)
	at org.apache.tika.mime.MimeTypes.addPattern(MimeTypes.java:450)
	at org.apache.tika.mime.TestMimeTypes.testResetRepo(TestMimeTypes.java:873)
        ...

@kaiyaok2
Copy link
Author

@THausherr @tballison I confirmed that the two lines in @BeforeEach does not create a new repo if one exists from a previous test run:

TikaConfig config = TikaConfig.getDefaultConfig();
repo = config.getMimeRepository();

TikaConfig.getDefaultConfig() simply calls the default TikaConfig() constructor (

).

When the system property 'tika.config' and the environment variable 'TIKA_CONFIG' are both not set, the mimeTypes field (accessible by getMimeRepository() - which is repo in our context) of the constructed config will be getDefaultMimeTypes(getContextClassLoader())(

this.mimeTypes = getDefaultMimeTypes(getContextClassLoader());
).

Now take a look at getDefaultMimeTypes() - when a classloader is given (getContextClassLoader() in our context), it first tries to retrieve from a HashMap via CLASSLOADER_SPECIFIC_DEFAULT_TYPES.get(classLoader); (

types = CLASSLOADER_SPECIFIC_DEFAULT_TYPES.get(classLoader);
). Notice that CLASSLOADER_SPECIFIC_DEFAULT_TYPES is not an instance variable, but a static HashMap.

So in the first test execution, the CLASSLOADER_SPECIFIC_DEFAULT_TYPES is empty, so types after the line types = CLASSLOADER_SPECIFIC_DEFAULT_TYPES.get(classLoader); will be null, and is later initialized by MimeTypesFactory.create() as desired. After this, the initialized types is put to the static CLASSLOADER_SPECIFIC_DEFAULT_TYPES map (

CLASSLOADER_SPECIFIC_DEFAULT_TYPES.put(classLoader, types);
).

Now in the second test execution, the CLASSLOADER_SPECIFIC_DEFAULT_TYPES already has the key of the context class loader, with corresponding types initialized from the previous run. So CLASSLOADER_SPECIFIC_DEFAULT_TYPES.get(classLoader) will return such initialized object directly. In other words, repo would be the same object across repeated test runs.

I think the essential idea of CLASSLOADER_SPECIFIC_DEFAULT_TYPES is 1-to-1 map between classloaders and default types, so this implementation does not seem buggy for me, but please confirm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants