Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More false negatives #103

Open
bernt-matthias opened this issue Oct 29, 2024 · 2 comments
Open

More false negatives #103

bernt-matthias opened this issue Oct 29, 2024 · 2 comments

Comments

@bernt-matthias
Copy link

Was experimenting a bit with puremagic. Unfortunately already the first two tests did not work (but file did it's job). grib might just be missing, but H5 should be detected, or?

https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/datatypes/test/test.mz5

python -m puremagic lib/galaxy/datatypes/test/test.mz5 
'lib/galaxy/datatypes/test/test.mz5' : could not be Identified
file lib/galaxy/datatypes/test/test.mz5
lib/galaxy/datatypes/test/test.mz5: Hierarchical Data Format (version 5) data

https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/datatypes/test/test.grib

python -m puremagic lib/galaxy/datatypes/test/test.grib
'lib/galaxy/datatypes/test/test.grib' : could not be Identified
file lib/galaxy/datatypes/test/test.grib 
Gridded binary (GRIB) version 1
@cdgriffith
Copy link
Owner

Thanks for reporting, never heard of either of these types before!

For MZ5 I see the standard from nasa is returning 404 https://earthdata.nasa.gov/esdis/eso/standards-and-references/hdf-eos5
there is also information on it here https://docs.ogc.org/is/18-043r3/18-043r3.html but no mention of magic numbers.

Opening the file itself, starts with ‰HDF so can probably use that with low accuracy. Do you have any more examples of these file types I could look through?

@cdgriffith
Copy link
Owner

Pulled down that repo and looked in the folder with the example files. Compared to file there are 25 file types that puremagic does not have matches for, removing ones from file that are only reported as ASCII, data, or very short file.

.h5
.model
.biom2
.cool
.grib
.mcool
.vcf
.sam
.loom
.h5ad
.h5mlm
.nii2
.gpr
.npy
.rma6
.cel
.bcf_uncompressed
.mztab2
.parquet
.ptkscmp
.iqtree
.mz5
.fcs
.hdt
.gal

I will start looking into each of those and seeing if they have magic numbers associated with them we can add to pure magic.

Thank you for raising this issue, and supplying the great source of example files!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants