Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 2.0 Goals #70

Open
cdgriffith opened this issue May 12, 2024 · 6 comments
Open

Version 2.0 Goals #70

cdgriffith opened this issue May 12, 2024 · 6 comments
Milestone

Comments

@cdgriffith
Copy link
Owner

cdgriffith commented May 12, 2024

Now that puremagic is picking up some outside traction, and used in places like MongoDB, want to lay out clear future plans.

Please keep comments on this page limited to overall goals, any specific conversations about any goal should be their own issue and will be updated here.

@NebularNerd
Copy link
Contributor

Could #69 be a new feature for 2.0? Compatibility wise the new field would/should not break anything (that I'm aware of).

@CatKasha
Copy link

Hi, found out your project via "Explore repositories" on github.com homepage feed
I have kinda similar project https://github.com/CatKasha/yet-another-filetype-checker
Idk if it will be helpful (my project is very simple) but hope it will give you some ideas for improvements

@chapmanjacobd
Copy link

chapmanjacobd commented May 26, 2024

I just found this: https://mark0.net/soft-trid-e.html

Not sure how well it is known but it contains "over 17k file types". The file signatures does not have an explicit data license attached to it, but at the very least it might be useful to compare against

maybe related:

@NebularNerd
Copy link
Contributor

TrID is one of the oldest filetype sites/software out there. That site has looked near enough the same for decades.

Their database is pretty solid and very extensive. But they cannot generate a confidence or process more complicated searches. For example .SBK Creative Soundfont is only handled as an extension where as we can handle looking at the file in two places to generate a match.

@cdgriffith
Copy link
Owner Author

cdgriffith commented Sep 28, 2024

Starting work on adding more advanced scanners. Rough right now, but have detection for unusual PDFs #94 and better ZIP type format detection #102 (MS Office, Open Office, JAR, APK, etc...)

https://github.com/cdgriffith/puremagic/tree/deep-scan/puremagic/scanners

Before release want to add scanners for:

  • ASCII Text
  • Encoded text (min UTF-8, UTF-16 and Windows standard)
  • Generic Binary File
  • PDF
  • ZIP, Word, Open Office
  • Python files (and other languages, eventually)

Still need to do:

  • Tests, Tests, Tests
  • Code simplification
  • Documentation updates
  • Support for streams, not just files

Won't be able to work on more myself for at least two weeks, hence this in progress documentation. Biggest help would be testing framework for scanners if anyone wants to contribute to a part of this!

@NebularNerd
Copy link
Contributor

Just had a quick skim through the code and this is awesome stuff. The zip method is way better coded that I can manage but I can see it works as I sort of thought it would in my head. If I want to help fill in some of the .zip what's the best way? I'm guessing I need to fork the dev branch?

Looking at the two examples I can see the rough ideas of how to improve some of the more complex formats I've mentioned in my PR's. For example, we could heavily reduce the size of the .json by shoving all the .mp3 related stuff I added into a dedicated scanner. That in itself would likely be smaller than the .json entries data size as we would not need to repeat everything so heavily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants