Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2024-05-06 Fix Confidence sorting #66

Merged

Conversation

NebularNerd
Copy link
Contributor

@NebularNerd NebularNerd commented May 6, 2024

While adding #65 one of the things that flagged up was that no matter how long the byte_match the winner was not the longest match. This quick one liner addresses that issue. Confidence is sorted by confidence then byte_match.

Before, Alternate match 2 should be the winner:

China tax introduction(edited1).docx
Most likely match:
Format:        MS Office Open XML Format Document
Confidence:    80.0%
Extension:     .docx
MIME:          application/vnd.openxmlformats-officedocument.wordprocessingml.document
Offset:        0
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00'
Hex:           504b 0304 1400 0600
String:        PK

Alternate match #1
Format:        Microsoft Office 2007+ Open XML Format Document file
Confidence:    80.0%
Extension:     .xlsx
MIME:          application/vnd.openxmlformats-officedocument.wordprocessingml.document
Offset:        0
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00'
Hex:           504b 0304 1400 0600
String:        PK

Alternate match #2
Format:        MS Office Open XML Format Word Document
Confidence:    80.0%
Extension:     .docx
MIME:          application/vnd.openxmlformats-officedocument.wordprocessingml.document
Offset:        0
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00word/document.xml'
Hex:           504b 0304 1400 0600 776f 7264 2f64 6f63 756d 656e 742e 786d 6c
String:        PKword/document.xml

Alternate match #3
Format:        MS Office Open XML Format Document
Confidence:    40.0%
Extension:     .pptx
MIME:          application/vnd.openxmlformats-officedocument.presentationml.presentation
Offset:        0
Bytes Matched: b'PK\x03\x04'
Hex:           504b 0304
String:        PK

Alternate match #4
Format:        MS Office Open XML Format Document
Confidence:    40.0%
Extension:     .xlsx
MIME:          application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Offset:        0
Bytes Matched: b'PK\x03\x04'
Hex:           504b 0304
String:        PK

Omitting other 20+ matches

Now it is! :

China tax introduction(edited1).docx
Most likely match:
Format:        MS Office Open XML Format Word Document
Confidence:    80.0%
Extension:     .docx
MIME:          application/vnd.openxmlformats-officedocument.wordprocessingml.document
Offset:        3000
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00word/document.xml'
Hex:           504b 0304 1400 0600 776f 7264 2f64 6f63 756d 656e 742e 786d 6c
String:        PKword/document.xml

Alternate match #1
Format:        MS Office Open XML Format Document
Confidence:    80.0%
Extension:     .docx
MIME:          application/vnd.openxmlformats-officedocument.wordprocessingml.document
Offset:        0
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00'
Hex:           504b 0304 1400 0600
String:        PK

Alternate match #2
Format:        Microsoft Excel - Macro-Enabled Workbook
Confidence:    80.0%
Extension:     .xlsm
MIME:          application/vnd.ms-excel.sheet.macroEnabled.12
Offset:        0
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00'
Hex:           504b 0304 1400 0600
String:        PK

Alternate match #3
Format:        Microsoft Office 2007+ Open XML Format Document file
Confidence:    80.0%
Extension:     .xlsx
MIME:          application/vnd.openxmlformats-officedocument.wordprocessingml.document
Offset:        0
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00'
Hex:           504b 0304 1400 0600
String:        PK

Alternate match #4
Format:        MS Office Open XML Format Document
Confidence:    40.0%
Extension:     .pptx
MIME:          application/vnd.openxmlformats-officedocument.presentationml.presentation
Offset:        0
Bytes Matched: b'PK\x03\x04'
Hex:           504b 0304
String:        PK

Omitting other 20+ matches

This will help in the future with really long matches if PureMagic adopts rules, for example using this .xlsm, Alternate Match 1 would be the better choice as the file matches the .vba aspect, however regular Excel wins as that has a slightly longer match, the Macro flavored version comes second. With a rules based system you could combine those together for a MEGA MATCH! of b'PK\x03\x04\x14\x00\x06\x00xl/workbook.xmlxl/vbaProject.bin' which should be unbeatable 😎:

ID-Generator-MACRO.xlsm
Most likely match:
Format:        Microsoft Office 2007+ Open XML Format Excel Document file
Confidence:    80.0%
Extension:     .xlsx
MIME:          application/vnd.openxmlformats-officedocument.wordprocessingml.document
Offset:        3000
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00xl/workbook.xml'
Hex:           504b 0304 1400 0600 786c 2f77 6f72 6b62 6f6f 6b2e 786d 6c
String:        PKxl/workbook.xml

Alternate match #1
Format:        Microsoft Excel - Macro-Enabled Workbook
Confidence:    80.0%
Extension:     .xlsm
MIME:          application/vnd.ms-excel.sheet.macroEnabled.12
Offset:        0
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00xl/vbaProject.bin'
Hex:           504b 0304 1400 0600 786c 2f76 6261 5072 6f6a 6563 742e 6269 6e
String:        PKxl/vbaProject.bin

Alternate match #2
Format:        MS Office Open XML Format Document
Confidence:    80.0%
Extension:     .docx
MIME:          application/vnd.openxmlformats-officedocument.wordprocessingml.document
Offset:        0
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00'
Hex:           504b 0304 1400 0600
String:        PK

Alternate match #3
Format:        Microsoft Excel - Macro-Enabled Workbook
Confidence:    80.0%
Extension:     .xlsm
MIME:          application/vnd.ms-excel.sheet.macroEnabled.12
Offset:        0
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00'
Hex:           504b 0304 1400 0600
String:        PK

Alternate match #4
Format:        Microsoft Office 2007+ Open XML Format Document file
Confidence:    80.0%
Extension:     .xlsx
MIME:          application/vnd.openxmlformats-officedocument.wordprocessingml.document
Offset:        0
Bytes Matched: b'PK\x03\x04\x14\x00\x06\x00'
Hex:           504b 0304 1400 0600
String:        PK

Omitting other 20+ matches

@cdgriffith
Copy link
Owner

Thank you for this fix!

@cdgriffith cdgriffith changed the base branch from master to develop May 8, 2024 17:58
@cdgriffith cdgriffith merged commit 351fdfd into cdgriffith:develop May 8, 2024
5 checks passed
@cdgriffith cdgriffith mentioned this pull request Jun 16, 2024
cdgriffith added a commit that referenced this pull request Jun 16, 2024
- Adding #72 #75 #76 #81 `.what()` to be a drop in replacement for `imghdr.what()` (thanks to Christian Clauss and Andy - NebularNerd)
- Adding #67 Test on Python 3.13 beta (thanks to Christian Clauss)
- Adding #77 from __future__ import annotations (thanks to Christian Clauss
- Fixing #66 Confidence sorting (thanks to Andy - NebularNerd)

---------

Co-authored-by: Andy <[email protected]>
Co-authored-by: Christian Clauss <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants