Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Updates
This update is primarily for new formats, it's bit of a mix this time around. Moving forward my update texts are changing as well, I'm using VSCode to create the blurb so aiming to try for a clearer layout 😎
Lots of new matches and some fixes for older entries. Some of the ASCII translations of the hex have been left out as it broke GitHub when I was pasting them in.
On the subject of matching, it's now becoming clearer that as part of v2 rebuild plan,
.zip
,.xml
and theMicrosoft Compound File
formats all need to have some form of unpacking/decoding to allow for better matching and less alternative confidences. Some of these are now giving 20-30 matches.Formats
Canon Camera RAW 2
Extensions:
.cr2
Magic: Intel TIFF* then a second marker of
0x435202
/CR�
at byte 8An update to Canon's original RAW format, this uses a beefed-up Intel* TIFF file. Like TIFF there is a lot of info we can extract if we wanted to in later v2.0 expansion ideas.
*There may possibly be Motorola encoded
.cr2
files out there as well going by one source, but my 350D files are Intel flavoured so I've only added that for now.Panasonic RAW and RAW2 and LEICA RAW
Extensions:
.raw
.rw2
.rwl
Magic:
0x49495500
There are entries for these file extensions in the .json, however, I suspect they are either duff entries or they will only match the file from which it was sourced. From my own Panasonic FZ1000 and various test files on the links below they all start with the magic above. I have not removed the existing entries as I may be wrong about them not being valid. The LEICA cameras are basically posh Panasonic's and use the same file format with a different extension, all other details are the same.
If anyone comes across Panasonic RAW's that don't match please leave a comment so we can take a look.
Comic Book Archives
Extensions:
.cb7
.cba
.cbr
.cbt
.cbz
These are simply archives containing image files in numerical order, the extension gives away the parent formats of 7-Zip, Ace, RAR, TAR and Zip. Headers are identical to the parent formats they use.
PRC, Mobipocket and Amazon Kindle eBooks
Extensions:
.prc
.mobi
.azw
.azw1
.azw3
.azw4
.tpz
.kfx
.kcr
This is a weird hodgepodge of formats, Starting with the original AportisDoc document
.pdc
, U.S. Robotics.prc
and Mobipocket SA.mobi
formats, which hail from the PalmPilot era, they eventually morphed into Amazon Kindle files with only the extensions to tell them apart (there are deeper changes but on the surface, they are essentially the same). To be annoying a lot of eBooks have the.mobi
or.azw
extension when they should really have something else, this will affect FILE based scores as that uses the extension as part of the scoring.Starting with the KF8
.azw3
format files could be MOBI or dual format MOBI/EPUB but still have the same extension, KF10/KFX.azw4
/.kfx
files are a completely new format. There are even more subformats/subversions than I have added but I need to learn more about them or get samples, or PureMagic needs new features to dig deeper into the files.PalmDoc
Extensions:
.prc
Magic:
0x5445587452454164
/TEXtREAd
at byte 60Pretty much the grandaddy of them all, The PalmDOC eBook format from the PalmPilot series of handhelds. Technically they are a subformat of Palm DOC
.pdb
(see below) but the header is what classes them as a PRC eBook. The will conflict match withAportisDoc document .pdc
as they are one and the same filetype, U.S. Robotics used the format as the basis for the Palm operating system. Just to be awkward, a.prc
may be a.pdb
and vice-versa.I've also added
.prc
as an extension only due to it being used to store all manner of data on Palm Pilots.MOBI and early Kindle eBooks
Extensions:
.mobi
,.azw
,.azw3 (MOBI)
Magic:
0x424f4f4b4d4f4249
/BOOKMOBI
at byte 60 and a footer of0xe98e0d0a
at -4The most common of the formats in this batch, most commonly found eBooks are a MOBI (aka Mobi6 format) regardless of its extension. Some old MOBI may have the extension
.prc
or.pdb
from their PalmPilot roots.Topaz DRM eBooks
Extensions:
.awz1
and.tpz
Magic:
0x54505a
/TPZ
at different offsets per fileThese are DRM encrypted files delivered via Whispernet or downloaded to your PC, I have a single
.azw
file in my Kindle library which is DRM'd (others are newer.azw4
), I need more samples but based on DeDRM we should be looking for theTPZ
magic, it's not at a fixed position so adding as extension only for now, v2 upgrades should let us test for this.Kindle KF8 eBooks
Extensions:
.azw3
Magic:
0x424f4f4b4d4f4249
/BOOKMOBI
at byte 60 and a footer of0x434f4e54424f554e44415259e98e0d0a
/CONTBOUNDARYé�
at -16These are dual format MOBI/ePub eBooks that have the tag
BOUNDARY
at the end of the MOBI data, however this is not a fixed position so would require a v2 upgrade to search for this, handily they also have an longer footer than regular MOBI files, we'll use that instead. 😊Amazon Print Replica eBook (aka Kindle Format 10/KF10/KFX)
Extensions:
.azw4
,.kfx
Magic:
0xea44524d494f4eeee00100eaee9e8183de9a86be97de95848d50726f74656374656444617461
This is the current Kindle format, all my files downloaded through Kindle for Windows still use
.azw
for the extension, so again FILE based scores will be affected. However, with a ridiculously long match you'll be more than certain it's this format. There is a version number but we'd need a regex to ensure correct reporting of just the digits as they seem to follow the patternv1.1blahblahblah
orv1.85blahblahbah
given how many versions there could be that would mean a lot of extra data in the database if we went with fixed strings.Kindle Cloud Reader and Kindle for Mac
Extensions:
.kcr
As the label suggests these are another wrapper for Kindle files. From limited info they are an
.azk
wrapped in DRM. I have no samples for these, so adding the extension only for now.Kindle Preview file
Extensions:
.azk
A PK zip based file format used by Kindle Previewer and older iOS Kindle apps. Again, no samples available so extension only for now.
Sundry files
These are files that you'll find with some eBooks, none are eBooks themselves but provide functionality to them.
.voucher
appears to be the DRM key for KFX eBooks, all start0xe00100eaee9e8183de9a86be97de95848d50726f74656374656444617461
.mbpV2
is a metadata file, it stores the last position and annotations. It's a basic JSON data file starting0x7b226d6435223a22
/{"md5":"
.mbp
is the original MOBI metadata file, like it's newer brother above it does the same job. I have no sample files so adding as as extension only for now..azw.res
these are Resource Containers that hold external data files such as high-res images, part of the AZW6 specification originally aimed at Japanese Manga and Graphics novels, western comics adopted the same format to offer higher quality images. Header of0x434f4e540200
.azw.md
these are Metadata Containers, they use the same header as.azw.res
..phl
are Amazon Kindle Popular Highlights Files, these are an XML file that show how many people highlight certain passages etc... All start with0x3c3f786d6c2076657273696f6e3d22312e302220656e636f64696e673d225554462d3822207374616e64616c6f6e653d22796573223f3e
/<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
.azw9.res
Resource containers for Kindle on the MAC, no samples so adding as an extension for now..azw9.md
Metadata containers for Kindle on the MAC, no samples so adding as an extension for now.This is a proper Rabbit Hole job, it took way longer than I thought it would and there is still more to uncover. This pile of links covers most of what I dug out. As samples become available and new features are added to PureMagic we can do more with this bunch of formats. Some information is contradictory so I expect there will be tweaks to this lot over time.
Palm OS Database
Extensions:
.pdb
The primary data format for the PalmPilot (also Visor handspring and Sony CLIÉ) series of handheld devices. A bit like RIFF and IFF, it's a container format that wraps around many types of data.
.prc
,.mobi
andAportisDoc document .pdc
files are a form of PDB but as they get lumped in with all the other eBook formats, I left them above. All files share the same extension with just the byte 60 header changing. There are later PDB files that use zTXT (such as Weasel), but that is another kettle of fish entirely. All PDB use the same mimetype with PalmOS deciding what to do once it looks at the subformat tag.Much like the eBooks above, the extension does not mean a lot, a Palm File could easily be an application and still have a
.prc
extension for example.Subformats
All these start at byte 60
Palm Pilot Applications:
0x6170706c
/appl
Palm Pilot zTXT Compressed file:
0x7a545854
/zTXT
GrayPaint
0x444154414772503f
/DATAGrP?
Adobe Reader
0x2e70646641444245
/.pdfADBE
BDicty (Dictionary Reader)
0x42566f6b42444943
/BVokBDIC
DB (Database program)
0x4442393944424f53
/DB99DBOS
eReader (aka Palm Reader)
0x504e526450507273
/PNRdPPrs
eReader
0x4461746150507273
/DataPPrs
FireViewer (ImageViewer)
0x76494d4756696577
/vIMGView
HanDBase
0x506d4442506d4442
/PmDBPmDB
InfoView
0x496e666f494e4442
/InfoINDB
iSilo
0x546f476f546f476f
/ToGoToGo
iSilo 3
0x53446f6353696c58
/SDocSilX
JFile
0x4a6244624a426173
/JbDbJBas
JFile Pro
0x4a6644624a46696c
/JfDbJFil
LIST
0x444154414c536462
/DATALSdb
MobileDB
0x4d6f62696c654442
/Mdb1Mdb1
Plucker
0x44617461506c6b72
/DataPlkr
PQA
0x70716120636c7072
/pqa clpr
QuickSheet
0x4461746153707264
/DataSprd
SuperMemo
0x534d3031534d656d
/SM01SMem
TealDoc
0x54455874546c4463
/TEXtTlDc
TealInfo
0x496e666f546c4966
/InfoTlIf
TealMeal
0x44617461546c4d6c
/DataTlMl
TealPaint
0x44617461546c5074
/DataTlPt
ThinkDB
0x6461746154444250
/dataTDBP
Tides
0x5464617454696465
/TdatTide
TomeRaider
0x546f526154525057
/ToRaTRPW
, these may also have a.tr
extensionWeasel
0x7a54585447506c6d
/zTXTGPlm
WordSmith
0x42444f4357726453
/BDOCWrdS
Not an exhaustive list but like RIFF and IFF there are going to always be more.
Justsolve: MOBI
MobileRead Wiki: PDB
TomeRaider eBooks
Extensions:
.tr
.tr2
.tr3
Magic:
.pdb
have0x546f526154525057
/ToRaTRPW
at byte 60 (as above).tr
and.tr2
have0x370000106d000010d2160010dcf4ddfcd1
at byte 0.tr3
have0x5452334454523343
/TR3DTR3C
at byte 60This came up while doing the Palm Doc entries. TomeRaider is another eBook format that started life on the PalmPilot series of devices. There are three formats, the
.pdb
version, then later on TR2 and TR3. TR2's and the old PDB version may both use.tr
when not on a Palm device. Calibre cannot read any of these files (not that I can find a TR2 sample but I imagine it also does not work) which is a shame, maybe a new project for me to look into...FictionBook 2 and FictionBook 3
Extensions:
.fb2
.fb2.zip
.fbz
Magic:
.fb2
has0x3c3f786d6c2076657273696f6e3d22312e302220656e636f64696e673d225554462d38223f3e0a3c46696374696f6e426f6f6b
/<?xml version="1.0" encoding="UTF-8"?> <FictionBook
.fbz
and.fb2.zip
are just normal PK Zip files with an FB2 inside.fb3
are also just normal PK Zip files with a similar structure to an ePubAnother eBook format that is popular in Russia but nearly unused anywhere else. FictionBook 2's are an XML file with everything stored within it as a monolithic block, the compressed variants are simply a zip file with a single FB2 inside. FictionBook 3's are a similar idea to ePub in that they are just a zip file with a structured layout. Yay! more PK Zip matches....
Windows Help Files
Extensions:
.hlp
.gid
.cnt
Magic:
.hlp
and.gid
both have0x3f5f0300
at byte 0, then0x0000ffffffff
at byte 6.cnt
has0x3a42617365
/:Base
at byte 0This was already in the data base but
.hlp
was split over two entries, I've condensed then into one superior match..gid
are a metadata file that stores the last window position and size (but not read position), there is not much info but looking at the samples created when I open.hlp
they all have the same starting layout. There are some other tags we can look for later to enhance confidence between both files..cnt
files are a plain text file containing the chapters for a Help file, they add a graphical Table of Contents (TOC) tab to the Search/Find tabs under Win95. Assuming no blank or lines with no colon:
the first line should always match the magic.Handy if you get
Cannot display this help file. Try opening the help file again, and if you still get this message, copy the help file to a different drive, and try again
MS Reader eBook
Extensions:
.lit
Already in the .json, just added the mimetype
application/x-ms-reader
Sony Broad Band eBook (aka BBeB)
Extensions:
.lrf
.lrf
.lrx
Magic:
.lrf
has0x4c00520046000000
at byte 0.lrx
and.lrf
no samples, extension only for nowA proprietary eBook format from Sony and Canon mainly aimed at the Sony Librié.
.lrs
are XML files that can be read as an eBook, but are aimed at being the source files for the other two extensions..lrf
and.lrx
are compiled and compiled with DRM.Rocket eBook
Extensions:
.rb
Magic:
eBooks have
0xb00cb00c
/°°
(bookbook)System files use
0xb00cc0de
/°ÀÞ
(bookcode) or0xb00cf00d
/°ð
(bookfood)Another proprietary format, this is for the NuvoMedia Rocket eBook reading device, reportedly the first dedicated eBook reader released in 1997. There are possibly DRM versions of the file that may differ from these entries.
Text Compression for Reader eBook (aka Psion Series 3 eBook)
Extensions:
.tcr
Magic:
0x2121382d4269742121
/!!8-Bit!!
at byte 0A text compression format I stumbled across while looking into Rocket eBooks. Quite possible the oldest format in this PR, it harks from the days of Psion Series 3 and 5's.
Shanda Bambook eBook (aka SuperNote Book)
Extensions:
.snb
Magic:
0x534e425030303042
/SNBP000B
at byte 0This is an eBook format for the amazingly named Shanda Bambook, a Chinese eBook reader. All info and test files I've got fail to work in Calibre, would be nice to have a working sample file.
Cheat Engine Trainer Data
Extensions:
.CETRAINER
Magic:
0x3c3f786d6c2076657273696f6e3d22312e302220656e636f64696e673d227574662d38223f3e0d0a3c43686561745461626c65
/<?xml version="1.0" encoding="utf-8"?> <CheatTable
at byte 0These are another XML format document used by CheatEngine for storing a trainer before being compiled into an executable. Longer match to help prevent false positives against other XML based files.
Quake PAK files
Extensions:
.pak
.bsp
.mdl
.lmp
.dem
.map
.rc
.spr
Magic:
.pak
has0x5041434b
/PACK
at byte 0, then may have0x4944504f
/IDPO
,0x52494646
/RIFF
or0x49425350
/IBSP
at byte 12.bsp
has0x1d000000
or0x1c000000
typically at byte 0, other versions may exist.mdl
has0x4944504f
/IDPO
at byte 0.map
has0x7b0a22
/{ "
at byte 0 assuming no comment lines before.spr
has0x49445350
/IDSP
at byte 0.lmp
,.dem
.rc
have no fixed headers, extensions onlyThe Quake PACK format shares a header with many other file types, we have entries in the JSON already but I've added these extra markers to help boost confidences. Not all files have them but it helps those that do.
Other Quake files
.bsp
are compiled level files.mdl
are 3D models used for characters, monsters, weapons etc....lmp
are various image related files.dem
are recorded demos (or movies) of levels.map
are un-compiled map files that are used to make.bsp
, they look a little like JSON at a glance.rc
are Resource files, basically a scripting language.spr
are sprite filesQuakeWiki: Quake File Formats
Description of .BSP Files
Quake map source
Quake Sprite format
Python Pickle
Extensions:
.pickle
Magic:
Protocol 0 has
0x28
/(
at byte 0Protocol 1 has
0x7d71
/}q
at byte 0Protocol 2 has
0x8002
/��
at byte 0Protocol 3 has
0x8003
/��
at byte 0Protocol 4 has
0x8004
/��
at byte 0Protocol 5 has
0x8005
/��
at byte 0All end
0x2e
/.
at -1Pickle is a data dump format for Python, there is an existing extension only but we can remove that now. The headers are small but thanks to the footer always being a
.
there should be no issues. Justsolve's protocol 1 file seems to not match my generated files when usingprotocol=1
(looks like it's a 0), going with my files for the magic. I've left the extension in for now to allow for fringe cases or laterSmacker video
Extensions:
.smk
Magic: Either
0x534d4b32
/SMK2
or0x534d4b34
/SMK4
at byte 0A popular video file format from the mid 90's, loads of early CD games used it due to it's decent compression and for the time fairly decent quality. There are two versions, not sure what the later one added.
Bink video
Extensions:
.bik
.bk2
.bik2
Magic:
0x42494b
/BIK
at byte 0Another popular video file format from early to mid CD era games, this replaced Smacker. There seems to be some confusion over the amount of FourCC's this format has:
BINK
BIKb
BIKf
BIKg
BIKh
BIKi
BIKd
are all considered valid. The samples I found all usedBIKi
but for now I have gone with justBIK
and the extension.bik
until more samples appear, this covers most potential files out there.AmigaGuide
Extensions:
.guide
Magic:
0x40646174616261736520616d69676167756964652e6775696465
/@database amigaguide.guide
at byte 0The AmigaGuide document was made for creating navigable help files, they work much like Windows Help files.
CRI Movie 2
Extensions:
.usm
Magic:
0x43524944
/CRID
at byte 0Another proprietary video format used in various games, especially those coming from Japanese studios. It's an annoying format as on later Windows version the audio no longer plays back due to the weird semi off standard codecs they used.
Adobe flash video file
Extensions:
.flv
Magic:
0x464c5601
/FLV�
at byte 0, then04
,01
or05
for audio, video or both at byte 4This is a tidy and improvement of existing entries, there were two
.flv
but one lacked the last byte pair so I've removed that from the JSON. Added little secondary matches for extra confidence boosts.Microsoft Works files
Extensions:
.wdb
.wks
.xlr
.wps
Magic:
Early
.wdb
versions have0x20540200000005540200
at byte 0Later
.wdb
.wps
and.xlr
versions have0xd0cf11e0a1b11ae1
/ÐÏ�ࡱ�á
at byte 0Early
.wps
have0x01fe
/�þ
at byte 0Microsoft Works was a cut down budget office that offered everything you needed in one package, it saved documents in a semi proprietary format that MS did support in Office but later dropped.
.wdb
were the Works equivalent to Access,.wks
/.xlr
were spreadsheets, and.wps
was a text file.There are some differing versions, early formats were just for Works, later ones were still Works specific but used the Microsoft Compound File format and identifying them may be trickier as we need to decode the CLSID identifier from the file. In fact the format is the basis for many many formats much like a RIFF or IFF, expect conflict clashes. Definitely a candidate for v2 identification upgrades.
JPEG XR, Windows Media Photo and Microsoft HD Photo File Format
Extensions:
.jxr
.wdp
.hdp
Magic:
All files should have
0x4949bc01
/II¼�
at byte 0.jxr
also has0x574d50484f544f00
/WMPHOTO
at byte 90.hdp
I cannot find any samples, extension only for now.Another member of the JPEG Family, derived from the Windows Media Photo and and Microsoft HD Photo formats, it's part MS, part JPEG, part butchered TIFF. The format is a mess.
JPEG-LS
Extensions:
.jls
Magic:
0xffd8fff7
/ÿØÿ÷
at byte 0Another JPEG format that is also not quite a format, it's a subset of regular JPEG and also has roots in HP's own lossless codec (which apparently is in one of the old Mars rovers). JustSolve magic suggestions match the output from the HP Reference encoder linked there, and at the CharlLS WebAssembly demo linked below. XnView would not view them despite claiming support, the online demo could read successfully converted images from the HP encoder. I've gone with the longer magic based on the test files, this should allow it to win confidence over regular
.jpg
Amiga Floppy Disk Images
Extensions:
.adf
Magic:
0x444f53
/DOS
at byte 0 then0x01
to0x07
for various Amiga filesystems at byte 3We have entries in the json but they can be improved. One was too specific, the other lacked mimetypes, lets fix that and make some enhancements. I've left a basic match for fringe cases and multi-parted the variations.
Amiga Harddisk Images
Extensions:
.hdf
Magic:
0x5244534b
/RDSK
at byte 0 for Amiga Filesystems0x504653
/PFS
for Professional Filesystem 3 (PFS3)0x504453
/PDS
for Professional Filesystem 3 (PDS3)0x534653
/SFS
for Smart File SystemI thought I had added these before but evidently not, like their floppy brethren these have many formats.
Fixes
There are also some small changes to various entries, fixing spelling errors, unifying names or adding mimetypes