Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character encodings may not be respected #4

Open
gingerbeardman opened this issue Aug 24, 2022 · 7 comments
Open

Character encodings may not be respected #4

gingerbeardman opened this issue Aug 24, 2022 · 7 comments

Comments

@gingerbeardman
Copy link

gingerbeardman commented Aug 24, 2022

Following on from issue #2

Previously (hmm, I'm trying to think when exactly? a long time ago!) I could set my Mac to Japanese and reboot, mount an HFS disc that uses MacJapanese character encoding and see the filenames as intended. Reboot was essential, login was not enough.

Such foreign discs are tricky as they contain filenames in multiple character sets. Files may have been copied from other discs or downloaded from the internet, so could contain many different encodings. Encodings are not stored anywhere: they have to be set manually, calculated using heuristics or some other map, or simply assume it.

I have found that Tcl has good support for Apple encodings, most of them written by Apple themselves back in the mid-1990s when this stuff was still very much current. Though macOS should respect and display the original encoding if the characters are correct and the system language is set correctly.

Another gotcha is that bugs/omissions in Japanese input methods (helper apps that assist typing of complex such script using multiple alphabets) allowed non-displayable characters to be typed in filenames! Example: when renaming a file in Finder, pressing the Delete key on an Extended Keyboard would insert that invisible character rather than deleting anything. This means that filenames can be quite dirty and contain invalid characters, which I guess should be resolved in some way or simply ignored?

If you need more details please ask as I have the info in my notes that I can dig out.

Anecdotes

  • My own tools for processing these discs use hfsutils to get the raw filenames and then I convert them using a Tcl one-liner (wrapped in some supporting logic).
  • DiskCatalogMaker stores file names using the native text encoding. It will use the current system text encoding when reading the old catalog, ie. CFStringGetSystemEncoding(), so my old catalogs made on Mac System 7 Japanese, will load into the current version of the app on modern macOS. Upon copy and paste the filenames will be converted to Unicode.

Sample HFS images

disk images that have a mix of MacRoman and MacJapanese:

This one looks 99% MacRoman, with a minimal file or two with MacJapanese names:

I have hundreds more discs of this type.

@thejoelpatrol
Copy link
Owner

thejoelpatrol commented Sep 2, 2022

I'm not sure macOS supports MacJapanese any more. The closest I can find with $ iconv -l is SHIFT_JISX0213. Forcing that encoding for the whole volume results in this:
image
AFAICT from Google Translate these appear to be roughly sensible filenames.

We don't want to force this manually globally, but even forcing one encoding that we decide via heuristic or settings/preferences or whatever would be better than assuming everything is MacRoman, as we are doing now in the absence of a command-line flag. We could do something like you mention, getting the system language and choosing the appropriate classic Mac encoding based on that, e.g. setting MacJapanese if you set your language to Japanese, MacHebrew if you are set to Hebrew, or Mac OS Thai, Mac OS Ukrainian, etc. But not all of these contain ASCII in the first 7 bits, so you won't be able to make any sense of MacRoman disks if you are set to Thai or Ukrainian, for example, so that's not great. Actually it seems these charsets do generally include ASCII in the first 7 bits so it might be workable, but a pain if you ever want more than one language plus ASCII.

I wonder if a simple GUI application would be useful here, to drag and drop a disk image on it and you can set the language/encoding there for people who don't want to use the command line.

I don't know what kind of heuristic could detect files with multiple encodings per volume, though. That seems real tough. Would those have been handled correctly by classic Mac OS?

In the interim, if you would like to use a volume that you know has a particular encoding, you can mount it manually. Open it in Disk Utility, and unmount the volume but do not eject the image. Get the disk number, e.g. /dev/disk2s2. Then, run this:

$ /Library/Filesystems/fusefs_hfs.fs/Contents/Resources/mount_fusefs_hfs --encoding=${ENCODING_NAME} ${DISK_NUMBER} ${MOUNTPOINT}
eg:
$ /Library/Filesystems/fusefs_hfs.fs/Contents/Resources/mount_fusefs_hfs --encoding=SHIFT_JISX0213 /dev/disk2s2 /Users/joel/mnt

@gingerbeardman
Copy link
Author

MacJapanese is closely related to SHIFT-JIS but they're not the same, see here. That said, it may be close enough to be a workable solution.

I think your suggestions are all good. If the encoding could be remembered between mounts that would be great.

Or maybe the user could add something to the filename that would clue fusehfs in to the required encoding? That way nothing would need to be stored on disk.

I'll try manual mounting soon.

@d235j
Copy link

d235j commented Sep 23, 2022

It looks like iconv doesn't support MacJapanese but CoreFoundation does — see https://developer.apple.com/documentation/coreservices/1399915-encoding_variants_for_macjapanes. The downside of rewriting the character encoding code using CF is that it would make fusehfs less portable to Linux.

I'm guessing something is stored on disk indicating encoding. Will need to investigate.

@gingerbeardman
Copy link
Author

I'm guessing something is stored on disk indicating encoding. Will need to investigate.

I don't believe there is, but I don't have any references to cite. The language of the host OS is responsible for interpreting the filenames according to its default encoding. Very old school.

That said, I would be interested to see what you find!

@joevt
Copy link

joevt commented Dec 4, 2022

There's a text encoding hint in the Finder Info of the Master Director Block. Maybe it's related? See dumpencoding at:
https://gist.github.com/joevt/a99e3af71343d8242e0078ab4af39b6c
See GET_HFS_TEXT_ENCODING at:
https://github.com/apple-oss-distributions/hfs/blob/hfs-627.40.1/mount_hfs/mount_hfs.c

There's HFS Encoding kexts for various macOS versions from Sierra to Catalina at:
/System/Library/Filesystems/hfs.fs/Contents/Resources/Encodings/
HFS_MacArabic.kext
HFS_MacCentralEurRoman.kext
HFS_MacChineseSimp.kext
HFS_MacChineseTrad.kext
HFS_MacCroatian.kext
HFS_MacCyrillic.kext
HFS_MacGreek.kext
HFS_MacHebrew.kext
HFS_MacIcelandic.kext
HFS_MacJapanese.kext
HFS_MacKorean.kext
HFS_MacRomanian.kext
HFS_MacThai.kext
HFS_MacTurkish.kext

I think there might be source code at least for converting MacJapanese?
https://github.com/apple-oss-distributions/hfs/blob/main/hfs_japanese/hfs_japanese.kmodproj/JapaneseConverter.c
And also MacRoman:
https://github.com/apple-oss-distributions/hfs/blob/main/hfs_encodings/hfs_encodings.c

The documentation archive has a note about an Encoding popup in the Finder's Get Info window for an HFS Standard volume:
https://developer.apple.com/library/archive/qa/qa1173/_index.html#//apple_ref/doc/uid/DTS10001705
What does that popup look like? Is it changing the value in the Master Director Block?

More about text encodings for HFS Plus:
https://developer.apple.com/library/archive/technotes/tn/tn1150.html#//apple_ref/doc/uid/DTS10002989
It implies that HFS Standard does not have the per-file or per-folder text encoding settings that HFS Plus has and that the text encoding "varies depending on how the system software was localized and what language kits are installed".

@d235j
Copy link

d235j commented Dec 4, 2022

Looks like if it exists at all, it’s stored in the Finder info word in the MDB?
https://github.com/apple-oss-distributions/hfs/blob/4e3719273a0c670ef4aa7c77bb421c89f3473e14/mount_hfs/mount_hfs.c

@gingerbeardman
Copy link
Author

Interesting!

Tcl has good conversion routines and encoding tables which were written by Apple themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants