Tagged ,

## Unicorn Support Would Be Easier

Unicode has two ways to represent composite characters like accented Roman letters. If you want to write a ü (that should render as a lowercase u with umlaut/diaeresis/two dots over it, if your browser is any good), you can represent it as the single Unicode codepoint LATIN SMALL LETTER U WITH DIAERESIS (U+00FC), or you can represent it as LATIN SMALL LETTER U (U+0075) followed by COMBINING DIAERESIS (U+0308). Since these are, in fact, the same letter, Unicode defines normalization forms for things like comparing and sorting strings. In the NFC form, all composite characters that have a single codepoint representation are represented as that single codepoint; in the NFD form, they are all decomposed into multiple codepoint representations. If you pick a normalization form and stick with it, then you can ensure that ü is always equal to ü no matter which representation it came in with.

If you don’t do normalization, then annoying things happen. The setup is as follows: I’ve got a server machine running Debian 5 Linux, serving files from an XFS filesystem through Samba, talking to a client Mac running OS X 10.5 (Leopard) and mounting the share using mount_smbfs(8). Some files and folders have accented letters in their names. Now, they display fine in Finder directory listings. Problem is, when you click on them, they disappear!

It took some poking around in Wireshark to figure out what was going on. I noticed that the file names being received from the server in directory listings were not the same as the ones that were later being requested: specifically, characters like our friend ü were decomposed in the listings, but composed in the requests. Samba would then report the requested files as missing.

I’m honestly not sure who to blame here. It could be OS X’s fault for not keeping track of received filenames somewhere so it could send out the same ones it got. It could be Samba’s fault for not normalizing incoming requests before looking up files when it’s set to use Unicode. It could even be XFS’s fault: XFS filenames are byte strings and can include any bytes other than the ASCII codes for / and NUL. The Mac OS filesystem HFS+ uses a normalization form that is (almost) NFD, so filenames are stored on disk in one format only, and other representations are illegal. XFS does not appear to do any processing of filenames. It’s actually possible to create two files named “ü” in the same directory, or at least two filenames that display as “ü” when interpreted as UTF-8 byte strings, decoded to Unicode code points, and displayed by something that can handle Unicode characters.

Fortunately, there is a workaround. If Unicode filenames are converted to NFC on disk on the server, they appear to survive the round trip to the Finder and back just fine. Python’s unicodedata module has a normalize() function that makes this pretty painless.

***

update: here’s some working sample code to NFC-ize everything in a folder hierarchy.

import os
import unicodedata

for root, dirs, files in os.walk(u'/path/to/base', topdown=False):
for entry in files:
nfc = unicodedata.normalize('NFC', entry)
if entry != nfc:
os.rename(
os.path.join(root, entry),
os.path.join(root, nfc))
print os.path.join(root, nfc)
rootparent, rootentry = os.path.split(root)
nfc = unicodedata.normalize('NFC', rootentry)
if rootentry != nfc:
os.rename(root, os.path.join(rootparent, nfc))
print os.path.join(rootparent, nfc)
Tagged , , , ,