Unicode strings to ASCII ...nicely

http://effbot.org/librarybook/unicodedata.htm

8th of August 2006

This has been a problem for a long time for me. Whenever someone enters a title in my CMS the id of the document is derived from the title. Spaces are replaced with '- and &' is replaced with and etc. The final thing I wanted to do was to make sure the Id is ASCII encoded when it's saved. My original attempt looked like this:

 >>> title = u"Klüft skräms inför på fédéral électoral große"
 >>> print title.encode('ascii','ignore')
 Klft skrms infr p fdral lectoral groe

But as you can see, a lot of the characters are gone. I'd much rather that a word like "Klüft" is converted to "Kluft" which will be more human readable and still correct. My second attempt was to write a big table of unicode to ascii replacements.

It looked something like this:

 u'\xe4': u'a',
 u'\xc4': u'A',
 etc...

Long, awful and not pythonic. Too risky to miss something but the result was good. Now for the final solution which I'm very happy with. It uses a module called unicodedata which is new to me. Here's how it works:

 >>> import unicodedata
 >>> unicodedata.normalize('NFKD', title).encode('ascii','ignore')
 'Kluft skrams infor pa federal electoral groe'

It's not perfect ('große' should have become grosse) but's only two lines of code.



Comment

Show all 16 comments
 
Name:
Email:
hide my email address.

Your email address will be encoded to prevent email-extraction spiders from reading it so you won't get spammed if you decide to show your email address.