tkbe

December 14, 2008

Django :: International characters in urls

Filed under: django — tb @ 4:26 am

I remember looking at putting Norwegian characters in urls a while ago, and giving up in disgust. I'm not talking about hostnames here, that actually works relatively well (e.g. http://båtførerprøven.norsktest.no/). What I'm talking about is something like http://www.norsktest.no/båtførerprøven. I haven't found a particularly pretty solution, but it seems to work on IE6, IE7, FF1.5, FF3, Opera9, and Safari3.1...

The solution is for a cms I'm writing for internal use, so almost all urls need to be looked up in a database. Here's from the urls.py:

PYTHON:
  1. (r'^(?P<path>(/[^/]+)+)',  views.view_page),

and here's the view code

PYTHON:
  1. def normpath(path):
  2.     # path data in the database is lowercase, with hyphens, and without a trailing /
  3.     path = path.lower().replace(' ', '-')
  4.     if path.endswith('/'):
  5.         path = path[:-1]
  6.  
  7.     # encoding machinery needed to support FF1.5
  8.     # We're taking advantage of the fact that non-ascii characters
  9.     # in the Latin-1 encoding are not valid UTF-8
  10.     try:
  11.         # if we just have ascii, or we have extended urls already
  12.         # encoded in utf-8...
  13.         tmp = path.decode('u8')
  14.         # ... then all is well in the universe (skip to end).
  15.     except:
  16.         # if we get here we could have garbage, or we could
  17.         # be on FF1.5 with Norwegian characters...
  18.         try: # check for possible latin-1 encoding
  19.             tmp = path.decode('l1')
  20.             path = tmp.encode('u8')
  21.             # whoo!
  22.         except:
  23.             raise http.Http404 # Garbage not found on our site.
  24.  
  25.     return path
  26.  
  27. def view_page(request, path):
  28.     path = normpath(path)
  29.     webpage = dj.get_object_or_404(Page, path=path)
  30.     ...

Lot's of fun urls are possible with this scheme, however IE7 seems to require a trailing / in the urls to keep from mangling the entered text with %XX codes (the look-up works regardless).

Powered by WordPress