Tableau is for everyone, and that includes people whose languages include characters that didn’t come from the extremely simplified version of the Latin alphabet that English uses.
If you’re reading this, you’re probably running into some issues and have an idea of what I’m talking about, but for clarity’s sake:
- ASCII characters are the basic American English, non-accented Latin characters
- Unicode is the standard that describes all of the characters in the world’s languages and beyond
- UTF-8 is an encoding of Unicode which has the wonderful property of matching ASCII for characters that ASCII can represent.
- Tableau handles text in Unicode and tends to read and respond in UTF-8 encoding (more on this later)
- When I say ‘non-ASCII’ I mean the characters that differ from ASCII in UTF-8 encoding. You need to be using these to see errors, because an ASCII and UTF-8 encoded document will look the same in non-accented English.
You can create a site name that includes non-ASCII characters, but not a Site Content URL. Otherwise, every other name or string in Tableau Server can have Unicode characters in them; they will display correctly and be totally valid and searchable.
Numeric Character Reference
Although they handle Unicode internally just fine, Tableau 9.0 (and 9.1) in most situation actually output non-ASCII Unicode using Numeric Character Reference rather than UTF-8 encoding the actual characters. This means you’ll see something like (with ampersands [&] at the beginning of each character, which I have removed so that your browser doesn’t encode it)
If you want to actually treat this as Unicode text, you’ll need to do some conversion. For example, in the tableau_rest_api library, all incoming text (as of version 1.5) goes through a conversion process to take the numeric character references, convert them to pure Unicode, then encode the Unicode text as UTF-8. The solution was inspired from here.
BUT WAIT THERE’S MORE
Not EVERY response coming back from the Tableau REST API uses Numeric Character Reference. Some responses are in pure UTF-8; my guess is that this will be more common in the future. So you need to detect for BOTH cases; convert in the first case or leave it all alone if it is UTF-8 from the beginning
So the v.1.5.0 code looks like:
# Use HTMLPasrser to get rid of the escaped unicode sequences, then encode the thing as utf-8 parser = HTMLParser() unicode_raw_response = parser.unescape(initial_response) try: self.__raw_response = unicode_raw_response.encode('utf-8') # Sometimes it appears we actually send this stuff in UTF8 except UnicodeDecodeError: self.__raw_response = unicode_raw_response unicode_raw_response = unicode_raw_response.decode('utf-8')
More info on v.1.5.0 will come in the next post.