Wednesday, February 6, 2013

Interesting Aspects on Locale Extensions

Locale Extensions

What can be done

Extensions allow to add additional information to a Locale as also described in the Java Doc:
The Locale class implements IETF BCP 47 which is composed of RFC 4647 "Matching of Language Tags" and RFC 5646 "Tags for Identifying Languages" with support for the LDML (UTS#35, "Unicode Locale Data Markup Language") BCP 47-compatible extensions for locale data exchange.
Now what you can do, is adding extension tags as follows:


b.setExtension('x', "myExt-myCal-myCur");

But there are some things to be aware of:

  • extensions are not case sensitive (The JDK Locale converts them all implicitly to lower case!)
  • there is no defined order of extension tags, so dont rely on!
  • tags can be separated by '-' or '_' (the standard requires '-', Java accepts both as input, but then translates all '_' to '-').
  • valid characters for tags are restricted to [a-z][A-Z][0-9], so there are no special characters like '?' or '=' or similar possible (Java checks this).
  • tags are a minimum of 2 characters long (this is also checked by the JDK)
  • tags are a maximum of 8 characters long (this is also checked by the JDK)
  • each extension is identified by a singleton character (not a digit)
So all the following inputs are accepted by the JDK:
b.setExtension('x', "mi");b.setExtension('a', "maxmaxma");b.setExtension('b', "de-US");
b.setExtension('d', "aa1-bb2_cc3_dd4");



Strange Behavior

Some days ago I played around with Locale extensions (JDK 7/8):
Locale.Builder b = new Locale.Builder();// b.setRegion("DE");// b.setLanguage("de");b.setExtension('x', "gr2-spPrepen-nldeDE");System.out.println("Locale: " + b.build());System.out.println("Locale's extension: " + b.build().getExtension('x'));

The outputput is a bit surprising (the extension does NOT appear in the toString-output):
Locale: Locale's extension: gr2-spprepen-nldede

At a first glue this seem to be a bug, but when reading the spec in http://tools.ietf.org/html/rfc5646#page-16, especially section 2.2.6:

An extension MUST follow at least a primary language subtag.
That is, a language tag cannot begin with an extension.

this can be a hint, why this behaves as shown above, though, if a Locale is invalid, then it should not be possible to create/build it...

Now, when setting a language in our example with:
b.setLanguage("de");

the  toString() result now seem to be correct:
Locale: de__#x-gr2-spprepen-nldedeLocale's extension: gr2-spprepen-nldede 

The same applies, when setting a region only...

> Locale: _DE_#x-gr2-spprepen-nldede> Locale's extension: gr2-spprepen-nldede


...or when setting both, a region and a language, also of the output is as expected:

> Locale: de_DE_#x-gr2-spprepen-nldede> Locale's extension: gr2-spprepen-nldede

Finally I also was trying some special inputs based on the constraints defined by the specification, and I was able to create other invalid Locale instances realtively easily:

  • b.setExtension('c', "de-DE"); will be converted to c-de-de, which is invalid since de is duplicated in the final representation (but required to be unique).
  • b.setExtension('c', "c-de"); will be converted to c-c-de-de, which is invalid since the extension singleton c is duplicated in the final representation (but required to be unique).
  • b.setExtension('c', "x-de"); will be converted to c-x-de-de, which is invalid since an extension singleton must contain some tags, which is not the case for c-x-de, which in this case is the final representation. 
So be careful, when using the Locale extension mechanism. I will also post this to the i18n colleagues at OpenJDK, I am wondering what they think...