Be a Sponsor
Be an Exhibitor

Program - Session Descriptions


Monday, October 17, 2011



Richard Ishida
Internationalization Lead,

Track 1: An Introduction to Writing Systems & Unicode

The tutorial will provide you with a good understanding of the many unique characteristics of non-Latin writing systems, and illustrate the problems involved in implementing such scripts in products. It does not provide detailed coding advice, but does provide the essential background information you need to understand the fundamental issues related to Unicode deployment, across a wide range of scripts. It has also proved to be an excellent orientation for newcomers to the conference, providing the background needed to assist understanding of the other talks! The tutorial goes beyond encoding issues to discuss characteristics related to input of ideographs, combining characters, context-dependent shape variation, text direction, vowel signs, ligatures, punctuation, wrapping and editing, font issues, sorting and indexing, keyboards, and more. The concepts are introduced through the use of examples from Chinese, Japanese, Korean, Arabic, Hebrew, Thai, Hindi/Tamil, Russian and Greek. While the tutorial is perfectly accessible to beginners, it has also attracted very good reviews from people at an intermediate and advanced level, due to the breadth of scripts discussed. No prior knowledge is needed.


Addison Phillips
Globalization Architect
Lab126 (Amazon)

Track 2: Internationalization: An Introduction, Part I: Characters and Character Encodings

What is internationalization? What do developers, product managers, or quality engineers need to know about it? How does a software development organization incorporate internationalization into the design, implementation, and delivery of an application?

This tutorial track provides an introduction to the topics of internationalization, localization and globalization. Attendees will understand the overall concepts and approach necessary to analyze a product for internationalization issues, develop a design or approach, and deliver a global-ready solution. The focus is on architectural approaches and general concepts, but will include specific examples and exercises.

Part I focuses on characters, character encodings, and the basics of Unicode.


Thomas Milo
Partner, DecoType

Track 3: Comprehensive Arabic Script Tutorial

This is a completely revised and updated comprehensive tutorial presented many times before to the Unicode Conference covering all aspects of Arabic-script computing, from calligraphy to typography history, script structure, orthography, encoding, dumb and smart computer typography, types of line-breaking, language specific issues, and much more.
10:30-10:45 - Morning Refreshments

Richard Ishida
Internationalization Lead,

Track 1: An Introduction to Writing Systems & Unicode (Cont'd.)

The tutorial will provide you with a good understanding of the many unique characteristics of non-Latin writing systems, and illustrate the problems involved in implementing such scripts in products. It does not provide detailed coding advice, but does provide the essential background information you need to understand the fundamental issues related to Unicode deployment, across a wide range of scripts. It has also proved to be an excellent orientation for newcomers to the conference, providing the background needed to assist understanding of the other talks! The tutorial goes beyond encoding issues to discuss characteristics related to input of ideographs, combining characters, context-dependent shape variation, text direction, vowel signs, ligatures, punctuation, wrapping and editing, font issues, sorting and indexing, keyboards, and more. The concepts are introduced through the use of examples from Chinese, Japanese, Korean, Arabic, Hebrew, Thai, Hindi/Tamil, Russian and Greek. While the tutorial is perfectly accessible to beginners, it has also attracted very good reviews from people at an intermediate and advanced level, due to the breadth of scripts discussed. No prior knowledge is needed.


Addison Phillips
Globalization Architect
Lab126 (Amazon)

Track 2: Internationalization: An Introduction (Part II, Writing Global Ready Code)

Part II focuses on preparing for the localization (translation) of user interfaces; making applications "locale-aware", including format and display differences; as well as approaches to delivering multi-lingual and multi-locale software or content.


Su Liu
AIX Globalization Architect,

Track 3: Smart Code Set Conversions for Unicode Support in Heterogeneous Environments

Modern storage network includes hundreds of code sets, and thousands of conversion modules for information services. In the heterogeneous environment, it is a crucial criterion to dynamically, efficiently, accurately convert Unicode data to non-Unicode data or vice versa. Therefore, the developing, and maintaining Unicode conversion services become more challenging tasks due to the inconsistent code set names, multiple encoding schemes, variants of code mapping tables, multiple versions of encoding standard, and diverse OS platforms. This tutorial introduces mechanism in the code set conversion design and explains use of Unicode technologies for solving problems in layers of application, operating system, and network. The tutorial first compares the differences among the major code set conversion algorithms, and then focuses on the challenges of Unicode and non-Unicode conversions in internal UNIX and across networks. A further discussion into the smart and advanced conversion introduces methods and solutions related to conflicts and problems on modifier, endianness, composed/decomposed character, multiple code set standard versions, and code set alias names. It gives some examples to illustrate options of complex text and CJK manipulations. Finally, the tutorial addresses the code set converting implementation strategies and options to choose native UNIX based or ICU based conversion functions for future Unicode support in the heterogeneous environment.

Keywords: Unicode, AIX, UNIX, Code Set, Conversion, Operating System

12:30-13:30 - LUNCH


Craig R. Cummings

Michael G McKenna

Senior International Engineering Manager, Zynga

Track 1 - Unicode - A Grand Tour

This tutorial will cover the next level of detail of what Unicode is, and how it is used in the real world. The modules of the tutorial include: The Unicode standard - what are the "Guiding Lights", or design principles behind Unicode? A tour of Unicode's structure, encoding forms, behavior, technical reports, database, and how to use the Unicode Standard. Implementation according to Unicode - a walk through the details of attributes, compatibility, non-spacing characters, directionality, normalization, graphemes, complex scripts, surrogates, collation, regular expressions and other aspects according to the Unicode Standard and associated Technical Reports. Unicode and the Real World - an overview of Unicode-based internationalization development libraries and implementations supporting Unicode in web servers, application servers, browsers, C/C++, Java, PHP, SQL, and various operating systems. On-going programs - how Unicode is evolving and how you can participate in its future. Pointers to other sessions at the conference that dive deeper on particular topics are highlighted throughout.


Tex Texin
Chief Globalization Architect, Rearden Commerce, Inc. 

Track 2 - Tutorial Web Internationalization - Standards and Best Practices

This tutorial is an introduction to internationalization on the World Wide Web. The audience will learn about the standards that provide for global interoperability and come away with an understanding of how to work with multilingual data on the Web. Character representation and the Unicode-based Reference Processing Model are described in detail. HTML, including HTML5, XHTML, XML (eXtensible Markup Language; for general markup), and CSS (Cascading Style Sheets; for styling information) are given particular emphasis. The tutorial addresses language identification and selection, character encoding models and negotiation, text presentation features, and more. The design and implementation of multilingual Web sites and localization considerations are also introduced.


Jim DeLaHunt
Jim DeLaHunt & Associates

Track 3 - Building multilingual websites in Drupal 7 and Joomla 1.6

A practical look at the language and locale capabilities of Joomla! 1.6 and Drupal 7, two leading free software content management systems (CMSs). They let you build more powerful, more international websites faster. We look at: their core internationalisation and locale services; localisation of UI and content. Each platform just had a major release, with advances in internationalisation. You will leave with specific tips for building your own site. We don't assume Joomla or Drupal experience, but do include material for advanced practioners. A good tutorial for web site product managers, web designers, developers, and managers of international web teams. 
15:30-15:45 - Afternoon Refreshments


Michael G McKenna
Craig R Cummings
Senior International Engineering Manager, Zynga

Track 1 - Unicode - A Grand Tour (Cont'd.)

This tutorial will cover the next level of detail of what Unicode is, and how it is used in the real world. The modules of the tutorial include: The Unicode standard - what are the "Guiding Lights", or design principles behind Unicode? A tour of Unicode's structure, encoding forms, behavior, technical reports, database, and how to use the Unicode Standard. Implementation according to Unicode - a walk through the details of attributes, compatibility, non-spacing characters, directionality, normalization, graphemes, complex scripts, surrogates, collation, regular expressions and other aspects according to the Unicode Standard and associated Technical Reports. Unicode and the Real World - an overview of Unicode-based internationalization development libraries and implementations supporting Unicode in web servers, application servers, browsers, C/C++, Java, PHP, SQL, and various operating systems. On-going programs - how Unicode is evolving and how you can participate in its future. Pointers to other sessions at the conference that dive deeper on particular topics are highlighted throughout.


Loïc Dufresne de Virel
Localization Strategist


Track 2 - Internationalization Testing Best Practices

In this tutorial, attendees will learn how to develop a systematic I18N validation plan. Touching on code scans, pseudo-builds, pseudo-locales, testing on localized Operating Systems, and international test data, the authors will share their years of experience in the field of internationalization & localization, showing you how to identify I18N issues as early as possible to avoid taking on unexpected amounts of technical debt! To illustrate their presentation, they will use actual issues that were found, and sometimes missed, during recent localization projects - guaranteeing a fun and practical session.

Co-authors / presenters:
Michael Kuperstein, Intel, Localization Engineer
Octavio Ramos, Intel, Validation Lead


John Emmons
Senior Software Engineer

Track 3 - Using ICU Workshop

This tutorial gives attendees everything they need to know to get started with working with text in computer systems: character encoding systems, character sets, Unicode, and text processing, using the International Components for Unicode library (ICU).

ICU is a very popular internationalization software solution. However, while it vastly simplifies the internationalization of products, there is a learning curve.

The goal of this tutorial is to help new users of ICU install and use the library. Topics include: Installation (C++ libraries, Java .jar files, Java SPI for JDK integration), verification of installation, introduction and detailed usage analysis of ICU's frameworks (normalization, formatting, calendars, collation, transliteration). The tutorial will walk through code snippets and examples to illustrate the common usage models, followed by demonstration applications and discussion of core features and conventions, advanced techniques and how to obtain further information. It is helpful if participants are familiar with Java, C and C++ programming. Issues relating to ICU4C/C++ as well as ICU4J (Java) will be discussed. After the tutorial, participants should be able to install and use ICU for solving their internationalization problems.

18:00-19:00 - Welcome Reception  


Tuesday, October 18, 2011



Laura Welcher
Director of Operations, The Rosetta Project

KEYNOTE PRESENTATION - Building the Multilingual Web - The Long View

The Rosetta Project at The Long Now Foundation is working to build an open public digital collection of all human language as well as an analog backup that can last for thousands of years - The Rosetta Disk.

In the "long now," the goal is long-term storage and access to information on the scale that both supports and transcends individual human societies and civilizations. In the "here and now" the project serves to support and amplify the importance of the world's nearly 7,000 human languages, the vast majority of which are endangered and, if current trends continue, likely to go extinct in the next 100 years.

The Rosetta Project shares the Unicode vision of a world where people can use communication technology on their own terms - in their own language. According to World Internet Statistics, over 80% of all web communication is in about ten languages, with over half in either English or Chinese. The remaining 20% represent "everyone else" including about 400 languages with speaker populations above 1 million, which collectively comprise about 95% of everyone on earth. Because of essential technologies like Unicode, we are poised to see this breadth of human languages flourish online and on mobile devices, providing for these languages a critical new domain of language use in the modern world. I will present several efforts underway at The Rosetta Project including the "Language Commons" that rely on Unicode as an essential technology in building the multilingual Web.

10:00-20:00 -  EXHIBIT AREA OPEN
10:00-10:30 - Morning Refreshments in Exhibit Area
10:30-11:20  SESSION 1


Addison Phillips
Globalization Architect
Lab126 (Amazon)

Richard Ishida

Chair, Activity Leave
W3C Internationalization WG

Track 1 - To the Promised Land: I18N Developments in HTML5 and CSS3

A new era of competition between the major browsers has rekindled work on the HTML and CSS standards. One of the results has been a renewed focus on providing features to support international content in new and exciting ways, from typeset quality presentation to vertical text; from East Asian support such as ruby and emphasis to improved bidirectional language support.

In this presentation we'll explore the changes that are available today, the status of these standards, and the challenges that remain.


Aharon Lanin 

Software Engineer, Google Inc.

Andy Staudacher Software Engineer, Google Inc.

John Tamplin

Software Engineer, Google Inc.

Katsuhiko Momoi

Staff Test Engineer, Google Inc.

Mark Davis, Sr. Internationalization Architect, Google Inc.



Track 2 - Pseudolocalization at Google -- some innovations

In this presentation we present some innovations on pseudo locale uses within Google with particular focus on standardizing pseudo locale naming scheme based on BCP 47, defining a small set of standard pseudo locales and their exact definitions, and how they can be used for automated checks for detecting major internationalization and bidirectionality issues. To spread good practices in internationalization testing across the industry, we recently released an open source pseudolocalization Java library. Standardizing pseudo locales is essential for this purpose as well.

We begin with a brief introduction to pseudolocalization concepts, its current merits and limitations. We argue that standardizing the naming scheme (BCP 47 compliant) and defining a few standard pseudo locales offers many benefits for Google's development environments and similar ones elsewhere.

A project can host other projects and expect to see across the components exactly the same pseudo locale methods used with the standard pseudo locales: "en-psaccent" (the LTR pseudo locale) and "ar-psbidi" (the FakeBidi pseudo locale). We will demo a BidiChecker tool that can run on the FakeBidi locale.

These innovations in the use of pseudo locales allow us to detect localizability issues early via automated checks, encourage development of new tools and tests on them, and help improve internationalization quality of products.


Martin J. Dürst
Aoyama Gakuin University
Shunsuke Oshima
Master Course Student
Aoyama Gakuin University

Track 3 - Implementing Better Source Editing for Bidirectional HTML and XML in the Text Editor Emacs

Authors: Shunsuke Oshima and Martin J. Dürst

The Unicode Bidirectional Algorithm (UBA) is tailored for running text such as letters and newspaper articles. However, it is not suited directly for structured formats such as XML, HTML and programming languages. The source is often reordered in unpredictable ways that are unrelated to the logical structure of these formats, and therefore, source editing was essentially impossible. In this paper, we present a solution to this problem and its implementation in the text editor Emacs.

Emacs is a very flexible and extensible text editor providing an integrated environment for a wide variety of development tasks. Extensibility is based on Emacs Lisp, which we also have used for our research. While there have been some experimental implementations of bidirectional rendering in Emacs in the past, a full implementation of the UBA has only become available recently in Emacs version 24.

In earlier research, we implemented a Web-based simulation for XML and XHTML source rendering (IUC28) and a JavaScript-based experimental editor (IUC32). These implementations, however, were standalone and did not reach the level of practical usability.

The problem of bidirectional source editing for structured formats such as (X)HTML and XML can be divided into three areas. The first area is the treatment of syntactically significant characters, for example the ubiquitous angle brackets in HTML and XML. The UBA classifies them as neutral so that they follow the direction of their surrounding text, including potential mirroring, which is appropriate for running text. However, in HTML and XML they define the overall structure of the markup and therefore have to be treated as strong.

The second area is the treatment of bidirectional control characters such as LRM and RLM in source editing. Without intervention, such characters can be entered either literally, in which case they show their effect but are invisible and therefore difficult to edit, or they can be entered in escaped form (e.g. ‎ or ‏) in which case they are visible but not effective. Ideally, they would be both visible and effective.

The third area is bidirectional markup such as the 'dir' attribute or the <bdo> element in HTML, which for ease of editing should be reflected in the layout of the element content during source editing.

All the above areas can be addressed by carefully placing additional bidirectional control characters into the source text. We already had worked out much of the details of this placement in our earlier research. The main difficulty with using additional bidirectional control characters is that they are not part of the actual source text and therefore have to be distinguished from the same characters when they are part of the source, and have to be carefully removed for operations such as copying and saving. We implemented two different ways of doing this. One way uses a special Emacs property to distinguish these characters so that they can be removed before the relevant operations. The other way uses Emacs overlays, which by definition are not part of the text proper. In many ways, this would be the ideal solution, but overlays currently are not taken into account for bidirectional rendering.

Another problem is that the inserted bidirectional control characters have to be recalculated for every single editing operation. Limiting insertion to the currently visible part of the text being edited makes sure that we achieve acceptable performance even for very long source files.

We are currently extending our implementation to work with TeX, and are looking into ways to fine-tune our implementation based on user feedback.

11:30-12:20  SESSION 2


Joel Sahleen
Software Developer, Adobe Systems Incorporated

Track 1 - Enterprise PHP Internationalization and Localization: A Case Study

PHP is one of the most popular server-side scripting languages on the internet. According to the latest data from, there are currently more than 25 million sites that run on PHP, including a third of the top million most visited. Although PHP may be best known for its use in high-profile, consumer-oriented sites such as Facebook and Yahoo!, in recent years the language has begun to carve out a new role for itself as part of the technology stack used to build business-critical, enterprise web applications like the Adobe Online Marketing Suite. Enterprise web applications tend to have different globalization requirements and constraints than consumer-oriented sites, and so when it comes to internationalization and localization, they must be approached in a somewhat different manner. Strategies that work well for consumer-oriented sites may not work well for enterprise web applications, and vice versa. This presentation examines the issues surrounding the internationalization and localization of large-scale, PHP-based web applications in an enterprise, SaaS context, using the Adobe Online Marketing Suite as a point of reference. The primary goal of the presentation is to show how enterprise internationalization and localization differs from non-enterprise internationalization and localization, and provide an overview of the different resources that are available to do enterprise internationalization and localization in PHP. By looking at some common internationalization and localization problems, and then describing how these problems were dealt with in the case of the Online Marketing Suite, I hope to demonstrate how PHP can be used in conjunction with other technologies like ICU to create fully internationalized, high performance web applications that are both easily localizable and highly scalable.


Tak Takahashi
Globalization Engineering Manager, Teradata



Track 2 - Localization Optimization Using Translation Repositories

Typically, localization requires the heavy involvement of translation companies to translate software resource strings as well as documentations. Translation of these strings and texts fully depend on TMX (translation memories) that are mainly utilized by translation companies. At Teradata, in order to optimize the localization process and minimize the external translation costs, we, a) pre-process translatable strings in various formats such as Java properties and .NET resx files, and then store them into a common format called the translation repository, b) "migrate" translations from older versions to newer versions automatically, c) "pre-translate" strings using translation repositories, and then d) submit to translation companies the strings with status flags to represent whether the string is new or modified for each resource string. Translation companies process and translate only strings that require new translation. This localization process and its supporting tools usually reduce the external translation costs for software by an average of 20-30 %. The process and tools were internally developed and have been since used for software localization at Teradata for years. We are now capable of handling DITA XML documentations in the same localization process as well.

Below is a list of challenges we face in software localization and translation:

  • For frequently updated/released applications, we need to "migrate" translation from older versions to newer versions, for example from Teradata CRM to This migration process may or may not require new translation by the translation companies, but the language pack may need to be specific for that particular release. We want to automate this translation migration process.
  • Cost reduction is always a challenge for everyone. We want to reduce the external translation costs we pay to the translation companies. When we translate software or documentation, every word sent to the translation companies will cost us, even if it was translated before.
  • Once we have localized, for example, a product-A, we would like to share or reuse translations for other products, without involving the translation companies.
  • We would also like to improve the localizability test and avoid any functional problems caused by inappropriate translation or localization.

To address these issues, we have designed and implemented a new localization process using the translation repository and its supporting tools. The translation repository was designed as a relational database that consists of tables, columns, and rows, so that we can scan tables and submit a query by SQL to search English or translated strings.

At Teradata, we now generate and maintain translation repositories for any software localization. The translation repositories enable:

  • Localization migration
  • In-house translation sharing
    • Reuse and sharing of translation
    • Facilitate in-house translation
  • Translation using common resource file format for any software translation
  • Facilitate validation and corrections of translations
  • Localization problem investigation
  • Reduction of external translation costs

In 2010, we have also enhanced our tools/process to handle DITA documentations. Now, we are capable of pre-processing DITA XML documentations, pre-translating elements using translation repositories generated from the software, and reducing the external translation costs.


Murray Sargent III
Partner Software Design Engineer Microsoft

Track 3 - Bidi Parentheses Algorithm

Ayman Aldahleh, Gilead Almosnino, Peter Constable, Dylan Deverill, Andrew Glass, Michael Kaplan, Laurentiu Iancu, Dwayne Robinson, Murray Sargent, Robert Steen

Microsoft Corporation

The Unicode Bidi Algorithm (UBA) can display parenthesized text in strange ways, such as a)b) instead of (a)b, as described in a recent blog post*. That post describes an algorithm to fix the display of many such cases and that algorithm shipped with Microsoft Excel 2007/2010. The problematic cases have different UBA directionalities for the two parentheses of a matched pair. The algorithm essentially says that in such ambiguous cases, use the paragraph (or embedding) directionality for both parentheses. In all cases, increment the levels of text runs inside by 2 when necessary to keep the text inside the parentheses. In this presentation we describe an enhancement of the algorithm to deal with cases in which the text inside the parentheses has a single directionality. For such cases, the directionality of the parentheses is chosen to be the same as the directionality of the text within them. This refinement handles a set of anomalous cases where, for example, parenthesized English text appears in right-to-left paragraphs.

This improved algorithm displays the vast majority of parenthesized text the way one would want, but it is fair to say that no simple algorithm can handle all cases. The UBA has the LRE, RLE, PDF, LRO, RLO Bidi control characters to force particular choices. In the event that any of these Unicode Bidi control characters are used, the Bidi parentheses algorithm is not used, because the assumption is made that the user has specific choices in mind. The simpler Unicode LRM and RLM control characters can be used with the algorithm if desired.

12:30-13:30 - LUNCH
13:30-14:20  SESSION 3


Matt Sanford
Tech Manager of International Team,    TwitterInc.

Track 1 - Internationalizing Twitter

Social networks connect you to people you already know. Twitter's model of connecting people to what's most important to them at any given time has created a cross-language and cross-cultural network unlike any other. All of this creates changes to language and communication that present a unique localization challenge. With our continued expansion in Japan, our translation of, and our work on supporting Tweets in any language, we've learned some valuable lessons we're excited to talk about. Twitter should work in any language, even if we don't have the resources to support a full localization, and making that happen is a large undertaking. 


Shaopeng Jia
Senior Software Engineer
Google Switzerland GmbH

Track 2 - libphonenumber - The Swiss Army Knife of International Telephone Number Handling

The libphonenumber project is an opensource project from Google, which provides Java, C++ and JavaScript APIs that supports parsing, validating and formatting international phone numbers for over 200 countries. This presentation will walk you through some common challenges when handling international phone numbers, and discuss hands-on how these challenges could be addressed with the APIs provide by the library. It will also present some common misconceptions with international phone numbers, and provides recommendations on best practices of handling phone numbers.

This is a 201 level talk compared to the talk I gave last year. It provides a more in-depth look at international phone number handling, and places focus on the new development of the library in the past year.


Roozbeh Pournader
Internationalization Engineer, Google

Track 3 - Bidirectionalization and Localization

In the process of bidirectionalizing a software application, the main challenge appears to be mirroring the interface properly and supporting all the nuances of the very sensitive markets. Getting the localized strings displayed properly usually gets overlooked. But it's a long and error-prone path, from translating a string by a linguist not familiar with the details of the Unicode Bidirectional Algorithm
(UBA) to the application displaying the string to the final user. This talk will suggest best practices for localizing for the bidirectional markets, and cutting the cost caused by the expensive endless loop of finding bidi bugs in translated strings late in the QA process. A comparison of some existing tools and platforms will also be provided, together with suggested solutions to harder-to-handle issues caused by the UBA.
14:30-15:20  SESSION 4


Norbert Runge
Test Engineer,
Google, Inc. 


Track 1 - Encoding Health of the WWW

Web pages are written in many languages, using dozens of character encodings. In recent years the World Wide Web has steadily migrated towards the Unicode (UTF-8). This presentation examines the "encoding health" of publicly accessible web pages: What type of encoding errors are typical, what are the symptoms and causes, how frequently do they occur, how should they be fixed? I will show techniques for finding and evaluating such problems, show how a search engine can work around them, and how they are exposed to webmasters.

For example, some pages are double-converted to UTF-8, turning 'ü' into 'ü'. Many pages have the wrong encoding declared, others contain a mix of encodings. I will show details for this and other examples.


Mark Davis
Staff Test Engineer & other titles, Google, Inc

Luke Swartz

Product Manager, Google

Track 2 - Internationalization in Google+

Google+ launched from day one in over 40 languages. It adds a number of new internationalization capabilities, which we'll review in this presentation.


Thomas Milo


Track 3 - DecoType font concepts and design tools for Arabic typefaces

DecoType are the first to develop novel, highly automated template-based design tools for Arabic fonts that cut development time to a minimum and eliminate the need for complex table building. By exploiting DT's unique smart font architecture, particularly porting a conventional font to the DecoType format can be done in a few hours. The result is a contextually fully programmed typeface that supports the complete Arabic block of the latest Unicode Standard. In addition to that, there is an advanced template that guides the designer to add the essential, but because of complexity omitted dissimilation features into her or his typeface as well. A dozen such typefaces have already been made for WinSoft Tasmeem.
15:20-16:00 - Afternoon Refreshments in Exhibit Area


Luke Swartz
Product Manager, Google

Track 1 - Web App i18n/l10n

Web Applications--including those written for various mobile platforms--continue to grow in number and sophistication, in many cases displacing traditional desktop software. However, traditional tools and methods for making software work internationally do not always work for web applications. At Google, we are trying to solve the following problems:

- Core i18n Libraries: Giving web applications access to internationalization libraries on par with those available to desktop applications.
- Resources: Allowing web applications to easily package and manage strings and other international resources.
- Localization: Making it easy to translate web applications' strings and other resources.

In this talk, we will explore each of these challenges, and show some potential solutions, which will help web applications become fully international.


Helena S Chapman
Program Director, IBM

Track 2 - Localization Data Standards: Apathy, Skepticism, and Cynicism

According to a 2009 European Union study, the language industry's annual compounded growth rate was estimated at 10% minimum over the next few years, resulting in approximate value of 16.5 billion to 20 billion € in 2015. For an industry with such potential, it is difficult to comprehend the lack of technical leadership and investment in interoperability of localization data. In this session, we will explore why open standards and consistent implementation of these standards are important to your organizations, the current status of the industry support with regards to localization data interchange standards, and what can be done to reinvigorate the focus in this area.

We will also take time to examine where the data interoperability gaps are in an end-to-end content/data life cycle of a localization request. What roles can Unicode Consortium play in helping mature and drive the most appropriate adoption of its existing and future standards and assets. Most importantly, as a Unicode standard supporter, what you need to be aware in contributing to drive these standards to benefit the localization operation of your organizations directly or for your clients.


Adil Allawi
Technical Director,
Diwan Software Limited

Track 3 - Unicode and the Revolution

What is now known as "the Arab Spring" has pushed a new wave of Arabic language users into the world of social media. But support for bi-di languages (like Arabic) on the web is difficult and can sometimes feels like a second class experience. What problems do these users face and what can the various web companies do to help them? The presentation will try to answer these problems and find a way forward for the future.

For the past five years I have been the Iraq correspondent to Global Voices and had the chance to work closely with those on the leading edge of the Arabic social media revolution. In this presentation I will seek to tie this experience with the technology that underlies it.

I will review the history of the Arabic language on the web, its current problems and successes and make a comparison between the uses of social networking sites in Arabic and their equivalent uses in mono-directional languages.For the presentation, I also hope to include feedback from active users of social networks on how they use this medium, what are their frustrations and successes on their chosen devices.

Finally I will conclude with a review of how Global Voices reports on social networks in multiple languages and make recommendations on how social web sites can improve Arabic support for their users and those reporting the conversations.

17:00-17:50  SESSION 6


Andrew Swerdlow

Co- Author:

Nav Jagpal

Technical Program Manager, Google


Track 1 - Breaking The Language Barrier On The Social Web

The social web has arrived with Social Network Sites (SNSs) such as Facebook having over half a billion users spending 700 billion minutes per month on Facebook. SNSs are no longer just for English users with recent reports of Twitter having more then 50% of Tweets in non-English languages. As social networks increase the diversity of languages published on their sites it can be difficult and confusing for users to understand each other. Social Translate is an open source project developed as a Chrome extension which attempts to automatically translate event streams and friends comments on SNSs. The extension allows users to select their primary language, when a user visits a social network site such as Facebook and Twitter it will use Google translate to detect the language of the event stream and and translate the text to the user's primary language. Social Translate also allow users to select multiple languages as their secondary language that will not be translated. This is useful for users that speak multiple languages and would like to have their events streams displayed in several languages. The Social Translate project serves as an interesting research case on how to combine machine translation (MT) to the real time social web in an open accessible way. This presentation will provide an analysis on the quality of machine translation on social networking sites such as Twitter and provide the impact of MT on social media.


Michael S. Kaplan
Program Manager,
Microsoft Corporation

Track 2 - Korean Hangul: from Sejong the Great's Hunmin Jeongeum to Unicode 6.1

Hangul has had a long history from the 1446 document that first described the underlying Jamo to the latest Jamo additions to the Unicode Standard. This presentation will do a whirlwind and only mildly irreverent tour of that history in the form of a presentation to Sejong to explain what has happened, highlighting the use, encoding, and re-encoding of one of the more perfect alphabets, imperfectly handled, in this or any other age.


Behnam Esfahbod
Yahya Tabesh

Sharif University of Technology

Track 3 - An Abstract Model for the Typography of Perso-Arabic Script

Perso-Arabic script, the second-most used writing system in the world, has many unique properties which have made its computation harder than some other scripts. In this paper we introduce an abstract model for the typography of Perso-Arabic script which exhibits the hidden properties of the script and makes the typographical computation of Perso-Arabic text possible. This properties have been ignored in most of the recent works on Perso-Arabic script, specially in the international standards.

Each letter in Perso-Arabic script, in any of its cursive forms, is constructed from a Base Shape and some of the them are accompanied with some Auxiliary Shapes (for example: Dots, super-script and sub-script Alefs, Madda, Hamza). These Auxiliary Shapes may appear on the top of the Base Shape, on the bottom, and sometimes on the head or the tail. Also, some other Auxiliary Shapes may accompany each letter as a separate Unicode character (for example: Harakats, Shadda, Sukun) which also will be positioned above or below of the Base Shape.

Base Shapes and Auxiliary Shapes work as the building blocks of the Perso-Arabic script. These visual properties are consistent in all of the writing styles of Perso-Arabic script (like Naskh, Thulth, Nasta'liq, and Tahriri). In this Abstract Model, we have encoded these properties such that a series of Shapes can be computed for any Unicode string.

Also, we introduce a metric distance based on our Abstract Model, the Shape Distance, for strings of Unicode characters. This metric makes it possible to compare Perso-Arabic strings based on their actual appearance, regardless of what writing system or font is used. The Shape Distance works such that strings with very similar appearance would have a distance close to zero, and big difference in the appearance results to large numbers. For example, two words with similar letters which only differ in one auxiliary part (like a dot) would have less distance than strings with letters that have different Base Shapes.

And finally, we have studied two Persian text corpora based on the properties of the words, the letters, and the Base Shapes and Auxiliary Shapes. We show that the distribution of Base Shapes and Auxiliary Shapes follows the same the pattern as distribution of the letters.

The first important application for our model is the security of Perso-Arabic domain names. By the introduction of Internationalized Domain Names (IDNs) and looking at the future of the internationalized Internet, security of domain names (at both the TLD level and the registry level) has became a serious concern of ICANN and various ccTLD ang gTLD registries. The Abstract Model and the Shape Distant algorithms can be used to calculate the similarity of Perso-Arabic domain names with very high accuracy.

The second application for our model is the font industry, it can be used in font generation, alteration and verification programs. Using this model, some parts of the glyph generation, classification, and table generation in Perso-Arabic fonts can be automated.

Another application of this model is the font rendering engines which at the moment depend on ArabicShaping table in UCD and the tables provided in font tables. The character data provided in ArabicShaping table is incomplete for some use cases, and the font tables cannot always be trusted to be complete or accurate. This model can help in better Perso-Arabic text rendering and increasing the stability of font rendering engines for this script.

## Notes

1. "Typography" here means "the general character or appearance of printed matter."

2. Perso-Arabic script is called "Arabic" in Unicode standard. Also some other names has been introduced, like "Arabetics". In this paper we use "Perso-Arabic" as the name of the script to distinguish it from the Arabic language.


Wednesday, October 19, 2011

09:00-09:50  SESSION 7


Andrew Glass
Program Manager, Microsoft Corporation  

Track 1 - International User Experiences in Windows

We have entered an era of explosive growth in software usage among speakers of languages other than English or other major European languages. In addition, a large segment of that growth includes users with dual- or multiple-language needs. In this context, it is increasingly important for an operating system to support great language-related experiences for worldwide and especially multi-lingual users. In this talk, we will outline some challenging areas of language support centering around text input and reading experiences, and introduce some possible solutions in these areas being explored for the Microsoft Windows platform.


Su Liu


Track 2 - Study for Processing Unicode Data with Multiple Versions of Unicode and Non-Unicode Standard

Unicode encoding and code set conversion are key features in solutions of storage, information retrieval and data mining systems. To process and support Unicode data with multiple versions of Unicode and Non-Unicode standards are challenge tasks in storage (e.g. digit libraries and data centers etc…), which contains the data in current version and at least one earlier version. Meanwhile, some Unicode side effects, such as variant characters, PUA, and overheads on data normalization and conversion, aggravate complexity to solve the multiple version issue. This paper discusses the multiple version impacts and Unicode data processing strategies on levels of storage, network and OS.

Keywords: Unicode Data, Multiple Versions, Network, Conversion


Pim Rietbroek
John Hudson

Brill / Tiro Typeworks



Track 3 - Fonts for the Ages

A scholarly publisher needs to be able to publish any text in any European language from any period, and to do so within traditional canons of typographic quality and sophistication, as expected by their authors and readers. This presentation describes the demands this makes on the publisher, and on editors and authors, and looks at how these demands are met through careful specification of requirements, standardisation on Unicode text encoding, and development of extensive and typographically sophisticated OpenType fonts.

Pim Rietbroek (Brill) and John Hudson (Tiro Typeworks) present an overview and selection of short case studies from their five-year project, illustrating some of the challenges encountered in texts ranging from transliterated ancient Egyptian and Sumero-Akkadian to linguistic descriptions of present-day endangered languages, with stops among ancient Greek acrophonic numerals and the not-always-helpful Unicode encoding unifications and disunifications.

10:00-10:50  SESSION 8


Michael S. Kaplan
Program Manager, Microsoft 


Track 1 - Locales on Windows - the view from 18 years in

It was 1993 that the basic model for locales was integrated into Windows in its current form, and that model has been largely unchanged for much of that time. In this unique view of those 18 years, you can find about about the lessons learned, unlearned, relearned, and mis-learned. You'll leave this all up view feeling both more impressed and more embarrassed to know Microsoft than you ever have before, even if you were there while it was going on!


Leandro Reis
Senior Program Manager, Globalization. Adobe Systems

Mike McKenna
Senior International Engineering Manager, Zynga

Paul-Henri Arnaud
Senior Process Analyst, Autodesk

Track 2 - Internationalization assessments: Merging the best of three approaches

Software globalization seems to many to be either a black-art practiced by an esoteric guild of polyglot bit-twiddlers or something you get for free because you happen to use Java and Unicode. The truth is neither, and it is something that can be implemented in a methodical way that can be measured. Three different corporations - Adobe, Autodesk and Zynga - are tackling the problem of how to measure globalization compliance and progress across a wide range of technologies and products. They decided to join together in an open-source fashion to decide on a standardized set of requirements, mappings to specific technology genres and method of grading for software globalization. This presentation will present their efforts to date with discussion on similarities and differences among approaches by the companies, as well as issues encountered, solutions implemented and solutions envisioned.



Peter Constable

Senior Program Manager, Microsoft Corp.

Track 3 - Creating World-Ready Apps For Windows 8 

The next version of Microsoft Windows is in development and promises to provide great new opportunities for the Windows developer ecosystem. In addition to introducing new app development paradigms, Windows 8 also adds a lot of new multilingual and globalization functionality. This talk will provide an overview of additional functionality that developers can leverage to develop world-ready desktop or Metro-style apps for Windows.

10:50-11:10 - Morning Refreshments
11:10-12:00  SESSION 9


Peter Constable
Senior Program Manager, Microsoft Corp.

Track 1 - Application Resources and Localization for Metro-Style Apps in Windows 8

Application developers still face big challenges in creating and deploying localized, multilingual apps. These challenges have been one of the focal points as we have worked on the next version of Microsoft Windows, Windows 8. This talk will take a detailed look at the new application resource model in Windows 8 for Metro-style apps created using HTML or XAML. You'll leave with a basic understanding of the new resource infrastructure and how it makes localizing your app a lot easier.


Steven R. Loomis
Software Engineer, IBM

Track 2 - What's new in CLDR 2.0

The Common Locale Data Repository is a project for the exchange of language and locale information used in application development, and to gather, store, and make such data publicly available. By pooling resources, the time and expense of collecting good data is minimized, and language groups have an avenue to get their data into implementations. This session will discuss implementation of CLDR and the latest project status, and how the process is being improved to produce higher-quality data. Panelists will then discuss how they are making use of CLDR data, the latest project status, and issues in the collection and production of data. The panel will consist of persons from multiple vendors involved in deploying CLDR in their own products and projects, as well as those involved in the data gathering and vetting process. Comments and questions will be welcomed from the audience.


Adil Allawi
Technical Director,Diwan Software Limited 



Track 3 - Developing a Unicode font for the desktop, the mobile and the web

The only Arabic font I ever designed, Geeza, started life in 1985 as one of the standard fonts on the original Apple Arabic Macintosh and is still the standard Arabic font on Mac OS and iOS. This presentation covers my approach to developing Geeza over the years, automating the font creation with Unicode data. I will also cover my work with to publishing Geeza as a multilingual web font. The talk will be aimed at giving information generally relevant to developing multilingual, Unicode fonts on different media and platforms. 

Over the following 25 years the font has been rebuilt several times over, extended to the full Arabic Unicode range, ported to every kind of device. The latest incarnation is now a web font for every major browser. The design needed to be relevant for user interfaces, printing and small screens yet still appear compatible with Roman fonts. Support for the full Unicode Arabic range, meant adding an extra 1500 glyphs together with the relevant tables for Arabic shaping, ligatures, justification and kerning.

Along the way I will explain the features need for a modern typeface to be useful in a world where data may be exchanged across different standards; How to approach the development of user interface fonts and the new standards for web fonts. I will discuss the importance of embedding semantic information into a font to allow unique identification of its glyphs and allow equivalence to be found in other fonts.

The presentation will conclude with a discussion about the future for international typefaces as they make their way into open standards and become part of the content of the worldwide web.

12:00-13:00 - LUNCH


Doug Felt

Steven Loomis
Software Engineer


Track 1 - Best Practices with the Java 7 Locale

In Java, the Locale class is fundamental for developing global software. Java7 adds several important enhancements to resolve issues that were difficult to handle in previous versions of Java. For example, the new script field allows developers to package Chinese localized resources in logical manner, and full support for BCP 47 language tag conversion allows software to exchange language and locale information through standard protocols without any data loss. This session provides a brief overview of the enhancements, followed by best practices and programming tips recommended for Java application developers.


Markus Scherer
Unicode Software Engineer, Google Inc.

Mark Davis
Sr.Internationalization Architect,
Google Inc.

Track 2 - Plural & Gender in Translated Messages

"There are 1 file(s)." / "Alice added 1 people to his mailing list." - User-facing messages with placeholders for numbers and strings are common technology. These require the placeholders and text to be reorderable to account for grammar of different languages. However, the common technology does not solve the problem of plural and personal gender in placeholders. That is, depending on the language and the placeholder values, the surrounding text often needs to change, as illustrated by the examples above.

ICU has been improving on the Java formatting framework, adding support for such message variants in both its C++ and Java versions. In addition, other aspects of message formatting have been simplified. This session explains the challenges, approaches, and new functions and capabilities.

Co-authors: Markus Scherer & Mark Davis


Anshuman Pandey                Ph.D. CandidateUniversity of Michigan



Track 3 - A Pre-script-ion for the Future: Unicode and the Development of Minority Languages in South Asia

South Asia is home to tremendous linguistic diversity. 'Ethnologue' records roughly 438 languages spoken in India alone. According to UNESCO's Endangered Languages Programme, 198 of those languages are considered to be endangered and there an additional 140 languages with similar status in the region stretching from Afghanistan across to Nepal and Bhutan. The conditions of these languages and the ongoing decline of other minority languages, which are not yet endangered, are the partial result of inadequate institutional support and development. The growth of new digital technologies has the potential to positively alter the course of endangered and minority language by offering speakers innovative ways in which to use and maintain their mother tongues. The basis of these technologies is Unicode. The aim of the Unicode Standard is to encode characters that are needed for representing text in all modern writing systems, as well as most historic scripts (The Unicode Standard, Version 6.0, p.10). Unicode, then, is a prescription for a brighter future for minority languages in South Asia

This presentation will discuss the role of Unicode in the development of minority and endangered languages in South Asia, with particular focus on India. The talk will begin with an update on the support for South Asian scripts in Unicode 6.0 and an analysis of how well the Standard covers these scripts today. It will then present five current character-encoding projects for writing systems used by minority language communities in India and Nepal and discuss the conditions of these languages, the linguistic requirements of these communities, and the potential that Unicode offers for increasing education and literacy using the language. The talk will then describe the role of Unicode in the Government of India's National e-Governance Plan (NeGP) and how the plan has the potential to facilitate institutional and governmental support for endangered and minority languages. The talk will close with a presentation of several case studies of writing systems being created today in India by minority language communities and what such ongoing activity will mean for Unicode. The presentation will conclude by discussing the current projects of the University of California - Berkeley's Script Encoding Initiative, which is working with the user communities to propose these scripts for encoding into Unicode.

14:00-14:50  SESSION 11


Michael Kuperstein
Localization Engineer, Intel

Track 1 - Agile Internationalization and Localization

Agile development methodologies are swiftly being adopted throughout the software development industry. This presentation will illustrate key concepts, challenges, and solutions for performing Agile internationalization and localization. One important goal of Agile development is to have release-ready software at the end of each 'sprint', which typically have a duration of between one and four weeks. For example, if a feature involves building a contacts list, then entering, storing and sorting of contact names should work for all languages, even for a single-language product. If the product is also being localized, then we would expect each particular feature to be fully localized into all the target market languages at the end of the sprint. In other words, "Done" truly means "No work left to be done." Unfortunately, this puts even more pressure on internationalization and localization teams, since the scheduling and quality challenges escalate rapidly as sprints become shorter and language counts climb. We'll cut through the confusion to focus on a handful of proven internationalization and localization strategies that can ensure a great user experience for customers of every culture.


Martin Jansche
Staff Software Engineer, Google Inc.


Track 2 - Speech Internationalization at Google

Speech Internationalization at Google

Pedro J. Moreno, Linne Ha, and Martin Jansche*
Google, Inc.  (*corresponding author)

Internationalization (i18n) and localization of software that processes spoken input or output is faced with challenges that differ from those found in many other software i18n projects. In this talk we describe our experience with internationalizing Google Voice Search, a speech-to-search service available on many popular mobile devices. First launched for US English in November 2008, Google Voice Search is currently available in more than 20 languages.

Our lack of precision in the number of available languages illustrates one the main challenges of this project: working definitions of "language" and "dialect" vary depending on context. Focusing purely on technical aspects, our notion of language is driven by the current limitations of speech recognition technology. For example, Google has developed speech recognition models for several variants of spoken English, including American, British, Indian, Australian, and South African English. But our support for English does not end here. Many other languages have been influenced by English: Our Cantonese recognizer was explicitly designed to deal with the many English loanwords that are in daily use in Hong Kong. The languages of Europe all borrow English words liberally, despite the efforts of local language academies.

This kind of linguistic diversity makes the challenge we face even harder, since it increases the number of language projects we have to deal with. This is compounded by the fact that internationalizing speech technology is hard because bringing up a recognizer in each new language is a separate development effort, requiring significant amounts of data, compute power, and engineering time. We are fortunate to work in an environment where enormous compute power is available and can be easily harnessed. Of the other two factors, we will assume that engineering time is always limited and at best, we can reduce the accidental complexity of the very complex task of building recognition models. Since our development approach is heavily data-driven, most of our recent progress has been in the area of data acquisition, where significant changes have been made to the tools and processes which we use to acquire linguistic data as needed for building recognizers.

We generally need three kinds of data: spoken utterances plus textual transcriptions; pronunciation dictionaries; and large amounts of text. We have crowd-sourced several aspects of our data collection efforts. For example, we generally need spoken examples for a given target language, from a variety of speakers, and under varying environmental conditions. We have built tools that make it easy to collect such data from volunteers. We have also made process changes to allow us to manage the data collections remotely while monitoring progress and assessing data quality. This has allowed us to collect acoustic data many times faster than before.

The collection of pronunciation information cannot be crowd-sourced as easily since the current process requires a certain amount of linguistic expertise. Volunteers are asked to transcribe words into crude phonetic representations. For some languages, e.g. Spanish, the pronunciation of a words is readily apparent from its orthography and can be expressed algorithmically. In those cases we use ICU transforms to give us word pronunciations, perhaps combined with a brief pronunciation dictionaries of exceptional or foreign words.

While text resources are often easily available, this does not hold for all languages. For example, Voice Search is currently available in South Africa in English, Afrikaans, and Zulu. For Zulu, the amount of text available in electronic form is considerably more limited than for the other languages. We'll share our thoughts on what could be done to encourage the creation and dissemination of data for under-resourced languages.


Martin Raymond
Script Information Engineer/Editor, SIL International



Track 3 - ScriptSource: Making information on the world's scripts accessible

Although there is plenty of script information on the web, there has been a need for a web site to present the information authoritatively and clearly, making it easier to understand the often complex relationships between scripts, characters and languages. ScriptSource has been designed to meet that need and to answer questions such as: 'Which scripts can be used to write that language?', or, 'Which writing systems use this Unicode character?'. The site allows registered users to add information to the site in the form of entries, which may include links to other sites, all entries being moderated. Users can also post 'needs' to enlist help in solving script-related problems. ScriptSource imports language data from the Ethnologue, character data from Unicode and locale data from the CLDR (Common Locale Data Repository). CLDR's locale data, such as exemplar sets, is linked to the scripts, characters and languages it relates to.

This session will cover some of the needs ScriptSource has been designed to meet, as well as the challenges encountered in bringing together information from different sources and creating the data associations to make it as meaningful as possible. There will be a demonstration to show how easy the ScriptSource User Interface is to navigate and to illustrate the main functions, including adding information about a script. The use ScriptSource makes of CLDR data will also be demonstrated, and the plans for a more extensive interface with the CLDR will be discussed.

14:50 – 15:10 - Afternoon Refreshments
15:10 - 16:00  SESSION 12


Mark Davis
Markus Scherer 




Track 1 - Bits of Unicode: ICU Data, Algorithms, and Performance

Supporting Unicode with good performance and with reasonable memory footprint presents a challenge. No matter which encoding form is used to represent Unicode, 1,114,111 different codepoints and associated data are a lot to handle. Most classic data structures are byte oriented, which is often not optimal for dealing with Unicode, even with UTF-8.

ICU uses a number of innovative algorithms and data structures to handle internationalization, balancing tradeoffs between performance and data footprint. This presentation covers some of the more interesting of these structures, and their applicability beyond internationalization. The discussion includes: the new trie structure for string lookup in ICU, effective use of inversion lists and inversion maps, compact character mapping tables, transliteration mappings, and others. Code that supplies and uses these structures is part of ICU, the Java/C/C++ open-source Unicode enablement library.

Co-authors: Markus Scherer & Mark Davis


Ken Lunde
Senion Computer Scientist, Adobe Systems Incorporated


Track 2 - Genuine Han Unification

There have been major shifts and reforms in East Asian writing systems in the past that seemed revolutionary at the time, but that are now considered to be standard and thus completely acceptable. One such reform was the hanzi simplification in China that took place during the early 1950s. Thanks to the Web and other advances in communication technology, the world has become a smaller place. Thus, more cross-cultural interaction is taking place than ever before. Perhaps serving somewhat as a catalyst, Unicode, with its tens of thousands of CJK Unified Ideographs that cover the needs of virtually all customers of the locales that use them, provides the foundation for another shift or reform, though it is not likely to take place for another decade or two.

Today, one can easily argue that for a single font to adequately serve multiple CJK locales, it must include more than one glyph per CJK Unified Ideograph code point. Such fonts are referred to as Pan-CJK fonts, because they serve the needs of more than one CJK locale. I have predicted that at some point, years or decades into the future, cross-cultural interaction will evolve into initiatives whose aim is to genuinely unify CJK Unified Ideographs across all CJK locales. This is likely to have the effect of making a single glyph acceptable for all CJK Unified Ideographs. The Chinese standard designated GB 18030 is actually a step in this direction, mainly because it specifies a single glyph for each CJK Unified Ideograph code point.

This presentation will explore the history and development of ideographs and Han Unification, and draw conclusions based on the presenter's own experience developing CJK fonts and working with CJK character set standards.


Roy Boney, Jr.
Language Technologist, Cherokee Nation

Joseph Erb

Educational digital media specialist, Language Technology Program at Cherokee Nation Education Services Group


Jeff Edwards

Language Technologist,

Cherokee Nation


Track 3 - The Cherokee Syllabary in Digital Applications

With the inclusion of the Cherokee syllabary in Unicode, it is being in used in some of the most popular and advanced devices in the world such as the iPhone and iPad. The Cherokee syllabary was included as part of the Unicode Common Locale Data Repository version 1.8. This has helped in the perpetuation of an endangered language and is paving the way for a renaissance of the Cherokee language in modern digital media culture.

One problem of the digital globalization of communication is that it has the potential to erode the already endangered cultures and languages of indigenous peoples. These communications rarely occur in the indigenous languages of minority cultures, if at all. It does not need to be this way, and proper adoption of the technology by the community is paramount for success.

This presentation will discuss the efforts undertaken by the Cherokee community to address these problems. It will discuss the adoption of the Cherokee syllabary into Unicode, the usage of the Cherokee syllabary in modern computing systems, the mobility of the Cherokee language in various hardware platforms and social web presences, and the adoption of Cherokee language technology by the Cherokee community.

The presentation will be by Roy Boney, Jr. and Joseph Erb, Language Technologists of Cherokee Nation Education Services Group.

16:10 - 17:00  SESSION 13


Stuart Gill
Member of Technical Staff, Google Inc.

Peter Edberg
Senior Software Engineer, Apple Inc.

Markus Scherer
Unicode Software Engineer, Google Inc


Track 1 - New in ICU

The International Components for Unicode library, or ICU, provides a full range of services for Unicode enablement, and is the globalization foundation used by many software packages and operating systems, from mobile phones like Android or iPhone all the way up to mainframes and cloud server farms. Freely available as open-source, it provides cross-platform C, C++, and Java APIs, with a thread-safe programming model.

This presentation will provide a brief overview of ICU, with emphasis on the recent updates in ICU 4.8, including the latest support for Unicode 6.0 and CLDR 2.0, collation reordering for better customization and reduced collation data, plural and gender in messages, and other changes (see The presentation will also touch on ICU's planned direction for 5.0 and future releases.


Zhang Zhoucai
CEO, Beijing UniHan Digital Tech Co.Ltd 

Zhang Chiyi
COO, Beijing UniHan Digital Technology Co., Ltd.

Track 2 - Some Special Requirements for Cloud-Publishing of Chinese Ancient Classics

Cloud computing focused in the early going on software as a service (SaaS) applications, but Amazon, Netflix, Google, Apple, Microsoft and others are now tapping the cloud for content delivery (some of these companies focus on streaming entertainment, while others focus on content creation/management). Both e-publishers and their readers increasingly rely on web, and want to get more and more benefits from so-called cloud-publishing. As a typical culture heritage, a variety of huge data base of Chinese ancient classics, such as ??????and ??????? have been built up and providing web service in world-wide? They are regarded as earliest breaking through in this field, which were almost made immediately after the Unicode/CJK unification standard released. Now, this kind of data base are facing the challenge and mission of migration in the cloud for digital transition and broader consumptions. 

As Unicode/CJK standard developer and Unicode implementer in Chinese e-publishing industry, the author summarized their experiences and lessons in Unicode based e-publishing, and points out that, besides the ordinary requirements for general content provider on the web-cloud, Chinese Ancient Classics have some special requirements for Cloud-Publishing for better reading and usage experiences, which include but are not limited to

  • super-CJK Font on Web-Cloud and its alternative solution,
  • super-CJK Hanzi Handwriting Recognizer on Web-Cloud ,
  • a Dictionary of Simplified-Traditional-Variant Hanzi on the web, and especially,
  • an on-the-web OLD TERM CHECKER would be very much desired in order to help reader to judge and understand strange Chinese term's attributes - a dynasty year, a name of an officer, a name of a place, or a name of a person?

In addition, Author will introduce their implementation of web-cloud based handwriting recognizer (UniHan Q-Pen) and the above mentioned TERM CHECKER built in their UniHan Classics data base on the web.

Zhang, Chiyi, COO,
Beijing UniHan Digital Technology Co., Ltd


Tex Texin
Chief Globalization Architect, Rearden Commerce, Inc.


Track 3 - Does it hurt when I do this? Data for I18n Testing

This presentation recommends specific data values that are likely to identify internationalization problems in software intended for global markets.

Based on years of global software experience, these data values are useful in functional or linguistic QA tests of internationalized software. The data value recommendations include character encoding, postal address, locale and other data types typically used in software and will assist in finding common internationalization problems. This presentation will offer specific test suggestions.

Program is subject to change.