Introducing Internationalised Domain Names and a technical overview of IDN structures (punycode), how it contributes to linguistic diversity online and a history of IDN deployment.
What are Internationalised Domain Names?
Domain names, which are a core part of the Internet’s addressing system, work because they are interoperable and resolve uniquely. This means that any user connected to the Internet, anywhere in the world, can get to the same destination by typing in a domain name (as part of a web- or email address). The plan to internationalise the character sets supported within the Domain Name System is almost as old as the Internet itself. However, technical constraints and the overriding priority of interoperability resulted in a restricted character set within the Domain Name System: ASCII a to z, 0 to 9 and the hyphen. This restricted character set is known as LDH (Letters, Digits and Hyphen) within the technical community.
Technical standards to internationalise domain names were developed from the mid-1990s. The solution retains the Domain Name System’s restricted character set, and transliterates every other character into it. Each series of non-ASCII characters is transliterated into a string of ASCII characters prefixed with xn– . The xn– ASCII forms of the domain names are meaningless to humans, but meaningful to machines (name servers) that resolve domain names. Thus, humans see the meaningful, transliterated characters when they navigate the Internet, whilst the underlying technical resolution of domain names remains unchanged.
A technical overview
Punycode is the algorithm used to transform a Unicode Label into an ASCII string. This ASCII string is prefixed with “xn–” (ACE prefix) to create an “A-label” or ACE label (ASCII Compatible Encoding) that the domain name system understands. For more details, see section 2.3 of RFC 5890.
Implementation of IDNs began in 2000 at the second level (under .com and .net) and 2001 (.jp). In the ten years that followed, several ccTLDs deployed IDNs, primarily supporting local language character sets. Some experimented with other strategies for internationalising domain names, but the IDN technology proved the most successful.
IDNs are technically complex to implement. Many challenges remain, including (at a technical level) how to handle variant characters, which are prevalent in Arabic and Chinese scripts. Another challenge is the user-experience, eg consistent representation in browsers and full functionality in emails – this is called ‘universal acceptance’.
How IDNs contribute to linguistic diversity online
Despite the technical challenges, IDNs are viewed by many as a catalyst and a necessary first step to achieving a multilingual Internet. According to UNESCO, in 2008 only 12 languages accounted for 98% of Internet web pages; English, with 72% of web pages, was the dominant language online. Recent reports indicate that other languages are growing rapidly online. For example, by 2010, only 20% of Wikipedia articles were in English, and by December 2018 this had fallen to less than 12%. Supporters of IDN believe that enabling users to navigate the Internet in their native language is bound to enhance the linguistic diversity of the online population, and the World Report has demonstrated that IDNs are strongly linked to local content.
While this study focuses on the web, it should be noted that other applications also require internationalisation, eg email, file transfer protocol, etc.
A short history of IDN deployment
For nearly two decades, hybrid Internationalised Domain Names have been available at the second level with ASCII Top Level Domains (for example, παράδειγμα.eu in the figure above). This situation was only satisfactory for Latin-based scripts used by most European languages, where the IDN element would commonly reflect accents, or other diacritical marks on Latin characters. For speakers of languages not based on Latin scripts (for example, Chinese, Arabic), the hybrid IDN/ASCII domains were unsatisfactory. Right-to-left scripts, such as Arabic and Hebrew created bi-directional domain names when combined with left-to-right TLD extensions, requiring users to have a familiarity with both their own language, and Latin scripts in order to navigate the Internet. As explained in the report IDNs State of Play 2011, bi-directional domain names not only require Internet users to change script when typing in a single web address, but also potentially confuse the strict hierarchy of the Domain Name System.