One Alphabet, Many Languages: Latin Script IDNs and the Road to Better Policies
- Mark W. Datysgeld

- 23 hours ago
- 6 min read
08 December 2025 | By Mark W. Datysgeld
This article is dedicated to the memory of Rubens Kühl, legendary .br representative and ICANN contributor, who unfortunately passed away on 3 November 2025, at age 55. Rubens was a champion of good practices on the Internet and a mentor to many, including this article’s author.
Introduction
Latin sits in a paradoxical position. On the one hand, a stripped-down Latin subset defines the model that underpins the DNS. On the other, “real” Latin usage spans hundreds of languages, with large inventories of diacritics, extended letters, and orthographic conventions that ASCII cannot express. This article follows how that tension has been handled: from historical path-dependency, through the creation of Root Zone Label Generation Rules (RZ-LGR) for Latin, and toward the Latin Script Diacritics PDP.
From uncontested default to policy issue
In the previous article in this series we studied how movable type, telegraph codes, and ASCII all converged around a compact Latin repertoire because it was economically and technically efficient. By the time hostname rules were codified, it didn’t feel like a political choice to define labels in terms of ASCII “letters, digits and hyphen”. The Latin script became invisible at the standards level because its most reduced form was already embedded there.
Internationalization then arrived as a late correction. IDNA preserved the ASCII-only DNS, mapping Unicode “U-labels” into Punycode “A-labels” behind the scenes. That allowed non-Latin scripts into the system, but left Latin in an odd position, both inside the DNS core as ASCII, but largely outside it in the form of the many diacritics and extensions that make Latin actually usable for most of the languages that rely on it.
A polyglot alphabet
The standard 26-letter Latin alphabet is only a narrow slice of the script. Current orthographies make use of diacritics such as the acute ◌́, grave ◌̀, tilde ◌̃, caron ◌̌, and ogonek ◌̨, as well as letters like ß, ð, þ, among many others. Quite often these marks are not stylistic; they are a mechanism to encode phonemic contrasts (minimal sound differences that alter word meaning).
At the same time, Latin stands next to other related or neighbouring scripts, especially Cyrillic and Greek, where letterforms are visually close even when code points differ. That makes Latin the centre of a confusability problem that is both intra-script (between accented and unaccented forms) and cross-script (between Latin and other alphabets derived from similar shapes).
We already discussed the impacts of this dilemma on the Guaraní language and the Greek script, but in practice, this affects even languages that are not Latin-adjacent, seeing as a significant number of languages currently in use have some form of Latin-based fallback. The Japanese writing system, for example, makes use of rōmaji, a transcription system that allows for はい to be written as “hai” (“yes”).
Latin in the Root Zone LGR
The Root Zone Label Generation Rules (RZ-LGR) framework is ICANN’s main technical response to script-specific risks at the top level. Community-based Generation Panels for each script define conservative rules for which Unicode code points may appear in root-zone IDN labels and how variants should be handled; these are then integrated into a unified RZ-LGR (ICANN, 2015).
The Latin Script Generation Panel was announced in 2017, with a mandate to develop such rules for Latin. After years of work by the group and a Public Comment period, its proposals were integrated into RZ-LGR-5. Several design choices with policy implications were made:
Support for a restricted subset of Latin characters. It is not meant to encode full orthographies for every Latin-using language.
Treats cross-script interaction cautiously, extensively accounting for the visual overlaps with Cyrillic and Greek.
Variants are narrowly defined. Most ASCII labels and diacritic variants are not automatically tied together as variants at the root level, even if speakers perceive them as “the same”.
In 2024, ICANN published richer reference Label Generation Rules for the second level, including a Latin “full variant” LGR intended to guide registries in managing second-level IDNs. This table enumerates a far larger Latin repertoire and many more variant relationships than the root LGR (ICANN, 2024). Taken together, these second-level tables show how quickly the number of characters and confusable labels grows when Latin is modelled across multiple languages. They, in turn, illustrate why the root zone keeps the Latin repertoire relatively tight.
The Latin Script Diacritics PDP
The gap between what the root zone can technically support and what users expect from Latin-script gTLDs is the starting point for the Latin Script Diacritics PDP. Cases such as the existing “.quebec” versus a potential IDN “.québec” are emblematic of the problem: pairs that users perceive as closely related or interchangeable for identity and localization purposes, yet which are not variants under the Latin RZ-LGR and, under current rules, cannot simply be run in parallel by the same registry operator (ICANN, 2024).
Following community discussions, the GNSO Council requested an Issue Report on Latin Script Diacritics in May 2024, leading ICANN org. to produce a Preliminary Issue Report, which was then published for Public Comment in July 2024. The summary of that proceeding recorded 90% explicit support from the community on initiating a PDP on Latin Script Diacritics. In November 2024, the GNSO Council adopted a resolution initiating the PDP on Latin Script Diacritics. The Working Group held its first meetings in March 2025 and has been meeting weekly since.
According to its charter, the PDP is confined to one top-level issue: identifying the limited circumstances, if any, in which a base ASCII gTLD and the Latin-script diacritic version of that gTLD (which are not variants of each other under the Latin RZ-LGR but may be visually similar) should be allowed to be simultaneously delegated and operated by the same registry operator, and what mechanism would enable this while preserving security, stability and user trust (ICANN, 2025).
The group was not tasked with redefining the Latin RZ-LGR or changing its variant sets, meaning that it was required to treat the existing Latin RZ-LGR, including its exclusions and the fact that most ASCII and diacritic pairs are not variants. Key questions include whether elements of the IDN ccTLD Fast Track exception procedure, the ccPDP4 recommendations on IDN ccTLDs, the EPDP on IDNs (Phases 1 and 2), and the SubPro recommendations on IDNs and string similarity could be adapted to Latin-diacritic gTLD pairs, and whether any eventual solution would require changes to existing gTLD consensus policies.
By late 2025, the Working Group had held more than two dozen meetings, combining regular weekly calls with working sessions at ICANN 82 (kick-off and project plan), ICANN 83 (EPDP-IDNs dependencies) and ICANN 84 (stress-testing of draft preliminary recommendations). In parallel, the WG anchored its scope in a Unicode-based definition of “Latin script diacritic” (a base ASCII code point plus one or more combining marks in the Combining Diacritical Marks block) and made available an internally generated ASCII/Unicode analysis tool to list all characters that meet this test.
On TLD equivalence, the WG converged on a deliberately narrow definition. An ASCII gTLD and its Latin-script diacritic counterparts can form an “ASCII Latin-diacritic gTLD set” provided they are operated by the same registry operator under a single contractual framework (same-entity principle), with fee treatment and most operational obligations aligned as far as feasible with the EPDP on IDNs model for variant sets, while each label still remains a distinct TLD in the root.
After stress-testing case studies, the WG rejected more expansive models in which Latin-diacritic sets and IDN variant sets could overlap, endorsing that a TLD may belong either to a Latin-diacritic set or to an IDN variant set, but not both; no Latin-diacritic label can be added to an existing variant set, and no variant label can be activated for a TLD that is part of an LD set. In practice, equivalence is therefore implemented as tightly controlled same-entity bundling for well-defined ASCII/diacritic pairs, leaving the Latin RZ-LGR and its variant definitions unchanged and keeping the root zone behaviour predictable. Work is expected to wrap up in 2026.
Conclusion
Latin’s trajectory in the DNS mirrors a broader shift in how the Internet treats language. For decades, a minimal Latin subset was baked into infrastructure as if it were neutral, and “Latin without diacritics” was the default. Once other scripts began to be systematically accommodated through IDNA and the RZ-LGR programme, that default position stopped being invisible. Latin re-emerged not as the neutral background of the DNS, but as one script among many; one that happens to be shared by hundreds of languages, layered with diacritics, confusables and strong expectations about “proper” spelling.
The Latin RZ-LGR and the Latin Script Diacritics PDP are two complementary responses to this reality. The RZ-LGR keeps the root zone conservative, limiting the Latin repertoire and declining to encode ASCII/diacritic relationships as technical variants, in order to contain combinatorial explosion. The PDP then took up the residual problem in a narrowly scoped way: defining when an ASCII label and its Latin-diacritic counterpart can be bundled as a same-entity “set” at the gTLD level, and under what constraints. The result is not a grand redesign of Latin in the DNS, but a pragmatic adjustment: recognising that the alphabet which once structured the system now also needs its own governance, so that Latin-script IDNs can better reflect how people actually write and recognise names while preserving a stable, predictable root.
By Mark W. Datysgeld