Challenges of Right‑to‑Left (RTL) Script Internationalized Domain Names
- Mark W. Datysgeld
- May 17
- 5 min read
Updated: 4 days ago
19 May 2024 | By Mark W. Datysgeld
Right‑to‑left (RTL) writing systems account for hundreds of millions of Internet users, including scripts such as Arabic, Hebrew, Persian, Urdu, Pashto, and Sindhi. Enabling these communities to write in their own language on the Internet is critical for linguistic diversity, yet despite substantial advances in overall backend and frontend support, the DNS’s design around a left‑to‑right (LTR) ASCII core complicates matters significantly. This article reviews the linguistic background, technical challenges, as well as the policy and standards efforts are shaping the adoption of RTL IDNs.
Historical context
Early scripts underwent a complex evolution in writing direction before settling into the patterns we observe today. An early Greek and Semitic practice known as “boustrophedon” (Greek for “to turn like an ox”) used alternating line directions, so that at the end of any given line the reader’s eyes would descend directly towards the character below the current one. However, by the 10th century BCE, the Phoenician script – which we touched upon in our article about the Greek script – became the first to adopt consistent unidirectional RTL orientation (Waal, 2018).
There are different hypotheses as to why RTL became the convention rather than LTR, seeing as boustrophedon made use of both. A plausible interpretation is that this is a consequence of Egyptian scribes having mostly chosen RTL for their writing system. Seeing as Egyptian hieroglyphs had momentous influence on most other writing systems, this feature propagated across the Levant region, where Semitic languages such as Hebrew and Arabic developed (Dobbs-Allsopp, 2023).
Additionally, both the writing medium and the fact that most people in the world are right-handed are seen as having influence over writing orientation (Coulmas, 2003). Written Semitic languages were initially written in stone, and it can be inferred that writing starting from the right was a byproduct of the technique used at the time for chiseling. While we don’t know the specifics, this is a probable hypothesis.
This gets reversed in later writing systems where paper-like materials and ink were used, starting from around 700 BCE. In this case we have greater evidence that the right-handed scribes would occasionally smear the fresh ink when writing RTL, but were able to achieve cleaner results when doing so LTR, with their hand moving away from the ink (Eckardt, 2017).
Despite the broader Mediterranean drift toward LTR writing, the RTL orientation of the Hebrew and Arabic scripts endured in significant part due to the strict copy of their canonical texts. Scribes would give the same importance to the form of Torah and Qurʾānic manuscripts as they did to their contents, consequently preserving not only the words, but the way in which they were structured. RTL writing is, therefore, a fundamental part of these communities and needs to be taken into consideration as more than a linguistic quirk, but rather as an integral part of their cultural expression.
Technical considerations
At the network and server level, every non-ASCII label is transformed into Punycode, prefixed with “xn--”, as defined in RFC 5890. This converts an example domain name such as “דוגמה.אינטרנט” into “xn--6dbbec0c.xn--4dbqac9ab9c”. That pure ASCII representation makes it so that the DNS always stores labels in LTR, and only at the application level the re-ordering is performed. This leaves a lot of the work for applications to perform correctly.
Choosing which characters may appear in the first place is delegated to ICANN’s Generation Panels that produce Root-Zone Label Generation Rules (RZ-LGRs). The Arabic Script RZ-LGR enumerates 151 letters and defines 22 variant mappings shared by Arabic, Persian, Urdu and Pashto, ensuring that dotted extensions such as the Persian گ or the Urdu ڑ behave consistently across TLDs. Languages that further extend Arabic need supplementary second-level LGRs so that their extra letters don’t become confusable.
When a domain name written in an RTL script appears in a browser’s address bar, two separate but complementary rulesets determine what the user sees. The Unicode Bidirectional Algorithm (Unicode Standard Annex #9) enables both RTL and LTR. Inside each dot-separated label, however, the IDNA Bidirectional Rule of RFC 5893 applies. In practice this means: RTL labels cannot start with digits, be them ASCII digits 0‒9 (European Numbers) or Arabic-Indic digits (٠, ١, ٢, …) and their Extended version. Mixing digit sets within a label is also forbidden.
Browser vendors normally add a defensive layer on top of these rules: if a label contains characters deemed visually confusable with characters from another script, the browser may fall back to displaying the raw “xn--" version of the domain name. ICANN’s Security and Stability Advisory Committee encouraged registries to mirror that practice by rejecting registrations that are single-script yet visually identical to an existing ASCII name (SSAC, 2018).
Email addresses are another important aspect to consider when it comes to domain names. While the SMTPUTF8 protocol (RFC 6531) allows full Unicode usage, a 2019 UASG study found that some Mail Transfer Agents (MTAs) mishandle addresses whose local and domain parts have opposite text directions, especially when bounce processing rewrites the return-path.
In short, getting an RTL IDN safely from the user’s keyboard to the authoritative name server involves coordinated safeguards at every layer—from the Unicode algorithm that orders glyphs, through RFC-level syntax checks, to script-specific LGR tables and transport-layer packet sizing. Each element removes a distinct class of spoofing or operational failure.
Policy and standards efforts
Several technology stacks have been shifting from “best-effort” RTL handling to more explicit guarantees of functionality. Taken as a whole, it could be said that there is a trend for greater attention being paid to these issues and real improvements are being delivered.
Unicode’s latest bidirectional algorithm (UAX #9 rev. 49, 2024) closes edge cases that once produced mirror-text in form fields. Meanwhile, the HTML Living Standard community has tackled several directionality challenges in recent years, moving ever closer to closing most open issues (WHATWG, 2024).
The email side saw RFC 8616 amending SPF, DKIM and DMARC in 2019 to be specific about when ASCII and Unicode label usage, ending long-standing ambiguity for Semitic scripts. GNOME 42 desktop closed many long-standing RTL issues across GTK widgets in its first coordinated sweep around the issue since 2014.
W3C’s living Arabic & Persian Layout Requirements has been consistently updated to, in their own words, “provide information for Web technologies such as CSS, HTML and digital publications, and for application developers, about how to support users of Arabic scripts”.
ICU releases 72–74 introduced the a new paragraph API and aligned cursor-movement with the new rules, letting application developers replace years of ad-hoc RTL code and replace them with an uniform solution.
Conclusion
Right-to-left IDNs sit at the intersection of millennia-old writing traditions and an Internet stack built for left-to-right ASCII. Progress is real: Unicode, ICANN, W3C, browser and OS vendors have been taking the issue more seriously. Yet, deployment gaps show that standards alone cannot guarantee usability. Finishing the job will require the same coordinated focus at the application edge that has already transformed the protocol core.