Unicode Identifiers and UAX 31

Programming languages need rules for what constitutes a valid identifier — a variable name, function name, type name. Historically each language invented its own rules. Unicode Standard Annex #31 (UAX 31) defines a common standard that modern languages now converge on.

Namecode encodes arbitrary strings into identifiers that conform to this standard.

XID_Start and XID_Continue

UAX 31 defines identifiers using two Unicode character properties from UAX #44:

XID_Start — characters that can begin an identifier:

XID_Continue — characters that can appear after the first position:

A valid identifier is: one XID_Start character, followed by zero or more XID_Continue characters.

The "XID" prefix stands for "eXtended IDentifier" — these are derived properties that remain stable across Unicode normalization forms (NFC/NFKC), which makes them safe for compilers and tooling.

What this means in practice

String Valid? Why
foo Yes Letter, letters
café Yes Letters (including é, which is XID_Continue)
名前 Yes CJK characters are XID_Start and XID_Continue
_private Yes Underscore is XID_Start
foo123 Yes Digits are XID_Continue
123foo No Digit is not XID_Start — can't begin an identifier
hello world No Space is not XID_Continue
foo-bar No Hyphen is not XID_Continue
foo@bar No @ is not XID_Continue

Language adoption

Most modern languages have converged on UAX 31 or a close subset:

Language Identifier rule Notes
Rust UAX 31 Exact UAX 31 since Rust 1.53 (2021). NFC normalized.
Python 3 UAX 31 Exact UAX 31 since Python 3.0 (2008). NFKC normalized.
JavaScript UAX 31 Via ECMAScript spec. Allows $ as an extension.
Go UAX 31 subset letter is Unicode letter or _; digits are Unicode digits.
Swift UAX 31 With some operator-character extensions.
C23 UAX 31 C23 adopts UAX 31. Older C standards used a different Unicode range list.
Java Custom Uses Character.isJavaIdentifierStart/Part, which is similar but predates and differs from UAX 31.

Because namecode output conforms to UAX 31, it produces valid identifiers in all of these languages (with the caveat that Java's rules are slightly different, though compatible in practice for the character set namecode uses).

Why namecode uses UAX 31

Namecode needs a single encoding that works across languages. UAX 31 is the natural choice:

  1. It's the actual standard that modern languages implement
  2. The XID_Start/XID_Continue properties are stable across Unicode versions
  3. The alphabet namecode uses for its encoded portion (az, 05) is a strict subset of ASCII XID_Continue, so encoded output is valid everywhere — even in languages with more restrictive rules

Further reading