Unicode Identifiers and UAX 31

Programming languages need rules for what constitutes a valid identifier — a variable name, function name, type name. Historically each language invented its own rules. Unicode Standard Annex #31 (UAX 31) defines a common standard that modern languages now converge on.

Namecode encodes arbitrary strings into identifiers that conform to this standard.

XID_Start and XID_Continue

UAX 31 defines identifiers using two Unicode character properties from UAX #44:

XID_Start — characters that can begin an identifier:

Letters (Latin, Greek, Cyrillic, CJK, etc.)
Letter-like numbers (e.g. Roman numerals)
Underscore _
Not digits, punctuation, symbols, or whitespace

XID_Continue — characters that can appear after the first position:

Everything in XID_Start, plus:
Digits (0–9)
Combining marks (accents, diacritics)
Connector punctuation (underscore)

A valid identifier is: one XID_Start character, followed by zero or more XID_Continue characters.

The "XID" prefix stands for "eXtended IDentifier" — these are derived properties that remain stable across Unicode normalization forms (NFC/NFKC), which makes them safe for compilers and tooling.

What this means in practice

String	Valid?	Why
`foo`	Yes	Letter, letters
`café`	Yes	Letters (including `é`, which is `XID_Continue`)
`名前`	Yes	CJK characters are `XID_Start` and `XID_Continue`
`_private`	Yes	Underscore is `XID_Start`
`foo123`	Yes	Digits are `XID_Continue`
`123foo`	No	Digit is not `XID_Start` — can't begin an identifier
`hello world`	No	Space is not `XID_Continue`
`foo-bar`	No	Hyphen is not `XID_Continue`
`foo@bar`	No	`@` is not `XID_Continue`

Language adoption

Most modern languages have converged on UAX 31 or a close subset:

Language	Identifier rule	Notes
Rust	UAX 31	Exact UAX 31 since Rust 1.53 (2021). NFC normalized.
Python 3	UAX 31	Exact UAX 31 since Python 3.0 (2008). NFKC normalized.
JavaScript	UAX 31	Via ECMAScript spec. Allows `$` as an extension.
Go	UAX 31 subset	`letter` is Unicode letter or `_`; digits are Unicode digits.
Swift	UAX 31	With some operator-character extensions.
C23	UAX 31	C23 adopts UAX 31. Older C standards used a different Unicode range list.
Java	Custom	Uses `Character.isJavaIdentifierStart/Part`, which is similar but predates and differs from UAX 31.

Because namecode output conforms to UAX 31, it produces valid identifiers in all of these languages (with the caveat that Java's rules are slightly different, though compatible in practice for the character set namecode uses).

Why namecode uses UAX 31

Namecode needs a single encoding that works across languages. UAX 31 is the natural choice:

It's the actual standard that modern languages implement
The XID_Start/XID_Continue properties are stable across Unicode versions
The alphabet namecode uses for its encoded portion (a–z, 0–5) is a strict subset of ASCII XID_Continue, so encoded output is valid everywhere — even in languages with more restrictive rules

Unicode Identifiers and UAX 31

XID_Start and XID_Continue #

What this means in practice #

Language adoption #

Why namecode uses UAX 31 #

Further reading #

XID_Start and XID_Continue

What this means in practice

Language adoption

Why namecode uses UAX 31

Further reading