Namecode Specification

Version: 0.1 Status: Draft

Abstract

Namecode encodes arbitrary Unicode strings into valid programming language identifiers. Think "Punycode for variable names." The output is valid across Rust, Go, JavaScript, and Python.

Motivation

Programming languages restrict what characters can appear in identifiers. When storing or transmitting data that uses identifiers as keys (file paths, JSON keys, database columns), arbitrary Unicode strings must be encoded into valid identifier form.

Namecode solves this by providing a reversible, deterministic encoding that:

Produces valid UAX 31 identifiers
Preserves readability for ASCII-only inputs
Handles all Unicode strings including emoji, CJK, RTL text

Properties

Guaranteed Properties

Property	Definition
Roundtrip	If `encode(s)` starts with `_N_`, then `decode(encode(s)) == s`
Passthrough	If `encode(s)` does not start with `_N_`, then `encode(s) == s`
Identity	`encode(decode(s)) == s` for all valid encodings `s`
Idempotency	`encode(encode(s)) == encode(s)`
Valid Output	`encode(s)` is a valid UAX 31 identifier (for non-empty `s`)
Deterministic	Same input always produces same output
O(n) Complexity	Both encode and decode run in linear time

Note: Strings that are already valid XID identifiers (and don't conflict with the encoding format) pass through unchanged. The decode function only accepts strings with the _N_ prefix.

Non-Goals

Normalization: Namecode preserves exact codepoints. NFC/NFKC normalization is the caller's responsibility.
Minimal Output: The encoding prioritizes correctness and simplicity over minimal length.
Human Readability of Encoded Portion: The bootstring-encoded section is not meant to be human-readable.

Terminology

Term	Definition
XID Identifier	A string valid per Unicode Standard Annex #31 (UAX 31). Starts with `XID_Start` or underscore, then zero or more `XID_Continue`. Single underscore `_` is valid.
Basic Character	A character that passes through unchanged: any `XID_Continue` character except when it would create `__` or a trailing underscore before the delimiter.
Non-Basic Character	Any character that must be encoded: non-`XID_Continue` characters, consecutive underscores, or trailing underscores when non-basic characters exist.
Prefix	`_N_` - marks a string as Namecode-encoded.
Delimiter	`__` - separates basic characters from the encoded portion.

Encoding Format

Grammar

namecode    = passthrough | encoded
passthrough = xid_identifier   ; if no collisions
encoded     = "_N_" basic "__" insertions
            | "_N_" basic                      ; no non-basic chars

basic       = { xid_continue } ; no "__", no trailing "_" if insertions exist
insertions  = { position_delta codepoint }

Decision Tree

Is input empty?
  └─ Yes → return ""
  └─ No  ↓

Is input a valid XID identifier AND doesn't start with "_N_"?
  └─ Yes → return input unchanged (passthrough)
  └─ No  ↓

Is input already a valid Namecode encoding?
  └─ Yes → return input unchanged (idempotency)
  └─ No  ↓

Encode the input:
  1. Extract basic characters (XID_Continue, avoiding __ and trailing _)
  2. Record positions and codepoints of non-basic characters
  3. Encode insertions using Bootstring
  4. Return "_N_" + basic + "__" + encoded_insertions

Examples

Input	Output	Reason
`foo`	`foo`	Valid XID, passthrough
`cafe`	`cafe`	Valid XID (ASCII)
`café`	`café`	Valid XID (Latin extended)
`名前`	`名前`	Valid XID (CJK)
`foo__bar`	`foo__bar`	Valid XID (contains `__` but no `_N_` prefix)
`hello world`	`_N_helloworld__fa0b`	Space is non-XID
`foo-bar`	`_N_foobar__da1d`	Hyphen is non-XID
`123foo`	`_N_123foo`	Digit can't start identifier (all chars XID_Continue)
`_N_test`	`_N__N_test`	Prefix collision (all chars XID_Continue)
`_`	`_`	Single underscore is valid XID
`` (empty)	`` (empty)	Empty passthrough

Bootstring Encoding

The encoded portion uses a variant of the Bootstring algorithm (RFC 3492), adapted for identifier-safe output.

Alphabet

32 characters, all valid in identifiers:

a-z (0-25) + 0-5 (26-31) = 32 characters

Characters 6-9 are NOT used (reserved for future extensions).

Constants

Constant	Value	Purpose
`BASE`	32	Number of distinct digits
`T_MIN`	1	Minimum threshold
`T_MAX`	26	Maximum threshold
`SKEW`	38	Bias adaptation skew
`DAMP`	700	First-time damping factor
`INITIAL_BIAS`	72	Starting bias value

Variable-Length Integer Encoding

Each value is encoded as a sequence of digits. The threshold function determines when the sequence terminates:

threshold(k, bias) =
    T_MIN           if k <= bias + T_MIN
    T_MAX           if k >= bias + T_MAX
    k - bias        otherwise

Encoding a value:

k = BASE
while true:
    t = threshold(k, bias)
    if value < t:
        output encode_digit(value)
        break
    digit = t + (value - t) % (BASE - t)
    output encode_digit(digit)
    value = (value - t) / (BASE - t)
    k += BASE

Decoding a value:

result = 0, w = 1, k = BASE
while true:
    digit = decode_digit(next_char())
    t = threshold(k, bias)
    result += digit * w
    if digit < t:
        break
    w *= (BASE - t)
    k += BASE

Bias Adaptation

After encoding/decoding each value, the bias is adapted:

adapt_bias(delta, num_points, first_time):
    delta = delta / DAMP   if first_time else delta / 2
    delta += delta / num_points
    k = 0
    while delta > ((BASE - T_MIN) * T_MAX) / 2:
        delta /= (BASE - T_MIN)
        k += BASE
    return k + ((BASE - T_MIN + 1) * delta) / (delta + SKEW)

Insertion Encoding

Non-basic characters are encoded as a sequence of (position_delta, codepoint) pairs:

Position Delta: Distance from previous insertion (or from start for first)
- First insertion: absolute position
- Subsequent insertions: current_position - previous_position - 1
Codepoint: The Unicode codepoint value

Both are encoded as variable-length integers with bias adaptation between each value.

Collision Handling

Prefix Collision (`_N_...`)

Strings starting with _N_ are always encoded, even if otherwise valid XID. This prevents ambiguity between literal _N_test and an encoded string.

encode("_N_test") → "_N__N_test" (not "_N_test")
decode("_N__N_test") → "_N_test"

Note: If a string starting with _N_ happens to be a valid Namecode encoding of some other string, the idempotency check will return it unchanged. This is intentional for the encode-encode idempotency property.

Double Underscore Passthrough

Strings containing __ but NOT starting with _N_ pass through unchanged:

encode("foo__bar") → "foo__bar" (valid XID, passes through)
encode("__") → "__" (valid XID: underscore followed by XID_Continue)

The __ delimiter only has meaning after the _N_ prefix, so these strings cannot be confused with encoded strings.

Basic Portion Constraints

When constructing the basic portion during encoding:

No trailing underscores: If non-basic characters exist, trailing underscores in basic are moved to non-basic to avoid ___ ambiguity with the delimiter.
No consecutive underscores: Underscores that would become consecutive in the basic portion (after removing non-basic characters) are moved to non-basic.

encode("test_ ") → "_N_test__..." (trailing _ encoded with space)
encode("__ _x") → "_N__x__..." (middle _ encoded to avoid __ in basic)

Error Handling

Decoding can fail with these errors:

Error	Cause
`NotEncoded`	Input doesn't start with `_N_`
`InvalidDigit(char)`	Character in encoded portion not in alphabet
`UnexpectedEnd`	Encoded data truncated mid-varint
`InvalidCodepoint(u32)`	Decoded codepoint not valid Unicode
`Overflow`	Arithmetic overflow during decoding

API

Rust

/// Encode a Unicode string into a valid UAX 31 identifier.
pub fn encode(input: &str) -> String;

/// Decode a Namecode string back to Unicode.
pub fn decode(input: &str) -> Result<String, DecodeError>;

/// Check if a string is a valid XID identifier (UAX 31).
pub fn is_xid_identifier(input: &str) -> bool;

Command Line

# Encode a string
namecode encode "hello world"
# Output: _N_helloworld__fa0b

# Decode a string
namecode decode "_N_helloworld__fa0b"
# Output: hello world

# Pipe mode
echo "foo-bar" | namecode encode
cat encoded.txt | namecode decode

Compatibility

Language Support

Namecode output is valid in:

Language	Identifier Rules	Namecode Compatible
Rust	UAX 31	Yes
Go	UAX 31 (subset)	Yes
Python 3	UAX 31	Yes
JavaScript	UAX 31	Yes
C/C++	ASCII + some Unicode	Mostly (ASCII subset always works)

Version Compatibility

The encoding format is stable. Any string encoded with Namecode 1.0 will decode correctly with future versions.

Future versions may add:

Alternative prefixes for different use cases
Extended digit alphabet (6-9 currently reserved)
Compression optimizations (backward compatible)

Test Vectors

Passthrough Cases

Input	Output
`foo`	`foo`
`_private`	`_private`
`café`	`café`
`名前`	`名前`
`CamelCase`	`CamelCase`

Encoding Cases

Input	Encoded	Notes
`hello world`	`_N_helloworld__fa0b`	Single space at position 5
`foo-bar`	`_N_foobar__da1d`	Single hyphen at position 3
`a b c`	`_N_abc__ba0bb0b`	Spaces at positions 1 and 3
`123`	`_N_123`	Digits are XID_Continue, no non-basic chars
	`_N___a0ba0ba0b`	Three spaces (all non-basic)

Edge Cases

Input	Output	Notes
``	``	Empty string
	`_N___a0b`	Single space (non-basic)
`a`	`a`	Single letter
`_`	`_`	Single underscore (valid XID)
`_a`	`_a`	Underscore + letter (valid XID)
`__`	`__`	Two underscores (valid XID: `_` + XID_Continue)
`___`	`___`	Three underscores (valid XID)
`foo__bar`	`foo__bar`	Valid XID, passes through
`_N_test`	`_N__N_test`	Prefix collision, no non-basic chars
`__ _x`	`_N__x__ba3la0ba3l`	Mixed: underscores separated by space

Security Considerations

No Injection: Namecode output contains only identifier-safe characters
No Normalization Attacks: Exact codepoints are preserved (no NFKC confusion)
Bounded Output: Output length is O(n) where n is input length
Deterministic: No randomness means no oracle attacks

Namecode Specification

Abstract #

Motivation #

Properties #

Guaranteed Properties #

Non-Goals #

Terminology #

Encoding Format #

Grammar #

Decision Tree #

Examples #

Bootstring Encoding #

Alphabet #

Constants #

Variable-Length Integer Encoding #

Bias Adaptation #

Insertion Encoding #

Collision Handling #

Prefix Collision (_N_...) #

Double Underscore Passthrough #

Basic Portion Constraints #

Error Handling #

API #

Rust #

Command Line #

Compatibility #

Language Support #

Version Compatibility #

Test Vectors #

Passthrough Cases #

Encoding Cases #

Edge Cases #

Security Considerations #

References #