Namecode Specification

Version: 0.1 Status: Draft

Abstract

Namecode encodes arbitrary Unicode strings into valid programming language identifiers. Think "Punycode for variable names." The output is valid across Rust, Go, JavaScript, and Python.

Motivation

Programming languages restrict what characters can appear in identifiers. When storing or transmitting data that uses identifiers as keys (file paths, JSON keys, database columns), arbitrary Unicode strings must be encoded into valid identifier form.

Namecode solves this by providing a reversible, deterministic encoding that:

Properties

Guaranteed Properties

Property Definition
Roundtrip If encode(s) starts with _N_, then decode(encode(s)) == s
Passthrough If encode(s) does not start with _N_, then encode(s) == s
Identity encode(decode(s)) == s for all valid encodings s
Idempotency encode(encode(s)) == encode(s)
Valid Output encode(s) is a valid UAX 31 identifier (for non-empty s)
Deterministic Same input always produces same output
O(n) Complexity Both encode and decode run in linear time

Note: Strings that are already valid XID identifiers (and don't conflict with the encoding format) pass through unchanged. The decode function only accepts strings with the _N_ prefix.

Non-Goals

Terminology

Term Definition
XID Identifier A string valid per Unicode Standard Annex #31 (UAX 31). Starts with XID_Start or underscore, then zero or more XID_Continue. Single underscore _ is valid.
Basic Character A character that passes through unchanged: any XID_Continue character except when it would create __ or a trailing underscore before the delimiter.
Non-Basic Character Any character that must be encoded: non-XID_Continue characters, consecutive underscores, or trailing underscores when non-basic characters exist.
Prefix _N_ - marks a string as Namecode-encoded.
Delimiter __ - separates basic characters from the encoded portion.

Encoding Format

Grammar

namecode    = passthrough | encoded
passthrough = xid_identifier   ; if no collisions
encoded     = "_N_" basic "__" insertions
            | "_N_" basic                      ; no non-basic chars

basic       = { xid_continue } ; no "__", no trailing "_" if insertions exist
insertions  = { position_delta codepoint }

Decision Tree

Is input empty?
  └─ Yes → return ""
  └─ NoIs input a valid XID identifier AND doesn't start with "_N_"?
  └─ Yes → return input unchanged (passthrough)
  └─ NoIs input already a valid Namecode encoding?
  └─ Yes → return input unchanged (idempotency)
  └─ NoEncode the input:
  1. Extract basic characters (XID_Continue, avoiding __ and trailing _)
  2. Record positions and codepoints of non-basic characters
  3. Encode insertions using Bootstring
  4. Return "_N_" + basic + "__" + encoded_insertions

Examples

Input Output Reason
foo foo Valid XID, passthrough
cafe cafe Valid XID (ASCII)
café café Valid XID (Latin extended)
名前 名前 Valid XID (CJK)
foo__bar foo__bar Valid XID (contains __ but no _N_ prefix)
hello world _N_helloworld__fa0b Space is non-XID
foo-bar _N_foobar__da1d Hyphen is non-XID
123foo _N_123foo Digit can't start identifier (all chars XID_Continue)
_N_test _N__N_test Prefix collision (all chars XID_Continue)
_ _ Single underscore is valid XID
`` (empty) `` (empty) Empty passthrough

Bootstring Encoding

The encoded portion uses a variant of the Bootstring algorithm (RFC 3492), adapted for identifier-safe output.

Alphabet

32 characters, all valid in identifiers:

a-z (0-25) + 0-5 (26-31) = 32 characters

Characters 6-9 are NOT used (reserved for future extensions).

Constants

Constant Value Purpose
BASE 32 Number of distinct digits
T_MIN 1 Minimum threshold
T_MAX 26 Maximum threshold
SKEW 38 Bias adaptation skew
DAMP 700 First-time damping factor
INITIAL_BIAS 72 Starting bias value

Variable-Length Integer Encoding

Each value is encoded as a sequence of digits. The threshold function determines when the sequence terminates:

threshold(k, bias) =
    T_MIN           if k <= bias + T_MIN
    T_MAX           if k >= bias + T_MAX
    k - bias        otherwise

Encoding a value:

k = BASE
while true:
    t = threshold(k, bias)
    if value < t:
        output encode_digit(value)
        break
    digit = t + (value - t) % (BASE - t)
    output encode_digit(digit)
    value = (value - t) / (BASE - t)
    k += BASE

Decoding a value:

result = 0, w = 1, k = BASE
while true:
    digit = decode_digit(next_char())
    t = threshold(k, bias)
    result += digit * w
    if digit < t:
        break
    w *= (BASE - t)
    k += BASE

Bias Adaptation

After encoding/decoding each value, the bias is adapted:

adapt_bias(delta, num_points, first_time):
    delta = delta / DAMP   if first_time else delta / 2
    delta += delta / num_points
    k = 0
    while delta > ((BASE - T_MIN) * T_MAX) / 2:
        delta /= (BASE - T_MIN)
        k += BASE
    return k + ((BASE - T_MIN + 1) * delta) / (delta + SKEW)

Insertion Encoding

Non-basic characters are encoded as a sequence of (position_delta, codepoint) pairs:

  1. Position Delta: Distance from previous insertion (or from start for first)

    • First insertion: absolute position
    • Subsequent insertions: current_position - previous_position - 1
  2. Codepoint: The Unicode codepoint value

Both are encoded as variable-length integers with bias adaptation between each value.

Collision Handling

Prefix Collision (_N_...)

Strings starting with _N_ are always encoded, even if otherwise valid XID. This prevents ambiguity between literal _N_test and an encoded string.

encode("_N_test") → "_N__N_test" (not "_N_test")
decode("_N__N_test") → "_N_test"

Note: If a string starting with _N_ happens to be a valid Namecode encoding of some other string, the idempotency check will return it unchanged. This is intentional for the encode-encode idempotency property.

Double Underscore Passthrough

Strings containing __ but NOT starting with _N_ pass through unchanged:

encode("foo__bar") → "foo__bar" (valid XID, passes through)
encode("__") → "__" (valid XID: underscore followed by XID_Continue)

The __ delimiter only has meaning after the _N_ prefix, so these strings cannot be confused with encoded strings.

Basic Portion Constraints

When constructing the basic portion during encoding:

  1. No trailing underscores: If non-basic characters exist, trailing underscores in basic are moved to non-basic to avoid ___ ambiguity with the delimiter.

  2. No consecutive underscores: Underscores that would become consecutive in the basic portion (after removing non-basic characters) are moved to non-basic.

encode("test_ ") → "_N_test__..." (trailing _ encoded with space)
encode("__ _x") → "_N__x__..." (middle _ encoded to avoid __ in basic)

Error Handling

Decoding can fail with these errors:

Error Cause
NotEncoded Input doesn't start with _N_
InvalidDigit(char) Character in encoded portion not in alphabet
UnexpectedEnd Encoded data truncated mid-varint
InvalidCodepoint(u32) Decoded codepoint not valid Unicode
Overflow Arithmetic overflow during decoding

API

Rust

/// Encode a Unicode string into a valid UAX 31 identifier.
pub fn encode(input: &str) -> String;

/// Decode a Namecode string back to Unicode.
pub fn decode(input: &str) -> Result<String, DecodeError>;

/// Check if a string is a valid XID identifier (UAX 31).
pub fn is_xid_identifier(input: &str) -> bool;

Command Line

# Encode a string
namecode encode "hello world"
# Output: _N_helloworld__fa0b

# Decode a string
namecode decode "_N_helloworld__fa0b"
# Output: hello world

# Pipe mode
echo "foo-bar" | namecode encode
cat encoded.txt | namecode decode

Compatibility

Language Support

Namecode output is valid in:

Language Identifier Rules Namecode Compatible
Rust UAX 31 Yes
Go UAX 31 (subset) Yes
Python 3 UAX 31 Yes
JavaScript UAX 31 Yes
C/C++ ASCII + some Unicode Mostly (ASCII subset always works)

Version Compatibility

The encoding format is stable. Any string encoded with Namecode 1.0 will decode correctly with future versions.

Future versions may add:

Test Vectors

Passthrough Cases

Input Output
foo foo
_private _private
café café
名前 名前
CamelCase CamelCase

Encoding Cases

Input Encoded Notes
hello world _N_helloworld__fa0b Single space at position 5
foo-bar _N_foobar__da1d Single hyphen at position 3
a b c _N_abc__ba0bb0b Spaces at positions 1 and 3
123 _N_123 Digits are XID_Continue, no non-basic chars
_N___a0ba0ba0b Three spaces (all non-basic)

Edge Cases

Input Output Notes
`` `` Empty string
_N___a0b Single space (non-basic)
a a Single letter
_ _ Single underscore (valid XID)
_a _a Underscore + letter (valid XID)
__ __ Two underscores (valid XID: _ + XID_Continue)
___ ___ Three underscores (valid XID)
foo__bar foo__bar Valid XID, passes through
_N_test _N__N_test Prefix collision, no non-basic chars
__ _x _N__x__ba3la0ba3l Mixed: underscores separated by space

Security Considerations

References

  1. Unicode Standard Annex #31: Unicode Identifier and Pattern Syntax
  2. RFC 3492: Punycode
  3. RFC 3454: Stringprep