Introducing TJSON:

Tagged JSON with Rich Types

DRAFT: This format is still in a draft state and subject to change!

TJSON (Tagged JSON) is a tagging scheme/microformat for enriching the types that can be stored in JSON documents. It augments the existing types present in JSON, codifiying ad hoc practices already commonly used for processing JSON into a schema-free, self-describing format.

TJSON documents are amenable to "content-aware hashing" where different encodings of the same data (including both TJSON and binary formats like Protocol Buffers, MessagePack, BSON, etc) can share the same content hash and therefore the same cryptographic signature. This is possible with content hash algorithms that are aware of the underlying structure of data, such as Ben Laurie's objecthash.

TJSON supports the following data types:

  • Objects: Name/value dictionaries. The names of objects in TJSON carry a postfix "tag" which acts as a type annotation for the associated value. See the descriptions of "Strings" below for more information.
  • Arrays: Lists of values: identical to JSON, but typed by their containing objects. Unlike JSON, arrays cannot be used as a top-level expression: only objects are allowed.
  • Sets: Lists of unique values: similar to an array, but repeated elements are disallowed.
  • Strings: TJSON strings are Unicode and always serialized as UTF-8. When used as the name of a member of an object, they carry a mandatory "tag" which functions as a self-describing type annotation which provides a type signature for the associated value.
  • Binary Data: First-class support for 8-bit clean binary data, encoded in a variety of formats including hexadecimal (a.k.a. base16), base32, and base64url.
  • Numbers:
    • Integers: TJSON supports the full ranges of both signed and unsigned 64-bit integers by serializing them as strings.
    • Floating points: Floating point numbers in TJSON are identical to JSON, but can always be disambiguated from integers.
  • Timestamps: TJSON has a first-class type for representing date/time timestamp values, serialized as a subset of RFC 3339 (an ISO 8601-alike).
  • Boolean Values: TJSON supports the true and false values from JSON (null is expressly disallowed).

Objects are the only type allowed at the top-level of a TJSON document. Many ordinary JSON parsers accept arrays or other types as top-level expressions. This is NOT the case in TJSON: objects-only at the top-level.

Objects in TJSON use the same syntax as JSON, but each member name contains a "tag" which annotates the type of the associated value of the member.

Below is an example of an object whose value is a Unicode String:

{"hello-world:s": "Hello, world!"}

This example consists of an object whose only member is named "hello-world" and whose corresponding value is the string (:s) encoded in UTF-8 whose contents are "Hello, world!"

Member names in TJSON must be distinct. The use of the same member name more than once in the same object is an error, regardless of if the same name is used for the same value, same types, or multiple different types. TJSON names are single-use only.

TJSON uses the case of the first letter of the name of a type to distinguish between scalar (single value) and non-scalar (collection) types. The syntax for identifying a nested TJSON object is a capital "O" letter: (NOT zero)

{"hello-object:O": {"hello-string:s": "Hello, world!"}}

Arrays are not allowed as a toplevel expression in TJSON. The following is NOT a valid TJSON document, because toplevel arrays are NOT allowed in TJSON:

["No toplevel arrays in TJSON!"]

Arrays MUST first be wrapped in an object, from which they inherit their type information. Arrays are described by an "A" tag (non-scalar types in TJSON are capitalized) however this tag alone is not sufficient:

{"not-quite-valid:A": ["Hello, world!"]}

To properly tag TJSON array, you MUST also include the type of its contents in the tag. The following is valid array syntax:

{"valid-array:A<s>": ["Hello, world!"]}

The above syntax describes an array of strings. It might remind you of generic syntax from statically typed programming languages. TJSON contains a tiny type system it uses to verify type annotations.

The syntax may be nested to support multidimensional arrays:

{"nested-array:A<A<s>>": [["Nested"], ["Array!"]]}

Or objects nested within arrays:

{"nested-object:A<O>": [{"nested:s": "object"}]}

The inner type parameter may be omitted for empty arrays:

{"empty-array:A<>": []}

Sets use a syntax that's nearly identical to arrays, but require elements within the set are unique:

{"valid-set:S<s>": ["One", "Two", "Three"]}

Sets containing repeated items are invalid and are rejected by compliant parsers:

{"invalid-set:S<s>": ["One", "One", "One"]}

As an element of an array, or a member of an object, strings have the same syntax as they do in JSON. But when used as the name of an object member, strings carry a special postfix tag which acts as a type annotation/signature for the value:

{"hello-string:s": "I'm a string!"}

Note that a posfix tag is mandatory for all object member names in TJSON and prevents any ambiguities between tagged and untagged strings. Parsers which encounter untagged names for object members should raise an exception.

Unlike JSON, TJSON strings MUST be encoded as UTF-8. Other Unicode encodings (e.g. UCS-2 as seen in JavaScript) are expressly disallowed. All TJSON documents should be valid UTF-8, and parsers should reject documents that fail to decode as UTF-8.

TJSON supports multiple different formats for encoding 8-bit clean binary data. Conforming encoders/decoders are required to support them all. The default is base64url, however encoders may be configured with alternative, potentially more visually appealing or well-recognized encodings for specific fields.

Hexadecimal Data (a.k.a. Base16)

Data tagged as "d16" is encoded in lower-case hexadecimal format:

{"hello-base-sixteen:d16": "48656c6c6f2c20776f726c6421"}

TJSON parsers should expressly reject the use of any upper case hexadecimal characters and fail with an exception.

Base32

Data tagged as "d32" is encoded in "base32" format as specified in RFC 4648:

{"hello-base-thirty-two:d32": "jbswy3dpfqqho33snrscc"}

The encoded data should NOT be padded with "=" characters as it's stored within a quote-delimited string so its length is known in advance.

TJSON parsers should expressly reject the use of any upper case Base32 characters and fail with an exception.

Base64url

Data tagged "d64" is encoded in in "base64url" format as specified in RFC 4648:

{"hello-base-sixty-four-url:d64": "SGVsbG8sIHdvcmxkIQ"}

The encoded data should NOT be padded with "=" characters as it's stored within a quote-delimited string so its length is known in advance.

The non-URL safe variant of Base64 is not supported by TJSON and should be rejected by parsers (i.e. if it contains the "+" or "/" characters it should be rejected)

Because "base64url" is the default encoding for TJSON, the shorthand "d" variant SHOULD be used by default unless another format is specified:

{"base-sixty-four-is-default:d": "SGVsbG8sIHdvcmxkIQ"}

TJSON supports both integers and floating point numbers in separate formats that can always be disambiguated.

Integers

In TJSON, integers are stored as strings, sidestepping integer precision issues with JSON parsers that do floating point conversions.

The following is an example of a signed integer, which may be any value in the range -(2**63) to (2**63)-1.

{"hello-signed-int:i": "42"}

The following is an example of an unsigned integer, which may be any value in the range 0 to (2**64)-1:

{"hello-unsigned-int:u": "18446744073709551615"}

Integers otherwise utilize the int syntax as described in the JSON specification.

Floating Points

Floating points use the native number literal syntax provided by JSON. Unlike integers, TJSON floats must not be quoted:

{"hello-float:f": 0.42}

The full IEEE 754 64-bit floating point range is supported.

TJSON supports the true and false values from JSON:

{"hello-true:b": true, "hello-false:b": false}

The null value is expressly disallowed anywhere inside of a TJSON document.

TJSON adds a literal syntax for timestamp values. The format is based on RFC 3339, however the use of the UTC time zone identifier "Z" is mandatory (i.e. all timestamps are Z-normalized):

{"hello-timestamp:t": "2016-10-02T07:31:51Z"}

TJSON parsers should expressly reject the use of other time zone identifiers and fail with an exception.