
This is part ten in a series on the 0.3 version of the language spec for the Merg-E Domain Specific Language for the InnuenDo Web 3.0 stack. I'll add more parts to the below list as the spec progresses:
- part 1 : coding style, files, merging, scoping, name resolution and synchronisation
- part 2 : reverse markdown for documentation
- part 3 : Actors and pools.
- part 4 : Semantic locks, blockers, continuation points and hazardous blockers
- part 5 : Semantic lexing, DAGs, prune / ent and alias.
- part 6 : DAGs and DataFrames as only data structures, and inline lambdas for pure compute.
- part 7 : Freezing
- part 8 : Attenuation, decomposition, and membranes
- part 9 : Sensitive data in immutables and future vault support.
- part 10 : Scalars and High Fidelity JSON
- part 11 : Operators, expressions and precedence.
- part 12 : Robust integers and integer bitwidth generic programming
- part 13 : The Merg-E ownership model, capture rules, and the --trustmebro compiler flag.
In this post we are going to look at an important part of the type system: scalars. And we will look at serialization and the special kind of measures we need to use to support using the common web serialization format JSON. We are going to define High-Fidelity JSON, and explore how it fits in with the Merg-E language limited type system.
Strings and binary data in Merg-E
Merg-E has only one string type, but it has two faces. We will get to the two faces later after we get string literals out of the way because in Merg-E string literals have a few properties that the programmer might not expect to find in a programming language, and lacks some properties that a programmer will likely expect from a programming language.
We already touched a little on string literals in the last section about sensitive data, using lines like:
hazardous string apiKey = "secret_1975718278757971981018830981307557_ACF4";
The rvalue here is the simplest string value we have in Merg-E, or one of three. The following is totally equivalent:
hazardous string apiKey = """secret_1975718278757971981018830981307557_ACF4""";
as is:
hazardous string apiKey = '''secret_1975718278757971981018830981307557_ACF4''';
Note that a for example:
hazardous string apiKey = 'secret_1975718278757971981018830981307557_ACF4';
is NOT a valid string literal !!, more on this later.
Now there is a possibility of the b modifier for these expression, for example:
hazardous string apiKey = b"secret_1975718278757971981018830981307557_ACF4";
Here is where the two faces come in. While both the original without the b and this one are valid, things change when we ad non-ascii characters to the mix.
inert string firstname = "Борис";
is valid, but:
inert string firstname = b"Борис";
is NOT!
A standard string is interpreted as UTF-8, which in Merg-E needs no escape codes because the source file is defined as being UTF-8. A b string literal can contain only ascii characters or else it will lead to a compile error. So the later example will give a compile error.
Now comes the first of multiple likely unexpected things: In Merg-E, there are no escapes in string literals !!!*, rather than escapes we have the three options of delimiting the string plus asciichar, utf8char and *rawbyte expressions. Let's look at each to explain.
The delimiters ''', b''', """ and b""" are optionally multi-line string literal delimiters. These at least give you the option to use single or double quotes inside of the string literal, and if you need a triple single quote or a triple double quote, you can simply use the other one as a delimiter.
inert string s1 = """I don't know what to do.";
inert string s2 = """How do I put ''' inside of a string literal?""";
inert string s3 = '''And how about """?''';
inert string s4 = """What is I need "quotes" in the middle of the line?""";
inert string s5 = '''Or maybe at the very end like this: "''';
This gets us on our way. But what for example if we want to add a tab (tabs are not allowed in Merg-E source files, or some exotic unicode character? Lets look at that:
inert string tabdelimited = "start" asciichar 9 "end";
inert string favchar = "My favourite unicode character is :" utf8char 20742 ".";
The first line adds a tab (ascii character 9), while the second one adds the chinese character 儆. You may prefer to use hexadecimal values instead, and that is possible but complicates the syntax a bit.
inert string tabdelimited = "start" asciichar asbase unsigned 16 "09" "end";
inert string favchar = "My favourite unicode character is :" utf8char asbase unsigned 16 "5106" ".";
To understand this expression we must jump ahead to expression precedence. Expression precedence in Merg-E is like operator precedence in other languages, but as in Merg-E operators are just aliases for expressions anyway, Merg-E has expression precedence, and every expression has a precedence number assigned to it, as well as a capture spec. We will get deeper into this in a future post, so we skim over the details, but in short the asbase expression has a very low precedence number, meaning it has precedence over most other string related expressions. It's capture spec is "zero left, three right", meaning the sub expression "utf8char asbase 16" is resolved before the different string parts are concatenated.
So what does asbase do. It is quite a powerful expression that is very pervasive within Merg-E code. It interprets a base-N encoded string as an integer where N is the specific base that for hexadecimal notations is 16, and converts it to a still typeless integer decimal. In this case unsigned tells it the encoded string is to be interpreted as an unsigned number, so no minus sign if the first bit happens to be a one.
Supported numbers for the base are:
- 2: binary
- 8: octal
- 16: hexadecimal
- 32: base32
- 58 : base58
- 64 : base64
Now for the multi-line stings, here things get messy as things start mixing with the strict coding conventions. While it is not strictly multi-line strings only, it is only idiomatic to use it as such.
Merg-E defines the triple pipe operator as the pre-lexing multiline operator. And while indenting rules are not enforced on continued strings, it is deemed idiomatic to visually align them with the start of the literal on the previous line.
inert string banner = "Welcome to the Merg-E demo server" newline |||
"---------------------------------" newline |||
"What a nice banner it is, isn't it" newline;
Again we see a new expression, newline. This is nothing fancy, just basically a convenient expression for asciichar 10 or asciichar asbase unsigned 16 "0a", but because we use it do much, a convenient shorthand is added to the language.
So how about the single single quote? Well in Merg-E that is reserved for character and byte literals. It is important to realize that in Merg-E a single character string and a character are not the same thing, a byte and a character aren't necessarily the same thing, and you can for example not compare a sting to a single byte or character. But you can combine them in a string expression, just by placing them as tokens behind each other.
inert uint8 mybyte1 = 'F';
inert uint8 tab = asbase unsigned 16 "09"
inert uint8 mynewline = 10;
inert string tabdelimited2 = b"start" tab b"end";
But now things get dangerous. We can make uint8 values that will mess up a normal string:
inert uint8 null = 0;
inert uint8 ff = asbase unsigned 16 "ff"
inert string messedup = b"ok part " ff null ff null ff ff b"other ok part";
But our two faces thing will come and help us, possibly a bit aggressively, a bit of patience, we'll get to that.
Let's look at the alternative, using the type character, that may seem similar to uint8 but isn't quite the same because it's an UTF8 character.
inert character mychar = 'F';
inert character tab = asbase unsigned 16 "09"
inert character mynewline = 10;
inert string tabdelimited2 = "start" tab "end";
Things look almost the same, and in fact at a binary level they are, but the character is UTF8 that could be multi byte but right now is single byte, and the uint8 is always a single byte.
This expression is valid:
inert character mychar2 = '儆.';
Where this expression will lead to a compile error:
This expression is valid:
inert uint8 mychar2 = '儆.';
Now for the two faces. A string will hold info on whether it was fully constructed from binary string literals and uint8 bytes, or if it was fully constructed from normal string literals and characters. Mixing these is not possible, but checking if a binary constructed string is utf8 is.
uint1 success = regularstring messedup;
The regularstring expression will never change the actual string, the only thing it does is give back a boolean indicating if the string is utf8 compliant from start to end. If the string was already constructed as a regular string, this operation will be instant. If it was constructed from binary parts, then a full scan is done.
One important thing to note is that if this fails, you can often still use the string as if it was utf8 up until the point where it isn't.
inert uint32 strlen = charlen messedup;
inert uint32 binarylength = bytelen messedup;
inert character c0 = messedup charpos 0;
inert uint8 b0 = messedup bytepos 0;
Note the two faces of the string. The length according to charlen and bytelen can be completely different, especially if as we did before, we try to interpret a binary constructed string as utf8. The same getting a specific byte or character may yield different results at the same index, so it is important not to mix these up. Later when we get to iterators we will see more idiomatic ways to deal with this.
Integers in Merg-E
Merg-E has a wide set of integer types, both unsigned and signed. Let's start with the signed integers:
- int2
- int4
- int8
- int16
- int32
- int64
- int128
- int256
- int512
- int1024
- int2048
- int4096
- int8192
- int16384
And the unsigned versions
- uint1
- uint2
- uint4
- uint8
- uint16
- uint32
- uint64
- uint128
- uint256
- uint512
- uint1024
- uint2048
- uint4096
- uint8192
- uint16384
So why so many integer types? As a Web 3.0 DSL, cryptography should be considered a first class citizen, and rather big integers are important for that.
On the low end of the spectrum we also see a few unusual ones. We do this because of two special aliases and because of four special values of these two types.
- boolean : uint1
- fourval : int2
Basicly boolean holds a subset of fourval, while many of the language primitives that can be used inline will return a fourval:
- 0: False
- 1: True
- -1: Error
- -2: Undef
Integers in HF-JSON
In web and Web 3.0 JSON is often used as serialization format, but as JSON is based on JavaScript and thus numerals in JSON are only guaranteed to have sufficient fidelity to hold values of up to the uint32 and int32 Merg-E types. Other types are not safe in standard JSON when encoded as numbers, and also for smaller integers type information is lost.
So the first thing we need to do for encoding integers in JSON in order to start defining High Fidelity JSON or HF-JSON, is to define a way to encode numbers that at least for smaller numbers is still human readable.
We do this by encoding all integers in a HF-expression in a JSON string.
- "HF:int{16}1966:f37b"
- "HF:uint{64}42424242:cba0"
This is the same as:
- "HF:int{16,10}1966:f37b"
- "HF:uint{64,10}42424242:cba0"
We start off with a "HF:" header, followed by a type + encoding base sting, followed by a CRC16 checksum to prevent user strings accidentally getting interpreted as HF-JSON integer values.
This is the simplest, most human readable, base 10 format. But as the size of the numbers grows, we expand on coding options, for example hex, or base64:
- "HF:uint{128,16}0007b37a999972f1644fcd9fffce9278:064d"
Floating points in Merg-E
Next to integers, MergE has a wide range of floating point numbers:
- float16
- float32
- float64
- float128
- float256
There are situations where Web 3 bots might need high fidelity floats or faster or smaller low fidelity floats, so Merg-E tries to support them all.
Floating point types in HF-JSON
Again in HF-JSON we will need to encode most float types as a string. There is however one exception, but we opt to only look at that exception for reading HF-JSON not for generating it unless we are running explicitly in legacy mode. A standard number in JavaScript is the same as our float64, so any input numeric data in JSON or HF-JSON will be interpreted as float64.
So how about encoding? We do that very similar to integers:
- "HF:float{64}-2028m19998739:b002"
What in this case is short for to:
- "HF:float{64,10,10}-2028m19998739:b002"
The m separates the exponent and the mantissa. It is important to note that the exponent should either be encoded as decimal (base 10) or hex (base 16) what should not be a problem because even for float256 only 19 bits are used by the exponent.
Dataframe serialization as HF-JSON
Merg-E doesn't handle just arbitrary data structures like JSON can express. It only knows scalars, dataframes, and its own subset of tree-graph-adjacent DAGs. Let's start by looking at dataframes.
In HF-JSON there we define two ways to structure a dataframe. The verbose, REST-common row oriented way is the way where type info is fully incorporated in each and every bit of data:
[
{ "name" : "Rob", "age" : "HF:int{8}55:b4b8"},
{ "name" : "Борис", "age": "HF:int{8}29:cb4f"}
]
And we have the data pipeline oriented column oriented way, where we can encode type info more efficiently and don't need checksums.
{
"HF:DOCTYPE": "COLS",
"name": ["Rob", "Борис"],
"age": {
"HF:TYPE:04c1": "int{8,10,10}",
"HF:VAL:b4a8": ["55", "29"]
}
}
Note that we now use the "HF:" prefix in another way.
To serialize a dataframe into hfjson:
inert string ashfjson = hfjsonize mydataframe;
or
inert string ashfjson = verbose hfjsonize mydataframe;
Or in the extreme case for maximum human readability:
inert string ashfjson = prettyprint verbose hfjsonize mydataframe;
Here verbose and the prettyprint are modifiers where verbose switches from the efficient column based dataframe serialization to the row based serialization.
The reverse would be:
inert dataframe newdf = dehfjsonize ashfjson;
DAG node serialization as HF-JSON
A DAG node serialization is a serialization to HF-JSON that follows the same DAG consistency rules that Merg-E as a language does. The most important aspect here is that there are no lists or arrays unless as part of one of the dataframe structures defined above. Deviation from this will stop Merg-E from being able to process the DAG-like JSON.
There is also one specific addition to HF-JSON to allow it to branch away from the standard tree structure that JSON provides: The alias expression. Best to show with an example:
{
"HF:DOCTYPE": "DAG",
"owner": "Борис",
"exports": {
"counter": "HF:int{16}1966:f37b"
},
"cnt": {
"HF:TYPE:04c1": "link{}",
"HF:VAL:b4a8": "exports.counter"
}
}
Note how cnt is an alias for exports.counter.
Again we can serialize:
inert string ashfjson2 = verbose hfjsonize mydag;
and deserialize:
mutable dag mydag2 = dehfjsonize ashfjson;
Teaser: A DSL inside a DSL, arbitrary HF-JSON and JSON as input
So what if we need to process non-HF-JSON structured JSON. With or without valid HF-JSON data inside it? In the v0.3 version of Merg-E the answer is simple: WE CAN'T, but for future versions we are eying something like a much simpler version of JSONATA. Basically a small DSL inside our DSL for transforming arbitrary structured JSON into a Merg-E compliant DAG without messy non-dataframe lists and other structures that aren't cleanly expressable in Merg-E DAG and dataframe view of the universe.
We do define an extra HF-JSON member for these HF-JSON files as counterpart to the DAG and COLUMNS HF:DOCTYPE field values, that were indicating a clean Merg-E DAG structure:
{
"HF:DOCTYPE": "MISC",
"HF:SCHEMA" : "do.innuen.wildschema1",
..
}
What this basically states is: this JSON doesn't comply to the DAG requirements of Merg-E, but it follows a unique documented and named schema that you could write a JSONATA-like transformer for.
Teaser: A DSL inside a DSL, arbitrary output from structure.
Just like Merg-E can't parse HF-JSON that isn't strictly Merg-E style DAG shapes, neither can it generate HF-JSON or standard JSON, or for that matter any type of output file including XML and HTML. Just like a DSL inside a DSL for parsing, we will look into including an output transformation DSL too. Likely a simpler version of what jinja2 provides for Python. Again, this is not for the v0.3 of the language spec or runtime, but it's a subject that we need to look into before reaching a Minimal Viable Product version of our runtime.
Coming up
In this post we looked at the scalars part of the Merg-E type-system. We looked at strings and the bytes and how they are both the same and different. We looked at integer and floating point types that actually go beyond what important serialization formats support, and we looked at High-Fidelity JSON or HF-JSON as a way to serialize both dataframes and DAGs, with an intro on how we can process arbitrarily structured JSON and HF-JSON in a way that fits Merg-E's limited DSL type system.
I'll need at least one more post to talk about parallelism models, expression precedence and captures, iterators, and possibly a few more.