
This is part twelve in a series on the 0.3 version of the language spec for the Merg-E Domain Specific Language for the InnuenDo Web 3.0 stack. I'll add more parts to the below list as the spec progresses:
- part 1 : coding style, files, merging, scoping, name resolution and synchronisation
- part 2 : reverse markdown for documentation
- part 3 : Actors and pools.
- part 4 : Semantic locks, blockers, continuation points and hazardous blockers
- part 5 : Semantic lexing, DAGs, prune / ent and alias.
- part 6 : DAGs and DataFrames as only data structures, and inline lambdas for pure compute.
- part 7 : Freezing
- part 8 : Attenuation, decomposition, and membranes
- part 9 : Sensitive data in immutables and future vault support.
- part 10 : Scalars and High Fidelity JSON
- part 11 : Operators, expressions and precedence.
- part 12 : Robust integers and integer bitwidth generic programming
- part 13 : The Merg-E ownership model, capture rules, and the --trustmebro compiler flag.
In this post we will look deep into the guts of the Merg-E integer type system. Until yesterday when writing down the specs for operator set extension in Merg-E I had no intent to make this part of my design notes part of the language spec as I considered it to be part of the implementation, not of the language, but after a night of sleep , I decided I should extend the language definition just a little bit to expose it to the user, as otherwise operator extension would pretty much be cripled.
The subject this post will be about is integer bitsize generics.
Bitwidth expansion
When we want full type safety, overflows are a problem that in Merg-E we want to prevent, or rather we want to make it auditable.
To illustrate, let's look at the following Merg-E expression:
inert uint16 yeardays = 7 * 4 * 13 + 1;
In the previous post we already discussed how this gets processed by the parser, basicly as:
((7 * 4) * 13) + 1
But the tokens aren't just numbers, they have a type. So using runtime types, trying to be as stingy with bit withs as possible we get:
((<uint4 7> * <uint4 4>) * <uint4 13>) + <uint1 1>
But if we now do the math, purely looking at bit lengths, we know an uint4 times an uint4 will not fit into an uint4, in fact we need an uint8 to safely multiply all possible values.
(<uint8 (7 * 4)> * <uint4 13>) + <uint1 1>
The same for the next one, but now we multiply an uint8 with an uint4, and because uint24 doesn't exist, we need to opt for an uint16
<uint16 (7 * 4 *13)> + <uint1 1>
Now we need to add an uint1, and while that chance is minimal, adding the biggest possible uint1 to the biggest possible uint16 there is a chance the result won't fit into an uint16, We would need an uint17 that again does not exist, so we jump to uint32.
<uint32 (7 * 4 * 13 + 1)>
But let's look at the original line:
inert uint16 yeardays = 7 * 4 * 13 + 1;
We are trying to assign an uint32 value to a uint16, and the parser that isn't doing the actual math yet, this is a big NO, so the user will get a compile error.
Like before we can use the hazardous modifier, this time doing a typecast:
inert uint16 yeardays = hazardous typecast uint16 ( 7 * 4 * 13 + 1 );
but this is horribly ugly for something that so obviously should fit.
Runtime types vs line-parse-time types
We already discussed Merg-E's actual integer types. Merg-E has 27 types of integers, 14 unsigned integer types from uint1 up to uint16384 and 13 signed integer types from int2 up to int16384. These types are real types with runtime representations. But as we will now see, during line parsing we need additional virtual integer types.
Let's add a huge collection of virtual types with inbetween bit values. So rather than uint1,uint2,uint4,uiunt8,... we have a series of types that looks like uint1,uint1,Vuint3,uint4,Vuint5,Vuint6,Vuint7,uint8...
Let's do the exercise from before again:
((<uint4 7> * <uint4 4>) * <uint4 13>) + <uint1 1>
(<uint8 (7 * 4)> * <uint4 13>) + <uint1 1>
<Vuint12 (7 * 4 * 13)> + <uint1 1>
<Vuint13 (7 * 4 * 13 + 1)>
So now we have a virtual 13 bit unsigned integer that fits into a 16 bit unsigned integer very comfortably.
line-parse-time types versus generic integer types
So how does the parser know that a multiplication adds up the bit widths while addition merely adds 1 bit to the biggest of the two widths ? Well, that is where generic integer types and generic integer type math comes in.
To illustrate let's look at our first power function from our previous post:
function power (base uint8 exponent uint8 hashazardous boolean callable<uint2048> return)::{
}{
if hasharardous {
return typecast uint2048 lang.math.power typecast float16 base typecast float16 exponent;
}{
mutable uint8 run = 0;
mutable uint2048 rval = 1
while run < exponent {
rval = rval * base;
run = run + 1;
};
return(rval);
};
This callable uses runtime real integer types, in this case uint8 and uint2048. It's already a type safety nightmare as you may understand, and that is actually the reason why there is no power for integers in the language itself, but let's see what we can do.
function power (base gintN exponent uintN hashazardous boolean callable< base |*| |!| exponent > return)::{
}{
if hasharardous {
return typecast base |*| |!| exponent lang.math.power typecast float16 base typecast float16 exponent;
}{
mutable uint8 run = 0;
mutable base |*| |!| exponent rval = 1
while run < exponent {
rval = rval * base;
run = run + 1;
};
return(rval);
};
We see the exponent has the type uintN denoting this function will try to accept any line-parse-time unsigned type, while the base has the type gintN denoting this function will try to accept any line-parse-time type, either signed or unsigned.
That is all fun and games, but the real trick here isn't the generic integer types themselves, it is the type math they enable.
Merg-E generics type math.
In the previous example we saw the weird expression:
... callable< base |*| |!| exponent > return ...
and later on we see
... typecast base |*| |!| exponent ...
Remember base and exponent have an integer generics type, and by virtue of that the parser will be able to do some crazy weird type math.
The expression:
base |*| |!| exponent
resolves to a line-parse-time type, let's look at the components and see how it works.
The |!| operator is right-capturing only and basically gives back the maximum integer of the uintN type of the expression it refers to.
so imagine the rvalue expression:
inline power(7, 5);
or alternative, using the example from the previous post:
7 ⏻ 5;
this expression will be resolved with an uint4 as parser-time types for both the base and the exponent.
For uint4 the |!| operator will yield the number 15, the maxint of an unsigned 4 bit integer.
So this turns the expression into:
base |*| 15
The |*| operator takes captures both left and right and right token, takes the signedness of the left capture and multiplies the bit width by either the value (if it is a literal or the result of a non generic integer value), or the bitwidth of the right capture (if it's a generics integer type).
With base as 7, but defined as gintN , lacking a minus sign, this will map to uint4. So in the end this resolves to:
Vuint60
Now when this expression ends up being assigned, it should be assignable to an int64 or uint64 equally well.
Next to the |*| operator and the |!|, there is the |+| operator and the |%| operator that respectively represent addition and the size after a modulus operation for what the regular % would result in an off by one issue.
Now if we look at wider usage it becomes really clear why the language itself has no native integer based power facility.
42 ⏻ 5000;
now we have a Vuint6 as base and a Vuint13 as exponent. The
|!| exponent
now expands to 8191, leading to
base |*| 8191
resulting in
Vuint49146
As the biggest language defined integer is uint16384, this will lead to a possibly cryptic compile error, telling the user that the function fingerprint for power doesn't match.
The |%| operator
So far we have seen expansion doing it's work, looking scary because type sizes can grow fast, even with the help of virtual parse-time types. The |%| operator is there to shinking things again when we can.
When you do 17 % 3, the result is 2. The expression N % 3 can only have the values 0,1 and 2.When we use powers of two, which is the most common for our type math, we may want to fit the result of a calculation back into a 8 bit integer and sometimes the modulus is fine.
If we do a
return( rval % 256);
we may want our return type to match the semantics.
For that |%| is almost an unary operation that you can see as a type alias that uses only the signedness of the left capture.
So Vint64 |%| 256 will result in int9 while Vuint64 |%| 256 will result in uint8.
Coming up
In this added-in post we discussed the small but robust integer type system of Merg-E, the role of the parser, and the possibility of limited generic programming in Merg-E when implementing custom operators or when making inlineable functions. While usage of integer type-with generics is pretty niche, I've decided to expose it in the language spec, specifically to make the operator system extension feature more usable.
I'll need at least one more post to talk about parallelism models, iterators, and possibly a few more.