
This is part six in a series on the 0.3 version of the language spec for the Merg-E Domain Specific Language for the InnuenDo Web 3.0 stack. I'll add more parts to the below list as the spec progresses:
- part 1 : coding style, files, merging, scoping, name resolution and synchronisation
- part 2 : reverse markdown for documentation
- part 3 : Actors and pools.
- part 4 : Semantic locks, blockers, continuation points and hazardous blockers
- part 5 : Semantic lexing, DAGs, prune / ent and alias.
- part 6 : DAGs and DataFrames as only data structures, and inline lambdas for pure compute.
- part 7 : Freezing
- part 8 : Attenuation, decomposition, and membranes
- part 9 : Sensitive data in immutables and future vault support.
- part 10 : Scalars and High Fidelity JSON
In this post we are going to look deeper into DAGs, and expand on the subject of data structures by looking at the only other data structure construct that Merg-E provides: The DataFrame. Then finaly we are going to look at the Merg-E version of the lambda: the inline invocation pattern.
Revisiting DAGs
In the last post we discussed DAGs shortly but not to deeply, nor did we disclose that DAGs (or rather Merg-E's subset of valid DAGs) are more then just the runtime defined scope tree that we can prune and ent on to our hearts delight. Let's revisit that and go a little deeper.
Before we do, we need to get some conflict of terminology out of the way. Merg-E uses botanical terminology when discussing the DAGs. DAGs in Merg-E are so called arborescent DAGs, not quite a tree graph, but similar enough to justify the botanical terminology in the language, and like in a botanical context, we consider the root to be at the bottom of any graphical representation of the arborescent DAG, not, as is the convention in computer science and in genealogy, with the root at the top. To keep this section unconfusing, we opt not to speak of up and down, but instead of ancestor and descendant and of child and parent, where the root is the ultimate ancestor of all vertices (aka nodes) in the graph, and a parent is the most direct ancestor while a child is an immediate decedent.
So how exactly do Merg-E's arborescent DAGs differ from generic DAGs? Well, from the view of any single execution context, the core structure of a Merg-E DAG is always single-root tree-shaped, it is a tree-graph with some extra edges (aka arcs or links). But while for generic DAGs these extra edges can do anywhere, for Merg-E the edges are parent-decendant constrained.
Merg-E defines a special type of vertex called an alias that gets a type of forward edge, but not strictly forward for itself, but instead to its parent. Basically an alias is a vertex with a single outgoing non-tree edge with a restricted target, and the first restriction is that the target vertex needs to be a descendant of the primary parent of the alias vertex.
A second constraint on the outgoing non-tree edge is that if the target is a child of the the primary parent of the alias vertice, then that child can not be* an alias vertice itself.
This was as far as static Merg-E DAGs go, but with prune and ent (aka graft), DAGs in Merg-E aren't really static. What happens to an alias when we prune a part of the DAG that has incoming edges from aliases? Well, that depends. A standard hard prune leaves aliases dangling in the parent scope, and will trim any dangling aliases in case of subgraph inheritance by deeper execution contexts. This means the edges are considered to be lexical in nature, giving the user the chance to ent an alternative target for the edge before hand over to deeper execution scopes.
Alternatively the user can opt for a soft prune, defensively named a hazardous prune:
hazardous prune scope.imported.lang.Amplify;
We know that lang.Amplify contains the ent primitive, and we also know that both lang.ent and lang.graft are aliases that pointed to lang.Amplify.ent. With a hard prune, these two aliases would now be dangling, but with a soft prune they are not allowed to. But we need to really know what we are doing, hence the re-use of the hazardous modifier in this context. Same modifier as for hazardous blockers, but with different implications.
When we mark a prune operation as hazardous we ask prune to aggressively restructure the DAG in such a way that other than the removed edge all graph invariants are preserved IF IT CAN, and to prune a minimum amount of stale aliases if it can't. There are no guarantees that all runtimes or even subsequent versions of the same Merg-E runtime will make the same decisions in this or will have the same restructuring intelligence, hence the marking as hazardous of soft prunes.
In this example the semantic lexer will have two options. Either the lang.ent alias will get deleted and be replaced by lang.Amplify.ent and the lang.graft alias will get updated to point to lang.ent, OR the lang.graft alias will get deleted and be replaced by lang.Amplify.ent and the lang.ent alias will get updated to point to lang.graft. Again, no guarantees which one. This is a simple example and users should expect these to work, but as you can imagine, do this a number of times throughout an alias heavy DAG, and things may become rather indeterministic and thus hazardous.
making your own DAGs.
Merg-E is to be a Domain Specific Language for Web 3.0 bots and Web 3.0 Layer-2 nodes, it's not supposed to become a full general purpose language. Because of that, and to keep future feature creep to a minimum, the Merg-E philosophy is to add the least complicated feature that covers the most of the use cases, and that goes for the type system too. Just like Merg-E not having classes or objects, Merg-E doesn't have structs, or dictionaries, and as you will see next, no lists or tuples either. Instead we choose to elevate DAGs to a more generic tool for structuring data. We needed DAGs already for our ambient authority tree, as well as for keeping the language keyword count down to one (scope) and the lang DAG that flowed from that, and we already exposed prune and ent to user space, we can just as well go the extra step and allow the user a slightly richer DAG experience that makes us not need structs or dicts in the language.
To use this, we need to remember that all the variables and constants we define, initially live under scope.export in our scope DAG. Basically we can convert any variable we define to a DAG that we can then ent on another dag. Next to this, we can define a DAG, but note that as we will be growing our own DAG, we need to initially make it mutable.
inert int max_prime = 1000;
mutable int counter = 0;
mutable int ok_count = 0;
dag max_prime_dag = prune export.max_prime;
dag counter_dag = prune export.counter;
dag ok_count_dag = prune export.ok_count;
mutable dag mydag;
ent mydag max_prime_dag as max_prime;
ent mydag counter_dag as counter;
ent mydag ok_count_dag as okcount;
shallow freeze mydag;
Note that leaf vertices don't use mutable, that is because we don't allow anything to be ented on top of unary typed leaf nodes.
This code is a little verbose, but it is basically equivalent to the following pseudo code if Merg-E was to have structs (what it doesn't):
mutable dag mydag = struct {
inert int max_prime = 1000;
mutable int counter = 0;
mutable int ok_count = 0;
}
The extra verbosity is a price Merg-E pays to keep the language minimal.
Ignore the shallow freeze for now, we will explain freezing in a future post in this series. For now consider unfrozen hand-constructed DAGs to be the equivalent of a dictionary or map in other languages, while a shallow frozen hand-constructed DAG is the equivalent of a struct.
DataFrame
Where the dag keeps Merg-E from needing structs and dicts or maps, there are still situations where you may want to use collections of things. A list, a set, etc. If we also take into account that Merg-E, when not running in pure actor mode, aims to allow for runtimes to implement high parallelism, we will need data structures that allow that and that map on our callable primitives. For the 0.3 version of the language spec, we opt for what is admittedly an experiment: For the 0.3 version of the language spec, we experiment with the assumption that a single collection primitive, the dataframe, can cover most collection use cases.
So first let's see how we construct an empty dataframe:
mutable dataframe myDf int max_prime int counter int ok_count;
Note that while the dataframe is mutable (for now), this only means it is extensible with new rows. Rows, once added, are immutable; mutability only refers to the ability to append new rows.
A dataframe is a little odd within the language because once instantiated in this way it will act like a lambda (we will introduce lambda's in an upcoming post) with an int64 return value. :
myDf(100 0 0);
myDf(100 16 42);
myDf(800 0 1);
freeze myDf;
For completeness, we consider the parquet data format to be a first citizen in the Merg-E language, thus we define this alternate way to create a new already frozen dataframe:
bytes myparquet = someParquetFile.read();
dataframe myDf myparquet;
prune export.myparquet;
So far so good, but now things get a bit complicated on the parallel processing side, but we will try to keep it under the hood as best as we can and hide inside of an internally complicated design pattern.
mutable dataframe inDf string id, int32 startval, int32 endval;
inDf( "foo" 17 99 );
inDf( "bar" 42 1999 );
inDf( "baz" 377 1483 );
freeze inDf;
borrowed mutable dataframe outDf string id bestval int32;
reentrant function (id str startval int32 endval int32 add_row callable<string int32>)::{
ambient.Random.low_entropy_random.int as random_int;
}{
int32 rval = random_int( startval endval );
add_row(id, rval);
};
mutable blocker dfReady;
dfReady += vectorize myFunction inDf outDf;
await_all dfReady;
string myparquet2 = serialize outDf;
Let's see if we can walk though this and see if it makes sense.
The first few lines build up a frozen input dataframe.
mutable dataframe inDf string id, int32 startval, int32 endval;
inDf( "foo" 17 99 );
inDf( "bar" 42 1999 );
inDf( "baz" 377 1483 );
freeze inDf;
The next step is to define but not yet fill an output dataframe:
borrowed mutable dataframe outDf string id bestval int32;
The output dataframe must be declared as borrowed; otherwise, ownership would be transferred to the vectorized computation, making the result inaccessible to the caller
Now we need to define a function for processing one row:
reentrant function (id str startval int32 endval int32 add_row callable<string int32>)::{
ambient.Random.low_entropy_random.int as random_int;
}{
int32 rval = random_int( startval endval );
add_row(id, rval);
};
The important thing to note is that the types for the function arguments are the same as for the input dataframe, but there is one extra function argument: A callable with the same function fingerprint as the output dataframe. We shouldn't need to worry about the details here but the important thing is that this callable is not the output dataframe itself, it is part of the hidden guts of the design pattern allowing us to vectorize and await. This callable is provided by the vectorization machinery and represents a controlled way to append rows to the output dataframe.
It is important to note that the reintrant modifier in the function definition allows the function to be invoked in parallel, but this comes at the price that the order of the output rows is not guaranteed. In most cases this shouldn't be that relevant, but if it is you may need to give up parallel processing in favour of preserved order by removing this modifier from your code.
Now comes the actual hidden magic.
mutable blocker dfReady;
dfReady += vectorize myFunction inDf outDf;
await_all dfReady;
We create a blocker, as we have seen many times before, and we add a vectorize expression to it. This expression does a lot of magic. It wires up our function in such a way that inDf will get fed to it, one row at a time, but potentially in parallel by scheduling each row as a separate task that depending on the runtime might run in parallel. Then when all tasks are completed, the await_all will make our main code resume on await_all.
Now to make all round-trip complete with respect to parquet as a first class citizen we serialize the result.
string myparquet2 = serialize outDf;
While this demonstrates the good fit for the dataframe with Merg-E's concurrency and scheduling model, if this is sufficient to do away with list, sets, and tuples is something that will need to be experienced. But for now, for the 0.3 version of the language specs, it's what we have.
There are some ideas for vertical vectorization, but these are not complete yet and it's not sure if they will make it into the 0.3 spec or if these will be added in a future version of the language spec.
Lambdas / inline
In this post we are going to look how Merg-E, that by default only supports scheduled callables like functions and actors that have no return values, in some cases you will want to just put some pure compute into a separate function and use the result, something that can return something without any continuation points or asynchronosity scheduling. Because we want to keep the language simple, we prefer patterns and runtime behaviour over extending the language more than is absolutely needed.
As we have seen with vectorize where we could create a standard function definition with a callable as the last function argument, we will do the same for pure compute functions that we aim to use in a way that we want to yield a return value synchronously. In other programming languages these simple pure-compute functions are usually referred to as lambdas. For semantic clarity, as we will show next, a the canonical name of a lambda in Merg-E is called inline.
Before we dive in deeper, let's see what an inline invocation looks like:
int x = inline power( 2 , 11);
As noted, this is the same as:
int x = lambda power( 2 , 11);
Though please consider inline to be the idiomatic canonical variant. The lambda token, just like graft for the canonical ent, is simply a more conventional name that is available for people getting familiar with the language.
The inline notation is preferred because it makes what is happening way more explicit then lambda ever could. Let's dive in and show just why.
function power (base uint8 exponent uint8 callable<uint2048> return)::{
}{
mutable uint8 run = 0;
mutable uint2048 rval = 1
while run < exponent {
rval = rval * base;
run = run + 1;
};
return(rval);
};
See that just as for vectorize the function has an extra argument that in this example we name return but not that while idiomatic, the name is actually technically free to choose. What we are doing is that we define a function that can be used under inline condition using an inline runtime pattern. This rather dumb example calculates base to the power of exponent and then invokes the callable return (again, return here is an idiomatic convention, not a language primitive that can be found anywhere in the lang DAG).
Now the appropriateness of the inline naming for lambdas comes in. When we call:
int x = inline power( 2 , 11);
the power function will indeed get called inline without anything getting scheduled. At the end the callable return will again get invoked without any chande of anything getting scheduled. That invocation is then used by the runtime to make the passed in rval be assigned to x.
This would be it, but there is a tiny bit more to it. Not all function definitions and not all merge/dev combos can be turned into inlinable functions. An inlinable function:
- Can not capture anything explicitly
- Can not be defined mutable
- Can not include any blocker
- Can not invoke any non-inline callable except for the injected return.
- Can not use ent/graft/alias
- In short, it needs to be pure compute only, no side effects.
Another important detail, it:
- Will always be called with all the original function argument being passed by value
Coming up
In this post we discussed the relatively minimal support of Merg-E for data structures, and how the two non-standard data structures the 0.3 version of the language spec, dag and dataframe are experimentally proposed to remove the need for more conventional programming language data structures like struct, map/dict, list, tuple and "set". If this is sufficient for the DSL that Merg-E aims to become remains to be seen, but right now we are hopeful about the prospect. We also discussed the inline or lambda and how it allows pure compute functions to mimic clasical return value patterns.
I'll need a few more posts to talk about things like freezing and attenuation in more detail, parallelism models, capability patterns, and a few more.