
This is part nine in a series on the 0.3 version of the language spec for the Merg-E Domain Specific Language for the InnuenDo Web 3.0 stack. I'll add more parts to the below list as the spec progresses:
- part 1 : coding style, files, merging, scoping, name resolution and synchronisation
- part 2 : reverse markdown for documentation
- part 3 : Actors and pools.
- part 4 : Semantic locks, blockers, continuation points and hazardous blockers
- part 5 : Semantic lexing, DAGs, prune / ent and alias.
- part 6 : DAGs and DataFrames as only data structures, and inline lambdas for pure compute.
- part 7 : Freezing
- part 8 : Attenuation, decomposition, and membranes
- part 9 : Sensitive data in immutables and future vault support.
- part 10 : Scalars and High Fidelity JSON
- part 11 : Operators, expressions and precedence.
- part 12 : Robust integers and integer bitwidth generic programming
- part 13 : The Merg-E ownership model, capture rules, and the --trustmebro compiler flag.
In this post we age going to look at explicit vs implicit capturing. The subject in this post has only crystallised recently after running into a big question as how to handle programmers putting sensitive data into a immutable. Before I asked myself that question, all immutables (constants) were considered to be inert, to exist without any authority, while all mutable data and by extension all mutable functions were considered to cary authority and thus would need to be captured explicitly rather than implicitly.
In this post we set that mistake straight in a consistent and idiomatic way. In order to do this though, we first need to branch out to a new project idea within the InnuenDo stack.
InnuenDo VaultFS
A long time ago I wrote a set of two simple FUSE user space file-systems named MinorFs. One of the file-systems was a sparse-capability filesystem that with the help of AppArmor offered decent capability based security. The second file-system basically looked at /proc/$PID of the caller, gathered some unique information about the executable and if needed its commandline arguments, and delegated a $HOME and $TMP directory to the process. No other process could access these directories unless the process in question was to delegate access. Not even processes running under the exact same user ID.
Based on the same techniques but simpler, we should be able to make a simple user space filesystem for an Innuendo VaultFS.
So what should VaultFS do? Well, all it needs to do is keep track of a directory of a directory of JSON files, one per InnuenDo stack executable or script. The idea is to have one write-only tool, probably with a password verification task, that can add sensitive strings for an executable or script running as a particular user, including itself (so it can change its own password. Then the user of the script can write sensitive strings to a key for a given executable or script, but no reading. Only the executable or script itself can read its values from the VaultFS. We need to work out the details, but this shouldn't be too hard to work out.
To keep things performant, the runtime should implement this as a single "get me my file" at runtime startup. Optionally with a SIGHUP handler.
Implicitly inert data types
It is clear that a single boolean isn't sensitive data. It is also very likely that a 40 character string might be. So basically the length of the data could be seen as a measure of possible sensitive data. For your debit card, the pin code is already sensitive, even if it's just 4 numeric characters with a total information of about 13.3, basically less than two random ascii characters.
As we are trying to be a secure language, we aren't going to put up a high threshold for implicitly inert data. If a two character string already has the capacity to contain sensitive data for some definition of sensitive, then bool, int8 and uint8 and maybe a single character string or bytes are implicitly inert, but strings and bytes are variable length, and we don't want weirdness where changing a literal changes the inertness of the constant. As such in Merg-E a string or bytes is never implicitly inert. We define the following types as implicitly inert:
- boolean
- char
- byte
- int8
- uint8
All other types are defined to be to be implicitly sensitive, including an empty string because it's thr type, not the value that determines inertness being implicit:
bool verbose = True;
char prefix = 'I';
byte bprefix = b'A';
int8 step = 4;
uint8 step2 = 8;
Are all implicitly inert, yet:
string name = "Rob";
string apiKey = "secret_1975718278757971981018830981307557_ACF4";
string empty = "";
bytes data = b"Apple pie";
int64 offset = 2;
are all implicitly sensitive.
Our human eyes will look at this and say, hey, apiKey is probably sensitive, but the other ones surely not. But what is "Apple pie" is a password or 2 is actually a representation of a 6 digit pin that just happens to be "000002". That might be sensitive. So we are not going to try to make our already complex semantic lexer any more complex, and we are going to say that type is enough to choose a sane default.
It is about the type having the structural capacity to cary authority, to hold a secret, sensitive data in the broadest sense, that carries authority, however small.
The 'inert' modifier
The policy for implicitly sensitive has a lot of false positives. We thus introduce the modifier inert to explicitly define the inert constants as such:
inert string name = "Rob";
string apiKey = "secret_1975718278757971981018830981307557_ACF4";
inert string empty = "";
inert bytes data = b"Apple pie";
inert int64 offset = 2;
Problem solved? Yes and no.
vault expressions
With the Innuendo VaultFS as part of the projected InnuenDo stack and vault expressions as part of the projected Merg-E runtime, we can move the sensitive data into the vault and replace
string apiKey = "secret_1975718278757971981018830981307557_ACF4";
with
string apiKey = ambient.settings.vault.innuendofs "innuendo-api-key";
Or
string apiKey = ambient.vault "innuendo-api-key";
Where ambient.vault is a reassignable alias to ambient.settings.vault.innuendofs.
This should lookup the api key from the vault and initialize the constant with it. This usage of a vault expression makes it possible to use sensitive data in constants, but without the problems inherent to using sensitive string literals in our code.
While using the InnuenDo VaultFs would of course be the preferred way to work, there will be setups where you need to use a different more standard kind of vault. While not a first priority, (far) future versions the runtime might likely need to support something like *Hashicorp Vault, but I'm absolutely no fan of the Vault security model. I'll likely try to see if OpenSigner or Infisical instead can provide the needed level of sensitive data in a cloud settings where Innuendo VaultFS won't likely be an option unless the user has full management of the VMs the cloud infrastructure runs on, and even then. These are all long-term extensions that I might never find the time for, so unless you want to contribute to the project and implement it yourself, don't rely on it showing up in early versions of Merg-E. When it does, I hope it can be abstracted with something like:
string apiKey = ambient.settings.vault.infisical "innuendo-api-key";
And then of course there is the situation where you are still developing your code, possibly on a non-Linux system without any access to any vault whatsoever. In that case we need a hack. As we have seen, nodes in the DAG can hold annotations, and the standard ambient subtree will hold annotations for Merg-E's tiny type system, defining an expression like ambient.settings.vault.innuendofs "innuendo-api-key" as sensitive. Alternatively for non inert settings, we can write things like:
string chain = ambient.settings.env "INNUENDO_PRIMARY_CHAIN";
or
string chain = ambient.settings.dotconf "INNUENDO_PRIMARY_CHAIN";
Both expressions will be implicitly marked as inert while the vault expressions are marked as sensitive. So if we want to store sensitive data in environment variables or a dotconf file, we will need to be explicit about that, More on that later.
Lexer warnings
So what if we leave the line like it was:
string apiKey = "secret_1975718278757971981018830981307557_ACF4";
Well, there is a better way. The semantic lexer doesn't know if "secret_1975718278757971981018830981307557_ACF4" is inert or if it holds authority, but it does know that it is ambiguous at this point. If it was inert, the user should have said so. If it isn't, then maybe it should be in the vault. So the semantic lexer will log a warning. Probably something like this:
WARNING: Line 328; String literal assigned to implicitly sensitive string. Consider using the Innuendo vault or marking string as inert or hazardous.
hazardous
So now the user has the choice, put the sensitive string in the vault, what should be the right option, leaving things as they are and accept the lexer warning, or marling the sensitive data containing string as hazardous. We used the modifier hazardous in two completely different settings before:
- hazardous blockers
- hazardous (soft) pruning
In both cases hazardous denoted the user telling the runtime that yes they knew it wasn't usually the cleanest way to do things, but they knew what they were doing and accepted the risk. And that is basically what is happening here too.
hazardous string apiKey = "secret_1975718278757971981018830981307557_ACF4";
And as we said before, if we knowingly choose to store sensitive data in environment vars or a dotconf file, we need to be explicit too:
hazardous string apiKey = ambient.settings.env "INNUENDO_API_KEY";
Operators
Now what happens if we use operators in an expression? Or for that fact if we use any binding expression that results in a scalar value? Well, if one of the elements of combination isn't inert, the resulting expression isn't inert.
The modifier hazardous here would be too poluting for the codebase as it's not the usage of the sensitive data that is hazardous but the hard coding. We introduce the modifier sensitive. It is basicly the same as the hazardous immutables modifier, but for when it is used idiomatically.
uint8 smallprime = 17;
hazardous uint256 bigprime = 2722258935367507707706996859454145691647;
sensitive uint256 pproduct = smallprime * bigprime;
But note if the resulting type is implicitly inert, that takes precedence.
uint8 mymod = bigprime % smallprime;
Here mymod is inert because uint8 is implicitly inert and type-level inertness dominates expression-level sensitivity, because inert types cannot structurally carry authority, regardless of provenance.
Note about uint256
You’ll notice we used uint256 in the example above. In Merg-E, integers aren't limited to the standard CPU word sizes. Because cryptography is a first-class citizen in this DSL, we define signed and unsigned integers for all powers of two from uint1 (which boolean is an alias for) up to uint16384. This allows us to handle cryptographic keys and large-field arithmetic natively and safely. More about this when we discuss operator precedence and result typing or when we discuss MERSON, a proposed JSON-alike object serialization format for Merg-E with a JSON compatibility mode that tries to address the fidelity issues of numeric values in JSON.
callables.
So how about callables? Functions, actors, locks? Should these too be treated as sensitive or inert just because they did or did not explicitly capture a sensitive value? After all, when we capture a mutable, the callable transitively needs to be defined as mutable itself. Doesn't the same apply to sensitive? To answer that we need to look at what the concequence is of being sensitive versus mutable.
| capture | freeze | lexer action when implicit | |
|---|---|---|---|
| inert | implicit | already immutable | None |
| sensitive | explicit | already immutable | Warn if strict warnings enabled |
| hazardous | explicit | already immutable | Warn |
| mutable | explicit | throws | Error |
Given that freezing can be transative while capturing is suppored to encapsulate, we currently take the standpoint that functions don't need to be transitively declared as sensitive because there seems to be no reason to disalow the implicit capture of a function that encapsulated sensitive data. That is basically what encapsulation is meant for.
Dataframes and DAGs.
The dataframe is a bit weird one when it comes to beint inate or sensitive. While it should be possible to at least have column level granularity on the sensitivity of a dataframe, possibly even finer, and while this could allow inline functions to transform a sensitive dataframe that won't get implicitly captured into a transformed dataframe without sensitive columns that can be implicitly captured, we are trying to keep the language small for now, and the price of fine grained tainting seems a bit to high for avoiding explicit capture needs for edge cases.
As such dataframes follow the same rule as scalars. A dataframe built from partially sensitive data will be considered sensitive unless all columns are of a type that is implicitly inert like uint8 or boolean.
For home-grown DAGs the same goes. If one leaf node is sensitive, the entire DAG is sensitive.
Coming up
In this post we looked at potentially sensitive data in immutables, and at the how and why of the now planned InnuenDo VaultFS and how it should integrate with the Merg-E runtime. We discussed vault usage, the modifier inert and yet another use for the hazardous modifier.
I'll need at least one more post to talk about parallelism models, and possibly a few more.