Last week, we discussed how data are indexed in our datastore and how we plan to retrieve the data using hash tables and expression trees. This week, we will discuss a breakthrough we made with our virtual machine and also some pitfalls we have discovered while implementing our indexing algorithm. We will first detail to the reader the concept of the virtual machine, since we have not detailed this part of our block chain yet.
When we wish to execute a program, we must rely on an interpreter, or “virtual machine,” in simple terms. The job of this interpreter is to execute the program generated by our English-language statements. Recall that a couple weeks ago, we discussed the differences between compiled and interpreted languages. Interpreted languages are like compiled languages in the sense that they are compiled down to something. In Java, Ethereum’s Solidity, and Python, this “something” is called bytecode, and in C#.NET, this “something” is called Common Language Runtime, or CLR. The reader might be familiar with the term “Ethereum Virtual Machine (EVM.)” When Solidity code is compiled, it is turned into bytecode. The Ethereum Virtual Machine understands the bytecode and executes it, instruction by instruction.
For example, a typical language statement looks like: b = b + 1. For our purposes, this means “Increment B by 1 and make this the new value of B.” However, when broken down into a set of low-level instructions, whether in bytecode or Assembly, the instructions might look like: “Get B from memory and load it into RAM; load the number 1 into RAM; add the two results; store the result in memory where we fetched the original B from.”
When we compile a program, these types of instructions are generated. This is done to keep the underlying hardware “simple,” and the most complex of programs are ultimately broken down into simple single-step or “atomic” instructions similar to what we have demonstrated here.
The advantage to generating atomic, simple instructions is that the hardware is optimized to perform these sorts of instructions quickly and efficiently. So we rely on the compiler to take a complex, close-to-English program that is human-readable, and turn it into “machine code”; a set of many, many single-step instructions that are simple for the machine to execute but that will be difficult for a human to follow efficiently, because we are used to thinking mostly in abstract terms.
When programs are built using our platform, we go through much the same process. We take the English-language statements and compile them down to “bytecode” for lack of a better term and then execute them on our virtual machine. Here, we call it a virtual machine because we don’t execute the code directly on the hardware; instead, we wrote software to understand our bytecode, and use that software to execute programs. This is also why languages like Python are called “interpreted” languages and why the EVM is called a virtual machine. While the Python code and Solidity code do get compiled in the true sense of the word, they are not executed on the hardware directly. Instead, there exists software that acts as the architecture and executes the machine code. Running programs this way makes the programs themselves hardware- and platform-agnostic. As long as a Python interpreter exists for a specific operating system, the same Python code can be executed (within reason) on that system that was executed on a system with completely different hardware specifications from the current system. We write “within reason” because there are exceptions to what we have stated, but we will not detail them here for the sake of brevity.
The problem we were facing with our virtual machine was that a program with medium complexity was taking longer to execute than was acceptable for us. We were consistently seeing twenty to twenty-five millisecond execution times per run. We have since revisited the virtual machine and have brought the execution time for the same program down to 0.5 milliseconds. This means we’re able to execute a piece of software of medium complexity in 5 TEN THOUSANDTHS of a second! (5/10000) That’s lighting fast and means that we're on track to delivering an extremely fast and efficient alternative to traditional cloud computing. This was our major breakthrough for this week. We are also in the process of building other benchmark programs to test our virtual machine on our block chain in other real-world use cases and we will continue to optimize the generated set of instructions to make sure the execution is running as quickly as possible.
With respect to our datastore, we have completed storing both document templates and actual documents, along with building indexes for new documents. We are currently working on data retrieval, updates and deletes.
While completing the task of creating new documents in our block chain, we asked the question of how indexes will be initially created. While it is true that all documents will have a unique ID field, unless we force-create new indexes, the ID field will be the only field that will be indexed.
We first considered changing our language slightly to include an “indexed” keyword, like “create ‘test’ as an indexed text field”. The problem with this approach is that the builder of the program will have to explicitly define which fields should be indexed, and our goal at Sparkster is to never force the builder to think about architecture. Implementing the proposed solution would have violated this principle that we follow.
The second solution we proposed was to build the index automatically based on a heuristic algorithm. We understand in this case that creating new indexes will be an expensive process and will grow according to O(mn) complexity, where m is the number of new fields to index and n is the total number of documents that exist. However, we anticipate that creating new indexes automatically will occur often at first and then become a less-often occurrence as more and more queries are executed against the datastore.
We are also thinking aggressively about the communication protocol we will use. As we mentioned last week, we will not be using HTTP. We considered using SSE, but the issue here is that SSE is unidirectional. So, we are evaluating using the web socket protocol for communication. We considered using raw TCP, but because we might need to communicate with a node from the web browser, this was not a viable option for us because browsers are restricted from opening raw TCP connections due to security restrictions and sandboxing. By contrast, web sockets are widely supported in browsers so it will be a client-friendly solution; furthermore, web socket communication is both efficient and small, much like MQTT.
This concludes our update to the reader for this week. We will continue next week with another update.
This is tremendous news! Virtual machines are THE main roadblock and sparkster just solved it?! Incredible...almost too good to be true. I need to see it in action.