Evaluating non-realtime block streaming in aiohivebot (part 2)

in HiveDevs3 months ago (edited)

In part 1 we got some unexpected results from a flag that was supposed to make things faster but that ended up benchmarking slower instead.

I took some time to look into the issue and it turned out as an interesting journey that I believe has ended up improving aiohivebot quite a bit. I made a few API changes in the process, some of them potentially breaking changes, although my lib has any users yet right now given that signing isn't implemented yet and it will need to be for aiohivebot to be used in any production code that uses aiohivebot in the way it is intended to be used.

By all means, play aroumd with aiohivebot. I'm bumpimg the version to 0.3, so if you are already using the lib and are worried about the API changes, pin the version to 0.2.9.

API updates

Before we move on to the real subject of this blog post, lets take a moment to address the API changes.

  • The roll back constructor arguments of the BaseBot class have been removed.
  • The BaseBot member vars start, block_counter and processing are now private member vars (as they should have always been)
  • The BaseBot method get_layer_api is renamed to internal_get_layer_api, and just like all other methods starting with internal_, should not be considered part of the API.
  • From now on all user definable methods that are not block event related have one of the following two forms:
    • monitor_xxx : methods for monitoring important overall system health attributes
    • xxx_hook : methods that adjust the inner working of the bot, middleware, or L2 node.

Note that the aiohivebot async python lib has three primary and one secondary usecases:

  • HIVE bots
  • HIVE middleware
  • New HIVE L2 projects
  • Historic block data streaming (secondary)

In order not to keep repeating this list of four throughout this blog post, I shall incorrectly refer to user code for each of these four usecases as middleware, please realize this when you read the term.

Roll-back has been removed because it moved application business logic to the framework layer in a way that never really fitted that well. If the functionality is needed in your usecase, please drop me a response to this post and I'll prioritize fitting in a decent alternative sooner rather than later. In fact, if you are using aiohivebot in any way, please let me know so I know to be less friviol with future API updates not bing backwards compatible.

The internal methods might be made private soon, but even if they arent, they are NOT meant to be directly used by middleware.

Naive sorting and filtering

So on to the source of our problem last week.

When aiohivebot needs to do a JSON-RPC API call, it will try to do this by using the best available public API node that supports the given API endpoint at that point in time. Under the hood, there is a client for every public API node that keeps basic API health metrics for the node, most notably the decaying average error rate a d the decaying average request latency.

The JSON-RPC API call is made by sorting the clients for the individual API nodes based on these metrics, and then applying a filter to the sorted list. Once we have a sorted filtered list of clients, the aiohivebot lib will first try sending the request to the first node in the list, if this somehow fails for whatever reason, move on to number two, and so on, untill a valid respomse is received on the query, or the list gets exausted. In case of an exausted list, the lib will stall for 15 seconds a retry.

So what was the problem with this that caused the unexpected slowdown? Well, the fastest node at the time, the one maintained by @deathwing sent back a few 429 rate limit responses, what is absolutely perfect, but to the health metrics an error is an error, so the error rate became non null, and as it happened with the naive sorting, this demoted the fastest node of the bunch halfway down the sorted list, promoting slower nodes to a higher position to the point that the fastest node wasnt used anymore even with a paralel top N. @deathwing's node simply got pushed out of the top N making everything slow down.

Blocks that are too new

While not relevant for the benchmark that used historic block streaming, when regular middleware becomes aware of a new block, there may be a problem while processing block that in some ways is too new. The problem is that a block may not yet be available or processed on all nodes. A completely safe way to handle this may be to always stick to the node that provided the block when doing consecutive API calls, but that, as we will show, is bit too rigid. There are situations though where this rigidity is needed.

A better default

So instead of the naive sorting and filtering that gave us these perdormance problems, lets see if we can make a flexible solution with a decent but overridable default.

For now we defined the following policy:

  • anything with an error rate of 66.7% or more gets discarded
  • anything with a latency of more than 30 seconds gets discarded
  • anything with a predicted rate-limit sleep of 30 seconds or more gets discarded.
  • if we are only zero or one block behind, we soft-lock to the node that provided the block if and only if the account_history_api is bing queried.
  • for what remains, we sort by whatever vallue is bigger between recent latency and predicted average rate limit latency in the curre t windo PLUS the predicted rate limit sleep for the next call.
  • If the error rate R exceeds 20%, we increase the sort-by value X to: X = X + XR + R

Stil not a one size fits all, but quite a dece t default for now. This default may change over time, so dont rely on it, and overrule the default to best fit your usecase if you feel you understand what you are doing. Its a delicate balance between throughput, latency, and possible race conditions near head.

The results

virtualorderhours
falsefalse615
falsetrue608
truefalse156
truetrue105

In the avobouve table we get a rough overview of the four reruns with the new default sorting and filtering. each run was about one hour, and we see that now both flags come at a price as expected. The price for supporting virtual opps is pretty steep. Thprice for guaranteed in order processing of blocks is relatively low with virtual opps disables, but quite high when combined with virtual opp support. it is to be expected that it will be high as well when there are more secondary API calls in a real bot or backend.

Custumizable sorting and filtering

So now lets customize. Because the default is just a default and we want the user to have sufficient controll, the default is a method of the BaseBot that can be overridden by the users middleware implementation. For this we extended the API with the overridable method pick_nodes_hook. This method gets a wide range of arguments that are to be used to both sort and filter. Let us make a little bot to demonstrate. One with the simplest of sorting and filtering pollicies.

class MyBot(BaseBot):
    """Example of an aiohivebot python bot without real utility"""
    def __init__(self):
        jan2024=81520247
        super().__init__(jan2024, use_virtual=True, maintain_order=False)
        self.count=0
        self.start_time = time.time()
        self.total_cought_up = datetime.timedelta(seconds=0)

    async def monitor_block_rate(self, rate, behind_head, behind_irreversable):
        self.count += 1
        behind = datetime.timedelta(seconds=behind_irreversable*3)
        if rate * 3 -1 < 0:
            negative_cought_up = datetime.timedelta(seconds=(1 - rate*3)*60)
            print("block rate =", rate,"blocks/second, behind =", behind, 
                  "fell behind =", negative_cought_up)
            self.total_cought_up -= negative_cought_up
        else:
            cought_up = datetime.timedelta(seconds=(rate*3-1)*60)
            print("block rate =", rate,"blocks/second, behind =", behind, "cought up =", cought_up)
            self.total_cought_up += cought_up
        if self.count == 60:
            print("Cought up", self.total_cought_up,
                  "in", datetime.timedelta(seconds=time.time() - self.start_time))
            self.abort()

    def pick_nodes_hook(self, error_rate, recent_latency, predicted_ratelimit_sleep,
            projected_ratelimit_window_latency, blocks_behind, api, method, is_block_source):
        """Default sorting key calculation for sorting available clients"""
        # prefer block source
        if is_block_source:
            return 0.0
        return float('inf')

Basicly this is the same code we used before for processing data for an hour with virtual operations enabled, but we added a method for tuning the sorting and filtering of the JSON-RPC nodes we request the data from. The method provides us with eight arguments, but in this case we use only one.

But lets look at each of the arguments we may use:

  • error_rate : The moving average error rate of the node
  • recent_latency : The moving average for the node of the request/response latency
  • predicted_ratelimit_sleep: The predicted rate limit sleep if we were to use this node right now
  • projected_ratelimit_window_latency: the projected average rate limit constrained latency for the remainder of the current rate limit window.
  • blocks_behind: the number of blocks that we are currently behind on head.
  • api: name of the API that we are about to use.
  • method: the specific method that we are about to use.
  • is_block_source: boolean indicating if node is the source of the current block.

This method returns a single number for the framework to use in sorting. If the number retured is positive infinity, then instead of sorting, that node is simply filtered out.

strategyhours
default156
sticky114

We see the total processing is a bit slower, and again with this particular strategy that was to be expected.

Stall monitoring

Especialy when writing our own pick_nodes_hook, there is a probability we might have stalling tasks. To monitor this,Ive added another monitor hook. Lets add it to our bot.

async def monitor_stall(self, 
                        block_source_node,
                        tried_nodes,
                        api,
                        method,
                        behind_blocks,
                        stall_count,
                        stall_time,
                        exceptions,
                        noresponses):
        print("STALL:",block_source_node, stall_time,
              "stalled", stall_count, "consecutive times,
              nodes tried:", tried_nodes, api, method, exceptions, ";", noresponses)

Lets just see what happens when we run it for a bit. These messages are quite big, so lets look at the last one from our run.

STALL: rpc.mahdiyari.info 3589.0828263759613 stalled 239 consecutive times, nodes tried: ['rpc.mahdiyari.info'] account_history_api get_ops_in_block [JsonRpcError('JsonRPC error') : {'code': -32002, 'message': 'Assert Exception:api_itr != data._registered_apis.end(): Could not find API account_history_api', 'data': {'code': 10, 'name': 'assert_exception', 'message': 'Assert Exception', 'stack': [{'context': {'level': 'error', 'file': 'json_rpc_plugin.cpp', 'line': 333, 'method': 'find_api_method', 'hostname': '', 'thread_name': 'th_85', 'timestamp': '2024-06-24T18:38:42'}, 'format': 'Could not find API ${api}', 'data': {'api': 'account_history_api'}}], 'extension': {'assertion_expression': 'api_itr != data._registered_apis.end()'}}}] ; []

This is interesting and might be indicative of a bug. The framework has two ways to probe if a node supports an API, by querying the jsonrpc endpoint if it exists, and by probing with a test query.Aparently something went wrong there as aiohivebot seems convinced that @mahdiyari's node can do account_history_api and it can't.

Node status monitoring

This one isnt new, but for some additional info on how the different nodes are doing, we can add an extra callback to get the node status once every 15 minutes or so.

async def monitor_node_status(self, node_uri, error_percentage, latency, ok_rate, error_rate,
                              block_rate, rate_limit_status, detailed_error_status):
        print("NODE:", node_uri)
        print(" * errors:", int(100*error_percentage)/100,"%")
        print(" * latency:", latency, "ms"),
        print(" * rates:")
        print("   - ok      :", ok_rate)
        print("   - error   :", error_rate)
        print("   - block   :", block_rate)
        print("  # rate limit status")
        print("   ", rate_limit_status)
        print("  # error details")
        print("   ", detailed_error_status)
        print("###############################################################")
        print()

Well, that is disapointing, the stall is keeping monitor_node_status being called for the stalling node in question. So it is not going to help us this time. Ill need to find a different way to find out if this is an aiohivebot bug. But to give an idea, here is some info about another node.

NODE: api.openhive.network
 * errors: 16.0 %
 * latency: 5572.452416876865 ms
 * rates:
   - ok      : 115.85070139304625
   - error   : 9.600000011656011
   - block   : 0.0
  # rate limit status
    {'count': 1760, 'uses_ratelimit_header': False, 'used_429': True,
     'used_retry_header': False, 'total_ratelimit_sleeptime': 227.85640659332273}
  # error details
    {'condenser_api.get_dynamic_global_properties': 
                  {'http-error': 0, 'ssl-error': 0, 'code': {200: 156}}, 
     'block_api.get_block_range': 
                  {'http-error': 0, 'ssl-error': 0, 'code': {200: 48}}, 
     'account_history_api.get_ops_in_block': 
                  {'http-error': 0, 'ssl-error': 0, 'code': {200: 1542, 429: 14}}}

In this sample we see rate limiting at work for the ope hive node.

Getting the latest version.

I just pushed the 0.3 version of aiohivebot to pypi.
You can install it simply by typing

pip install aiohivebot -U

Support and merchandise

If you appreciate my work on aiohivebot or on the coinzdense/aiohivebot/hive-archeology-bot in its completeness, I would very much like to work more on these projects, and I could really use your support in order for me to allow myself to allocate more time to them. I've got (normal and legal) bills to pay, so I need to often choose activities that make me money over activities I'm pasionate about, but with your support the second can become the first again.

If you appreciate my work on these projects, please:

If you have your own projects and are in need of some Python and/or C++ expertise and have other skill sets to contribute to my projects, thats very interesting too. We can support eachothers projects with an hour swap.

While I prefer to work on these projects, if instead of supporting my open source projects you are in need of my skills, I'm available for short contracts with a discount if the project I'll be working on will be open source.

image.png