Benchmarking hivemind API calls after switching to postgREST server

in HiveDevs2 months ago

blocktrades update.png

I'll make a full report on our recent work soon, probably next week, but in the meantime I wanted to share the performance improvement we're seeing after switching hivemind from the python based server to postgREST.

We've had several devs working on this task for a few months now; it was a big job as there was a lot of python code in the "server" portion of hivemind that needed to be replaced by SQL code in order for us to switch to using postgREST as hivemind's web server.

But the performance improvements seem well worth it: of the API calls we've benchmarked, we're generally seeing speedups between 3-4x faster on average. We had a few regressions where we still need to tweak the SQL code to get optimal performance (e.g. get_discussions_by_created, get_following), but for our very first benchmark of the old and new code, I'm quite pleased.

Here's the old/new benchmark data below. The median column shows the median time for the API call to complete and is probably one of the most useful figures of merit for comparing the two servers.

Generally speaking, I think the most noticeable speedup will be for "quick" API calls, due to the reduced fixed-overhead latency of postgREST. But we can see cases where just the replacement of python code with equivalently-functioning SQL has resulted in good speedups as well.

Old Python-based Server:

EndpointMax [ms]Min [ms]Average [ms]Median [ms]
condenser_api.get_discussions_by_blog731111523992399
bridge.get_account_posts6237551011897
bridge.get_discussion1579331610206
bridge.get_ranked_posts2511226893862
condenser_api.get_discussions_by_comments8835384430
condenser_api.get_followers387218563
condenser_api.get_discussions_by_created259049562255
bridge.get_profile1102349574531
condenser_api.get_discussions_by_feed2206141118831915
condenser_api.get_blog4482107032993564
condenser_api.get_following362167221201

New PostgREST Server:

EndpointMax [ms]Min [ms]Average [ms]Median [ms]
condenser_api.get_discussions_by_blog285866638669
bridge.get_account_posts54029344252
bridge.get_discussion367361270550
bridge.get_ranked_posts5041106431233
condenser_api.get_discussions_by_comments2771102111
condenser_api.get_followers121114819
condenser_api.get_discussions_by_created281373023521266
bridge.get_profile1079308559455
condenser_api.get_discussions_by_feed722479599588
condenser_api.get_blog1376323885914
condenser_api.get_following135224610531211

Next up, production testing

Note these are results are from a synthetic benchmark we've created, not from real world traffic testing.

After we fix the API calls with performance regressions, we'll setup a server with the optimized code and direct a copy of production traffic to it as a final test.

Sort:  

Those are some nice improvements. I notice a couple got slower, but what tends to matter is how often each is called. We need Hive to scale well as it grows. I am impressed with how small the machines some witnesses use are. We need to publicise more that Hive does not require energy hungry server farms.

I'm not really involved in the publicizing side of Hive, but it does sound like a good idea.

You don't have to. Regular non-developer folks that love HIVE would also love to get the word out. We just need to know what to say :) I get lots of info from your interviews with Taskmaster4450, because you explain it in simple terms.

Good to see tech improvements all the time - the chain is excellent as far as i can judge froma tech point but honestly on this point here you should be involved as probably the main guy on the chain - even community driven we need cornerstones and key speakers even decentralized :-)

I'm not really involved in the publicizing side of Hive, but it does sound like a good idea.

Pretty good point @steevc - but we know since ages that we as a chain are faster, maybe safer as other chains and the way we can nae our wallet (not just numbers) is also a driver for marekting (but no knows or cares about it) - at least in DPoS area - the DPoS area is maybe an issue and that we are not on the list of big investors (not only as there is this other Hive chain).

At least i see constant tech improvements - thx @blocktrades and the marketing is in the best hand by @guiltyparties anyway - and now i shut my mouth.

Thank you very much for the outstanding work of your team. I believe that the deployment of hivemind based on postgREST will bring great improvements to the front-end site and various service programs.

In addition, I guess switching from Python code to SQL code should greatly improve maintainability, right?

Yes, it'll be easier to optimize the queries. Previously, performance issues could be in either the python or the SQL portion of the code. Now that it is all in the SQL code, we can just do explain on the queries and fully analyze the reasons for the performance.

switching from Python code to SQL code should greatly improve maintainability, right?

I think this works only if you know SQL at blocktrades level. For me, most of the SQL blackmagic in the current state is not easy-to-read :)

As far as I recall, a lot of the conversion of the mixed Python/SQL code was done by devs that certainly aren't SQL experts (although it was then reviewed by more experienced devs so that improvements could be suggested to the new SQL code), so either the code wasn't that difficult to understand, or at least the Python parts of it wasn't.

But in general, if you run into an SQL query that is difficult to understand (for sure some of them in hivemind are), I strongly recommend using an AI to comment the code for you, it does a good job.

Impressive. I am particularly interested in the bridge.get_account_posts speeds. Hopefully, this will allow for faster sorting of posts in different orders. It would be great to have the option to sort from First to latest.

Having all the code in pure SQL versus a mix of Python and SQL will definitely make it easier for us to optimize the API calls further in the future.

Nice, this speedup is more than I would have expected. I really enjoy working with postgREST and the many benefits it brings - performance, security, simplification, flexibility.

Do you plan to use the postgREST native filtering via URI params? It can save a tremendous amount of work since each endpoint can have arbitrary filtering without needing to manually write-up the validation and logic for each parameter. I'm sure you already know this. But if you decide to use it, you have to follow particular steps in order for your function to be inlined and to have good performance: check this Q&A with the postgREST devs.

Regardless of whether you go for the filtering or not, since you are using plpgsql functions for the endpoints (just like I was), you can get substantial performance improvements just from switching to pure SQL and making the functions STABLE.

I wasn't aware that such filters could be used on anything except exposed tables and views, so it was interesting to learn it could be applied to functions, thanks for the link.

We'll experiment more with such features moving forward as we move away from the json RPC API to the new REST-based API. We haven't even started creating a REST API for hivemind yet, because first we wanted to replace the python server to postgREST.

Well, then maybe this is a good opportunity to share some experience I've had creating a REST API with postgREST, in case any of it is useful. Here are some endpoints for Hive Open Polls:

And so on. None of this needed to be manually coded in the WHERE or ORDER BY.

If you look at the response, you will see there are nested fields. Objects and arrays can be nested however one wishes - you can look at the polls repo if you want to see how I've built them as json.

Polls are just regular posts/comments with some json metadata, so hopefully this is all pretty generic.

This approach should also really enhance security since you don't need to handle any arguments yourself, you can let postgREST do all of it for you. (Of course, if for any reason you go for dynamic SQL, then that's a different matter - I follow this post's recommendations). If for any reason what is offered by the postgREST filtering is not sufficient, you can also have manual arguments. The two can be combined for the same endpoint, no problem. So a single API call can pass your manual argument(s) and on top of it add any of the other filters provided natively by postgREST.

It was worth the effort! That's a significant speedup on most operations, and, you said you'll continue to tweak the SQL code in the case of the few operations that are slower now than with Python code.

Yes, I expect we will have those fixed in the new few days.

Didn't know about that technology. We use PostgreSQL, si knowing about this is pretty interesting. Curious about the connection pooling limits though...

PostgREST is a web server that sits in front of your Postgres database. It has built-in support for a connection pool with configurable number of maximum connections: https://docs.postgrest.org/en/v12/references/connection_pool.html

PIZZA!
The Hive.Pizza team manually curated this post.

$PIZZA slices delivered:
(3/10) @danzocal tipped @blocktrades

You can now send $PIZZA tips in Discord via tip.cc!

Good evening, my friend, I wanted to ask you for some support, we are submitting a proposal to the dhf and I would like to see if you could help me, thank you in advance and God bless you.

Python-based Server

29dcade3-9648-498b-b12c-a39b822f185e.jpg

PostgREST Server

1cc758e6-9b41-4f59-b212-39434648bc3e.jpg


Improvement In Average Response Times For Specific Endpoints

image.png

Loading...