Loading real-time streaming data into Druid (part 4 of Druid tutorial)

in #utopian-io7 years ago (edited)

This is part four in series of tutorials about Druid: high performance, scalable, distributed time-series datastore. In this part we will learn how to load real-time streaming data into Druid with Tranquility.

This tutorial, as well as previous parts, expects reader to have basic knowledge of system administration and some experience working in command line. If you don't yet have local Druid instance running, or don't have sample dataset loaded into database, please refer to previous parts of this tutorial:

Your question and suggestions are welcomed in comment section, as always. Examples are tested on MacOS, but should work without changes in Linux as well. Windows users are advised to use virtual machine to emulate linux environment, since Druid does not have Windows support.

About Tranquility

Tranquility is a streaming data injection tool for Druid database. It is an official product, developed and maintained by the same team as Druid database. This utility fits right in with modern data engineering technology stack: it works out of the box with Kafka, Samza, Spark, Storm and Trident and others. It also provides simple HTTP API for loading real-time data from other applications.

Tranqulity helps in writing streaming applications by handling administrative tasks for you:

  • creation of indexing tasks
  • partitioning and replication
  • service discovery
  • schema rollover

For the purpose of this tutorial we will use Tranquility's HTTP API for stream upload.

Tranquility server config

Druid database distribution comes bundled with sample data generator and sample config file for Tranqulity server. Sample Tranqulity config is located in druid-0.11.0/conf-quickstart/tranquility/server.json. Whole file is just 74 lines of json, and is somewhat similar in format to batch task configuration from Loading data into Druid database tutorial. It's important to understand few important directives:

  • dataSources can be a single json object where keys are either names of datasource and values are a datasource configuration object (as is the case in example config), or an array, with datasource format configuration objects as items.
  • properties contains global Tranqulity properties
  • dataSources.<datasource-name>.properties allows you to override global properties for each data source

There are many properties available for configuration in Tranqulity, but most of them are specific to one injestion method. We use seven properties in server.json config file:

  • zookeeper.connect - address of zookeeper instance (required)
  • druid.discovery.curator.path - service discovery path for Tranqulity's internal Apache Curator
  • druid.selectors.indexing.serviceName - service name of Druid's overlord node
  • http.port - server port (global only)
  • http.threads - how many threads to use for http handling (global only)
  • task.partitions - how many Druid partitions to create for task
  • task.replicants - how many replicants to create for data in Druid.

DataSource dimentions are configured to be schemaless (any dimension key will be accepted by Tranqulity). Supported metrics are:

  • count
  • value_sum (sum aggregation on value input metric)
  • value_min (min aggregation on value input metric)
  • value_max (max aggregation on value input metric)

Running Tranqulity server

First of all we need to download and unarchive tranquility:

curl -O http://static.druid.io/tranquility/releases/tranquility-distribution-0.8.0.tgz
tar -xzf tranquility-distribution-0.8.0.tgz
cd tranquility-distribution-0.8.0

To start tranqulility server with this config file we need to execute following command (assuming folders druid-0.11.0 and tranquility-distribution-0.8.0 are sharing common parent folder):

bin/tranquility server -configFile ../druid-0.11.0/conf-quickstart/tranquility/server.json

Keep in mind that tranquility requires running Zookeeper instance. If you receive Zookeper connection errors on the start, follow steps described in the first part of the tutorial.

Generating mock metrics

Sample data generator utility is located in bin/generate-example-metrics. Let's run it and examine the output:

$ ./bin/generate-example-metrics
{"unit": "milliseconds", "http_method": "GET", "value": 70, "timestamp": "2018-02-28T21:12:00Z", "http_code": "200", "page": "/list", "metricType": "request/latency", "server": "www1.example.com"}
{"unit": "milliseconds", "http_method": "GET", "value": 86, "timestamp": "2018-02-28T21:12:00Z", "http_code": "200", "page": "/", "metricType": "request/latency", "server": "www2.example.com"}
{"unit": "milliseconds", "http_method": "GET", "value": 79, "timestamp": "2018-02-28T21:12:00Z", "http_code": "200", "page": "/list", "metricType": "request/latency", "server": "www1.example.com"}
...

As you can see, generate-example-metrics simulate simple web-server logs in JSON. Metrics in this format can be forwarded directly to Tranqulity without any additional transformations. We can do it with just by redirecting generate-example-metrics to curl:

bin/generate-example-metrics | curl -XPOST -H'Content-Type: application/json' --data-binary @- http://localhost:8200/v1/post/metrics

This will start the real-time indexing task for our data in Druid Indexing Service. Uploaded data be
comes immediately available for querying.

Summary

In this part of the Druid tutorial we learned how to load real-time streaming data into Druid using Tranqulity. In upcoming tutorials we will cover number of interesting topics, such as:

  • advanced Druid queries
  • real-time injection of avro-encoded events from kafka into our Druid cluster using Tranquility utility
  • we will visualize data in Druid with Swiv(formerly Pivot) and Superset



Posted on Utopian.io - Rewarding Open Source Contributors

Sort:  

Thank you for the contribution. It has been approved.

I think that for future tutorials it would be better if you said what the users will learn per section of your tutorial, instead of just saying what they will learn in general. Also make sure to really explain why they should do it like you show them to, but of course, this is all my opinion!

You can contact us on Discord.
[utopian-moderator]

All fair remarks.

I think that for future tutorials it would be better if you said what the users will learn per section of your tutorial

That was my original plan. Unfortunately it's often the case that I change the learning curriculum mid-way for many reasons: topic might end up being simpler than I expected or the other way around - too long and complicated. So I tried to keep it intentionally vague. It sure would be better for users to know the curriculum ahead of time, so I'll do my best to improve in this regard.

Also make sure to really explain why they should do it like you show them

I've explained advantages of Tranquility in About Tranquility section. I will keep in mind the need to justify my choices on every step of the way in upcoming tutorials ;)

Thanks for detailed review, @amosbastian.

Hey @laxam I am @utopian-io. I have just upvoted you!

Achievements

  • You have less than 500 followers. Just gave you a gift to help you succeed!
  • Seems like you contribute quite often. AMAZING!

Suggestions

  • Contribute more often to get higher and higher rewards. I wish to see you often!
  • Work on your followers to increase the votes/rewards. I follow what humans do and my vote is mainly based on that. Good luck!

Get Noticed!

  • Did you know project owners can manually vote with their own voting power or by voting power delegated to their projects? Ask the project owner to review your contributions!

Community-Driven Witness!

I am the first and only Steem Community-Driven Witness. Participate on Discord. Lets GROW TOGETHER!

mooncryption-utopian-witness-gif

Up-vote this comment to grow my power and help Open Source contributions like this one. Want to chat? Join me on Discord https://discord.gg/Pc8HG9x