forked from druid-io/druid-io.github.io
-
Notifications
You must be signed in to change notification settings - Fork 0
/
faq.html
108 lines (76 loc) · 8.75 KB
/
faq.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
published: true
title: FAQ
layout: html_page
---
<div class="druid-header">
<div class="container">
<h1 class='index' id='frequently_asked_questions'>Frequently Asked Questions</h1>
<h4>Don't see your question here? <a href='/community.html#converse'>Ask us</a>.</h4>
</div>
</div>
<div class="container">
<div class="row">
<div class="col-md-10 col-md-offset-1">
<h2 id='why'>Why did Metamarkets Open Source Druid?</h2>
<p>Originally built by Metamarkets as a component of its analytics service as a platform to provide immediate insight against live data feeds for the <a href='http://metamarkets.com/product/'>on-line advertising space</a>, we realized Druid itself was broadly applicable for real-time and n-dimensional drill down use cases. Presenting Druid to the community helps evolve Druid as a horizontal platform, and over time, helps Metamarkets understand requirements from adjacent verticals.</p>
<h2 id='memory'>Is Druid in-memory?</h2>
<p>Given that it is impossible to process data with a modern processor without first loading the data into memory, yes, it is in-memory. The earliest iterations of Druid didn’t allow for data to be paged in from and out to disk, so we often called it an “in-memory” system. However, we very quickly realized that RAM hasn’t become cheap enough to actually store all data in RAM and sell a product at a price-point that customers are willing to pay. Since the Summer of 2011, we have leveraged memory-mapping to allow us to page data between disk and memory and extend the amount of data a single node can load up to the size of its disks.</p>
<p>That said, as we made the shift, we didn’t want to compromise on the ability to configure the system to run such that everything is essentially in memory. To this end, individual Historical nodes can be configured with the maximum amount of data they should be given. Couple that with the Coordinator’s ability to assign data blocks to different “tiers” based on differing query requirements and Druid essentially becomes a system that can be configured across the whole spectrum of performance requirements. Configuration can be such that all data can be in memory and processed, it can be such that data is heavily over-committed compared to the amount of memory available, and it can also be that the most recent month of data is in memory, while everything else is over-committed.</p>
<h2 id='no-sql'>Is Druid a NoSql database?</h2>
<p>No. Druid’s write support is very limited, so it is best not thought of as a database of any kind. Similar to other NoSQL stores, the query interface is not SQL and it is fault-tolerant with hands-off scaling operations, but NoSQL stores are generally extensions of a key-value store and Druid is a horrible key-value store.</p>
<p>In general, we prefer to not try to group Druid into the cluster of NoSQL stores because it is used to solve a different problem than NoSQL stores generally do. NoSQL stores are generally not good at doing what Druid does and vice versa.</p>
<h2 id='sql-client'>Does Druid speak SQL?</h2>
<p>Yes, to an extent. An ANTLR grammar was written to support a limited set of SQL on top of Druid. It is available in the repository but has not been exposed yet via any of the normal mechanisms, and so it really only accessible to individuals comfortable with building and running Java.</p>
<p><a href='https://github.com/metamx/druid/blob/master/server/src/main/java/io/druid/server/sql/SQLRunner.java'>SQLRunner</a></p>
<h2 id='redshift'>How does Druid compare to Redshift?</h2>
<p><a href='/docs/{{site.druid_version}}/Druid-vs-Redshift.html'>Druid vs Redshift</a></p>
<h2 id='vertica'>How does Druid compare to Vertica?</h2>
<p><a href='/docs/{{site.druid_version}}/Druid-vs-Vertica.html'>Druid vs Vertica</a></p>
<h2 id='cassandra'>How does Druid compare to Cassandra?</h2>
<p><a href='/docs/{{site.druid_version}}/Druid-vs-Cassandra.html'>Druid vs Cassandra</a></p>
<h2 id='hadoop'>How does Druid compare to Hadoop?</h2>
<p><a href='/docs/{{site.druid_version}}/Druid-vs-Hadoop.html'>Druid vs Hadoop</a></p>
<h2 id='impala'>How does Druid compare to Impala/Shark?</h2>
<p><a href='/docs/{{site.druid_version}}/Druid-vs-Impala-or-Shark.html'>Druid vs Impala or Shark</a></p>
<h2 id='external'>What external dependencies does Druid have?</h2>
<p>Druid has three external dependencies that must be running in order for the Druid cluster to operate</p>
<div class="text-indent">
<ol>
<li>Deep Storage</li>
<li>Metadata Storage (MySQL or Postgres)</li>
<li>ZooKeeper</li>
</ol>
</div>
<h2 id='what_is_deep_storage'>What is “deep storage”?</h2>
<p>Simply put, deep storage is some storage infrastructure that Druid depends on for data availability. If this infrastructure goes down, then Druid cannot load new data into the system, nor can it recover from failures in the cluster. As long as this infrastructure is operational, Druid Historical nodes cannot lose data.</p>
<p>A more technical description of Deep Storage and the options available exists in our docs.<br /><a href='/docs/{{site.druid_version}}/Deep-Storage.html'>Deep Storage</a></p>
<h2 id='deep-storage'>Can anything other than S3 be used for “deep storage”?</h2>
<p>Yes. Options include HFDS and any shared mountable file system. More details are available on our docs.<br /><a href='/docs/{{site.druid_version}}/Deep-Storage.html'>Deep Storage</a></p>
<h2 id='queries'>Where can I find a set of simple queries to run?</h2>
<p>We recommend started with our tutorials in our docs:</p>
<p><a href="/docs/{{site.druid_version}}/Tutorial:-A-First-Look-at-Druid.html">Try out Druid for the first time</a></p>
<h2 id='non-structured'>Can Druid accept non-structured data?</h2>
<p>Druid requires some structure to the data it ingests. In general data should consist of a timestamp, dimensions and metrics. These are discussed in a bit more detail in our original blog post: <a href='blog/2011/04/30/introducing-druid.html'>Introducing Druid: Real-Time Analytics at a Billion Rows Per Second</a></p>
<h2 id='time-series'>Does Druid only accept data with a timestamp?</h2>
<p>Yes. Data that is ingested into Druid <em>must</em> have a timestamp.</p>
<h2 id='zookeeper'>Isn’t Zookeeper a single point of failure?</h2>
<p>No, Zookeeper can be deployed to withstand a configurable number of individual node failures. Also, if ZooKeeper goes down, the cluster will continue to operate.</p>
<p>Losing ZooKeeper does, however, mean that the cluster cannot add new data segments, nor can it effectively react to the lose of one of the nodes. So, while it is safe to lose access to ZooKeeper, it is definitely a degraded state.</p>
<h2 id='master'>Isn’t the Master a single point of failure?</h2>
<p>No, the “Master” or “Coordinator” node is merely a coordinator. It is <strong><em>not</em></strong> involved in the query path. Losing all coordinators means that new segments will not be loaded by the Historical nodes, i.e. no new data will enter the cluster. Also, segment balancing decisions will not be made, so it stops the cluster from being able to effectively scale up and down. Just like ZooKeeper, losing the coordinators puts the cluster in a degraded state, but it will keep operating just fine.</p>
<p>Also, there can be multiple coordinator nodes deployed. Extra coordinators will act as “hot spares” in case the active coordinator is lost.</p>
<h2 id='mas-queries'>Do all queries go through the master?</h2>
<p>Queries <strong><em>never</em></strong> touch the master. Ever.</p>
<h2 id='mas-dies'>Can I query Druid even after the master dies?</h2>
<p>Yes, the Master has no impact on queries.</p>
<h2 id='update'>How do I update data?</h2>
<p>Updating data means that you have to re-generate the segments for the time period that you need to update. This can be done by reindexing the data for that time period. Once the indexer finishes, the Druid cluster will swap the new segment in and stop serving the old segment.</p>
<h2 id='ind-serv'>What is the IndexingService, do I need that?</h2>
<p>The IndexingService is a job scheduling service with tasks that operate on Segments. Long-term, we intend to move indexing activities to be fronted by this service.</p>
<p>It is not a requirement yet, but will become one as time goes on.</p>
<p>More information can be found in the docs:</p>
<p><a href="/docs/{{site.druid_version}}/Indexing-Service.html">Indexing Service</a></p>
</div>
</div>
</div>