These operators are used to compare two operands. The following table describes the relational operators available in Hive:
Operator
|
Operand
|
Description
|
A = B
|
all primitive types
|
TRUE if expression A is equivalent to expression B otherwise FALSE.
|
A != B
|
all primitive types
|
TRUE if expression A is not equivalent to expression B otherwise FALSE.
|
A < B
|
all primitive types
|
TRUE if expression A is less than expression B otherwise FALSE.
|
A <= B
|
all primitive types
|
TRUE if expression A is less than or equal to expression B otherwise FALSE.
|
A > B
|
all primitive types
|
TRUE if expression A is greater than expression B otherwise FALSE.
|
A >= B
|
all primitive types
|
TRUE if expression A is greater than or equal to expression B otherwise FALSE.
|
A IS NULL
|
all types
|
TRUE if expression A evaluates to NULL otherwise FALSE.
|
A IS NOT NULL
|
all types
|
FALSE if expression A evaluates to NULL otherwise TRUE.
|
A LIKE B
|
Strings
|
TRUE if string pattern A matches to B otherwise FALSE.
|
A RLIKE B
|
Strings
|
NULL if A or B is NULL, TRUE if any substring of A matches the Java regular expression B , otherwise FALSE.
|
A REGEXP B
|
Strings
|
Same as RLIKE.
|
Arithmetic Operators
These operators support various common arithmetic operations on the operands. All of them return number types. The following table describes the arithmetic operators available in Hive:
Operators
|
Operand
|
Description
|
A + B
|
all number types
|
Gives the result of adding A and B.
|
A - B
|
all number types
|
Gives the result of subtracting B from A.
|
A * B
|
all number types
|
Gives the result of multiplying A and B.
|
A / B
|
all number types
|
Gives the result of dividing B from A.
|
A % B
|
all number types
|
Gives the reminder resulting from dividing A by B.
|
A & B
|
all number types
|
Gives the result of bitwise AND of A and B.
|
A | B
|
all number types
|
Gives the result of bitwise OR of A and B.
|
A ^ B
|
all number types
|
Gives the result of bitwise XOR of A and B.
|
~A
|
all number types
|
Gives the result of bitwise NOT of A.
|
Logical Operators
The operators are logical expressions. All of them return either TRUE or FALSE.
Operators
|
Operands
|
Description
|
A AND B
|
boolean
|
TRUE if both A and B are TRUE, otherwise FALSE.
|
A && B
|
boolean
|
Same as A AND B.
|
A OR B
|
boolean
|
TRUE if either A or B or both are TRUE, otherwise FALSE.
|
A || B
|
boolean
|
Same as A OR B.
|
NOT A
|
boolean
|
TRUE if A is FALSE, otherwise FALSE.
|
!A
|
boolean
|
Same as NOT A.
|
Complex Operators
These operators provide an expression to access the elements of Complex Types.
Operator
|
Operand
|
Description
|
A[n]
|
A is an Array and n is an int
|
It returns the nth element in the array A. The first element has index 0.
|
M[key]
|
M is a Map and key has type K
|
It returns the value corresponding to the key in the map.
|
S.x
|
S is a struct
|
It returns the x field of S.
|
Views
A view is a sort of “virtual table” that is defined by a SELECT statement. Views can be used to present data to users in a way that differs from the way it is actually stored on disk. Often, the data from existing tables is simplified or aggregated in a particular way that makes it convenient for further processing. Views may also be used to restrict users’ access to particular subsets of tables that they are authorized to see.
HiveQL
The Hive Query Language (HiveQL) is a query language for Hive to process and analyze structured data in a Metastore. This chapter explains how to use the SELECT statement with WHERE clause.
SELECT statement is used to retrieve the data from a table. WHERE clause works similar to a condition. It filters the data using the condition and gives you a finite result. The built-in operators and functions generate an expression, which fulfils the condition.
Syntax
Given below is the syntax of the SELECT query:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];
Select-Order By Syntax
Given below is the syntax of the ORDER BY clause:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]]
[LIMIT number];
Select-Group By
The GROUP BY clause is used to group all the records in a result set using a particular collection column. It is used to query a group of records.
Syntax
The syntax of GROUP BY clause is as follows:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]]
[LIMIT number];
Sort by:
Sort by clause performs on column names of Hive tables to sort the output. We can mention DESC for sorting the order in descending order and mention ASC for Ascending order of the sort.
Query:
SELECT * from employees_guru SORT BY Id DESC;
Cluster By:
Cluster By used as an alternative for both Distribute BY and Sort BY clauses in Hive-QL.
Cluster BY clause used on tables present in Hive. Hive uses the columns in Cluster by to distribute the rows among reducers.
Query:
SELECT Id, Name from employees_guru CLUSTER BY Id;
Distribute By:
Distribute BY clause used on tables present in Hive. Hive uses the columns in Distribute by to distribute the rows among reducers. All Distribute BY columns will go to the same reducer.
It ensures each of N reducers gets non-overlapping ranges of column
It doesn't sort the output of each reducer
Query:
SELECT Id, Name from employees_guru DISTRIBUTE BY Id;
Select-Joins
JOIN is a clause that is used for combining specific fields from two tables by using values common to each one. It is used to combine records from two or more tables in the database.
Syntax
join_table:
table_reference JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference
join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition
| table_reference CROSS JOIN table_reference [join_condition]
There are different types of joins given as follows:
JOIN
LEFT OUTER JOIN
RIGHT OUTER JOIN
FULL OUTER JOIN
JOIN
JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same as OUTER JOIN in SQL. A JOIN condition is to be raised using the primary keys and foreign keys of the tables.
The following query executes JOIN on the CUSTOMER and ORDER tables, and retrieves the records:
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT
FROM CUSTOMERS c JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+----+----------+-----+--------+
| ID | NAME | AGE | AMOUNT |
+----+----------+-----+--------+
| 3 | kaushik | 23 | 3000 |
| 3 | kaushik | 23 | 1500 |
| 2 | Khilan | 25 | 1560 |
| 4 | Chaitali | 25 | 2060 |
+----+----------+-----+--------+
LEFT OUTER JOIN
The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there are no matches in the right table. This means, if the ON clause matches 0 (zero) records in the right table, the JOIN still returns a row in the result, but with NULL in each column from the right table.
A LEFT JOIN returns all the values from the left table, plus the matched values from the right table, or NULL in case of no matching JOIN predicate.
The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
LEFT OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+----+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+----+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
+----+----------+--------+---------------------+
RIGHT OUTER JOIN
The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if there are no matches in the left table. If the ON clause matches 0 (zero) records in the left table, the JOIN still returns a row in the result, but with NULL in each column from the left table.
A RIGHT JOIN returns all the values from the right table, plus the matched values from the left table, or NULL in case of no matching join predicate.
The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER and ORDER tables.
notranslate"> hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+
FULL OUTER JOIN
The HiveQL FULL OUTER JOIN combines the records of both the left and the right outer tables that fulfil the JOIN condition. The joined table contains either all the records from both the tables, or fills in NULL values for missing matches on either side.
The following query demonstrates FULL OUTER JOIN between CUSTOMER and ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+
HBase
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide quick random access to huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and provides read and write access.
Storage Mechanism in HBase
HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines only column families, which are the key value pairs. A table have multiple column families and each column family can have any number of columns. Subsequent column values are stored contiguously on the disk. Each cell value of the table has a timestamp. In short, in an HBase:
Table is a collection of rows.
Row is a collection of column families.
Column family is a collection of columns.
Column is a collection of key value pairs.
Given below is an example schema of table in HBase.
Row id
|
Column Family
|
Column Family
|
Column Family
|
Column Family
|
col1
|
col2
|
col3
|
col1
|
col2
|
col3
|
col1
|
col2
|
col3
|
col1
|
col2
|
col3
|
1
|
|
|
|
|
|
|
|
|
|
|
|
|
2
|
|
|
|
|
|
|
|
|
|
|
|
|
Row-Oriented Database
|
Column-Oriented Database
|
It is suitable for Online Transaction Process (OLTP).
|
It is suitable for Online Analytical Processing (OLAP).
|
Such databases are designed for small number of rows and columns.
|
Column-oriented databases are designed for huge tables.
| Features of HBase
HBase is linearly scalable.
It has automatic failure support.
It provides consistent read and writes.
It integrates with Hadoop, both as a source and a destination.
It has easy java API for client.
It provides data replication across clusters.
Applications of HBase
It is used whenever there is a need to write heavy applications.
HBase is used whenever we need to provide fast random access to available data.
Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
HBase Architecture
HMaster
The master server -
Assigns regions to the region servers and takes the help of Apache ZooKeeper for this task.
Handles load balancing of the regions across region servers. It unloads the busy servers and shifts the regions to less occupied servers.
Maintains the state of the cluster by negotiating the load balancing.
Is responsible for schema changes and other metadata operations such as creation of tables and column families.
Region server
The region servers have regions that -
Communicate with the client and handle data-related operations.
Handle read and write requests for all the regions under it.
Decide the size of the region by following the region size thresholds.
Zookeeper
ZooKeeper is a distributed co-ordination service to manage large set of hosts. Co-ordinating and managing a service in a distributed environment is a complicated process. ZooKeeper solves this issue with its simple architecture and API. ZooKeeper allows developers to focus on core application logic without worrying about the distributed nature of the application.
The ZooKeeper framework was originally built at “Yahoo!” for accessing their applications in an easy and robust manner. Later, Apache ZooKeeper became a standard for organized service used by Hadoop, HBase, and other distributed frameworks. For example, Apache HBase uses ZooKeeper to track the status of distributed data.
Benefits of ZooKeeper
Here are the benefits of using ZooKeeper −
Simple distributed coordination process
Synchronization − Mutual exclusion and co-operation between server processes. This process helps in Apache HBase for configuration management.
Ordered Messages
Serialization − Encode the data according to specific rules. Ensure your application runs consistently. This approach can be used in MapReduce to coordinate queue to execute running threads.
Reliability
Atomicity − Data transfer either succeed or fail completely, but no transaction is partial.
Zookeeper – Fundamentals
Architecture
Hierarchical namespace
Session
Watches
Architecture of ZooKeeper
Part
|
Description
|
Client
|
Clients, one of the nodes in our distributed application cluster, access information from the server. For a particular time interval, every client sends a message to the server to let the sever know that the client is alive.
Similarly, the server sends an acknowledgement when a client connects. If there is no response from the connected server, the client automatically redirects the message to another server.
|
Server
|
Server, one of the nodes in our ZooKeeper ensemble, provides all the services to clients. Gives acknowledgement to client to inform that the server is alive.
|
Ensemble
|
Group of ZooKeeper servers. The minimum number of nodes that is required to form an ensemble is 3.
|
Leader
|
Server node which performs automatic recovery if any of the connected node failed. Leaders are elected on service startup.
|
Follower
|
Server node which follows leader instruction.
|
Hierarchical Namespace
The following diagram depicts the tree structure of ZooKeeper file system used for memory representation. ZooKeeper node is referred as znode. Every znode is identified by a name and separated by a sequence of path (/).
In the diagram, first you have a root znode separated by “/”. Under root, you have two logical namespaces config and workers.
The config namespace is used for centralized configuration management and the workers namespace is used for naming.
Under config namespace, each znode can store upto 1MB of data. This is similar to UNIX file system except that the parent znode can store data as well. The main purpose of this structure is to store synchronized data and describe the metadata of the znode. This structure is called as ZooKeeper Data Model.
Every znode in the ZooKeeper data model maintains a stat structure. A stat simply provides the metadata of a znode. It consists of Version number, Action control list (ACL), Timestamp, and Data length.
Version number − Every znode has a version number, which means every time the data associated with the znode changes, its corresponding version number would also increased. The use of version number is important when multiple zookeeper clients are trying to perform operations over the same znode.
Action Control List (ACL) − ACL is basically an authentication mechanism for accessing the znode. It governs all the znode read and write operations.
Timestamp − Timestamp represents time elapsed from znode creation and modification. It is usually represented in milliseconds. ZooKeeper identifies every change to the znodes from “Transaction ID” (zxid). Zxid is unique and maintains time for each transaction so that you can easily identify the time elapsed from one request to another request.
Data length − Total amount of the data stored in a znode is the data length. You can store a maximum of 1MB of data
Types of Znodes
Znodes are categorized as persistence, sequential, and ephemeral.
Persistence znode − Persistence znode is alive even after the client, which created that particular znode, is disconnected. By default, all znodes are persistent unless otherwise specified.
Ephemeral znode − Ephemeral znodes are active until the client is alive. When a client gets disconnected from the ZooKeeper ensemble, then the ephemeral znodes get deleted automatically. For this reason, only ephemeral znodes are not allowed to have a children further. If an ephemeral znode is deleted, then the next suitable node will fill its position. Ephemeral znodes play an important role in Leader election.
Sequential znode − Sequential znodes can be either persistent or ephemeral. When a new znode is created as a sequential znode, then ZooKeeper sets the path of the znode by attaching a 10 digit sequence number to the original name. For example, if a znode with path /myapp is created as a sequential znode, ZooKeeper will change the path to /myapp0000000001 and set the next sequence number as 0000000002. If two sequential znodes are created concurrently, then ZooKeeper never uses the same number for each znode. Sequential znodes play an important role in Locking and Synchronization.
Sessions
Sessions are very important for the operation of ZooKeeper. Requests in a session are executed in FIFO order. Once a client connects to a server, the session will be established and a session id is assigned to the client.
The client sends heartbeats at a particular time interval to keep the session valid. If the ZooKeeper ensemble does not receive heartbeats from a client for more than the period (session timeout) specified at the starting of the service, it decides that the client died.
Session timeouts are usually represented in milliseconds. When a session ends for any reason, the ephemeral znodes created during that session also get deleted.
Watches
Watches are a simple mechanism for the client to get notifications about the changes in the ZooKeeper ensemble. Clients can set watches while reading a particular znode. Watches send a notification to the registered client for any of the znode (on which client registers) changes.
Znode changes are modification of data associated with the znode or changes in the znode’s children. Watches are triggered only once. If a client wants a notification again, it must be done through another read operation. When a connection session is expired, the client will be disconnected from the server and the associated watches are also removed.
Visualization
Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.
Big Data visualization involves the presentation of data of almost any type in a graphical format that makes it easy to understand and interpret. But it goes far beyond typical corporate graphs, histograms and pie charts to more complex representations like heat maps and fever charts, enabling decision makers to explore data sets to identify correlations or unexpected patterns.
Big Data visualization is not the only way for decision makers to analyze data, but Big Data visualization techniques offer a fast and effective way to:
Review large amounts of data – data presented in graphical form enables decision makers to take in large amounts of data and gain an understanding of what it means very quickly – far more quickly than poring over spreadsheets or analyzing numerical tables.
Spot trends – time-sequence data often captures trends, but spotting trends hidden in data is notoriously hard to do – especially when the sources are diverse and the quantity of data is large. But the use of appropriate Big Data visualization techniques can make it easy to spot these trends, and in business terms a trend that is spotted early is an opportunity that can be acted upon.
Identify correlations and unexpected relationships – One of the huge strengths of Big Data visualization is that enables users to explore data sets – not to find answers specific questions, but to discover what unexpected insights the data can reveal. This can be done by adding or removing data sets, changing scales, removing outliers, and changing visualization types. Identifying previously unsuspected patterns and relationships in data can provide businesses with a huge competitive advantage.
Present the data to others – An oft-overlooked feature of Big Data visualization is that it provides a highly effective way to communicate any insights that it surfaces to others. That's because it can convey meaning very quickly and in a way that it is easy to understand: precisely what is needed in both internal and external business presentations.
Use of Data Visualization
To make easier in understand and remember.
To discover unknown facts, outliers, and trends.
To visualize relationships and patterns quickly.
To ask a better question and make better decisions.
To competitive analyze.
To improve insights.
Data visualization techniques
Depending on these factors, you can choose different data visualization techniques and configure their features. Here are the common types of visualization techniques:
Charts
The easiest way to show the development of one or several data sets is a chart. Charts vary from bar and line charts that show the relationship between elements over time to pie charts that demonstrate the components or proportions between the elements of one whole.
Plots
Plots allow to distribute two or more data sets over a 2D or even 3D space to show the relationship between these sets and the parameters on the plot. Plots also vary. Scatter and bubble plots are some of the most widely-used visualizations. When it comes to big data, analysts often use more complex box plots that help visualize the relationship between large volumes of data.
Maps
Maps are popular ways to visualize data used in different industries. They allow to locate elements on relevant objects and areas — geographical maps, building plans, website layouts, etc. Among the most popular map visualizations are heat maps, dot distribution maps, cartograms.
Diagrams and matrices
Diagrams are usually used to demonstrate complex data relationships and links and include various types of data on one visualization. They can be hierarchical, multidimensional, tree-like.
Matrix is one of the advanced data visualization techniques that help determine the correlation between multiple constantly updating (steaming) data sets.
Share with your friends: |