Big Data Technologies



Download 263.45 Kb.
Page4/7
Date05.08.2017
Size263.45 Kb.
#26698
1   2   3   4   5   6   7

Key-values Stores


The Key-Value database is a very simple structure based on Amazon’s Dynamo DB. Data is indexed and queried based on it’s key. Key-value stores provide consistent hashing so they can scale incrementally as your data scales. They communicate node structure through a gossip-based membership protocol to keep all the nodes synchronized. If you are looking to scale very large sets of low complexity data, key-value stores are the best option.

Examples: Riak, Voldemort etc

Column Family Stores were created to store and process very large amounts of data distributed over many machines. There are still keys but they point to multiple columns. In the case of BigTable(Google's Column Family NoSQL model), rows are identified by a row key with the data sorted and stored by this key. The columns are arranged by column family. E.g. Cassandra, HBase etc



  • These data stores are based on Google’s BigTable implementation. They may look similar to relational databases on the surface but under the hood a lot has changed. A column family database can have different columns on each row so is not relational and doesn’t have what qualifies in an RDBMS as a table. The only key concepts in a column family database are columns, column families and super columns. All you really need to start with is a column family. Column families define how the data is structured on disk. A column by itself is just a key-value pair that exists in a column family. A super column is like a catalogue or a collection of other columns except for other super columns.

  • Column family databases are still extremely scalable but less-so than key value stores. However, they work better with more complex data sets.

Document Databases were inspired by Lotus Notesand are similar to keyvalue stores. The model is basically versioned documents that are collections of other key-value collections. The semi-structured documents are stored in formats like JSONe.g. MongoDB, CouchDB





  • A document database is not a new idea. It was used to power one of the more prominent communication platforms of the 90’s and still in service today, Lotus Notes now called Lotus Domino. APIs for document DBs use Restful web services and JSON for message structure making them easy to move data in and out.

  • A document database has a fairly simple data model based on collections of key-value pairs.

  • A typical record in a document database would look like this:

  • { “Subject”: “I like Plankton”

“Author”: “Rusty”

“PostedDate”: “5/23/2006″

“Tags”: ["plankton", "baseball", "decisions"]

“Body”: “I decided today that I don’t like baseball. I like plankton.” }



  • Document databases improve on handling more complex structures but are slightly less scalable than column family databases.

Graph Databases are built with nodes, relationships between notes and the properties of nodes. Instead of tables of rows and columns and the rigid structure of SQL, a flexible graph model is used which can scale across many machines.





  • Graph databases take document databases to the extreme by introducing the concept of type relationships between documents or nodes. The most common example is the relationship between people on a social network such as Facebook. The idea is inspired by the graph theory work by Leonhard Euler, the 18th century mathematician. Key/Value stores used key-value pairs as their modeling units. Column Family databases use the tuple with attributes to model the data store. A graph database is a big dense network structure.

  • While it could take an RDBMS hours to sift through a huge linked list of people, a graph database uses sophisticated shortest path algorithms to make data queries more efficient. Although slower than its other NoSQL counterparts, a graph database can have the most complex structure of them all and still traverse billions of nodes and relationships with light speed.


Cassandra


  • Cassandra is now deployed as the backend storage system for multiple services within Facebook

  • To meet the reliability and scalability needs described above Facebook has developed Cassandra.

  • Cassandra was designed to full the storage needs of the Search problem.

Data Model

  • Cassandra is a distributed key-value store.

  • A table in Cassandra is a distributed multi-dimensional map indexed by a key. The value is an object which is highly structured.

  • The row key in a table is a string with no size restrictions, although typically 16 to 36 bytes long.

  • Every operation under a single row key is atomic per replica no matter how many columns are being read or written into.

  • Columns are grouped together into sets called column families.

  • Cassandra exposes two kinds of columns families, Simple and Super column families.

  • Super column families can be visualized as a column family within a column family.

Architecture

  • Cassandra is designed to handle big data workloads across multiple nodes with no single point of failure.

  • Its architecture is based in the understanding that system and hardware failure can and do occur.

  • Cassandra addresses the problem of failures by employing a peer-to-peer distributed system where all nodes are the same and data is distributed among all nodes in the cluster.

  • Each node exchanges information across the cluster every second.

  • A commit log on each node captures write activity to ensure data durability.

  • Data is also written to an in-memory structure, called a memtable, and then written to a data file called an SStable on disk once the memory structure is full.

  • All writes are automatically partitioned and replicated throughout the cluster.

  • Client read or write requests can go to any node in the cluster.

  • When a client connects to a node with a request, that node serves as the coordinator for that particular client operation.

  • The coordinator acts as a proxy between the client application and the nodes that own the data being requested.

  • The coordinator determines which nodes in the ring should get the request based on how the cluster is configured.


Download 263.45 Kb.

Share with your friends:
1   2   3   4   5   6   7




The database is protected by copyright ©ininet.org 2024
send message

    Main page