Storing Data
The local services are all about storing data. In the past, the emphasis was put on high-speed access, the data types were simple and the output devices were standard text devices. This has gradually changed, so that today we are storing complex data (sounds, images) and also the output devices have become more diversified.
The different types of local services can be described in two dimensions. First, there is the complexity of the data they store and second, there is the distinction between those systems that allow for (structured) querying and those who do not. Fig. 2.7 shows the different implementations of these categories:
Fig. 2.7 – Storage systems
File Systems
Market analysts estimate that on average around 85% of all information that is stored in an organization is done by means of files and the volume is growing somewhere between 40 and 120% per year. A file is a collection of data that has a name. There are many different types of files: texts (formatted and unformatted), program files, images, movies, etc. Different types of files store different types of data.
A directory is a special kind of file used to organize other files into a hierarchical structure. Directories contain bookkeeping information about other files. You can think of a directory as a folder that contains files and perhaps other folders. The topmost directory is called the root directory. A directory that is below another directory is called a subdirectory.
A File (Management) System is the software that an Operating System or other program uses to organize and keep track of files and directories. Although all Operating Systems provide their own file management system, separate file management systems are also available on the market. These systems interact smoothly with the operating system but provide more features, such as improved backup procedures and stricter file protection.
Relational Database Management Systems (RDBMSs)
A database is a collection of data that is organized in such a way that its contents can easily be reached, managed, and updated [13].
A Database Management System (DBMS), sometimes just called a database manager, is a program that lets one or more users create and access data in a database. The DBMS manages requests from users and programs so that they do not have to understand where the data is physically located. In handling requests, the DBMS ensures the integrity of the data (that is, making sure it continues to be accessible and is consistently organized as intended) and security (making sure only those with access privileges can access the data). A DBMS can be thought of as a file manager that manages data in databases rather than files in a file system.
The most widely spread DBMSs are Relational Database Management Systems or RDBMSs. In an RDBMS data are stored as tables. Below is a simple example of a table that is part of an order entry application.
Customer
-
CustID
|
Name
|
Address
|
State
|
City
|
100234
|
AB Corp.
|
1, Las Vegas Blvd.
|
Nevada
|
Las Vegas
|
222394
|
Technotools
|
323, Main Street
|
Minnesota
|
Mineapolis
|
124040
|
Acme
|
555, Star Blvd.
|
Texas
|
Dallas
|
449452
|
Zillion Inc.
|
123A, Park Ave.
|
Kentucky
|
Louisville
|
553935
|
Whatsup
|
8, Silicon Driveway
|
California
|
San Diego
|
Another table holds the orders that were taken down:
Order
-
OrdID
|
OrdCustID
|
Description
|
Quantity
|
12345
|
124040
|
Baseball Gloves
|
2
|
12346
|
222394
|
Basketball
|
3
|
12347
|
100234
|
T-shirt
|
10
|
12348
|
124040
|
Helmet
|
6
|
12349
|
553935
|
Bicycle
|
9
|
The columns of a database table are known as fields or attributes, while the rows are called records. A key is a field that identifies a record. For the Customer table, the company name could be used as a key, but when there are two companies with the same name, there is a problem. The best way is to use an arbitrary whole number or a code (CustID).
There are two types of key fields: primary keys and foreign keys. A primary key is a field that uniquely identifies a record in a table (CustID and OrdID). No two records can have the same value for a primary key. A foreign key represents the value of primary key for a related table. In the Order table, the OrdCustID field holds the value of the CustID field for the customer who placed the order. By doing this, we can link the Customers and Orders tables using the CustID field. By joining the two tables based on this relationship, we can add fields from both tables and see all orders along with any pertinent customer data.
Data not only has to be stored, it also has to be retrieved from an RDBMS. This is done by queries. A query is a view of data that represents the data from one or more tables. In Relational Databases, queries are built using Structured Query Language (SQL). SQL includes the capability of manipulating both the structure of a database and its data. In its most common form, SQL is used to create a simple SELECT query.
Let's take the example and build a query to look at customer orders. Here's the SQL for it:
-
SELECT Name, Description, Quantity
FROM Customer
WHERE Customer.CustID = Order.OrdCustID
ORDER BY Name, Description;
|
This query starts with the SELECT keyword. SELECT simply means that we wish to retrieve records from the tables. Following the SELECT keyword is the list of fields. Next comes the FROM keyword. This is used to show where the data is coming from. Here, it is coming from the Customer table and the Order table. The WHERE clause specifies which fields are used to join the tables. Here we are matching the CustID field from Customer to the OrdCustID field (the foreign key) in Order. The ORDER BY clause states the way the results have to be ordered (first by Name, then by Description).
The result that is returned by this query on the given tables looks like this:
-
Name
|
Description
|
Quantity
|
AB Corp.
|
T-shirt
|
10
|
Acme
|
Baseball Gloves
|
2
|
Acme
|
Helmet
|
6
|
Technotools
|
Basketball
|
3
|
Whatsup
|
Bicycle
|
9
|
One thing that should be noted is that, even though an ANSI standard exists, each RDBMS has its own particular dialect of SQL. Statements that were written for one RDBMS will not necessarily run or be interpreted on another engine.
So far, we have only discussed the SELECT statement, this is known as a Data Manipulation Language (DML) statement. Other DML statements allow for the creation or modification of data (INSERT and UPDATE). Data Definition Language (DDL) statements such as CREATE TABLE are used to manage the structure of the database and Data Control Language (DCL) statements allow the control over the access rights to the data, such as GRANT.
Normalization
Normalization is a theory for designing relational schemas that are efficient and work well. Well-normalized tables avoid unnecessary duplication and reduce inconsistencies. In well-normalized databases semantic dependencies are maintained by primary key uniqueness. Historically, several different types of normalization have been defined (first through fifth normal form); the most important however is the Boyce-Codd Normal Form (BCNF):
“A table is in Boyce-Codd Normal Form if every determinant is a candidate key.”
Let A and B be two attributes of a table. A is said to be a determinant of B if each value of A has precisely one (possibly null) associated value of B or - said differently - if and only if whenever two tuples agree on their A value they agree on their B value. A candidate key is a combination of attributes that can be uniquely used to identify a database record. Each table can have one or more candidate keys. One of these candidate keys is selected as the table's primary key.
Consider the following table:
-
ID
|
Product
|
City
|
State
|
Price
|
9991
|
T-shirt
|
San Diego
|
California
|
100
|
9992
|
Baseball Gloves
|
San Diego
|
California
|
100
|
9993
|
T-shirt
|
Mesa
|
Arizona
|
150
|
9994
|
Helmet
|
Los Angeles
|
California
|
120
|
9996
|
Baseball Gloves
|
Mesa
|
Arizona
|
150
|
9997
|
Helmet
|
Phoenix
|
Arizona
|
80
|
9998
|
Basketball
|
Green Valley
|
Arizona
|
90
|
9999
|
Bicycle
|
Tombstone
|
Arizona
|
110
|
Transformed to the BCNF this would result in three tables:
-
ID
|
Product
|
City
|
9991
|
T-shirt
|
San Diego
|
9992
|
Baseball Gloves
|
San Diego
|
9993
|
T-shirt
|
Mesa
|
9994
|
Helmet
|
Los Angeles
|
9996
|
Baseball Gloves
|
Mesa
|
9997
|
Helmet
|
Phoenix
|
9998
|
Basketball
|
Green Valley
|
9999
|
Bicycle
|
Tombstone
|
-
City
|
State
|
San Diego
|
California
|
Los Angeles
|
California
|
Tombstone
|
Arizona
|
Mesa
|
Arizona
|
Phoenix
|
Arizona
|
Green Valley
|
Arizona
|
-
City
|
Price
|
Tombstone
|
110
|
Mesa
|
150
|
Phoenix
|
80
|
Green Valley
|
90
|
San Diego
|
100
|
Los Angeles
|
120
|
Indexing
Conceptually, the relational model is simple: operations on a set of tables yield another table as the result. But simply scanning tables is inefficient, that is why indexes exist. Indexes can best be compared with the alphabetical index of a book; they are easy to scan and point to occurrences of the searched word in the actual text.
There are different types of indexes but the common point is that they can speed up the search for a particular row or the sorting of a set of rows. Most databases have some kind of query optimizer that chooses the best path to get the data that satisfy a given query. An index can speed up considerably joins, ORDER BY and GROUP BY clauses from a SELECT statement. It can also speed up queries for which the conditions match the columns and sort order of the index. When you have a primary key, you already have an implicitly defined unique index on the primary key columns. It is a good idea to define non-unique indexes on the foreign keys.
Parallel databases
Today, database research is focusing on technology that allows multiple, smaller machines to achieve the same throughput as single, larger machine, with greater scalability and reliability. A so-called parallel DBMS runs across multiple processors and disks and executes operations in parallel, whenever possible, to improve performance
.
Parallel database architectures can be broadly classified into three categories: shared memory, shared disk, and shared nothing. The first two architectures require a special hardware arrangement; the last one uses a standard communication network.
The advantages of a parallel DBMS are obvious:
-
Improved Performance;
-
Improved Availability;
-
Improved Reliability;
-
Lower Total Cost of Ownership (TCO);
-
Scalability.
The price, however, is paid in terms of:
-
Increased complexity;
-
More difficult security;
-
More difficult integrity control.
The Open Group has formed a Working Group to provide specifications for a common SQL Application Programming Interface (API) and a common database protocol that enables DBMS from one vendor to communicate directly with DBMS from another vendor without the need for a gateway.
Object-Oriented Database Management Systems (OODBMS)
Relational database technology fails to handle the needs of complex Information Systems; they require the application developer to force an information model into tables with relationships between them.
The study of databases is centered on the problem of data modeling. A data model is a collection of well-defined concepts that help people to consider and express the properties of data intensive information systems.
A data model consists of:
-
Static properties such as objects, attributes and relationships;
-
Integrity rules over objects and operations;
-
Dynamic properties such as operations or rules defining new database states based on applied state changes.
Relational databases are capable of supporting the first two points and rely on separate software programs for defining the dynamic properties of the model. The disadvantage of delegating the dynamic properties to separate programs is that they cannot be applied uniformly in all scenarios since they are defined outside the database.
Object-oriented databases can model all three of these points directly within the database thus supporting a complete problem/solution modelling capability. Object-oriented databases provide a unifying paradigm that allows one to integrate all three aspects of data modelling and to apply them uniformly to all customers of the database. OODBMSs have, however, some drawbacks as well. First, there is no common query language such as SQL for the relational databases. The performance of Object-oriented databases is much lower than that of RDBMSs. Moreover, there is a lack of standardized industry support, and finally, high-tech features such as multi-threading and parallel processing are not available.
Today, there are many companies marketing commercial object-oriented databases that are second-generation products. OODBMSs have established themselves in niches such as e-commerce, engineering, and special purpose databases in areas such as securities and medicine.
Object-Relational Database Management Systems (ORDBMS)
The relational database vendors considered the arrival of OODBMSs as a threat. As a result they started to add features and functionality to their products to stay competitive. Real-life implementations of the Object Relational Databases (ORDBMSs) as well as theoretical extensions of the relational model such as the “Third Manifesto” [14] started to appear mid 1990s.
Stored procedures are collections of SQL statements and flow-control language that are stored in a precompiled way into the database. Stored procedures are supported by most RDBMSs, but there is a fair amount of variation in their syntax and capabilities. Triggers are Stored Procedures that are launched on a given event, such as inserting or updating a row in a table.
The addition of complex data types such as BLOBs (Binary Large Objects) is another example of how the relational database vendors have reacted to OODBMSs. A BLOB is a large block of data stored in a database, such as an image or sound file. A special subtype of BLOB is CLOB; characters are used instead of pure binary information. A BLOB cannot be interpreted by the database management system but is known only by its size and location. Modern RDBMSs allow for some relatively simple operations on BLOBs such as storing, retrieving, and even some sort of indexing. There remains, however, a fundamental mismatch between BLOBS and RDBMSs.
The net result of these enhancements was that the purely relational databases evolved into Object-Relational Databases (ORDBMSs), also described as hybrid databases. These systems are much more like RDBMSs than OODBMSs. They make use of a proprietary object model and are designed for large-grained objects.
More recently, another challenge for the database vendors was set by the appearance of the eXtensible Markup Language (XML) [43]. XML is a powerful communication mechanism; it contains both the data and its description in a well-structured way. In fact, it is a text format with a particular way of representing information in a hierarchical structure (Fig. 2.8).
Fig. 2.8 – XML as a hierarchical structure
We observe that:
-
The document starts with the XML declaration ;
-
All elements have start and end tags: ;
-
All elements are contained within one root element: .
In XML, a clear distinction is made between data and metadata, syntax and semantics. Much like in a relational database the same structure stores different types of information but in XML the structure is not fixed; it can be altered on-the-fly, as long as the syntax is respected. XML is the basis for new architectures in various domains of IT: storage, exchange, and representation of information. The database vendors could of course not neglect the importance of XML, so different ways to store and retrieve XML documents were developed. Traditional database vendors have all extended their products and propose two possible strategies:
-
In the mapping approach all elements of the XML file are stored as separate rows. The advantage of this approach is that, once stored, the data are accessible with traditional tools (SQL). This is why it is recommended for data-centric applications;
-
In a RDBMS XML can also be stored in one piece: as a CLOB or a derived data type (e.g., XMLType in Oracle). This approach is best suited for text-centric applications.
Some OODBMS vendors have used their know-how and technology to build native XML databases. The idea is to store the elements of the XML document as objects using a special query language data can then be retrieved and updated.
Working with XML documents in databases sparked the need for a standard query language. Several attempts have been undertaken by W3C, one of them being Xquery. XQuery has an SQL-like syntax combined with Xpath, a query language designed to address parts of XML documents:
-
FOR $b IN //Book
WHERE $b/Author/First Name = "Ralph"
AND $b/Author/Last Name = "Kimball"
RETURN $b/title
|
Xquery statements are known as FLWR expressions. FLWR (pronounced flower) is an acronym that stands for the four possible sub expressions; this expression type can contain: FOR, LET, WHERE, and RETURN. The FOR statement creates a collection of variable bindings, one for each node in the "ordered forest" returned by the path expression. For each of these bindings, the subsequent statements are executed. The nodes in the collection are filtered by the WHERE clause. RETURN constructs the result. The LET subexpression (not shown) binds a variable to a value.
Benchmarking of DBMSs
The performance of DBMSs is compared using benchmarks. The Transaction Processing Performance Council or TPC is a non-profit corporation founded by database manufacturers to define transaction processing and database benchmarks and to disseminate objective and verifiable performance data. The term transaction is often applied to a wide variety of business and computer functions. Looked at as a computer function, a transaction could refer to a set of operations including disk read/writes, operating system calls, or some form of data transfer from one subsystem to another. The TPC defines a transaction, as it is commonly understood in the business world: a commercial exchange of goods, services, or money. A typical transaction would include the updating to a database system for such things as inventory control, airline reservations, or banking. In these environments, several customers or service representatives input and manage their transactions via a terminal or desktop computer connected to a database.
The benchmarks that are currently in use are TPC-C, TPC-H, TPC-R, and TPC-W.
-
TPC-C – This benchmark simulates a complete computing environment where a population of customers executes transactions against a database;
-
TPC-H – This is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications;
-
TPC-R – This is a decision support benchmark similar to TPC-H, but which allows additional optimizations based on advance knowledge of the queries;
-
TPC-W – This is a transactional web benchmark. The workload is performed in a controlled Internet environment that simulates the activities of a business oriented transactional web server.
Share with your friends: |