1. INTRODUCTION:-
Information is playing a very important role in our lives today and one of the major sources of information is databases. Databases are gaining prime importance in a huge variety of application areas both private and public information systems. Databases are built with the objective of facilitating the activities of data storage, processing, and retrieval etc. all of which are associated with data management in information systems. In relational databases, to retrieve
information from a database, one needs to formulate a query in such way that the computer will understand and produce the desired output [4]. Retrieval of information from the databases requires the knowledge of databases language like the Structured Query Language (SQL). The SQL norms are based on a Boolean interpretation of the queries. But not all users are aware of the Query language and not all the queries of user’s seeking information from the database can be answered explicitly by the querying system. It is due to the fact that not all the requirement’s characteristics can be expressed by regular query languages. To simplify the complexity of SQL and to manipulate of data in databases for common people (not SQL professionals), many researches have turned out to use natural language instead of SQL. The idea of using natural language instead of SQL has led to the development of new type of processing method called Natural Language Interface to Database systems (NLIDB) [12, 16]. The NLIDB system is actually a branch of more comprehensive method called Natural Language Processing (NLP). In general, the main objective of NLP research is to create an easy and friendly environment to interact with computers in the sense that computer usage does not require any programming language skills to access the data; only natural language (i.e. English) is required. One drawback of previous systems is that the grammar must be tailor-made for each given database. Another drawback is that many NLP systems cover only a small domain of the English language questions. The system is generic in nature given the appropriate database and knowledge base. This feature makes our system distinguishable. Natural language computing (NLC) is becoming one of the most active areas in Human-Computer Interaction. The goal of NLC is to enable communication between people and computers without resorting to memorization of complex commands and procedures [19]. Via computers around the globe the people accumulate and manipulate huge amount of data every second of the day. These huge amounts of data are located in private personal computers or remote location (i.e. the internet). Mostly, the data is stored in some kind of databases or data warehouses. Data in database are usually managed by DBMS. SQL is used to access the database. The NLIDB system is actually a branch of more comprehensive method called Natural Language Processing (NLP). The main objective of NLP research is to create an easy and friendly environment to interact with computers in natural language (i.e. English) [16]. In recent times, there is a rising demands for non-expert users to query relational databases in a more natural language encompassing linguistic variables and terms, instead of operating on the values of the attributes. Therefore the idea of using natural language instead of SQL has prompted the development of new type of processing method called Natural Language Interface to Database systems (NLIDB). NLIDB is a step towards the development of intelligent database systems (IDBS) to enhance the users in performing flexible querying in databases.
2. EARLY SYSTEMS
The early efforts in the NL interfaces area started back in fifties. Prototype systems were introduced in the late sixties and early seventies. Many of these systems relied on pattern matching to directly mapping the user input to the database . Formal LIst Processor (FLIP) is an early language for pattern-matching based on LISP structure works on the bases that if the input matches one of the patterns then the system is able to build a query for the database. In the pattern matching based systems, the database details were inter-mixed into the code, limited to specific databases and to the number and complexity of the patterns. Natural language processing was one of the approach, where the user interactively is allowed to interrogate the stored data.
2.1 LUNAR system
The system LUNAR is a system that answers questions about samples of rocks brought back from the moon. The meaning of systems’ name is that is in relation to the moon. The system was informally introduced in 1971. The LUNAR system uses two databases; one for the chemical analysis and the other for literature references which helps it to achieve its functions. The LUNAR system uses an Augmented Transition Network (ATN) parser and Woods' Procedural Semantics. According to [13], the LUNAR system performance was quite impressive; it managed to handle 78% of requests without any errors and this ratio rose to 90% when dictionary errors were corrected. The system was not put to intensive use because of its linguistic limitations.
2.2 LADDER
The LADDER system was designed as a natural language interface to a database of information about US Navy ships. According to, the LADDER system uses semantic grammar to parse questions to query a distributed database. The system uses semantic grammars technique that interleaves syntactic and semantic processing. The question answering is done via parsing the input and mapping the parse tree to a database query. The system LADDER is based on a three-layered architecture [14]. The first component of the system is for Informal Natural Language Access to Navy Data (INLAND). This component accepts questions in a natural language and generates a query to the database. The queries from the INLAND are directed to the Intelligent Data Access (IDA). This component which is the second component of LADDER builds a fragment of a query to IDA for each lower level syntactic unit in the English Language. Input query and these fragments are then combined to higher level syntactic units to be recognized.The combined fragments are sent as a command to IDA at the sentence level. IDA would compose an answer that is relevant to the user’s original query in addition to planning the correct sequence of file queries. The third component of the LADDER system is for File Access Manager (FAM). The FAM finds the location of the generic files and handles the access to them in the distributed database. LISP was used to implement the LADDER system.
2.3 CHAT-80
The system CHAT-80 is one of the most referenced NLP systems in the 80’s. Prolog was used to implement this system. The CHAT-80 was an impressive, efficient and sophisticated system. The database of CHAT-80 consists of facts (i. e. oceans, major seas, major rivers and major cities) about 150 of the countries world and a small set of English language vocabulary that are enough for querying the database. An English language question is processed in three stages [15]. The system translates the English language question by the creation of a logical form as processes of three serial and complementary functions where:
Logical constants are used to represent words.
Predicates are used to represent Verbs, nouns, and adjectives with their associated prepositions. The predicates can have one or more arguments.
Conjunctions of predicates are used to represent complex phrases or sentences.
Figure 1 Chat-80 Processing Scheme
These functions are being; parsing, interpretation and scoping. Grammatical structure of a sentence is determined by the parsing module and the interpretation andscoping consist of various translation rules, expressed directly as Prolog clauses. The basic method followed by Chat-80 is to attach some extra control information to the logical form of a query in order to make it an efficient piece of Prolog program that can be executed directly to produce the answer. The generated control information comes into two forms:
The order in which Prolog will attempt to satisfy the query is determined by the Order of the predications for that query.
Separates the over all program into a number of independent sub problems to limit the amount of backtracking performed by Prolog.
In this way, Prolog is led to answer the queries in an obviously sensible way and the Prolog compiler can compile the transformed query into code that is as efficient as iterative loops in a conventional language.
3. VARIOUS APPROACHES TO NATURAL LANGUAGE PROCESSING INTERFACE
Natural language is the topic of interest from computational viewpoint due to the implicit ambiguity that language possesses [1]. Several researchers applied different techniques to deal with language. Next few sub-sections describe diverse strategies that are used to process language for various purposes.
3.1Symbolic Approach (Rule Based Approach)
Natural Language Processing appears to be a strongly symbolic activity. Words are symbols that stand for objects and concepts in real worlds, and they are put together into sentences that follow the well specified grammar rules. Knowledge about language is explicitly encoded in rules or other forms of representation. Language is analyzed at various levels to obtain information. On this obtained information certain rules are applied to achieve linguistic functionality [1]. As Human Language capabilities include rule-base reasoning, it is supported well by symbolic processing. In symbolic processing rules are formed for every level of linguistic analysis.
3.2 Empirical Approach (Corpus Based Approach)
Empirical approaches are based on statistical analysis as well as other data driven analysis, of raw data which is in the form of text corpora i.e. a collection of machine readable text. The approach has been around since NLP began in the early 1950s. The empirical NLP has emerged as a replacement to rule-based NLP only in the last decade. Corpora are primarily used as a source of information about language and a number of techniques have emerged to enable the analysis of corpus data. Syntactic analysis can be achieved on the basis of statistical probabilities estimated from a training corpus. Lexical ambiguities can be resolved by considering the likelihood of one or another interpretation on the basis of context. Given the positive results shown by the Empirical approach several different symbolic and statistical methods have been employed, but most of them are used to generate one part of a larger information extraction system.
3.3 Connectionist Approach (Using Neural Network)
Since human language capabilities are based on neural network in the brain, Artificial Neural Networks (also called as connectionist network) provides on essential starting point for modeling language processing [2, 18]. Instead of symbols, the approach is based on distributed representations that correspond to statistical regularities in language. Though there has been significant research on the use of applying neural networks for language processing little has been done in the field of using the sub-symbolic learning in language processing. SHRUTI system developed by [3] is a neurally inspired system for event modeling and temporal processing at a connectionist level.
4. NATURAL LANGUAGE DATABASE INTERFACE BASED ON PROBABILISTIC CONTEXT FREE GRAMMAR:
All the earlier tools translates a natural language query into an intermediate language similar to first-order logic and then into SQL using a set of definitive rules. The intermediate language expresses the meaning of the query in terms of high-level concepts that are independent of database structures. It is lack of robust in the translating process, since premises of rules must be matched with phrases of the query exactly or else the rule will be rejected. The purpose of CFG is to overcome this problem by constructing the most probable grammar tree and analyzing the non-terminals (phrases) in the grammar tree to collect the parameters which will be used in SQL.
4.1 Overview of Natural Language Database Interface
The overall system architecture of our natural language database interface is given in Figure 2. The original input to the system is a query expressed in natural language. To process a query, the first step is part of speech tagging; after this step each word of the query is tagged [6, 7]. The second step is parsing the tagged sentence by a PCFG. The PCFG parser analyzes the query sentence according to the tag of each word, and calculates the probability of all possible grammar trees. The result of the analysis forms a grammar tree with the maximum probability. Finally, the SQL translator processes the grammar tree to obtain the SQL query by a series of dependency rules.