Figure 2 NLDBI Architecture
4.2 Probabilistic Context Free Grammar
CFG has a formalization capability in describing most sentence structures. Also, CFG is so well formed that efficient sentence parser could be built on top of it [19]. Probabilistic context free grammar (PCFG) inherits most advantages of CFG, and it is the simplest statistical model to analyze natural language. Usually natural language sentences are transformed into a tree structure through PCFG, and the grammar tree is analyzed according to user’s requirements. A probabilistic context free grammar consists of the following:
A terminal set: {wk}, where wk is a word, corresponding to a leaf in the grammar tree;
A non-terminal set: {Ni}, Ni which is a sign used to generate terminals, corresponding to a non-leaf node in the grammar tree;
The grammar is weakly equivalent to the Chomsky Norm Form grammar (CNFG) which consists of rules of only two forms which is the most concise grammar. With n non-terminals and v terminals, the number of parameters for each of the four forms is shown in table 1.
s
Table 1
Consider a sentence w1m that is a sequence of words w1 w2 w3……wm (ignoring punctuations), and each string wi in the sequence stands for a word in the sentence. The grammar tree of w1m can be generated by a set of pre-defined grammar rules. As wi may be generated by different non-terminals, and this situation also appears when generating a non-terminal (a non-terminal may be generated by different sets of several non-terminals), usually more than one grammar tree may be generated. Take sentence “ate dinner on the table with a fork” for example, there are 2 grammar tress corresponding to the sentence as shown in figure 3.
Figure 3
Two possible grammar trees that may be generated for the sentence “ate dinner on the table with a fork”
4.3 SQL Translator
SQL translator is used translating the leaves of the tree to the corresponding SQL. Actually the process is collecting information from the parsed tree. Two techniques may be used to collect the information: dependency structure and verb sub categorization [8, 9, 10]. These techniques are also used in disambiguation, since a PCFG is context-free and ancestor-free, any information in the context of a node in the parsed tree needs not be taken into account when constructing the parsed tree. A dependency structure is used to capture the inherent relations occurs in the corpus texts that may be critical in real-world applications, and it is usually described by typed dependencies. A typed dependency represents grammatical relations between individual words with dependency labels, such as subject or indirect object. Verb sub categorization is the other technique [11]. If we know the sub categorization frame of the verb, we can find the objects of this verb easily, and the target of the query can be found easily. Usually this process can be done after scanning the parsed tree.
5. THE GENERIC INTERACTIVE NATURAL LANGUAGE INTERFACE TO DATABASE SYSTEM ARCHITECTURE:
The architecture of the GINLIDB system consists of two major components:
1. Linguistic handling component
2. SQL constructing component.
The natural language query correctness as far as the grammatical structure and the possibility of successful transformation to SQL statement is concerned, is controlled by the first component.
The exact SQL statement that opens a connection to the database in use and executes the generated SQL statement and returns the query's result to the user is generated by the second component in GINLIDB [12].
Figure 4 The architecture of GINLIDB
A. Graphical User Interface (GUI)
The user interact with it GINLIDB system in a user-friendly Environment. The user does not require the knowledge of computers and database terms. The interaction with our system is through suitable visual forms, buttons, and menus.
B. Linguistic handling component
The Linguistic handling component consists of three parts: Lexical analysis, Parser, and Semantic representation
1) Lexical analysis
Lexical analysis divides the sentence it into simpler elements that called tokens. This process is performed with the help of, token analyzing, spelling checker, ambiguity reduction and excessive tokens removal.
2) GINLIDB Parser
There is substantial ambiguity in the structure of natural language queries so the queries are not easily parsed by programs. The GINLIDB parser is designed with two stages of grammars: lexical and syntactic. The token generation (lexical analysis) is the first stage where the input tokens' stream is split into meaningful symbols. The syntactic analysis based on Augmented Transition Network (ATN) is the second stage, which checks if the tokens' structure is in allowable grammatical structure. This is processed via the parser according to a Context-Free Grammar (CFG).
Share with your friends: |