A.A. Protskevich CALCULATION OF OPTIMAL PARAMETERS OF THE CODE DICTIONARY
The problem of effective representation and processing of the textual information in network information systems remains urgent, despite of significant achievements in this direction. Within the bounds of this problem the textual information compression, coding and transfer tasks oriented on use in distributed information environment are of particular importance. The textual information coding efficiency can be increased by integration of the information code block from a symbol up to a word. At a stage of analysis of a sentence, we receive a set of stems and morphological codes, which can be used to code the input text. The creation and application of the code dictionary of stems and morphemes for the morphological analysis and coding of the textual message is the basis for the given method.
Let's calculate potential compression factor of the textual message depending on code dictionary size under condition of unlimited size of the message dictionary. Then the compression depending on code dictionary size is determined as:
where k is a compression factor; nD is code dictionary size; p(i) is probabilities of a ith word occurrence, provided that words are sorted in decreasing order probabilities of occurrence determined by the generalized Zipf-Mandelbrot's law taking into account deviations in the low-frequency zone; hD(nD,i) and h(i) are costs in bits of the coded and not coded word, depending on particular language and the index construction method in the code dictionary; n(N) is code dictionary size on a excerption of size N; C and L is amount of symbols in the alphabet and average length of a word in a natural language. Using the fact of various frequencies of words in natural language, prefix codes can be applied for index construction in the code dictionary to improve the compression factor. The carried out calculations showed that with large nD the textual information coding efficiency increases with growth of the code dictionary size. For Russian the compression achieves 37% and 16% with usage the prefix coding for index construction in the code dictionary. Therefore, there is no sense to limit the code dictionary size.
I.A.Purtov
Considering knowledge representation which is present in texts of a certain requirements, specifications and QA-protocols, it’s necessary for us to introduce relations of a higher level than the predicate(subject, object) or the property(thing). It’s due to a more complicated semantic structure of real events described by sentences of a natural language.
Such time a spatial relations as “earlier”, “later”, “at the same time”, “closer”, “further”, “lower”, “higher” and so on are good examples of the constructions like that.
Prolog is a first-order language, that’s why its syntax doesn’t presuppose constructions corresponding to relations between relations, that is, second order predicates.
However, it’s quite possible to make such predicates with the help of means of the standard Prolog. While working at the problem the researcher has got typical constructions (a sequences of predicates) that imitate second order relations for different combinations of the names of subject, object and predicate relations. Among these relations there’s a common case when the names of subjects and objects of connecting predicates are different; the case when predicates have the same names of subjects and objects; and, lastly, the case of the relations between the same predicates.
A short Prolog-program can serve as a good example (see p.1). Here, the predicates pred1 and pred2 are some relations between certain subjects of the relations (subj1 and subj2) and objects (obj1 and obj2). The relation rel functioning as a second order predicate, connects first-order constructions. Of course, in reality it’s not a second order predicate because it doesn’t connect predicates themselves but it does their names. Besides, as mentioned above, the relation rel is the relation between symbol chains (‘symbol’). Nevertheless, the model like that is adequate enough because it fulfils the requirements of truth made towards second order relations, that is:
-
the relation is true, if:
-
the name of predicates functioning as the arguments of relations exist in the program;
-
the order of arguments corresponds to the second order predicate (rel), because without any additional rules Prolog-system finds the relation asymmetrical;
-
both predicates in a second order relation are true;
-
in other cases the relation is considered to be false.
predicates
pred1(symbol, symbol)
pred2(symbol, symbol)
rel(symbol, symbol)
clauses
pred1(subj1, obj1).
pred2(subj2, obj2).
rel(pred1, pred2) :- pred1(subj1, obj1), pred2(subj2, obj2).
p.1. Modeling of second order relations in the Prolog.
The example mentioned above is valid on condition that the predicates shouldn’t have the same names and arguments, that is, the relation exists between two different actions made by different subjects over different objects. It happens very often when a second order relation connects the events that have the same subject and object. In this case it’s necessary to introduce pseudonyms – single-seat predicates that takes the same meaning as the starting predicate.
Share with your friends: |