...
where attributes i to n are providing the factual information (values w to y of an object), and attribute m is providing the new - inferred - information (with value z).
Attributes providing the factual information are called the "input attributes" to the rule. The attribute providing the new - inferred - information is called the "output attribute" from the rule. The input attributes are physically stored as columns or "input items" in the rule Info file. The output attribute is physically stored as a column or "output item" in the rule Info file.
Example:
IF
THEN
ELSE IF
and phase is "any">
THEN
ELSE IF
THEN
As with the dataset, "values" are physically stored at each intersection of each occurrence’s record and each input and output items in the rule Info file.
Input items in a rule must have the same definition (name, type, size...) and coding schema as their corresponding item in the dataset.
An "inference" is the action of producing a new derived information to an object according a) to the available information it provides, and b) to the rule which is activated. It proceeds in 5 steps:
1. The input attributes are identified in the rule.
2. The values for these attributes are retrieved from the object in the
dataset and constitute a fact.
3. The occurrence of the rule which matches the fact is searched for.
4. The output attribute definition and value are retrieved from the
matching occurrence
5. and are added to the object in the dataset.
When a rule is activated on a dataset, one inference will take place for each object of the dataset, one after the other. The result will be a new attribute in the dataset, one for the whole dataset, to hold the new inferred values, one for each object.
An attribute of the dataset that has been previously inferred using a rule is further considered as storing available information. It can thus be used as an input attribute to other rules.
2.3 - Wild cards:
It is difficult, if not impossible, for an expert to foresee all the cases that can possibly occur in a set of available information. Furthermore, in some cases, several, nay, many different values of a fact will lead to the same conclusion (e.g. IF THEN ...).
Therefore a "wild card" mechanism allows the expert to define occurrences of rules which will match several different facts.
The "any" terms in the expressions of the last example above show such situations.
The "any" wild card will be, by convention, denoted as a star character (*) in a rule.
A fact for which an exact matching occurrence can be found will receive this occurrence’s output attribute value.
A fact for which an exact matching occurrence cannot be found will receive the output attribute value of the last occurrence of the rule that matches, if it can be found using the wild card convention. This is assuming that an expert rather builds a rule by refining its occurrences, considering the most general cases before the most particular cases.
When no matching occurrence at all can be found for a fact, no value is provided to the output attribute, thus leaving it "blank" (or "0" (zero) depending on the output item's type). This can lead to confusion if blank (or 0) are possible normal output values. Therefore having a fully "wild carded" occurrence as a header of a rule will "pick up" all facts for which no information can be provided and force the output value to, say the NODATA value.
Using these specifications, the above example would become:
Example:
IF
THEN
ELSE IF
THEN
ELSE IF
and phase is "any">
THEN
ELSE IF
THEN
The wild card convention simulates the logical OR operator.
2.4 - Confidence levels:
Expert knowledge is fuzzy and subject to evolution. Furthermore, the available information on the one hand, and the inferences that can be made using that information and the expert knowledge on the other hand, both have a certain reliability. Therefore it is necessary to have a mechanism that will allow each available information (or factual value) held in the dataset, and each infered information (or output value) held in the rules database, to be complemented with its reliability.
The reliability of an information is called its "confidence level".
Confidence levels are held by confidence level attributes, one for each attribute of the dataset, and one for the output attribute of each rule.
Therefore each object in the dataset has a confidence level value for each one of its attributes. And each occurrence of each rule has a confidence level value for its output attribute.
The coding schema for confidence levels is the following:
v Very low or no information
l Low
m Moderate
h High
When an inference takes place, the following 4 steps complement those listed above:
6. The output confidence level attribute definition is retrieved from the
matching occurrence,
7. and is added to the object in the dataset.
8. The confidence level of each input attribute of the object is determined
from:
. its own associated confidence level item, named .CL in the
dataset,
. or, if the former is not found, the global confidence level item of
the object, named CFL in the dataset,
. or, if neither are found, as an assumed high (H) confidence level.
9. The minimum (worst) confidence level value is retrieved from the
confidence levels of all the attributes implied in the inference
process (input confidence levels of the object, and output confidence
level of the occurrence).
10. The resulting confidence level value is added to the output confidence
level attribute in the object.
We have seen that an attribute of the dataset that has been previously inferred using a rule can be used as an input attribute to other rules. Its confidence level will be used in the same way as for any other input attribute.
3 - Technical specifications:
-----------------------------
3.1 - Expert type rules:
When a rule is applied to a dataset, it is processed in the following manner:
1. Input items to the rule are located and checked in the dataset
2. and output and confidence level items are added (empty) to the dataset.
3. Then for each record in the dataset:
4. the combination of actual values for the input items are matched to their corresponding combination in the rule Info file,
5. the corresponding value for the output item is retrieved from the rule Info file,
6. the corresponding value for the output confidence level item is computed from all the available input confidence levels in the dataset and the output confidence level in the rule,
7. and finally these values are updated in the current record of the dataset.
Input and output items of a rule have a limited number of possible Info data types. These are character (C), clear integer (I) and clear numeric (N). Any other Info data type (date (D), binary integer (B) and binary floating point numeric (F)) is not to be used in rule data files.
3.2 - Class type rules:
The rules described above are called "expert type rules" as opposed to "class type rules". Class type rules are simple reclassification or recoding rules. They are used in any of the following cases:
1) convert the Info data type of an input item in the dataset from an unauthorized to an authorized type (e.g. B to I, or F to N),
2) reduce the number of different values for an input item (e.g. reclass detailed texture classes into less detailed texture classes),
3) recode the values of an input item (e.g. change codes to a more "speaking" coding schema),
4) a combination of the above cases.
Class type rules accept only one input item and produce one output item. The input item has no limitation as to it's Info data type. The output item follows the same limitations as those applicable to expert type rules.
Class type rules do not follow the wild card convention. Wild cards may not be used there.
Class type rules do not hold an output item associated confidence level for their occurrences. But if the input item has an associated confidence level in the dataset, the class type rule copies it in the dataset to a confidence level item associated to the output item.
Thus class type rules may or may not produce a confidence level item together with the output item. Whereas expert type rules always produce a confidence level item.
3.3 - Other rule descriptors:
Each occurrence of a rule is furthermore informed with the following:
- an author identification number,
- a last update date,
- and a pointer to a text file to hold free explanatory notes to give any more details about the occurrence (not implemented at this time).
The rules database also holds a rules information file (DICTIONARY) and an authors information file (AUTHORS).
3.4 - Dataset:
Each time a rule is activated or "fired", the input items to the rule are checked against the dataset. All input items of a rule must exist within the dataset. They must have the same definition (name, type, size) in both the dataset and the rule.
Each item in the dataset may or may not have an associated confidence level item. When an expert rule is fired, if an input item does not have an associated confidence level in the dataset, it is assumed to have the best confidence level.
3.5 - Naming conventions:
A rule is an Info file stored in the $PTRHOME/xxx_rules Arc/Info workspace, where xxx refers to the domain to which the rules apply (e.g. eur32_rules refers to rules applicable to the Soils Geographical Database of Europe at Scale 1:1,000,000 version 3.2).
It is named RULE in which identifies uniquely the rule in the rule database.
Each record of the rule file is called an occurrence of the rule.
Input and output items in a rule follow the Arc/Info naming conventions with one restriction: an item name must not exceed 13 characters. (The reason for this is Info's 16 characters limit reached with the next naming convention for associated confidence level items.)
An associated confidence level item has the same name as the item to which it is associated but is suffixed by ".CL" (e.g. if item name is ITEM then associated confidence level item name must be ITEM.CL).
3.6 - Class rule structure:
A class rule is one that classifies or recodes one and only one input attribute into an output attribute.
COLUMN ITEM NAME WIDTH OUTPUT TYPE N.DEC ALTERNATE NAME INDEXED?
1 NUM_AUTHOR 2 2 I - -
Identification number of author of the occurrence of the rule.
3 LAST_UPD 8 8 D - -
Last update date of the occurrence of the rule.
11 NOTE 4 5 B - -
Pathname to an ASCII explanatory note file of the occurrence of the rule.
(Not used at this time.)
15 ? ? ? - -
Output attribute from the rule.
? ? ? ? - -
Input attribute to the rule. In the case of a class rule, there is only
one input attribute.
** REDEFINED ITEMS **
1 CLASS_RULE 2 2 I - -
The name of this redefined attribute is only used to differentiate
class type from expert type rules.
Example of a class rule to recode an attribute named TYPE to a new attribute
named CLASSTYPE:
COLUMN ITEM NAME WIDTH OUTPUT TYPE N.DEC ALTERNATE NAME INDEXED?
1 NUM_AUTHOR 2 2 I - -
3 LAST_UPD 8 8 D - -
11 NOTE 4 5 B - -
15 CLASSTYPE 1 1 I - -
16 TYPE 2 2 C - -
** REDEFINED ITEMS **
1 CLASS_RULE 2 2 I - -
3.7 - Expert rule structure:
An expert rule is one that uses one or several input attributes to infer the
values of an output attribute.
COLUMN ITEM NAME WIDTH OUTPUT TYPE N.DEC ALTERNATE NAME INDEXED?
1 NUM_AUTHOR 2 2 I - -
Identification number of author of the occurrence of the rule.
3 LAST_UPD 8 8 D - -
Last update date of the occurrence of the rule.
11 NOTE 4 5 B - -
Pathname to an ASCII explanatory note file of the occurrence of the rule.
(Not used at this time.)
15 ? ? ? - -
Output attribute from the rule.
? .CL 1 1 C - -
Output confidence level attribute from the rule.
? ? ? ? - -
1st input attribute to the rule.
{ ? ? ? ? - -
2nd input attribute to the rule.
...
? ? ? ? - -
Nth input attribute to the rule. }
** REDEFINED ITEMS **
1 EXPERT_RULE 2 2 I - -
The name of this redefined attribute is only used to differentiate
expert type from class type rules.
Example of an expert rule to infer an output attribute named GEOL from a set
of 2 attributes named DEPTH and TYPE:
COLUMN ITEM NAME WIDTH OUTPUT TYPE N.DEC ALTERNATE NAME INDEXED?
1 NUM_AUTHOR 2 2 I - -
3 LAST_UPD 8 8 D - -
11 NOTE 4 5 B - -
15 GEOL 1 1 C - -
16 GEOL.CL 1 1 C - -
17 DEPTH 1 1 C - -
18 TYPE 1 1 C - -
** REDEFINED ITEMS **
1 EXPERT_RULE 2 2 I - -
3.8 - Item coding and constraints:
.CL (output attribute confidence level):
h High level of confidence.
m Medium level of confidence.
l Low level of confidence.
v Very low level of confidence or no information.
' ' A blank is output in the dataset for this item when no inference
can be made for some input value(s) because the occurrence is
missing from the rule. The blank must not figure in the rule
(all occurrences must have an explicit confidence level).
NUM_AUTHOR
-1 Unknown.
1
...
99
LAST_UPD
01/01/00 Unknown.
NOTE
0 No note.
1 Note number 1 for current rule.
...
0 or ' ' For expert rules, a 0 or blank (whether the output item is of
numerical or character type respectively) is output in the
dataset when no inference can be made for some input value(s)
because an appropriate occurrence is missing from the rule. If
this happens together with a blank output in the .CL
item when running the rule, a warning is issued. It is thus a
good idea not to use 0 or blank as output values from a rule so
that it cannot make confusion with the case of a missing
occurrence.
3.9 - Rules data:
A rule holds a number of occurrences. Each occurrence holds the data for the set of input values of a fact to which corresponds an output value and a confidence level for that fact.
A "wild card" character (* = the star character) may be used that stands for "any value".
Example:
IN1 IN2 OUT OUT.CL
* * n v
a * 1 v
a x 2 h
a y 3 h
* x 4 l
* y 5 m
b * 6 v
b x 7 h
b y 8 h
If a fact has the values (IN1=a,IN2=x) then (OUT=2,CONF=h).
If fact is (IN1=a,IN2=z) then (OUT=1,CONF=v).
If fact is (IN1=c,IN2=x) then (OUT=4,CONF=l).
If fact is (IN1=c,IN2=z) then (OUT=n,CONF=v).
Notice that a fact to which corresponds an exact matching occurrence will be put in correspondence with this occurrence wherever the occurrence is positioned in the rule. Therefore the order in which occurrences with no wild card(s) appear has no significance to the program.
On the contrary, any fact which does not have an exactly matching occurrence in the rule is put in correspondence (if possible) with the last "wild card" matching occurrence encountered in the rule. For example (IN1=a,IN2=z) could have been matched to (OUT=n,CONF=v), but another match was found later in the rule (OUT=1,CONF=v) which was retained. Therefore the order in which occurrences with wild card(s) appear is significant to the program. The user must feel this as describing occurrences in a rule starting from the most general case (IN1=*,IN2=*) to the most particular case (IN1=a,IN2=x).
When a fact does not find any matching occurrence in a rule it is left blank (IN1=' ',IN2=' '). Having a fully "wild carded" occurrence (IN1=*,IN2=*) as a header of a rule will "pick up" all facts for which no information can be provided. For example (IN1=c,IN2=z). In this example the user controls the output by providing a "no data" information (OUT=n,CONF=v), instead of letting the program leave it blank (OUT=' ',CONF=' ').
4 - Project organization:
-------------------------
This project is independent from any other. This means that any data necessary to this project is copied into the project's directory (e.g. the Soils Geographical Database of the European Union version 2.1).
The main objects that can be found under the project's directory are:
PTRDBE_Readme A "first things first" short read-me file.
PTRDBE_Metadata Overview of the subject.
PTRDBE_Specif Project specifications.
PTRDBE_dictiona Project's dictionary.
Rules_xxx The Pedotransfer Rule number xxx.