Light: Laboratory for Information Globalization and Harmonization Technologies February 15, 2004 (v13)

Table 1. Illustrating Information Needs in Three Contexts

Download 274.22 Kb.
Size274.22 Kb.
1   2   3   4   5   6   7   8

Table 1. Illustrating Information Needs in Three Contexts
Due to space limitations, this proposal document will focus on the NHS national priority. There are very similar needs and opportunity in the other national priority areas as summarized below.
1.2.2 Economic Prosperity and Vibrant Civil Society (ECS)

The need for intelligent harmonization of heterogeneous information is important to all information-intensive endeavors – which encompasses many aspects of our economy and society, including business, government and education. The fundamental technology research proposed has broad relevancy for all complex inter-organizational applications, such as Manufacturing (e.g., Integrated Supply Chain Management), Transportation/Logistics (e.g., In-Transit Visibility), Government (e.g., Electronic Voting), Military (e.g., Total Asset Visibility), and Financial Services (e.g., Global Risk Management). Our LIGHT team is involved in research in all of these areas. People from different organizations and different parts of our societies have different perspectives (i.e. "contexts"). Rather than requiring them all to change to some imposed “standard”, it is much more viable to have the information systems able to adapt to the people’s needs (i.e., “context mediate"). Furthermore, laws or policies that unnecessarily limit or impair the effective use and re-use of information are also to be studied.

1.2.3 Advances in Science and Engineering (ASE)

Similarly, the advancement of science and engineering usually involves the accumulation and use of information and knowledge, often gathered by multiple organizations and often for differing purposes. We are working with colleagues at MIT and other institutions in several areas, such as biology, healthcare, engineering product design, and manufacturing.

The field of biology, for example, has become increasingly information-intensive. Information generated in life sciences research is so large that no single person or group owns or controls all the needed data sources. A pharmaceutical company, for example, combines information from 40 sources on average to conduct research in drug development. Although much of this information is publicly available, heterogeneity in data structure and semantics limits the ability of life science researchers to easily integrate and exploit research data. Biologists often think in terms of pathways, may it be sequence analysis, functional genomics, proteomics or literature search. Pathways, discovered by different groups do not have a uniform representation. Pathway integration will be critical to systemic understanding how the cell works and will significantly speed up advances in the field. LIGHT will enable semantic interoperability between life science information sources, which have diverse data representations and semantics. Unlike other more constrained approaches, LIGHT will simultaneously support multiple views. For example, rather than adopting a single gene centric view as the standard way of viewing data, the system will adjust data automatically if the researcher wants to view the data in terms of function, disease, phenotype, or organ. Similarly, data semantics will be adjusted automatically reflecting the assumptions of a particular researcher: be it a biologist, geneticist or a medical researcher.
1.3 Addressing Information Needs

1.3.1 Operational Example

For illustrative purposes only, let us consider the types of information illustrated by Example 2 in Table 1. A specific question is: to what extent have economic performance and environmental conditions in Yugoslavia been affected by the conflicts in the region? The answer to this question could shape policy priorities for different national and international institutions, as well as reconstruction strategies, and may even determine which agencies will be the leading players. Moreover, there are potentials for resumed violence and the region’s relevance to overall European stability remains central to the US national interest. This is not an isolated case but one that illustrates concurrent challenges for information compilation, analysis, and interpretation – under changing conditions.

For example, if we are interested in determining the change of carbon dioxide (CO2) emissions in the region, normalized against the change in GDP - before and after the outbreak of the hostilities – we need to take into account territorial and jurisdictional boundaries, changes in accounting and recording norms, and varying degrees of autonomy. User requirements add another layer of complexity. For example, what units of CO2 emissions and GDP should be displayed, and what unit conversions need to be made from the information sources? Which Yugoslavia is of concern to the user: the country defined by its year 2000 borders, or the entire geographic area formerly known as Yugoslavia in 1990? One of the effects of the war is that the region, which used to be one country consisting of six republics and two provinces, has been reconstituted into five legal entities (countries), each having its own reporting formats, currency, units of measure, and new socio-economic parameters. In other words, the meaning of the request for information will differ, depending on the actors, actions, stakes and strategies involved.

In this simple case, we suppose that the request comes from a reconstruction agency interested in the following values: CO2 emission amounts (in tons/yr), CO2 per capita, annual GDP (in million USD/yr), GDP per capita, and the ratio CO2/GDP (in tons CO2/million USD) for the entire region of the former Yugoslavia (see the alternative User 2 scenario in Table 2). A restatement of the question would then become: what is the change in CO2 emissions and GDP in the region formerly known as Yugoslavia before and after the war?

1.3.2 Diverse Sources and Contexts

By necessity, to answer this question, one needs to draw data from diverse types of sources (we call these differing domains of information) - such as, economic data (e.g., the World Bank, UN Statistics Division), environmental data (e.g., Oak Ridge National Laboratory, World Resources Institute), and country history data (e.g., the CIA Factbook), as illustrated in Table 2. Merely combining the numbers from the various sources is likely to produce serious errors due to different sets of assumptions driving the representation of the information in the sources. These assumptions are often not explicit but are an important representation of ‘reality’ (we call these the meaning or context of the information, which will be explained in more detail in Section 2.)

The purpose of Table 2 is to illustrate some of the complexities in a seemingly simple question. In addition to variations in data sources and domains, there are significant differences in contexts and formats, critical temporality issues, and data conversions that all factor into the user’s information needs. As specified in the table, time T0 refers to a date before the war (e.g., 1990), when the entire region was a single country (referred to as “YUG”). Time T1 refers to a date after the war (e.g., 2000), when the country “YUG” retains its name, but has lost four of its provinces, which are now independent countries. The first column of Table 2 lists some of the sources and domains covered by this question. The second column shows sample data that could be extracted from the sources. The bottom row of this table lists auxiliary mapping information that is needed to understand the meanings of symbols used in the other data sources. For example, when the GDP for Yugoslavia is written in YUN units, a currency code source is needed to understand that this symbol represents the Yugoslavian Dinar. The third column lists the outputs and units requested by the user. Accordingly, for User 1, a simple calculation based on data from country “YUG” will invariably give a wrong answer. For example, deriving the CO2/GDP ratio by simply summing up the CO2 emissions and dividing it by the sum of GDP from sources A and B will not provide a correct answer.
1.3.3 Manual Approach

Given the types of data shown in Table 2, along with the appropriate context knowledge (some of which is shown in italics), an analyst could determine the answer to our question. The proper calculation involves numerous steps, including selecting the necessary sources, making the appropriate conversions, and using the correct calculations. For example:

For time T0:

  1. Get CO2 emissions data for “YUG” from source B;

  2. Convert it to tons/year using scale factor 1000; call the result X;

  3. Get GDP data from source A;

  4. Convert to USD by looking up currency conversion table, an auxiliary source; call the result Y;

  5. No need to convert the scale for GDP because the receiver uses the same scale, namely, 1,000,000;

  6. Compute X/Y (equal to 535 tons/million USD in Table 2).

For time T1:

  1. Consult source for country history and find all countries in the area of former YUG;

  2. Get CO2 emissions data for “YUG” from source B (or a new source);

  3. Convert it to tons/year using scale factor 1000; call the result X1;

  4. Get CO2 emissions data for “BIH” from source B (or a new source);

  5. Convert it to tons/year using scale factor 1000; call the result X2;

  6. Continue this process for the rest of the sources to get the emissions data for the rest of the countries;

  7. Sum X1, X2, X3, etc. and call it X;

  8. Get GDP for “YUG” from source A (or alternative); Convert it to USD using the auxiliary sources;

  9. No need to convert the scale factor; call the result Y1;

  10. Get GDP for “BIH” from source E; Convert it to USD using the auxiliary sources; call the result Y2;

  11. Continue this process for the rest of the sources to get the GDP data for the rest of the countries;

  12. Sum Y1, Y2, Y3, etc. and call it Y;

  13. Compute X/Y (equal to 282 tons/million USD in Table 2).

Domain and Sources Consulted

Sample Data Available

Basic Question, Information User Type & Usage

Economic Performance

  • World Bank’s World Development Indicators database

  • UN Statistics Division’s database

  • Statistics Bureaus of individual counties

A. Annual GDP and Population Data:























- GDP in billions local currency per year

- Population in millions


How did economic output and environmental conditions change in YUG over time?

User 1: YUG as a geographic region bounded at T0:



















User 2: YUG as a legal, autonomous state



















Note (receiver’ contexts):

T0: 1990 (prior to breakup)

T1: 2000 (after breakup)

CO2: 1000’s tons per year

CO2/capita: tons per person

GDP: billions USD per year

GDP/capita: 1000’s USD per person

CO2/GDP: tons per million USD

Environmental Impacts

  • Oak Ridge National Laboratory’s CDIAC database

  • WRI database

  • GSSD

  • EPA of individual countries

B. Emissions Data:















- Emissions in 1000s tons per year

Country History:

  • CIA

  • GSSD


(i.e., geographically, YUG at T0 is equivalent to YUG+BIH+HRV+MKD+SVN at T1)

Mappings Defined:

  • Country code

  • Currency code

  • Historical exchange rates*

[As an interesting aside, the

country last known as “Yugoslavia,”officially disappeared in 2003

and was replaced by the “Republics

of Serbia and Montenegro.”

For simplicity, we will ignore this

extra complexity.]
* Note: Hyperinflation in YUG resulted in establishment of a new currency unit in June 1993. Therefore, T1.YUN is completely different from T0.YUN.








New Yugoslavian



Bosnia and





































Table 2. Operational Example: Information Needs in Cases of Conflict
The complexity of this task would be easily magnified if, for example, the CO2 emissions data from the various sources were all in different metrics or, alternatively, if demographic variables were drawn from different institutional contexts (e.g., with or without counting refugees). This example shows some of the operational challenges if a user were to manually attempt to answer this question. This case highlights just some of the common data difficulties where information reconciliation continues to be made ‘by hand’. It is easy to see why such analysis can be very labor intensive and error-prone. This makes it difficult under “normal” circumstances and possibly impossible under time-critical circumstances. This example may appear to be simple, but it includes major complexities such as reconciling spatial territoriality, currency, and atmospheric measures. Barriers to effective information access and utilization usually involve such complexities.
1.3.4 LIGHT: A Better Way

A key goal of this research effort is to create the Laboratory for Information Globalization and Harmonization Technologies (LIGHT) that can automatically determine and reliably perform the steps shown above in response to a user’s request. Every user is distinctive. LIGHT will be capable of storing the necessary context information about the sources and users and have a reasoning engine capable of determining the necessary sources, conversions, and calculations necessary. The COIN and GSSD systems, to be described briefly below, have proven the feasibility of this approach in more limited situations. LIGHT will be the next generation: it will combine context and content.

1.4 Existing Foundations – COIN and GSSD

Important research in two areas has already been completed that provides important foundations for addressing the emergent challenges discussed above: the COntext INterchange Project (COIN) and the Global System for Sustainable Development (GSSD).
1.4.1 COIN

The COntext INterchange (COIN) Project has developed a basic theory, architecture, and software prototype for supporting intelligent information integration employing context mediation technology [MAD99, GBM*99, GoBM96, Goh96, SM91a]. We propose to utilize the foundation of COIN to develop theories and methodologies for our proposed System for Harmonized Information Processing (SHIP). The fundamental concept underlying such a system is the representation of knowledge as Collaborative Domain Spaces (CDSs). A CDS is a grouping of the knowledge including source schemas, data context, conversion functions, and source capabilities as related to a single domain ontology. The software components needed to provide harmonized information processing (i.e. through the use of a CDS or collections of linked CDSs) include a context mediation engine [BGL*00, Goh96], one or more ontology library systems, a context domain and conversion function management system, and a query execution and planner [Fynn97]. In addition, support tools are required to allow for applications’ (i.e. receivers’) context definition and source definitions to be added and removed easily (i.e., schemas, contexts, capabilities). Developing a flexible, scalable software platform will require significant additional research in a number of key research areas as described in Section 2.4.

1.4.2 GSSD

The Global System for Sustainable Development serves as an Internet-based platform for exploring the contents transmitted through different forms of information access, provision, and integration across multiple information sources, languages, cultural contexts, and ontologies. GSSD has an extensive, quality-controlled set of ontologies related to system sustainability (specifically, to sources of instability and alternative responses and actions), with reference to the field of international relations broadly defined. In addition, GSSD has made considerable gains into understanding the organization and management of large scale, distributed, and diverse research teams, including cross-national (China and Japan, and countries in the Middle East and Europe) and institutional partners (private, public, and international agencies). Designed and implemented by social scientists, GSSD is seen as demonstrating ‘opportunities for collaboration and new technologies,’ according to the National Academy of Engineering [RAC01, p. viii]. GSSD databases cover issues related to dynamics of conflict, as well as other domains relevant to our proposed research, such as migration, refugees, unmet human needs, as well as evolving efforts at coordinated international actions. GSSD provides a rich testing ground for the new information technologies we propose to develop, including automated methods for information aggregation from various sources, context mediation capabilities, customized information retrieval capabilities, and ontology representations.

1.5. Research Team

Due to the multi-disciplinary nature of this project, we have composed a research team that is uniquely qualified to conduct this work. The PIs of this project come from MIT’s School of Humanities, Arts, and Social Sciences (Choucri), School of Engineering (Madnick and Wang), and School of Management (Siegel and Madnick), and the students who will contribute significantly to the research come from all these diverse Schools. Furthermore, the PIs have extensive research experience in the critical areas characterized by rapid change, system instabilities, and demands for rapid response to information need. These are all necessary to accomplish the goals of this project.

1.6. Proposal Organization

The remainder of this proposal will elaborate on the intended research tasks. Section 2 will describe research needs in IT theory and technologies, and how these capabilities can address the national priorities as discussed in Section 3. Section 4 provides a brief description of the new laboratory, its intellectual and research strategy, and how it will ensure coherence among the components of the project and also handle outreach activities. Finally Sections 5 and 6 will present the anticipated contributions of the project, with a focus on educational impacts.

Section 2. IT Theory and Technology Research
2.1 Needs for Harmonized Information Processing and Collaborative Domain Spaces

Advances in computing and networking technologies now allow extensive volumes of data to be gathered, organized, and shared on an unprecedented scale and scope. Unfortunately, these newfound capabilities by themselves are only marginally useful if the information cannot be easily extracted and gathered from disparate sources, if the information is represented with different interpretations, and if it must satisfy differing user needs [MHR00, MAD99, CFM*01]. The data requirements (e.g., scope, timing) and the sources of the data (e.g., government, industry, global organizations) are extremely diverse. It is proposed that the application focus for this research effort be in the domains of the national priority areas with specific emphasis on national and homeland security, which by definition, takes into account internal as well as external dimensions of relations among actors in both the public and the private domains.

This research effort will:

  1. Analyze the data and technology requirements for the categories of problems described in Section 1;

  2. Research, design, develop and test extensions and improvements to the underlying COIN and GSSD theory and components;

  3. Provide a scalable, flexible platform for servicing the range of applications described in Section 1; and

  4. Demonstrate the effectiveness of the theories, tools, and methodologies through technology transfer to other collaborating organizations.

2.2 Illustrative Example of Information Extraction, Dissemination, and Interpretation Challenges

As an illustration of the problems created by information disparities, let us refer back to the example from the conflict realm introduced in Section 1.3. The question was: what are the impacts of CO2 emissions on economic performance in Yugoslavia. It is necessary to draw data from diverse sources such as CIA Worldbook (for current boundaries), World Resources Institute (for CO2 emissions), and the World Bank (for economic data).

There are many additional information challenges that had not been explicitly noted earlier, such as:

Information Extraction: Some of the sources may be full relational databases, in which case there is the issue of remote access. In many other cases, the sources may be traditional HTML web sites, which are fine for viewing from a browser but not effective for combining data or performing calculations (other than manually “cut & paste”). Other sources might be tables in a text file, Word document, or even a spreadsheet. Although the increasing use of eXtensible Markup Language (XML) will reduce some of these interchange problems [MAD01], we will continue to live in a very heterogeneous world for quite a while to come. So we must be able to extract information from all types of sources.

Information Dissemination: The users want to use the resulting “answers” in many ways. Some will want to see the desired information displayed in their web browser but others might want the answers to be deposited into a database, spreadsheet, or application program for further processing.

Information Interpretation: Although the problems of information extraction and dissemination will be addressed in this research, the most difficult challenges involve information interpretation. Specifically, an example question is: “What is the change of CO2 emissions per GDP in Yugoslavia before and after the Balkans war?”

Before the war (time T0), the entire region was one country. Data for CO2 emissions was in thousands of tons/year, and GDP was in billions of Yugoslavian Dinars. After the war (time T1), Yugoslavia only has two of its original five provinces; the other three provinces are now four independent countries, each with its own currency. The size and population of the country, now known as Yugoslavia, has changed. Even Yugoslavia has introduced a new currency to combat hyperinflation.

From the perspective of any one agency, UNEP for example, the question: “How have CO2 emission per GDP changed in Yugoslavia after the war?” may have multiple interpretations. Not only does each source have a context, but so does each user (also referred to as a receiver). For example, does the user mean Yugoslavia as the original geographic area (depicted as user 1 in Table 2) or as the legal entity, which has changed size (user 2). To answer the question correctly, we have to use the changing context information. A simple calculation based on the “raw” data will not give the right answer. As seen earlier, the calculation will involve many steps, including selecting necessary sources, making appropriate conversions, and using correct calculations. Furthermore, each user might have a different preferred context for their answer, such as: tons/million USD or kilograms/billion EURO, etc. More of these information harmonization challenges will be highlighted in Section 2.4.

Although seemingly simple, this example addresses some of the most complex issues in NHS: namely the impact of changing legal jurisdictions and sovereignties on (a) state performance, (b) salience of socio-political stress, (c) demographic shifts and (d) estimates of economic activity, as critical variables of note. Extending this example to the case of the former Soviet Republics, before and after independence, is conceptually the same type of challenge – with greater complexity. For example, the US Department of Defense is interested in demographic distributions around oil fields (by ethnic group) and before and after independence. Alternatively, UNEP is interested in CO2 emissions per capita given that these are oil-producing regions. On the other hand, foreign investors will be interested in insurance rates before and after independence. The fact that the demise of the Soviet Union led to the creation of a large number of independent states is a reminder that the Yugoslavia example is far from unique, it highlights a class of increasingly complex information reconciliation problems. Many of the new states in Central Asia rank high as potential sources of terrorism. So too, if Iraq is rendered into a federal entity.

The information shown in italics in Table 2 (e.g., “Population in millions”) illustrates context knowledge. Sometimes this context knowledge is explicitly provided with the source data (but still must be accessed and processed), but many times it must be found in other sources. The good news is that such context knowledge almost always exists, but it is often widely distributed within and across organizations. Thus, a central focus of this part of the effort is to support the acquisition, organization, and effective intelligent usage of distributed context knowledge to support information harmonization and collaborative domains. .

2.3 Research Platform

The MIT COntext INterchange (COIN) project has developed a platform including a theory, architecture, and basic prototype for such intelligent harmonized information processing. COIN is based on database theory and mediators [Wied92, Wied99]. Context Interchange is a mediation approach for semantic integration of disparate (heterogeneous and distributed) information sources as described in [BGL*00 and GBM*99]. The Context Interchange approach includes not only the mediation infrastructure and services, but also wrapping technology and middleware services for accessing the source information and facilitating the integration of the mediated results into end-users applications (see Figure 1).

Figure 1. The Architecture of the Context Interchange System

The wrappers are physical and logical gateways providing uniform access to the disparate sources over the network [Chen99, FMS00a, FMS00b]. The set of Context Mediation Services, comprises a Context Mediator, a Query Optimizer and a Query Executioner. The Context Mediator is in charge of the identification and resolution of potential semantic conflicts induced by a query. This automatic detection and reconciliation of conflicts present in different information sources is made possible by ontological knowledge of the underlying application domain, as well as informational content and implicit assumptions associated with the receivers and sources.

The result of the mediation is a mediated query. To retrieve the data from the disparate information sources, the mediated query is then transformed into a query execution plan, which is optimized, taking into account the topology of the network of sources and their capabilities. The plan is then executed to retrieve the data from the various sources, then results are composed and sent to the receiver.

The knowledge needed for harmonization is formally modeled in a COIN framework [Goh96], The COIN framework is a mathematical structure offering a robust foundation for the realization of the Context Interchange strategy. The COIN framework comprises a data model and a language, called COINL, of the Frame-Logic (F-Logic) family [KLW95, DT95]. The framework is used to define the different elements needed to implement the strategy in a given application:

  • The Domain Model is a collection of rich types (semantic types) defining the domain of discourse for the integration strategy;

  • Elevation Axioms for each source identify the semantic objects (instances of semantic types) corresponding to source data elements and define integrity constraints specifying general properties of the sources;

  • Context Definitions define the different interpretations of the semantic objects in the different sources and/or from a receiver's point of view.

The comparison and conversion procedure itself is inspired by the Abductive Logic Programming framework [KKT93] and can be qualified as an abduction procedure, to take advantage of its formal logical framework. One of the main advantages of the abductive logic programming framework is the simplicity in which it can be used to formally combine and to implement features of query processing, semantic query optimization and constraint programming.
2.4. Research Tasks and Expected Contributions in Integrating Systems (int) and Data (dmc) Involving Complex and Interdependent Social Systems (soc)

Possible criticisms of research are either that it is “impossible” or “trivial.” We believe that the thirteen research goals listed below are ideally matched to the goals of the NSF ITR. First, they build on our proven COIN and GSSD efforts and, in many cases, we have working papers describing approaches toward solutions (due to space limitations, it is difficult to present many details) – so we believe that our goals are definitely “possible.” On the other hand, each of these research goals separately is challenging and we believe that no one has attempted to accomplish them all in unison, so it is definitely “not trivial.” Even if we succeeded in accomplishing only a subset of these goals, it would be a major contribution – but our goal is to accomplish and integrate them all.

1. Extended Domain of Knowledge – Equational Context. In addition to the types of domain and context knowledge currently handled by the COIN framework, we need to perform research to add capabilities for both the representation and reasoning to provide support for equational [FGM02] context. Equational context refers to the knowledge such as “average GDP per person (AGDP)” means “total GDP” divided by “population.” In some data sources, AGDP explicitly exists (possibly with differing names and in differing units), but in other cases it may not explicitly exist but could be calculated by using “total GDP” and “population” from one or more sources – if that knowledge existed and was used effectively. {See [FMG02] for more details on proposed solution approach.}

2. Extended Domain of Knowledge – Temporal Context. Temporal context refers to the fact that context not only varies across sources but also across time. Thus, the implied currency context for France’s GDP prior to 2002 might be French Francs but after 2002 it is in Euros. If one were performing a longitudinal study over multiple years from multiple sources, it is important that this variation in context over time be understood and processed appropriately. A seemingly straightforward variable like the size of ‘military expenditures’ across countries is defined differently depending on the rules of inclusion or exclusion (as, for example, of military pensions) used in different jurisdictions. Changes in territorial boundaries signal changes in jurisidiction, and often changes in modes of information provision and formatting. This is a common problem facing a new government after a revolution. {See [ZMS04] for more details on proposed solution approach.}

3. Extended Domain of Knowledge – Entity Aggregation Context. Entity Aggregation addresses the reality that we often have multiple interpretations of what constitutes an entity. We have already seen that example in the multiple interpretations of what is meant by “Yugoslavia.” This situation occurs in many other cases, such as does “IBM” include “Lotus Development Corp” (a wholly-owned subsidiary)? – the frequent answer is “depends on the context.” We have defined this problem as “corporate householding.” This is a common occurrence and challenge in many aspects of national and homeland security. {See [MWX03] for more details on proposed solution approach.}

4. Linked Collaborative Domain Spaces. The existing COIN framework provides representation and reasoning capabilities for a single domain. Although there are a number of ontology library systems that allow for management of multiple ontologies [DSW*99, DFen01 Fensel01, HelfH00], they have limitations in scalability and dynamically incorporating new ontological knowledge. Especially, they lack the capability of representing rich context knowledge needed for reconciling differences among sources. The primary focus of this overall research effort is the ability to operate in a multi-disciplinary environment across multiple linked collaborative domain spaces. The representational capabilities to relate concepts across domains, and efficiently maintain the effectiveness of these collaborative domain spaces is critically important – especially in an environment where we believe the underlying domains themselves will continually undergo evolution. For some users, the reality of domain shifts itself is the defining feature of interest [Nuna01]. {See [Kal03] for more details on proposed solution approach.}

5. Advanced Mediation Reasoning and Services. The COIN abductive framework can also be extrapolated to problem areas such as integrity management, view updates and intentional updates for databases [Chu00]. Because of the clear separation between the declarative definition of the logic of mediation into the COINL program from the generic abductive procedure for query mediation, we are able to adapt our mediation procedure to new situations such as mediated consistency management across disparate sources, mediated update management of one or more database using heterogeneous external auxiliary information or mediated monitoring of changes. An update asserts that certain data objects must be made to have certain values in the updater’s context. By combining the update assertions with the COIN logical formulation of context semantics, we can determine whether the update is unambiguous and feasible, and if so, what source data updates must be made to achieve the intended results. If ambiguous or otherwise infeasible, the logical representation may be able to indicate what additional constraints would clarify the updater’s intention sufficiently for the update to proceed. We will build upon the formal system underlying our current framework, F-Logic and abductive reasoning, and extend the expressiveness and the reasoning capabilities leveraging ideas developed in different yet similar frameworks such as Description Logic and classification. By selecting applications, where fundamental shifts in relationships, systems, and pressures, we are opting for the ‘tough test’ where the underlying domain is highly dynamic even volatile.

6. Automatic Source Selection. A natural extension is to leverage context knowledge to achieve context-based automatic source selection. One particular kind of context knowledge useful to enable automatic source selection is the content scope of data sources. Data sources differ either significantly or subtly in their coverage scopes. In a highly diverse environment with hundreds and thousands of data sources, differences of content scopes can be valuably used to facilitate effective and efficient data source selection. Integrity constraints in COINL and the consistency checking component of the abductive procedure provide the basic ingredients to characterize the scope of information available from each source, to efficiently rule out irrelevant data sources and thereby speed up the selection process. For example, a query requesting information about companies with assets lower than $2 million can avoid accessing a particular source based on knowledge of integrity constraints stating that the source only reports information about companies listed in the New York Stock Exchange (NYSE), and that companies must have assets larger than $10 million to be listed in the NYSE. In general, integrity constraints express necessary conditions imposed on data. However, more generally, a notion of completeness degree of the domain of the source with respect to the constraint captures a richer semantic information and allows more powerful source selection. For instance, a source could contain exactly or at least all the data verifying the constraint (e.g., all the companies listed in the NYSE are reported in the source). The source may be influenced by institutional objectives, resulting in major differences in metrics (for concepts like ‘terrorism’) due to differences in definitions of the concept itself. In cases of violent conflict, casualty reports vary significantly largely because of differences in definitions of the variable (ie who is being counted). {See [TM98] for more details on proposed solution approach.}

7. Source Quality. Not only do the sources vary in semantic meaning, they also vary in quality, and they do so in various ways. We must be able to represent and reason about the quality attributes of the sources. Although there has been some basic research on modeling the semantics of data quality [WKM93], significant additional research must be done to advance and formalize these notions and then incorporate them into the SHIP system. {See [Mad03] for more recent details on proposed solution approach.}

8. Attribution Knowledge Processing. For quality assessment and other reasons, it is important to know the attribution of the sources [LCN*99, LMB98]. For example, it can be important to know that although three different sources agree on a controversial piece of the information (e.g., casualties in the Afghanistan war), all three sources acquired that information from the same, maybe questionable, origin source. Thus, attribution metadata must be represented and processed in our system. {See [Lee02] for more recent details on proposed solution approach.}

9. Domain Knowledge Processing – Improving Computer Performance. While domain and context knowledge processing has been shown to have considerable conceptual value [CZ98, MBM*98, LMS96b, SW92], its application in real situations requires both efficiency and scalability across large numbers of sources, quantities and kinds of data, and demand for services. The scalability and optimization of this mediation processing for large numbers of sources across multiple collaborative domains and contexts will invariably be important. In a heterogeneous and distributed environment, the mediator transforms a query written in terms known in the user or application program context (i.e., according to the user's or programmer's assumptions and knowledge) into one or more queries in the terms of the component sources. The individual subqueries may still involve several sources. However, subsequent planning, optimization and execution phases are needed [AKS96, Fynn97]. The planning and execution phases must consider the limitations of the sources and the topology and costs of the network (especially when dealing with non-database sources, such as web pages or web services). The execution phase is in charge of the scheduling of the query execution plan and the realization of the complementary operations that could not be handled by the sources individually (e.g. a join across sources). {See [Tar02] for more details on proposed solution approach.}

10. Domain Knowledge Acquisition – Improving Human Performance. Domain and context knowledge acquisition are also very important. One essential property to be emphasized is the independence of the domains and sources. Our approach is non intrusive and respects their independence (i.e. autonomy). To effectively use the expressive power of the constructs and mechanisms in COINL, it is important that the human knowledge sources be able to easily provide the needed domain and context knowledge. It is therefore essential to develop an appropriate flexible methodology, and the tools supporting this methodology. Where a large number of independent information sources are accessed (as is now possible with the global Internet), flexibility, scalability, and non-intrusiveness will be of primary importance. Traditional tight-coupling approaches to semantic interoperability rely on the a priori creation of federated views on the heterogeneous information sources. These approaches do not scale-up efficiently given the complexity involved in constructing and maintaining a shared schema for a large number of, possibly independently managed and evolving, sources. Loose-coupling approaches rely on the user's intimate knowledge of the semantic conflicts between the sources and the conflict resolution procedures. This reliance becomes a drawback for scalability when this knowledge grows and changes as more sources join the system and when sources are changing. The Context Interchange (COIN) approach is a middle ground between these two approaches. It allows queries to the sources to be mediated, i.e. semantic conflicts to be identified and solved by a context mediator through comparison of contexts associated with the sources and receivers concerned by the queries. It only requires the minimum adoption of a common Domain Model, such as that developed for GSSD, that defines the domain of discourse of the application. {See [Lee03] for more details on proposed solution approach.}

11. Relationship with Evolving Semantic Web. Although the initial COIN and GSSD research and theories preceded the emerging activities now described as the “Semantic Web,” there are many areas of overlap, especially involving the development of the OWL ontology standards and the use of rules and reasoning. The LIGHT research will contribute to the maturing of the Semantic Web and, at the same time, LIGHT will exploit relevant standards and tools that emerge from the Semantic Web activities.

12. Operational System for Harmonized Information Processing. A critical goal of this project is to develop a fully operational System for Harmonized Information Processing that will be used to support the types of challenges listed in Section 1, incorporating all the compoents above. It is essential that this system be developed with maximum flexibility, and extensibility that will permit new and existing applications to seamlessly extract data from an array of changing heterogeneous sources. The utility of many data bases in the national priority areas is seriously constrained by the difficulties of reconciling known disparities and conflicts within and across sources. (Data reconciliation itself has become an important focus of scholarly inquiry in various parts in political science, as recognized by the NSF).

13. Policy Implications Regarding Data Use and Re-use. There are widely differing views regarding the use and re-use of even publicly available information. In particular, the USA has taken a largely “laissez faire” approach whereas the European Union is pursuing a much more restrictive policy (as embodied in its “Data Base Directive”). We have started to apply principle from the domain of economics to develop a more scientific approach to studying and evaluating the current and proposed policies and legislation in this area. {See [ZMS02] for more details on proposed solution approach.}
Section 3. National Priority Area Research – Focus on National and Homeland Security (NHS)
3.1 Brief Domain Overview

The study of International Relations (IR) in Political Science generally converges around two seemingly distinct, but interrelated ‘poles’, namely matters of (a) conflict and war and (b) cooperation and collaboration. Both ‘poles’ address matters of sovereignty and security, national action and international consequence, local disruptions and global impacts, national integration and regional contestation – among others. Differences in theories, methods, and data practices create different perspectives on issues, shaping different “questions”, and potentially leading to different “answers”. The proliferation of new actors (i.e. states, non-governmental organizations, cross-border political groups, non-state actors, international institutions, global firms, etc.) reflects diverse perspectives, creates new sources of data and new difficulties for access, interpretation and management. Therefore, it comes as no surprise that fundamental changes in the international system have created new priorities and challenges for the conduct of research and the making of policy.

Directory: smadnick -> www

Download 274.22 Kb.

Share with your friends:
1   2   3   4   5   6   7   8

The database is protected by copyright © 2023
send message

    Main page