The basic elements that must be considered in designing a DWH architecture are the data sources, the management instruments, the effective data warehouse data structure, in terms of micro, macro and meta data, and the different types and number of possible users.
This is generally composed of two main different functional environments: the first is where all available information is collect and build-up, usually defined as Extraction, Transformation and Loading (ETL) environment, while the second is the actual data warehouse, i.e. where data analysis, or mining, reports for executives and statistical deliverables are realized.
If the management of data in the S-DWH is after any statistical check or data imputation (Process phase in the GSBPM), this means that the S-DWH use as sources cleaned micro data, and their relative description and quality meta data.
We should define this approach as a “passive S-DWH system” in which we exclude in the ETL phase any statistical action on data to correct or modify values, i.e. we exclude any sub-processes of the GSBPM Process phase, but of course we shouldn’t exclude further data transformation step for a coherent harmonization of the definitions.
Otherwise if we include in the ETL phase all statistical check or data imputation on data sources this means that we are considering a whole statistical production system with its typical statistical elaborations. We may define this approach as a “full active S-DWH system” in which we include statistical action on data to correct or modify values and transform them to harmonize definitions from sources to the output.
Whit this last approach we must identify a unique possibly entry point for the metadata definitions and management of all statistical processes managed. This approach is the most complex one in terms of design and it depends on the ability of each organization to overcame the methodological, organizational and IT barriers of a full active S-DWH.
Any others intermediate solution, between a fully active S-DWH system and a passive S-DWH system, can be accounted by managing as external sources the data and the metadata produced out of the S-DWH control system and we may define that the boundary of a S-DWH is then the operational limit of internal users which depends on the typology and availability of the data sources.
In a generic fully active S-DWH system we identified four functional layers, starting from the bottom up to the top of the architectural pile, they are defined as:
IV access layer, for the final presentation, dissemination and delivery of the information sought specialized for external, relativaly to NSI or EStat, users;
III interpretation and data analysis layer, enables any data analysis or data mining, functional to support statistical design or any new strategies, as well as data re-use; functionality and data are optimized then for internal users, specifically for statistician methodologists or statistician experts.
II integration layer, is where all operational activities needed for any statistical production process are carried out; in this layer data are manly transformed from raw to cleaned data and this activities are carried on by internal operators;
I source layer, is the level in which we locate all the activities related to storing and managing internal or external data sources.
The ground level corresponds to the area where the process starts, while the top of the pile is where the data warehousing process finishes. This reflects a conceptual organization in which we consider the first two levels as operational IT infrastructures and the last two layers as the effective data warehouse.
This layered S-DWH vision can be described in terms of three reference architecture domains:
-
Business Architecture,
-
Information Systems Architecture,
-
Technology Architecture.
The Business Architecture (BA) is a part of an enterprise architecture related to corporate business, and the documents and diagrams that describe the architectural structure of that business.
The Information Systems Architecture is, in our context, the conceptual organization of the effective S-DWH which is able to support tactical demands.
The Technology Architecture is the combined set of software, hardware and networks able to develop and support IT services.
6Business Architecture
The BA is the bridge between the enterprise business model and enterprise strategy on one side, and the business functionality of the enterprise on the other side and is used to align strategic objectives and tactical demands. We provides a common understanding of a NSI articulating the organization by:
-
management processes, the processes that govern the operation of a system,
-
business processes, that constitute the core business and create the primary value stream,
6.1Business processes, that constitute the core business and create the primary value stream
In the layered S-DWH vision we identified the business processes in each layers; the ground level corresponds to the area where the external sources are incoming and interfaced, while the top of the pile is where aggregated, or deliverable, data are available for external user. In the intermediate layers we manage the ETL functions for uploading the DWH in which are carried out strategic analysis, data mining and design, for possible new strategies or data re-use.
This will reflect a conceptual organization in which we will consider the first two levels as pure statistical operational infrastructures, where is produced the necessary information, functional for acquiring, storing, coding, checking, imputing, editing and validating data, and the last two layers as the effective data warehouse, i.e. levels in which data are accessible for execute analysis, re-use of data and perform reporting.
The core of the S-DWH system is the interpretation and analysis layer, this is the effective data warehouse and must support all kinds of statistical analysis or data mining, on micro and macro data, in order to support statistical design, data re-use or real-time quality checks during productions.
The layers II and III are reciprocally functional to each other. Layer II always prepare the elaborated information for the layer III: from raw data, just uploaded into the S-DWH and not yet included in a production process, to micro/macro statistical data at any elaboration step of any production processes. Otherwise in layer III it must be possible to easily access and analyze this micro/macro elaborated data of the production processes in any state of elaboration, from raw data to cleaned and validate micro data. This because, in layer III methodologists should correct possible operational elaboration mistakes before, during and after any statistical production line, or design new elaboration processes for new surveys. In this way the new concept or strategy can generate a feedback toward layer II which is able to correct, or increase the quality, of the regular production lines.
A key factor of this S-DWH architecture is that layer II and III must include components of bidirectional cooperation. This means that, layer II supplies elaborated data for analytical activities, while layer III supplies concepts usable for the engineerization of ETL functions, or new production processes.
Finally, the access layer should supports functionalities related to the exercise of output systems, from the dissemination web application to the interoperability. From this point of view, the access layer operates inversely to the source layer. On the layer IV we should realize all data transformations, in terms of data and metadata, from the S-DWH data structure toward any possible interface tools functional to dissemination.
In the following sections we will indicate explicitly the atomic activities that should be supported by each layer using the GSBPM taxonomy.
-
The Source Layer is the level in which we locate all the activities related to storing and managing internal or external data sources. Internal data are from direct data capturing carried out by CAWI, CAPI or CATI; while external data are from administrative archives, for example from Customs Agencies, Revenue Agencies, Chambers of Commerce, National Social Security Institutes.
Generally, data from direct surveys are well-structured so they can flow directly into the integration layer. This is because NSIs have full control of their own applications. Differently, data from others institutions’ archives must come into the S-DWH with their meta data in order to be read correctly.
In the source layer we support data loading operations for the integration layer but do not include any data transformation operations, which will be realized in the next layer.
Analyzing the GSBPM shows that the only activities that can be included in this layer are:
-
Phase
|
sub-process
|
4- Collect:
|
4.2-set up collection,
4.3-run collection,
4.4-finalize collection
|
Set up collection (4.2) ensures that the processes and technology are ready to collect data. So, this sub-process ensures that the people, instruments and technology are ready to work for any data collections. This sub-process includes:
-
preparing web collection instruments,
-
training collection staff,
-
ensuring collection resources are available e.g. laptops,
-
configuring collection systems to request and receive the data,
-
ensuring the security of data to be collected.
Where the process is repeated regularly, some of these activities may not be explicitly required for each iteration.
Run collection (4.3) is where the collection is implemented, with different collection instruments being used to collect the data.
It is important to consider that the run collection sub-process in a web-survey could be contemporary with the review, validate & edit sub-processes.
Finalize collection (4.4) includes loading the collected data into a suitable electronic environment for further processing of the next layers. This sub-process also aims to check the metadata descriptions of all external archives entering the SDW system. In a generic data interchange, as far as metadata transmission is concerned, the mapping between the metadata concepts used by different international organizations, could support the idea of open exchange and sharing of metadata based on common terminology.
Integration Layer funtionalities
The integration layer is where all operational activities needed for all statistical elaboration process are carried out. This means operations carried out automatically or manually by operators to produce statistical information in an IT infrastructure. With this aim, different sub-processes are pre-defined and pre-configured by statisticians as a consequence of the statistical survey design in order to support the operational activities.
This means that whoever is responsible for a statistical production subject defines the operational work flow and each elaboration step, in terms of input and output parameters that must be defined in the integration layer, to realize the statistical elaboration.
For this reason, production tools in this layer must support an adequate level of generalization for a wide range of processes and iterative productions. They should be organized in operational work flows for checking, cleaning, linking and harmonizing data-information in a common persistent area where information is grouped by subject. This could be those recurring (cyclic) activities involved in the running of the whole or any part of a statistical production process and should be able to integrate activities of different statistical skills and of different information domains.
To sustain these operational activities, it would be advisable to have micro data organized in generalized data structures able to archive any kind of statistical production, otherwise data should be organized in completely free form but with a level of metadata able to realize an automatic structured interface toward the themselves data.
Therefore, in the Integration layer are possible a wide family of software applications, from Data Integration Tool, where a user-friendly graphic interface helps to build up work flow to generic statistics elaboration line or part of it.
In this layer , we should include all the sub-processes of phase 5 and some sub-processes from phase 4,6 and 7 of the GSBPM:
-
Phase
|
sub-process
|
5- Process
|
5.1-integrate data;
5.2-classify & code;
5.3-review, validate & edit;
5.4-impute;
5.5-derive new variables and statistical units:
5.6-calculate weights;
5.7-calculate aggregate;
5.8-finalize data files
|
6- Analyze
|
6.1-prepare draft output;
|
Integrate data (5.1), this sub-process integrates data from one or more sources. Input data can be from external or internal data sources and the result is a harmonized data set. Data integration typically includes record linkage routines and prioritising, when two or more sources contain data for the same variable (with potentially different values).
The integration sub-process includes micro data record linkage which can be realized before or after any reviewing or editing, in function of the statistical process. At the end of each production process, data organized by subject area should be clean and linkable.
Classify and code (5.2), this sub-process classifies and codes data. For example automatic coding routines may assign numeric codes to text responses according to a pre-determined classification scheme, which should include a residual interactive human activity.
Review, validate and edit (5.3), this sub-process applies to collected micro-data, and looks at each record to try to identify (and where necessary correct) potential problems, errors and discrepancies such as outliers, item non-response and miscoding. It can also be referred to as input data validation. It may be run iteratively, validating data against predefined edit rules, usually in a set order. It may apply automatic edits, or raise alerts for manual inspection and correction of the data. Reviewing, validating and editing can apply to unit records both from surveys and administrative sources, before and after integration.
Impute (5.4), this sub-process refers to when data are missing or unreliable. Estimates may be imputed, often using a rule-based approach.
Derive new variables and statistical units (5.5), this sub-process in this layer describes the simple function of the derivation of new variables and statistical units from existing data using logical rules defined by statistical methodologists.
Calculate weights, (5.6), this sub process creates weights for unit data records according to the defined methodology and is automatically applied for each iteration.
Calculate aggregates (5.7), this sub process creates already defined aggregate data from micro-data for each iteration. Sometimes this may be an intermediate rather than a final activity, particularly for business processes where there are strong time pressures, and a requirement to produce both preliminary and final estimates.
Finalize data files (5.8), this sub-process brings together the results of the production process, usually macro-data, which will be used as input for dissemination.
Prepare draft outputs (6.1), this sub-process is where the information produced is transformed into statistical outputs for each iteration. Generally, it includes the production of additional measurements such as indices, trends or seasonally adjusted series, as well as the recording of quality characteristics. The presence of this sub-process in this layer is strictly related to regular production process, in which the measures estimated are regularly produced, as should in the STS.
Interpretation and data analysis layer funtionalities
The interpretation and data analysis layer is specifically for internal users, statisticians, and enables any data analysis, data mining and support at the maximum detailed granularity, micro data, for design production processes or individuate data re-use. Data mining is the process of applying statistical methods to data with the intention of uncovering hidden patterns. This layer must be suitable to support experts for free data analysis in order to design or test any possible new statistical methodology, or strategy.
The results expected of the human activities in this layer should then be statistical “services” useful for other phases of the elaboration process, from the sampling, to the set-up of instruments used in the process phase until generation of new possible statistical outputs. These services can, however, be oriented to re-use by creating new hypotheses to test against the larger data populations. In this layer experts can design the complete process of information delivery, which includes cases where the demand for new statistical information does not involve necessarily the construction of new surveys, or a complete work-flow setup for any new survey needed.
From this point of view, the activities on the Interpretation layer should be functional not only to statistical experts for analysis but also to self-improve the S-DWH, by a continuous update, or new definition, of the production processes managed by the S-DWH itself.
We should point out that a S-DWH approach can also increase efficiency in the Specify Needs and Design Phase since statistical experts, working on these phases on the layer III, share the same information elaborated in the Process Phase in the layer II.
The use of a data warehouse approach for statistical production has the advantage of forcing different typologies of users to share the same information data. That is, the same stored-data are usable for different statistical phases.
In general in the Interpretation layer, only a reduced number of users are allowed to operate in order to prevent a reduction of servers performance, given that a deep data analyses could involve very complex activities not always pre-evaluated in terms of processing costs. Moreover, queries on the operational structures of the integration layer can not be left to a free user access, but they must be always optimized and mediate by specific tools in order to not reduce the server performance of the integration layer.
Therefore, this layer supports any possible activities for new statistical production strategies aimed at recovering facts from large administrative archives. This would create more production efficiency and less of a statistical burden and production costs.
From the GSBPM then we consider:
-
1- Specify Needs:
|
1.5 - check data availability
|
2- Design:
|
2.1-design outputs
2.2-design variable descriptions
2.4-design frame and sample methodology
2.5-design statistical processing methodology
2.6-design production systems and workflow
|
4- Collect:
|
4.1-select sample
|
5- Process
|
5.1-integrate data;
5.5-derive new variables and statistical units;
5.6-calculate weights;
5.7-calculate aggregate;
|
6- Analyze
|
6.1-prepare draft output;
6.2-validate outputs;
6.3-scrutinize and explain;
6.4-apply disclosure control;
6.5-finalize outputs
|
7- Disseminate
|
7.1-update output systems,
|
9- Evaluate
|
9.1- gather evaluation inputs
9.2- conduct evaluation
|
Check data availability (1.5), this sub-process checks whether current data sources could meet user requirements, and the conditions under which they would be available, including any restrictions on their use. An assessment of possible alternatives would normally include research into potential administrative data sources and their methodologies, to determine whether they would be suitable for use for statistical purposes. When existing sources have been assessed, a strategy for filling any remaining gaps in the data requirement is prepared. This sub-process also includes a more general assessment of the legal framework in which data would be collected and used, and may therefore identify proposals for changes to existing legislation or the introduction of a new legal framework.
Design outputs (2.1), this sub-process contains the detailed design of the statistical outputs to be produced, including the related development work and preparation of the systems and tools used in phase 7 (Disseminate). Outputs should be designed, wherever possible, to follow existing standards, so inputs to this process may include metadata from similar or previous collections, international standards.
Design variable descriptions (2.2), this sub-process defines the statistical variables to be collected via the data collection instrument, as well as any other variables that will be derived from them in sub-process 5.5 (Derive new variables and statistical units), and any classifications that will be used. This sub-process may need to run in parallel with sub-process 2.3 (Design data collection methodology), as the definition of the variables to be collected, and the choice of data collection instrument may be inter-dependent to some degree. The III layer can be seen as a simulation environment able to identify the effective variables needed.
Design frame and sample methodology (2.4), this sub-process identifies and specifies the population of interest, defines a sampling frame (and, where necessary, the register from which it is derived), and determines the most appropriate sampling criteria and methodology (which could include complete enumeration). Common sources are administrative and statistical registers, censuses and sample surveys. This sub-process describes how these sources can be combined if needed. Analysis of whether the frame covers the target population should be performed. A sampling plan should be made: The actual sample is created sub-process 4.1 (Select sample), using the methodology, specified in this sub-process.
Design statistical processing methodology (2.5), this sub-process designs the statistical processing methodology to be applied during phase 5 (Process), and Phase 6 (Analyse). This can include specification of routines for coding, editing, imputing, estimating, integrating, validating and finalising data sets.
Design production systems and workflow (2.6), this sub-process determines the workflow from data collection to archiving, taking an overview of all the processes required within the whole statistical production process, and ensuring that they fit together efficiently with no gaps or redundancies. Various systems and databases are needed throughout the process. A general principle is to reuse processes and technology across many statistical business processes, so existing systems and databases should be examined first, to determine whether they are fit for purpose for this specific process, then, if any gaps are identified, new solutions should be designed. This sub-process also considers how staff will interact with systems, and who will be responsible for what and when.
Select sample (4.1), this sub-process establishes the frame and selects the sample for each iteration of the collection, in line with the design frame and sample methodology. This is an interactive activity on statistical business registers typically carry out by statisticians using advanced methodological tools.
It includes the coordination of samples between instances of the same statistical business process (for example to manage overlap or rotation), and between different processes using a common frame or register (for example to manage overlap or to spread response burden).
Integrate data (5.1), in this layer this sub-process makes it possible for experts to freely carry out micro data record linkage from different information data-sources when these refer to the same statistical analysis unit.
In this layer this sub-process must be intended as a evaluation for the data linking design, wherever needs.
Derive new variables and statistical units (5.5), this sub-process derives variables and statistical units that are not explicitly provided in the collection, but are needed to deliver the required outputs. In this layer this function would be used to set up procedures or for defining the derivation roles applicable in each production iteration. In this layer this sub-process must be intended as a evaluation for evaluation on designing new variable.
Prepare draft outputs (6.1), in this layer this sub-process means the free construction of not regular outputs.
Validate outputs (6.2), this sub-process is where statisticians validate the quality of the outputs produced. Also this sub process is intended as a regular operational activity, and the validations are carried out at the end of each iteration on an already defined quality framework.
Scrutinize and explain (6.3) this sub-process is where the in-depth understanding of the outputs is gained by statisticians. They use that understanding to scrutinize and explain the statistics produced for this cycle by assessing how well the statistics reflect their initial expectations, viewing the statistics from all perspectives using different tools and media, and carrying out in-depth statistical analyses.
Apply disclosure control (6.4), this sub-process ensures that the data (and metadata) to
be disseminated do not breach the appropriate rules on confidentiality. This means the use of specific methodological tools to check the primary and secondary disclosure
Finalize outputs (6.5), this sub-process ensures the statistics and associated information are fit for purpose and reach the required quality level, and are thus ready for use.
Update output systems (7.1), this sub-process manages update to systems where data and metadata are stored for dissemination purposes.
Gather evaluation inputs (9.1), evaluation material can be produced in any other phase or sub-process. It may take many forms, including feedback from users, process metadata, system metrics and staff suggestions. Reports of progress against an action plan agreed during a previous iteration may also form an input to evaluations of subsequent iterations. This sub-process gathers all of these inputs, and makes them available for the person or team producing the evaluation.
Conduct evaluation (9.2), this rocess analyzes the evaluation inputs and synthesizes them into an evaluation report. The resulting report should note any quality issues specific to this iteration of the statistical business process, and should make recommendations for changes if appropriate. These recommendations can cover changes to any phase or sub-process for future iterations of the process, or can suggest that the process is not repeated.
Access Layer funtionalities
The Access Layer is the layer for the final presentation, dissemination and delivery of the information sought. This layer is addressed to a wide typology of external users and computer instruments. This layer must support automatic dissemination systems and free analysts tools, in both cases, statistical information are mainly macro data not confidential, we may have micro data only in special limited cases.
This typology of users can be supported by three broad categories of instruments:
-
a specialized web server for software interfaces towards other external integrated output systems. A typical example is the interchange of macro data information via SDMX, as well as with other XML standards of international organizations.
-
specialized Business Intelligence tools. In this category, extensive in terms of solutions on the market, we find tools to build queries, navigational tools (OLAP viewer), and in a broad sense web browsers, which are becoming the common interface for different applications. Among these we should also consider graphics and publishing tools able to generate graphs and tables for users.
-
office automation tools. This is a reassuring solution for users who come to the data warehouse context for the first time, as they are not forced to learn new complex instruments. The problem is that this solution, while adequate with regard to productivity and efficiency, is very restrictive in the use of the data warehouse since these instruments, have significant architectural and functional limitations.
In order to support this different typology of instruments, this layer must allow the transformation of data-information already estimated and validated in the preview layers by automatic software.
From the GSBPM we may consider only the phase 7 for operational process and specifically:
-
7- Disseminate
|
7.1-update output systems
7.2-produce dissemination.
7.3-manage release of dissemination products
7.4-promote dissemination
7.5-manage user support
|
Update output systems (7.1) this sub-process in this layer manages the output update adapting the already defined macro data to specific output systems, including re-formatting data and metadata into specific output databases, ensuring that data are linked to the relevant metadata. This process is related with the interoperability between the access layer and others external system; e.g toward the SDMX standard or other a Open Data infrastructure.
Produce dissemination products (7.2), this sub-process produces final, previously designed statistical products, which can take many forms including printed publications, press releases and web sites. Typical steps include:
-preparing the product components (explanatory text, tables, charts etc.);
-assembling the components into products;
-editing the products and checking that they meet publication standards.
The production of dissemination products is a sort of integration process between table, text and graphs. In general this is a production chain in which standard table and comments from the scrutinizing of the produced information are included.
Manage release of dissemination products (7.3), this sub-process ensures that all elements for the release are in place including managing the timing of the release. It includes briefings for specific groups such as the press or ministers, as well as the arrangements for any pre-release embargoes. It also includes the provision of products to subscribers.
Promote dissemination products (7.4), this sub-process concerns the active promotion of the statistical products produced in a specific statistical business process, to help them reach the widest possible audience. It includes the use of customer relationship management tools, to better target potential users of the products, as well as the use of tools including web sites, wikis and blogs to facilitate the process of communicating statistical information to users.
Manage user support (7.5), this sub-process ensures that customer queries are recorded, and that responses are provided within agreed deadlines. These queries should be regularly reviewed to provide an input to the over-arching quality management process, as they can indicate new or changing user needs.
6.2Management processes, the processes that govern the operation of a system,
In a S-DWH we recognizes fourteen over-arching statistical processes needed to support the statistic production processes, nine of them are the same as in the GSBPM, while the remaining five are consequence of a full active S-DWH approach.
In line with the GSBPM, the first 9 over-arching processes are1:
-
statistical program management – This includes systematic monitoring and reviewing of emerging information requirements and emerging and changing data sources across all statistical domains. It may result in the definition of new statistical business processes or the redesign of existing ones
-
quality management – This process includes quality assessment and control mechanisms. It recognizes the importance of evaluation and feedback throughout the statistical business process
-
metadata management – Metadata are generated and processed within each phase, there is, therefore, a strong requirement for a metadata management system to ensure that the appropriate metadata retain their links with data throughout the different phases
-
statistical framework management – This includes developing standards, for example methodologies, concepts and classifications that apply across multiple processes
-
knowledge management – This ensures that statistical business processes are repeatable, mainly through the maintenance of process documentation
-
data management – This includes process-independent considerations such as general data security, custodianship and ownership
-
process data management – This includes the management of data and metadata generated by and providing information on all parts of the statistical business process. (process management is the ensemble of activities of planning and monitoring the performance of a process) operations management is an area of management concerned with overseeing, designing, and controlling the process of production and redesigning business operations in the production of goods or services
-
provider management – This includes cross-process burden management, as well as topics such as profiling and management of contact information (and thus has particularly close links with statistical business processes that maintain registers)
-
customer management – This includes general marketing activities, promoting statistical literacy, and dealing with non-specific customer feedback.
In addition, we should include five more over-arching management processes in order to coordinate the actions of a fully active S-DWH infrastructure; they are:
-
S-DWH Management: - This includes all activities able to support the coordination between: statistical framework management, provider management, process data management, data management
-
data capturing management – This include all activities related with a direct, statistical or computer, support (help-desk) to respondents, i.e. provision of specialized customer care for web-questionnaire compilation or toward external institution for acquiring archives.
-
output management, for general marketing activities, promoting statistical literacy, and dealing with non-specific customer feedback.
-
web communication management, includes data capturing management, customer management and output management; this includes for example should be the effective management of a statistical web portal, able to support all front-office activities.
-
business register management (or for institutions or civil registers) – this is a trade register kept by the registration authorities and is related to provider management and operational activities.
By definition, an S-DWH system includes all effective sub-processes needed to carry out any production process. Web communication management handles the contact between respondents and NSIs, this includes providing a contact point for collection and dissemination of data over internet. It supports several phases of the statistical business process, from collecting to disseminating, and at the same time provides the necessary support for respondents.
The BR Management is an overall process since the statistical, or legal, state of any enterprise is archived and updated at the beginning and end of any production process.
6.3Functional diagram for strategic over-arching processes
The strategic management processes among the over-arching processes stated in GSBPM and in the extension for the S-DWH management functionalities falls outside S-DWH system but are still vital for the function of a S-DWH. Those strategic functions are Statistical Program Management, Business Register Management and Web Communication Management. The functional diagram below illustrates the relation between the strategic over-arching processes and the operational management.
Figure 2: High level functional diagram representation
In the functional diagram functions are represented by modules whose interactions are represented by flows. The diagram is a collection of coherent processes, which are continuously performed. Each module is described with a box and contains everything necessary to execute the represented functionality. As far as possible the GSBPM and GSIM are used to describe the functional architecture of an S-DWH, thus the colours of the arrows in the functional diagrams refers to the four conceptual categories already used inside the GSIM conceptual reference model.
The functional diagram in figure 2 shows that the identification of new statistical needs (Specify Needs phase) will trigger the initiation of a Statistical Program. This, in turn, will then trigger a design phase (in GSIM, the Statistical Program Design, which will lead to the development of a set of Process Step Designs - i.e. all the sub-processes, business functions, inputs, outputs etc. that are to be used to undertake the statistical activity).
The basic input process for new statistical information derives from the natural evolution of the civil society or the economic system. During this phase, needs are investigated and high level objectives are established for output. The S-DWH is able to support this process by allowing the use of all available information to analysts to check if the new concepts and new variables already are managed in the S-DWH.
The design phase can be triggered by the demand for a new statistical product, or as a result of a change associated with process improvement, or perhaps as a result of new data sources becoming available. In each case a new Statistical Program will be created, and a new associated Statistical Design.
The web communication management is an external component with a strong interdependency with the S-DWH since it is the interface for external users, respondents and scientific or social society. From an operational point of view the provision of a contact point accessible over internet, e.g. a web-portal is a key factor for relationship with respondents, services related to direct or indirect data capturing and delivery of information products.
Functional diagram for operational over-arching processes
In order to analyze functions to support a generic statistic business process we describe the functional diagram of Figure in more detail. Expanding the module representing the S-DWH Management, we can identify four more management functions within; Statistical Framework Management, Provider Management, Process Metadata Management and Data Management. Furthermore, by expanding the Web Communication Management module we can identify three more functions; Data Capturing Management, Customer Management and Output Management. This is shown in the diagram in
.
Figure 3: Functional Diagram, expanded representation.
The details in
enable us to contextualize the nine phases of the GSBPM in an S-DWH functional diagram. We represent the nine phases using connecting arrows between modules. For the arrows we use the same four colors used in the GSIM to contextualize the objects.
The four layers in the S-DWH are placed in the Data management function labeled I° (Source layer), II° (Interpretation layer), III° (Integration layer) and IV° (Access layer).
Specify Needs phase - This phase is the request for new statistics or an update on current statistics. The flow is blue since this phase represents the building of Business Objects from the GSIM, i.e. activities for planning statistical programs. This phase is a strategic activity in an S-DWH approach because a first overall analysis of all available data and meta data is realized.
In the diagram we identify a sequence of functions starting from the Statistical Program pass through the Statistical framework and ending with the Interpretation layer of Data Management. This module relationship supports executives in order to “consult needs”, “identify concepts”, “estimate output objectives” and “determine needs for information”.
The connection between the Statistical framework and the Interpretation layer data indicates the flow of activities to “check data availability”, i.e. if the available data could meet the information needs or the conditions under which data would be available. This action is then supported by the “interpretation and analysis layer” functionalities in which data is available and easy to use for any expert in order to determine whether they would be suitable for the new statistical purposes. At the end of this action, statisticians should prepare a business case to get approval from executives or from the Statistical Program manager.
Design phase - This phase describes the development and design activities, and any associated practical research work needed to define the statistical outputs, concepts, methodologies, collection instruments and operational processes. All these sub-processes can create active and/or passive meta data, functional to the implementation process. Using the GSIM reference colours we colour this flow in blue to describe activities for planning the statistical program, realized by the interaction between the statistical framework, process metadata and provider management modules. Meanwhile the phase of conceptual definition is represented by the interaction between the statistical framework and the interpretation layer.
The information related to the “design data collection methodology” impacts on the provider management in order to “design the frame” and “sample methodology”. These designs specify the population of interest, defining a sample frame based on the business register, and determine the most appropriate sampling criteria and methodology in order to cover all output needs. It also uses information from the provider management in order to coordinate samples between instances of the same statistical business process (for example to manage overlap or rotation), and between different processes using a common frame or register (for example to manage overlap or to spread response burden).
The operational activity definitions are based on a specific design of a statistical process methodology which includes specification of routines for coding, editing, imputing, estimating, integrating, validating and finalizing data sets. All methodological decisions are taken using concepts and instruments defined in the Statistical Framework. The workflow definition is managed inside the Process Metadata and supports the production system. If a new process requires a new concept, variable or instrument, those are defined in the Statistical Framework.
Build phase –In this phase all sub processes are built and tested for the systems component production. For statistical outputs produced on a regular basis, this phase usually occurs for the first iteration or following a review or a change in methodology, rather than for each iteration.
In a S-DWH which represents a generalized production infrastructure this phase is based on code reuse and each new output production line is basically a work flow configuration. This has a direct impact on active metadata managed by process meta data in order to execute the operational production flows properly. In analogy with the GSIM, we color this flow in orange. Therefore, in a S-DWH the build phase can be seen as a metadata configuration able to interconnect the Statistical Framework with the DWH data structures.
Collect phase - This phase is related to all collection activities for necessary data, and loading of data into the source layer of the S-DWH. This represents the first step of the operational production process and therefore, in analogy with the GSIM the flow is colored red.
The two main modules involved with the collection phase in the functional diagram are Provider Management and Data Capturing Management.
Provider Management includes: Cross-Process Burden, Profiling and Contact Information Managements. This is done by optimizing register information using three inputs of information, the first from the external official Business Register, the second from respondents' feedback and third from the identification of the sample for each survey.
Data capturing management collects external data into the source layer. Typically this phase does not include any data transformations. We identify two main types of data capture: from controlled systems and from non controlled systems. The first is data collection directly from respondents using instruments which should include shared variable definitions and preliminary quality checks. A typical example is a web questionnaire. The second type is for example data collected from an external archive. In this case a conceptual mapping between internal and external statistical concept is necessary before any data can be loaded. Data mapping involves combining data residing in different sources and providing users with a unified view of these data. These systems are formally defined as a triple where T is the target schema, S is the heterogeneous set of source schemas, and M is the mapping that maps queries between the source and the target schemas.
Process phase - This phase is the effective operational activities made by reviewers. It is based on specific step of elaboration and corresponds to the typical ETL phase of a DWH. In an S-DWH it describes the cleaning of data records and their preparation for output or analysis. The operational sequence of activities follows the design of the survey configured in the metadata management. This phase corresponds to the operational use of modules and for this reason we colour this flow in red in analogy with the managing of production objects of the GSIM.
All the sub process “classify & code”, “review”, “validate & edit”, “impute”, “derive new variables and statistical units”, “calculate weights”, “calculate aggregate”, “finalize data files” are made up in the “integration layer” following ad hoc sequences in function of the typology of the survey.
The “integrate data” is connecting different sources and use the Provider Management in order to update asynchronous business register status.
Analyze phase - This phase is central for any S-DWH, since during this phase statistical concepts are produced, validated, examined in detail and made ready for dissemination. We therefore colour the activity flow of this phase green in accordance with the GSIM.
In the diagram the flow is bidirectional connecting the statistical framework and the interpretation layer of the data management. This is to indicate that all non consolidated concepts must be first created and tested directly in the interpretation and analysis layer. It includes the use or the definition of measurements such as indices, trends or seasonally adjusted series. All the consolidated draft output can be then automated for the next iteration and included directly in the ETL elaborations for a direct output.
The Analysis phase includes primary data scrutinizing and interpretation to support the data output. The scrutiny is an in-depth understanding from statisticians of the data. They use that understanding to explain the statistics produced in each cycle by evaluating the effective fitting with their initial expectations.
Disseminate phase - This manages the release of the statistical products to customers. For statistical outputs produced regularly, this phase occurs for each iteration. From the GSBPM we have five sub processes: “updating output systems”, “produce dissemination products”, “manage release of dissemination products”, “promote dissemination products”, “manage user support”. All of these sub process can be directly considered related to the operational data warehousing.
The “updating output systems” sub process is the arrow connecting the Data Management with the Output Management. This flow is coloured red to indicate the operational data uploading. The Output Management produce and manage release of dissemination products and promote dissemination products using the information stored in the “access layer”.
Finally the “finalize output” sub process ensure the statistics and associated information are fit for purpose and reach the required quality level, and are thus ready for use. This sub process is manly realized in the “interpretation and analysis” and their evaluations are available at the access layer.
Archive phase - This phase manages the archiving and disposal of statistical data and metadata. Considering that an S-DWH is substantially an integrated data system, this phase must be considered to be an over-arching activity; i.e. in a S-DWH it is a central structured generalized activity for all S-DWH levels. In this phase we include all operational structured steps needed for the Data Management and the flow is marked red.
In the GSBPM are four sub processes considered: “definition archive rules”, “management of archive repository”, “preserve data and associated metadata” and “dispose of data and associated metadata”. Between them the “definition archive rules” is a typical metadata activity and the others are operational functions.
The archive rules sub process define structural metadata, for the definition of the structure of data (data mart and primary), metadata, variable, data dimensions, constraints, etc., and it defines process metadata, for specific statistical business process as regards to a general archiving policy of the NSI or standards applied across the government sector.
The other sub processes concern the management of one or more data bases, the preservation of data and metadata and their disposal, these functions are operational on an S-DWH and are depending from its design.
Evaluate phase - This phase provides the basic information for the overall quality evaluation management. The evaluation is applied to all the S-DWH layers through the statistical framework management. It takes place at the end of each sub process and the gathered quality information is stored into the corresponding metadata structures of each layer. Evaluation material may take many forms, data from monitoring systems, log files, feedback from users or staff suggestions.
For statistical outputs produced regularly evaluation should, at least in theory, occur once for each iteration. The evaluation is one key factor to determine whether future iterations should take place and whether any improvements should be implemented. In a S-DWH context the evaluation phase always involves evaluation of business processes for an integrated production.
Share with your friends: |