Annual report



Download 283.18 Kb.
Page5/7
Date14.05.2017
Size283.18 Kb.
#18136
1   2   3   4   5   6   7

Work carried out

Change of protocol (from DiGIR to Tapir)


In 2001/2002 CRIA’s team, with the support of Fapesp, participated actively in an international initiative led by the Universities of Kansas and California at Berkeley to develop a data model and communication protocol to integrate data systems for biological collections. The data model adopted was DarwinCore2 and involves 48 data fields for all taxonomic groups. The communication protocol based on a Client-Server architecture is called DiGIR3 (Distributed Generic Information Retrieval).

Figure shows the basic architecture adopted by most collections from the Americas. The portal is a software package responsible for receiving queries from the search interface, redirecting them to providers, receiving the answers and returning them to the search interface. The portal can send a request to several providers, communicating through the protocol and can also determine which provider should be consulted based on existing metadata.

The provider is a software package responsible for serving data and metadata to the portal from multiple data sources in a structured way. The provider is implemented as a web application that answers questions from the portal. The provider receives an XML document, validates the document, generates the search, and prepares the answer in the required format, also as an XML document. The communication protocol between the Portal and the Provider is DiGIR.

Figure . Basic architecture for a DiGIR network

Three elements are required in order to serve data through a distributed network: fast and stable connectivity; an online data server accessible 24 hours a day; and qualified personnel that can guarantee a permanent service. Most Brazilian collections do not have even one of these. The speciesLink network has introduced a new element to the architecture, a cache node, responsible for mirroring data from the collections and in serving the data to the network (Figure ).

Figure . Architecture of the speciesLink network

European collections have developed a data model and data transfer protocol that are different from DarwinCore and DiGIR. Their data model is called ABCD (Access to Biological Collection Data) and includes more than 500 data fields. The protocol used to serve data is called BioCase.

The two systems are incompatible and GBIF (Global Biodiversity Information Facility)4 must run a system using both protocols in order to integrate data from all its members. A new protocol TAPIR (TDWG Access Protocol for Information Retrieval)5 has been developed to deal with both data models so that eventually only one global protocol, common to all networks of biological collections, will be needed. DiGIR networks are migrating to the new protocol and as time passes all programs written for DiGIR will become obsolete. That is why changing the protocol is considered a priority and was carried out at the beginning of the project, before increasing the number of herbaria on the network.

Both protocols DiGIR and TAPIR are based on a client-server architecture. A TAPIR provider was installed in all cache nodes of the speciesLink network and a new TAPIR portal was developed by CRIA. This work also required the development of a new search engine (indexer) to collect, parse, and store data to facilitate fast information retrieval and to enable the production of indicators and the use of other applications such as data cleaning through a centralized database (Figure ).

The migration to TAPIR began with the servers in São Paulo State due to their proximity to CRIA and because they were the first servers of the network and were already requiring updates. The actual physical distribution of the servers was also optimized. The distribution of cache nodes and the location of the collections can be seen at Figure . Besides three servers in the state of São Paulo, the network has cache nodes in Manaus, Vitória, Rio de Janeiro and Curitiba. All collections (in green) send their public data to one of these cache nodes which then serves the network.


Figure . Geographic distribution of the cache nodes (in blue) and of the collections (in green) of the speciesLink network

The work carried out in each cache node included:


  • Installation and update of all software of all providers. Subversion 1.4.x; PHP 5.x and Perl modules

  • Compatibility of the PostgreSQL database with TapirLink

  • “Debug” and changes in the TapirLink code that was presenting errors in some servers.

Once all necessary software had been installed, the next step was to install the Tapir provider in accordance to the following steps:

  • Download of the TapirLink6 software

  • Configuration of the connection module with PostgreSQL of the PHP

  • Configuration of the Apache web server

  • Configuration of the user and password in order to have access to the administration page of the provider

After installation on the provider had been completed the resources that were on the DiGIR provider were transferred to Tapir. The new indexer/harvester was installed and became responsible for updating the centralized database. A web service was also made available in order for the openModeller7 framework to have access to the data (Figure ).

Figure . Diagram with the complete architecture of the speciesLink network


Inclusion of new collections to the speciesLink network


The following procedure was carried out in order to integrate new collections to the network:

  • A standard mandatory questionnaire8 was sent to each collection.

  • The questionnaire was analyzed as to the data structure, software used, number of digitized records, and total holdings.

  • The cache node was prepared to receive data from the collection.

  • The collection was visited by technicians or a remote access to the computer was made in order to install the software spLinker, developed by CRIA, responsible for mapping data fields following the DarwinCore standard and in sending the data to the selected cache node.

  • If necessary, filters to mark sensitive data were created.

  • Data was transferred to cache nodes and possible transmission errors were identified. This part of the work had to have one technician at the collection (or logged on to the computer of the collection) and one at CRIA monitoring the transmission.

  • When the technician was physically at the collection, local staff was trained in the use of spLinker and other tools9 available at the speciesLink network were demonstrated. Special attention was given to the data cleaning and collection profile reports produced by the system based on online records.

In 2009 the speciesLink network integrated 29 new collections, six of which were herbaria, one being from abroad.

  • FURB, Herbário Dr. Roberto Miguel Klein

  • HTSA, Herbário do Trópico Semi-Árido

  • HVSF, Herbário Vale do São Francisco

  • UFACPZ, Herbário da Universidade Federal do Acre

  • HUEFS, Herbário da Universidade Estadual de Feira de Santana

  • NMNH-Botany_BR, Smithsonian Department de Botany - Brazilian records

Besides the inclusion of these new collections, 68% of herbaria that are part of the speciesLink network updated their data in 2009. The network is serving approximately 2.27 million records, of which 930,000 are geo-referenced (December 10, 2009).

Website


The development of the website INCT Herbário Virtual da Flora e dos Fungos began in 2009 and can be accessed at http://inct.florabrasil.net. It is a dynamic system with regular updates.

A Linux Server was prepared to receive the site (Debian GNU/Linux “Lenny”). The database MySQL was installed on an Apache web server, together with programming language PHP, and Ws-ftp file transfer protocol. Wordpress was also installed. A number of adjustments were made such as the configuration of the database, configuration of users and names. All software was installed with the latest versions and all were free. Those responsible for implementing the site have full access to Wordpress and ftp.


New outputs of the information system


Although this activity was scheduled to begin during the project’s second year, due to the fact that the project began with a very large amount of data, the first product was developed to offer data to help determine the specialists’ visiting program by identifying herbarium material at partner institutions that required identification.

For all records classified as "Plantae", those that had an entry in the field "family" but no entry in the field "genus" were selected. The exercise was designed to verify if this information indicated gaps (local, regional, national) in taxonomic knowledge. A list of a total of 68,051 records from 558 families which matched these criteria were sent to the Steering Committee..

A second spread sheet was prepared by selecting records classified as “Plantae” with both fields “family” and “genus” not blank and the field “species” blank or sp (and its variations such as spp. sp., etc.). In this case, data was presented for each herbarium. This second spreadsheet contained about 200 thousand records. By knowing which were the unidentified records and where they were located it was possible to develop a strategy to improve data quality and to program training courses.

At the time New York also had some records with incomplete identification, and we believe this could lead to an opportunity of collaboration and training.

These reports were generated through direct access to the database and will now serve as an example for the development of an advanced search interface that will be launched in 2010. Through this new interface it is expected that all interested may carry out this type of analysis independently.

Another important application developed in 2009 involved the automatic geo-referencing of records with data of the municipality where the sample was collected. All collections that have records of samples from Brazil that are not geo-coded but have municipality data, have a spreadsheet available produced by an automatic geo-referencing application with the values for latitude, longitude, datum, and maximum error. For users that are searching the database, the system presents new columns with this data, making it clear that this is not original data, but data produced by an application.

In December 2009 there were 2.27 million records, of which 842 thousand were geo-referenced at the origin and an additional 800 thousand which had been geo-referenced automatically, using the geo-referencing tool. This way, for those questions that require a precision at the municipality level, these new data are good and these users are now able to count with about double the amount of data they had before.

Data repatriation


At the beginning of the project, Brazilian data from the botanical gardens of New York and Missouri had already been integrated to the speciesLink network.

Visits were made to the National Museum of Natural History in Paris, France, Kew Gardens in the UK and to the Smithsonian Institution in the USA with the aim of understanding the type of data system in each institution and to study whether a repatriation program could be initiated. All three institutions have two features in common: they are serving data to GBIF; and they have received support from the Mellon Foundation to produce high resolution images of their type specimens from Latin America.

Another important initiative was the project being developed through a cooperation between the Paris Museum, the Botanical Institute of São Paulo, and CRIA, with the support of Fapesp (São Paulo Research Foundation) to develop a prototype of the Virtual Herbarium of A. de Saint-Hilaire10. All samples collected by Saint-Hilaire are being scanned and the images are being sent to CRIA for online dissemination.

The diagram of the strategy adopted for the prototype is presented on Figure .



Figure . Diagram of the data repatriation system for the holdings of Saint-Hilaire at the Paris Museum

An important concept when defining a data integration strategy is that only one database should be used as reference. Each update must be done at this reference database and data repatriation or transfer must be carried out from this same database.

In the case of Paris, any transcription or data entry must be fed into the Sonnerat database in Paris. For this project an application was developed to transfer data from Sonnerat to a database at CRIA through the DiGIR protocol to serve as the informational base of the system in Brazil. The system checks every hour whether there is an update in Paris. Special software tools were developed to make high resolution images available through the Internet and a prototype was launched with excellent results.

A visit was also made to the Royal Botanic Gardens, Kew in order to know more about its digitization process and quality control of the images that are part of the Mellon Foundation project. The first step is to attribute a bar code to each species of each sample. This barcode is entered into the information system and the exsiccate is cleaned so that dust does not affect the quality of the scanning.

june2009 226

Figure . First step: cleaning, barcoding, and digitizing metadata

The second step is scanning the material and carrying out the first quality control check.

june2009 229

Figure . Scanning

In the case of samples with volume, a high resolution camera is used at Kew ().

june2009 223

Figure . Digital camera used in the case of material with more volume

The images then go through a thorough quality control check and if not approved the process is carried out again.


Defining a strategy for data repatriation


Many herbaria that have type specimens from Latin America are receiving grants to digitize their material and therefore are acquiring expertise and equipment in this area. It is important for the Virtual Herbarium to develop an image server to receive these high resolution images and associated metadata in order to make all this material publicly accessible and integrated with the other data. This can then be linked to a cyber identification service with specialists from Brazil and abroad aiming at improving the identification of material collected in Brazil.

Since different institutions have adopted different information systems and different procedures for treating and computerizing their holdings, it is important to qualify these different situations in order to define specific strategic actions that are compatible with long term programs.

The following situations can be found and require different strategies:


  • Holdings are digitized and data is available online

  • Samples are identified but are not digitized

  • Samples are not identified nor digitized; and,

  • Holdings with all examples above.

In order to repatriate data that is already available online it is necessary to evaluate the compatibility of the data fields with DarwinCore and to study mechanisms of data export and indexing. If the collection already serves data to GBIF, repatriating this data and making it available through the speciesLink network technically speaking is practically immediate and will most probably be carried out by harvesting an existing DiGIR provider.

Many collections abroad do not have teams to process data associated to samples collected in Brazil. If this is the case, possible strategies to be agreed upon with each herbarium are:



  • Digitization services are hired in Brazil or abroad (funding from the Brazilian government and/or partners)

  • Brazilian specialists and technicians are sent to work on the digitization of data and identification of the material at the partner organization.

A third possibility is to take low resolution pictures of each sample and send them to Brazil so that the label data can be digitized. All data that is digitized must be sent and imported into the local system at the partner institution. Data repatriation will be carried out accessing the local system.

In some international herbaria the characterization and taxonomy of the samples must be revised. In places where specialists for specific groups do not exist or are insufficient it is important to establish a research program where Brazilian specialists and graduate students may be involved with work on taxonomic revisions, cataloging and identification of material with local or remote supervision. Again, we suggest that all data digitization be carried out in the system used by the partner institution in order to guarantee continuity of the dynamic repatriation process of data from all samples collected in Brazil.


Maintenance


Routine maintenance includes hardware maintenance, software updating, backup, problem solving, and a help desk to support curators and other people responsible for the data of the herbaria and a help desk to attend user needs.


Download 283.18 Kb.

Share with your friends:
1   2   3   4   5   6   7




The database is protected by copyright ©ininet.org 2024
send message

    Main page