LANL Research Library
 

aDORe Federation - Overview

Scalability in digital libraries

Scalability in digital libraries is a problem that extends into multiple dimensions. For example, there are issues related to the amount of digital objects to be handled and issues related to their size. There are issues related to the performance of processes such as ingestion of objects into a repository, dissemination of stored objects, and introspection upon stored objects among others driven by preservation requirements. Optimizing, tuning, and tweaking the existing repository infrastructure can initially alleviate performance problems, but eventually limits are reached. At that point, a major redesign of the repository solution is an obvious option. An alternative is to move towards an environment that consists of parallel instances of the existing repository solution and to glue those together into a repository federation that behaves as if it were a single repository. The desire to federate repositories in such a manner actually also emerges as a result of the understanding that no single digital library hosts all artifacts that are relevant for a specific subject domain, community, or application. The proposition of a “single repository behavior” exposed by a federation consisting of any number of distributed repositories is appealing, and has been the subject of digital library interoperability efforts such as Dienst [2], NCSTRL [3], CORDRA [4, 5, 6], DRIVER [7], and the Chinese DSpace federation [8]. Both federation paths, on one hand the federation of multiple instances of a specific repository installation, and on the other hand the federation of distributed repositories, reveal another dimension of the scalability problem in contemporary digital library efforts. Indeed, as a result of a combination of low-level system scalability issues, and higher-level community needs, there comes a point at which the reality of a multiple-repository environment must be embraced. The challenge is then to devise an approach to federate repositories in a manner that is functional, practically achievable, and … scaleable to a vast amount of federated repositories. The aDORe Federation is a federated repository framework and reference implementation which aims to address many of the described issues.

What is the aDORe Federation Architecture?

The goal of the aDORe federation architecture is to facilitate a uniform manner for client applications to discover and access content objects available in a group of distributed repositories. This is achieved by means of a 3-Tier architecture illustrated in Figure 1. Tier-3 provides client applications with a single point of access to all content available in the federation, irrespective of the actual location of that content in federated repositories. In order to realize this, the architecture requires all federated repositories to implement the same, minimal set of machine interfaces to make their content accessible. These repository interfaces constitute Tier-1 of the architecture. Moreover, the architecture requires the introduction of a middle Tier, Tier-2, consisting of two shared infrastructure components that keep the books on content objects, repositories, and repository interfaces in the federation. These shared infrastructure components minimally expose one machine interface each. In order to respond to client requests, the federation’s single point of access interacts with these interfaces as well as with the interfaces exposed by the content repositories. As a matter of fact, the single point of access to the federation supports exactly the same minimal set of machine interfaces as each federated repository does, effectively making the entire federation behave in the same manner as each individual constituent repository.

All entities in the aDORe federation architecture, content objects, repositories, and machine interfaces, are identified by means of URIs. The choice for URIs as such turns each entity into a uniquely identified resource on the Web. An appropriate choice of the authority component of a URI scheme helps to avoid unwanted collapses of identifiers, for example, for different content objects from various federated repositories. The architecture distinguishes between protocol-based URIs that can be de-referenced via a common protocol to provide access to a representation, and non-protocol-based URIs for which no common de-referencing mechanism approach exists. The choice between these two types of URIs in the deployment of an aDORe federation relates to the use case at hand.

All machine interfaces in the aDORe federation architecture are protocol-based. This choice simultaneously accommodates a multiple-custodian use case with constituent repositories that are effectively distributed across the Internet, and a single-custodian use case in which considerations of scale eventually require the distribution of components across an Intranet. Although the functionality provided by the proposed machine interfaces can be implemented in a variety of ways, the desire to leverage existing standards in the aDORe work has led to using community standards that fit the job. In fact, a combination of the OAI-PMH and OpenURL can address all core requirements, and is used in provided aDORe Archive implementation of the aDORe federation Tier-1 repository.

For additional informaion on the aDORe Federation Abstract Model, reference "The aDORe Federation Architecture".

Figure 1
  • Tier-1: the aDORe repositories
    • Networked systems that host digital object content and that make that content accessible by exposing core service interfaces.
    • Currently XMLtapes and ARCfiles (aDORe Archive)
    • Other Content Management Systems can be turned into an aDORe repository by implementing the core service interfaces.
  • Tier-2: the aDORe Federation Management components
    • Networked systems that facilitate presenting the aDORe repositories as a single logical repository; these federation components expose core service interfaces to allow access to their content.
    • Federation components are: Identifier Locator, Service Registry, Format Registry, Semantic Registry
  • Tier-3: the aDORe front-ends
    • Networked systems that make digital object content hosted in the multitude of physical aDORe repositories accessible by exposing core services interfaces that present those aDORe repositories as a single logical repository
    • aDORe front-ends are: OAI-PMH Federator, OpenURL Resolver

What does the aDORe Federation Reference Implementation provide?

Use Case: The Research Library of the Los Alamos National Laboratory (LANL) hosts a significant digital scholarly collection and makes services based on that collection available to its customer base. The collection currently consists of licensed content from both secondary and primary publishers (e.g. APS, BIOSIS, EI, Elsevier, Thomson Scientific, etc.) and unclassified LANL Technical Reports, and is expected to grow to include a wide variety of unclassified digital assets that result from the Laboratory’s research endeavors. Previous incarnations of the Library’s repository had fallen victim to issues of scalability. A uniform approach for ingesting, storing, and disseminating content was necessary to ensure the collection’s manageability, accessibility, and preservation. Also, the sheer volume of the collection required parallelization for ingestion and dissemination, and distribution for storage.

This aDORe Federation Reference Implementation provides the following:

  • a Tier-1 aDORe Archive Installation, consisting of:
    • XMLtape Toolkit
    • ARCfile Toolkit
    • XMLtape Registry
    • ARCfile Registry
    • XMLtape OpenURL Resolver
    • XMLtape OpenURL XQuery Resolver
  • IESR-based Service Registry
    • Uses the Ockham Service Registry IESR-based database schema and provides OAI-PMH and OpenURL Services.
    • SRU/W Support available with Ockham Registry installation.
  • Identifier Locator
    • A fast, in-memory MySQL-based solution used for efficient resolution of Datastream, Content, and Surrogate Identifiers to Repository Identifiers.
    • Currently stores more than 500M identifiers with sub-10ms retrieval times.
  • OAI-PMH Federator
    • Provides access multiple aDORe Archive installations through a common OAI-PMH interface.
  • OpenURL Disseminator
    • OpenURL Service interface providing federated access to all repository content, as well as performs transformation and dissemation services using a rule-engine based plug-in framework.

Figure 2

Additional Information

[1] Herbert Van de Sompel, Ryan Chute, Patrick Hochstenbach
The aDORe Federation Architecture

[2] Lagoze C, Davis JR (1995) Dienst - An Architecture for Distributed Document Libraries.
Communications of the ACM 38 (4), p 47.

[3] Davis JR, Lagoze C (1999) NCSTRL: Design and Deployment of a Globally Distributed Digital Library.
Journal of the American Society for Information Science 31(3), pp 273 – 280
DOI 10.1002/(SICI)1097-4571(2000)51:3<273::AID-ASI6>3.0.CO;2-6

[4] Rehak D, Daniel R, Lannom R (2005) A Model and Infrastructure for Federated Learning Content Repositories.
Interoperability of Web-Based Educational Systems Workshop, Volume 143 or CEUR Workshop Proceedings.
Retrieved from http://cordra.net/cordra/information/publications/2005/www2005/cordrawww2005.pdf

[5] McDonough JP (2006) METS: standardized encoding for digital library objects.
International Journal on Digital Libraries 6(2), pp 148-158 DOI 10.1007/s00799-005-0132-1

[6] Joint Information Systems Committee (2006) Information Environment Service Registry Metadata.
Retrieved from http://iesr.ac.uk/metadata/

[7] DRIVER (2006) Digital Repository Infrastructure Vision for European Research.
Retrieved from http://www.driver-repository.eu/

[8] Tansley R (2006) Building a Distributed, Standards-based Repository Federation.
D-Lib Magazine 12(7/8) DOI 10.1045/july2006-tansley