LANL Research Library
 

ARCFile - Overview

What is the aDORe ARCFile Toolkit?

Developed by the Internet Archive, ARC files are a file-based approach to store concatenated datstreams, separating each datastream using administrative metadata. The aDORe ARCFile Solution builds upon the ARC file API provided in the Heritrix Web Crawler library. This solution supplements the Heritrix ARC file API, providing the following functions:

  • Integrated UUID Generation to ensure unique arcfile and resource identifiers
  • Integrated Indexing Implementation
  • Flexible methods to access datastream using identifier or byte offsets
  • Simple Methods to extract and write datastreams to file

How does the aDORe ARCFile Toolkit work?

  • Write: Write datastreams to aggregate files for ease of storage in a conventional file system.
  • Index: Index metadata records associated with archived datastreams.
  • Retrieve: Extract datastream from archive.

The resulting output of the aDORe ARCFile Toolkit:

  • ArcFile: Concatenation of collected datastreams
  • ArcFile Index (CDX): Record of each resource successfully harvested to an arc file
Figure 1

Additional Information

Liu, X., Balakireva, L., Hochstenbach, P., Van de Sompel, H. (2005, June).
File-based storage of Digital Objects and constituent datastreams: XMLTapes and Internet Archive ARC files

Burner, M., Kahle, B (1996, September).
Arc File Format