Skip to content
This repository has been archived by the owner on Sep 25, 2022. It is now read-only.

Preparing the Source Content

Peter Monks edited this page Jul 20, 2015 · 11 revisions

Before initiating an import, you will need to prepare the content that will be used as the source for the import. The details for doing this vary depending on the type of import source you're using; the built-in "Default" source is documented here, but if you're using a custom source provided by a 3rd party, please refer to that source's documentation before going any further.

Overview of the Default Source

The Bulk Import Tool ships with a "Default" import source that reads source content from a directory in the Alfresco server's filesystem. This source is backwards compatible with v1.x of the tool, so if you've used older versions of the tool, most of the following information should be familiar.

The files and folders within the source directory are imported into the Alfresco repository exactly as they appear on disk. So for example if you imported the following file/folder tree, the target space in the repository would end up containing exactly the same set of folder and file nodes:

sourceDirectory/
├── directoryA/
│   ├── subDirectoryA/
│   │   ├── subSubDirectoryA/
│   │   └── subSubDirectoryB/
│   └── subDirectoryB/
├── test_files/
│   ├── logo.png
│   ├── favicon.ico
│   ├── main.css
│   └── functions.js
├── jpeg_example_JPG_RIP_100.jpg
├── newtons_cradle_animation_book_2.gif
├── pdf32000_2008.pdf
├── png_transparency_demonstration_1.png
├── sunflower_as_gif_small.gif
├── test.html
├── testdoc.doc
├── testdocx.docx
├── testpptx.ppt
├── testpptx.pptx
├── testtxt.txt
├── testxls.xls
└── testxlsx.xlsx

The source directory can be physically local to the server (i.e. stored on a directly attached hard drive, SSD drive, RAID array, etc.), or on a remote device that is mounted into the server's filesystem (e.g. NAS, SAN, iSCSI, etc.).

Note that the details of how a remote device is mounted into the server's filesystem is highly operating system and environment specific - the Bulk Import Tool is not functionally sensitive to specific protocols (provided the mounted directory can be read using standard Java file I/O), but different protocols can have vastly different performance characteristics. Generally speaking, a local device will outperform a remote device (although this is highly environment specific).

Finally, this mounting is performed outside Alfresco, using your operating system's tools & techniques, and has nothing to do with Alfresco's own file server capabilities.

Metadata

The "Default" source also has the ability to load metadata (types, aspects & their properties) into the repository, for both files and folders. This is accomplished using "shadow metadata files" (which are entirely optional - if you don't need to import any custom metadata you won't need them).

Naming of Shadow Metadata Files

These shadow metadata files are located in the same folder as the file they refer to, and must have exactly the same name and extension as the file for which they define the metadata with the addition of the suffix .metadata.properties.xml. So for example, if there's a content file called IMG_1967.jpg that has some custom metadata, you would create a shadow metadata file called IMG_1967.jpg.metadata.properties.xml (note the ".jpg" in the middle!).

Shadow metadata files can also be used for directories. If you have a directory with custom metadata called My Documents, for example, the shadow metadata file would be called My Documents.metadata.properties.xml. This shadow metadata file needs to be a peer of the directory it describes - it must not be located within that directory.

Content of Shadow Metadata Files

As the suffix suggests, the shadow metadata files are in XML format, specifically the Java XML property file format. These files have the general syntax:

  <?xml version="1.0" encoding="UTF-8"?>
  <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
  <properties>
    <entry key="key1">value1</entry>
    <entry key="key2">value2</entry>
    ...
  </properties>

For the Bulk Import Tool, the value of the key attribute either refers to a special entry (see below), or the name of the metadata property you wish to populate. The content of the <entry> element is the value of that special entry or metadata property.

This example shows how to set the cm:description property:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
  <entry key="cm:description">This will become the description of my node.</entry>
</properties>
Special Keys

The Bulk Import Tool looks for a number of special keys, all of which are optional (they all have sensible default values):

  • separator - the separator value (a string) to use for delimiting multi-valued properties and the aspects special key. Defaults to a single comma character (,).
  • type - contains the qualified name of the content type to use for the file or folder. Defaults to either cm:folder (for a folder) or cm:content (for a file).
  • aspects - contains a delimited list (see separator) of the qualified names of the aspect(s) to attach to the file or folder. Defaults to the empty list (no aspects added to the node, beyond those that are mandatory for cm:folder or cm:content).
  • namespace - the namespace URI (not prefix!) to use for the node. Defaults to http://www.alfresco.org/model/content/1.0.
  • parentAssociation - the parent association type to use for the node. Defaults to cm:contains.
Multi-Valued Properties

Multi-valued properties are delimited (see the separator special key above), but not trimmed - any whitespace before or after the delimiter value is retained verbatim and written into the property.

Metadata Example

Here's a fully worked example for IMG_1967.jpg.metadata.properties.xml:

  <?xml version="1.0" encoding="UTF-8"?>
  <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
  <properties>
    <entry key="separator"> # </entry>  <!-- 3 character delimiter: space, hash, space -->
    <entry key="namespace">http://www.alfresco.org/model/application/1.0</entry>
    <entry key="parentAssociation">cm:contains</entry>
    <entry key="type">cm:content</entry>
    <entry key="aspects">cm:versionable # cm:dublincore # cm:taggable</entry>
    <entry key="cm:title">A photo of a flower.</entry>
    <entry key="cm:description">A photo I took of a flower while walking around Bantry Bay.</entry>
    <entry key="cm:created">1901-01-01T12:34:56.789+10:00</entry>
    <!-- cm:dublincore properties -->
    <entry key="cm:author">Peter Monks</entry>
    <entry key="cm:publisher">Peter Monks</entry>
    <entry key="cm:contributor">Peter Monks</entry>
    <entry key="cm:type">Photograph</entry>
    <entry key="cm:identifier">IMG_1967.jpg</entry>
    <entry key="cm:dcsource">Canon Powershot G2</entry>
    <entry key="cm:coverage">Worldwide</entry>
    <entry key="cm:rights">Copyright © Peter Monks 2002, All Rights Reserved</entry>
    <entry key="cm:subject">A photo of a flower.</entry>
    <!-- cm:taggable properties -->
    <!-- Note: The following tag NodeRefs will be invalid in your Alfresco installation -->
    <entry key="cm:taggable">workspace://SpacesStore/a7063e59-ef78-46f2-bc00-560b1b9222ab # workspace://SpacesStore/0672094d-3566-4412-a79f-c3f787bfc629</entry>
  </properties>

Additional Notes on Metadata Files

  • You must specify the <?xml?> and <!DOCTYPE> tags as shown above - failing to do so causes the tool to skip the metadata file (with a warning in the Alfresco log).
  • Be especially careful of UTF-8 "Byte Order Mark" (BOM) characters at the start of XML files - they are invalid in XML files (since XML has its own method for identifying byte order), but because most editors treat them as invisible characters they can be difficult to track down.
  • The metadata must conform to the type and aspect definitions configured in Alfresco (including mandatory fields, constraints and data types). Any violations will terminate the import.
  • Peer associations between content items loaded by the tool are not yet supported - see issue #16.
    • Associations to objects that are already in the repository can be created, however, by supplying the NodeRef of the target object as the value of the property (see the cm:taggable property in the example above).
  • Date and/or time values must be specified using ISO8601 format.
    • You can also use the special value NOW (case-sensitive) as the value of a date / time property, to specify that it should be populated with the server's date/time at the instant the item is imported.
  • Updating the aspects or metadata on existing content will not remove any existing aspects not listed in the new metadata file - this tool is not intended to support full synchronisation.
  • The metadata loading facility can be used to decorate content that's already in the Alfresco repository, without having to upload that content again. To use this mechanism, create a shadow metadata file as described above and import it with replace set to true. The tool will match the shadow file up with the existing file in the repository and decorate it with the new metadata.

Version History Files

The import tool also optionally supports loading a version history for each file (Alfresco doesn't support version histories for folders). To use this mechanism, create a file with the same name as the main file, but append a v# or v#.# extension. For example:

  IMG_1967.jpg.v1       <- version 1 content
  IMG_1967.jpg.v2       <- version 2 content
  IMG_1967.jpg.v2.1     <- version 2.1 content
  IMG_1967.jpg          <- "head" (latest) revision of the content

This also applies to metadata files, if you wish to capture metadata history as well. For example:

  IMG_1967.jpg.metadata.properties.xml.v1     <- version 1 metadata
  IMG_1967.jpg.metadata.properties.xml.v2     <- version 2 metadata
  IMG_1967.jpg.metadata.properties.xml.v2.1   <- version 2.1 metadata
  IMG_1967.jpg.metadata.properties.xml        <- "head" (latest) revision of the metadata

Additional Notes on Version Files

  • The tool always imports versions in numeric order. If you have two files with the same numeric version number (e.g. v1 and v1.0), only one of them will be imported; which one actually gets imported is non-deterministic.
  • Version numbers don't have to be contiguous - you can number your version files however you wish, provided you use valid numbers (integers or decimals).
  • The version number values in your version files will not be used in Alfresco - the version numbers in Alfresco will be contiguous, starting at 1.0 and increasing by 1.0 for every major version (e.g. 1.0, 2.0, 3.0, etc.) and 0.1 for every minor version (e.g. 1.1, 1.2, 1.3 etc.). Alfresco doesn't allow version labels to be set to arbitrary values (see issue #13).
  • Each version may contain a content update, a metadata update, or both - you are not limited to updating everything in every version. If not included in a version, the prior version's content or metadata will remain in place in the next version.

Here's a fully fleshed out example, showing all possible combinations of content, metadata and version files:

  IMG_1967.jpg.v1                             <- version 1 content
  IMG_1967.jpg.metadata.properties.xml.v1     <- version 1 metadata
  IMG_1967.jpg.v1.1                           <- version 1.1 content
  IMG_1967.jpg.metadata.properties.xml.v1.1   <- version 1.1 metadata
  IMG_1967.jpg.v2                             <- version 2 content
  IMG_1967.jpg.metadata.properties.xml.v2     <- version 2 metadata
  IMG_1967.jpg.v2.1                           <- version 2.1 content (content only version)
  IMG_1967.jpg.metadata.properties.xml.v3     <- version 3 metadata (metadata only version)
  IMG_1967.jpg.metadata.properties.xml        <- "head" (latest) revision of the metadata
  IMG_1967.jpg                                <- "head" (latest) revision of the content

In-Place Imports

The Bulk Import Tool provides a performance optimisation technique referred to as an in-place import. To use this feature, stage the source content directory (including any metadata and/or version files) to a directory inside the Alfresco content store, then run the tool normally. The tool automatically detects that the source directory is already located "inside" the Alfresco content store, and will perform an in-place import rather than a default (streaming) import.

Additional Notes on In-Place Imports

  • Once in-place imported, you must not touch the source directory in any way (including reading it)! Following such an import, Alfresco assumes it has exclusive access to those files, and any other concurrent access from other processes can cause system outages and/or data corruption.
  • You do not need to modify the structure of the source directory to match Alfresco's default "timestamp hashbucket" structure. In fact doing so is counterproductive as that directory structure would then be created in Alfresco (which is unlikely to be the desired directory structure for end users).
  • For ease of management later on, it's recommended that you place the source directory in a clearly named subdirectory of the contentstore. For example ${ALFRESCO_HOME}/alf_data/contentstore/bulk-import.
  • Because the shadow metadata files used by the import tool are never registered with the repository, Alfresco will not automatically clean them up after an import is complete. If you wish to clean these files up you will need to manually do so after you're sure the import has succeeded. While doing this you must be extremely careful not to remove any files that were, in fact, imported (see the first point).
  • The contentstore-relative paths ("content URL") of in-place imported content must not exceed 255 characters (this is an Alfresco restriction). If the Bulk Import Tool detects that an otherwise in-place import eligible file is going to exceed that restriction, it will automatically switch to a streaming import for that particular file (reverting to an in-place import for the next file that doesn't exceed this limit).

Back to usage.

Clone this wiki locally