My recent article, Consolidating, Accessing and Analyzing Unstructured Data, highlighted the growing importance of unstructured data in data integration projects. It also discussed the industry direction toward the use of enterprise content management (ECM) systems for managing the increasing volume of unstructured and semi-structured data in organizations. In this month’s newsletter, I want to continue the discussion of ECM by looking at how industry standards in this area are likely to ease the integration of unstructured and semi-structured data into the business environment.
Most ECM systems consist of a set of content services, which I will call the ECM application, and a content repository. The ECM application provides the tools for interacting with business content, whereas the repository handles the management of the content. Although the repository typically employs an underlying database or file system for storing the content, the data interface to the repository and the data model employed by the repository are usually proprietary. In many cases, the repository is tightly coupled with the ECM application, and its data interface is not always available for use by external applications. This means that third-party products can only access the repository content at the application level, assuming of course that the ECM product provides a documented application program interface (API). Even if a repository-level data interface is available, third-party products and IT applications must create unique interface adapters for each ECM product.
The proprietary nature of ECM applications and their content repositories is similar to the situation with packaged business transaction applications. These latter applications employ standard relational database systems, but direct access to the data is not usually possible; and, as with ECM systems, often the only way to access the data is at the application level.
The trend of both ECM and packaged application vendors is to provide Web services interfaces to their products. Although this approach simplifies and standardizes access to ECM and packaged business transaction applications, access is still done at the application level, rather than at the data level.
Web services access to an ECM system at the application level works okay if the ECM application is to be incorporated into a workflow or composite application, for example, as a part of service-oriented architecture (SOA). If, however, an external application needs to access a content repository at the data interface level, another approach is required. The solution is to separate ECM applications from their content repositories and to provide a standardized interface for accessing repository content. Such an architecture is provided by Content Repository API for Java Technology (known as JCR).
The Java Content Repository
JCR is “an ongoing effort to define a standard repository interface for the J2SE/J2EE platforms. The goal of a content repository API is to abstract the details of application data storage and retrieval such that many different applications can use the same interface, for multiple purposes, without significant performance degradation. Content services can then be layered on top of that abstraction to enable software reuse and reduce application development time.” (Source: Roy Fielding. JSR 170 Overview: Standardizing the Content Repository Interface. Day Software White Paper, March 2005.)
Industry standards for the Java are developed using the Java Community Process (JCP). Members of the JCP participate in the development of specifications, referred to as Java Specification Requests (JSRs). The specification for the Java Content Repository is known as JSR 170, which recently became a formal standard. The standard was proposed by Day Software and involved experts from leading organizations such as BEA, EMC (Documentum), FileNet, Fujitsu, Hummingbird, IBM, Vignette and the Apache Software Foundation. A second version of the JCR is now in development and is known as JSR 283.
JSR 170 supports three implementation levels. Level 1 provides a read-only repository and supports content export via XML. The export capability allows the content to be moved to other platforms, including a Level 2 repository. Level 2 adds a write capability and content import via XML. Level 3 adds versioning. It is important to emphasize that JSR 170 is an API and not a protocol; therefore, protocols such as HTTP and WebDAV can be used in conjunction with JSR 170.
Most ECM vendors are expected to focus on providing Level 1 support, which will allow the contents in their proprietary repositories to be accessed in read-only mode. It is interesting to note, however, that Day Software is marketing a Level 3 repository known as Content Repository Extreme (CRX). The company is also developing repository connectors for other ECM products. The first of these to be delivered is for EMC Documentum. The connector will make content stored in the Documentum repository accessible using the JSR 170 interface. Other Day connectors that are in development include interfaces for FileNet, IBM Domino.doc, Interwoven, Microsoft SharePoint, Open Text LiveLink, and Software AG Tamino. The connectors support content reading and writing, and both SQL and XPath querying and searching.
An open source reference platform (known as Jackrabbit) developed by Day Software for JSR 170 can be obtained from the Apache Software Foundation. This reference platform is currently an incubator project, but is expected to move to a full Apache project in the near future.
Moving to an Enterprise Approach
Many organizations are moving toward an enterprise approach to data integration. Today, this integration is focused on structured data, but there is increasing interest in adding support for integrating unstructured and semi-structured data to enterprise projects. Many integration technologies (EII, ETL and business portals, for example) are beginning to support this latter type of business content, but it is expensive to develop and support custom adapters for each type of content repository. Web services provide a certain level of integration, but for full data integration it will be important to separate ECM applications from their underlying repositories and support a universal repository interface. The JCR standard is the first step in supporting this process.