In this article, we will cover our notion of the parallel Web. This term refers to the techniques that help content publishers represent data on the Web with two or more addresses. For example, one address might hold a human-consumable format and another a machine-consumable format. Additionally, we include within the notion of the parallel Web those cases where alternate representations of the same data or resource are made available at the same location, but are selected through the HTTP protocol (see Resources for links to this and other material).
HTTP and HTML are two core technologies that enable the World Wide Web, and the specifications of each contain a popular technique that enables the discovery and negotiation of alternate content representations. Content negotiation is available through the HTTP protocol, the mechanism that allows user agents and proxies/gateways on the Internet to exchange hypermedia. This technique might be mapped mostly to a scenario where alternate representations are found at the same Web address. In HTML pages, the link element indicates a separate location containing an alternate representation of the page. In the remainder of this article, we’ll look at some of the history behind these two techniques, examine their current deployment and usage today, explain how they might be applied to help solve the MissMASH example scenario, and evaluate the strengths, weaknesses, and appropriate use cases for both.
Content negotiation as a phrase did not appear until the completion of the HTTP/1.1 specification, but the core of its functionality was defined in the HTTP/1.0 specification (see Resources for links to both) — albeit in a rather underspecified format. The HTTP/1.0 specification provided implementors with a small set of miscellaneous headers: Accept, Accept-Charset, Accept-Encoding, Accept-Language, and Content-Language. (We will go into more technical detail of these headers and their usage in the next section of this article.)
Historically, it is important to note that in its original form, the content negotiation mechanism left it completely up to the server to choose the best representation from all of the combinations available given by the choices sent by the user agent. In the next (and current) version of content negotiation, which arrived with HTTP/1.1, the specification introduced choices with respect to who makes the final decision on the alternate representation’s format, language, and encoding. The specification mentions server-driven, agent-driven, and transparent negotiation.
Server-driven negotiation is very similar to the original content negotiation specification, with some improvements. Agent-driven negotiation is new and allows the user agent (possibly with help from the user) to choose the best representation out of a list supplied by the server. This option suffers from underspecification and the need to issue multiple HTTP requests to the server to obtain a resource; as such, it really hasn’t taken off in current deployments. Lastly, transparent negotiation is a hybrid of the first two types, but is completely undefined and hence is also not used on the Web today.
Amidst this sea of underspecification, however, HTTP/1.1 introduced quality values: floating point numbers between 0 and 1 that indicate the user agent’s preference for a given parameter. For example, the user agent might use quality values to indicate to the server its user’s fluency level in several languages. The server could then use that information along with the faithfulness of its alternate language representations to choose the best translation for that request and deliver it as per server-driven content negotiation.
The examples you’ve seen of content negotiation so far have highlighted its uses to serve the heterogeneous mix of people accessing the World Wide Web. But just as diverse humans can use content negotiation to retrieve Web content in a useful fashion, so also can diverse software programs request content in machine-readable formats to further their utility on the Web. You will see how to use content negotiation to further a meaningful Web for humans and machines later in this article.
Most notably, the HTML 4.01 specification introduced new attributes that can indicate the media type and character set of the resource referenced by the link element. It is thanks to these additions that the link element has received wider adoption, especially by Weblogs. Later in this article, you will see that while many of the traditional use cases for the link element revolve around the needs of humans, the element is increasingly used to reference machine-consumable content in an attempt to further software’s capabilities on the Web.
Back to top
<!DOCTYPE HTML PUBLIC “-//W3C//DTD HTML 4.01//EN”
…the rest of the article…
In Listing 2, you see the most popular use of the link element today. Most modern browsers have a decent amount of support for the Cascading Style Sheets specifications, allowing publishers to separate structure from presentation in their HTML pages and making it easier to change their visual presentation without having to recode the entire page. This example in Listing 2 makes use of a new attribute: media, which specifies the intended destination medium for the associated style information. The default value is screen, but you can also use print, or, to target mobile devices, handheld. (See Resources for more on these values.) Beware: while you can use multiple values in the media attribute, they must be separated by commas, as opposed to the spaces you’d use for values of the rel attribute.Listing 2. Providing links to stylesheet information
…the rest of the article…
Our next example uses the internationalization capabilities of the HTML specification to give browsers or user agents language-specific navigational information in order to best match the reader with her preferred tongue. In the case of this article series, the authors combined can speak two languages fluently. If we had a large Spanish readership, we could provide those readers with a link to the article in Spanish by using the link element.
Notice a few of the attributes and their values that we use in Listing 3. First, we use the alternate link type to express the relationship between this article and our Spanish version. Because we are using it in combination with the hreflang attribute, we imply that the referenced Web page is a translated, alternate version of the current document. We also add a lang attribute to specify the language of the title attribute. Finally, we added a type attribute value, which is optional and is only there as an advisory hint to the user agent. (A Web browser might use this hint, for example, to launch the registered application that can deal with the specified media type.)Listing 3. Language-specific navigational link elements
title=”La Red Paralela”
…the rest of the article…
The previous examples mostly served to show the most common uses of the link element, but, with the exception of the language-specific use of link, they didn’t really show you how to provide alternate representations of the data found on the original page. And even in the language example given in Listing 3, the alternate representation is intended for consumption by humans and not by machines. The final link example in Listing 4 shows how blogs use the link element to refer to feeds. A feed is an XML representation of its HTML counterpart that carries more structured semantics than the HTML contents of a blog as rendered for a person. Blog-reading software can easily digest an XML feed in order to group blog entries by categories, dates, or authors, and to facilitate powerful search and filter functionality. Without the link element, software would not auto-discover a machine-readable version of a blog’s contents and could not provide this advanced functionality. This is a good example of how content generated by humans (even through blog publishing tools) is not easily understood by machines without the use of a parallel Web technique. The feed auto-discovery mechanism enabled by the link element helps humans co-exist with machines on the Web.
In Listing 4, you begin to see the power of the link element emerge to provide machines with valuable information in decoding content on the Web today. We again use of the “alternate” relationship to denote a new URLwhere you can find a substitute representation of the same content currently viewed. Through the use of media types, we can give the user agent the ability to choose the best understood format — whether it be Atom or RSS feeds or any other media format.
The link elements shown in Listing 4 create a parallel Web with three nodes. First, at the current URL is an HTML representation of this article intended for human consumption. Second, the relative URL ../feed/atom contains a machine-readable version of the same content represented in the application/atom+xml format. And finally, the relative URL ../feed/rss contains a different machine-readable version of the same content, represented this time in the application/rdf+xml format.Listing 4. Providing auto-discovery links to feed URLs in blogs
title=”Atom 1.0 Feed”
…the rest of the article…
We must note that the values of the rel attributes are not limited to a closed set; the Web community is constantly experimenting with new values that eventually become the de facto standard in expressing a new relationship between the HTML page and the location specified in the href attribute. For instance, the pingback value used by the WordPress blog publishing system (see Resources) is now very widely used; it provides machines with a pointer to a service endpoint that connects Weblog entries with one another through their comment listings.
As simple REST-based Web services permeate the Web, we foresee a large number of rel values being defined to point to many different service interfaces. Such an explosion of uses for the link element greatly enhances the functionality of traditional Web browsers but also comes with other problems involving governance and agreement of relationship types that are beyond the scope of this article. Find more information on the HTML link element in Resources.
Back to top
Typically on the Web, URLs contain a file extension that indicate either the method used to produced the HTML (for example, .html for static pages, .php for PHP scripts, or .cgi for various scripting languages using the Common Gateway Interface) or the media format of the content at the URL (such as .png, .pdf, or .zip). While this approach is adequate in many cases, it requires that you create and maintain one URL for each different representation of the same content. In turn, this requires that you use some mechanism to link the multiple URLs together (such as the HTML link element described above, or the human-traversable a anchor link element) and that software know how to choose between the different URLs according to its needs.
Content negotiation allows different representations of data to be sent in response to a request for a single URL, depending on the media types, character sets, and languages a user agent indicates that it can (and wants to) handle. Whereas the link element constructs a parallel Web by assigning multiple URLs to multiple representations of the same content, content negotiation enables a parallel Web wherein multiple formats of the same content all are available at a single URL. The Resources in this article contain links to more detail on behind-the-scenes HTTP content negotiation.
Apache provides a content negotiation module that has two methods to select resource variants depending on the HTTP request: type maps and MultiViews. If a type map is used in the Apache configuration file, it is used to map URLs to file names based on any combination of language, content type, and character encoding. MultiViews are very similar, except that the administrator does not have to create a type map file; instead, the map is created on the fly based on other configurations specified throughout the server configuration document. For brevity, in this article we will only show an example of a type map, but please see the Resources for links to very detailed Apache content negotiation guides.
In Listing 5, we list the contents of a type map used to provide the mapping necessary to serve different versions of our article based on language and media type. In line 1, we specify the single URI used to serve multiple format and language representations of the same article. Lines 3 and 4 establish that by default we will serve the file parallelweb.en.html in response to all requests for the text/html media type. In lines 6 through 9, we have a special case in which we can serve the Spanish version of our article if desired by the user agent. Next, in lines 10 and 11, we provide a printed version of our article. Finally, you see in lines 13 and 14 the ability for software to retrieve an Atom feed version of our article. Content negotiation allows blog-reading software requesting our article in the application/atom+xml content type to receive a representation of the article that the software can understand more easily than the standard HTML version.Listing 5. Using type maps in Apache to serve multiple language and format representations
 URI: parallelweb
 URI: parallelweb.en.html
 Content-type: text/html
 URI: parallelweb.es.html
 Content-type: text/html
 Content-language: es
 URI: parallelweb.ps
 Content-type: application/postscript; qs=0.8
 URI: parallelweb.xml
 Content-type: application/atom+xml
In summary, this extra configuration on our server allows a single URL to be distributed for our article, while still permitting both human-friendly and machine-readable data to be available as long as the user agent requests it.
Now, let’s move on to our example application and hypothesize how we might build it by using parallel Web techniques.
Back to top
One of the very first pieces of information needed by MissMASH is some sort of online calendar account information. As you might imagine, this might become unwieldy in no time if we required MissMASH to understand every possible public and/or private calendar service on the Internet. But if these services make use of parallel Web techniques, it is as simple as entering the URL you see in your browser when accessing that service into MissMASH. Consider, for example, Google Calendar. (For the purposes of this example, we’ll ignore user authentication details.) Imagine that a user configuring MissMASH navigates in a Web browser to his Google Calendar. Once there, he copies the URL in his browser’s address bar and pastes it into the MissMASH configuration. MissMASH would request that URL, parse the returned HTML, and find inside a gem of the parallel Web: a link element.
Google offers multiple versions of its online calendar: iCalendar (see Resources), Atom feeds, and HTML. Unfortunately, it only provides a link element to the Atom feed version of the calendar, but it’s just as described in Listing 4. The same would occur for Flickr, which happens to also provide a link to an Atom feed of your most recent photos. In either case, MissMASH would get a second URL from the href attribute of the link element and would resolve that URL to receive a machine-readable representation of the user’s event and photo information. Because these representations contain (among other data) dates and locations that software can easily recognize, MissMASH can use these alternate representations to render calendar and photo data together.
For illustrative purposes, let’s also imagine that Citibank used content negotiation on the user’s account URL. MissMASH can then negotiate with the Citibank server for any of the different supported formats, such as Atom feeds or the Quicken Interchange Format (see Resources)
We now have almost all of the information necessary to render our external personal data sources in MissMASH. Of course, for MissMASH to use the parallel Web in this way, it must understand both the structure and the semantics of the machine-readable representations that it retrieves. If MissMASH receives iCal data from Google Calendar, an Atom feed from Flickr, and Quicken Interchange Format data from Citibank, then it must understand how to parse and interpret all three of these machine-readable formats in order to merge the information. The parallel Web techniques do not themselves help resolve this matter; instead, the parallel Web’s key benefit is that the user does not have to visit obscure configuration Web pages of the source-data services to find the URLs to one of the data formats supported by MissMASH. While MissMASH must focus on supporting a small number of different formats, this is far superior to requiring MissMASH to understand both a variety of data formats and a variety of application-specific, human-intended services for retrieving that data.
Future articles in this series will examine techniques to create a meaningful Web that impose a smaller burden on the data consumer to understand multiple machine-readable formats.
Back to top
— This technique fulfills most or all of the requirements of this evaluation criteria.
— This technique fulfills some of the requirements of this evaluation criteria, but has some significant drawbacks.
— This technique does not satisfy the requirements of this evaluation criteria.
Because the parallel Web techniques do not constrain the specific formats of alternative data representations, in theory, these techniques might provide machine-readable data that is not only equivalent but even richer in content and semantics than the original human-friendly Web page. However, we must point out that most online services use this technique to link to Atom feeds, which often only capture a small amount of the original data in machine-readable formats. For example, an Atom feed of banking transactions might capture the date, location, and title of the transaction in a machine-readable fashion, whereas the (not widely used on the parallel Web) Quicken Interchange Format contains much richer semantic data, such as account numbers, transaction types, and transaction amounts.
Our evaluation: The parallel Web provides authoritatively sound data, but in common usage does not provide machine-readable data that completely represents the authoritative human-oriented content.
All is not lost, however. Because the parallel Web techniques do not mandate one specific alternate representation, implementors are free to choose a representation that itself supports great extensibility and expressivity. In practice, however, the de facto formats used today on the Internet with parallel Web techniques do not offer much flexibility for extension to new data formats and richer semantics.
Our evaluation: The parallel Web is flexible enough to accommodate new data and new representations, but existing implementations do not scale easily and existing data formats are not particularly extensible.
In our evaluation: While the algorithms to generate the various data formats must be maintained, the actual data need only be represented a single time.
Our evaluation: The parallel Web provides no data locality whatsoever.
Our evaluation: The parallel Web is a technique already in use, and adopting it does not threaten the meaning of any current Web sites.
Our evaluation: Developers and implementors can rest assured that the parallel Web is grounded firmly in accepted standards.
Our evaluation: The parallel Web has been around for a while and a great deal of mature tooling is available to produce and consume it.
As a designer, you are freed from the need to maintain both data and presentation.
You can optimize content to suit the needs of the target consumer
At the same time, we have already touched on a few areas in which the parallel Web is more complex than other approaches. Using the link element requires that you maintain multiple URLs for multiple data representations. In a world in which one “404 Not Found” error can be the difference between a sale for you or a sale for your competitor, the maintenance of numerous active URLs can be a burdensome task. Content negotiation itself requires the maintenance of moderately complex server-side configurations. And because both techniques support multiple representations of the same data, you must write and maintain code to generate these representations from the core data.
Overall evaluation: The parallel Web is widely understood and used today, but it does require a high amount of maintenance as time passes.
Back to top
Nevertheless, you’ve seen in this article that both of these approaches have shortcomings that, to date, have limited them to semantically poor use cases. Use of the link element requires maintenance of multiple URLs and requires multiple HTTP requests to get at machine-readable data. Content negotiation hides different representations at a single URL and makes it difficult to unite the human view of the content with the machine view. Further, the lack of uniformity across machine-readable representations and the lack of extensibility of Atom, one of the only common machine-readable representations in use today, hamper the adoption of the parallel Web in cases such as MissMASH in which the data being produced and consumed is not consistently structured.
In the rest of this series, we’ll examine techniques that strive to achieve a meaningful Web for humans and machines without maintaining two (or more) strands of a parallel Web. These techniques begin with an HTML Web page and derive semantics from the content of that page. Thus, they share the benefits of not requiring multiple URLs for multiple representations and of containing both the human- and machine-targeted content within the same payload. You will see, however, that these techniques still differ substantially in their approaches, and have their own advantages and disadvantages, which we will evaluate.
In the next article, we’ll examine the algorithmic approach, in which third parties use structural and heuristic techniques to derive machine-readable semantics from HTML Web pages targeted at people.