In this first article, you meet the human-computer conflict, learn the criteria used to evaluate different technologies, and find a brief description of the major techniques used today to enable machine-human coexistence on the Web.
Yet this content-rich, human-friendly world has a shadowy underworld. It’s a world in which machines attempt to benefit from this wealth of data that’s so easily accessible to humans. It’s the world of aggregators and agents, reasoners and visualizations, all striving to improve the productivity of their human masters. But the machines often struggle to interpret the mounds of information intended for human consumption.
The story is not all bleak, however. Even if you were unaware of this human-computer conflict, there is no need to worry. By the end of this series, you’ll have enough knowledge to choose intelligently among a myriad of possible paths to bridge between the data-presentation needs of machine and human data consumers.
Semantic WebA mesh of information linked up in such a way as to be easily processed by machines, on a global scale. The Semantic Web extends the Web by using standards, markup languages, and related processing tools.
However, by the very nature of its inhabitants, the Web grew exponentially and gave priority to content consumable mostly by humans and not machines. Steadily, users’ lives became more reliant on the Web, and we transitioned from a Web of personal and academic homepages to a Web of e-commerce and business-to-business transactions. Even as more and more of the world’s most vital information flowed through its links, most of the Web-enabled interactions still required human interpretation. But as expected, the rise of Internet-connected devices in people’s lives has driven dependence on the devices’ software understanding data on the Web.
The top three Web sites (as of this article’s writing) according to Alexa traffic rankings are Yahoo!, MSN, and Google — all search engines. Each of these sites is powered by an army of software-driven Web crawlers that apply various techniques to index human-generated Web content and make it amenable to text searches. Without these companies’ vast arrays of algorithmic techniques for consuming the Web, your Web-navigation experiences would be limited to following explicitly declared hypertext links.
Next, consider the 5th most trafficked site on Alexa’s list: eBay. People commonly think of eBay as one of the best examples of humans interacting on the Web. However, machines play a significant role in eBay’s popularity. Approximately 47% of eBay’s listings are created using software agents rather than with the human-driven forms. During the last quarter of 2005, the machine-oriented eBay platform handled eight billion service requests. Also in 2005, the number of eBay transactions through Web Services APIs increased by 84% annually. It’s clear that without the services that eBay provides software agents to participate equally with humans, the online auction business would not be nearly as manageable for humans dealing with significant numbers of sales or purchases.
For a third example, we turn to Web feeds. Content-syndication formats such as Atom and RSS have empowered a new generation of news-reading software that frees you from the tedious, repetitive, and inefficient reliance on bookmarked Web sites and Web browsers to stay in touch with news of interest. Without the machine-understandable content representation embodied by RSS and Atom, these news readers could not exist.
In short, imagine a World Wide Web where a Web site could only contain content authored by humans exclusively for that site. Content could not be shared, remixed, and reused between Web sites. To intelligently aggregate, combine, and act on Web-based content, agents, crawlers, readers, and other devices must be able to read and understand that content. This is why it’s necessary to take an in-depth look at the different mechanisms available today to improve the interactions between machines and human-generated content in Web applications.
Creating new data integrations of this sort requires that the software driving the integration be able to understand and interpret the data on particular Web pages. This software must be able to retrieve the Web pages that display your photos from Flickr and discover the dates, times, and descriptions of your photos. It also needs to understand how to interpret the transactions from your online bank statement. The same software must be able to understand various views of your online calendar (daily, weekly, and monthly), and figure out which parts of the Web page represent which dates and times.
The example in Figure 1 shows how embedded metadata might benefit your end-user applications. You begin with your data stored in several places. Flickr hosts your photographs, Citibank provides access to your banking transactions, and Google Calendar manages your daily schedule. You wish to experience all of this data in a single calendar-based interface (missMASH), such that the photos from your Sunday at the State Park appear in the same weekly view as your credit card transaction from Wednesday’s grocery shopping. To do this, the software that powers missMASH must have some way to understand the data from your Flickr, Citibank, and Google Calendar accounts in order to remix the data in an integrated environment.
In this series, we’ll examine how you might implement the scenario discussed above using the different mechanisms available for human-computer coexistence on the Web. We will introduce and explain each technology, then show how the technologies might be used to integrate bank statements, photos, and calendars. We will also evaluate the strengths and weaknesses of each technology, and hopefully make it easier for you to decide between the options.
The authority of the data is one axis along which to measure this trust. We consider one representation of data to be authoritative if the representation is published by the owner of the data. A data representation might be non-authoritative if it is derived from a different representation of the data by a third party.
Along these lines, we appreciate it if the techniques acommodate new data in an elegant manner, and are expressive enough to instill confidence that in the future we can represent previously unforeseen data.
This does not necessarily apply to repetition within multiple data representations, if those representations are generated from a single data store.
In addition to being easy to author and to read, data locality allows visitors to the Web page to copy and paste both the human and machine representations of the data together, rather than requiring multiple discontinuous segments of the Web page be copied to fully capture the data of interest. This, in turn, promotes wider adoption and re-use of the techniques and the data. (Just ask anyone who has ever learned HTML with the generous use of the View Source command.)
For example, a new technique that prescribed that text enclosed in the HTML u tag represent the name of a publication might lead to correct semantics in some cases, but might also license many incorrect interpretations. While the markup in this example might be authoritative (because it originates from the owner of the data), it is still incorrect because the Web page author did not intend to use the u HTML tag in this manner.
You should be able to use techniques without losing the ability to adhere to accepted Web standards such as HTML (HTML 4 or XHTML 1), CSS, XML, XSLT, and more.
Creating a Web of data that humans can read and machines can process is of little value if no tools understand the techniques used. We prefer techniques that have a large base of tools already available. Failing that, we prefer techniques for which one can easily implement new tools. Tools should be available both to help author Web pages that make use of a technique, and also to consume the machine-readable data as specified by the technique.
The Web has been around for a while now, and it’s only recently that the need to share data between humans and machines is receiving a lot of attention. The vast landscapes of content available on the Web are authored and maintained by a wide variety of people, and it is important that whatever techniques we promote be easily understandable and adoptable by as many Web authors as possible.
This section provides a brief introduction to the major techniques used today to enable machine-human coexistence on the Web. Subsequent articles in this series will explore these techniques in detail.
For example, Web feeds and feed readers have empowered humans to keep up with the vast amount of information being published today. When you use a feed reader, you initialize it with the address (URL) of an XML file — usually an RSS or Atom file. In most cases, the machine-consumable data within such a feed has a parallel URL on the Web, where you can find a human-readable representation of the same contentd. There are a variety of techniques to achieve this parallel Web in a useful and maintainable fashion. Part 2 of this series will discuss a parallel Web in detail, including the benefits and drawbacks of having the same data available at more than one Web address. Future installments of this series will cover techniques that allow multiple data representations to be contained within a single Web address.
Scrapers, which extract data by examining the structure and layout of a Web page
Natural-language processors, which attempt to read and understand a Web page’s content in order to generate data
These techniques are designed for situations where the structure or content of a Web page is highly predictable and unlikely to change. The algorithms are usually developed by the person seeking to consume the data, and as such they are not governed by any standards organization. Often, these algorithms are an integrator’s only option when faced with accessing data whose owner does not publish a machine-readable representation of the data. Stay tuned for details on the algorithmic approach in Part 3 of this series.
As with the algorithmic approach, microformats differs from many others in our series because it is not part of a standards process in organizations such as the W3C or IETF. Instead, their principles focus on specific problems and leverage current behaviors and usage patterns on the Web. This has given microformats a great start towards their goal of improving Web microcontent (blogs, for example) publishing in general. The main examples of microformat success have been the hCard and hCalendar specifications. These specifications allow microcontent publishers to easily embed attributes in their HTML content that allow machines to pick out small nuggets of information, such as business cards or event information from microcontent Web sites.
GRDDL has great potential for bridging the gap between humans and machines by enabling authoritative on-the-fly transformations of content. While this is similar to the parallel Web, there are significant differences. GRDDL provides a general mechanism for machines to transform content on demand, and GRDDL does not create permanent versions of alternative data representations. The W3C has recently chartered a GRDDL working group to produce a recommended specification for GRDDL.
Embedded RDF is not currently developed by a standards body. Similar to microformats, eRDF is capable of encoding data within Web pages to help machines extract contact, event, and location information (and other types of data) to enable powerful software agents.
As with eRDF, RDFa takes advantage of namespaces and the RDF graph data model to enable the representation of many data structures and vocabularies within a single Web page. RDFa seeks to be a general-purpose solution to the inclusion of arbitrary machine-readable data within a Web page.
Stay tuned for Part 2, which will explore in detail the widely used parallel Web technique.