Alvis Task T3.2 - Metadata Format for Enriched Documents

Milestone M3.2 - Month 12 (December 2004)

Mike Taylor


Table of Contents
1. Abstract
2. Introduction
2.1. The Alvis Architecture: A Brief Overview
2.2. The Need for a Format for Enriched Documents
2.3. Principles for Designing the Format
3. The Alvis Pipeline and The Family of Formats
3.1. Acquisition Format
3.2. Linguistic Format
3.3. Relevance Format
4. Syntactic Issues
5. Discussion of the Canonical Format
5.1. Example Canonical Document
5.2. Intent of the Canonical Format
5.3. Elements Included in the Canonical Format
6. Discussion of the Enriched Document
6.1. Example Enriched Document
6.2. Representing Multiple Documents in a Single Package
6.3. Document Identifiers and Identity
6.4. The <acquisition> Section
6.4.1. Acquisition Data
6.4.2. Original Document
6.4.3. Canonical Document
6.4.4. Metadata
6.4.5. Links
6.4.6. Analysis
6.5. The <linguisticAnalysis> Section
6.6. The <relevance> Section
7. DTD for Enriched Documents

Chapter 1. Abstract

Alvis networks provide semantic searching by indexing not only documents but also complex annotations of those documents, which provide linguistic and relevance information about the content. The production of this information is a complex process, involving several different components: a Document Source, a Lingustic Analyser, a Document Probability package and an Indexer. We specify and describe the family of formats in which documents are expressed as they pass through this pipeline: Acquisition Format, which is what the various Document Source modules produce; Linguistic Format, which is an extension including information produced by the Linguistic Analyser; and Relevance Format, a further extension including information produced by the Document Probability package. It is this last and most complex form of the document that is eventually fed to the indexer to facilitate subsequent semantically rich queries. The three related formats are together known as Alvis's Enriched Document formats.


Chapter 2. Introduction

2.1. The Alvis Architecture: A Brief Overview

The EU-funded Alvis project runs for three years from 2004-2006, and is tasked with building a semantic peer-to-peer search engine.

The peer-to-peer (P2P) aspect of the Alvis system is one of the key areas in which it innovates. The metadata format used to describe the capabilities of individual peers is described in the document Alvis Task T3.1 - Peer-Description Metadata Format. Milestone M3.1 - Month 6 (June 2004) .

However, within each individual peer that makes up an Alvis network, a great deal of semantic work is done. Documents may be added to a peer's local database from a variety of sources, to become available for subsequent semantic searches. As documents enter the system, they pass through a pipeline of components that enhance them with various kinds of information. This document describes the format - or, rather, the family of formats - that such a document takes as it passes through the system.

It is important to understand that the communication described in this document is all within the context of single Alvis peer. Communication between peers expresses an entirely different set of concepts, and consequently uses an entirely different format, to be described elsewhere.

An Alvis peer that includes all the components described in this document is informally referred to as a ``fat peer'' because so much work is happening within the peer itself. This is in contrast with other candidate P2P architectures such as ``peer soup'', which the Alvis Architecture Group considered and rejected, in which each peer provides just one of the components described below, and the functionality of the components is exposed in the P2P protocol. The ``fat peer'' architecture achieves a simpler P2P protocol at the expense of each individual peer being more complex; whereas the ``peer soup'' architecture would have achieved smaller individual peers, but at the cost of a more complex protocol between them.


2.2. The Need for a Format for Enriched Documents

As documents are added to an Alvis peer, they pass between a series of components that perform various analyses and transformations, each enriching the document in different ways. Since the components are to be created by different partners, potentially using different programming languages running on different platforms, it is necessary to define a rigorous shared format in which the documents, together with their associated enrichments, can be expressed for interchange between components.

Task T3.2 (Semantic Document Metadata Framework) in Workpackage WP3 (Data Model and Standards) is to provide such a format. This document describes the format as it stands at month twelve (December 2004) of the Alvis project, and consitutes Milestone M3.2.

Since the Alvis proposal document was written, the consortium partners have adopted more explicitly descriptive terminology, so that what the WP3 part of the proposal referred to as ``semantic documents'' are now called ``enriched documents''. Accordingly, Task 3.2 is now more properly defined as the provision of a metadata format for enriched documents.


2.3. Principles for Designing the Format

In designing the Alvis metadata format for enriched document, we have been guided by four principles:

Clarity

We make all tag-names explicit, even at the expense of making them longer, so that the documents are so far as possible self-describing. We can revisit the names of the elements and attributes later in the project, making them shorter if necessary to improve efficiency; but for now, clarity is the overriding concern.

Structure

The record should consist of clearly delineated sub-records, each generated by a specific piece of software, and each further subdivided where appropriate.

Generality

We prefer to name elements and attributes according to what they are for rather than according to some implementation-specific detail such as the name of the particular program that generates them.

Simplicity

We introduce no more complexity into the record structure than is necessary to fulful the first three principles.


Chapter 3. The Alvis Pipeline and The Family of Formats

Documents in the Alvis system go through several processes before entering the indexing engine, and this series of stages is known as the ``pipeline''.

The Alvis document-processing pipeline looks like this:

The metadata format for enriched documents described in this document, then, is really a family of three closely related formats, each building on the last: acquisition format, linguistic format and relevance format. These are each described in more detail in the following sections.


3.1. Acquisition Format

The first format - known as ``acquisition format'' - is a simple representation of a document broken into several semantically distinct parts (title, author, body of text, etc.), including information about the acquisition process. It does not seek to represent all the details of the original form of the document. (It may, however, carry with it a copy of the original document, as described below.)

The goal of the canonical document carried in acquisition format is to capture all and only the semantically significant content of the original document, and to put it in a single well-defined form that subsequent components of the pipeline can easily handle without needing to know anything about the vagaries of, for example, HTML. The task of comprehending the various input formats, then, is encapsulated neatly by the various Document Sources, and the other Alvis components need know nothing of them.


3.2. Linguistic Format

The second format - known as ``linguistic format'' - is an augmented version of acquisition format. That is, a linguistic document contains all the same XML elements and content as the acquisition document from which it was derived, but has additional elements describing the linguistic information extracted from the canonical document itself. This means that every acquisition document is also a linguistic document, but with zero linguistic information.

Linguistic documents are created by the Linguistic Analyser built in WP5 (Document Analysis and Normalization), and the linguistic document format is designed to capture the kind of information that WP5 generates.

Such information could be added to the document in one of two ways: either by embedding it within the core document, inserting additonal XML elements to elucidate the structure; or by placing the additional information alongside the core document, with pointers indicating what parts of the core document each annotation pertains to. This second approach is known as ``stand-off annotation'', and this is the approach taken by the linguistic format.


3.3. Relevance Format

The third format - ``relevance format'' is an augmented version of linguistic format. That is, a relevance document contains all the same XML elements and content as the linguistic document from which it was derived, but has additional elements describing the relevance information extracted from the linguistic document. This means that every relevance document is also a linguistic document,but with zero relevance information (and every acquisition document is a relevance document, but with zero linguistic or relevance information).

Relevance documents are created by the relevance package built in WP2 (Document Probability Model), and the relevance document format is designed to capture the kind of information that WP2 generates.

Document Sources such as the harvester should create enriched documents consisting only of of a acquisition section. The Linguistic Analyser should not alter the acquisition section of documents that pass through it in any way, only adding a linguistic section (or altering an existing linguistic section). The relevance package should not alter the acquisition or linguistic sections of documents that pass through it in any way, only adding a relevance section (or altering an existing relevance section).


Chapter 4. Syntactic Issues

Creation of a format such as this consists of two essentially independent decisions - one semantic one and syntactic. The former involves deciding what information needs to be in the documents; the latter with how that information is encoded.

The syntactic issues involved in representing enriched documents are the same as those for representing peer descriptions: choice of meta-format (XML, YAML, etc.) and choice of constraint language (DTD, XML Schema, etc.). Accordingly, the reader is referred to the discussion of these issues in the document describing that format: Alvis Task T3.1 - Peer-Description Metadata Format .

In accordance with that document's conclusions, enriched documents are represented in Alvis by XML documents, and the format of those document is constrained by a DTD. The format is described and discussed in the remainder of this document, and the DTD itself is listed below.


Chapter 5. Discussion of the Canonical Format

In this chapter, we are concerned only with the canonical record itself: that is, the canonical representation into which source documents are transformed. We do not discuss here the remaining elements of the enriched document's <acquisition/> section: these are covered below.


5.1. Example Canonical Document

The following example canonical document serves as a motivating example for much of the discussion to follow.

<?xml version="1.0"?>
<!-- $Id: m3-2.html,v 1.1 2005-05-19 13:57:29 mike Exp $ -->
<canonicalDocument>
  <section title="Why Dinosaurs are Cool">
    Dinosaurs are much cooler than mammals because they are so much
    bigger.  T. rex could eat a tiger, easy.
    <section title="Saurischians">
      Saurischians include the sauropods (the biggest of all
      dinosaurs) and the theropods (the carnivorous dinosaurs).
    </section>
    <section title="Ornithischians">
      Ornithischians include Triceratops, Pachycephalosaurus,
      Stegosaurus, Ankylosaurus and Iguanodon.
    </section>
  </section>
  <section><!-- no title could be found for this section -->
    The coolest
    <ulink url="http://www.dinodata.net/">dinosaurs</ulink>
    of all were the sauropods.  They were way
    huge.  I mean, you may think it's a long way from the top of a
    giraffe to the bottom, but that's peanuts to a Brachiosaurus.
    The coolest sauropods of all were:
    <list>
      <item>Bruhathkayosaurus</item>
      <item>Amphicoelias fragillimus</item>
      <item>Brachiosaurus, which has the species:
        <list>
          <item>altithorax</item>
          <item>brancai</item>
          <item>?nougaredi</item>
        </list>
      </item>
      <item>
        <ulink url="http://www.snomnh.ou.edu/pdf/2000/00-27.pdf">
          Sauroposeidon
        </ulink>
      </item>
      <item>Migeod's mysterious M23 sauropod</item>
    </list>
  </section>
</canonicalDocument>

5.2. Intent of the Canonical Format

The intention of the <canonicalDocument> element is that it contains markup transformed from the original document where and only where that markup indicates document structure with some semantic significance. For example, if a Web harvester is preparing an HTML document that includes the following fragment:


Here is what <font face="Ariel">Holtz</font> says:
<blockquote>
  All those great coelurosaur fossils from Liaoning are a
  couple of lucky rolls of the taphonomic crap shoot.
</blockquote>
	
then the <font> tag, which serves only a cosmetic purpose in the HTML and conveys no information at all, must be discarded, and may not be represented in the <canonicalDocument> but the <blockquote> tag is conveying real information, so we define our format such that it doesn't preclude the possibility of preserving such information in the text that gets passed into the semantic workpackages. It is unlikely that such information will in fact be used in the short term, but we prefer to keep the door open for doing it later in the project.

Future versions of this format may preserve questionable markup such as italics, which can have semantic significance in some fields (e.g. indicating the use of a formal scientific genus or species name in biology).


5.3. Elements Included in the Canonical Format

In accordance with the principles expounded above, and with the need for simplicity at this early stage in the Alvis project, we define canonical format initially to be capable of representing the following structural elements as well as plain text:

Sections with Titles

A canonical document consists of a series of <section> elements, which may represent any logical division of a source document: chapters of a book, pages of an article, HTML <div>, elements, etc.

Each <section> element may have a title attribute. This may be extracted from any suitable part of the source document: the <h1>...<h6> elements in HTML, lines with a large point-size in MS-Word documents, etc.

Each <section> contains a mixture of plain text, <list>s and <ulink> elements.

Example:

<section>
This is a simple section with
no contained lists or links.
</section>

Lists with Items

A lists consists of a series of <item> elements. No distinction is made between bullet lists, numbered lists, etc.

Each <item> contains a mixture of plain text, <ulink>s and sublists, represented by <list> elements.

Example:

<list>
  <item>First entry</item>
  <item>Second entry</item>
  <item>Third entry</item>
</list>

Links with URLs

Links to other documents are represented by <ulink> elements. (The name is to avoid a clash with the unrelated <link> element in the <inlinks> and <outlinks> parts of the <links> section. The name is chosen because it's what the DocBook DTD uses for its analogous element.)

The URL of the linked document is specified by the <ulink> element's url attribute. The anchor text of the link is the content of the element.

Example:

<ulink url="http://google.com/">
  The Google search engine
</ulink>


Chapter 6. Discussion of the Enriched Document

6.1. Example Enriched Document

The following example enriched document (actually a collection consisting of a single document) serves as a motivating example for much of the discussion to follow.

<?xml version="1.0" encoding="UTF-8"?>
<!-- $Id: m3-2.html,v 1.1 2005-05-19 13:57:29 mike Exp $ -->
<documentCollection>
  <documentRecord id="12345678">

    <!-- Information generated during the initial acquision of the
    document, whether by a web crawler, MS-Word converter, etc. -->
    <acquisition>

      <!-- Information, in the current WP7 format, to do with the
      acquisions process: acquision date, URLs where the document was
      found, expiry date, size, etc. -->
      <acquisitionData>
        <modifiedDate>2001-04-19</modifiedDate>
        <expiryDate>2004-11-06</expiryDate>
        <checkedDate>2004-10-06</checkedDate>
        <httpServer>WebSTAR/4.4(SSL) ID/72915</httpServer>
        <urls>
          <url>http://www.snomnh.ou.edu/pdf/2000/00-27.pdf</url>
        </urls>
      </acquisitionData>

      <!-- Original document represented as cleaned HTML, or text
      extracted from MSWord, PS, PDF, etc.  May be a binary format, with
      an attribute specifying base64 or quoted-printable encoding -->
      <originalDocument mimeType="text/plain" charSet="us-ascii">
        ...
      </originalDocument>

      <!-- Visible text from document, together with what internal
      structure we can express through canonical markup -->
      <canonicalDocument>
        <!-- As in previous example -->
      </canonicalDocument>

      <!-- Information, in the current WP7 format, that is _about_ the
      document rather than part of it: e.g., author, title, subject, DOI -->
      <metaData>
        <meta name="dc.author">Wedel, Mathew J.</meta>
        <meta name="dc.date">2000</meta>
        <meta name="dc.title">Sauroposeidon proteles, a new sauropod from
                the Early Cretaceous of Oklahoma</meta>
      </metaData>

      <!-- Link information from WP7 format. All URLs will contain an
      internal ID (not guaranteed to be unique across multiple crawler
      instances) -->
      <links>
        <outlinks> <!-- links to external pages -->
          <link type="a"> <!-- repeatable -->
            <anchorText>Text from this document</anchorText>
            <location documentId="...">URL</location>
          </link>
        </outlinks>
        <inlinks> <!-- links from external pages -->
          <link type="a"> <!-- repeatable -->
            <anchorText>PDF ( 1 MB)</anchorText>
            <location documentId="...">http://www.snomnh.ou.edu/publications/Articles/index.shtml</location>
          </link>
        </inlinks>
        <!-- Number of unique other hosts with links pointing to this page -->
        <inlinkHosts> ... </inlinkHosts>
      </links>

      <!-- Results of analysis done as part of the acquisition process,
      e.g. genre intuited from top-level domain name of the site from
      which a Web document was crawled -->
      <analysis>
        <!-- analysis also containes other analysed properties (mainly from
             the URL) with property name as tag and content as value -->
        <property name="topLevelDomain">edu</property>
        <property name="language">en</property>
        <property name="genre">article</property>
        <ranking scheme="..."> ... </ranking> <!-- repeatable -->
        <topic absoluteScore="150" relativeScore="570">
          <class>ALL</class>
        </topic>
        <topic absoluteScore="100" relativeScore="380">
          <class>CP</class>
          <terms>carnivorous plant[^\s]*, carnivor[^\s]*, </terms>
        </topic>
        <topic absoluteScore="50" relativeScore="190">
          <class>CP.Dionaea</class>
          <terms>flytrap[^\s]*, venus flytrap[^\s]*, </terms>
        </topic>
      </analysis>
    </acquisition>

    <!-- Annotations from WP5 -->
    <linguisticAnalysis>
      <!-- Details omitted: see Deliverable D5.1 -->
    </linguisticAnalysis >

    <!-- Relevance information added from WP2 -->
    <relevance>
      <scoreset type="ranking">
        <score topicId="1">8.36536</score>
        <score topicId="4">4.25395</score>
        <score topicId="19">0.44538</score>
        <score topicId="36">2.35349</score>
      </scoreset>
      <scoreset type="content">
        <score topicId="1">40.25395</score>
        <score topicId="4">2.947</score>
        <score topicId="17">0.44538</score>
        <score topicId="23">1.4629</score>
        <score topicId="36">2.35349</score>
      </scoreset>
    </relevance>
  </documentRecord>
</documentCollection>

6.2. Representing Multiple Documents in a Single Package

A <documentCollection> is merely a wrapper for several <documentRecord> objects, and has no information at all of its own: it's not a collection about some specific subject, or a collection of documents harvested from a particular place, or anything similar: it's just a simple, unstructured aggregate like a TAR archive or or ZIP file.


6.3. Document Identifiers and Identity

Each document in a complete Alvis network is identified by an opaque, unique identifier, represented by the id attribute on the <documentRecord> element. This identifier must remain constant as the record takes on its various forms (acquisition, linguistic, relevance) during its journey through the Alvis pipeline. This identifier may be subsequently used to specify records for deletion or update.

Because Document Sources are the first components to handle the documents that are passed through the Alvis pipeline, it falls to them to allocate the identifiers. However, the identifier identifiers the entire enriched document, in each of the forms that it takes, not just the <acquisition> section that is a Document Source's main responsibility. Accordingly, the id attribute is on the <documentRecord> element rather than <acquisition>.

Document Sources must choose identifiers that are, in the immortal words of RFC 1341 ``as unique as possible''. One possible mechanism for generating such globally unique IDs is for the peer maintainer to use an Internet domain-name that they own, and append a locally unique token. Another is simply to generate a long string of random bits. A third is to use an MD5 checksum of the document.

Warning

Using the MD5 checksum (or any checksum) has the property that two identical documents acquired at different times and by different Sources will have the same identifier, and will thus be, for Alvis purposes, ``the same record''. Is this property desirable or broken? It depend on what we mean by our notion of a record's identity.

When are two Alvis documents considered to be the same document? For identity purposes, an Alvis document is a specific sequence of bytes. This means that, given one document to compare with:

  • If a document consisting of the same sequence of bytes is found at another URL (or is acquired by another process altogether) then that is deemed to be The Same Document.

  • If the document is re-fetched from the same URL, and it includes a counter or a time indication, then it's a Different Document the second time (and each subsequent time).

  • If another document is identical except that a typo has been fixed, or a tab character replaced by eight spaces, then it's A Different Document.

  • If another document consists of exactly the same characters in the same order, but encoded in a different character set (e.g. UTF-8 vs. ISO-Latin-1) then it's A Different Document. (This is because Alvis considers documents to be identifiable as sequences of bytes rather than characters.)


6.4. The <acquisition> Section

6.4.1. Acquisition Data

The <acquisitionData> subsection describes the process by which the document was acquired, rather than the acquired document itself. It consists of the following elements, in the specified order.

<modifiedDate> (mandatory)

The date and time at which the document was last acquired, the most recent version superseding any previous versions. Like all other datestamps in the enriched record format, it must be provided in accordance with the ISO 8601 specification, which allows (among others) the following formats:

  • 1998 - year only.

  • 1998-03 - year and month.

  • 1998-03-18 - year, month and day.

  • 1998-03-18 03 - year, month, day and hours.

  • 1998-03-18 03:28 - year, month, day, hours and minutes

  • 1998-03-18 03:28:12 - year, month, day, hours, minutes and seconds.

<expiryDate> (optional)

The date at which the record's currency expires, so that the Document Source must revisit the document at its original location to verify that it has not changed.

<checkedDate> (optional)

The date at which the document was last checked at its original location, to determine that it had not changed since being acquired.

<httpServer> (optional)

For Document Sources that work by crawling the Web this provides a way to indicate the server software, e.g. Apache/2.0.40 (Red Hat Linux)

<urls> (mandatory)

For Document Sources that work by crawling the Web, each of the potentially many URLs where the document was found is recorded inside a <url> element, of where there may be any number within <urls>.


6.4.2. Original Document

The <originalDocument> element contains a copy of the original document, perhaps compressed and encoded. This original document is not used in subsequent analysis, but only for delivery to users. Accordingly, it may be in any format, including binary formats such as PDF (so long as it is suitably encoded).

In general, this element should contain a byte-for-byte copy of the document as it was originally acquired. However, for some formats, it may be beneficial to use a transformed version of the document: for example, the non-conformant HTML found on many web pages can profitably be cleaned up using a tool such as HTML Tidy.

The <originalDocument> element has the following attributes:

mimeType (mandatory)

The type of the original document, chosen from the controlled list maintained by IANA at http://www.iana.org/assignments/media-types/index.html. Example values include text/plain, text/xml, text/html, application/pdf and application/msword.

charSet (mandatory)

The character set used by the original document, chosen from the controlled list maintained by IANA at http://www.iana.org/assignments/character-sets. Example values include UTF-8, ISO-8859-1, and US-ASCII.

Note well that this attribute indicates the character set used by included (possibly compressed and/or encoded) document - not that of the document that it's included in, which is specified by the XML declaration.

compression (optional)

Indicates that the document has been compressed from its original form to yield the sequence of bytes that form the content of the <originalDocument> element. This attribute may take the following values:

deflate

The document is compressed using the DEFLATE Compressed Data Format Specification version 1.3, as specified in RFC 1951 (May 1996).

gzip

The document is compressed using the GZIP file format specification version 4.3, as specified in RFC 1952 (May 1996).

encoding (optional)

Indicates that the (perhaps compressed) document has been encoded to protect its content from being interpreted as XML markup. This is necessary if the content contains either of the characters < or &. This attribute may take the following values:

quoted-printable

The document is encoded using the Quoted-Printable transfer endoding, as specified in section 6.7 of RFC 2045 (November 1996).

Note that use of the Quoted-Printable encoding does not in itself guarantee that the encoded document is safe to embed as text in an XML document. The particular Quoted-Printable implementation needs to ensure that, among any other translations that it does, it translates < to =3c and & to =26.

base64

The document is encoded using the Base64 transfer endoding, as specified in section 6.8 of RFC 2045 (November 1996).

xml

The document is encoded using the XML escapes &amp; for &, &lt; for <, and &gt; for >.

When this encoding is used, some provision must also be made for characters such as ESCAPE (ASCII 0x1b) which simply cannot be represented in XML - not even as numeric entities such as &#x1b;. See the XML 1.0 specification, section 2.2 (Character) at http://www.w3.org/TR/2004/REC-xml-20040204/#charsets for details of which characters are acceptable. Since XML does not allow characters such as ESCAPE to be represented, there may be no realistic alternative other than the discard them, which for some formats may be disastrous. For this reason, we recommend that one of the other encodings be used in place of xml.

Example:

<originalDocument mimeType="application/msword" charSet="utf-8"
	compression="gzip" encoding="base64">
H4sICDwAXEECA3RleHQuZG9jAO19C5xcRZlvdRKS4ZEQAoSAEdoYYRI6Qx6TyQOuy2QmIQkJGTLh
/ZAzMz2ZJjPdQ3fPhGFZFjG+EBBZBJZFRS4quMCPdV1EdL1cFlG5XNd1uVzkclcX0SuK3BhZVlwg
+6+vvjpVdU6d090TUMIv3flP+nSfU6fqq6+++l5V5wf/eMhPPv83R/2riLxOFBPFG7v3F5Ot7zLA
...
</originalDocument>


6.4.3. Canonical Document

The section contains the canonical document itself, as described above.


6.4.4. Metadata

This section contains information about the document, as opposed to the content of the document. Document Sources are at liberty to acquire this information in any way that works. Examples include:

  • If the document is acquired from a database, the a metadata record associated with the document may be harvested from the database along with it.

  • If the document is acquired from a simple filesystem, some metadata may be harvested from the attributes of the file, e.g. date of last modification.

  • If the document is acquired from a complex filesystem, additional metadata may be available. For example, in Apple filesystems, there is a ``resource fork'' corresponding to the ``data fork'' that contains the actual document.

  • Documents in some formats may carry their own metadata with them: for example, HTML documents harvested from the Web usually provide a <title> element, and additional metadata is often available in <meta> tags indictating the author, keywords, description, etc.

Whatever the source of the metadata, Document Sources should express it in simple context=value pairs using <meta> tags within the <metaData> section, like this:

<meta name="dc.author">Wedel, Mathew J.</meta>
<meta name="dc.date">2000</meta>

The valid values of the name attribute are the names of the fifteen Dublin Core Simple elements, as listed and described at http://dublincore.org/documents/dces/


6.4.5. Links

The acquisition record may contain a <links> section which provides an indication of both the inbound and outbound links for this core document. The former are held within an <inlinks> container, the latter in an <outlinks> container.

Outbound links are easy to discover by static analysis of the document. By contrast, a comprehensive list of inbound links can by obtained only by analysing an entire corpus; and such a list can only be exhaustive with respect to a specific corpus, since it is always possible that there is another document somewhere in the world that links to it.

Both inbound and outbound links are represented by the same structure: a <link> element with a type attribute, containing <anchorText> and <location> subelements. The type attribute takes a value indicating the kind of link. The possible values for this attribute are taken from the corresponding HTML tags as follows:

a

A conventional typertext anchor.

img

An embedded image that is part of a page.

frame

An embedded frame that is part of a frameset.


6.4.6. Analysis

The acquisition record may contain an <analysis> section which contains the results of simple pre-processing done on the data by the harvester. This information may serve as a guide for the more sophisticated analysers later in the Alvis pipeline.

The analysis section may contain the follow subelements, all of them optional and repeatable:

<property>

Specifies the value of any named property of the document: for example, if the language of the document is known to be English, this can be specified using <property name="language">en</property>

<ranking>

Specifies the ranking of the document under a specific scheme named by the scheme attribute. For example, <ranking scheme="abc">42</ranking>

<topic>

Indicates whether or not the document belongs a specific topic.

The <class> subelement is a topic or sub-topic specifier (subject class notation) indicating the specific topic for this Web-page. It often comes from a hierarchical classification system such as Engineering Index.

The two attributes, absoluteScore and relativeScore, indicate the document's score in the specified topic, as assigned by the Document Source, the latter being normalized by the size of the document. The scores are calculated as described in the milestone document MS7.1 and are based on the terms indicated by regular expressions in the <term> subelement.


6.5. The <linguisticAnalysis> Section

The <linguisticAnalysis> section follows the form described in Deliverable D5.1, Report on method and language for the production of the augmented document representations. That document both describes the format of this element and includes a DTD formally specifying it.


6.6. The <relevance> Section

The <relevance> section describes the results of WP2's document probability calculations as a sequence of <scoreset> subsections, each of which provides scores for a set of topics. Each <scoreset> has a type attribute specifying its applicability. Types likely to appear in relevance score-sets include:

ranking

Topic-sensitive authority score that merges the topical relevance of the document content as well as the topical relevance of documents that link to it. ``Authority'' in this sense implies that other documents on the topic link to it, and thus it is considered important for the topic.

content

Topic scores based on the relevance of the document content.

Each <scoreset> element contains zero or more <score> elements, which in turn carry a topicId attribute and contain a a floating-point number measuring the score of the identified topic according to the document probability model. The form of the topicId attribute remains controversial: either an opaque numeric or symbolic identifier may be used, or a short phrase identifying the topic in a human-readable way.

In principle, this section of a relevance-format enriched document carries information analogous to that generated at harvesting time and encoded in <topic> elements within the <acquisition> section's <analysis> subsection. The Document Source and Document Probability software components use different approaches to trying to determine the topic of a document. For example, the subject-specific web crawler developed in WP7 is based on an an explicit topic definition (ontology), while the WP2 software's relevance figures are based on statistical modeling of documents in a collection.


Chapter 7. DTD for Enriched Documents

<!-- $Id: m3-2.html,v 1.1 2005-05-19 13:57:29 mike Exp $ -->

<!-- This DTD prescribes the format of Alvis enriched document records -->


<!ELEMENT documentCollection (documentRecord*)>


<!ELEMENT documentRecord (acquisition, linguisticAnalysis?, relevance?)>
<!ATTLIST documentRecord id CDATA #REQUIRED>


<!ELEMENT acquisition (acquisitionData, originalDocument?, canonicalDocument,
                       metaData?, links?, analysis?)>

<!ELEMENT acquisitionData (modifiedDate, expiryDate?, checkedDate?,
                           httpServer?, urls)>
<!ELEMENT modifiedDate (#PCDATA)>
<!ELEMENT expiryDate (#PCDATA)>
<!ELEMENT checkedDate (#PCDATA)>
<!ELEMENT httpServer (#PCDATA)>
<!ELEMENT urls (url*)>
<!ELEMENT url (#PCDATA)>

<!ELEMENT originalDocument (#PCDATA)>
<!-- The "encoding" attribute may be "base64" or "quoted-printable" -->
<!ATTLIST originalDocument mimeType CDATA #REQUIRED
                           charSet CDATA #REQUIRED
                           compression CDATA #IMPLIED
                           encoding CDATA #IMPLIED>
<!-- originalDocument.mimeType chosen from IANA's list -->
<!-- originalDocument.charSet chosen from IANA's list -->
<!-- originalDocument.compression may take the following values:
        "deflate", "gzip" -->
<!-- originalDocument.encoding may take the following values:
        "quoted-printable", "base64", "xml" -->

<!ELEMENT canonicalDocument (section*)>
<!ELEMENT section (#PCDATA|list|ulink|section)*>
<!ATTLIST section title CDATA #IMPLIED>
<!ELEMENT list (item*)>
<!ELEMENT item (#PCDATA|list|ulink)*>
<!ELEMENT ulink (#PCDATA)>
<!ATTLIST ulink url CDATA #IMPLIED>

<!ELEMENT metaData (meta*)>
<!ELEMENT meta (#PCDATA)>
<!ATTLIST meta name CDATA #REQUIRED>
<!-- meta.name may take values chosen from the Dublin Core element set -->

<!ELEMENT links (outlinks?, inlinks?, inlinkHosts?)>
<!ELEMENT outlinks (link*)>
<!ELEMENT inlinks (link*)>
<!ELEMENT inlinkHosts (#PCDATA)>
<!ELEMENT link (anchorText?, location)>
<!ATTLIST link type CDATA #REQUIRED>
<!-- link.type may take the following values: "a", "img", "frame" -->
<!ELEMENT anchorText (#PCDATA)>
<!ELEMENT location (#PCDATA)>
<!ATTLIST location documentId CDATA #IMPLIED>

<!ELEMENT analysis (property*, ranking*, topic*)>
<!ELEMENT property (#PCDATA)>
<!ATTLIST property name CDATA #REQUIRED>
<!ELEMENT ranking (#PCDATA)>
<!ATTLIST ranking scheme CDATA #REQUIRED>
<!ELEMENT topic (class, terms?)>
<!ATTLIST topic absoluteScore CDATA #REQUIRED
                relativeScore CDATA #REQUIRED>
<!ELEMENT class (#PCDATA)>
<!ELEMENT terms (#PCDATA)>


<!ELEMENT linguisticAnalysis (#PCDATA)>
<!-- Details omitted: see Deliverable D5.1 -->


<!ELEMENT relevance (scoreset*)>
<!ELEMENT scoreset (score*)>
<!ATTLIST scoreset type CDATA #REQUIRED>
<!ELEMENT score (#PCDATA)>
<!ATTLIST score topicId CDATA #REQUIRED>