Alvis Annual Meeting, 27th January 2004

0. Contents

0. Contents
1. Call to Order
2. Participants
3. Approval of the Agenda
4. Year-End Documents
5. Preparation for Review
6. WP4 Proposals
        6.1. What's to come
        6.2. Two scenarios
        6.3. How it all fits together
        6.4. Discussion
        6.5. My modest proposal
7. Demos
8. Architecture Clarification
9. Financials (Mikko)

4. Year-End Documents

This is a periodic reporting requirement, similar to the bi-monthly reports but bigger.

Martin says this this report should be a synthesis of all the bi-monthly reports already submitted. He makes the point that this is a management activity rather than a scientific one, so scientific partners should not be heavily loaded by this work.

A long discussion follows about exactly what figures Mikko is going to need in order to prepare the financials for this report. I assume that all this will be summarised in email, so I am not taking notes.

5. Preparation for Review

In the review of deliverables and milestones, two of our jobs are relevant, both milestones.

I said that M3.2 is nearly complete, and that it is waiting on two things. The simpler is contributions on the representation of WP2's relevance analysis: Wray and I will thrash this out in ``quarter of an hour'' (his estimate) over a beer. More significantly, we need to warm up the discussions on representing the linguistic analysis results again. This must be done in consultation with WP5 and WP6, and stalled when Theirry stopped working on Alvis.
I pointed out the M3.3 is a software release, not a document. I claimed that it is essentially complete; Wray says he needs to apply a ``five-line'' patch to Zebra to make it so (increasing the relevance-scoring range from 1-100 to 1-1000). Marc has written most of a stub document to accompany this release.

Mikko on financial reporting: nothing new here, I think.

[Lots of discussion of reviewers' comments. Nothing new except what I dicuss below.]

One of the reviewers, who is clearly a Semantic Web believer, seems to think that Alvis should use RDF. What for? This is not clear, in fact it appears that he hasn't made any actual proposal, and just thinks that RDF is a good thing that should be used, well, everywhere. Unsurprisingly, I think this is stupid. Karl agrees, but otherwise the general sentiment of the meeting seems to be that we may as well give the reviewer what he wants ... if he can figure out what the heck that is.

6. WP4 Proposals

6.1. What's to come

Model for P2P full-text retrieval scenario
Introduction: recap
Architecture
Distributed inverted file
Highly Discriminative Keys (HDK) indexing

6.2. Two scenarios

Bootstrap case: some institutions create an Alvis network from scratch. Several peers with document collections join at the same time to efficiently construct a distributed index.
Maintenance case: a single new peer joins an existing network.

6.3. How it all fits together

[Recap of four layer architecture: transport layer, structured overlay network, document and content management, retrieval models.]

[Summary of how ``inverted files'' are used in conventional IR].

In a P2P distributed system, each peer is responsible for certain terms and their postings lists. Query terms are forwarded to the responsible peer, which returns the corresponding postings list. The association of a given query term with a particular peer is done using some kind of magic hash function with load balancing and suchlike.

On problems with this approach is that postings lists can be very long (e.g. for ``the''). An EPFL study has shown that this is not feasible. The proposed solution is to limit the lengths of postings lists by combining terms. A threshhold size is created defining the longest acceptable postings list. If the list for a given term becomes larger than this, then its posting list is discarded and instead a lists is created for the combination of the term with one or more others. A set of n such terms indexed in conjunction is called a key. This is HDK indexing.

How do you decide which terms to combine into keys? One clue is to use terms that appear in the same context in a document. [Note: this requires knowledge of an entire corpus.] This limits the number of keys to be linearly related to the number of terms.

Warning: dark arts ahead. What happens if you search for a combination of terms that is too common to have a postings list, such as ``britney spears''? The proposed solution is ``distributional semantics'': an association matrix makes probabalistic connections between terms. This is used to find terms that are likely to be related to term1 and term2. [My comment: huh?]

Important action point. Martin agrees that we should be able to play with this if we're going to sign up to the approach. Accordingly, Wray will send him a selection of about 7000 randomly chosen Wikipedia articles which he will roll into an HDK index that we can all search over the web. That way, we can reassure ourselves that this really does work, using real queries.

EPFL's presentation includes graphs showing impressive characteristics of the HDK approach, indicating query recall very similar to that of simple single-term indexing and postings bandwidth between 1/10 and 1/5 as great.

Future work includes:

First version of the prototype by 31st March 2005
Integration of other use cases into the architecture
Integration of other modules and activities: probabalistic document modeling, crawler, language processing.

6.4. Discussion

Claire or Adeline asked how this approach integrates the probability values calculated by Wray's module. The reply is complex (I don't really understand it) and doesn't seem to be convincing people.

I asked how they see the WP3 indexing engine contributing to the EPFL framework. The answer was even less convincing, and amounts to ``it doesn't matter, we'll find a way to wedge it in somewhere around the edges''.

[Later:] Unbelievable. Just unbelievable.

Tenatative conclusion for the demo in March: we should just demonstrate Zebra's ability to quickly index and search large collections of documents. Most usefully, these will be documents generated by the NLP, including linguistic markup from WP5 and document probability measures from WP2. We should let EPFL do their own demonstration.

6.5. My modest proposal

The WP7->WP5/6->WP2 pipeline that produces enriched XML is considered to be atomic. Then what happens to the XML that's produced? It can be passed into either one of two competing P2P networks. One of them uses a single distributed index, implemented by WP4's P2P techniques. The other uses a P2P network of autonomous nodes implemented by WP3, trickling queries between them. Both of these networks can also be queried using the same protocol - probably a simple HTTP GET query. Then:

The two P2P approaches can compete against each other to determine experimentally which works best in which circumstances, e.g. which is best for complex queries, which scales best.
The WP3 indexer can feed its input into the WP4 distributed indexer to facilitate its work.
The WP4 index can be used as a selection phase to determine which are the most promising WP3 peers to ask about each given query.

This means that the project is doing valuable competitive research, that it can investigate the interactions between different P2P approaches. It also gives Alvis a backup plan, so that if either P2P approach turns out to perform unsatisfactorily, the success of the other would constitute the success of the project.

Here is a photo of the illustration that I drew on the blackboard.

7. Demos

Hang Guo - TU
Claire Nedellec - INRA
Wray Buntine - HUT

Not really relevant to us.

8. Architecture Clarification

Martin presents an elaboration of the architecture diagram that I presented yesterday. It's ... complicated. But I think it's all presentational complexity deriving from an academic perspectic, rather than intrinsic complexity that we have to actually implement.

9. Financials (Mikko)

We still owe Mikko a Form C and an Audit Certificate.