Talis/Index Data meeting

24th November 2004

1. Who's Who?
        1.1. Steve Pile
        1.2. Justin Leavesley
        1.3. Richard Wallis
        1.4. Rob Styles
        1.5. Mike Taylor
        1.6. Adam Dickmeiss
        1.7. Dave Errington
2. Zebra - Adam
        2.1. Overview
        2.2. Record Storage
        2.3. Performance
        2.4. Searching
        2.5. Integration
3. ZiNG - Mike
4. TalisBase Past and Present - Rob
        4.1. The Current System: TalisBase
        4.2. TalisBase 1.5
        4.3. TalisBase 2
5. Index Data the Company
6. Talis the Company
        6.1. Overview
        6.2. Internal Development Projects
                6.2.1. Project Bigfoot
                6.2.2. Project Silkworm
                6.2.3. Project Nimbus
                6.2.4. Project Bluebird
7. Some ideas

1. Who's Who?

	Steve		Justin		Richard
	Pile		Leavesley	Wallis
	(morning)

	+---------------------------------------+
	|					|
	|					|
	|					|
	|					|
	+---------------------------------------+

	Adam		Mike		Rob		Dave
	Dickmeiss	Taylor		Styles		Errington
							(afternoon)

1.1. Steve Pile

Has worked for Talis for 25 years. Has done some Z39.50 work, went to a ZIG in Boston Spa. Is looking at Microsoft technologies such as SOAP, .NET, C#.

1.2. Justin Leavesley

Has been working for Talis for four years. Heavily incolved in building Prism, the federated search tool.

Twelve or thirteen years at Talis. Involved in designing and building the library management system, Architect of Prism. Now in consultancy division, spreading the word on integration and web services to the customers. Wanting to set the scene for Justin's team to actually do the integration. Richard has also worked in the metadata team as technical lead on early TalisBase.

1.4. Rob Styles

Technical lead for metadata team, in charge of TB1.5 and TB2. has worked for Talis for six weeks. Background is in Internet banking. The joys of MARC and Z39.50 are newish to him, but he's read some of the YAZ source. Enjoying it! Libraries are more interesting than banks.

1.5. Mike Taylor

[Who I am]

1.6. Adam Dickmeiss

Started Index Data in 1994 with Sebastian Hammer. Studied electrical engineering at the University of Denmark, but was mostly interested in algorithms and optimising compilers. Very glad that he can still code in a way that aims for performance, which is an unusual goal these days! Index Data has specialised in Z39.50, and created the well-established YAZ toolkit which is given away under a BSD-like licence.

Before Seb and Adam started Index Data, they created a union catalogue search engine for the Danish National Library. When they started Index Data, they developed a new search engine with better ideas. This became Zebra, which functions as a Z39.50 server.

Adam is the technical chief of Index Data.

1.7. Dave Errington

CEO of Talis.

2. Zebra - Adam

2.1. Overview

[See the book Managing Gigabytes.]

Zebra is an indexing engine that indexes documents according to configurable rules so they can be efficiently retrieved later. It is not a relational database, but indexes full text.

Zebra is tied to user requirements. It has tended to be used mostly for libraries, but it is configurable by filters to handle any kind of record.

[diagram of Zebra components]

Searching is very fast even for large indexes (frequent words) so we don't implement stopwords; we just go ahead and search for the frequent words (so you can search for ``the who'').

2.2. Record Storage

Records are internally stored as DOM-like trees (the details of the format predate the DOM by several years), and are converted on the fly into the format that the application wants: for example, ISO2709 (MARC), XML, GRS-1, SUTRS ...

Each type of record that's supported is implemented by a ``filter''. This provides two methods, Index and Retrieve: the former is invoked when a record is added, to tell the indexing guts what terms to index and how; the latter is called to format the abstract DOM-like record in the appropriate way for the client.

Zebra can either make its own copy of records (which allows for the most efficient subsequent retrieval) or just build an index for records held elsewhere, so long as it has a way to get those records when it needs them (which obviates some synchronisation problems).

If we use the former approach, we could periodically update the Zebra database by having it use OAI-PMH to poll the master database for updates. An alternative approach, used by DBC to populate Zebra mirrors from its Oracle master DB, is to have a trigger in the master database actively push updates to Zebra as they happen. Zebra happily supports any and all of these models.

[Much discussion of alternative architechtures relating Zebra to the existing Sybase bibliographic database and, optionally, a live holdings database that may be separate.]

2.3. Performance

Zebra's performance is attributable to several factors including compression in the indexing B-trees, lazy searching that allows efficient combination of subsearches for common and rare terms, and hierarchical dictionaries.

Updates can be done using shadow files, which is safe for the indexing files if something catastrophic happens during the indexing process, and which also means that the update appears atomic to searching clients, however many records are involved. Update efficiency is better for large updates than for small updates; Zebra automatically switches update modes depending on what kind of update it is asked to do.

A single Zebra database can not be spread across multiple computers, as there is no way to partition the indexes. However, index files can be split across multiple physical disks, thereby reducing contention for disk which is always the most scarce resource. In practice this is fine for scaling, since Zebra is so damned fast to start with. For example, it takes about six hours to index the whole of the DBC database of 27 million MARC-like records, for an average of about 4.5 million records per hour = 75,000 per minute = 1250 per second. DBC do not bother backing up their indexes, as restoring from backups would take longer than rebuilding the indexes from scratch!

2.4. Searching

Each register has specifications that indicate how the fields it works with are broken up into indexable tokens. This is done by classifying characters into one of seven classes (e.g. whitespace, literal, word-constituent).

Adjacency searches do not used separately indexed multi-word phrases (which would bloat the indexes unacceptably) but by considering the positions in which search-term words appear in their fields.

2.5. Integration

Talis has an emerging strategy to make more use of Microsoft .NET internally, and will need to integrate Zebra into this somehow. Adam suggests three candidate approaches:

Make a COM wrapper around the Zebra API. We've done this before, so we know it can be done, but it was not a particularly happy experience and we would not be keen to repeat the exercise.
Use .NET's web services facilities to access Zebra using SRW and/or a WS interface to the internal API (including update, etc.)
Build a C# wrapper around the Zebra API, analogous to the existing Perl API. This could probably be done using SWIG.

Mike argues that web services is so obviously the way to go that it seems perverse even to have suggested the alternatives. .NET is extremely WS-friendly, and Zebra supports SOAP access out of the box, as it understands the SOAP-based SRW protocol.

Z39.50 access to the Talis catalogue would most naturally be providede directly by Zebra, in which case it would need to interface to Talis's existing authentication and authorisation systems. The obvious way to do this is to build a plug-in module interface for authentication, whereby Zebra passes the Z39.50 Init request's authentication tokens into a plug-in module which returns information telling Zebra whether the user should be allowed to continue.

3. ZiNG - Mike

[Brief exposition of the history of ZiNG, the four major components that constitute it and how Zebra supports it.]

The ZiNG web-site is at www.loc.gov/zing. This site includes information on SRW/SRU, CQL, ZOOM and ZeeRex, and also on a dead sub-project called ez3950 which can be safely ignored.

4. TalisBase Past and Present - Rob

4.1. The Current System: TalisBase

[diagram]

TalisBase uses BRS, a rather old and almost unsupported full-text DBMS. This contains several databases such as BNB, Talis's own union catalogue, and the LC's database. This collection of databases within BRS is fronted by the ``BRS session'' code, written mostly in C with some Perl.

In front of this are two interfaces: one is a Z39.50 server, built on the Crossnet toolkit, providing read-only access; the other is an interface called CASS, built on Sybase Openserver (which is a library distributed by Sybase to make services look like Sybase databases). CASS has both search and update interfaces to BRS update, and is in turn accessed using SQL, largely in order to trigger stored procedures. The Alto ``thick client'' uses Z39.50 against TB's server for searching, and uses SQL against CASS for updates. There is also another client that runs against CASS, but this was not discussed.

Within BRS are records in a form consisting of paragraphs annotated with a title such as ``author'', ``title'', and a special paragraph containing the MARC record iteslf. These faceted records are built from simple MARC records by a layer above the BRS session, so that what is fed into that from CASS is just the MARC record.

4.2. TalisBase 1.5

TB1.5 is an initiative to buy some breathing space, in the form of MARC21 support, with minimum effort. The Z39.50 server and CASS will not be changed at all (although, unknown to them, they will be passing a different flavour of MARC record). Alto will be changed to send and receive MARC21 records. The BRS configuration will be updated so that the MARC blob is MARC21 rather than UKMARC. Although the stored MARC21 records will use Unicode, the extracted data (for indexing) will be translated into ASCII.

This will be done entirely in-house.

Converting the UKMARC records to MARC21 is an as-yet unsolved problem. The plan seems to involve USEMARCON, which interprets files that specify the field mapping. However, this is slow. An alternative would be either to write a program that hardcodes the specific translation, or to reimplement USEMARCON much better. (It is staggeringly slow. Talis are getting a throughput of six records per second, which will take two months to convert their 30 million records. In other words, USEMARCON converts one MARC record in the time it takes Zebra to index 200 of them.)

There are also some ``TalisMARC'' records in the data-set, which Talis know how to deal with.

4.3. TalisBase 2

Everything will need to be delivered using not only Z39.50 but also SRW/U. The formats to be delivered will include not only MARC21 but also MarcXML and Dublin Core. The system will also continue to support UKMARC, including the supply of records translated on the fly from the MARC21 master records - a service that will be valuable to libraries that have not upgraded by the time the BL stops supplying UKMARC records.

The biliographic database server will also need to augment the records that are delivered using data from third-party sources - for example, images of book covers from BRS. [This is Project Demeter.]

The requirements for TB2 are rapidly becoming more concrete, thanks to the ongoing work of an excellent analyst. The requirements should be nailed down and delivered to us around the end of January. Thereafter, Talis would like to replace the 1.0 and 1.5 systems with an identically functional one ``as soon as possible'' (clarified to ``within a small number of months''), and to add other facilities incrementally.

Unicode indexing should come in at the very first TB2 installation, even though in other respects TB2 should be ``bug-compatible'' with TB1.5. (This is so that TB1.5 can be decommissioned while Talis Text clients continue to be supported.)

Another TB2 requirement is to integrate the holdings information currently maintained elsewhere in Unityweb. But this is for rather further down the line. Unity web includes holdings records for 90% of the UK's public libraries, so there are a lot of records.

5. Index Data the Company

[See the PowerPoint presentation at www.indexdata.dk/tmp/indexdata.ppt]

6. Talis the Company

6.1. Overview

[History from the Web-site www.talis.com/about_talis/corporate.shtml]

Talis currently employs about 90 people, all of them in the Birmingham office. Turnover over the last few years has been about �7M per year, with small profits.

Lots has changed in the last year, with the office move, staff being reorganised, technical infrastructure refreshed, VoIP used to enable home-working, etc. The goal now is to build a new business culture on top of the new technology culture.

Founded in 1969; started out using punched cards, then went to computers. Went through three generations running on a mainframe, VLS and then Unix. ``Talis Text'' was written in C, but is now superseded by Alto, a thick client written in Delphi. Also have Prism, a metasearching OPAC. There are other products.

[Products listed at www.talis.com/products/product_select.shtml]

Talis serves 52 academic libraries, 25% of the UK market. Their focus is on ``post-92s'', the universities that used to be polytechnics. This is largely an accident of history. They don't want to be competing with the likes of Ex Libris for libraries such as Oxford and Cambridge.

Talis serves about 114 public libraries, 29% of the UK market. This has happened because of the historic strategy of allowing only LMS customers to use other Talis products. This strategy no longer works because libraries can integrate components from different vendors.

Although there is overlap, these two markets are rather different and have different requirements. Talis does not serve legal libraries, health libraries, etc.

Although Talis is a for-profit company, it is 75% owned by BLCMP, which is a non-profit. (The other 25% is owned by employees.) Talis and BLCMP are closely joined - e.g. they have the same board of directors. This structure is perceived as idiosyncratic, and is a barrier to growth in several ways, e.g. they can't raise capital.

About 20% of customers can be legitmately seen as development partners.

About 90% of revenue is from current customers - a customer-base that has very slow replacement cycle, so there are no problems such as customers leaving with bad debt. Customer retention is very high. This means it's a very stable, predictable company. Revenue has not grown notably over the last decade.

6.2. Internal Development Projects

The current goal is to create ``a culture of innovation in a relatively boring market'' [Dave Errington].

6.2.1. Project Bigfoot

This is the project to upgrade TalisBase to TB2. The strategic goal of this to create a new platform from which new services can be delivered.

6.2.2. Project Silkworm

Talis is the only company that has a significant market share in both the academic and public UK library markets. Silkworm is an ambitious long-term project to build a unified, directory-driven conglomeration of these, and of other Talis services, so that all the content is discoverable and can be integrated into whatever context is useful.

For example: UnityWeb is the UK's largest public holdings database, used for ILL request management. It has 60% of the UK's holdings, so 60% of all UK ILLs go through it. There is some basic request management in the UnityWeb software, but it does not integrate with TalisBase. Silkworm aims to fix this.

Use ZeeRex's documented but unimplemented Friends and Neighbours functionality to enable one-click configuration of Z-broadcasters.
Per-user OpenURL resolvers can be used to deliver peer-group reviews and other content that is specialised for particular groups or individuals.

-- end --