|
"A key requirement for the system was to optimize search performance without requiring expensive servers and database management systems. This was achieved successfully ... "
|
|
" ... the entire system is capable of running on a standard, albeit current, single processor desktop system."
|
|
|
Objective
HarpWeek LLC is an aggregator of 19th century newspapers. In 2002, HarpWeek engaged Aptigent to upgrade their commercial product, which integrates page images, article text, manually created index entries, and literary synopses together into an application that provides millions of students, scholars and researchers worldwide an incredibly rich resource to retrieve and discover information published in Harper’s Weekly.
Approach
Content Management
One of the first challenges that HarpWeek faced was the integration of hundreds of thousands of page images in multiple resolutions with XML files comprising the complete text of the issues, along with a large collection of manually-created index information and hierarchical descriptor lists. These resources, along with literary synopses created to accompany the serialized fictional works throughout the newspaper, all had to be integrated in order to realize HarpWeek’s vision for an information retrieval system to allow access to this incredible array of resources using a wide variety of advanced searching and browsing techniques. Once the user interface and workflow models were understood, a web-based management console was created to streamline the tasks of loading, validating, and correcting errors in the content. The console is also used to manage both IP and user ID authentication models for customer access to the product.
High-Speed Searching and Retrieval
A key requirement for the system was to optimize search performance without requiring expensive servers and database management systems. This was achieved successfully, delivering around 200GB of data, including 220,000 images, 350,000 article text files, and 520,000 index entries with crisp response time.
A wide variety of searching techniques are supported, including Boolean, proximity, inflected forms, and stemming. Search filters may also be applied for date ranges, index categories, and feature types. When an article is viewed, the search terms are highlighted in the text. The descriptors provide a method to drill down into pages from successively more precise topics, and users may also navigate and browse pages from issues using a light table interface.
Technical Environment
The HarpWeek products are built as web applications on a Microsoft Windows Server platform. Microsoft SQL Server™ is used to maintain customer account information and some basic structural information for the collections. The bulk of the content used for searching and retrieval is implemented using the Microsoft Index Server. Although HarpWeek has wisely chosen to run the software on professional server-grade computers to support the user volume, the entire system is capable of running on a standard, albeit current, single processor desktop system.
Award-winning Solution
This product was recently awarded the prestigious E-Lincoln prize. Please see the announcement for more information.
See HarpWeek in Action
Although electronic access to the Harper's Weekly archive is only available on a subscription basis, you can see a subset of the Harper's Weekly project at the publicly-accessible Coffin Nails site.
|