Searching and sifting through large amount of information is a challenging task and a norm confronting web users today. There is a need for resource discovery services capable of dealing with large search resultsets effectively. Previous research indicates a tendency for users to prefer interfaces that incorporate some form of categorisation and grouping of results. These types of UIs tend to enhance the efficiency of information seeking and provide greater user satisfaction. This post describes the development of faceted search, a resource discovery approach based on a type of category system which has become prevalent in digital libraries. A remit of the UX2.0 project involves enhancing an existing digital library featuring faceted search through user-centred design (UCD). We are also evaluating AquaBrowser a leading library product facilitating faceted search.
This post relates to my recent work on two systems. It describes the development and setting up of a faceted search infrastructure - Apache Solr for Blacklight, a Ruby-on-Rails and open source resource discovery UI. To provide concentration user experience, the infrastructure provides data aggregated from multiple and heterogeneous sources.
Part 1 of this account (this post) describes the general setup of Solr for multi-sourced data and the experience of importing the CERN book dataset, using the Data Import Hanlder (DIH) of Solr. UX2 is incorporating the book data in combinant with digital library objects, for low-fi content-rich UI prototypes testing.
A forthcoming post (Part 2) describes programmatic development of Solr, to enable indexing of Dublin Core metadata and binary documents in multiple formats (PDF, PowerPoints etc) held in an existing Fedora Commons digital repository.
Faceted Search
Faceted search is becoming a contemporary and de facto search tool for content rich websites, in particular online stores and digital libraries. In addition to results listing, a faceted search UIs provides a result digest containing various "orthogonal sets of categories", each providing independent aspect or facet related to the search result set. For example, the figure below shows the faceted search interface of the World Digital Library. Users can browse or navigate among these facets which are typically listings of metadata links such as author, topic and year (of publication), all accompanied by the respective number of matching items within the search results. The facet links are essentially advanced search shortcuts. Selecting a link is equivalent to performing a boolean (AND) operation involving the keyword and a field with suggested term (that yields results), e.g. 'landscape' (keyword) AND 'type:book' (facet). Faceted search therefore allows users to perform “post-coordinate boolean operations via a navigational metaphor”. In addition, the facets provide information scaffold appropriate for domain research. Due to the integration of search and navigation (browse) interactions, faceted search is often referred as faceted navigation or faceted browsing.
Faceted search UI of the World Digital Library (http://www.wdl.org)
From the perspective of interaction design, the navigational approach and the ways with which the facets are presented (highly visible listings), have the potentials to enhance the affordance of advanced search. Less efforts are required from the users to perform the equivalent and traditionally more convoluted search task: opting for advanced search in UI, selecting search fields and typing in terms. This in turn encourages repetitive and quick use: successive clicking of several facet links. Modern faceted search UIs also provide breadcrumb trails displaying the search context (keyword entered, selected facets); some enable the search context to be repurposed and manipulated, e.g. removable facet, changing the search term while maintaining the facet selection. In combinant, these factors facilitate fast drill down of search results and enhance user experience.
Apache Solr Search Engine
Apache Solr is a Java search platform based on the Lucene indexing and search library. It is a stable open source technology, gaining popularity and used by many public and commercial websites. The Guardian uses it for the Open Platform and has good things to say about it - "robust performance, high availability, rapid scalability". The platform provides numerous and powerful features with indexing and search facilities at its core. It is service-oriented with services accessible via HTTP/XML web service APIs. Hence it can underpin resource discovery UIs (clients) built with other programming languages such as Blacklight (Ruby-on-Rails).
I have implemented Solr in 2008 for an earlier version of Blacklight, for a digital library which is being enhanced under the UX2 project. The core services were efficient and speedy. It was feasible to use the same Solr instance to provide auto-suggest interaction. AJAX queries invoked by keystrokes were sent to Solr's dedicated (autosuggest) search indexes, retrieving on-the-fly, the title, related author and subject suggestions (see below). I also hacked the engine so that it would index rich documents held in the digital library.
Autosuggest feature of UX2 digital library returning matching, related authors and subjects
Almost two years on, the search engine has improved a great deal, both in terms of performance and new features. It now has built-in support for rich document indexing and auto-suggest. In UX2, the UI aspects of the standard features, i.e. querying and faceting are being enhanced. For example, prototypes of various ways of faceting will feature in user testing. In addition, some of the Solr new features are being explored:
- Multicore indexes setup for multiple data sources
- Various data import handlers for indexing external data sources (URL/XML, database)
- Fulltext indexing for rich documents (built-in)
- Multiple facets selection per category
- Auto-suggest (built-in)
- 'Did you mean' spell checking
- 'More like this' function
The development described in this post (and Part 2) relates to the first three features, i.e. the setup of multiple indexes, importing and indexing multiple data sources: CERN book dataset and Fedora repository objects.
Installation and Configuration
The Solr platform can run on any machine with Java (version 1.5 or greater). You can download and install a nightly built version of Solr here. An alternate approach is to check out and compile the source code from Solr subversion system - I opted for this and is tracking Solr source code using a Git repository via git-svn (the first import took 30 minutes):
git svn clone http://svn.apache.org/repos/asf/lucene/dev/trunk/ apache-solr-svn
Solr provides an admin web UI for inspecting and querying indexes (REST-like URL queries/XML responses). You may want to build or try out some Solr-based web applications such as Blacklight, VuFind, Drupal Solr Module for better visualisation and demonstration of Solr's capability. This post focuses on the setup of Solr for Blacklight.
Solr is configured via two essential files: solrconfig.xml and schema.xml. The config files can be found in the example directory - a self-contained Solr instance which can be used as a development stub or starting point. The example directory containing various types of Solr home directory for different purposes, e.g. single core (solr.home/example/solr/) or multicore (solr.home/example/multicore). If you checked the system out via the Subversion code base, you will need to run 'ant example' to compile the source and generate the example instance. The config files are located in 'conf' folders of the Solr home directories (each has a README.txt explaining the directory structure).
Read the getting started tutorial.
solrconfig.xml, schema.xml
The solrconfig.xml file contains the typical configuration details such as request and search handlers, specifying what fields to search and which facets to return. For example, the following code snippet defines a 'DisMax' search handler, applying a single search term across several fields of different weighting (boosts). Note: the search fields below (e.g. 'title_t', 'title_display') are defined in the other configuration file, schema.xml.
<!-- Specifies a dismix handler, making it the default handler,
it's possible to configure several handlers, each for
a particular type of search, e.g. author, subject -->
<requestHandler name="search" class="solr.SearchHandler" default="true" >
<lst name="defaults">
<str name="defType">dismax</str>
..
<!-- This handler should apply the search term across a set of query fields (qf)
of different weightings. -->
<str name="qf">
title_unstem_search^100000
subtitle_unstem_search^50000
title_t^25000
..
isbn_t
text
</str>
..
<!-- Specify a list of results fields to return for display -->
<str name="fl">
id,
score,
author_display,
format,
isbn_t,
language_facet,
subject_topic_facet,
..
</str>
..
<!-- Specify a list of facets to return with results -->
<str name="facet">true</str>
<str name="facet.mincount">1</str>
<str name="facet.limit">10</str>
<str name="facet.field">format</str>
<str name="facet.field">pub_date</str>
<str name="facet.field">subject_topic_facet</str>
..
</requestHandler>
The above example is part of the Blacklight Solr configuration file. We are adapting it for the UX2 project. Hence it will be modified during the course of UX2 development. Further details on configuring Solr: SolrConfigXml wiki, SolrRequestHandler wiki.
- solrconfig.xml of UX2, based on Blacklight's, with modifications as described in this blog
The schema.xml file describes the building blocks of Solr: the data and field types for querying and indexing. Typically, different field types are defined (string, text, boolean). The field types underpin different fields definition. The defined search fields are applicable in solrconfig.xml, in higher-level contexts (in request handlers) analogous to 'plumbing of parts'. One of the key features of schema.xml is the definition of various text field types and the ways with which text are indexed through a series of text-processing filters. For example, text for sort fields (faceting) are indexed as a single term or token (e.g. "albert einstein"), whereas text for searching are split (tokenised) into individual searchable pieces ("albert", "einstein"). Additional treatments (filtering) are also necessary. For example, Solr provides a StopFilterFactory to filter out uninteresting words (stop words) such as "the", "a", "and". Stemming is also important. It reduces words ("running", "runs", "ran") into their root form ("run"). Other examples of text filtering includes case folding, synonym processing and accented characters normalisation ('protégé' <-> 'protege'). Read more on text tokenizers and filters.
UX2 is using the Blacklight schema as a starting point and made a few basic changes so that we can start experimenting with the current version of Solr (4.0). The changes include text fieldtypes definition using charFilter and removing UTF normalisation (which does not seem to work in the Solr 4 just now). Apart from a new 'fulltext' text field exclusively for rich document content (cf. the catch-all text field for both metadata and fulltext content), all fields are currently left unchanged.
- schema.xml of UX2, based on Blacklight schema
Multicore Indexes - Data Concentration
Starting from Solr 1.3, the concept of Solr cores was introduced, enabling a single Solr instance to run multiple indexes ('cores'). The cores are independent of each others and can be managed individually. Each is a fully-function live index with its own configuration and data. Multiple cores addresses some key requirements: rebuilding and reloading index, testing new configuration, creating new indexes, merging or swapping indexes. All of these can be accomplished 'hot' or on-the-fly' without the need to shut down the server.
Multicore is a good match for UX2's remit in developing diffusion-concentration user experience, necessitating both the aggregation and diffusion of data and services. While the service-oriented nature of Solr enables the diffusion of resource discovery services to different front-ends, multiple cores enable data from different sources to be managed more efficiently. First, data sources can be managed (imported/indexes) and tested independently before being merged with a main staging index. The merged index provides concentration user experience as resource discovery of multiple data sources can be performed simultaneously through a unified user interface. Second, new data sources can be introduced into the main index progressively without disrupting existing data services. In UX2, the following multiple cores setup is being piloted:
- Fedora Core, index of digital library objects from Fedora Commons repository
- CERN Core, index of the CERN book dataset
- Main Core, the staging index merged from the above and any new core from additional data sources
The multiple cores can be configured via the solr.xml file in the Solr home directory (e.g. in solr.home/example/multicore/solr.xml) as below:
<solr persistent="true">
<cores adminPath="/admin/cores">
<core name="fedora" instanceDir="fedora/"/>
<core name="cernbookdata" instanceDir="cernbookdata/"/>
<core name="main" instanceDir="main/"/>
</cores>
</solr>
solr.xml basically specifies the cores and their respective home directories which needed to be created (e.g. mkdir fedora). Each of the core directory must contain a conf subdirectory (see below) with the appropriate config files including the solrconfig.xml and schema.xml as described above.
To start Solr with the multicore setup, you can be use the start.jar in the example directory of the Solr distribution, and the multicore home directory as a starting point:
java -Dsolr.solr.home=multicore -jar start.jar
After the server started, you should be able to access the Solr admin UI via a web browser at the default server URL (http://localhost:8983):
Now for importing some data into the cores!
Indexing CERN book dataset with Solr Data Import Handler
CERN has released the bibliographic book data held in its Library as part of an initiative to open up data for open access. A remit of UX2 explores use scenarios (concentration user experience) enabled by the aggregation of multiple data sources and the existing digital library objects. The main objective is to develop a unified and enhanced resource discovery UI via usable and useful contemporary interaction design patterns such as faceted search and world cloud.
Solr provides a data import handler (DIH) capable of dealing with data from different sources: relational databases, URL data (e.g. XML) sources and files. The CERN dataset (6 Dec 2009) contains 36244 records and can be downloaded as a single 105MB MARC XML file. It should be feasible to import the data as a file via the DIH. To enable Solr DIH, insert the following in solrconfig.xml of the cernbookdata core:
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">dataimport-cern-books-config.xml</str>
</lst>
</requestHandler>
The above creates a data import handler named dataimport and refers to a DIH config file, dataimport-cern-books-config.xml which resides in the same directory as solrconfig.xml. The DIH config file informs the handler about the data source (CERN data file) location, what fields within the dataset to read and how they map to the index fields specified in schema.xml. For example, the following maps two MARC XML fields (001, 245a) to the id, and title_display index fields of the cernbookdata core respectively.
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" />
<document>
<entity name="page" processor="XPathEntityProcessor"
stream="true" forEach="/collection/record/"
url="/file-location-of/cern_books_data.xml"
transformer="RegexTransformer, DateFormatTransformer">
<field column="id"
xpath="/collection/record/controlfield[@tag='001']" />
<field column="title_display"
xpath="/collection/record//datafield[@tag='245']/subfield[@code='a']" />
</entity>
</document>
</dataConfig>
To initiate the data import, start Solr (with multicore), use the DIH interface of the web admin (e.g. http://localhost:8983/solr/cernbookdata/admin/dataimport.jsp): select the 'dataimport' handler and opt for 'Full Import'.
DIH Use Experience and Issues
The performance of the DIH is great for one-to-one data mapping, e.g. MARC 001 -> id, MARC 245a -> title_display, as indicated by the following experiments of indexing 36244 CERN records, all from a single file:
- 2 fields mapping, on a 64bit MacBook: ~5 seconds
- 8 fields mapping, on a 64bit MacBook: ~28 seconds
For UX2/Blacklight, some of the fields need to be mapped to multiple index fields (one-to-many mapping). For example, MARC 245a should be mapped to both title_display and title_t (text) fields. I attempted and failed to do this by deploying the following field mapping:
<dataConfig>
..
<field column="title_display" xpath="/collection/record//datafield[@tag='245']/subfield[@code='a']" />
<field column="title_t" xpath="/collection/record//datafield[@tag='245']/subfield[@code='a']" />
..
</dataConfig>
Solr only performed the title_t mapping. In addition, another UX2 requirement wasn't met with failed attempts to map multiple MARC fields to a single Solr field (many-to-one mapping), e.g. mapping MARC subject fields (650a, 653) to subject_t. Therefore, the following can be observed:
- Solr DIH can only undertake one-to-one mapping of XML fields between data source and a Solr index
The one-to-one data mapping configuration on its own is not viable for UX2/Blacklight. However, there are two plausible solutions to address the many-to-one and one-to-many mapping requirement:
- Use copyField in the core's schema.xml
- Use the XML Stylesheet Transformation (xslt) functionality of DIH to pre-process the CERN data file, replicating and aggregating MARC fields prior to DIH (one-to-one) mapping.
I opted for the XSLT option to accomplish the CERN data import. First, specify an XML stylesheet in the DIH config file via the xsl attribute:
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" />
<document>
<entity .. xsl="file-location-of/xslt/preprocessCernBooksData.xsl">
.. field mappings..
</entity>
</document>
</dataConfig>
Second, create the XML stylesheet to process the MARC XML fields. For example, the following snippet applies MARC 245a to title_display, title_t (one-to-many), as well as MARC 245a, 245b, 240a, 240b 700t fields to title_t (many-to-one):
<xsl:template match="marc:datafield">
<xsl:choose>
<xsl:when test="@tag='245'">
<xsl:variable name="field245b"><xsl:value-of select="marc:subfield[@code='b']"/></xsl:variable>
<xsl:variable name="field240a"><xsl:value-of select="../marc:datafield[@tag='240']/marc:subfield[@code='a']"/></xsl:variable>
<xsl:variable name="field240b"><xsl:value-of select="../marc:datafield[@tag='240']/marc:subfield[@code='b']"/></xsl:variable>
<xsl:variable name="field700t"><xsl:value-of select="../marc:datafield[@tag='700']/marc:subfield[@code='t']"/></xsl:variable>
<marc:title_display>
<xsl:value-of select="marc:subfield[@code='a']"/>
</marc:title_display>
<marc:title_t>
<xsl:value-of select="marc:subfield[@code='a']"/>
<xsl:text> </xsl:text><xsl:value-of select="$field245b"/>
<xsl:text> </xsl:text><xsl:value-of select="$field240a"/>
<xsl:text> </xsl:text><xsl:value-of select="$field240b"/>
<xsl:text> </xsl:text><xsl:value-of select="$field700t"/>
</marc:title_t>
</xsl:when>
...
<xsl:otherwise>
<xsl:copy-of select="."/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
Third, map the transformed XML fields to their corresponding index fields in the DIH config file:
<dataConfig>
..
<field column="title_display" xpath="/collection/record/title_display" />
<field column="title_t" xpath="/collection/record/title_t" />
..
</dataConfig>
The following are the complete config and XSL files:
The following table describes the current experimental MARC-Solr mapping for the CERN dataset:
Solr - cernbookdata core | MARC fields |
---|---|
id | 001 |
title_display, title_sort | 245a |
subtitle_display | 245b |
title_t | 245a, 245b, 240a, 240b, 700t |
author_display | 100a, 110a, 700a |
author_facet, author_first_letter | 100a, 700a |
author_sort | 100a |
author_t | 100a, 110a, 111a, 130a |
published_display, published_t | 260a, 260b |
pub_date, year_facet, pub_date_sort | 260c |
isbn_display, isbn_t | 020a |
format | 980a |
language_facet | 041a |
subject_topic_facet, subject_topic_first_letter | 650a, 653a |
subject_era_facet | 650d, 650y, 651y, 655y |
subject_t | 653, 650a |
identifier_display, web_display | 856u, 940u |
The XSLT option is not without its issues. First, this approach incurs significant system overheads due to the XSLT processing. It is necessary to allocate more memory (e.g. 2GB heap size) to Solr when indexing, by starting the server as follow:
java -Xms2g -Xmx2g -Dsolr.solr.home=multicore -jar start.jar
With 2GB heap, indexing the above 18 Solr fields on a 64bit MacBook:
- 382 records: ~5s
- 8773 records: ~27s
- all 36244 records: ~7m55s
- all 36244 records: ~5m25s (64bit MacPro with 4G heap)
Note: The latest Xalan jar files needs to be in the Java virtual machine's /lib/endorsed directory to prevent a 'no more DTM IDs are available' exception during indexing.
So, here it is, the CERN dataset imported into a Solr core and Blacklight. Also check out the UX2 Blacklight prototype.
In the next post I shall describe how Fedora digital repository objects are indexed for the fedora core.
Thanks!
You should take a look at Sphinx for the search functionality. It's better then Solr. I've written an article on this here: http://blog.sphereinc.com/2010/07/full-text-search-in-rails-applications-with-sphinx/
Posted by: Pavel | 07/29/2010 at 11:03 AM
Thanks, this is very informative.The buzz is around faceted search as it provides user with a seamless web experience. Yonik Seeley, creator of Solr walks you through concept of Faceted Search , check out :http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr
Posted by: Julia | 09/08/2010 at 01:14 PM
Sphinx is crude compared to Solr...Pavel needs to read up on what Solr can do ;-)
Thanks for the article and writing up the XML mapping workaround...
Posted by: ken | 10/13/2010 at 09:27 PM