Evaluation of XML retrieval

Table 10.2: INEX 2002 collection statistics.

12,107	number of documents
494 MB	size
1995-2002	time of publication of articles
1,532	average number of XML nodes per document
6.9	average depth of a node
30	number of CAS topics
30	number of CO topics

**Figure 10.11:** Simplified schema of the documents in the INEX collection.
$\begin{figure}\psset{unit=0.75cm} \begin{pspicture}(2.6,-0.4)(14.4,9.5) \par \ps... ...5,8.5)(5.5,5.5) \psline{->}(8.5,8.5)(11.5,7.5) \par \end{pspicture} \end{figure}$

The premier venue for research on XML retrieval is the INEX (INitiative for the Evaluation of XML retrieval) program, a collaborative effort that has produced reference collections, sets of queries, and relevance judgments. A yearly INEX meeting is held to present and discuss research results. The INEX 2002 collection consisted of about 12,000 articles from IEEE journals. We give collection statistics in Table 10.2 and show part of the schema of the collection in Figure 10.11 . The IEEE journal collection was expanded in 2005. Since 2006 INEX uses the much larger English Wikipedia as a test collection. The relevance of documents is judged by human assessors using the methodology introduced in Section 8.1 (page

), appropriately modified for structured documents as we will discuss shortly.

Two types of information needs or in INEX are content-only or CO topics and content-and-structure (CAS) topics. CO topics are regular keyword queries as in unstructured information retrieval. CAS topics have structural constraints in addition to keywords. We already encountered an example of a CAS topic in Figure 10.3 . The keywords in this case are summer and holidays and the structural constraints specify that the keywords occur in a section that in turn is part of an article and that this article has an embedded year attribute with value 2001 or 2002.

Since CAS queries have both structural and content criteria, relevance assessments are more complicated than in unstructured retrieval. INEX 2002 defined component coverage and topical relevance as orthogonal dimensions of relevance. The component coverage dimension evaluates whether the element retrieved is ``structurally'' correct, i.e., neither too low nor too high in the tree. We distinguish four cases:

The topical relevance dimension also has four levels: highly relevant (3), fairly relevant (2), marginally relevant (1) and nonrelevant (0). Components are judged on both dimensions and the judgments are then combined into a digit-letter code. 2S is a fairly relevant component that is too small and 3E is a highly relevant component that has exact coverage. In theory, there are 16 combinations of coverage and relevance, but many cannot occur. For example, a nonrelevant component cannot have exact coverage, so the combination 3N is not possible.

$\begin{displaymath} \mbox{\bf Q} (rel,cov) = \left\{ \begin{array}{ll} 1.00 & \m... ....00 & \mbox{if} \quad (rel,cov) = \mbox{0N} \end{array}\right. \end{displaymath}$

(54)

The number of relevant components in a retrieved set

of components can then be computed as:

$\begin{displaymath} \char93 (\mbox{relevant items retrieved}) = \sum_{c \in A} \mbox{\bf Q}(rel(c),cov(c)) \end{displaymath}$

(55)

One flaw of measuring relevance this way is that overlap is not accounted for. We discussed the concept of marginal relevance in the context of unstructured retrieval in Section 8.5.1 (page

). This problem is worse in XML retrieval because of the problem of multiple nested elements occurring in a search result as we discussed on page 10.2 . Much of the recent focus at INEX has been on developing algorithms and evaluation measures that return non-redundant results lists and evaluate them properly. See the references in Section 10.6 .

Table 10.3: INEX 2002 results of the vector space model in Section 10.3 for content-and-structure (CAS) queries and the quantization function Q.

algorithm	average precision
SIMNOMERGE	0.242
SIMMERGE	0.271

Table 10.3 shows two INEX 2002 runs of the vector space system we described in Section 10.3 . The better run is the SIMMERGE run, which incorporates few structural constraints and mostly relies on keyword matching. SIMMERGE's median average precision (where the median is with respect to average precision numbers over topics) is only 0.147. Effectiveness in XML retrieval is often lower than in unstructured retrieval since XML retrieval is harder. Instead of just finding a document, we have to find the subpart of a document that is most relevant to the query. Also, XML retrieval effectiveness - when evaluated as described here - can be lower than unstructured retrieval effectiveness on a standard evaluation because graded judgments lower measured performance. Consider a system that returns a document with graded relevance 0.6 and binary relevance 1 at the top of the retrieved list. Then, interpolated precision at 0.00 recall (cf. page 8.4 ) is 1.0 on a binary evaluation, but can be as low as 0.6 on a graded evaluation.

Table 10.4: A comparison of content-only and full-structure search in INEX 2003/2004.

	content only	full structure	improvement
precision at 5	0.2000	0.3265	63.3%
precision at 10	0.1820	0.2531	39.1%
precision at 20	0.1700	0.1796	5.6%
precision at 30	0.1527	0.1531	0.3%

Table 10.3 gives us a sense of the typical performance of XML retrieval, but it does not compare structured with unstructured retrieval. Table 10.4 directly shows the effect of using structure in retrieval. The results are for a language-model-based system (cf. Chapter 12 ) that is evaluated on a subset of CAS topics from INEX 2003 and 2004. The evaluation metric is precision at

as defined in Chapter 8 (page 8.4 ). The discretization function used for the evaluation maps highly relevant elements (roughly corresponding to the 3E elements defined for Q) to 1 and all other elements to 0. The content-only system treats queries and documents as unstructured bags of words. The full-structure model ranks elements that satisfy structural constraints higher than elements that do not. For instance, for the query in Figure 10.3 an element that contains the phrase summer holidays in a section will be rated higher than one that contains it in an abstract.

The table shows that structure helps increase precision at the top of the results list. There is a large increase of precision at

and at

. There is almost no improvement at

. These results demonstrate the benefits of structured retrieval. Structured retrieval imposes additional constraints on what to return and documents that pass the structural filter are more likely to be relevant. Recall may suffer because some relevant documents will be filtered out, but for precision-oriented tasks structured retrieval is superior.