Information Retrieval:  A Health & Biomedical Perspective

Information Retrieval:  A Health & Biomedical Perspective (Second Edition)

William Hersh, M.D.

Springer-Verlag , 2003

Back to Updates Table of Contents

Update to Chapter 3 - System Evaluation

3.1 Overview of research

(4/10/05) Here is a quote from Albert Einstein that puts quantitative research into a good perspective: "Not everything that can be counted counts, and not everything that counts can be counted." (http://www.quotationspage.com/quote/26950.html)

3.1.1 Comparative research

(4/10/07) A simple overview of statistical tests for medical education research is quite pertinent to information retrieval (IR) research (Windish, 2006).

Windish, D. and Diener-West, M. (2006). A clinician-educator's roadmap to choosing and interpreting statistical tests. Journal of General Internal Medicine, 21: 656-660.

3.1.2 Other types of research

3.2 Classifications of evaluation

(4/12/04) Two other classifications of evaluation pertinent to information retrieval (IR) worth mentioning. Patton's (1990) well-known volume on qualitative research has a frequently cited table that describes different types of research and their characteristics from the most general to the most practical.

Type
Purpose
Focus
Desired Results
Desired Generalization
Basic research
Knowledge as end in itself
Questions deemed important by intellectual interest
Contribution to theory
Global
Applied research
Understand nature of human problems
Questions deemed important by society
Formulate problem-solving and interventions
Limited to application context
Summative evaluation
Determine effectivess of human interventions
Goals of the intervention
Judgments and generalizations about interventions
All interventions with similar goals
Formative evaluation
Improve an intervention
Strengths and weaknesses of intervenion
Recommendations for improvement
Specific setting studied
Action research
Solve problems
Problems of organization or community
Solve problems quickly and effectively
Here and now

Another classification emanates from the medical informatics literature and is focused on system development (Stead et al., 1994). It also creates a matrix, although the dimensions are system development and site/level of evaluation. Only certain levels of sites/levels of evaluation are are appropriate for specific levels of system development.


Definition
Laboratory - bench
Laboratory - field
Remote field - validity
Remote field - efficacy
Specification
*
*



Component development

*



Combination of components into a system

*
*
*

Integration of system into environment

*
*
*
*
Routine use


*
*
*

Patton, M. (1990). Qualitative Evaluation and Research Methods (2nd edition). Newbury Park, CA. Sage Publications.
Stead, W., Haynes, R., Fuller, S., Friedman, C., Travis, L., Beck, J., Fenichel, C., Chandrasekeran, B., Buchanan, B., Abola, E., Sievert, M., Gardner, R., Messerle, J., Jaffe, C., Pearson, W. and Abarbanel, R. (1994). Designing medical informatics research and library-resource projects to increase what is learned. Journal of the American Medical Informatics Association, 1: 28-33.

(4/9/05) Despont-Gros et al. (2005) have developed a classification of user interactions with clinical information systems based on a review of the human-computer interaction literature. They found that variables assessed in studies included:
The studies used a variety of experimental approaches:
Despont-Gros, C., Müller, H., et al. (2005). Evaluating user interactions with clinical information systems: a model based on human-computer interaction models. Journal of Biomedical Informatics, 38: 244-255.

(4/15/06)  Friedman (2005) compares two approaches to evaluation of community-based information interventions, which can be applied to IR systems. He distinguishes the smallball approach, consisting of focused evaluations over the life of a project, with powerball evaluations that are done by randomized experiments at a project's conclusion. He asserts that the former has several advantages during various stages of system deployment:
He argues that powerball studies are often seen as the only legitimate type of evaluation, yet their ability to answer all the research questions we may have can be limited.

Friedman, C. (2005). "Smallball" evaluation: a prescription for studying community-based information interventions. Journal of the Medical Library Association, 93: S43-S48.

(4/16/06) Another classification of sorts that is used in Chapter 7 but not explicitly introduced in this chapter is the notion of system-oriented versus user-oriented evaluation research. System-oriented research focuses on evaluation of the system, either by part or as a whole, focusing on how well it performs a set of standardized tasks. The usual approach to system-oriented evaluation in IR is through the use of a test collection, which consists of:
User-oriented evaluation, on the other hand, focuses on assessing the system in the hands of real users, who themselves may be in a simulated laboratory setting or real-world environment.

3.2.1 Lancaster and Warner

3.2.2 Fidel and Soergel

3.2.2.1 Setting

3.2.2.2 User

3.2.2.3 Request

3.2.2.4 Database

3.2.2.5 Search system

3.2.2.6 Searcher

3.2.2.7 Search process

3.2.2.8 Search outcome

3.2.3 Hersh and Hickam

3.2.3.1 Was the system used?

3.2.4.2 For what was the system used?

3.2.3.3 Were the users satisfied?

3.2.3.4 How well did they use the system?

3.2.3.5 What factors were associated with successful or unsuccessful use of the system?

3.2.3.6 Did the system have an impact?

3.2.4 Simulation in evaluation experiments

(4/9/05) Although not an IR evaluation study per se, Dresselhaus et al. (2004) compared different approaches to assessing variation among clinicians in the quality of preventive care provided. They looked at used standardized patients (trained actors) as well as computerized clinical vignettes. The measures from the standardized patients included abstraction of the medical record and reports from the standardized patients. Measures from the standardized patients and the clinical vignettes were equally effective in predicting the quality of preventive care provided. However, the clinical vignettes were also noted to be less expensive as well as more easily controlled for case mix at a given site.

Dresselhaus, T., Peabody, J., et al. (2004). An evaluation of vignettes for predicting variation in the quality of preventive care. Journal of General Internal Medicine, 19: 1013-1018.

3.3 Relevance-based evaluation

3.3.1 Recall and precision

(4/12/04) Another relevance-based measure introduced several decades ago attempted to account for the cost of having to assess nonrelevant document. Cooper (1968) defined the expected search length (ESL) as a measurement of retrieval performance that calculated how many nonrelevant documents had to be seen by the user to obtain a specificed number of relevant documents. More recently, Losee (1996) introduced the average search length (ASL), which is the "expected number of documents obtained in retrieving a relevant document, the mean position of a relevant document."

Cooper, W. (1968). Expected search length: a single measure of retrieval effectiveness based on the weak ordering action of retrieval systems. American Documentation, 19: 30-41.
Losee, R. (1996). Evaluating retrieval performance given database and query characteristics: analytical determination of performance surfaces. Journal of the American Society for Information Science, 47: 95-105.

(4/12/04) Soboroff et al. (2001) proposed the measurement of recall and precision without human relevance judgments. Noting past work by Voorhees (2000, described in the text) demonstrating that differences in judgments did not effect the relative performance of systems, they selected random documents from the retrieval pool of multiple searches on each topic. Their results were most effective when they did not eliminate duplicates from selection (in essence giving more frequently retrieved documents a more likely chance to be selected as relevant). They found that their results were most effective in separating high-performing and low-performing systems from those in the middle, but that they were less successful at identifying the truly best (or worst) systems from among the top (or bottom) performing systems. Aslam et al. (2006) developed methods for sampling very small numbers of documents (4% of usual pool size) that led to estimates of relevance for the remaining retrieved documents comparable to if they were judged by relevance judges.

Aslam, J., Pavlu, V., et al. (2006). A statistical method for system evaluation using incomplete judgments. Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA. ACM Press. 541-548.
Soboroff, I., Nicholas, C. and Cahan, P. (2001). Ranking retrieval systems without relevance judgments. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , New Orleans, LA. ACM Press. 66-73.

3.3.1.1 Similarities to medical diagnostic test evaluation

(4/13/03) Another medical measurement analogy from recall and precision has been defined by Bachmann et al. (2002): the number needed to read (NNR), which is the inverse of precision, i.e., 1/precision. The NNR defines the total number of articles that must be read to find each relevant one. This analogy can actually be carried back to the medical measureument realm, with the inverse of the positive predictive value (equivalent of precision) representing the number needed to test.

Bachmann, L., Coray, R., et al. (2002). Identifying diagnostic studies in MEDLINE: reducing the number needed to read. Journal of the American Medical Informatics Association, 9: 653-658.

3.3.1.2 Practical issues in measuring recall and precision

(4/10/07) The book does not do as good a job as it could noting that retrieval systems are compared across different topics by taking the mean or average of measures, be they recall, precision, bpref, etc..

(4/9/05) Another measure commonly used to combine recall and precision is the F measure (van Rijsbergen, 1979). This measure is the harmonic mean of recall and precision, and uses a parameter α that gives added value to recall as it increases.
F measure
When α=1, the measure is called F1, and it represents the harmonic mean of recall and precision. For a search situation where precision was important, one would set α to a lower level, i.e., less than one.

van Rijsbergen, C. (1979). Information Retrieval. London. Butterworth.

3.3.1.3 The special case of ranked output

(4/10/07) The mean average precision (MAP) measure, just barely mentioned in the book, has achieved widespread, almost default use, as a single measure that can be applied to ranked output (Buckley, 2005). MAP is calculated from the mean of average precision (AP) for each topic. AP represents the average of precision at each point a relevant document is retrieved or, for relevant documents not rertieved, a value of 0. As such, it is a recall-oriented measure (despite having precision in its name), since it measures retrieval across the entire set of relevant documents for a topic.

Here is how AP would be calculated for the ranked output of Table 3.3 in the book:
Table 3.3

If the retrieved documents in positions 14 and 20 in the output were not relevant, and those other relevant documents had not been retrieved at all, then AP would be calculated as follows:
Table 3, modified

One concern about recall and precision is the completeness of relevance judgments. When using relative recall, we cannot be certain that enough relevant documents have been identified to give a close approximation to absolute recall. Buckley and Voorhees (2004) introduced a new measure, binary preference (bpref), which is based on the number of times judged nonrelevant documents are retrieved before known relevant ones (that, of course, have been judged). Experiments showed that the measure was highly correlated with existing measures, such as MAP, when judgments were complete and more robust to incomplete judgments. Stated simplistically, bpref essentially is a measure that uses only the retrieved documents that have been judged for relevance.

In another variant of getting "partial credit" for being relevant, Harper (2006) notes, particularly in user studies, that subjects may have a different interpretation of a topic than its creator or relevance judge. Therefore, he or she may carry out perfectly acceptable retrieval but from a different frame of mind that motivated the original topic or its relevance judgment. He proposes that user studies include relevance judgments and not just rely on those created by others, with the "credit" he or she gets for relevance between 0 and 1, based on some probability measure that itself is based on other interpretations from other relevance judges and/or users.

Buckley, C. and Voorhees, E. (2005). Retrieval System Evaluation, 53-75, in Voorhees, E. and Harman, D., eds. TREC:  Experiment and Evaluation in Information Retrieval. Cambridge, MA. MIT Press.
Buckley, C. and Voorhees, E. (2004). Retrieval evaluation with incomplete information. Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, England. ACM Press. 25-32.
Harper, D. (2006). Employing user relevance assessments for measuring retrieval effectiveness. First International Workshop on Adaptive Information Retrieval, Glasgow, UK. http://www.dcs.gla.ac.uk/workshops/air/slides/DavidHarper-EmployingUserRelevanceAssessments.pdf.

3.3.1.5 Enhancements to recall and precision

3.3.2 What is relevance?

3.3.2.1 Topical relevance

3.3.2.2 Situational relevance

3.3.3 Research about relevance judgments

3.3.4 Limitations of relevance-based measures

3.3.5 Alternatives to relevance-based measures

(4/10/07) Not an IR study per se, an interesting evaluation of bioinformatics visualization tools for interpreting output of gene microarray data was recently reported (Saraiya, 2005). A variety of tasks were set up for subjects using different visualization tools for the same data. Subjects were measured on the different "insights" they were expected to obtain from the data.

Saraiya, P., North, C., et al. (2005). An insight-based methodology for evaluating bioinformatics visualizations. IEEE Transactions on Visualization and Computer Graphics, 11: 443-456.

(4/16/06) Joachims (2002) introduced a new approach to evaluation for the Web based on "clickthrough data." It was based on the premise that the links a user clicks on in the results listing from a Web search engine are a measure of relevance. A search engine or system is therefore "better" if more links are clicked from the output of one over the other. He proposed two types of experiments:
  1. Regular clickthrough data - The user's query is sent to two search engines, with the complete rankings from one system or the other randomly presented to the user.
  2. Unbiased clickthrough data - The user's query is sent to two search engines, but in this approach the results are mixed (although order within each set is maintained) together.
In both types of experiments, one system was deemed superior to the other when more Web pages from its output are clicked through by users.

In follow-up work, Joachims et al. (2005) looked at the eye movements and click-through behavior of real users, comparing them with the relevance judgments of other real users. They found that the user click-through was relatively highly associated with relevance, but was subject to two modest biases:
They conclude that while clicks cannot be thought of as absolute relevance judgments, they are a highly effective relative approximation.

Borlund (2003) has proposed a user-based, interactive model for evaluation that designs the evaluation to recreate as realistically as possible the real-world searching environment and allows a more dynamic approach to assigning relevance judgments.

Borlund, P. (2003). The IIR evaluation model: a framework for evaluation of interactive retrieval systems. Information Research, 8: 3. http://informationr.net/ir/8-3/paper152.html.
Joachims, T. (2002). Evaluating retrieval performance using clickthrough data. Proceedings of the SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval, Tampere, Finland. ACM Press. http://www.cs.cornell.edu/People/tj/publications/joachims_02b.pdf.
Joachims, T., Granka, L., et al. (2005). Accurately interpreting clickthrough data as implicit feedback. Proceedings of the 28th International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil. ACM Press. 154-161.

3.4 The Text Retrieval Conference

(4/16/06) Initiatives like TREC are sometimes called "challenge evaluations," and given all such initiatives TREC has spawned, this section might be better titled, "The Text Retrieval Conference and Related Challenge Evaluations." In 2005, a book was published reflecting the entire TREC experience back to its inception (Voorhees and Harman, 2005). It provides a number of "big picture" views of various aspects of TREC.

In the meantime, a number of new tracks have been introduced to TREC since the publication of this book, among them are:
Some TREC tracks have been so successful that they have spawned their own separate organizational structures:
Another IR evaluation forum focuses on retrieval from XML documents, the INitiative for the Evaluation of XML Retrieval (INEX).

Allan, J. (2004). HARD Track overview in TREC 2004 - high accuracy retrieval from documents. The Thirteenth Text Retrieval Conference - TREC 2004, Gaithersburg, MD. National Institute of Standards and Technology. http://trec.nist.gov/pubs/trec13/papers/HARD.OVERVIEW.pdf.
Clarke, C., Scholer, F., et al. (2005). The TREC 2005 Terabyte Track. The Fourteenth Text REtrieval Conference - TREC 2005, Gaithersburg, MD. National Institute of Standards and Technology. http://trec.nist.gov/pubs/trec14/papers/TERABYTE.OVERVIEW.pdf.
Clough, P., Müller, H., et al. (2005). The CLEF 2005 Cross-Language Image Retrieval Track. 6th Workshop of the Cross-Language Evaluation Forum, CLEF 2005, Springer Lecture Notes in Computer Science, Vienna, Austria. Springer-Verlag. 553-557.
Cohen, A. and Hersh, W. (2006). The TREC 2004 Genomics Track categorization task: classifying full-text biomedical documents. Journal of Biomedical Discovery and Collaboration, 1: 4. http://www.j-biomed-discovery.com/content/1/1/4.
Cormack, G. and Lynam, T. (2005). TREC 2005 Spam Track overview. The Fourteenth Text REtrieval Conference - TREC 2005, Gaithersburg, MD. National Institute for Standards and Technology. http://trec.nist.gov/pubs/trec14/papers/SPAM.OVERVIEW.pdf.
Hersh, W., Bhupatiraju, R., et al. (2006). Enhancing access to the bibliome: the TREC 2004 Genomics Track. Journal of Biomedical Discovery and Collaboration, 1: 3. http://www.j-biomed-discovery.com/content/1/1/3.
Hersh, W., Cohen, A., et al. (2005). TREC 2005 Genomics Track overview. The Fourteenth Text Retrieval Conference - TREC 2005, Gaithersburg, MD. National Institute for Standards & Technology. http://trec.nist.gov/pubs/trec14/papers/GEO.OVERVIEW.pdf.
Hersh, W., Cohen, A., et al. (2006). TREC 2006 Genomics Track overview. The Fifteenth Text Retrieval Conference (TREC 2006), Gaithersburg, MD. National Institute for Standards & Technology. http://trec.nist.gov/pubs/trec15/papers/GEO.OVERVIEW.pdf.
Hersh, W., Müller, H., et al. (2006). Advancing biomedical image retrieval: development and analysis of a test collection. Journal of the American Medical Informatics Association, 13: 488-496.
Voorhees, E. and Harman, D., eds. (2005). TREC: Experiment and Evaluation in Information Retrieval. Cambridge, MA. MIT Press.

3.5 Measures of agreement

(4/10/06) Hripcsak and Rothschild (2005) investigated the relationship of kappa to the F measure. They showed that when the number of negative cases is large, and the probability of chance agreement on positive cases is very small, then the two measures will approach each other mathematically. This is therefore useful in situations (more common in assessment of natural language understanding systems) where the true number of negative cases is unknown but large.

In addition, there are actually variants of the kappa measure whose assumptions lead to different results in some cases (Di Eugenio and Glass, 2004).

Di Eugenio, B. and Glass, M. (2004). The kappa statistic: a second look. Computational Linguistics, 30: 95-101.
Hripcsak, G. and Rothschild, A. (2005). Agreement, the F-measure, and reliability in information retrieval. Journal of the American Medical Informatics Association, 12: 296-298.

Last updated - April 21, 2007