Information Retrieval:  A Health & Biomedical Perspective

Information Retrieval:  A Health & Biomedical Perspective (Second Edition)

William Hersh, M.D.

Springer-Verlag , 2003

Back to Updates Table of Contents

Update to Chapter 8 - Lexical-Statistical Systems

(5/18/03) Note in this chapter of the book that there are several instances where TF*IDF is incorrectly replaced with TF*IOF .

(5/17/04) A book that provides a succinct overview of the language-processing aspects of not only information retrieval, but also information extraction, categorization, summarization, and mining has been published (Jackson and Moulinier, 2002)

Jackson, P. and Moulinier, I. (2002). Natural Language Processing for Online Applications: Text Retrieval, Extraction, and Categorization. Amsterdam. Benjamin Johns Publishing.

8.1 Evaluation of lexical-statistical systems

(5/26/07) As noted to the update of Chapter 3, a new track has been introduced in TREC, the Genomics Track. This is the first track within TREC to focus on retrieval in a domain, and is chaired by this author.

The Genomics Track is motivated by the information needs being generated by the data-intensive nature of the increasing number of "high throughput" technologies. These technologies, exemplified by gene microarrays, uncover new genes, proteins, and other biological prodcuts and activities that require the attention of researchers to understand in the context of their research. In a microarray experiment, for example, a research may uncover dozens of genes that are differentially (over or under) expressed relative to controls. As such, the researcher must learn about many new genes, proteins, etc. in a hurry. Clearly better IR systems will help in this process.

The first year of the Genomics Track occured in TREC 2003 (Hersh and Bhupatiraju, 2003). This first year was a "preliminary" year, mainly due to the fact that resources were not available for relevance judgements. As such, the track had to make use of existing data sources, which of course resulted in their not having been developed with the purpose of IR evaluation in mind.  Nonetheless, the track was very successful, with a total of 29 groups participating. This made the track the second largest at TREC in terms of numbers of participating groups. Also in 2003, however, the track was awarded a National Science Foundation Information Technology Research grant, which begin to provide five years of funding from the 2004 track forward. Data from the TREC Genomics Track are available on a password-protected Web site. Information on how to access the data are available at the track Web site (http://ir.ohsu.edu/genomics/).
TREC 2004 Ad Hoc Task
The first meaningful running of the Genomics Track occured in TREC 2004 (Hersh et al., 2006). The steering committee of the track decided to organize two tasks that would assess a variety of information management issues in the genomics domain. The first task was a standard ad hoc retrieval task using topics obtained from real biomedical research scientists and documents from a large subset of the MEDLINE bibliographic database. The second task focused on categorization of full-text documents, simulating the task of curators of the Mouse Genome Informatics (MGI) system and consisting of three subtasks. One subtask focused on the triage of articles likely to have experimental evidence warranting the assignment of GO terms, while the other two subtasks focused on the assignment of the three top-level GO categories. The ad hoc retrieval task is described in this section while the categorization task is described below in section 8.3.1.

The Genomics Track turned out to have the highest participation of any track in TREC in both 2004 and 2005, with 33 and 41 research groups respectively.

The document collection for the ad hoc retrieval task in both 2004 and 2005 was a 10-year subset of MEDLINE. The use of full-text documents for this task was contemplated, but an adequate amount to represent real-world searching could not be procured. As such, the decision was made to use MEDLINE. In reality, despite the widespread availability of on-line, full-text scientific journals at present, most searchers of the biomedical literature still use MEDLINE (through PubMed or Ovid) as an entry point. Consequently, there is great value in being able to search MEDLINE effectively. The MEDLINE subset consisted of completed citations from the database inclusive from 1994 to 2003. Records were extracted using the Date Completed (DCOM) field for all references in the range of 19940101 - 20031231. This provided a total of 4,591,008 records, which is about one third of the full MEDLINE database.

The topics for the 2004 ad hoc retrieval task were developed from the information needs of real biologists and modified as little as possible to create needs statements with a reasonable estimated amount of relevant articles (i.e., more than zero but less than one thousand). The information needs capture began with interviews by 12 volunteers who sought biologists in their local environments. A total of 43 interviews yielded 74 information needs. Some of these volunteers, as well as an additional four individuals, created topics in the proposed format from the original interview data. The same individuals then were assigned different draft topics for searching on PubMed so they could be modified to generate final topics with a reasonable number of relevant articles. The track chair made one last pass to make the formatting consistent and extract the 50 that seemed most suitable as topics for the track.

Relevance judgments were done using the conventional “pooling method” whereby a fixed number of top-ranking documents from each official run were pooled and provided to an individual (blinded to the number of groups who retrieved the document and what their search statements were). The relevance assessor then judged each document for the specific topic query as definitely relevant (DR), possibly relevant (PR), or not relevant (NR). For the official results, which required binary relevance judgments, documents that were rated DR or PR were considered relevant.

The primary evaluation measure for the task was mean average precision (MAP). Results were calculated using the trec_eval program, a standard scoring system for TREC. A statistical analysis was performed using a repeated measures analysis of variance, with posthoc Tukey tests for pairwise comparisons. In addition to analyzing MAP, we also assessed precision at 10 and 100 documents. The results and analysis of the ad hoc retrieval task are described in more detail in track overview paper (Hersh et al., 2004). The best performing groups used a variety of domain-independent and domain-specific techniques. Examples of the former that worked well included the use of Okapi weighting and query expansion. Examples of the latter included expansion of symbols by LocusLink and MeSH records, and other gene and protein synonyms. Groups attempting to map topic terms into controlled vocabulary terms, in particular gene and protein names, fared less well.
TREC 2005 Ad Hoc Task
In the 2005 ad hoc retrieval task, topics were employed that were more structured than the mostly free-form topics from 2004 (Hersh et al., 2005). The purpose of this approach was to provide systems with better defined (yet still realistic) queries for finding genomics information. As such, topics were developed from generic structured templates so systems could make better use of other resources, such as ontologies or databases. It was also hoped this could serve as the basis to begin investigation toward an interactive task.

As with 2004, we collected topics from real biologists. However, instead of soliciting free-form topics, we provided the biologists we interviewed with generic templates and asked them to express information needs they had recently that fit within those templates.

While it would have been ideal to interview users and develop the templates themselves from such interviews, the time frame of the track did not allow this. Instead, a set of generic topic templates (GTTs) were developed from an analysis of the topics from the 2004 track and other known biologist information needs. After we developed the GTTs, 11 people went out interviewed 25 biologists to obtain instances of them. Other people then searched on the topics to make sure there were at least some, although not too many, relevant documents. The topics did not have to fit precisely into the GTTs, but had to come close (i.e., have all the required semantic types).

As with 2004, there were 50 topics. Closure was reached on five GTTs, each of which had 10 instances, for a total of 50 topics. The five GTTs are listed below.  The semantic types in each GTT are underlined.  For some semantic types, more than one instance is allowed. The five GTTs were:
  1. Find articles describing standard methods or protocols for doing some sort of experiment or procedure.
  2. Find articles describing the role of a gene involved in a given disease.
  3. Find articles describing the role of a gene in a specific biological process.
  4. Find articles describing interactions (e.g., promote, suppress, inhibit, etc.) between two or more genes in the function of an organ or in a disease.
  5. Find articles describing one or more mutations of a given gene and its biological impact.

Relevance judgments were done using the conventional pooling method of TREC. Based on estimation of relevance judgment resources, the top 60 documents for each topic from all official runs were used. This gave an average pool size of 821 documents with a range of 290 to 1356. These pools were then provided to the relevance judges, who consisted of five individuals with varying expertise in biology. The relevance judges were instructed in the following manner for each GTT:

A total of 32 groups submitted 58 runs.  Six topics had no definitely relevant documents.  One topic had no definitely or possibly relevant documents and was dropped from the calculation of official results.

The overall results showed that manual synonym expansion helped but automated synonym expansion did not. Relevance feedback without term expansion helped.  The basic Okapi system with well-tuned parameters gave good baseline performance. As with the 2004 track, and TREC in general, the preliminary analysis performed in the context of the yearly cycle precluded better characterization of baseline and other experiments that would improve their understanding.

TREC 2006 Track

The TREC Genomics Track implemented a new task in 2006 that focused on passage retrieval for question answering using full-text documents from the biomedical literature (Hersh et al., 2006). The track was based on the premise that what many information seekers, especially users of the biomedical literature, desire something in between answers to questions and full documents, i.e., a system that attempts to provide short, specific answers to questions and put them in context by providing supporting information and linking to original sources.

A test collection of 162,259 full-text documents and 28 topics expressed as questions was assembled. Systems were required to return passages that contained answers to the questions. Expert judges determined the relevance of passages and grouped them into aspects identified by one or more Medical Subject Headings (MeSH) terms. Document relevance was defined by the presence of one or more relevant aspects. The performance of submitted runs was scored using mean average precision (MAP) at the passage, aspect, and document level.

For the TREC 2006 Genomics Track, the task consisted of retrieval of short passages (from phrase to sentence to paragraph in length) that specifically addressed an information need, along with linkage to the location in the original source document. Topics were expressed as questions and systems were measured on how well they retrieved relevant information at the passage, aspect, and document levels. Systems were required to return passages linked to source documents, while relevance judges not only rated the passages, but also grouped them by aspect. For this task, aspect was defined similar to its definition in the TREC Interactive Track aspectual recall task (Hersh et al., 2001), representing answers that covered a similar portion of a full answer to the topic question. It also drew upon experience in passage retrieval from the previous TREC High Accuracy Retrieval from Documents (HARD) Track (Allan, 2003; Allan, 2004).

The documents for the 2006 task came from a new full-text biomedical corpus obtained from a number of publishers who use Highwire Press (www.highwire.org) for electronic distribution of their journals. They agreed to allow use of their full text in HTML format, which preserved formatting, structure, table and figure legends, etc. The document collection was derived from 49 journals and were obtained by a Web crawl of the Highwire site, with post-processing to eliminate as much non-article material as we could. The full collection contained 162,259 documents. The collection was about 12.3 GB when uncompressed. The paper describing the track lists the journals and number of documents from each.

For the 2006 track, there were three levels of retrieval performance that we measured: passage retrieval, aspect retrieval, and document retrieval. Each of these provided insight into the overall performance for a user trying to answer the given topic questions. Each was measured by some variant of mean average precision (MAP).

Passage-level MAP - This measure used a variation of MAP, computing individual precision scores for passages based on character-level precision, using a variant of a similar approach used for the TREC 2004 HARD Track (Allan, 2004). For each nominated passage, the number of characters that overlapped with those deemed relevant by the judges in the gold standard was determined. For each relevant retrieved passage, precision was computed as the fraction of characters overlapping with the gold standard passages divided by the total number of characters included in all nominated passages from this system for the topic up until that point. Similar to regular MAP, remaining relevant passages that were not retrieved (no overlap with any nominated passages) were added into the calculation as well, with precision set to 0 for these relevant non-retrieved gold standard passages. Then the mean of these average precisions over all topics was calculated to compute the MAP for passages. Note that this measure is essentially the fraction of retrieved characters that are part of an answer to the topic question.

Aspect-level MAP - Aspect retrieval was measured using the average precision for the aspects of a topic, averaged across all topics. To compute this, for each submitted run, the ranked passages were transformed to two types of values, either the aspect(s) of the gold standard passage that the submitted passage overlapped with or the value “not relevant”. This resulted was a ranked list, for each run and each topic, of lists of aspects per passage, Non-relevant passages had empty lists of aspects. Because the organizers were uncertain of the utility for a user of a repeated aspect (e.g., same aspect occurring again further down the list), they discarded these from the output to be analyzed. For the remaining aspects of a topic, they calculated MAP similar to how it is calculated for documents, with the additional wrinkle that a single passage may have associated with it multiple aspects. Therefore the precision for the retrieval of each aspect was computed as the fraction of relevant passages for the retrieved passages up to the current passage under consideration. These fractions at each point of first aspect retrieval were then averaged together to compute the average aspect precision. Relevant passages that did not contribute any new aspects to the aspects retrieved by higher ranked passages were removed from the ranking. Taking the mean over all topics produced the final aspect-based MAP.

Document-level MAP - For the purposes of this measure, any PMID that had a passage associated with a topic ID in the set of gold standard passages was considered a relevant document for that topic. All other documents were considered not relevant for that topic. System run outputs were collapsed by PMID document identifier, with the documents appearing in the same order as the first time the corresponding PMID appeared in the nominated passages for that topic. For a given system run, average precision was measured at each point of correct (relevant) recall for a topic. The MAP was the mean of the average precisions across topics.

In general, document MAP scores were highest, followed by aspect, and then passage, although these scores were not directly comparable since they measure precision at recall of different things. There was a general, though far from perfect, correlation between passage, aspect, and document MAP. As seen in many TREC-style evaluations, statistical significance, based on pair-wise comparison with the top-ranking score in an ANOVA model, was not achieved for any measure until well down the ranked list of runs.

Overall there was a wide variation in system performance across submissions for each of the three measures. In general, scores grouped into three sets. A few groups dominated the high scores of each measure, followed by a large group with scores around the mean, and then another large group of relatively low scores. Submissions that scored well on document retrieval tended to score well on both passage and aspect. While a correlation between passage and document retrieval might have been expected, the correlation between document and aspect retrieval is more surprising since aspect retrieval places an emphasis on novelty and document retrieval does not.

For all three measures, the best automatic approaches were as good or better than manual or interactive systems. Manual and interactive approaches did not appear to provide an advantage over automatic methods. However, because the definitions of automatic, manual, and interactive were not as solid as in previous years because systems had the topic questions available during development, inference should be limited from these observations.

It was clear from the results and techniques of the top-performing groups in passage retrieval that certain approaches were quite effective. In particular, “trimming” passages to shorten them was done in all the runs with the highest passage MAP. Indeed, some groups noted that non-content manipulations of passages had substantial effects on passage MAP, with one group claiming that breaking passages in half with no other changes doubled their (otherwise low) score. To this end, the organizers defined an alternative passage MAP (PASSAGE2) that calculated MAP  as if each character in each passage were a ranked document. In essence, the output of passages was concatenated, with each character being from a relevant passage or not. The results show that some re-ranking of runs occurred.

This novel approach to the TREC 2006 Genomics Track was carried out successfully, leading to the development of a new test collection with new documents and tasks, as well as a new evaluation method and the software to administer and score it. While further analysis of results is required for more definitive conclusions, it can be noted that passage retrieval in this context is quite difficult, with results quite low. Fortunately, retrieval at the aspect and document levels is much better, indicating users still might be able to find answers to their questions in the biomedical literature. Duplicate relevance assessments showed relatively good levels of reproducibility, with one exceptional outlier.

The TREC 2007 Genomics Track will continue in the same direction, using the existing document collection and task structure but adding completely new topics and potentially new topic types. The 2007 track will be the last running of the Genomics Track within TREC, although future options to continue biomedical IR challenge evaluations are being explored.

Allan, J. (2003). HARD Track overview in TREC 2003 - high accuracy retrieval from documents. The Twelfth Text REtrieval Conference - TREC 2003, Gaithersburg, MD. Naitonal Institute of Standards and Technology. 24-37. http://trec.nist.gov/pubs/trec12/papers/HARD.OVERVIEW.pdf.
Allan, J. (2004). HARD Track overview in TREC 2004 - high accuracy retrieval from documents. The Thirteenth Text Retrieval Conference (TREC 2004), Gaithersburg, MD. National Institute of Standards and Technology. http://trec.nist.gov/pubs/trec13/papers/HARD.OVERVIEW.pdf.
Hersh, W. (2001). Interactivity at the Text Retrieval Conference (TREC). Information Processing and Management, 37: 365-366.
Hersh, W. and Bhupatiraju, R. (2003). TREC genomics track overview. The Twelfth Text Retrieval Conference (TREC 2003), Gaithersburg, MD. NIST. 14-23. http://trec.nist.gov/pubs/trec12/papers/GENOMICS.OVERVIEW3.pdf.
Hersh, W., Bhupatiraju, R., et al. (2006). Enhancing access to the bibliome: the TREC 2004 Genomics Track. Journal of Biomedical Discovery and Collaboration, 1: 3. http://www.j-biomed-discovery.com/content/1/1/3.
Hersh, W., Cohen, A., et al. (2005). TREC 2005 Genomics Track overview. The Fourteenth Text Retrieval Conference - TREC 2005, Gaithersburg, MD. National Institute for Standards & Technology. http://trec.nist.gov/pubs/trec14/papers/GEO.OVERVIEW.pdf.
Hersh, W., Cohen, A., et al. (2006). TREC 2006 Genomics Track overview. The Fifteenth Text Retrieval Conference (TREC 2006), Gaithersburg, MD. National Institute for Standards & Technology. http://trec.nist.gov/pubs/trec15/papers/GEO.OVERVIEW.pdf.

(5/24/06) Another goal of retrieval evaluation is to determine how to predict the difficulty of queries. Several research initiatives have addressed this topic in recent years.

In 2003, a workshop entitled Reliable Information Access (RIA) was held to address variability in retrieval results (Buckley and Harman, 2004). Two "tracks" addressed different facets and approaches to the problem. A "bottom-up" track carried out a large failure analysis, with six IR systems contributing one run each of 45 topics, with a detailed manual analysis of the results. A "top-down" track performed a number of runs using different variations of query expansion.

The bottom-up analysis found that systems obtained comparable performance (e.g., MAP) scores but performed differently across topics, retrieving different documents and emphasizing different aspects of the topics. However, it was concluded that all systems failed for similar reasons, usually missing some aspect of a topic that would lead to retrieval of more relevant documents. Another conclusion was that if systems could recognize the problem causing the failure, then substantially better retreival could be obtained. In other words, emphasis on future system development should focus on what current techniques can be applied to which situations, and not on developing new retrieval techniques.

The top-down analysis found that query expansion (also known as blind feedback, i.e., adding terms from highly ranked documents into a query to expand the number of terms and augment retrieval) was highly sensitive to variations in approaches. For example, the selection of the initial document set for expansion, the number of terms used, and the terms chosen greatly influenced performance.

Another workshop held in 2005 focused on predicting query difficulty (Carmel et al., 2005). The TREC Robust Track has also focused on performance on topics that have historically been difficult, assessing methods to recognize such queries and employ methods to improve results.

Buckley, C. and Harman, D. (2004). Reliable Information Access Final Workshop Report. Bedford, MA, MITRE Corp. https://rrc.mitre.org/pubs/ria_final.pdf.
Carmel, D., Yom-Tov, E., et al. (2005). Predicting query difficulty - methods and applications. SIGIR Forum, 39(2): 25-28. http://www.acm.org/sigir/forum/2005D/2005d_sigirforum_carmel.pdf.
Voorhees, E. (2006). The TREC 2005 Robust Track. SIGIR Forum, 40(1): 41-48. http://www.acm.org/sigir/forum/2006J/2006j_sigirforum_voorhees.pdf.

8.2 Basic lexical-statistical approaches

8.2.1 The vector-space model

(5/17/04) Note from the errata file that the formulae for cosine normalization from Equations 2 and 3 of Chapter 6 (page 189) are incorrect.  Here are the correct versions:
Cosine
Cosine
The calculations in this section of the chapter are correct, i.e., only the formula is incorrect.

8.2.2 Variations in term weighting

8.2.2.1 Variants of TF

8.2.2.2 Okapi weighting

(5/18/03) Equation 7 on page 273 of the book is slightly misformatted. Although not incorrect, it could be potentially misleading. In the second addend, the 1.5 is multiplied by the quantity (Length of document) / (Average document length).

8.2.2.3 Pivoted normalization

8.2.3 Automatic thesaurus generation

8.2.4 Latent semantic indexing

8.2.5 Phrase construction

8.2.6 Passage retrieval

(5/26/07) As noted above, the TREC 2006 Genomics Track focused on passage retrieval.

Hersh, W., Cohen, A., et al. (2006). TREC 2006 Genomics Track overview. The Fifteenth Text Retrieval Conference (TREC 2006), Gaithersburg, MD. National Institute for Standards & Technology. http://trec.nist.gov/pubs/trec15/papers/GEO.OVERVIEW.pdf.

8.2.7 Relevance feedback and query expansion

(5/20/03) On page 281, formula (11), the formula for pair weight in relevance feedback, is incorrect. The quantity after the minus sign should be, Correlation in nonrelevant documents, i.e.,
Pair weight = Correlation of pair to relevance * (Correlation in relevant documents - Correlation in nonrelevant documents)

8.2.8 Probabilistic IR

8.2.8.1 Basic probabilistic IR

8.2.8.2 Language modeling

(5/17/04) The language modeling continutes to attract researcher interest and retrieval success. Part of the success of these models results from the "smoothing" methods that give weight to terms in the query that do not occur in retrieved documents. Zhai and Lafferty investigated this feature in more detail and derived a number of new conclusions about this approach to IR. These language models also allow the measurement of query "clarity," which is defined as a measure of the deviation between in the query and document language models from the general collection model (Cronen-Townsend et al., 2002). Cronen-Townsend et al. found that query clarity was a good predictor of retrieval results using TREC ad hoc test collections. However, research by Turpin and Hersh (2004) assessing real user queries from their TREC Interactive Track experiments (see Section 8.4.2) failed to find this association.

Cronen-Townsend, S., Zhou, Y., et al. (2002). Predicting query performance. Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland. ACM Press. 299-306.
Turpin, A. and Hersh, W. (2004). Do Clarity Scores for Queries Correlate with User Performance? Proceedings of the  Fifteenth Australasian Database Conference (ADC2004), Dunedin, New Zealand. Australian Computer Society. 85-91. http://crpit.com/confpapers/CRPITV27Turpin.pdf.
Zhai, C. and Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22: 179-214.

8.2.9 Stemming

8.2.10 Implementations of lexical-statistical systems

(5/16/05) The IR systems of the NLM have generally not used natural language query techniques, perhaps with the exception of the "Related Articles" feature of PubMed. However, as noted in Chapter 6, the askMEDLINE interface (http://askmedline.nlm.nih.gov/) is an exception (Fontelo, 2005). Designed mainly for handheld devices, the system allows natural language queries to be entered. The system performs a variety of interactions with the underlying PubMed engine.

Fontelo, P., Liu, F., et al. (2005). askMEDLINE: a free-text, natural language query tool for MEDLINE/PubMed. BMC Medical Informatics and Decision Making, 5: 5. http://www.biomedcentral.com/1472-6947/5/5.

(5/20/06) As noted in the Chapter 1 update, an up-to-date list of open-source search engines is at http://www.searchtools.com/tools/tools-opensource.html. One system of note is Lucene, which is written in Java and is now part of the open-source Web server Apache (Gospodnetic, 2005). Another new open-source search engine is Zettair. The name comes from zettabyte and IR, with the former representing the quantity of 2 to the 70th power bytes, which is equal to 1,024 exabytes or approximately 1021 bytes. Zettair indexes HTML or TREC collections. It was developed for the TREC Terabyte Track, so is very fast and efficient.

Gospodnetic, O. and Hatcher, E. (2005). Lucene in Action. Greenwich, CT. Manning Publications.

(5/18/04) Two tutorials for using the SMART system, although not updated in several years, are available:

8.3 Specific lexical-statistical applications

8.3.1 Routing and filtering

(5/28/07) Besides the TREC Genomics Track categorization task, little text categorization work has focused on biomedical topics. One exception is recent work by Cohen et al. (2006), who attempted to determine whether automated classification of document citations could be useful in reducing the time spent by experts reviewing journal articles for inclusion in updating systematic reviews of drug class efficacy for treatment of disease.
They developed a voting perceptron-based automated citation classification system that classified articles as containing high-quality, drug class-specific evidence or not. Performance was assessed using cross-validation experiments. They found that at the level of 95% recall, there was a reduction in the number of articles needing manual review in 11 of 15 drug review topics by as much as 50%. They concluded that automated document citation classification could be a useful tool in maintaining systematic reviews of the efficacy of drug therapy. They are planning to carr out further work refining the classification system and determining the best manner to integrate the system into the production of systematic reviews.

Another approach, not categorization or filtering per se, involved the use of reordering documents in retrieval output based on citation data (Bernstam, 2006). They assessed a variety of citation-based algorithms that attempted to rank documents in search output not by term frequencies or related approaches, but rather by measures of citation frequency. The best-known of these approaches is the PageRank algorithm used by Google.

They compared eight different algorithms: simple PubMed queries, PubMed clinical queries, vector cosine, citation count, journal impact factor, PageRank, and a machine learning approach based on polynomial support vector machines. The aim was to prioritize important articles, defined as being included in a pre-existing bibliography of important literature in surgical oncology. The citation-based algorithms were found to be more effective than noncitation-based algorithms at identifying important articles. The most effective strategies were simple citation count and PageRank, which on average identified over six important articles in the first 100 results compared to <1 for the best noncitation-based algorithm. Similar differences were observed between citation-based and noncitation-based algorithms at 10, 20, 50, 200, 500, and 1,000 results. They also assessed citation lag, i.e., how it takes a period of time before citations to appear to an important article. This was found to affect performance of PageRank more than simple citation count. In spite of citation lag, however, the citation-based algorithms were still more effective than noncitation-based algorithms. They concluded that algorithms that were successful on the World Wide Web could be applied to biomedical information retrieval, helping to identify important articles within large sets of relevant results.

Bernstam, E., Herskovic, J., et al. (2006). Using citation data to improve retrieval from MEDLINE. Journal of the American Medical Informatics Association, 13: 96-105.
Cohen, A., Hersh, W., et al. (2006). Reducing workload in systematic review preparation using automated citation classification. Journal of the American Medical Informatics Association, 13: 206-219.

(5/17/04) A number of new interesting, if not controversial, techniques have emerged from text categorization research. In addition, a comprehensive overview has recently been written (Sebastiani, 2004) and a new test collection has become available (Lewis et al., 2004). The latter will insure research in filtering will continue, especially since the TREC Filtering Track was discontinued after 2002 (Robertson and Soboroff, 2002).

One of the controversial techniques is plagiarism detection (a topic no doubt of interest to students of mine reading this!). Schleimer et al. (2003) describe a system for "document fingerprinting" that enables copies of documents and sub-documents within them. A growing cat-and-mouse environment has emerged as Web-based services make the purchase of term papers easier than ever while the tools for detecting them are also more powerful due to their availability to crawl Web sites and compare student papers already entered into its database from the local purchaser and, sometimes, beyond. An overview of the leading companies offering plagiarism detection tools is at http://www.wou.edu/provost/library/staff/kincanon/plagiarism/chart.htm. The controversies behind these tools have been enumerated by Foster (2002).

Aonther controversial use of text categorization is the prediction of author gender based on writing style (Ball, 2003). A group of researchers developed algorithms to predict author gender for a group of 566 English-language texts with about 80% accuracy (Koppel et al., 2002; Argamon et al., 2003). The controversy in their work is that it "confirms" a number of gender stereotypes, e.g., men talk more about objects versus women talking more about relationships and female writers using more pronouns. Their algorithms are also able to discern fiction vs. non-fiction with about 98% accuracy, which is probably less surprising.

There has been little work with text categorization in the biomedical domain beyond the use of MEDLINE category assignment in TREC-8 described in the book. Recent work by Aphinyanaphongs et al. (2005), however, has shown that these techniques can be used to identify high-quality better than the MEDLINE Clinical Queries algorithms of Haynes et al. (1994 version). The best approach uses polynomial support vector machines. One limitation of this work is that it does not compare results with the newer Clinical Queries filters developed in recent years. These authors have also found ways to express these queries using Boolean operators so they can be used in exact-match retrieval systems (Aphinyanaphongs and Aliferis, 2004).

Additional work in biomedical text categorization took place in the TREC 2004 Genomics Track. The categorization task consisted of simulation of two of the classification activities carried out by human annotators for the MGI system: a triage task and two simplified variations of MGI’s annotation task. Systems were required to classify full-text documents from a two-year span (2002-2003) of three journals, with the first year’s (2002) documents comprising the training data and the second year’s (2003) documents making up the test data.

The documents for the categorization task consisted of articles from three journals over two years obtained from Highwire Press (http://www.highwire.org). Highwire is a “value added” electronic publisher of scientific journals. Most journals in their collection are published by professional associations, with the copyright remaining with the associations. Highwire originally began with biomedical journals, but in recent years has expanded into other disciplines. They have also supported IR and related research by acting as an intermediary between consenting publishers and information systems research groups who want to use their journals, such as the Genomics Track.

The journals available and used by the track were Journal of Biological Chemistry (JBC), Journal of Cell Biology (JCB), and Proceedings of the National Academy of Science (PNAS). These journals have a good proportion of mouse genome articles.  Each of the papers from these journals was provided in SGML format based on Highwire’s Document Type Definition (DTD). Articles from the year 2002 were used as training data and articles from 2003 as test data. The documents for the categorization tasks came from a subset of articles having the words mouse, mice or murine as described above. A crosswalk file (look-up table) that matched an identifier for each Highwire article (its file name) and its corresponding PubMed ID (PMID) was made available to participants. The SGML training document collection was 150 megabytes in size compressed and 449 megabytes uncompressed. The SGML test document collection was 140 megabytes compressed and 397 megabytes uncompressed.

The framework for evaluation in the categorization task was based on the following table of possibilities:


Relevant (classified)
Not relevant (not classified)
Total
Retrieved
True positive (TP)
False positive (FP)
All retrieved (AR)
Not retrieved
False negative (FN)
True negative (TN)
All not retrieved (ANR)

All positive (AP)
All negative (AN)


The consensus of the track was to use a utility measure for the triage task and F measure for the annotation tasks. A Python program that calculated both statistics for each task was developed.

The measure for the triage task was the utility measure often applied in text categorization research and used by the former TREC Filtering Track. This measure contains coefficients for the utility of retrieving a relevant and retrieving a nonrelevant document. A version was used that was normalized by the best possible score:
Unorm = Uraw / Umax

For a test collection of documents to categorize, Uraw was calculated as follows:
Uraw = (ur * TP) + (unr * FP)
where:
Values for ur and unr were used that were driven by boundary cases for different results. In particular, it was desired for the measure to have the following characteristics:
In order to achieve the above boundary cases, they set ur > 1. The ideal approach would have been to interview MGI curators and use decision-theoretic approaches to determine their utilty. However, time did not permit us to do that. Deciding that the triage-everything approach should have a higher score than the triage-nothing approach, they estimated that a Unorm in the range of 0.25-0.3 for the triage-everything condition would be appropriate. Solving for the above boundary cases with Unorm ~ 0.25-0.3 for that case, they obtained a value for ur ~ 20. To keep calculations simple, they chose a value of ur = 20. The table below shows the value of Unorm for the boundary cases.

Situation
 Unorm - Training Unorm - Test
Completely perfect prediction 1.0
1.0
Triage everything 0.27
0.33
Triage nothing 0
0
Completely imperfect prediction -0.73
-0.67

The measure Umax was calculated by assuming all relevant documents are retrieved and no nonrelevant documents are retrieved:
Umax = ur * AP
(This happens to equal AN.)

Thus, for the training data,
Uraw = (20 * TP) - FP
Umax =  20 * 375 = 7500
Unorm = [(20 * TP) - FP] / 7500
(If you plug in the bounary conditions to the these equations, you should obtain the results specified above.)

Likewise, for the test data,
Uraw = (20 * TP) - FP
Umax =  20 * 420 = 8400
Unorm = [(20 * TP) - FP] / 8400

The measures for the annotation subtasks were based on the notion of identifying tuples of data. Given the article and gene, systems designated one or both of the folloing tuples:
They employed a global recall, precision, and F measure evaluation measure for each subtask:
Recall = number of tuples correctly identified / number of correct tuples = TP / AP
Precision = number of tuples correctly identified / number of tuples identified = TP / AR
F = (2 * recall * precision) / (recall + precision)
For the training data, the number of correct <article, gene, GO hierarchy code> tuples was 593, while the number of correct <article, gene, GO hierarchy code, evidence code> tuples for the test data was 644.

More details about the triage and annotation hierarchy subtasks are available on the track Web site (Hersh et al., 2004). For the triage subtask, a variety of approaches were used by the different groups. However, none performed any better than simply using the MeSH term Mouse.

Aphinyanaphongs, Y. and Aliferis, C. (2004). Learning Boolean queries for article quality filtering. MEDINFO 2004 - Proceedings of the Eleventh World Congress on Medical Informatics, San Francisco, CA. IOS Press. 263-267. http://discover1.mc.vanderbilt.edu/discover/public/Publications/Boolean_Queries.pdf.
Aphinyanaphongs, Y., Tsamardinos, I., et al. (2005). Text categorization models for high-quality article retrieval in internal medicine. Journal of the American Medical Informatics Association, 12: 207-216.
Argamon, S., Koppel, M., et al. (2003). Gender, genre, and writing style in formal written texts. Text, 23: 321-346.
Ball, P. (2003). Computer program detects author gender. Nature. http://www.nature.com/nsu/030714/030714-13.html.
Foster, A. (2002). Plagiarism-detection tool creates legal quandary. Chronicle of Higher Education. May 17, 2002. http://chronicle.com/free/v48/i36/36a03701.htm.
Hersh, W., Bhuptiraju, R., et al. (2004). TREC 2004 genomics track overview. The Thirteenth Text Retrieval Conference (TREC 2004), Gaithersburg, MD. NIST. http://trec.nist.gov/pubs/trec13/papers/GEO.OVERVIEW.pdf.
Koppel, M., Argamon, S., et al. (2002). Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17: 401-412.
Lewis, D., Yang, Y., et al. (2004). RCV1: a new benchmark collection for text categorization research. Journal of Machine Learning Research , 5: 361-397.
Robertson, S. and Soboroff, I. (2002). The TREC 2002 Filtering Track report. The Eleventh Text REtrieval Conference (TREC 2002), Gaithersburg, MD. NIST. 27-39. http://trec.nist.gov/pubs/trec11/papers/OVER.FILTERING.pdf.
Schleimer, S., Wilkerson, D., et al. (2003). Winnowing: local algorithms for document fingerprinting. Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, CA. ACM Press. 76-85.
Sebastiani, F. (2005). Text Categorization, 109-129, in Zanasi, A., ed. Text Mining and its Applications. Southampton, UK. WIT Press. http://nmis.isti.cnr.it/sebastiani/Publications/TM05.pdf.

8.3.2 Web searching

(5/17/04) A new book describes the gamut of models that define, retrieve, and process Web content (Baldi et al., 2003).

Baldi, P., Frasconi, P., et al. (2003). Modeling the Internet and the Web - Probabilistic Methods and Algorithms. West Sussex, England. John Wiley & Sons.

8.3.2.1 TREC Web Track

(5/18/05)  The TREC 2002 Web Track employed a new test collection, .GOV (Craswell and Hawking, 2002). The collection consisted of an 18-gigabyte crawl of 1.25 million documents (not only HTML pages, but also PDF, Word, and Postscript files) from the .gov (U.S. government) domain, replacing the five-year-old data in WT10g. Documents longer than 100KB were truncated at that length.  The TREC 2002 Interactive Track employed the .GOV collection as well (Hersh, 2002).

The TREC 2002 Web track had two experimental tasks:  a "topic-distillation" task that sought to find high-quality, authoritative pages on given topics and a named-page-finding task that sought specific pages. The former task proved difficult for groups due to uncertainty over the task's meaning. For the named-age-finding task, groups found added performance by exploiting document structure (e.g., title and keywords) or anchor text.

The TREC 2003 Web track continued the use of the .GOV collection as well as the topic-distillation and named-page-finding tasks (Craswell et al., 2003). The primary metrics used to measure performance were R-precision (precision at the point of n documents retrieved, where n was the number of relevant documents for that query) for the former and mean reciprocal rank (MRR) for the latter. Both tasks represented types of searches that retrieved home pages and other high-level pages that facilitate navigation into a site. Techniques found to work well included searching over link text (ie, that between <A> and </A> on an HTML page), since that text is likely to describe the page its links to, and analyzing URLs to find those likely to point to home pages (eg, are short). The TREC 2003 Web Track also incorporated the former standalone Interactive Track.

The TREC 2004 Web track was the final running of this track (Craswell and Hawking, 2004). (It was replaced in 2005 with the "Enterprise Track" that focused on retrieval of documents from the intranet of a large enterprise.) The 2004 track combined three tasks that had been used in previous years: topic distillation, home-page finding, and named-page finding. There were 75 topics for each task. Not surprisingly, there was many fewer "relevant" documents for the latter two tasks (1600 vs. 80-83).

The track used a variety of established and new evaluation measures. In addition to MAP and MRR, they added Success@N, where N = 1, 5, or 10 and was defined as the number of topics where the "answer" appeared at or above rank N. For example, Success@10 meant the number of topics where the answer appeared in the top 10 documents of the output. The best results came from a group from Microsoft Research (Zaragoza et al., 2004). Their system obtained the best results using the following features:
An addition exercise in the TREC 2004 Web track was to automatically classify the topics into the three categories (topic distillation, home page, and named page). Since there were three categories, a purely random approach would achieve a 25% rate of correctness. The best performing group obtained a 61.3% rate of correctness. Virtually all groups were best able to distinguish topic distillation topics from the other two. None were really able to exploit this information to improve retrieval performance.

A related development at TREC 2004 was the running of the Terabyte Track (Clarke et al., 2004). We include it here because it also made use of a Web crawl. In fact, the track developed a new document collection, called GOV2, which consisted of a 426-gigabyte, 25-million document new crawl of the .GOV domain. The main purpose of the track was to assess retrieval with very large collections now commonly seen in Web search engines. The track developed 50 topics and performed the usual pooled relevance judgments. As much attention was paid to indexing and retrieval efficiency as retrieval results. Indexing time varied from 2-14 hours and average query time was often within 1-2 seconds, indicating today's common hardware can handle quite large document collections. The best MAP results were in the range of 0.25-0.3 and were obtained by several groups.

Clarke, C., Craswell, N., et al. (2004). Overview of the TREC 2004 Terabyte Track. The Thirteenth Text REtrieval Conference Proceedings (TREC 2004), Gaithersburg, MD. NIST. http://trec.nist.gov/pubs/trec13/papers/TERA.OVERVIEW.pdf.
Craswell, N. and Hawking, D. (2002). Overview of the TREC-2002 Web Track. The Eleventh Text REtrieval Conference (TREC 2002), Gaithersburg, MD. National Institute of Standards and Technology. http://trec.nist.gov/pubs/trec11/papers/WEB.OVER.pdf.
Craswell, N., Hawking, D., et al. (2003). Overview of the TREC 2003 Web Track. The Twelfth Text REtrieval Conference (TREC 2003), Gaithersburg, MD. NIST. http://trec.nist.gov/pubs/trec12/papers/WEB.OVERVIEW.pdf.
Craswell, N. and Hawking, D. (2004). Overview of the TREC 2004 Web Track. The Thiteenth Text REtrieval Conference (TREC 2004), Gaithersburg, MD. NIST. http://trec.nist.gov/pubs/trec13/papers/WEB.OVERVIEW.pdf.
Hersh, W. (2002). TREC 2002 Interactive Track report. The Eleventh Text Retrieval Conference (TREC 2002), Gaithersburg, MD. National Institute of Standards and Technology. http://trec.nist.gov/pubs/trec11/papers/INTERACTIVE.OHSU.pdf.
Zaragoza, H., Craswell, N., et al. (2004). Microsoft Cambridge at TREC 13:  Web and Hard Tracks. The Thiteenth Text REtrieval Conference (TREC 2004), Gaithersburg, MD. NIST. http://trec.nist.gov/pubs/trec13/papers/microsoft-cambridge.web.hard.pdf.

8.3.2.2 Other Web searching system evaluation

(5/19/04) A variety of other research has continued to assess search techniques in general for the Web. Craswell et al. (2001) performed additional research on the effectiveness of link-based methods for page-finding tasks. Additional benefit was found from a machine learning method. Another common searching task on the Web is finding pages that contain search interfaces to the invisible Web. Most of these pages use HTML forms (an HTML construct allowing entry of data and buttons to submit the form to a server for processing), but there are both false-negatives (a query interface that is not an HTML form) and false-positives (an HTML form that is not a query interface). Scholer et al. (2004) have pursued a related approach for page-finding, using past queries that retrieved a page as document "surrogates" for that page. Experimental results have found a 7% improvement in retrieval performance. Another approach to improving retrieval has been to take advantage of "click-through" data, ie, assuming a page is relevant if a user clicks on the link to it from a page displaying Web search engine results. Joachims (2002) has developed a system that uses machine learning to associate words in queries with pages users are likely to click on to view. Finally, Cope et al. (2003) have investigated a variety of approaches for finding searching interfaces with a variety of machine-learning methods. They found the best approach was a system that developed a decision tree (C4.5) to assess the features of pages to predict which represented search intefaces. This approach achieved an accuracy of 85%.

An additional line of research is looking at eXtensible Mark-up Language (XML) document retrieval. Early workers in this area have been Fuhr and Großjohann (2001), who note that structured XML documents may result in different retrieval tasks than simple content-based searching. As the XML tags represent semantic aspects of a document, users may wish to retrieve documents or parts therein for specific needs. Some of these queries may almost border on types of queries made to relational databases, e.g., find the cities of residence from an XML document of people. A standard for querying XML documents, XQuery, is emerging, but some researchers, such as Fuhr and Großjohann, argue it is (for now) incomplete. As such, they have developed a more comprehensive query language, XML Information Retrieval Query Language (XIRQL). Fuhr is also the organizer of an evaluation forum for XML data, Initiative for the Evaluation of XML Retrieval (INEX, http://inex.is.informatik.uni-duisburg.de:2003/), which is modeled similar to TREC.

Cope, J., Craswell, N., et al. (2003). Automated discovery of search interfaces on the Web. Proceedings of the Fourteenth Australasian Database Conference on Database Technologies, Adelaide, Australia. Australian Computer Society. 181-189.
Craswell, N., Hawking, D., et al. (2001). Effective site finding using link anchor information. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , New Orleans, LA. ACM Press. 250-257.
Fuhr, N. and Großjohann, K. (2001). XIRQL: A Query Language for Information Retrieval in XML Documents. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , New Orleans, LA. ACM Press. 172-180.
Joachims, T. (2002). Optimizing search engines using clickthrough data. Proceedings of the ACM Conference on Knowledge Discovery and Data Mining , Edmonton, Alberta, Canada. ACM Press. http://www.cs.cornell.edu/People/tj/publications/joachims_02c.pdf.
Scholer, F., Williams, H., et al. (2004). Query association surrogates for Web search. Journal of the American Society for Information Science , 55: 637-650.

8.3.2.3 Analysis of the Web

(5/19/03)  Additional research continues to examine the structure of the Web. Pennock et al. (2002) tested the power-law distribution of Web links, which they call the "rich get richer" nature of links on the Web, i.e., a small number of pages have a disproportionate share of links to and from them. They found that while a randomly selected portion of Web pages did indeed follow the power law, certain subsets of Web pages did not. In particular, home pages of companies, universities, newspapers, and scientists tended to follow more of a log-normal distribution. Soboroff (2002) addressed the question of whether the Web collections used in the TREC Web track, WT10g and .GOV, were similar to the Web in general and could thus have their experimental results viewed as generalizable. He indeed found that both collections obeyed the power law, with the slope of 2-3 seen in random collections of Web pages. Other similarities were noted as well.

Pennock, D., Flake, G., et al. (2002). Winners don't take all: characterizing the competition for links on the Web. Proceedings of the National Academy of Sciences, 99: 5207-5211. http://www.pnas.org/cgi/content/abstract/99/8/5207.
Soboroff, I. (2002). Do TREC Web collections look like the Web? SIGIR Forum, 36(2): 23-31. http://www.sigir.org/forum/F2002/soboroff.pdf.

(5/19/04) Fetterly et al. (2003) provided a new analysis of the evolution of Web pages, monitoring 150 million pages on a weekly basis for 11 weeks. Change of pages in the Web is certainly a concern for IR, since search engines usually lag weeks if not months behind in identifying changes in existing Web pages. This may also be important to users who cite a page for some specific content feature and may not be aware that the page has changed. Their analysis found variable rates of change, but in general, most changes to pages were small. One surprising finding was that large pages changed more frequently and extensively than smaller pages. They also found that pages in the .com and .net top-level domain were more likely to change than those in the .edu and .gov domains, and that past changes to a page was a good predictor of future change.

Thelwall has continued his work assessing the relationship between Web "impact" and the quality of scholarly work. A most recent analysis found that whle higher rated scholars produce more Web content, the impact as measured by in-links to those pages is not increased for those scholars ( Thelwall and Harries, 2004). In other words, more productive researchers produce more Web-based content but the rate of links to their sites is not increased over those who are less productive. Thelwall (2003) also attempted to categorize links between academic Web sites and found four general categories of links:
Finally, another concern about Web content is whether the output of search engines present a skewed or biased view of a topic. Gerhart (2004) has looked at five controversial topics and found the output from Web search engines does not present a diversity of views. Her analysis found that simple queries tended to present the positive views of a controversy, which she attributed to the nature of search engines, current linking practices by authors of Web pages, and the simple queries used by Web searchers.

Fetterly, D., Manasse, M., et al. (2003). A large-scale study of the evolution of web pages. Proceedings of the Twelfth International Conference on the World Wide Web, Budapest, Hungary. ACM Press. 669-678.
Gerhart, S. (2004). Do Web search engines suppress controversy? First Monday, 9: 1. http://www.firstmonday.dk/issues/issue9_1/gerhart/.
Thelwall, M. (2003). What is this link doing here? Beginning a fine-grained process of identifying reasons for academic hyperlink creation. Information Research, 8: 3. http://informationr.net/ir/8-3/paper151.html.
Thelwall, M. and Harries, G. (2004). Do the web sites of higher rated scholars have significantly more online impact? Journal of the American Society for Information Science & Technology, 55: 149-159.

8.3.2.4 Automated generation of hypertext

8.3.3 Question answering

(5/28/07) Very little question-answering work has focused on medical topics, with one recent exception. Demner-Fushman and Lin (2007) developed a system that aimed to answer evidence-based medicine (EBM) questions. They implemented a system based on a suite of "knowledge extractors" aiming to process the text of MEDLINE abstracts into the PICO (problem, intervention, control, outcome) framework, determine the strength of evidence of the article, and determine the EBM task, i.e., treatment, diagnosis, prognosis, or harm. These attributes are then matched up against EBM questions that come from users. An evaluation found the system improved substantially over the Pubmed baseline in terms of ranking documents overall as well as putting answers to the questions into the top five documents retrieved.

Demner-Fushman, D. and Lin, J. (2007). Answering clinical questions with knowledge-based and statistical techniques. Computational Linguistics, 33: 63-103.

8.3.4 Text Summarization

(5/19/05) The growing amount of textual information on the Web and in the scientific literature has rekindled interest in text summarization. A prototypical use case comes from the biological world. Many scientists increasingly use so-called "high throughput" biotechnologies that generate massive amounts of data. A gene microarray, for example, may measure 10,000 genes or gene polymorphisms. Several dozen or more genes may be up- or down-regulated in the biological process being studied by the investigator. As such, he or she may need to know a lot about the biology of one or more genes in a hurry. Although the state of the art of text summarization is not to this point yet, a system that could properly summarize information from documents about these genes could be extraordinarily useful to such scientists. There are probably many other comparable use cases outside of biomedicine.

Manually generated summaries of documents have been around for centuries. Probably the best known summary of a document is the abstract that accompanies a scientific paper. In essence, however, all documents are summaries in the sense that they are abstractions of reality, as noted by Mani (2001). Although he does note that summarization in the context of text usually means a condensed version of a longer document.

A number of overviews of text summarization have recently appeared. One of the most coherent is the book by Mani (2001) which reflects the state of the art. A special issue of the journal Computational Linguistics highlights some of the recent research developments (Radev et al., 2002). A recent survey paper gives an overview of summarization from a medical perspective (Afantenos et al., 2005). The latter draw on the work of the former to develop a list of factors influencing the output of a summarization system:
An ideal summarization would allow these parameters to be set and then create an appropriate summary.

The approaches taken to summarization are extensive. However, in general, abstraction approaches typically involve converting text to some underlying conceptual representation, usually using natural language processing techniques (see Chapter 9). This representation is then converted to narrative form based on the other factors desired, including length. Extraction, on the other hand, involves selecting the best extracts from the document(s), so the usual approach is to determine, sometimes using techniques like machine learning, what are the best sentences. Every owner of Microsoft Word can try out an extraction summarizer by choosing the Autosummarize option under the Tools menu.

Evaluation of summarization systems has been challenging. There have been two general approaches:
An overview of the evaluation challenges in summarization have been highlighted by Radev et al. (2003). This paper reports the results of a large-scale evaluation but also has a classification of evaluation measures, which include:
Interest in information summarization evaluation has led to the development of the Document Understanding Confernece (DUC, http://duc.nist.gov). Organized by NIST, it is run in a spirit similar to TREC. DUC has been guided by a roadmap developed by leading researchers in the field (Carbonell et al., 2000). In the most recent iteration, DUC 2004, there were five tasks:
  1. Very short (<75 character) summaries of single English documents
  2. Short (<665 character) summaries of mutiple English documents
  3. Very short (<75 character) English summaries of single Arabic documents
  4. Short (<665 character) English summaries of mutiple Arabic documents
  5. Short (<665 character) summaries of mutiple English documents answering the question "Who is X?," where X is a specified person
Evaluation of systems was based on the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) system, which was motivated by the work of Lin and Hovy (2003) and described by Lin (2004). ROUGE essentially works by comparing n-grams between gold standard summaries and the output of systems. In general, the results showed that human summaries in general had higher concordance with the gold standard than the automated systems. However, their overlap was far from perfect. Nenkova and Passoneau (2004) highlighted the well-known problems with using overlap of single human summaries and proposed an alternative approach called the "pyramid method."

Most work on summarization in the biomedical domain has come from a single research group and emanates from the Personalized Search and Summarization over Multimedia Information (PERSIVAL, http://persival.cs.columbia.edu/) digital library project. Their approach uses discharge summaries with mapping of terms in the cardiology domain to UMLS Metathesaurus concepts to retrieve and summarize articles (Elhadad and McKeown, 2002; McKeown et al., 2003). The user customizability of this system was described by Elhadad et al. (2005). A less patient-specific approach has been taken in the Centrifuser system, which attempts to derive features of documents (e.g., topicality, content type, readability, etc.) and can be used to process the output of results from Web search engines (Kan et al., 2001). This system processing Google output was compared with three Web search engines (Google, Yahoo, and About.com) in a usability test employing 13 subjects searching on three medical topics: diabetes, hypertension, and asthma (Kushniruk, 2002). All four systems performed comparably, with stregnths and weaknesses identified of each.

Additional work on summarization has come from a research group at the NLM, looking at both summaries of biomedical literature (Fiszman et al., 2004a) and consumer health information (Fiszman, 2004b). Both systems use a relatively common approach that proceeds through a number of steps. The first step is to map document text into semantic propositions, consisting of two concepts and a relationship between them, e.g. proton-pump inhibitors treat Zollinger-Ellison Syndome. The propositions are then transformed into a small set of specified predicates consisting of two semantic types linked by one of six relationships, such as CAUSES, TREATS, and LOCATION_OF. These are then connected into a "conceptual condenstate" that links the concepts together in a single graph. This is followed by a pruning operation that eliminates general concepts (i.e., those near the top of the terminology hierarchy). From this point, a narrative summary can be generated. An analysis of the biomedical literature found that the text could be reduced 98% in size with an 87% precision (rate of correctness). The consumer health literature was able to be reduced in size comparably, with a somewhat lower level of percision at 66%. Ultimately, of course, the output of these systems will need to be assessed with real users.

A research group that has explored the summarization of scientific articles outside of medicine is Teufel and Moens (2002).

Afantenos, S., Karkaletsis, V., et al. (2005). Summarization from medical documents: a survey. Artificial Intelligence in Medicine, 33: 157-177.
Carbonell, J., Harman, D., et al. (2000). Vision Statement to Guide Research in Question & Answering (Q&A) and Text Summarization. http://www.ai.mit.edu/people/jimmylin/papers/Carbonell00.pdf.
Elhadad, N. and McKeown, K. (2002). Towards generating patient specific summaries of medical articles. NAACL 2001 Workshop on Automatic Summarization , Pittsburgh, PA. NAACL. http://www.isi.edu/~cyl/was-naacl2001/papers/elhadad-p1.pdf.
Elhadad, N., Kan, M., et al. (2005). Customization in a unified framework for summarizing medical literature. Artificial Intelligence in Medicine, 33: 179-198.
Fiszman, M., Rindflesch, T., et al. (2004a). Abstraction summarization for managing the biomedical research literature. Proceedings of the HLT-NAACL Workshop on Computational Lexical Semantics, Boston, MA. North American Association for Computational Linguistics. 76-83. http://lhncbc.nlm.nih.gov/lhc/docs/published/2004/pub2004015.pdf.
Fiszman, M., Rindflesch, T., et al. (2004b). Summarization of an online medical encyclopedia. MEDINFO 2004 - Proceedings of the Eleventh World Congress on Medical Informatics, San Francisco, CA. IOS Press. 506-510.
Harman, D. and Over, P. (2002). The DUC summarization evaluations. Proceedings of HLT 2002 - Human Language Technology Conference, San Diego, CA.
Kan, M., McKeown, K., et al. (2001). Domain-specific informative and indicative summarization for information retrieval. DUC 2001 - Workshop on Text Summarization , New Orleans, LA. NIST. http://www-nlpir.nist.gov/projects/duc/pubs/2001papers/columbia_redo2.ps.
Kushniruk, A., Kan, M., et al. (2002). Usability evaluation of an experimental text summarization system and three search engines:  implications for the reengineering of health care interfaces. Proceedings of the AMIA 2002 Annual Symposium, San Antonio, TX. Hanley & Belfus. 420-424.
Lin, C. and Hovy, E. (2003). Automatic evaluation of summaries using n-gram co-occurrence statistics. Proceedings of the 2003 Human Language Technology Conference (HLT-NAACL 2003), Edmonton, Alberta. North American Association for Computational Linguistics. http://www.isi.edu/~cyl/papers/NAACL2003.pdf.
Lin, C. (2004). ROUGE:  a package for automatic evaluation of summaries. Proceedings of Text Summarization Branches Out Workshop, Barcelona, Spain. Association for Computational Linguistics. http://www.isi.edu/~cyl/papers/WAS2004.pdf.
Mani, I. (2001). Automatic Summarization. Amsterdam. John Benjamins.
McKeown, K., Elhadad, N., et al. (2003). Leveraging a common representation for personalized search and summarization in a medical digital library. ACM/IEEE 2003 Joint Conference on Digital Libraries (JCDL 2003) , Houston, TX. IEEE Computer Society. 147-158.
Nenkova, A. and Passonneau, R. (2004). Evaluating content selection in summarization:  the pyramid method. Proceedings of the 2004 Human Language Technology Conference, Boston, MA. North American Association for Computational Linguistics. 145-152. http://www.cs.columbia.edu/~ani/papers/pyramid.pdf.
Radev, D., McKeown, K., et al. (2002). Introduction to the special issue on summarization. Computational Linguistics, 28: 399-408.
Radev, D., Teufel, S., et al. (2003). Evaluation challenges in large-scale document summarization. Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan. Association for Computational Linguistics. 375-382.
Sparck Jones, K. and Sakai, T. (2001). Generic summaries for indexing in IR. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, LA. ACM Press. 190-198.
Teufel, S. and Moens, M. (2002). Summarizing scientific articles: experiments with relevance and rhetorical status. Computational Linguistics, 28: 409-445.

8.4 User evaluation of system-oriented approaches

8.4.1 Early studies

8.4.2 The TREC Interactive Track

8.4.2.1 Methods

8.4.2.2 Comparing Boolean and natural language searching

8.4.2.3 Do batch and interactive experiments give the same results?

8.4.2.4 Relevance feedback

8.4.2.5 Presentation of search results

8.5 Summary

Last updated - May 28, 2007