Update to Chapter 8 - Lexical-Statistical Systems
(5/18/03) Note in this chapter of the book that there are several
instances where TF*IDF is incorrectly replaced with TF*IOF
.
(5/17/04) A book that provides a succinct overview of the
language-processing aspects of not only information retrieval, but also
information extraction, categorization, summarization, and mining has
been published (Jackson and Moulinier, 2002)
Jackson, P. and Moulinier, I. (2002). Natural Language Processing
for Online Applications: Text Retrieval, Extraction, and
Categorization. Amsterdam. Benjamin Johns Publishing.
8.1 Evaluation of lexical-statistical systems
(5/26/07) As noted to the update of Chapter 3, a new track has
been introduced in TREC, the Genomics Track.
This is the first track within TREC to focus on retrieval in a
domain, and is chaired by this author.
The Genomics Track is motivated by the information needs being
generated by the data-intensive nature of the increasing number of
"high throughput" technologies. These technologies, exemplified
by gene microarrays, uncover new genes, proteins, and other biological
prodcuts and activities that require the attention of researchers to
understand in the context of their research. In a microarray
experiment, for example, a research may uncover dozens of genes that
are differentially (over or under) expressed relative to
controls. As such, the researcher must learn about many new
genes, proteins, etc. in a hurry. Clearly better IR systems will
help in this process.
The first year of the Genomics Track occured in TREC 2003 (Hersh and
Bhupatiraju, 2003). This first year was a "preliminary" year,
mainly due to the fact that resources were not available for relevance
judgements. As such, the track had to make use of existing data
sources, which of course resulted in their not having been developed
with the purpose of IR evaluation in mind. Nonetheless, the track
was very successful, with a total of 29 groups participating. This made
the track the second largest at TREC in terms of numbers of
participating groups. Also in 2003, however, the track was
awarded a National Science Foundation Information Technology Research
grant, which begin to provide five years of funding from the 2004 track
forward. Data from the TREC Genomics Track are
available on a password-protected Web site. Information
on how to access the data are available at the track
Web
site (http://ir.ohsu.edu/genomics/).
TREC 2004 Ad Hoc Task
The first meaningful running of the Genomics Track occured in TREC 2004
(Hersh et al., 2006). The steering committee of the track decided
to organize two tasks that would assess a variety of information
management issues in the genomics domain. The
first task was a standard ad hoc
retrieval task using topics obtained from real biomedical research
scientists
and documents from a large subset of the MEDLINE bibliographic database. The second task focused on categorization of
full-text documents, simulating the task of curators of the Mouse
Genome
Informatics (MGI) system and consisting of three subtasks.
One subtask focused on the triage of articles
likely to have experimental evidence warranting the assignment of GO
terms,
while the other two subtasks focused on the assignment of the three
top-level
GO categories. The ad hoc retrieval task is described in this
section while the categorization task is described below in section
8.3.1.
The Genomics Track turned out to have the highest participation
of any track in TREC in both 2004 and 2005, with 33 and 41 research
groups respectively.
The document collection for the ad hoc retrieval task in both 2004 and
2005 was a 10-year
subset of MEDLINE. The use of full-text documents
for this task was contemplated, but an adequate amount to represent
real-world searching could not be procured. As such, the decision
was made to use MEDLINE. In
reality, despite the widespread availability of on-line, full-text
scientific journals at present, most searchers of the biomedical
literature still use MEDLINE (through PubMed or Ovid) as an entry
point. Consequently,
there is great value in being able to search MEDLINE effectively. The
MEDLINE subset consisted of completed citations from the database
inclusive from 1994 to 2003. Records were extracted using the
Date Completed (DCOM) field for all references in the range of 19940101
- 20031231. This provided a total of 4,591,008 records, which is
about one third of the full MEDLINE database.
The topics for the 2004 ad hoc retrieval task were developed from the
information needs of real biologists and modified as little as possible
to create needs statements with a reasonable estimated amount of
relevant articles (i.e., more than zero but less than one
thousand). The information needs capture began with interviews by
12 volunteers who sought biologists in their local environments. A
total of 43 interviews yielded 74 information needs. Some of
these volunteers, as well as an additional four individuals, created
topics in the proposed format from the original interview data. The
same individuals then were assigned different draft
topics for searching on PubMed so they could be modified to generate
final topics with a reasonable number of relevant articles. The
track chair made one last pass to make the formatting consistent and
extract the 50 that seemed most suitable as topics for the track.
Relevance judgments were done using the conventional “pooling method”
whereby a fixed number of top-ranking documents from each official run
were pooled and provided to an individual (blinded to the number of
groups who retrieved the document and what their search statements
were). The relevance assessor then judged each document for the
specific topic query as definitely relevant (DR), possibly relevant
(PR), or not relevant (NR). For the official results, which
required binary relevance judgments, documents that were rated DR or PR
were considered relevant.
The primary evaluation measure for the task was mean average precision
(MAP). Results were calculated using the trec_eval program, a
standard scoring system for TREC. A statistical analysis was
performed using a repeated measures analysis of variance, with posthoc
Tukey tests for pairwise comparisons. In addition to analyzing
MAP, we also assessed precision at 10 and 100 documents. The
results and analysis of the ad hoc retrieval task are described in more
detail in track overview paper (Hersh et al., 2004). The best
performing groups used a variety of domain-independent and
domain-specific techniques. Examples of the former that worked
well included the use of Okapi weighting and query expansion. Examples
of the latter included expansion of symbols by LocusLink and
MeSH records, and other gene and protein synonyms. Groups
attempting to map topic terms into controlled vocabulary terms, in
particular gene and protein names, fared less well.
TREC 2005 Ad Hoc Task
In the 2005 ad hoc
retrieval task, topics were employed that were more structured than the
mostly free-form topics from 2004 (Hersh et al., 2005). The
purpose of this approach was to provide systems with better defined
(yet still realistic) queries for finding genomics information. As
such, topics were developed from generic structured templates so
systems could make better use of other resources, such as ontologies or
databases. It was also hoped this could serve as the basis to
begin investigation toward an interactive task.
As with 2004, we collected topics from real biologists. However,
instead of soliciting free-form topics, we provided the biologists we
interviewed with generic templates and asked them to express
information needs they had recently that fit within those templates.
While it would have been ideal to interview users and develop the
templates themselves from such interviews, the time frame of the track
did not allow this. Instead, a set of generic topic templates
(GTTs) were developed from an analysis of the topics from the 2004
track and other known biologist information needs. After we
developed the GTTs, 11 people went out interviewed 25 biologists to
obtain instances of them. Other people then searched on the
topics to make sure there were at least some, although not too many,
relevant documents. The topics did not have to fit precisely into
the GTTs, but had to come close (i.e., have all the required semantic
types).
As with 2004, there were 50 topics. Closure was reached on five
GTTs, each of which had 10 instances, for a total of 50 topics. The
five GTTs are listed below. The semantic types in each GTT
are underlined. For some semantic types, more than one instance
is allowed. The five GTTs were:
- Find articles describing standard methods or protocols for
doing some sort of experiment or procedure.
- Find articles describing the role of a gene involved in a given disease.
- Find articles describing the role of a gene in a specific biological process.
- Find articles describing interactions (e.g., promote, suppress,
inhibit, etc.) between two or more genes in the function of an organ or in
a disease.
- Find articles describing one or more mutations of a given gene and its biological
impact.
Relevance judgments were done using the
conventional pooling
method of TREC. Based on estimation of
relevance judgment resources, the top 60 documents for each topic from
all
official runs were used. This gave an
average
pool size of 821 documents with a range of 290 to 1356. These
pools were then provided to the
relevance judges, who consisted of five individuals with varying
expertise in
biology. The relevance judges were
instructed in the following manner for each GTT:
- Relevant article must describe how to conduct, adjust, or improve
a standard, a, new method, or a protocol for doing some sort of
experiment or procedure.
- Relevant article must describe some specific role of the gene in
the stated disease or biological process.
- Relevant article must describe a specific interaction (e.g.,
promote, suppress, inhibit, etc.) between two or more genes in the
stated function of the organ or the disease.
- Relevant article must describe a mutation of the stated gene and
the particular biological impact(s) that the mutation has been found to
have.
- The articles had to describe a specific gene, disease,
impact, mutation, etc. and not just the concept in general.
A total of 32 groups submitted 58 runs. Six
topics had no definitely relevant documents.
One topic had no definitely or possibly relevant documents and
was
dropped from the calculation of official results.
The overall results showed that manual synonym
expansion helped but automated synonym expansion did not. Relevance
feedback without term expansion helped. The basic Okapi
system with well-tuned parameters gave good baseline performance. As
with the 2004 track, and TREC in general, the preliminary analysis
performed in the context of the yearly cycle precluded better
characterization of baseline and other experiments that would improve
their understanding.
TREC 2006 Track
The TREC Genomics Track implemented a new task in
2006 that focused on passage retrieval for question answering using
full-text documents from the biomedical literature (Hersh et al.,
2006). The track was based on the premise that what many information
seekers, especially users of the biomedical literature, desire
something in between answers to questions and full documents, i.e., a
system that attempts to provide short, specific answers to questions
and put them in context by providing supporting information and linking
to original sources.
A test collection of 162,259 full-text documents and 28 topics
expressed as questions was assembled. Systems were required to return
passages that contained answers to the questions. Expert judges
determined the relevance of passages and grouped them into aspects
identified by one or more Medical Subject Headings (MeSH) terms.
Document relevance was defined by the presence of one or more relevant
aspects. The performance of submitted runs was scored using mean
average precision (MAP) at the passage, aspect, and document level.
For the TREC 2006 Genomics Track, the task consisted of retrieval of
short passages (from phrase to sentence to paragraph in length) that
specifically addressed an information need, along with linkage to the
location in the original source document. Topics were expressed as
questions and systems were measured on how well they retrieved relevant
information at the passage, aspect, and document levels. Systems were
required to return passages linked to source documents, while relevance
judges not only rated the passages, but also grouped them by aspect.
For this task, aspect was defined similar to its definition in the TREC
Interactive Track aspectual recall task (Hersh et al., 2001),
representing answers that covered a similar portion of a full answer to
the topic question. It also drew upon experience in passage retrieval
from the previous TREC High Accuracy Retrieval from Documents (HARD)
Track (Allan, 2003; Allan, 2004).
The documents for the 2006 task came from a new full-text biomedical
corpus obtained from a number of publishers who use Highwire Press
(www.highwire.org) for electronic distribution of their journals. They
agreed to allow use of their full text in HTML format, which preserved
formatting, structure, table and figure legends, etc. The document
collection was derived from 49 journals and were obtained by a Web
crawl of the Highwire site, with post-processing to eliminate as much
non-article material as we could. The full collection contained 162,259
documents. The collection was about 12.3 GB when uncompressed. The
paper describing the track lists the journals and number of documents
from each.
For the 2006 track, there were three levels of retrieval performance
that we measured: passage retrieval, aspect retrieval, and document
retrieval. Each of these provided insight into the overall performance
for a user trying to answer the given topic questions. Each was
measured by some variant of mean average precision (MAP).
Passage-level MAP - This measure used a variation of MAP, computing
individual precision scores for passages based on character-level
precision, using a variant of a similar approach used for the TREC 2004
HARD Track (Allan, 2004). For each nominated passage, the number of
characters that overlapped with those deemed relevant by the judges in
the gold standard was determined. For each relevant retrieved passage,
precision was computed as the fraction of characters overlapping with
the gold standard passages divided by the total number of characters
included in all nominated passages from this system for the topic up
until that point. Similar to regular MAP, remaining relevant passages
that were not retrieved (no overlap with any nominated passages) were
added into the calculation as well, with precision set to 0 for these
relevant non-retrieved gold standard passages. Then the mean of these
average precisions over all topics was calculated to compute the MAP
for passages. Note that this measure is essentially the fraction of
retrieved characters that are part of an answer to the topic question.
Aspect-level MAP - Aspect retrieval was measured using the average
precision for the aspects of a topic, averaged across all topics. To
compute this, for each submitted run, the ranked passages were
transformed to two types of values, either the aspect(s) of the gold
standard passage that the submitted passage overlapped with or the
value “not relevant”. This resulted was a ranked list, for each run and
each topic, of lists of aspects per passage, Non-relevant passages had
empty lists of aspects. Because the organizers were uncertain of the
utility for a user of a repeated aspect (e.g., same aspect occurring
again further down the list), they discarded these from the output to
be analyzed. For the remaining aspects of a topic, they calculated MAP
similar to how it is calculated for documents, with the additional
wrinkle that a single passage may have associated with it multiple
aspects. Therefore the precision for the retrieval of each aspect was
computed as the fraction of relevant passages for the retrieved
passages up to the current passage under consideration. These fractions
at each point of first aspect retrieval were then averaged together to
compute the average aspect precision. Relevant passages that did not
contribute any new aspects to the aspects retrieved by higher ranked
passages were removed from the ranking. Taking the mean over all topics
produced the final aspect-based MAP.
Document-level MAP - For the purposes of this measure, any PMID that
had a passage associated with a topic ID in the set of gold standard
passages was considered a relevant document for that topic. All other
documents were considered not relevant for that topic. System run
outputs were collapsed by PMID document identifier, with the documents
appearing in the same order as the first time the corresponding PMID
appeared in the nominated passages for that topic. For a given system
run, average precision was measured at each point of correct (relevant)
recall for a topic. The MAP was the mean of the average precisions
across topics.
In general, document MAP scores were highest, followed by aspect, and
then passage, although these scores were not directly comparable since
they measure precision at recall of different things. There was a
general, though far from perfect, correlation between passage, aspect,
and document MAP. As seen in many TREC-style evaluations, statistical
significance, based on pair-wise comparison with the top-ranking score
in an ANOVA model, was not achieved for any measure until well down the
ranked list of runs.
Overall there was a wide variation in system performance across
submissions for each of the three measures. In general, scores grouped
into three sets. A few groups dominated the high scores of each
measure, followed by a large group with scores around the mean, and
then another large group of relatively low scores. Submissions that
scored well on document retrieval tended to score well on both passage
and aspect. While a correlation between passage and document retrieval
might have been expected, the correlation between document and aspect
retrieval is more surprising since aspect retrieval places an emphasis
on novelty and document retrieval does not.
For all three measures, the best automatic approaches were as good or
better than manual or interactive systems. Manual and interactive
approaches did not appear to provide an advantage over automatic
methods. However, because the definitions of automatic, manual, and
interactive were not as solid as in previous years because systems had
the topic questions available during development, inference should be
limited from these observations.
It was clear from the results and techniques of the top-performing
groups in passage retrieval that certain approaches were quite
effective. In particular, “trimming” passages to shorten them was done
in all the runs with the highest passage MAP. Indeed, some groups noted
that non-content manipulations of passages had substantial effects on
passage MAP, with one group claiming that breaking passages in half
with no other changes doubled their (otherwise low) score. To this end,
the organizers defined an alternative passage MAP (PASSAGE2) that
calculated MAP as if each character in each passage were a ranked
document. In essence, the output of passages was concatenated, with
each character being from a relevant passage or not. The results show
that some re-ranking of runs occurred.
This novel approach to the TREC 2006 Genomics Track was carried out
successfully, leading to the development of a new test collection with
new documents and tasks, as well as a new evaluation method and the
software to administer and score it. While further analysis of results
is required for more definitive conclusions, it can be noted that
passage retrieval in this context is quite difficult, with results
quite low. Fortunately, retrieval at the aspect and document levels is
much better, indicating users still might be able to find answers to
their questions in the biomedical literature. Duplicate relevance
assessments showed relatively good levels of reproducibility, with one
exceptional outlier.
The TREC 2007 Genomics Track will continue in the same direction, using
the existing document collection and task structure but adding
completely new topics and potentially new topic types. The 2007 track
will be the last running of the Genomics Track within TREC, although
future options to continue biomedical IR challenge evaluations are
being explored.
Allan, J. (2003). HARD Track overview in TREC 2003 - high accuracy
retrieval from documents. The
Twelfth Text REtrieval Conference - TREC 2003, Gaithersburg, MD.
Naitonal Institute of Standards and Technology. 24-37. http://trec.nist.gov/pubs/trec12/papers/HARD.OVERVIEW.pdf.
Allan, J. (2004). HARD Track overview in TREC 2004 - high accuracy
retrieval from documents. The
Thirteenth Text Retrieval Conference (TREC 2004), Gaithersburg,
MD. National Institute of Standards and Technology. http://trec.nist.gov/pubs/trec13/papers/HARD.OVERVIEW.pdf.
Hersh, W. (2001). Interactivity at the Text Retrieval Conference
(TREC). Information Processing and
Management, 37: 365-366.
Hersh, W. and Bhupatiraju, R. (2003). TREC genomics track overview. The Twelfth Text Retrieval Conference
(TREC 2003), Gaithersburg, MD. NIST. 14-23. http://trec.nist.gov/pubs/trec12/papers/GENOMICS.OVERVIEW3.pdf.
Hersh, W., Bhupatiraju, R., et al. (2006). Enhancing access to the
bibliome: the TREC 2004 Genomics Track. Journal of Biomedical Discovery and
Collaboration, 1: 3. http://www.j-biomed-discovery.com/content/1/1/3.
Hersh, W., Cohen, A., et al. (2005). TREC 2005 Genomics Track overview.
The Fourteenth Text Retrieval
Conference - TREC 2005, Gaithersburg, MD. National Institute for
Standards & Technology. http://trec.nist.gov/pubs/trec14/papers/GEO.OVERVIEW.pdf.
Hersh, W., Cohen, A., et al. (2006). TREC 2006 Genomics Track overview.
The Fifteenth Text Retrieval
Conference (TREC 2006), Gaithersburg, MD. National Institute for
Standards & Technology. http://trec.nist.gov/pubs/trec15/papers/GEO.OVERVIEW.pdf.
(5/24/06) Another goal of retrieval evaluation is to determine
how to predict the difficulty of queries. Several research
initiatives have addressed this topic in recent years.
In 2003, a workshop entitled Reliable
Information Access (RIA) was held
to address variability in retrieval results (Buckley and Harman,
2004). Two "tracks" addressed different facets and approaches to
the problem. A "bottom-up" track carried out a large failure
analysis, with six IR systems contributing one run each of 45 topics,
with a detailed manual analysis of the results. A "top-down"
track performed a number of runs using different variations of query
expansion.
The bottom-up analysis found that systems obtained comparable
performance (e.g., MAP) scores but performed differently across topics,
retrieving different documents and emphasizing different aspects of the
topics. However, it was concluded that all systems failed for
similar reasons, usually missing some aspect of a topic that would lead
to retrieval of more relevant documents. Another conclusion was
that if systems could recognize the problem causing the failure, then
substantially better retreival could be obtained. In other words,
emphasis on future system development should focus on what current
techniques can be applied to which situations, and not on developing
new retrieval techniques.
The top-down analysis found that query expansion (also known as blind
feedback, i.e., adding terms from highly ranked
documents into a query to expand the number of terms and augment
retrieval) was highly sensitive to variations in approaches. For
example, the selection of the initial document set for expansion, the
number of terms used, and the terms chosen greatly influenced
performance.
Another workshop held in 2005 focused on predicting query difficulty
(Carmel et al., 2005). The TREC Robust Track has also focused on
performance on topics that have historically been difficult, assessing
methods to recognize such queries and employ methods to improve results.
Buckley, C. and Harman, D. (2004). Reliable
Information Access Final Workshop Report. Bedford, MA, MITRE
Corp. https://rrc.mitre.org/pubs/ria_final.pdf.
Carmel, D., Yom-Tov, E., et al. (2005). Predicting query difficulty -
methods and applications. SIGIR Forum,
39(2): 25-28. http://www.acm.org/sigir/forum/2005D/2005d_sigirforum_carmel.pdf.
Voorhees, E. (2006). The TREC 2005 Robust Track. SIGIR Forum, 40(1): 41-48. http://www.acm.org/sigir/forum/2006J/2006j_sigirforum_voorhees.pdf.
8.2 Basic lexical-statistical approaches
8.2.1 The vector-space model
(5/17/04) Note from the errata file
that the formulae for cosine normalization from Equations
2 and 3 of Chapter 6 (page 189) are incorrect. Here are the
correct versions:
The calculations in this section of the chapter are correct, i.e., only
the formula is incorrect.
8.2.2 Variations in term weighting
8.2.2.1 Variants of TF
8.2.2.2 Okapi weighting
(5/18/03) Equation 7 on page 273 of the book is slightly
misformatted. Although not incorrect, it could be potentially
misleading. In the second addend, the 1.5 is multiplied by the
quantity (Length of document) / (Average document length).
8.2.2.3 Pivoted normalization
8.2.3 Automatic thesaurus generation
8.2.4 Latent semantic indexing
8.2.5 Phrase construction
8.2.6 Passage retrieval
(5/26/07) As noted above, the TREC 2006 Genomics Track focused on
passage retrieval.
Hersh, W., Cohen, A., et al. (2006). TREC 2006 Genomics Track overview.
The Fifteenth Text Retrieval
Conference (TREC 2006), Gaithersburg, MD. National Institute for
Standards & Technology. http://trec.nist.gov/pubs/trec15/papers/GEO.OVERVIEW.pdf.
8.2.7 Relevance feedback and query expansion
(5/20/03) On page 281, formula (11), the formula for pair weight
in relevance feedback, is incorrect. The quantity after the minus
sign should be, Correlation in nonrelevant documents, i.e.,
Pair weight = Correlation of pair to relevance *
(Correlation in relevant documents - Correlation in nonrelevant
documents)
8.2.8 Probabilistic IR
8.2.8.1 Basic probabilistic IR
8.2.8.2 Language modeling
(5/17/04) The language modeling continutes to attract researcher
interest and retrieval success. Part of the success of these
models results from the "smoothing" methods that give weight to terms
in the query that do not occur in retrieved documents. Zhai and
Lafferty investigated this feature in more detail and derived a number
of new conclusions about this approach to IR. These language
models also allow the measurement of query "clarity," which is defined
as a measure of the deviation between in the query and document
language models from the general collection model (Cronen-Townsend et
al., 2002). Cronen-Townsend et al. found that query clarity was a
good predictor of retrieval results using TREC ad hoc test
collections. However, research by Turpin and Hersh (2004)
assessing real user queries from their TREC Interactive Track
experiments (see Section 8.4.2) failed to find this association.
Cronen-Townsend, S., Zhou, Y., et al. (2002). Predicting query
performance. Proceedings of the 25th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval,
Tampere, Finland. ACM Press. 299-306.
Turpin, A. and Hersh, W. (2004). Do Clarity Scores for Queries
Correlate with User Performance? Proceedings of the Fifteenth
Australasian Database Conference (ADC2004), Dunedin, New Zealand.
Australian Computer Society. 85-91.
http://crpit.com/confpapers/CRPITV27Turpin.pdf.
Zhai, C. and Lafferty, J. (2004). A study of smoothing methods for
language models applied to information retrieval. ACM Transactions
on Information Systems, 22: 179-214.
8.2.9 Stemming
8.2.10 Implementations of lexical-statistical systems
(5/16/05) The IR systems of the NLM have generally not used
natural language query techniques, perhaps with the exception of the
"Related Articles" feature of PubMed. However, as noted in
Chapter 6, the askMEDLINE interface (http://askmedline.nlm.nih.gov/)
is an exception (Fontelo, 2005). Designed mainly for handheld
devices, the system allows natural language queries to be
entered. The system performs a variety of interactions with the
underlying PubMed engine.
Fontelo, P., Liu, F., et al. (2005). askMEDLINE: a free-text, natural
language query tool for MEDLINE/PubMed. BMC Medical Informatics and Decision Making,
5: 5. http://www.biomedcentral.com/1472-6947/5/5.
(5/20/06) As noted in the Chapter 1 update, an up-to-date list of
open-source search engines is at
http://www.searchtools.com/tools/tools-opensource.html. One
system of note is
Lucene,
which is written in Java and is now part of the open-source Web server Apache (Gospodnetic,
2005). Another new
open-source search engine is Zettair. The name
comes from zettabyte and IR, with the former representing the quantity
of 2 to the 70th power bytes, which is equal to 1,024 exabytes or approximately 1021 bytes. Zettair
indexes HTML or TREC collections. It was developed for the TREC
Terabyte Track, so is very fast and efficient.
Gospodnetic, O. and Hatcher, E. (2005). Lucene in Action.
Greenwich, CT. Manning Publications.
(5/18/04) Two tutorials for using the SMART system, although not
updated in several years, are available:
8.3 Specific lexical-statistical applications
8.3.1 Routing and filtering
(5/28/07) Besides the TREC Genomics Track categorization task, little
text categorization work has focused on biomedical topics. One
exception is recent work by Cohen et al. (2006), who attempted to
determine whether automated classification of document citations could
be useful in
reducing the time spent by experts reviewing journal articles for
inclusion in updating systematic reviews of drug class efficacy for
treatment of disease.
They developed a voting perceptron-based automated citation
classification system that classified articles as containing
high-quality, drug
class-specific evidence or not. Performance was assessed using
cross-validation experiments. They found that at the level of 95%
recall, there was a reduction in the number
of articles needing manual review in 11 of 15 drug review topics by as
much as 50%. They concluded that automated document
citation classification could be a useful tool in maintaining
systematic reviews of the efficacy of drug therapy. They are planning
to carr out further work refining the classification system and
determining the best
manner to integrate the system into the production of systematic
reviews.
Another approach, not categorization or filtering per se, involved the
use of reordering documents in retrieval output based on citation data
(Bernstam, 2006). They assessed a variety of citation-based algorithms
that attempted to rank documents in search output not by term
frequencies or related approaches, but rather by measures of citation
frequency. The best-known of these approaches is the PageRank algorithm
used by Google.
They compared eight different algorithms: simple PubMed queries, PubMed
clinical queries, vector cosine, citation count,
journal impact factor, PageRank, and a machine learning approach based
on
polynomial support vector machines. The aim was to prioritize
important articles, defined as being included in a pre-existing
bibliography of important literature in surgical oncology. The
citation-based algorithms were found to be more effective than
noncitation-based
algorithms at identifying important articles. The most effective
strategies were simple citation count and PageRank, which on average
identified over six important articles in the first 100 results
compared to <1 for the best noncitation-based algorithm. Similar
differences were observed between citation-based and
noncitation-based algorithms at 10, 20, 50, 200, 500, and 1,000
results. They also assessed citation lag, i.e., how it takes a period
of time before citations to appear to an important article. This was
found to affect performance of PageRank more than
simple citation count. In spite of citation lag,
however, the citation-based algorithms were still more effective than
noncitation-based
algorithms. They concluded that algorithms that were successful on the
World Wide Web could be applied to biomedical information retrieval,
helping to identify important articles within
large sets of relevant results.
Bernstam, E., Herskovic, J., et al. (2006). Using citation data to
improve retrieval from MEDLINE. Journal
of the American Medical Informatics Association, 13: 96-105.
Cohen, A., Hersh, W., et al. (2006). Reducing workload in systematic
review preparation using automated citation classification. Journal of the American Medical
Informatics Association, 13: 206-219.
(5/17/04) A number of new interesting, if not controversial,
techniques have emerged from text categorization research. In
addition, a comprehensive overview has recently been written
(Sebastiani, 2004) and a new test collection has become available
(Lewis et al., 2004). The latter will insure research in
filtering will continue, especially since the TREC Filtering Track was
discontinued after 2002 (Robertson and Soboroff, 2002).
One of the controversial techniques is plagiarism detection (a topic no
doubt of interest to students of mine reading this!). Schleimer
et al. (2003) describe a system for "document fingerprinting" that
enables copies of documents and sub-documents within them. A
growing cat-and-mouse environment has emerged as Web-based services
make the purchase of term
papers easier than ever while the tools for detecting them are also
more
powerful due to their availability to crawl Web sites and compare
student
papers already entered into its database from the local purchaser and,
sometimes, beyond. An overview of the leading companies offering
plagiarism
detection tools is at
http://www.wou.edu/provost/library/staff/kincanon/plagiarism/chart.htm.
The controversies behind these tools have been enumerated by
Foster (2002).
Aonther controversial use of text categorization is the prediction of
author gender based on writing style (Ball, 2003). A group of
researchers developed algorithms to predict author gender for a group
of 566 English-language texts with about 80% accuracy (Koppel et al.,
2002; Argamon et al., 2003). The controversy in their work is
that it "confirms" a number of gender stereotypes, e.g., men talk more
about objects versus women talking more about relationships and female
writers using more pronouns. Their algorithms are also able to
discern fiction vs. non-fiction with about 98% accuracy, which is
probably less surprising.
There has been little work with text categorization in the biomedical
domain beyond the use of MEDLINE category assignment in TREC-8
described in the book. Recent work by Aphinyanaphongs et al.
(2005), however, has shown that these techniques can be
used to identify high-quality better than the MEDLINE Clinical Queries
algorithms of Haynes et al. (1994 version). The best approach
uses polynomial support vector machines. One limitation of this
work is that it does not compare results with the newer Clinical
Queries filters developed in recent years. These authors have
also found ways to express these queries using Boolean operators so
they can be used in exact-match
retrieval systems (Aphinyanaphongs and Aliferis, 2004).
Additional work in biomedical text categorization took place in the
TREC 2004 Genomics Track. The categorization task consisted of
simulation of two of the
classification
activities carried out by human annotators for the MGI system: a
triage task and two simplified variations of MGI’s annotation
task. Systems were required to classify full-text documents from
a two-year span (2002-2003) of three journals, with the first year’s
(2002) documents comprising the training data and the second year’s
(2003) documents making up the test data.
The documents for the categorization task consisted of articles from
three journals over two years obtained from Highwire Press (http://www.highwire.org). Highwire
is a “value added” electronic publisher of scientific
journals. Most journals in their collection are published by
professional associations, with the copyright remaining with the
associations. Highwire originally began with biomedical journals,
but in recent years has expanded into other disciplines. They
have also supported IR and related research by acting as an
intermediary between consenting publishers and information systems
research groups who want to use their journals, such as the Genomics
Track.
The journals available and used by the track were Journal of Biological Chemistry
(JBC), Journal of Cell Biology
(JCB), and Proceedings of the
National Academy of Science (PNAS). These journals have a
good proportion of mouse genome articles. Each of the papers from
these journals was provided in SGML format based on Highwire’s Document
Type Definition (DTD). Articles from the year 2002 were used as
training data and articles from 2003 as test data. The documents
for the
categorization tasks came from a subset of articles having the words
mouse, mice or murine as described above. A crosswalk
file (look-up table) that matched an identifier for each Highwire
article (its file name) and its corresponding PubMed ID (PMID) was made
available to participants. The SGML training document collection was
150 megabytes in size
compressed and 449 megabytes uncompressed. The SGML test document
collection was 140 megabytes compressed and 397 megabytes uncompressed.
The framework for evaluation in the categorization task was based on
the
following table of possibilities:
|
Relevant (classified)
|
Not relevant (not classified)
|
Total
|
Retrieved
|
True positive (TP)
|
False positive (FP)
|
All retrieved (AR)
|
Not retrieved
|
False negative (FN)
|
True negative (TN)
|
All not retrieved (ANR)
|
|
All positive (AP)
|
All negative (AN)
|
|
The consensus of the track was to use a utility measure for the triage
task and F measure for the annotation tasks. A
Python program that calculated both statistics for each
task was developed.
The measure for the triage task was the utility measure often applied
in
text categorization research and used by the former TREC Filtering
Track. This measure contains coefficients for the
utility of retrieving a relevant and retrieving a nonrelevant document.
A version was used that was normalized by the best possible
score:
Unorm = Uraw / Umax
For a test collection of documents to categorize, Uraw was
calculated as follows:
Uraw = (ur * TP)
+ (unr * FP)
where:
- ur = relative utility of relevant document
- unr = relative utility of nonrelevant document
Values for ur and unr were used that were driven
by
boundary cases for different results. In
particular, it was desired for the measure to have the following
characteristics:
- Completely perfect prediction - Unorm = 1
- All documents designated positive (triage everything) - 1 > Unorm
> 0
- All documents designated negative (triage nothing) - Unorm
= 0
- Completely imperfect prediction - Unorm < 0
In order to achieve the above boundary cases, they set ur
> 1. The ideal approach would have been to interview MGI
curators and
use decision-theoretic approaches to determine their utilty. However,
time did not permit us to do that. Deciding that the
triage-everything approach should have a higher score than the
triage-nothing approach, they estimated that a Unorm in the
range of 0.25-0.3 for the triage-everything condition would be
appropriate. Solving for the above boundary cases with Unorm
~ 0.25-0.3 for that case, they obtained a value for ur ~
20. To keep calculations simple, they chose a value of ur
= 20. The table below shows the value of Unorm for the
boundary cases.
Situation
|
Unorm - Training |
Unorm - Test |
| Completely perfect prediction |
1.0
|
1.0
|
| Triage everything |
0.27
|
0.33
|
| Triage nothing |
0
|
0
|
| Completely imperfect prediction |
-0.73
|
-0.67
|
The measure Umax was calculated by assuming all relevant
documents are retrieved and no nonrelevant documents are retrieved:
Umax = ur * AP
(This happens to equal AN.)
Thus, for the training data,
Uraw = (20 * TP) - FP
Umax = 20 * 375 = 7500
Unorm = [(20 * TP) - FP] /
7500
(If you plug in the bounary conditions to the these equations, you
should obtain the results specified above.)
Likewise, for the test data,
Uraw = (20 * TP) - FP
Umax = 20 * 420 = 8400
Unorm = [(20 * TP) - FP] /
8400
The measures for the annotation subtasks were based on the notion of
identifying tuples of data. Given the article and gene, systems
designated one or both of the folloing tuples:
- <article, gene, GO hierarchy code>
- <article, gene, GO hierarchy code, evidence code>
They employed a global recall, precision, and F measure evaluation
measure
for each subtask:
Recall = number of tuples correctly
identified / number of correct tuples = TP / AP
Precision = number of tuples correctly
identified / number of tuples identified = TP / AR
F = (2 * recall * precision) / (recall + precision)
For the training data, the number of correct <article, gene, GO
hierarchy code> tuples was 593, while the number of correct
<article, gene, GO hierarchy code, evidence code> tuples for the
test data was 644.
More details about the triage and annotation hierarchy subtasks are
available on the track Web site (Hersh et al., 2004). For the
triage subtask, a variety of approaches were used by the different
groups. However, none performed any better than simply using the
MeSH term Mouse.
Aphinyanaphongs, Y. and Aliferis, C. (2004). Learning Boolean queries
for article quality filtering. MEDINFO
2004 - Proceedings of the Eleventh World Congress on Medical Informatics,
San Francisco, CA. IOS Press. 263-267. http://discover1.mc.vanderbilt.edu/discover/public/Publications/Boolean_Queries.pdf.
Aphinyanaphongs, Y., Tsamardinos, I., et al. (2005). Text
categorization models for high-quality article retrieval in internal
medicine. Journal of the American
Medical Informatics Association, 12: 207-216.
Argamon, S., Koppel, M., et al. (2003). Gender, genre, and writing
style in formal written texts. Text, 23: 321-346.
Ball, P. (2003). Computer program detects author gender. Nature.
http://www.nature.com/nsu/030714/030714-13.html.
Foster, A. (2002). Plagiarism-detection tool creates legal quandary.
Chronicle of Higher Education. May 17, 2002.
http://chronicle.com/free/v48/i36/36a03701.htm.
Hersh, W., Bhuptiraju, R., et al. (2004). TREC 2004 genomics track
overview. The Thirteenth Text
Retrieval Conference (TREC 2004), Gaithersburg, MD. NIST. http://trec.nist.gov/pubs/trec13/papers/GEO.OVERVIEW.pdf.
Koppel, M., Argamon, S., et al. (2002). Automatically categorizing
written texts by author gender. Literary and Linguistic Computing,
17: 401-412.
Lewis, D., Yang, Y., et al. (2004). RCV1: a new benchmark
collection for text categorization research. Journal of Machine
Learning Research , 5: 361-397.
Robertson, S. and Soboroff, I. (2002). The TREC 2002 Filtering Track
report. The Eleventh Text REtrieval Conference (TREC 2002),
Gaithersburg, MD. NIST. 27-39.
http://trec.nist.gov/pubs/trec11/papers/OVER.FILTERING.pdf.
Schleimer, S., Wilkerson, D., et al. (2003). Winnowing: local
algorithms for document fingerprinting. Proceedings of the 2003 ACM
SIGMOD International Conference on Management of Data, San Diego,
CA.
ACM Press. 76-85.
Sebastiani, F. (2005). Text Categorization, 109-129, in Zanasi, A., ed.
Text Mining and its
Applications. Southampton, UK. WIT Press. http://nmis.isti.cnr.it/sebastiani/Publications/TM05.pdf.
8.3.2 Web searching
(5/17/04) A new book describes the gamut of models that define,
retrieve, and process Web content (Baldi et al., 2003).
Baldi, P., Frasconi, P., et al. (2003). Modeling the Internet and
the Web - Probabilistic Methods and Algorithms. West Sussex,
England. John Wiley & Sons.
8.3.2.1 TREC Web Track
(5/18/05) The TREC 2002 Web Track employed a new test collection,
.GOV (Craswell and Hawking, 2002). The collection consisted of an
18-gigabyte crawl of 1.25 million documents (not only HTML pages, but
also PDF, Word, and Postscript files) from the .gov (U.S. government)
domain, replacing the five-year-old data in WT10g. Documents
longer than
100KB were truncated at that length. The TREC 2002 Interactive
Track
employed the .GOV collection as well (Hersh, 2002).
The TREC 2002 Web track had two experimental tasks: a
"topic-distillation" task that sought to find high-quality,
authoritative pages on given topics and a named-page-finding task that
sought specific pages. The former task proved difficult for
groups due to uncertainty over the task's meaning. For the
named-age-finding task, groups found added performance
by exploiting document structure (e.g., title and keywords) or anchor
text.
The TREC 2003 Web track continued the use of the .GOV
collection as well as the topic-distillation and named-page-finding
tasks (Craswell
et al., 2003). The primary metrics used to measure performance
were
R-precision (precision at the point of n documents retrieved, where n
was
the number of relevant documents for that query) for the former and
mean
reciprocal rank (MRR) for the latter. Both tasks represented
types
of searches that retrieved home pages and other high-level pages that
facilitate
navigation into a site. Techniques found to work well included
searching
over link text (ie, that between <A> and </A> on an HTML
page),
since that text is likely to describe the page its links to, and
analyzing
URLs to find those likely to point to home pages (eg, are short).
The
TREC 2003 Web Track also incorporated the former standalone Interactive
Track.
The TREC 2004 Web track was the final running of this track (Craswell
and Hawking, 2004). (It was replaced in 2005 with the
"Enterprise Track" that focused on retrieval of documents from the
intranet of a large enterprise.) The 2004 track combined three
tasks that had been used in previous years: topic distillation,
home-page finding, and named-page finding. There were 75 topics
for each task. Not surprisingly, there was many fewer "relevant"
documents for the latter two tasks (1600 vs. 80-83).
The track used a variety of established and new evaluation
measures. In addition to MAP and MRR, they added Success@N, where
N = 1, 5, or 10 and was defined as the number of topics where the
"answer" appeared at or above rank N. For example, Success@10
meant the number of topics where the answer appeared in the top 10
documents of the output. The best results came from a group from
Microsoft Research (Zaragoza et al., 2004). Their system obtained
the best results using the following features:
- Okapi weighting of text from title, body, and anchor text
pointing to the page from other pages
- PageRank function
- Inverse length of URL (i.e., the number of characters in the URL,
with shorter being better)
An addition exercise in the TREC 2004 Web track was to automatically
classify the topics into the three categories (topic distillation, home
page, and named page). Since there were three categories, a
purely random approach would achieve a 25% rate of correctness. The
best performing group obtained a 61.3% rate of correctness. Virtually
all groups were best able to distinguish topic distillation
topics from the other two. None were really able to exploit this
information to improve retrieval performance.
A related development at TREC 2004 was the running of the Terabyte
Track (Clarke et al., 2004). We include it here because it also
made use of a Web crawl. In fact, the track developed a new
document collection, called GOV2, which consisted of a 426-gigabyte,
25-million document new crawl of the .GOV domain. The main
purpose of the track was to assess retrieval with very large
collections now commonly seen in Web search engines. The track
developed 50 topics and performed the usual pooled relevance
judgments. As much attention was paid to indexing and retrieval
efficiency as retrieval results. Indexing time varied from 2-14
hours and average query time was often within 1-2 seconds, indicating
today's common hardware can handle quite large document
collections. The best MAP results were in the range of 0.25-0.3
and were obtained by several groups.
Clarke, C., Craswell, N., et al. (2004). Overview of the TREC 2004
Terabyte Track. The Thirteenth Text REtrieval Conference Proceedings
(TREC 2004), Gaithersburg, MD. NIST. http://trec.nist.gov/pubs/trec13/papers/TERA.OVERVIEW.pdf.
Craswell, N. and Hawking, D. (2002). Overview of the TREC-2002 Web
Track. The Eleventh Text REtrieval
Conference (TREC 2002),
Gaithersburg, MD. National Institute of Standards and Technology.
http://trec.nist.gov/pubs/trec11/papers/WEB.OVER.pdf.
Craswell, N., Hawking, D., et al. (2003). Overview of the TREC 2003 Web
Track. The Twelfth Text REtrieval Conference (TREC 2003),
Gaithersburg, MD. NIST.
http://trec.nist.gov/pubs/trec12/papers/WEB.OVERVIEW.pdf.
Craswell, N. and Hawking, D. (2004). Overview of the TREC 2004 Web
Track. The Thiteenth Text REtrieval
Conference (TREC 2004), Gaithersburg, MD. NIST. http://trec.nist.gov/pubs/trec13/papers/WEB.OVERVIEW.pdf.
Hersh, W. (2002). TREC 2002 Interactive Track report. The Eleventh Text
Retrieval Conference (TREC 2002), Gaithersburg, MD. National
Institute
of Standards and Technology.
http://trec.nist.gov/pubs/trec11/papers/INTERACTIVE.OHSU.pdf.
Zaragoza, H., Craswell, N., et al. (2004). Microsoft Cambridge at TREC
13: Web and Hard Tracks. The
Thiteenth Text REtrieval Conference (TREC 2004), Gaithersburg,
MD. NIST. http://trec.nist.gov/pubs/trec13/papers/microsoft-cambridge.web.hard.pdf.
8.3.2.2 Other Web searching system evaluation
(5/19/04) A variety of other research has continued to assess
search techniques in general for the Web. Craswell et al. (2001)
performed additional research on the effectiveness of link-based
methods for page-finding tasks. Additional benefit was found from
a machine learning method. Another common searching task on the
Web is finding pages that contain search interfaces to the invisible
Web. Most of these pages use HTML forms (an HTML construct
allowing entry of data and buttons to submit the form to a server for
processing), but there are both false-negatives (a
query interface that is not an HTML form) and false-positives (an HTML
form
that is not a query interface). Scholer et al. (2004) have
pursued
a related approach for page-finding, using past queries that retrieved
a
page as document "surrogates" for that page. Experimental results
have found a 7% improvement in retrieval performance. Another
approach
to improving retrieval has been to take advantage of "click-through"
data,
ie, assuming a page is relevant if a user clicks on the link to it from
a page displaying Web search engine results. Joachims (2002) has
developed a system that uses machine learning to associate words in
queries with pages users are likely to click on to view. Finally,
Cope et al. (2003)
have investigated a variety of approaches for finding searching
interfaces
with a variety of machine-learning methods. They found the best
approach
was a system that developed a decision tree (C4.5) to assess the
features
of pages to predict which represented search intefaces. This
approach
achieved an accuracy of 85%.
An additional line of research is looking at eXtensible Mark-up
Language (XML) document retrieval. Early workers in this area
have been Fuhr and Großjohann (2001), who note that structured
XML documents may
result in different retrieval tasks than simple content-based
searching.
As the XML tags represent semantic aspects of a document, users
may
wish to retrieve documents or parts therein for specific needs.
Some
of these queries may almost border on types of queries made to
relational
databases, e.g., find the cities of residence from an XML document of
people.
A standard for querying XML documents, XQuery, is emerging, but
some
researchers, such as Fuhr and Großjohann, argue it is (for now)
incomplete.
As such, they have developed a more comprehensive query language,
XML
Information Retrieval Query Language (XIRQL). Fuhr is also the
organizer
of an evaluation forum for XML data, Initiative for the Evaluation of
XML
Retrieval (INEX,
http://inex.is.informatik.uni-duisburg.de:2003/), which is modeled
similar to TREC.
Cope, J., Craswell, N., et al. (2003). Automated discovery of search
interfaces on the Web. Proceedings of the Fourteenth Australasian
Database
Conference on Database Technologies, Adelaide, Australia.
Australian
Computer Society. 181-189.
Craswell, N., Hawking, D., et al. (2001). Effective site finding using
link anchor information. Proceedings of the 24th Annual
International ACM SIGIR Conference on Research and Development in
Information Retrieval , New Orleans, LA. ACM Press. 250-257.
Fuhr, N. and Großjohann, K. (2001). XIRQL: A Query Language for
Information Retrieval in XML Documents. Proceedings of the 24th
Annual International ACM SIGIR Conference on Research and Development
in Information Retrieval , New Orleans, LA. ACM Press. 172-180.
Joachims, T. (2002). Optimizing search engines using clickthrough data.
Proceedings of the ACM Conference on Knowledge Discovery and Data
Mining , Edmonton, Alberta, Canada. ACM Press.
http://www.cs.cornell.edu/People/tj/publications/joachims_02c.pdf.
Scholer, F., Williams, H., et al. (2004). Query association surrogates
for Web search. Journal of the American Society for Information
Science , 55: 637-650.
8.3.2.3 Analysis of the Web
(5/19/03) Additional research continues to examine the structure
of the Web. Pennock et al. (2002) tested the power-law
distribution of Web links, which they call the "rich get richer" nature
of links on the Web, i.e., a small number of pages have a
disproportionate share of links to and from them. They found that
while a randomly selected portion of Web pages did indeed follow the
power law, certain subsets
of Web pages did not. In particular, home pages of companies,
universities, newspapers, and scientists tended to follow more of a
log-normal distribution. Soboroff (2002) addressed the question
of whether the Web collections used in the TREC Web track, WT10g and
.GOV, were similar to the Web in
general and could thus have their experimental results viewed as
generalizable.
He indeed found that both collections obeyed the power law, with
the
slope of 2-3 seen in random collections of Web pages. Other
similarities were noted as well.
Pennock, D., Flake, G., et al. (2002). Winners don't take all:
characterizing the competition for links on the Web. Proceedings of the
National Academy of Sciences, 99: 5207-5211.
http://www.pnas.org/cgi/content/abstract/99/8/5207.
Soboroff, I. (2002). Do TREC Web collections look like the Web? SIGIR
Forum, 36(2): 23-31.
http://www.sigir.org/forum/F2002/soboroff.pdf.
(5/19/04) Fetterly et al. (2003) provided a new analysis of the
evolution of Web pages, monitoring 150 million pages on a weekly basis
for 11 weeks. Change of pages in the Web is certainly a concern
for IR, since search engines usually lag weeks if not months behind in
identifying changes in
existing Web pages. This may also be important to users who cite
a
page for some specific content feature and may not be aware that the
page
has changed. Their analysis found variable rates of change, but
in
general, most changes to pages were small. One surprising finding
was
that large pages changed more frequently and extensively than smaller
pages. They also found that pages in the .com and .net top-level domain
were
more likely to change than those in the .edu and .gov domains, and that
past
changes to a page was a good predictor of future change.
Thelwall has continued his work assessing the relationship between Web
"impact" and the quality of scholarly work. A most recent
analysis
found that whle higher rated scholars produce more Web content, the
impact
as measured by in-links to those pages is not increased for those
scholars
( Thelwall and Harries, 2004). In other words, more productive
researchers
produce more Web-based content but the rate of links to their sites is
not
increased over those who are less productive. Thelwall (2003)
also
attempted to categorize links between academic Web sites and found four
general
categories of links:
- General navigation - allowing browsing to non-specific information
- Ownership - allow navigation to co-owners or co-authors of a
project
- Social - recognition that the site being linked to is important
in the social context of a field
- Gratutitous - links acknolwedging institutions or other entities
Finally, another concern about Web content is whether the output of
search engines present a skewed or biased view of a topic.
Gerhart (2004)
has looked at five controversial topics and found the output from Web
search
engines does not present a diversity of views. Her analysis found
that
simple queries tended to present the positive views of a controversy,
which
she attributed to the nature of search engines, current linking
practices
by authors of Web pages, and the simple queries used by Web searchers.
Fetterly, D., Manasse, M., et al. (2003). A large-scale study of the
evolution of web pages. Proceedings of the Twelfth International
Conference on the World Wide Web, Budapest, Hungary. ACM Press.
669-678.
Gerhart, S. (2004). Do Web search engines suppress controversy? First
Monday, 9: 1.
http://www.firstmonday.dk/issues/issue9_1/gerhart/.
Thelwall, M. (2003). What is this link doing here? Beginning a
fine-grained process of identifying reasons for academic hyperlink
creation. Information Research, 8: 3.
http://informationr.net/ir/8-3/paper151.html.
Thelwall, M. and Harries, G. (2004). Do the web sites of higher rated
scholars have significantly more online impact? Journal of the
American
Society for Information Science & Technology, 55: 149-159.
8.3.2.4 Automated generation of hypertext
8.3.3 Question answering
(5/28/07) Very little question-answering work has focused on medical
topics, with one recent exception. Demner-Fushman and Lin (2007)
developed a system that aimed to answer evidence-based medicine (EBM)
questions. They implemented a system based on a suite of "knowledge
extractors" aiming to process the text of MEDLINE
abstracts into the PICO (problem, intervention, control, outcome)
framework, determine the strength of evidence of the article, and
determine the EBM task, i.e., treatment, diagnosis, prognosis, or harm.
These attributes are then matched up against EBM questions that come
from users. An evaluation found the system improved substantially over
the Pubmed baseline in terms of ranking documents overall as well as
putting answers to the questions into the top five documents retrieved.
Demner-Fushman, D. and Lin, J. (2007). Answering clinical questions
with knowledge-based and statistical techniques. Computational Linguistics, 33:
63-103.
8.3.4 Text Summarization
(5/19/05) The growing amount of textual information on the Web
and in the scientific literature has rekindled interest in text
summarization. A prototypical use case comes from the biological
world. Many scientists increasingly use so-called "high
throughput" biotechnologies that generate massive amounts of
data. A gene microarray, for example, may measure 10,000 genes or
gene polymorphisms. Several dozen or more genes may be up- or
down-regulated in the biological process being studied by the
investigator. As such, he or she may need to know a lot about the
biology of one or more genes in a hurry. Although the state of
the art of text summarization is not to this point yet, a system that
could properly summarize information from documents about these genes
could be extraordinarily useful to such scientists. There are
probably many other comparable use cases outside of biomedicine.
Manually generated summaries of documents have been around for
centuries. Probably the best known summary of a document is the
abstract that accompanies a scientific paper. In essence,
however, all documents are summaries in the sense that they are
abstractions of reality, as noted by Mani (2001). Although he
does note that summarization in the context of text usually means a
condensed version of a longer document.
A number of overviews of text summarization have recently
appeared. One of the most coherent is the book by Mani (2001)
which reflects the state of the art. A special
issue of the journal Computational Linguistics highlights some
of the recent research developments (Radev et al., 2002). A
recent survey paper gives an overview of summarization from a medical
perspective (Afantenos et al., 2005). The latter draw on the work
of the former to develop a list of factors influencing the output of a
summarization system:
- Input factors
- Single vs. multiple document - summarizing from one or more
than document?
- Language - language of documents being summarized
- Text vs. multimedia - just documents or the growing amount of
other multimedia content?
- Purpose factors
- Informative vs. indicative - former aims to substitute for
longer document whereas latter just describes what the document covers
- Generic vs. user-oriented - former takes into account all
information in document whereas latter is focused toward needs of a
user as expressed by a query
- General purpose vs. domain-specific - former employs generic
techniques whereas latter utilizes resources from a given domain
- Output factors
- Abstract vs. extract - former creates de novo text whereas
latter extracts sentences and other parts of document(s) being
summarized
- Quality - how well does summary meet intended purpose?
- Length - how long is summary?
An ideal summarization would allow these parameters to be set and then
create an appropriate summary.
The approaches taken to summarization are extensive. However, in
general, abstraction approaches typically involve converting text to
some underlying conceptual representation, usually using natural
language processing techniques (see Chapter 9). This
representation is then converted to narrative form based on the other
factors desired, including length. Extraction, on the other hand,
involves selecting the best extracts from the document(s), so the usual
approach is to determine, sometimes using techniques like machine
learning, what are the best sentences. Every owner of Microsoft
Word can try out an extraction summarizer by choosing the Autosummarize
option under the Tools menu.
Evaluation of summarization systems has been challenging. There
have been two general approaches:
- Intrinsic - evaluation independent of purpose summary aims to
serve. Factors that can be assessed include the grammatical
quality, such as the integrity of sentences and readbility, or word or
phrase overlap with some gold standard summary created by a
human. The latter is very challenging, since human summarizers
tend to create very different summaries and the overlap of words or
phrases is not a guarantee that there is true conceptual overlap.
- Extrinsic - evaluation of summary for a specific task, usually by
human judges.
An overview of the evaluation
challenges in summarization have been highlighted by Radev et al.
(2003).
This paper reports
the results of a large-scale evaluation but also has a classification
of
evaluation measures, which include:
- Co-selection measures - measures of selection of extracts judged
by humans to represent important sentences. These include recall,
precision, kappa, and relative utility. The first three have been
defined in Chapter 3, while the latter is used by having judges assign
a relative score (usually on a 1-10 scale) for sentences and include
them in a cluster of similar sentences. These measures cannot be
used with abstracts.
- Similarity measures - measures of similarity with manual
summaries. These include the cosine measure, introduced in this
chapter, and
the longest common subsequence, which measures the minimum number of
steps
needed to transform one string into another.
- Relevance correlation - measures the relative decrease in
retrieval performance of summaries when they replace full documents in
IR experiments. This approach was motivated by work of Sparck
Jones and Sakai (2001) who found that document summaries achieve good
precision when substituted for full documents but results are less good
for recall.
Interest in information summarization evaluation has led to the
development of the Document Understanding Confernece (DUC, http://duc.nist.gov). Organized
by NIST, it is run in a spirit similar to TREC. DUC has been
guided by a roadmap developed by leading researchers in the field
(Carbonell et al., 2000). In the most recent iteration, DUC 2004,
there were five tasks:
- Very short (<75 character) summaries of single English
documents
- Short (<665 character) summaries of mutiple English documents
- Very short (<75 character) English summaries of single Arabic
documents
- Short (<665 character) English summaries of mutiple Arabic
documents
- Short (<665 character) summaries of mutiple English documents
answering the question "Who is X?," where X is a specified person
Evaluation of systems was based on the Recall-Oriented Understudy for
Gisting Evaluation (ROUGE) system, which was motivated by the work
of Lin and Hovy (2003) and described by Lin (2004). ROUGE
essentially works by comparing n-grams between gold standard summaries
and the output of systems. In general, the results showed that
human summaries in general had higher concordance with the gold
standard than the automated systems. However, their overlap was
far from perfect. Nenkova and Passoneau (2004) highlighted the
well-known problems with using overlap of single human summaries and
proposed an alternative approach called the "pyramid method."
Most work on summarization in the biomedical domain has come from a
single research group and emanates from the Personalized Search and
Summarization over Multimedia Information (PERSIVAL,
http://persival.cs.columbia.edu/) digital library project.
Their approach uses discharge summaries with mapping of terms in the
cardiology domain to UMLS Metathesaurus concepts to retrieve and
summarize articles (Elhadad and McKeown, 2002; McKeown
et al., 2003). The user customizability of this system was
described by Elhadad et al. (2005). A less patient-specific
approach has been taken in
the Centrifuser system, which attempts to derive features of documents
(e.g., topicality, content type, readability, etc.) and can be used to
process
the output of results from Web search engines (Kan et al., 2001).
This
system processing Google output was compared with three Web search
engines
(Google, Yahoo, and About.com) in a usability test employing 13
subjects
searching on three medical topics: diabetes, hypertension, and
asthma
(Kushniruk, 2002). All four systems performed comparably, with
stregnths
and weaknesses identified of each.
Additional work on summarization has come from a research group at the
NLM, looking at both summaries of biomedical literature (Fiszman et
al., 2004a) and consumer health information (Fiszman, 2004b). Both
systems use a relatively common approach that proceeds through a
number of steps. The first step is to map document text into
semantic propositions, consisting of two concepts and a relationship
between them, e.g. proton-pump inhibitors treat Zollinger-Ellison
Syndome. The propositions are then transformed into a small set
of specified predicates consisting of two semantic types linked by one
of six relationships, such as CAUSES, TREATS, and LOCATION_OF. These
are then connected into a "conceptual condenstate" that links the
concepts together in a single graph. This is followed by a
pruning operation that eliminates general concepts (i.e., those near
the top of the terminology hierarchy). From this point, a
narrative summary can be generated. An analysis of the biomedical
literature found that the text could be reduced 98% in size with an 87%
precision (rate of correctness). The consumer health literature
was able to be reduced in size comparably, with a somewhat lower level
of percision at 66%. Ultimately, of course, the output of these
systems will need to be assessed with real users.
A research group that has
explored
the summarization of scientific articles outside of medicine is Teufel
and
Moens (2002).
Afantenos, S., Karkaletsis, V., et al. (2005). Summarization from
medical documents: a survey. Artificial
Intelligence in Medicine, 33: 157-177.
Carbonell, J., Harman, D., et al. (2000). Vision Statement to Guide
Research in Question & Answering (Q&A) and Text Summarization.
http://www.ai.mit.edu/people/jimmylin/papers/Carbonell00.pdf.
Elhadad, N. and McKeown, K. (2002). Towards generating patient specific
summaries of medical articles. NAACL 2001 Workshop on Automatic
Summarization , Pittsburgh, PA. NAACL.
http://www.isi.edu/~cyl/was-naacl2001/papers/elhadad-p1.pdf.
Elhadad, N., Kan, M., et al. (2005). Customization in a unified
framework for summarizing medical literature. Artificial Intelligence in Medicine,
33: 179-198.
Fiszman, M., Rindflesch, T., et al. (2004a). Abstraction summarization
for managing the biomedical research literature. Proceedings of the HLT-NAACL Workshop on
Computational Lexical Semantics, Boston, MA. North American
Association for Computational Linguistics. 76-83. http://lhncbc.nlm.nih.gov/lhc/docs/published/2004/pub2004015.pdf.
Fiszman, M., Rindflesch, T., et al. (2004b). Summarization of an online
medical encyclopedia. MEDINFO 2004 -
Proceedings of the Eleventh World Congress on Medical Informatics,
San Francisco, CA. IOS Press. 506-510.
Harman, D. and Over, P. (2002). The DUC summarization evaluations.
Proceedings of HLT 2002 - Human
Language Technology Conference, San
Diego, CA.
Kan, M., McKeown, K., et al. (2001). Domain-specific informative and
indicative summarization for information retrieval. DUC 2001 -
Workshop on Text Summarization , New Orleans, LA. NIST.
http://www-nlpir.nist.gov/projects/duc/pubs/2001papers/columbia_redo2.ps.
Kushniruk, A., Kan, M., et al. (2002). Usability evaluation of an
experimental text summarization system and three search engines:
implications for the reengineering of health care interfaces. Proceedings
of the AMIA
2002 Annual Symposium, San Antonio, TX. Hanley & Belfus.
420-424.
Lin, C. and Hovy, E. (2003). Automatic evaluation of summaries using
n-gram co-occurrence statistics. Proceedings
of the 2003 Human Language Technology Conference (HLT-NAACL 2003),
Edmonton, Alberta. North American Association for Computational
Linguistics. http://www.isi.edu/~cyl/papers/NAACL2003.pdf.
Lin, C. (2004). ROUGE: a package for automatic evaluation of
summaries. Proceedings of Text
Summarization Branches Out Workshop, Barcelona, Spain.
Association for Computational Linguistics. http://www.isi.edu/~cyl/papers/WAS2004.pdf.
Mani, I. (2001). Automatic Summarization. Amsterdam. John
Benjamins.
McKeown, K., Elhadad, N., et al. (2003). Leveraging a common
representation for personalized search and summarization in a medical
digital library. ACM/IEEE 2003 Joint Conference on Digital
Libraries (JCDL 2003) , Houston, TX. IEEE Computer Society. 147-158.
Nenkova, A. and Passonneau, R. (2004). Evaluating content selection in
summarization: the pyramid method. Proceedings of the 2004 Human Language
Technology Conference, Boston, MA. North American Association
for Computational Linguistics. 145-152. http://www.cs.columbia.edu/~ani/papers/pyramid.pdf.
Radev, D., McKeown, K., et al. (2002). Introduction to the special
issue on summarization. Computational Linguistics, 28: 399-408.
Radev, D., Teufel, S., et al. (2003). Evaluation challenges in
large-scale document summarization. Proceedings of the 41st Annual
Meeting of the Association for Computational Linguistics, Sapporo,
Japan. Association for Computational Linguistics. 375-382.
Sparck Jones, K. and Sakai, T. (2001). Generic summaries for indexing
in IR. Proceedings of the 24th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval,
New Orleans, LA. ACM Press. 190-198.
Teufel, S. and Moens, M. (2002). Summarizing scientific articles:
experiments with relevance and rhetorical status. Computational
Linguistics, 28: 409-445.
8.4 User evaluation of system-oriented approaches
8.4.1 Early studies
8.4.2 The TREC Interactive Track
8.4.2.1 Methods
8.4.2.2 Comparing Boolean and natural language searching
8.4.2.3 Do batch and interactive experiments give the same
results?
8.4.2.4 Relevance feedback
8.4.2.5 Presentation of search results
8.5 Summary
Last updated - May 28, 2007