Information Retrieval:  A Health & Biomedical Perspective

Information Retrieval:  A Health & Biomedical Perspective (Second Edition)

William Hersh, M.D.

Springer-Verlag , 2003

Back to Updates Table of Contents

Update to Chapter 6 - Retrieval

6.1 Search process

6.2 General principles of searching

6.2.1 Exact-match searching

6.2.2 Partial-match searching

(5/6/03) Note errors in relevance feedback data described in errata file.

(5/19/03) Also note from the errata file that the formula for cosine normalization from Equation 2 (page 189) is incorrect.  Here is the correct version:
Cosine Formula  

6.2.3 Term selection

6.2.3.1 Term look-up

6.2.3.2 Term expansion

6.2.3.4 Other word-related operators

6.2.3.5 Subheadings

6.2.3.6 Explosions

6.2.4 Other attribute selection

6.3 Searching interfaces

(5/8/07) Not only do a growing number of search system providers offer toolbars that can be added to browsers (e.g., the Google toolbar - http://toolbar.google.com/), but newer versions of browsers (e.g., Firefox, Internet Explorer 7) offer built-in searching of a variety of resources and the ability to add more.

6.3.1 Bibliographic

6.3.1.1 Literature reference databases

(5/8/07) PubMed continues to grow with enhancements and new features. A great source for technical information on NLM searching products is the NLM Technical Bulletin (http://www.nlm.nih.gov/pubs/techbull/tb.html).

One change has been some minor modifications to the automatic term mapping algorithm. In the first step, which formerly just mapped into MeSH terms, the system now attempts to map into MeSH subheadings, publication types, phrases (from UMLS and elsewhere), pharmaceutical action terms, and supplementary concepts. If this step is not successful, the system proceeds to attempting to map into journal names and then author names. Another change is a small list of journal name exceptions, where the aim is to avoid some common searches (e.g., heart failure, treatment review) from mapping as a default into journals with those names (McGhee, 2005). This chaged was motivated from work by Smith (2004), who noted that some journals have the same name as MeSH headings (and therefore are preferentially mapped to the latter) while others map to the journal. An additional major change has been revamping of the "Limits" page, adding a larger variety of options for limiting searches (Canese, 2006).

Other changes to PubMed include modifications to its interface. One significant change is the addition of tabs to the interface. One set of tabs replace the main features (e.g., Limits, Preview/Index, etc.) while another set of tabs allow the user to set up stored searches (Nahin, 2005). An additional change allows the user to change the sort order of the search output from the traditional reverse chronological order. Other ordering for output may now include first author, last author, or journal name.

Another enhancement aims to provide more seamless access to full-text literature. Direct linkage to journal articles is one type of resource utilized by the LinkOut facility of PubMed (http://www.ncbi.nlm.nih.gov/entrez/linkout/). Other types of resources are available, including library holdings, consumer health information, commentaries on articles, supplementary data, and practice guidelines. The library holdings feature allows local libraries to restrict full-text linkages to the journals they subscribe to as well as link to other local resources, such as print-based holdings in special collections. Over 1,300 organizations take advantage of this feature, including the Oregon Health & Science University Library (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?holding=ohsulib). The PubMed interface also allows searching to be limited to various subsets of full-text literature, as listed below. Note that the number of full text journals available almost equals the total number of journals indexed in MEDLINE, again indicating that electronic publishing of major biomedical journals is nearly ubiquitous.

Subset
Quantity
Search string
PubMed Central
325 journals (as of May 8, 2007)
pubmed pmc [sb]
Free full text
973 journals (as of May 8, 2007)
free full text[sb]
Full text
5,098 journals (as of May 8, 2007)
full text[sb] 
(List of PubMed Central - http://www.pubmedcentral.nih.gov/fprender.fcgi?cmd=full_view#getcsvfile)
(List of full text - http://www.ncbi.nlm.nih.gov/entrez/linkout/journals/jourlists.cgi?typeid=1&type=journals&operation=Show)

Another option is to search PubMed Central directly (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PMC). Two features allowed from this interface are restriction of search to free full-text only and "Smart Search," which omits high-frequency (stop) words from the search of the full-text documents.

An additional feature added to PubMed is a spelling checker (Canese, 2004; Wilbur, 2006). Similar to the spelling checker of Google, when an word entered in the search has few or no matches, the user will be presented with an alternative spelling and the number of citations that will be retrieved if that option is selected. (An example of how this works can be seen by entering breast cancerr). The spelling option is deactivated when the user enters a search tag, e.g. [mh]. Also appearing in this portion of the screen, when a PubMed reference is linked to from a commercial search engine, such as Google or MSN Search, is a note of what the search string entered into that search engine would retrieve if entered into PubMed (Nahin, 2005). The number of citations that would be retrieved is usually small, since PubMed does not use a word-based retrieval mechanism similar to what the commercial search engines use.

Another major enhancement to PubMed is the MyNCBI system, which replaces the Cubby that allowed stored searches and other features (Nahin, 2005). To access MyNCBI, individuals must create a free user account with a password. This allows them to save searches and have them updated and sent via email on a periodic basis. It also allows the designation of filters for limiting the output of all searches, such as to review articles, human studies, and/or various journal subsets. MyNCBI also lets the user specify properties associated with the LinkOut feature, such as automatic links to the holdings of specific libraries or to other databases, such as the NCBI Bookshelf or various genomics databases.

Also continuing to be enhanced in PubMed is the Clinical Queries interface (http://www.ncbi.nlm.nih.gov/entrez/query/static/clinical.html). Clinical Queries are actually part of a broader category of Special Queries (http://www.nlm.nih.gov/bsd/special_queries.html), which include:
The Clinical Queries now feature three categories (http://www.ncbi.nlm.nih.gov/entrez/query/static/clinicaltable.html):
Ongoing research by the "Hedges Team" at McMaster University has led to refinement of the search strategies for optimal retrieval of articles likely to contain scientifically strong studies (i.e., those likely to contain evidence for use in the practice of evidence-based medicine) (Wilczynski, 2005). A list of publications describing updates of their stratgeies is available (http://www.nlm.nih.gov/pubs/techbull/jf04/cq_info.html). The search straetgy for systematic reviews was originally based on work done by Shojania and Bero (2001) but has been updated with an improved strategy (http://www.nlm.nih.gov/bsd/pubmed_subsets/sysreviews_strategy.html) based on the research of Montori et al. (2005). The table below from the NLM Web site shows the latest strategies used by the Clinical Queries interface and updates Table 6.4 in the book. Recent research has found that adding the word "randomised" to the specific/narrow strategy for therapy improves recall, mainly through the addition of recently published studies not yet indexed (Corrao, 2006).

Category Optimized For Broad/ Narrow PubMed Equivalent
therapy sensitive/broad 99%/70% ((clinical[Title/Abstract] AND trial[Title/Abstract]) OR clinical trials[MeSH Terms] OR clinical trial[Publication Type] OR random*[Title/Abstract] OR random allocation[MeSH Terms] OR therapeutic use[MeSH Subheading])
specific/narrow 93%/97% (randomized controlled trial[Publication Type] OR (randomized[Title/Abstract] AND controlled[Title/Abstract] AND trial[Title/Abstract]))
diagnosis sensitive/broad 98%/74% (sensitiv*[Title/Abstract] OR sensitivity and specificity[MeSH Terms] OR diagnos*[Title/Abstract] OR diagnosis[MeSH:noexp] OR diagnostic * [MeSH:noexp] OR diagnosis,differential[MeSH:noexp] OR diagnosis[Subheading:noexp])
specific/narrow  64%/98% (specificity[Title/Abstract])
etiology sensitive/broad 93%/63% (risk*[Title/Abstract] OR risk*[MeSH:noexp] OR risk *[MeSH:noexp] OR cohort studies[MeSH Terms] OR group*[Text Word]) 
specific/narrow 51%/95% ((relative[Title/Abstract] AND risk*[Title/Abstract]) OR (relative risk[Text Word]) OR risks[Text Word] OR cohort studies[MeSH:noexp] OR (cohort[Title/Abstract] AND stud*[Title/Abstract]))
prognosis sensitive/broad 90%/80% (incidence[MeSH:noexp] OR mortality[MeSH Terms] OR follow up studies[MeSH:noexp] OR prognos*[Text Word] OR predict*[Text Word] OR course*[Text Word])
specific/narrow 52%/94% (prognos*[Title/Abstract] OR (first[Title/Abstract] AND episode[Title/Abstract]) OR cohort[Title/Abstract])
clinical prediction guides sensitive/broad 96%/79% (predict*[tiab] OR predictive value of tests[mh] OR scor*[tiab] OR observ*[tiab] OR observer variation[mh])
specific/narrow 54%/99% (validation[tiab] OR validate[tiab])

One area where MEDLINE is less effective at finding specific types of articles concerns those describing adverse effects of medications. Derry et al. (2001) found that 23% of papers describing adverse effects did not have text words or indexing terms that would enable their retrieval. Likewise, Golder et al. (2006) found that none of a variety of searching strategies achieved better than 50% recall when searching for similar articles.

Another improvement to PubMed and other NLM databases has been the reformatting of Web pages to make them more friendly for display on the small screens of handheld devices. A growing number of these devices feature wireless connectivity to the Internet. Fontelo et al. (2003) described initial work in this area, which has developed into the interface available at http://pubmedhh.nlm.nih.gov/nlm/. Recent enhancements to this site include:
Additional effort has been devoted to making PubMed directly accessible to other Web-based applications that do not require its user interface. For example, an application may wish to send a query directly and receive results for its own formatting. One common application is the direct linkage to PubMed references, e.g., as done by Highwire Press and many medical textbooks (http://www.ncbi.nlm.nih.gov/entrez/query/static/linking.html). The Entrez Programming Utilities page provides to much of this functionality (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html), including the following:
Canese, K. (2004). New PubMed® Spell Checking Feature. NLM Technical Bulletin. November-December, 2004. e12. http://www.nlm.nih.gov/pubs/techbull/nd04/nd04_spell.html.
Canese, K. (2006). PubMed Limits Page Updated. NLM Technical Bulletin. e2. http://165.112.6.70/pubs/techbull/ma06/ma06_limits.html.
Corrao, S., Colomba, D., et al. (2006). Improving efficacy of PubMed Clinical Queries for retrieving scientifically strong studies on treatment. Journal of the American Medical Informatics Association, 13: 485-487.
Derry, S., Loke, Y., et al. (2001). Incomplete evidence: the inadequacy of databases in tracing published adverse drug reactions in clinical trials. BMC Medical Research Methodology, 1: 7. http://www.biomedcentral.com/1471-2288/1/7.
Fontelo, P., Ackerman, M., Kim, G. and Locatis, C. (2003). The PDA as a portal to knowledge sources in a wireless setting. Telemedicine Journal and e-Health , 9: 141-147.
Fontelo, P., Liu, F., et al. (2005). askMEDLINE: a free-text, natural language query tool for MEDLINE/PubMed. BMC Medical Informatics and Decision Making, 5: 5. http://www.biomedcentral.com/1472-6947/5/5.
Fontelo, P., Nahin, A., et al. (2005). Accessing MEDLINE/PubMed with handheld devices: developments and new search portals. Proceedings of the 38th Annual Hawaii International Conference on System Sciences, Big Island, Hawaii. IEEE Computer Society. 158b. http://csdl.computer.org/comp/proceedings/hicss/2005/2268/06/22680158b.pdf.
Golder, S., McIntosh, H., et al. (2006). Developing efficient search strategies to identify reports of adverse effects in MEDLINE and EMBASE. Health Information and Libraries Journal, 23: 3-12.
McGhee, M. (2005). PubMed Subject Searching Avoids Conflicts with Journal Titles. NLM Technical Bulletin. e3. http://165.112.6.70/pubs/techbull/so05/so05_pm_exceptions.html.
Montori VM, Wilczynski NL, Morgan D, and Haynes RB, Optimal search strategies for retrieving systematic reviews from Medline: analytical survey. British Medical Journal, 2005. 330: 68.
Nahin, A. (2005). My NCBI Replaces the Cubby: Includes Automatic E-mailing of Search Updates and Filters. NLM Technical Bulletin. January-February, 2005. e3. http://www.nlm.nih.gov/pubs/techbull/jf05/jf05_myncbi.html.
Nahin, A. (2005). Links from Commercial Search Engines to PubMed® Citations. NLM Technical Bulletin. March-April, 2005. e3. http://www.nlm.nih.gov/pubs/techbull/ma05/ma05_google.html.
Nahin, A. (2005). New Look for PubMed Screen. NLM Technical Bulletin. e4. http://165.112.6.70/pubs/techbull/jf05/jf05_tabs.html.
Shojania, K. and Bero, L. (2001). Taking advantage of the explosion of systematic reviews: an efficient MEDLINE search strategy. Effective Clinical Practice, 4: 157-162.
Smith, A. (2004). An examination of PubMed's ability to disambiguate subject queries and journal title queries. Journal of the Medical Library Association , 92: 97-100.
Wilbur, W., Kim, W., et al. (2006). Spelling correction in the PubMed search engine. Information Retrieval, in press.
Wilczynski, N., Morgan, D., et al. (2005). An overview of the design and methods for retrieving high-quality studies for clinical care. BMC Medical Informatics and Decision Making, 5: 20. http://www.biomedcentral.com/1472-6947/5/20.

6.3.1.2 Web catalogs

6.3.1.3 Specialized registries

6.3.2 Full-text

6.3.2.1 Periodicals

6.3.2.2 Textbooks

(5/3/05) Commercial medical textbook products continue to enhance their interfaces. Each also provides some unique features.

The most feature-packed system (from an IR standpoint) continues to be Stat!-Ref (http://www.statref.com). In addition to allowing synonyms and stemming, it also has additional levels of "precision" (for all the words entered, from most to least restrictive):
Up to Date (http://www.uptodate.com) provides unique features for words in the query not found in the database. If a term is not found, it automatically removes one letter at a time from the end of the word until one is found. It also attempts to map non-found words to common synonyms, abbreviations, and misspellings.

6.3.2.3 Web search engines

(5/8/07) The phrase "Web crawler" is not used much any more, and "Web search engine" seems to have replaced it in the vernacular. As mentioned in Chapter 1, SearchEngineShowdown is a frequently updated Web site that compares commerical search engines. Two pages cover their retrieval features:
Google continues add new and interesting features beyond those for simple Web page retrieval. The Google help page (http://www.google.com/help/) is an excellent starting point. The features page (http://www.google.com/help/features.html) describes many of the search engine's advanced features, some of which include:
A book devoted to "Google hacks" has recently been updated in a second edition. The book describes many of the above features in greater detail (Calishain and Dornfest, 2003).

Google also provides an application programming interface (API) to its search engine, allowing others to build applications that access it using emerging Web Services standards (http://www.google.com/apis/).

Another great concern to major Web search engines is performance. With over 200 million queries per day, Google requires hardware and software that can handle the load.  Hochmuth (2003) presents a high-level description of Google's architecture (circa 2003). Other papers have described Google's file system (Ghemawat et al., 2003) and cluster architecture (Barroso et al., 2005). Some researchers have developed more efficient implementations for Google's PageRank algorithm, e.g., (Chen et al., 2002).

Of course, the other leading search sites are not complacent (Anonymous, 2004). Yahoo! (http://www.yahoo.com/) has made a number of acquisitions in recent years and abandoned using Google as its word-based search engine. (Recall that Yahoo! used to be mainly a Web catalog, with searches done on category names in the catalog and then passed to Google if adequate results were not obtained.) Likewise, Bill Gates and Microsoft have aimed MSN Search (http://search.msn.com/) at Google and begun to invest in IR research in its various laboratories. MSN Search offers many of the features available from Google, including a programming API.

A big challenge in the past for search engines has been how to make money for a service users do not pay for. (Not only do they not pay, but the results they find take them right off the search engine site.) Google has developed means to generate revenue streams without compromising its basic search results. While some other search engines feature sponsored links very prominently, Google continues to post them in a separate location clearly demarcated from the main results. Google's real cash cow has been its AdSense program (https://www.google.com/adsense/), which pays sites to put context-specific advertisements on their pages. Google also obtains revenue from its AdWords program (https://adwords.google.com/select/). In this program, advertisers bid for their results to be placed in the sponsored links portion of the results, with those paying more ranking higher. For an example of how this works and who uses it, enter the query "medical informatics" or "health informatics" into Google.

With the growing importance of Web search engines to commercial entities, obtaining high rankings in search engines has become big business. One report (Anonymous, 2003) details do's and don'ts for obtaining high rankings in Google, which include obtaining links to pages, putting important search words in the title, and avoiding use of the HTML META tag, which commerical search engines tend to view as a place for Web page authors to manipulate their systems. (This is somewhat ironic, since the META tag was designed to faciliate retrieval!)

Some search engines described in the book have changed substantially since the book was written. Ask Jeeves (http://www.ask.com/) has abandoned its frequently asked question approach to search in favor of a combination of paid placement at the top of the results page followed by output from the Teoma search engine (http://www.teoma.com/) below. AltaVista (http://www.altavista.com) has also changed its results page to make sponsored links more prominent. Northern Light (http://www.northernlight.com) has exited the free search results completely and has instead adopted a subscription model for business-oriented information. Metacrawler (http://www.metacrawler.com), the original "meta" search engine, has also begun to intermix sponsored results with its regular results. Copernic has changed its flagship product name to Copernic Agent and offers a free basic version as well as commerical versions with advanced numbers of features (http://www.copernic.com/).

One question that remains unanswered is whether there should continue to be development of health-specific search engines. Coiera et al. (2005) describe the architecture of a federated search system that brings together multiple evidence-based resources. The Healthline system (http://www.healthline.com/) claims to offer consumers access to filtered, high-quality content.

Anonymous (2003). The Google Ranking Report. Sedona, AZ, Cyberdifference Corp. http://www.mseo.com/google_ranking_report.html.
Barroso, L., Dean, J., et al. (2003). Web search for a planet:  the Google Cluster Architecture. IEEE Micro, 23(2): 22-28. http://labs.google.com/papers/googlecluster-ieee.pdf.
Calishain, T. and Dornfest, R. (2004). Google Hacks - Tips & Tools for Smarter Searching (Second Edition). Sebastapol, CA. O'Reilly & Associates.
Chen, Y., Gan, Q. and Suel, T. (2002). I/O-efficient techniques for computing pagerank. Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management, McLean, VA. ACM Press. 549-557.
Coiera, E., Walther, M., et al. (2005). Architecture for knowledge-based and federated search of online clinical evidence. Journal of Medical Internet Research, 7: e52. http://www.jmir.org/2005/5/e52/.
Ghemawat, S., Gobioff, H., et al. (2003). The Google file system. Proceedings of the19th ACM Symposium on Operating Systems Principles, Lake George, NY. ACM Press. http://labs.google.com/papers/gfs-sosp2003.pdf.
Hochmuth, P. (2003). Speedy returns are Google's goal. Network World . 17-18. http://www.nwfusion.com/news/2003/0901google.html.

6.3.3 Databases and Collections

6.3.3.1 Images

(5/8/07) A comprehensive overview of content-based image retrieval (CBIR) was recently published (Muller et al., 2004). This paper notes that CBIR still awaits major breakthroughs to achieve the same kind of widespread use currently seen with text-based retrieval systems. Indeed, most state-of-the-art systems employ searching of text-based descriptors of images. The paper notes that visual features are usually classified into three categories:
Most systems are still limited to primitive features, usually combined with textual annotations for logical and abstract features. The loss of information from images to a non-image representation has been called the "semantic gap." Müller et al. (2004) make the case for image retrieval in biomedicine. They note that routine clinical care produces thousands of images daily in most hospitals. The growth of standards for image communications (e.g., Digital Imaging and Communications in Medicine or DICOM) and the widespread use of Picture Archiving and Communications Systems (PACS) insure that large numbers of digital images are available for use. While image interpretation in medicine is not on the horizon any time soon, the value of retrieving images based on features could have value in assistant diagnosis, treatment, and education.

Müller, H., Michoux, N., Bandon, D. and Geissbuhler, A. (2004). A review of content-based image retrieval systems in medical applications-clinical benefits and future directions. International Journal of Medical Informatics , 73: 1-23.

(5/2/02) Most of the major Web search engines now feature image retrieval. They usually display images associated with the text of the Web pages on which they are located. Some of the features of Google are particularly useful in image retrieval, such as filetype (e.g., can limit to JPEG) and those related to current news. When an image is retrieved, Google displays both a thumbnail of the image as well as the page from which it came. As with most of the major search engines, Google allows imaging from its main search page or has a separate image retrieval page (http://images.google.com/).

6.3.3.2 Genomics

6.3.3.3 Citations

6.3.3.4 Other databases

(5/6/03) The ClinicalTrials.Gov database of clinical trials offers some innovative retrieval functionality. At the home page, the user can enter a natural language search or use a "Focused Search" that provides an interface to search by disease, location, treatment, sponsor, and so forth. The natural language search results page provides an option of "Query Details" that:
As an example, if the user enters the query "heart attack beta blockers portland," the "Query Details" will map the phrase "heart attack" to "myocardial infarction," recognize the phrase "beta blocker," and find trials in cities named "Portland."

6.3.4 Aggregations

(5/9/06) A number of systems that aggregate genomic-related data and information are now available:
Dennis, G., Sherman, B., Hosack, D., Yang, J., Gao, W., Lane, H. and Lempicki, R. (2003). DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biology, 4: P3. http://genomebiology.com/2003/4/5/P3 .
Diehn, M., Sherlock, G., Binkley, G., Jin, H., Matese, J., Hernandez-Boussard, T., Rees, C., Cherry, J., Botstein, D., Brown, P. and Alizadeh, A. (2003). SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data. Nucleic Acids Research, 31: 219-223.
Maglott, D., Ostell, J., et al. (2005). Entrez Gene:  gene-centered information at NCBI. Nucleic Acids Research, 33: D54-D58.
Wheeler, D., Barrett, T., et al. (2006). Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 34: D173-D180.
Zeeberg, B., Feng, W., Wang, G., Wang, M., Fojo, A., Sunshine, M., Narasimhan, S., Kane, D., Reinhold, W., Lababidi, S., Bussey, K., Riss, J., Barrett, J. and Weinstein, J. (2003). GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biology, 4: R28. http://genomebiology.com/2003/4/4/R28 .

(5/9/06) A search system that aggregates a variety of bibliographic and full-text resources in computer science is Bibfinder (http://kilimanjaro.eas.asu.edu/). Based on the Havasu framework (Kambhampati et al., 2002), Bibfinder searches over a number of other resources, including the ACM Digital Library, CiteSeer, Science Direct, and Google. Recent work has enhanced BibFinder to more effectively use statistics to determine coverage of a topic in a source (Nie et al., 2003).

Kambhampati, S., Nambiar, U., Nie, Z. and Vaddi, S. (2002). Havasu: A Multi-Objective, Adaptive Query Processing Framework for Web Data Integration . Tempe, AZ, Arizona State University. http://rakaposhi.eas.asu.edu/havasu/Havasu_files/havasu-techrpt.pdf .
Nie, Z., Kambhampati, S. and Hernandez, T. (2003). BibFinder/StatMiner: effectively mining and using coverage and overlap statistics in data integration. Proceedings of the 29th Very Large Database Conference, Berlin, Germany. http://www.public.asu.edu/~zaiqingn/vldb2003.pdf .

(5/3/04) Most textbooks now are aggregated into large collections, an approach pioneered by Stat!-Ref. Another trend is the movement of textbooks to handheld device formats, such as the Palm and Pocket PC. The largest aggregator of medical textbooks is Skyscape (http://www.skyscape.com), which reformats content into a consistent format and provides standard (if simple) browsing and searching capabilities.

6.4 Document Delivery

6.5 Notification or Information Filtering

(5/10/06) One format for notification, mentioned in updates to earlier chapters and receiving increasing usage on the Web, is RSS. A growing number of medical publishers are using it, such as:
Both of these systems also send notifications via simple email.

Canese, K. (2005). RSS Feeds Available from PubMed. NLM Technical Bulletin. http://165.112.6.70/pubs/techbull/mj05/mj05_rss.html.

6.6  Searching with the sample database

Last updated - May 8, 2007