Information Retrieval:  A Health & Biomedical Perspective

Information Retrieval:  A Health & Biomedical Perspective (Second Edition)

William Hersh, M.D.

Springer-Verlag , 2003

Back to Updates Table of Contents

Update to Chapter 5 - Indexing

(4/30/06) A recent book on metadata provides an overview of the structure and use of metadata for traditional database as well as Web content (Caplan, 2003). The beginning chapters cover introductory topics such as the syntax, storage, and use of metadata. These chapters also distinguish terminology used for metadata and describe major approaches being used for the Web. Later chapters cover the use of metadata for a number of specific domains. Interestingly, metadata for biomedicine is not included which, I suppose, makes this volume quite complementary with my book!

Another article about metadata describes its value for types of information systems (Udell, 2005). The author notes three distinct categories of metadata, each of which has specific uses in organizing data:
Caplan, P. (2003). Metadata Fundamentals for All Librarians. Chicago. American Library Association.
Udell, J. (2005). Managing metadata. Infoworld. October 24, 2005. 33-38. http://www.infoworld.com/article/05/10/20/43FEmetadata_1.html.

5.1 Types of indexing

5.2 Factors influencing indexing

5.3 Controlled vocabularies

(4/29/07) Another term that commonly comes up when discussing controlled vacabularies is ontology. There are many definitions of ontologies, and the word is often used to describe any type of controlled vacabulary or terminology. A commonly cited definition and general overview of ontologies comes from Noy and McGiunness (2001). These authors describe an ontology as a "formal explicit description of concepts in a domain of discourse." Some commonly agreed upon components of an ontology are classes of general concepts, with specific instances or instantiations that represent concepts within them. Concepts have various attributes, usually connected via relationships. Concepts also have restrictions, sometimes called facets. In a pure sense, ontologies differ from terminologies in that the fomer richly represent a domain whereas the latter catalogs its formal terms.

Ontologies have been viewed as an essential component of the "semantic Web," another term that has its own ambiguities (Berners-Lee et al., 2001; Robu et al., 2006). The term ontology is frequently used to describe medical terminology systems, though Cimino and Zhu (2006) note that most major terminologies, while used successfully for many applications, have varying amounts to adherence to true ontological principles. This will likely change, however, as the new National Center for Biomedical Ontology advances its work in this area (Rubin et al., 2006)

Berners-Lee, T., Lassila, O., et al. (2001). The Semantic Web. Scientific American, 284(5): 34-43. http://www.sciam.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21&catID=2.
Cimino, J. and Zhu, X. (2006). The practical impact of ontologies on biomedical informatics. Methods of Information in Medicine, 45(Supp 1): 124-135.
Noy, N. and McGuinness, D. (2001). Ontology Development 101: A Guide to Creating Your First Ontology, Stanford University Knowledge Systems Laboratory. http://www.ksl.stanford.edu/people/dlm/papers/ontology101/ontology101-noy-mcguinness.html.
Robu, I., Robu, V., et al. (2006). An introduction to the Semantic Web for health sciences librarians. Journal of the Medical Library Association, 94: 198-205.
Rubin, D., Lewis, S., et al. (2006). National Center for Biomedical Ontology: advancing biomedicine through structured organization of scientific knowledge. OMICS, 10: 185-198.

(4/29/07) Another area where terminology incompleteness and ambiguity has hampered information retrieval is that of genes and proteins. The book already notes that the various names of biomedical entities, such as diseases, symptoms, medications, and so forth suffer from both synonymy and polysemy (also called ambiguity in some literatures, such as the genomics literature). There has historically been no control over gene names; they tend to be assigned by the researchers who discover them. Leading geneticist Dr. David Botstein has noted this problem and called for a better approach to naming of genes (O'Neill, 2003). Genes have characteristics that may present even more challenges than names of other entities:
Several studies have investgated gene-naming problems. Tuason et al. (2004) systematically assessed ambiguity in gene names across four species: mouse, worm, fly, and yeast. This analysis was extended to 17 more organisms by Chen et al. (2005). Fundel and Zimmer (2006) focused on the four organisms of Tuason et al. as well as the rat, looking more deeply at the problem and assessing how it might be improved with human curation.

All of these studies looked at ambiguities within the organism, with general English words, with terms in the UMLS Metathesaurus, and across the organisms. For each gene, they also assessed the ambiguity of official symbols, all symbols, and all symbols and names. The following table from Chen et al. (2005) shows the results obtained across the 21 organisms. It should be noted there was substantial variation across different organisms.

% Ambiguity
Official symbols only All symbols (official and aliases) All names (including all symbols)
Within species
0.02%
5.0%
5.6%
Across species
14.2%
13.4%
16.0%
English words
0.6%
1.1%
1.8%
UMLS Metathesaurus terms
1.0%
3.0%
13.1%

For official symbols, there was virtually no ambiguity within species or with English words or the Metathesaurus. The substantial (14.2%) ambiguity across species was believed to be due to homologous genes, which are genes that code for a protein of similar function across species. The ambiguity with English words and Metathesaurus terms was generally low with one exception, which was between gene names and Metathesaurus terms. Analysis of the latter showed that 80% of the ambiguity was due to gene names being given the same name as the resulting phenotype from expression of the gene, e.g., the gene "limb deformity" results in deformed limbs.

In the Tuason et al. (2004) study, the authors assessed of a group of abstracts that failed to be retrieved using the official gene symbol or an alias. They found that the most common reason for lack of retrieval was gene name variation, either full or partially. When other names for the gene used in literature were assessed for ambiguity, the amount was much higher. In 45,000 article abstracts, 10% of gene names within the mouse genome database were ambiguous and 31.3% were ambiguous with gene names from the other three organisms. In the Chen et al. (2005) study, an additional analysis looked at the terms authors prefer to use in the papers they write, finding that they overwhelmingly preferred gene synonyms (74.7%) over official symbols (17.7%) and official names (7.6%).

The organization responsible for the naming of human genes is the Human Geneome Organization (HUGO, http://www.gene.ucl.ac.uk/hugo/ ), which authorizes the Human Gene Nomenclature Committee (HGNC, http://www.gene.ucl.ac.uk/nomenclature/ ) as the authority for assigning human gene names and symbols. The NCBI databases on human genes, e.g., Online Mendelian Inheritance in Man (OMIM, http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM ), adhere to the HGNC, although as noted above, authors in the literature do not always do so.

Chen, L., Liu, H., et al. (2005). Gene name ambiguity of eukaryotic nomenclatures. Bioinformatics, 21: 248-256.
Fundel, K. and Zimmer, R. (2006). Gene and protein nomenclature in public databases. BMC Bioinformatics, 7: 372. http://www.biomedcentral.com/1471-2105/7/372.
Ivics, Z., Hackett, P., Plasterk, R. and Izsvak, Z. (1997). Molecular reconstruction of Sleeping Beauty, a Tc1-like transposon from fish, and its transposition in human cells. Cell, 91: 501-510.
O'Neill, G. (2003). Gruber winner Botstein calls for better gene-naming system. Melbourne, Australia, XIX International Congress of Genetics. http://www.geneticsmedia.org/gruber_winner_botstein_calls_for_better_gene_name_system.htm.
Tuason, O., Chen, L., Liu, H., Blake, J. and Friedman, C. (2004). Biological nomenclatures: a source of lexical knowledge and ambiguity. Pacific Symposium on Biocomputing, Kona, Hawaii. World Scientific. 238-249.

5.3.1 General principles of controlled vocabularies

(4/29/07) How do we evaluate medical terminology systems? Arts et al. (2005) reviewed the literature and found no comprehensive approaches, so developed their own hierarchical approach. They noted two broad concerns for terminology systems and sub-concerns within them that could be measured and rated:
Arts, D., Cornet, R., et al. (2005). A framework for characterizing terminological systems. Methods of Information in Medicine, 44: 616-625.

5.3.2 The Medical Subject Headings (MeSH) vocabulary

(4/29/07) An updated fact sheet about MeSH is available at http://www.nlm.nih.gov/pubs/factsheets/mesh.html. The most recently documented version of MeSH (2005) has the following numbers of terms:
In addition to the MeSH Browser described in the book, users can search for MeSH terms within PubMed by selecting "MeSH" from the uppermost "Search" drop-down menu. MeSH can downloaded from the National Library of Medicine in a variety of formats, including text and XML (http://www.nlm.nih.gov/mesh/filelist.html). Somewhat similar to Index Medicus, less of MeSH is available by paper now. The three volumes described in the book were discontinued in 2003, although a single volume with the terms and tree structures is still available (http://www.nlm.nih.gov/mesh/pubs.html).

(5/6/03) Another bibliographic database that uses MeSH (unchanged) is the Manual Alternative and Natural Therapy Index System (MANTIS, http://www.healthindex.com/MANTISAbout.html), which was mentioned in the update to Chapter 4.

5.3.3 Other indexing vocabularies

(4/24/05) The Gene Ontology (GO, http://www.geneontology.org/) continues to be an important resource in the molecular biology world. Its primary use is actually not in indexing content but rather structuring the knowledge of genes and their functions. Many of the model organism databases are devoting great resources to annotating the genes in their databases with GO codes. This work is usually done by curators who have advanced training in various fields of biology. In addition, the GO Consortium continues to expand and update the three vocabularies (Anonymous, 2006). As of April, 2007, there were a total of 23,053 terms in GO, distributed among the Biological Process (13,506), Cellular Component (1,957), and Molecular Function (7,590) categories. GO terms are now included in the UMLS Metathesaurus. An ongoing summary of the databases that use GO and the number of anontations within them is provided on the GO Web site (http://www.geneontology.org/GO.current.annotations.shtml).

GO also has codes that indicate the level of evidence supporting the association of a term with a gene. These evidence codes are described on the GO Web site (http://www.geneontology.org/GO.evidence.html). The current evidence codes in use are in the following table.

Code
Meaning
Use
IC
inferred by curator
Used where an annotation is not supported by any evidence but can be reasonably inferred from other GO annotations for which evidence is available
IDA
inferred from direct assay
Inferred from enzyme assays, in vitro reconstitution (e.g. transcription), immunofluorescence (for cellular component), cell fractionation (for cellular component), or physical interaction/binding assay (sometimes appropriate for cellular component or molecular function)
IEA
inferred from electronic annotation
Annotations based on "hits" in sequence similarity searches (if not reviewed by curators) or annotations transferred from other database records
IEP
inferred from expression pattern
Inferred from level of trascription (e.g., microarrays) or protein (e.g., Western blots)
IGI
inferred from genetic interaction
Inferred from traditional genetic interactions, fnctional complementation, rescue experiments, or inference about one gene drawn from the phenotype of a mutation in a different gene
IMP
inferred from mutant phenotype
Inferred from any gene mutation/knockout, overexpression/ectopic expression of wild-type or mutant genes, anti-sense experiments, RNAi experiments, or specific protein inhibitors
IPI
inferred from physical interaction
Covers physical interactions between the gene product of interest and another molecule, such as 2-hybrid interactions, co-purification, co-immunoprecipitation, and ion/protein binding experiments
ISS
inferred from sequence or structural similarity
Inferred from sequence similarity (homologue of/most closely related to), recognized domains, structural similarity, or Southern blotting
ND
no biological data available
Used for annotations to unknown molecular function, biological process, or cellular component
TAS
traceable author statement
Use for anything in a review article where the original experiments are traceable through that article or for anything found in a text book or dictionary
NAS
nontraceable author statement
Used for database entries that do not cite a paper (e.g. UniProt Knowledgebase records, YPD protein reports) or when there are statements in papers (abstract, introduction, or discussion) that a curator cannot trace to another publication

Some of the evidence codes represent stronger levels of evidence. For example, the weakest forms of evidence are IEA, where codes have been assigned based on genes identified in a sequence similarity search are have not been manually reviewed, and TAS, where the author of a paper has made a statement about the function of a gene with a citation to a paper describing an experiment that has not been curated.

The table below shows an example of GO code assigned for the BRCA1 gene in the mouse, a gene known to be associated with breast cancer in mice and humans. (Source:  Mouse Genome Informatics, http://www.informatics.jax.org/)

Category Classification Term Evidence
Biological Process carbohydrate metabolism ISS
Biological Process centrosome cycle IGI
Biological Process DNA damage response, signal transduction resulting in induction of apoptosis ISS
Biological Process DNA repair IEA
Biological Process DNA replication and chromosome cycle IMP
Biological Process dosage compensation, by inactivation of X chromosome IDA
Biological Process negative regulation of cell cycle IEA
Biological Process protein ubiquitination IEA
Biological Process response to DNA damage stimulus IEA
Cellular Component condensed chromosome IDA
Cellular Component cytoplasm IDA
Cellular Component intracellular IEA
Cellular Component nucleus IDA
Cellular Component nucleus ISS
Cellular Component ubiquitin ligase complex IEA
Molecular Function damaged DNA binding IDA
Molecular Function DNA binding IEA
Molecular Function hydrolase activity, hydrolyzing O-glycosyl compounds ISS
Molecular Function protein binding ISS
Molecular Function protein binding TAS
Molecular Function RNA binding ISS
Molecular Function ubiquitin-protein ligase activity IEA
Molecular Function zinc ion binding IEA

There is some concern that GO is not truly an "ontology" in the sense that purists would describe one. Nonetheless, the bioinformatics community is also increasingly using GO to annotate the function of genes and their products, particularly in model organism databases. A guide describing the preferred approach to annotating with GO is available on the GO Web site (http://www.geneontology.org/GO.annotation.html).

Anonymous (2006). The Gene Ontology (GO) project in 2006. Nucleic Acids Research, 34: D322-D326.

(4/26/04) The Center for Bioinformatics of the National Cancer Institute (http://ncicb.nci.nih.gov) has undertaken two vocabulary efforts, the NCI Thesaurus and the NCI Metathesaurus. The NCI Thesaurus (http://nciterms.nci.nih.gov) is focused on cancer science and covers basic, preclinical, and clinical research as well as administrative terminology associated with research management (Sioutos et al., 2007). Its goal is to provide a knowledge model that enabling cross-disciplinary workers to correctly interpret the meaning and relationships among entities from disciplines other than their own. The NCI Metathesaurus is described in the next section.

The top-level categories of the NCI Thesaurus include:
Sioutos, N., deCoronado, S., et al. (2007). NCI Thesaurus: a semantic model integrating cancer-related clinical and molecular information. Journal of Biomedical Informatics, 40: 30-43.

5.3.4 The Unified Medical Language System

(4/29/07) Documentation for current and past releases of the Unified Medical Language System (UMLS) are maintained at http://www.nlm.nih.gov/research/umls/documentation.html. An updated fact sheet about the UMLS is available at http://www.nlm.nih.gov/pubs/factsheets/umls.html. Scripts for loading UMLS data into various relational database systems are available at http://umlsinfo.nlm.nih.gov/load.html. A description of the UMLS from a bioinformatics perspective has been published by Bodenreider (2004).

One way to conceptualize the UMLS Metathesaurus is to think of it as a "repository" of vocabularies. The source vocabularies are maintained unchanged and can be extracted from the Metathesaurus.

The 2004 edition of the UMLS Metathesaurus brought substantial changes. The major new features were the new Rich Release Format (RRF) and the inclusion of SNOMED CT, the most recent revision of this vocabulary that had recently been licensed for free use in the United States by its original creator, the College of American Pathologists (CAP). SNOMED has since been acquired by the newly-formed International Health Terminology Standards Development Organisation (IHTSDO, also known as SNOMED SDO; http://www.ihtsdo.org/)

The major goals of the RRF were to allow easier extraction of source vocabularies and more robust mapping across different vocabularies (Anonymous, 2003). The former was accomplished by the addition of a new fourth identifier (in addition to the previous CUI, LUI, and SUI) for each string, the atomic unique identifier (AUI). This identifier maintains and identifies the string in its exact form from its source vocabluary. Previously, if two terms were identical or lexical variants, they would be represented by the same SUI. The creation of the AUI allowed the term to be traced back to its source vocabulary in its original form. Another change was the merging of the MRCON and MRSO files into a new MRCONSO file, which included the source vocabulary of each string and also facilitated easier extraction of a specific vocabulary. In addition, new fields in the Metathesaurus allowed encoding of rules for mapping as well as typed attributes.

The tables below from the Metathesaurus documentation demonstrate the new AUIs. The second table also demonstrates an ambiguous term, "cold," which can represent the temperature, the upper respiratory disease, or an acornym for a lung disease.

Concept (CUI)

Terms (LUIs)

Strings (SUIs)

  Atoms (AUIs)
     (RRF only)

C0004238
Atrial Fibrillation
(preferred)
Atrial Fibrillations 
Auricular Fibrillation
Auricular Fibrillations

L0004238
Atrial Fibrillation
(preferred)
Atrial Fibrillations

S0016668
Atrial Fibrillation
(preferred)

A0027665
Atrial Fibrillation
(from MSH)
A0027667
Atrial Fibrillation
(from PSY)

S0016669
Atrial Fibrillations

A0027668
Atrial Fibrillations
(from MSH)

L0004327
(synonym)
Auricular Fibrillation 
Auricular Fibrillations

S0016899
Auricular Fibrillation
(preferred)

A0027930
Auricular Fibrillation
(from PSY)

S0016900
(plural variant)
Auricular Fibrillations

A0027932
Auricular Fibrillations
(from MSH)

Concepts (CUIs)

Terms (LUIs)

Strings (SUIs)

   Atoms (AUIs)
     (RRF only) 

C0009264
cold temperature

L0215040
cold temperature

S0288775
cold temperature

A0318651
cold temperature
(from CSP)

L0009264
Cold <1>
Cold

S0007170
Cold <1>

A0016032
Cold <1>
(from MTH)

S0026353
Cold

A0040712
Cold
(from MSH)

C0009443
Common Cold

L0009443
Common Cold

S0026747
Common Cold

A0041261
Common Cold
(from MSH)

L0009264
Cold <2>
Cold

S0007171
Cold <2>

A0016033
Cold <2>
(from MTH)

S0026353
Cold

A0040708
Cold
(from COSTAR)

C0024117
Chronic Obstructive 
Airway Disease

L0498186
Chronic Obstructive 
Airway Disease

S0837575
Chronic Obstructive 
Airway Disease

A0896021
Chronic Obstructive
Airway Disease
(from MSH)

L0008703
Chronic Obstructive 
Lung Disease

S0837576
Chronic Obstructive 
Lung Disease

A0896023
Chronic Obstructive
Lung Disease
(from MSH)

L0009264
COLD <3>
COLD

S0829315
COLD <3>

A0887858
COLD <3>
(from MTH)

S0474508
COLD

A0539536
COLD
(from SNMI)

The inclusion of SNOMED CT in the Metathesaurus was a challenging undertaking, due to the complexity of the former.  A Web page was created that described how some of the structural differences between the two were reconciled (http://www.nlm.nih.gov/research/umls/Snomed/snomed_represented.html).

Additional work with the Metathesaurus in the RxNorm project has aimed to improve its ability to represent clinical drugs, which may consist of more than one ingredient and have other attributes, such as brand names, strengths, and routes of administration (Nelson et al., 2002). This can be relevant to content from retrieval links that is linked to from applications such as electornic health records, e.g., an electronic prescribing application for which information about all the components of a multi-drug formulation is to be displayed. The RxNav application has been developed that allows browsing of RxNorm (Bodenreider, 2004).

Chen et al. (2007) carried out an evaluation of the UMLS by surveying users on its email list, with 70 individuals responding. They found that the two major intended uses of the UMLS were access to source terminologies (75%) and mapping among source terminologies (44%), although most access was to just a small number of sources. Th emost common reported uses of the UMLS were for terminology research (31%), information retrieval (16%), and terminology translation (12%). Other observations found were that the UMLS Metathesaurus was used as a terminology itself (77%) and that users wanted the NLM to develop a unified hierarchy and derive a terminology from it (73%).

Anonymous (2003). White Paper: UMLS Metathesaurus Rich Release (MR+) Format. Bethesda, MD, National Library of Medicine. http://www.nlm.nih.gov/research/umls/white_paper.html.
Bodenreider, O. (2004). The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research, 32: D267-D270.
Bodenreider, O. and Nelson, S. (2004). RxNav: a semantic navigation tool for clinical drugs. MEDINFO 2004 - Proceedings of the Eleventh World Congress on Medical Informatics, San Francisco, CA. IOS Press. 1530.
Chen, Y., Perl, Y., et al. (2007). Analysis of a study of the users, uses, and future agenda of the UMLS. Journal of the American Medical Informatics Association, 14: 221-231.
Nelson, S., Brown, S., Erlbaum, M., Olson, N., Powell, T., Carlsen, B., Carter, J., Tuttle, M. and Hole, W. (2002). A semantic normal form for clinical drugs in the UMLS: early experiences with the VANDF. Proceedings of the 2002 Annual AMIA Symposium, San Antonio, TX. Hanley & Belfus. 557-561.

(4/23/05) The NCI Metathesaurus (http://ncimeta.nci.nih.gov) is based on the UMLS Metathesaurus. Sources deemed not relevant to cancer are omitted from the UMLS Metathesaurus, while those believed to be valuable to cancer science have been added, such as Mitelman's terminology of chromosome aberrations in cancer and GO (although the latter is now in the UMLS Metathesaurus as well). The NCI Metathesaurus contains about 850,000 concepts mapped to 1.5 million terms. A public API is available for the NCI Metathesaurus server in the caCORE system (Komatsoulis, 2007) (http://ncicb.nci.nih.gov/NCICB/core). The caCORE also contains a distribution of the NCI Thesaurus data.

Komatsoulis, G., Warzel, D., et al. (2007). caCORE version 3: implementation of a model driven, service-oriented architecture for semantic interoperability. Journal of Biomedical Informatics: Epub ahead of print.

5.4 Manual indexing

(4/24/04) A list of frequently asked questions about MEDLINE indexing has been developed by the NLM (http://www.nlm.nih.gov/bsd/indexfaq.html).

(4/30/06) NLM added full author names to MEDLINE starting in 2002 in the FAU field. The previous abbreviated author name with last number and first and middle initials was maintained in the old AU field (Nahin, 2005).

Nahin, A. (2003). Full Author Searching Comes to PubMed. NLM Technical Bulletin. e4. http://165.112.6.70/pubs/techbull/mj05/mj05_full_author.html.

(4/30/06) The bioinformatics community has devoted great effort to developing mark-up languages and other schemas to annotate and model various biological processes.  Some of these are listed in the table below.

Purpose Name Source
Annotate genes Distributed Annotation System (DAS)
http://www.biodas.org
Model structure, mathematics, and metadata about cells Cell ML http://www.cellml.org
Model biochemical networks Systems Biology Markup Langauge (SBML)
http://sbml.org
Minimum information about a microarray experiment MIAME http://www.mged.org
Annotate bio-sequence information Bioinformatic Sequence Markup Language (BSML) http://www.bsml.org/
Provide a computational framework for human and other eukaryotic physiology International Union of Physiological Sciences (IUPS) Physiome Project
http:// www.physiome.org.nz/

(4/23/05) Another effort from the bioinformatics community is the Life Sciences Identifier (LSID, http://lsid.sourceforge.net/), which assigns a unique identifier to all biomedical research data (Salamone, 2004). Although not widely adopted yet, the LSID uses the Uniform Resource Name (URN) format and contains the following elements:
Salamone gives examples of how the LSID would be used.  A PubMed article, for example, would have the following LSID:
urn:lsid:ncbi.nlm.nih.gov:pubmed:12571434
Likewise, a second version of the protein 1AFT in the Protein Data Bank would have the following LSID:
urn:lsid:pdb.org:1AFT:2
The usage of the LSID has been modest.  See the following discussion for details:  http://www.nodalpoint.org/node/1571.

Salamone, S. (2004). LSID: an informatics lifesaver. BIO-IT World. January, 2004. 38-42. http://www.bio-itworld.com/archive/011204/lsid.html.

5.4.1 Bibliographic manual indexing

(4/28/03) McGregor (2003) addresses the issues of indexing with MeSH outside the National Library of Medicine (NLM), i.e., those who use it to index other resources. He notes that MeSH is well-tuned to indexing the biomedical literature and that the NLM devotes the resources to updating it with the terms it needs, a process that is likely to consume too much resources for most other organizations. Adding "enhanced" or "local" terminology to MeSH can be difficult. One problem is mapping terms into the proper location in the MeSH hierarchy. Another is maintaining those new terms when MeSH is revised or reorganized by the NLM.  McGregor also notes that the addition of terms is often political, e.g., the developer of a new surgical procedure wants to be sure his or her new procedure is in the index. A final problem he notes is the lack of use of the MeSH hierarchy. Non-NLM indexers usually do not follow the adage of indexing to the most specific level so searchers can take advantage of the explosion feature of retrieval (see Chapter 6), which leads to poorer search results.

McGregor, B. (2003). Medical indexing outside the National Library of Medicine. Journal of the Medical Library Association, 90: 339-341.

(4/23/05) A new addition to manual indexing for gene information is the Gene Reference into Function (GeneRIF), which is part of Entrez Gene (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene), a "switchboard" of information about gene loci (Mitchell, 2003). Assignment of GeneRIFs is now part of the MEDLINE indexing process, although others can nominate them to the NCBI. GeneRIFs describe MEDLINE references that focus on the basic biology of the gene or its protein products from the designated organism. This includes includes isolation, structure, genetics and function of genes/proteins in normal and disease states.  More information can be found at http://www.ncbi.nlm.nih.gov/projects/GeneRIF/GeneRIFhelp.html.

Mitchell, J., Aronson, A., Mork, J., Folk, L., Humphrey, S. and Ward, J. (2003). Gene indexing: characterization and analysis of NLM's GeneRIFs. Proceedings of the AMIA 2003 Annual Symposium, Washington, DC. Hanley & Belfus. 460-464.

(4/23/05) The characterization of the background of indexers in the book is incorrect. According to Betsy Humphreys, Deputy Director for Library Operations at the NLM (personal communication), "Most everyone who indexes has one or more degrees in one of the life or health sciences (e.g., biochemistry, molecular biology). They are all professional in that they get fairly rigorous training and are revised by a senior indexer for a variable length of time (depending on how long it takes them to become high quality indexers). Anyone who makes it through this process (including clinicians/scientists) can become a high quality indexer. Indexing is not for everyone - no matter what their educational background. There is some anecdotal evidence that clinicians may tend to focus more on nuances in specific articles and therefore index less consistently - and consistency is an important aspect of high quality indexing. Some indexers do have library or information science degrees - but they also have a background in science. There are more indexers with a PhD in a scientific field than there are with a Master of Library Science, however."

Here is some additional information about NLM indexers (Marcetich, 2004):
Marcetich, J., Rappaport, M., et al. (2004). Indexing consistency in MEDLINE. MLA 04 Abstracts, Washington, DC. Medical Library Association. 10-11. http://www.mlanet.org/am/am2004/pdf/abstracts.pdf.

5.4.2 Full-text manual indexing

5.4.3 Web manual indexing

(4/30/07) Although we do not think of it as "indexing" in the traditional sense, the growing application of "paid search" is a form of indexing. Paid search is the assignment of indexing terms to content based on how much someone is willing to pay for them (Jansen, 2005). While some search engines do not distinguish between search results based on paid search, Google has developed a tremendously successful business model by clearly demarcating "Sponsored Links" separate from its regular search results. Google's Adwords approach works by advertisers bidding on given words and phrases (e.g., the OHSU biomedical informatics programming bidding on medical informatics, health informatics, biomedical informatics, etc.) for how much they are willing to pay when a user sees the ad and clicks through to the advertiser's site. Whoever is willing to bid more for a word or phrase will rank higher in the output. Advertisers are charged only when users click through, and can set a daily maximum to not exceed a specific budget. Once the daily maximum is reached, the advertiser's ad will no longer appear in the Sponsored Links until the following day. Google's approach is not the only one, but is most common (Fain and Pedersen, 2005). One challenge with approaches like Adwords is "click fraud," where competitors or others with malicious intent can set up robots that click through ads just to run up the advertiser's cost to their daily maximum (Kitts et al., 2005). Most search engines have processes that aim to detect click fraud.

Fail, D. and Pedersen, J. (2005). Sponsored search: a brief history. Bulletin of the American Society for Information Science and Technology, 32(2): 12-13. http://www.asis.org/Bulletin/Dec-05/pedersen.html.
Kitts, B., LeBlanc, B., et al. (2005). Click fraud. Bulletin of the American Society for Information Science and Technology, 32(2): 20-23. http://www.asis.org/Bulletin/Dec-05/clickfraud.html.
Jansen, B. (2005). Paid search as an information seeking paradigm. Bulletin of the American Society for Information Science and Technology, 32(2): 7-8. http://www.asis.org/Bulletin/Dec-05/jansen.html.

(4/30/06) A number of approaches have emerged in recent years to index various types of Web content. One effort has focused on making high-quality consumer-oriented health information available. Healthline (http://www.healthline.com) searches over 60,000 Web sites they believe represent high-quality information. They also license a health encyclopedia aiming to guarantee that some high-quality information is available on almost every topic. They have also developed a consumer-oriented vocabulary whose terms are assigned to articles in its database. Finally, they employ a form of collaborative filtering that allows users to rate articles they have read.

Collaborative filtering is the rating of content by like users that is widely used by many Web sites. Amazon.com and Netflix employ a form of collaborative filtering to rate books. One approach to collaborative filtering gaining prominent on the Web also goes by the name of "social bookmarking." The leading site for this is Del.icio.us (http://del.icio.us/). Another approach is Flickr, an approach to tagging images that is described below.

There is some collaborative filtering used in health and biomedicine. Haynes and colleagues (2005a, 2005b) have described the McMaster Online Rating of Evidence (MORE) system, where clinicians rate journal articles already filtered for scientific (i.e., evidence-based) merit using the criteria of ACP Journal Club, Evidence-Based Medicine, Evidence-Based Nursing, etc.. These clinicians rate the articles on seven-point scales for relevance and newsworthiness. The ratings are averaged for specific medical disciplines so that users of MORE will see ratings that have been made by physicians and nurses in their own specialties. A study of physician users from around the world found they rated systematic reviews higher for relevance to clinical practice and original studies higher for newsworthiness (Haynes et al., 2006).

Another effort focuses on indexing Web-based content for medical education, the Health Education Assets Library (HEAL, http://www.healcentral.org). The primary goal of HEAL is to develop a metadata standard for content such as images, cases, quizzes, lecture slides, and so forth so that they can be readily shared by other medical educations (Candler et al., 2003). HEAL is part of the National SMETE Digital Library (NSDL) initiative to develop digital libraries (see Chapter 10). The leaders of this project have also assessed the barriers to greater sharing of this type of content, finding in focus groups of medical school faculty that wider dissemination could be enhanced by (Uijtdehaage et al., 2003):
Candler, C., Uijtdehaage, S., et al. (2003). Introducing HEAL: the Health Education Assets Library. Academic Medicine , 78: 249-253. http://www.healcentral.org/publications/Academic_Medicine_C_Mar_2003.pdf.
Haynes, R. (2005). bmjupdates+, a new free service for evidence-based clinical practice. Evidence-Based Nursing, 8: 39.
Haynes, R. and Walker-Dilks, C. (2005). Having trouble deciding what's most important to read? Look to the stars. ACP Journal Club, 143(1): A10. http://www.acpjc.org/shared/about_stars.htm.
Haynes, R., Cotoi, C., et al. (2006). Second-order peer review of the medical literature for clinical practitioners. Journal of the American Medical Association, 295: 1801-1808.
Uijtdehaage, S., Contini, J., et al. (2003). Sharing digital teaching resources: breaking down barriers by addressing the concerns of faculty members. Academic Medicine, 78: 286-294. http://www.healcentral.org/publications/Academic_Medicine_U_Mar_2003.pdf.

(4/23/05) Another major Web metadata initiative is the eXtensible Metadata Platform (XMP). Although being led by a private company, Adobe, the effort is being carried out through the public standards process of the Worldwide Web Consortium. XMP is built on top of the RDF Schema framework described below. An overview document (Anonymous, 2005) and more detail are available on the Adobe Web site. The overview contains an informative basic description of RDF based on a can of beans and what it contains, who produced it, etc..

Anonymous (2005). A Manager’s Introduction to Adobe eXtensible Metadata Platform, The Adobe XML Metadata Framework. San Jose. CA, Adobe Systems Incorporated. http://www.adobe.com/products/xmp/pdfs/whitepaper.pdf.

5.4.3.1 Dublin Core Metadata Initiative

(4/27/03) An update of the Dublin Core Metadata Initiative (DCMI) was published in D-Lib Magazine by Dekkers and Weibel (2003), who reported on further standardization of the DCMI, publication of a new XML Schema for it, and its growing use in commerical applications.

Dekkers, M. and Weibel, S. (2003). State of the Dublin Core Metadata Initiative, April 2003. D-Lib Magazine, 9: 4. http://www.dlib.org/dlib/april03/weibel/04weibel.html.

(4/27/03) Another large-scale Web content indexing initiative comes from Kaiser-Permanente, where a national effort aims to index all knowledge-based resources on the health system's nationwide clinical intranet (Dolin et al., 2001). The Permanente Knowledge Connection uses a superset of the DCMI. Analysis of the indexing process found that metadata assignment for these mostly secondary literature resources was comparable to the time that human indexing is required for the primary literature in MEDLINE records by National Library of Medicine indexers, which was about 15-30 minutes to initially catalog a resource and 5-10 minutes to update it when the content is revised.

Dolin, R., Boles, M., et al. (2001). Kaiser Permanente's "metadata-driven" national clinical intranet. MEDINFO 2001 - Proceedings of the Tenth World Congress on Medical Informatics, London, England. IOS Press. 319-323.

(4/24/04) The National Institute of Environmental Health Sciences (NIEHS, http://www.niehs.nih.gov/), an institute of the National Institutes of Health, has assessed the use of DCMI for resources on its Web site (Robertson et al., 2001). An analysis of its use found that content authors were readily able to generate metadata and were able to do so with quality comparable to professional indexers (Greenberg et al., 2002).

Robertson, W., Leadem, E., Dube, J. and Greenberg, J. (2001). Design and implementation of the National Institute of Environmental Health Sciences Dublin Core Metadata schema. Proceedings of the International Conference on Dublin Core and Metadata Applications 2001, Tokyo, Japan. National Institute of Informatics (NII). http://www.nii.ac.jp/dc2001/proceedings/product/paper-29.pdf.
Greenberg, J., Pattuelli, M., Parsia, B. and Robertson, W. (2002). Author-generated Dublin Core Metadata for Web resources: a baseline study in an organization. Journal of Digital Information, 2: 2. http://jodi.ecs.soton.ac.uk/Articles/v02/i02/Greenberg/.

(4/24/04) As described in the book, the PICS standard has been adapted for use as a metadata system for rating the quality of information contained on health-related Web sites. Initially called medPICS, this metadata vocabulary is now called the Health Information Disclosure, Description, and Evaluation Language (HIDDEL) (Eysenbach et al., 2001). It is implemented in the MedCIRCLE project, which allows harvesting of HIDDEL metadata by other Web sites to ascertain trustworthiness of information (Mayer et al., 2003). A major principle of MedCIRCLE is a chain of trust, whereby trustworthiness can be linked across multiple organizations. For example, if Organization A trusts a certain Web page and Organization B trusts Organization A, then through these linkages, Organization C, which trusts Organization B, will trust the page that Organization A trusts. One participant in MedCIRCLE is CISMeF (described in the book), which uses a combination of DCMI and HIDDEL (Thirion et al., 2003)

Eysenbach, G., Kohler, C., Yihune, G., Lampe, K., Cross, P. and Brickley, D. (2001). A metadata vocabulary for self- and third-party labeling of health web-sites: Health Information Disclosure, Description and Evaluation Language (HIDDEL). Proceedings of the 2001 AMIA Annual Symposium , Washington, DC. Hanley & Belfus. 169-173.
Mayer, M., Darmoni, S., Fiene, M., Kohler, C., Roth-Berghofer, T. and Eysenbach, G. (2003). MedCIRCLE: collaboration for Internet rating, certification, labelling and evaluation of health information on the World Wide Web. Studies in Health Technology and Informatics, 95: 667-672.
Thirion, B., Loosli, G., Douyere, M. and Darmoni, S. (2003). Metadata element set in a quality-controlled subject gateway: a step to a health semantic Web. Studies in Health Technology and Informatics , 95: 707-712.

(4/26/04) An additional enhancement to Resource Description Framework (RDF) has been made. RDF does not provide mechanisms for describing properties or the relationships between them. For this reason, RDF Schema has been developed, which provides additional semantics for capturing this type of information. RDF Schema is very useful for the Semantic Web, described in Chapter 10.

Brickley, D. and Guha, R. (2004). RDF Vocabulary Description Language 1.0: RDF Schema. World Wide Web Consortium. http://www.w3.org/TR/rdf-schema/.

5.4.3.2 Open Directory

5.4.4 Limitations of human indexing

(4/23/05) The oft-cited Funk and Reid study (1983) has been replicated in the new modern indexing environment of the NLM. The results have been not been formally published but were presented in a public forum (Marcetich, 2004). They show that human indexing consistency has not changed substantially.

Category of indexing
Consistency (Funk and Reid, 1982)
Consistency (Marcetich et al., 2004)
Major topic MeSH terms 61.1% 48.6%
All MeSH terms 55.4% 44.9%
Major topic MeSH/subheadings 43.1% 28.3%
All MeSH/subheadings 33.8% 24.3%
Check tags 74.7% 74.5%
Major topic floating subheadings 54.9% 46.1%
All floating subheadings 48.7% 43.4%

Marcetich, J., Rappaport, M., et al. (2004). Indexing consistency in MEDLINE. MLA 04 Abstracts, Washington, DC. Medical Library Association. 10-11. http://www.mlanet.org/am/am2004/pdf/abstracts.pdf.

5.5 Automated indexing

5.5.1 Word indexing

5.5.2 Limitations of word indexing

5.5.3 Word weighting

5.5.4 Link-based indexing

(4/30/06) Although the actual operations of Google are now highly guarded trade secrets, a number of researchers have developed and published enhancements to the original Pagerank algorithm to make it more efficient. One approach focuses on improving input/output efficiency and also has an understandable description of the basic algorithm (Chen, 2002).

The text in the book could improve itself in describing PageRank. One point that was left out was how the PageRank algorithm begins, since the PageRank value for any given page is calculated from the PageRank values of the pages that point to it. The usual approach, as described by Chen et al. (2002), is to assign every page a baseline value (such as the damping factor) and then iterate on a periodic basis.

It is often stated that PageRank is, simpilistically, a form of measuring the in-degree, or the number of links that point to a page. In reality, PageRank is more complex, giving added weight to pages that are pointed to by those that themselves have higher PageRank. Fortunato et al. (2006) carried out an analysis that assessed how closely PageRank is approximated by simple in-degree. They found that the approximation was relatively accurate, allowing Web content creators to estimate their PageRank of their content by knowing the in-degree to their pages.

PageRank and other forms of link-based indexing have been applied to biomedicine using data from the Science Citation Index (Bernstam, 2006). As will be noted in later chapters, this approach has been shown to improve the effectiveness of searching for articles deemed "important" in specialized bibliographies.

Bernstam, E., Herskovic, J., et al. (2006). Using citation data to improve retrieval from MEDLINE. Journal of the American Medical Informatics Association, 13: 96-105.
Chen, Y., Gan, Q., et al. (2002). I/O-efficient techniques for computing pagerank. Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management, McLean, VA. ACM Press. 549-557. http://doi.acm.org/10.1145/584792.584882.
Fortunato, S., Boguna, M., et al. (2005). How to make the top ten:  approximating PageRank from in-degree. Bloomington, IN, Indiana University. http://arxiv.org/pdf/cs.IR/0511016.

5.5.5 Web crawling

(5/7/03) Henzinger et al. (2002) describe the major challenges for Web search engines:
Henzinger, M., Motwani, R., et al. (2002). Challenges to Web search engines. SIGIR Forum, 36: 11-22. http://www.acm.org/sigir/forum/F2002/henzinger.pdf.

(4/25/04) Of course, the Web may fundamentally alter everything we know and do with regards to indexing. Brooks (2003) notes that Web content is very different from traditional document databases, that latter of which he describes as the "closed Web" (or invisible Web, as it was called in Chapter 1 of this book). The "open Web," he notes, consists of pages that are constantly changing, in marked distinction to more traditional bibliographic and full-text collections. Furthermore, many of the creators of open Web pages actively seek to do things that make their pages more retrievable by Web search engines, such as create text visible only to search engines but not readers and attempt to increase linkages to them. He also notes that Web search engines deliberately keep their parsing algorithms proprietary so pages creators will not "game" their systems. They also ignore the text in the HTML <META> tag because of its untrustworthiness. Brooks argues that traditional topical metadata approaches (e.g., Bates, 2002) are destined to fail for the Web and that open Web pages are best viewed as "cultural artifacts" rather than enduring documents.

Bates, M. (2002). After the dot-bomb: getting Web information right this time. First Monday, 7: 7. http://firstmonday.org/issues/issue7_7/bates.
Brooks, T. (2003). Web search: how the Web has changed information retrieval. Information Research, 8: 3. http://informationr.net/ir/8-3/paper154.html.

(4/23/05) Google now performs stemming of words when it indexes them, according to its "basics" search page (http://www.google.com/help/basics.html).

5.6 Index imaging

(4/30/06) In the future, this section should be entitled something like, "Indexing of Other (or Non-Text) Content," reflecting the diversity of content that people are interested in indexing and retrieving (and the direction my own research has been taking!). Let's start, however, with index imaging. The indexing of images has been described by many authors, in a general textbook (Del Bimbo, 1999), a general scientific paper (Rui, 1999), and more medically focused journal articles (Müller, 2004; Lehmann, 2004).

A better way to describe indexing imaging than in the book is to note that there are two broad approaches. One is what is described in the book, which is appropriately called semantic indexing, or the use of textual annotations. The approach to semantic indexing can be quite varied, from simple free-text descriptions to the use of more structured metadata, such as the approaches being taken by the HEAL project and XMP described above. One approach gaining increasing visibility is the Google Image search tool (http://images.google.com/), which "indexes" images by the text of the Web pages in which they appear.

A key aspect of metadata for imaging systems is standardization of the image type, orientation, modality, and so forth. An early approach to develop a schema for this was the SNOMED DICOM microglossary. A more recent effort is the Image Retrieval in Medical Applications (IRMA) classification (Lehmann, 2003), which classifies images along four axes:
A developing standard for metadata about images is the Z39.87 Data Dictionary for Technical Metadata for Digital Still Images, which aims to define a standard set of metadata elements for digital images. Supporting the draft standard is an XML schema, called Metadata for Images in XLM (MIX, http://www.loc.gov/standards/mix/). An emerging standard for video metadata comes from the Video Development Initiative (ViDE, http://www.vide.net/), which is adapting the DCMI (Agnew, 2001).

The other broad approach is called content-based indexing or visual indexing (Müller, 2004). In this approach, computer algorithms identify "features" in the image. These features include aspects of an image that a computer algorithm can recognize, such as:
It can be seen, however, that these features do not necessarily identify what is in the image, e.g., a chest x-ray with pneumonia or a microscopic slide of a cell.

In somewhat of an analogy to text indexing, the semantic approach can be considered to be "manual" indexing, while the content-based approach could be called "automated" indexing. Unlike text indexing, however, the state of the art for visual indexing is still fairly primitive. While we can process text documents automatically and get a good sense of what they are about, we still cannot, for example, process a chest x-ray and determine that an infiltrate from pneumonia or an enlarged heart is present. The best that visual retrieval can do is find very like images, a process that can be derailed by things as simple as a picture border or slightly changed orientation.

Although image retrieval will be discussed at greater length in the next chapter on retrieval, it should be noted here that content-based image retrieval tools typically build vectors of these features and aim to retrieve images with similar features. An open source system, the GNU Image Finding Tool (GIFT, http://www.gnu.org/software/gift/), has been adapted for medical image retrieval (Müller, 2003).

Another type of content attracting a great deal of interest from an indexing standpoint is e-learning content. An emerging view in this area is that educational content should be developed as learning objects. Ideally, learning objects should be sharable, reusable, and able to be discovered by their metadata. An emerging standard for e-larning content is the IEEE 1484 Learning Object Metadata (IEEE LOM) standard (Ogbuji, 2003), the most recent version of which is available on the IEEE Web site (Anonymous, 2005). The IEEE LOM consists of nine general categories:
  1. General
  2. LifeCycle
  3. Metametadata
  4. Technical
  5. Educational
  6. Rights
  7. Relation
  8. Annotation
  9. Classification
A number of these elements map to the DCMI (see http://www.ischool.washington.edu/sasutton/IEEE1484.html). A comparison between IEEE LOM and DCMI was carried out for the iLumina Digital Library Project, which covers science, technology, engineering, and mathematics content for college undergraduates (Heath, 2005). The authors found that IEEE LOM was more comprehensive than DCMI (it has many more elements), but most of the elements deemed most important in IEEE LOM had correlates in DCMI.

A medical-specific version, Healthcare LOM, has been developed by the Medbiquitous Consortium (http://www.medbiq.org/). This version is being used in the MedEdPORTAL project (http://www.aamc.org/mededportal), a database of medical educational content.

Agnew, G. and Kniesner, D. (2001). ViDe User’s Guide: Dublin Core Application Profile for Digital Video, Video Development Initiative (ViDE). http://www.vide.net/workgroups/videoaccess/resources/vide_dc_userguide_20010909.pdf.
Anonymous (2005). IEEE P1484.12.3, Draft 8 Draft Standard for Learning Technology - Extensible Markup Language (XML) Schema Definition Language Binding for Learning Object Metadata. Washington, DC, IEEE Computer Society. http://ltsc.ieee.org/wg12/files/IEEE_1484_12_03_d8_submitted.pdf.
Bidgood, W. (1998). The SNOMED DICOM microglossary: controlled terminology resource for data interchange in biomedical imaging. Methods of Information in Medicine, 37: 404-414.
Del Bimbo, A. (1999). Visual Information Retrieval. San Francisco. Morgan Kaufmann Publishers.
Heath, B., McArthur, D., et al. (2005). Metadata lessons from the iLumina digital library. Communications of the ACM, 48(7): 68-74.
Lehmann, T., Schubert, H., et al. (2003). The IRMA code for unique classification of medical images. Medical Imaging 2003: PACS and Integrated Medical Information Systems: Design and Evaluation, San Diego, CA. International Society for Optical Engineering. 440-451. http://libra.imib.rwth-aachen.de/irma/ps-pdf/SPIE_2003-5033(x)-56.pdf.
Lehmann, T., Güld, M., et al. (2004). Content-based image retrieval in medical applications. Methods of Information in Medicine, 43: 354-361.
Müller, H., Michoux, N., et al. (2004). A review of content-based image retrieval systems in medical applications-clinical benefits and future directions. International Journal of Medical Informatics, 73: 1-23.
Müller, H., Rosset, A., et al. (2003). Integrating content-based access methods into a medical case database. Proceedings of Medical Informatics Europe (MIE 2003), St. Malo, France. http://www.sim.hcuge.ch/medgift/publications/MRV2003.pdf.
Ogbuji, U. (2003). Thinking XML: Learning Objects Metadata. IBM developerWorks. December 2, 2003. http://www-128.ibm.com/developerworks/xml/library/x-think21.html.
Rui, Y., Huang, T., et al. (1999). Image retrieval:  past, present and future. Journal of Visual Communication and Image Representation, 10: 1-23.

(4/27/03) The Digital Anatomist Project (http://sig.biostr.washington.edu/projects/da/) has devoted great effort to modeling anatomical structures and the knowledge associated with them. An overview of the project described its motivation and challenges (Brinkley and Rosse, 1997). A recent review described the issues in modeling structural information such as anatomy (Brinkley and Rosse, 2002). The conceptual framework of the Digital Anatomist organizes structural information into the following categories:
Brinkley, J. and Rosse, C. (1997). The Digital Anatomist distributed framework and its applications to knowledge-based medical imaging. Journal of the American Medical Informatics Association , 4: 165-183.
Brinkley, J. and Rosse, C. (2002). Imaging Informatics and the Human Brain Project:  The Role of Structure, 111-128, in Haux, R. and Kulikowski, C., eds. Yearbook of Medical Informatics 2002 . Stuttgart, Germany. Schattauer. http://sig.biostr.washington.edu/publications/online/Yearbook_MI_2002.pdf.

5.7  Data structures for efficient retrieval

Last updated - April 30, 2007