The main goal of this book is to provide an understanding of the theory,
implementation, and evaluation of information retrieval (IR) systems in health
and biomedicine. There are already a number of excellent “how-to” volumes
on searching healthcare databases (Hutchinson, 1998, Katcher, 1999).
Similarly, there are also a number of high-quality basic IR textbooks (Grossman
and Frieder, 1998, Baeza-Yates and Ribeiro-Neto, 1999, Belew, 2000).
This volume is different from all of the above in that it covers basic IR
as do the latter books, but with a distinct focus on the health and biomedicine
domain.
The first edition of this book was written from 1994 to 1995 and published
in 1996. Although subsequent editions of books in many fields represent
incremental updates, this edition is profoundly rewritten, and is essentially
a new book. As everyone who is reading this knows, the IR world has
changed substantially in the 8 to 9 years since the first edition.
In that edition, the Internet was a “special topic” in the very last chapter
of the book. Here, on the other hand, it is introduced at the beginning
of the first chapter. At the time of the first edition, using a search
engine was something that only a minority of healthcare practitioners had
ever done. Now, not only have virtually all physicians used the World
Wide Web, but a majority of their patients have used it for seeking personal
health information as well.
The Web also profoundly altered the way this edition was researched and written.
When preparing the first edition, finding an article not in my own collection
or the Oregon Health & Science University (OHSU) Library was a chore
that entailed either driving to a nearby library at another university or
ordering it through interlibrary loan. For this edition, I was usually
able to find articles on the Web, either because OHSU or I had a subscription
or because the article was freely available. In writing this edition
I really learned firsthand the value of scientific publishing on the Web.
The only downside to the feast of articles I found was the ease with which
I found additional articles to consider for inclusion!
Another way the Web has impacted and will continue to impact this edition
is through the maintenance of a Web site for errata and updates. The
Web site www.irbook.org will identify all errors in the book text as well
as provide updates on important new findings in the field as they become
available.
The work on this edition also drove home the quality of the IR systems I
was using. I must give particular mention to the following resources
that provided fast and accurate access to a great deal of information:
the National Library of Medicine (NLM) PubMed MEDLINE system, the Google
search engine, ResearchIndex, and the multitude of journals that have opted
to electronically publish their full texts.
As in the first edition, the approach is still to introduce all the necessary
theory to allow coverage of the implementation and evaluation of IR systems
in health and biomedicine. Any book on theoretical aspects must necessarily
use technical jargon, and this book is no exception. Although jargon
is minimized, it cannot be eliminated it without retreating to a more superficial
level of coverage. The reader’s understanding of the jargon will vary
based on their backgrounds, but anyone with some background in computers,
libraries, health, and/or biomedicine should be able to understand most of
the terms used. In any case, an attempt to define all jargon terms
is made.
Another approach is to attempt wherever possible to classify topics, whether
discussing types of information or models of evaluation. I have always
found classification useful in providing an overview of complex topics.
One problem, of course, is that everything does not fit into the neat
and simple categories of the classification. This occurs repeatedly
with information science, and the reader is forewarned.
This book had its origins in a tutorial taught at the former Symposium on
Computer Applications in Medicine (SCAMC) meeting. The content continues
to grow each year through my annual course taught to medical informatics
students in the on-campus and disease-learning programs at OHSU. (They
often do not realize that next year’s course content is based in part on
the new and interesting things they teach me!) The book can be used
in either a basic information science course or a health and biomedical information
science course. It should also provide a strong background for others
interested in this topic, including those who design, implement, use, and
evaluate IR systems.
Interest continues to grow in health and biomedical IR systems. I entered
a fellowship in medical informatics at Harvard University in the late 1980s,
when the influence of medical artificial intelligence was still strong.
I had assumed I would take up the banner of some aspect of that area, such
as knowledge representation. But along the way I came across a reference
from the field of “information retrieval.” It looked interesting, so
I looked at the references of that reference. It did not take long
to figure out that this was where my real interests lay, and I spent many
an afternoon in my fellowship tracing references in the Harvard University
and Massachusetts Institute of Technology libraries. Even though I
had not yet heard of the field of bibliometrics, I was personally validating
all its principles. Like many in the field, I have been awed to see
IR become “mainstream” with the advent of the Web in recent years.
The book is divided into three sections. The first section covers the
basic concepts of information science. The first chapter provides basic
definitions and models that will be used throughout the book. The next
chapter gives an overview of health and biomedical information, covering
issues related to its generation and use. The third chapter discusses
the evaluation of IR systems, highlighting the methods and their limitations.
The evaluation chapter is deliberately placed at the beginning of the book
to emphasize the fundamental importance of this topic.
The second section covers the current state-of-the-art in commercial and
other widely used retrieval systems. The first chapter in this section
gives an overview of the great deal of content that is currently available.
Next come chapters on the two fundamental intellectual tasks of IR, indexing
and retrieval. The predominant paradigms of each are discussed in detail.
The final chapter covers evaluation of these systems, providing a justification
for the work described in the following section on research efforts.
The third section covers the major threads of research and development in
efforts to build better IR systems. The focus is initially on details
of indexing and retrieval, with a chapter each on the two major thrusts,
which are lexical-statistical and linguistic systems. In the next chapter,
a survey of various efforts to augment other systems is described.
This is followed by a chapter on information extraction, a topic of growing
importance. Throughout this section, a theme of implementational feasibility
and evaluation is maintained.
Within each chapter, the goal is to provide a comprehensive overview of the
topic, with thorough citations of pertinent references. There is a
preference to discuss health and biomedical implementations of principles,
but where this is not possible, the original domain of implementation is
discussed. Several chapters make use of a small sample database in
Appendix 1 to illustrate the principles being discussed, which is further
described at the end of Chapter 1.
This book would not have been possible without the influence of various mentors,
dating back to high school, who nurtured my interests in science generally
and/or medical informatics specifically, and/or helped me achieve my academic
and career goals. The most prominent include: Mr. Robert Koonz
(then of New Trier West High School, Northfield, IL), Dr. Darryl Sweeney
(University of Illinois at Champaign-Urbana), Dr. Robert Greenes (Harvard
Medical School), Dr. David Evans (Clairvoyance Corp.), Dr. Mark Frisse (Express
Scripts Corp.), Dr. J. Robert Beck (then of OHSU), Dr. David Hickam (OHSU),
Dr. Brian Haynes (McMaster University), and Dr. Lesley Hallick (OHSU).
I must also acknowledge the contributions of the late Dr. Gerard Salton (Cornell
University), whose writings initiated and sustained my interest in this field.
I would also like to note the contributions of institutions and people in
the federal government that aided the development of my career and this book.
While many Americans increasingly question the abilities of their government
to do anything successfully, the National Library of Medicine (NLM), under
the directorship of Dr. Donald A. B. Lindberg, has led the growth and advancement
of the field of medical informatics. The NLM’s fellowship and research
funding have given me the skills and experience to succeed in this field.
Likewise, the Agency for Healthcare Research and Quality (AHRQ) deserves
mention for its contributions to my own growth as well as others in the field
of medical informatics. I would also like to acknowledge retired Oregon
Senator Mark O. Hatfield through his dedication to biomedical research funding
that aided myself and many others.
Finally, this book also would not have been possible without the love and
support of my family. All of my “extended” parents, Mom and Jon, Dad
and Gloria, as well as my grandmother Baubee, brother Jeff, sister-in-law
Myra, mother-in-law Marjorie, and father-in-law Coop supported, sometimes
grudgingly, the various interests I developed in life and the somewhat different
career path I chose. (I think they still cannot understand why I decided
not to be a “regular doctor.”) And last, but most importantly, has
been the contribution of my wife, Sally, and two children, Becca and Alyssa,
whose unlimited love and support made this undertaking so enjoyable and rewarding.