Information Retrieval:  A Health & Biomedical Perspective

Information Retrieval:  A Health & Biomedical Perspective (Second Edition)

William Hersh, M.D.

Springer-Verlag , 2003

Preface

The main goal of this book is to provide an understanding of the theory, implementation, and evaluation of information retrieval (IR) systems in health and biomedicine.  There are already a number of excellent “how-to” volumes on searching healthcare databases (Hutchinson, 1998, Katcher, 1999).  Similarly, there are also a number of high-quality basic IR textbooks (Grossman and Frieder, 1998, Baeza-Yates and Ribeiro-Neto, 1999, Belew, 2000).  This volume is different from all of the above in that it covers basic IR as do the latter books, but with a distinct focus on the health and biomedicine domain.

The first edition of this book was written from 1994 to 1995 and published in 1996.  Although subsequent editions of books in many fields represent incremental updates, this edition is profoundly rewritten, and is essentially a new book.  As everyone who is reading this knows, the IR world has changed substantially in the 8 to 9 years since the first edition.  In that edition, the Internet was a “special topic” in the very last chapter of the book.  Here, on the other hand, it is introduced at the beginning of the first chapter.  At the time of the first edition, using a search engine was something that only a minority of healthcare practitioners had ever done.  Now, not only have virtually all physicians used the World Wide Web, but a majority of their patients have used it for seeking personal health information as well.

The Web also profoundly altered the way this edition was researched and written.  When preparing the first edition, finding an article not in my own collection or the Oregon Health & Science University (OHSU) Library was a chore that entailed either driving to a nearby library at another university or ordering it through interlibrary loan.  For this edition, I was usually able to find articles on the Web, either because OHSU or I had a subscription or because the article was freely available.  In writing this edition I really learned firsthand the value of scientific publishing on the Web.  The only downside to the feast of articles I found was the ease with which I found additional articles to consider for inclusion!

Another way the Web has impacted and will continue to impact this edition is through the maintenance of a Web site for errata and updates.  The Web site www.irbook.org will identify all errors in the book text as well as provide updates on important new findings in the field as they become available.

The work on this edition also drove home the quality of the IR systems I was using.  I must give particular mention to the following resources that provided fast and accurate access to a great deal of information:  the National Library of Medicine (NLM) PubMed MEDLINE system, the Google search engine, ResearchIndex, and the multitude of journals that have opted to electronically publish their full texts.

As in the first edition, the approach is still to introduce all the necessary theory to allow coverage of the implementation and evaluation of IR systems in health and biomedicine.  Any book on theoretical aspects must necessarily use technical jargon, and this book is no exception.  Although jargon is minimized, it cannot be eliminated it without retreating to a more superficial level of coverage.  The reader’s understanding of the jargon will vary based on their backgrounds, but anyone with some background in computers, libraries, health, and/or biomedicine should be able to understand most of the terms used.  In any case, an attempt to define all jargon terms is made.

Another approach is to attempt wherever possible to classify topics, whether discussing types of information or models of evaluation.  I have always found classification useful in providing an overview of complex topics.  One problem, of course, is that everything does not fit into the neat  and simple categories of the classification.  This occurs repeatedly with information science, and the reader is forewarned.

This book had its origins in a tutorial taught at the former Symposium on Computer Applications in Medicine (SCAMC) meeting.  The content continues to grow each year through my annual course taught to medical informatics students in the on-campus and disease-learning programs at OHSU.  (They often do not realize that next year’s course content is based in part on the new and interesting things they teach me!)  The book can be used in either a basic information science course or a health and biomedical information science course.  It should also provide a strong background for others interested in this topic, including those who design, implement, use, and evaluate IR systems.

Interest continues to grow in health and biomedical IR systems.  I entered a fellowship in medical informatics at Harvard University in the late 1980s, when the influence of medical artificial intelligence was still strong.  I had assumed I would take up the banner of some aspect of that area, such as knowledge representation.  But along the way I came across a reference from the field of “information retrieval.”  It looked interesting, so I looked at the references of that reference.  It did not take long to figure out that this was where my real interests lay, and I spent many an afternoon in my fellowship tracing references in the Harvard University and Massachusetts Institute of Technology libraries.  Even though I had not yet heard of the field of bibliometrics, I was personally validating all its principles.  Like many in the field, I have been awed to see IR become “mainstream” with the advent of the Web in recent years.

The book is divided into three sections.  The first section covers the basic concepts of information science.  The first chapter provides basic definitions and models that will be used throughout the book.  The next chapter gives an overview of health and biomedical information, covering issues related to its generation and use.  The third chapter discusses the evaluation of IR systems, highlighting the methods and their limitations.  The evaluation chapter is deliberately placed at the beginning of the book to emphasize the fundamental importance of this topic.

The second section covers the current state-of-the-art in commercial and other widely used retrieval systems.  The first chapter in this section gives an overview of the great deal of content that is currently available.  Next come chapters on the two fundamental intellectual tasks of IR, indexing and retrieval.  The predominant paradigms of each are discussed in detail.  The final chapter covers evaluation of these systems, providing a justification for the work described in the following section on research efforts.

The third section covers the major threads of research and development in efforts to build better IR systems.  The focus is initially on details of indexing and retrieval, with a chapter each on the two major thrusts, which are lexical-statistical and linguistic systems.  In the next chapter, a survey of various efforts to augment other systems is described.  This is followed by a chapter on information extraction, a topic of growing importance.  Throughout this section, a theme of implementational feasibility and evaluation is maintained.

Within each chapter, the goal is to provide a comprehensive overview of the topic, with thorough citations of pertinent references.  There is a preference to discuss health and biomedical implementations of principles, but where this is not possible, the original domain of implementation is discussed.  Several chapters make use of a small sample database in Appendix 1 to illustrate the principles being discussed, which is further described at the end of Chapter 1.

This book would not have been possible without the influence of various mentors, dating back to high school, who nurtured my interests in science generally and/or medical informatics specifically, and/or helped me achieve my academic and career goals.  The most prominent include:  Mr. Robert Koonz (then of New Trier West High School, Northfield, IL), Dr. Darryl Sweeney (University of Illinois at Champaign-Urbana), Dr. Robert Greenes (Harvard Medical School), Dr. David Evans (Clairvoyance Corp.), Dr. Mark Frisse (Express Scripts Corp.), Dr. J. Robert Beck (then of OHSU), Dr. David Hickam (OHSU), Dr. Brian Haynes (McMaster University), and Dr. Lesley Hallick (OHSU).  I must also acknowledge the contributions of the late Dr. Gerard Salton (Cornell University), whose writings initiated and sustained my interest in this field.

I would also like to note the contributions of institutions and people in the federal government that aided the development of my career and this book.  While many Americans increasingly question the abilities of their government to do anything successfully, the National Library of Medicine (NLM), under the directorship of Dr. Donald A. B. Lindberg, has led the growth and advancement of the field of medical informatics.  The NLM’s fellowship and research funding have given me the skills and experience to succeed in this field.  Likewise, the Agency for Healthcare Research and Quality (AHRQ) deserves mention for its contributions to my own growth as well as others in the field of medical informatics.  I would also like to acknowledge retired Oregon Senator Mark O. Hatfield through his dedication to biomedical research funding that aided myself and many others.

Finally, this book also would not have been possible without the love and support of my family.  All of my “extended” parents, Mom and Jon, Dad and Gloria, as well as my grandmother Baubee, brother Jeff, sister-in-law Myra, mother-in-law Marjorie, and father-in-law Coop supported, sometimes grudgingly, the various interests I developed in life and the somewhat different career path I chose.  (I think they still cannot understand why I decided not to be a “regular doctor.”)  And last, but most importantly, has been the contribution of my wife, Sally, and two children, Becca and Alyssa, whose unlimited love and support made this undertaking so enjoyable and rewarding.

William Hersh, M.D.

Last updated - January 11, 2003