Information Retrieval:  A Health & Biomedical Perspective

Information Retrieval:  A Health & Biomedical Perspective (Second Edition)

William Hersh, M.D.

Springer-Verlag, 2003

Back to Updates Table of Contents

Update to Chapter 1 - Terminology and Resources

1.1 Basic definitions

(4/1/07) Zins recently performed an on-line Delphi experiment that enlisted 57 scholars in information science (including me!) to define and create a knowledge map of information science. The results are summarized on a Web site and in four publications:
Zins, C. (2007). Conceptions of information science. Journal of the American Society for Information Science & Technology, 58: 335-350.
Zins, C. (2007). Conceptual approaches for defining "data", "information", and "knowledge". Journal of the American Society for Information Science & Technology, 58: 479-493.
Zins, C. (2007). Knowledge map of information science. Journal of the American Society for Information Science & Technology, 58: 526-535.
Zins, C. (2007). Classification schemes of information science: 28 scholars map the field. Journal of the American Society for Information Science & Technology, 58: 645-672.

(4/1/07) For those of us working in this field for soem time, i.e., almost two decades for myself, the "mainsteaming" of IR continues to amaze. By 2004, over 100 million Americans had used a search engine (Fallows, 2004). Search engines themselves receive about 500 million searches per day from around the world, and on any given day, about 41% of all Internet users submit nearly 60 million queries to them (Rainie, 2005). "Search" has been described as an "integral" computer application (Barrows, 2006). Important variants include enterprise and desktop search. Maintainance of metadata in organizaitons is considered a key challenge. The "mass digitization" of information raises a host of issues, e.g., copyright, optical character recognition (OCR) quality, libraries, long-term ownership, business models for publishers and content sellers, information literacy, standards and interoperability (Anonymous, 2006a).

Additional evidence that IR is "mainstream" can be gleaned with regards to the Google search engine. The word "Google" itself has become a verb (i.e., "Did you Google him?"), while the Google Zeitgeist (http://www.google.com/press/zeitgeist.html) gives a glimpse of what the world wants to know about. In the meantime, teenagers and others pass the time engaging in Googlewhacking, a game where one tries to have Google retrieve one and only one page. There are over 570,000 Googlewhacks and counting (http://www.googlewhack.com/tally.pl). Political columnists ask whether Google is a diety (Friedman, 2003), while the software giant Microsoft has declared that search is the most important computer application in the near future (Ferguson, 2005) and is on a "search and destroy" mission against Google (Vogelstein, 2005). Google, of course, is not standing back, developing on-line versions of Microsoft Office application tools (Barret, 2006) and other competitors are fighting both (Anonymous, 2006b). The best story of Google, in my opinion, is the book by Battelle (2005).

Biomedicine is being impacted by the growth of IR as well. The leaders of the National Library of Medicine has laid out a vision for the future of medical libraries ten years hence, noting that the "place" will be preserved but that most of the information will be interactive and electronic (Lindberg, 2005). A leading neuroscientist, noting the advances in the Human Genome Project and related areas, has noted that biology is now an "information science," with many advances likely to come from using data to form and test hypotheses (Insel, 2003). Major medical journals note that search engines, most notably Google, are the major means that visitors are brought to access their on-line articles (Giustini, 2005; Steinbrook, 2006). Meanwhile, pharmaceutical companies fight for informations and library talent (Davies, 2006). One such talented individual quotes Harvard University Chemistry Professor Frank Westheimer, who once famously said, "A month in the laboratory can save an hour in the library."

Anonymous (2006a). Mass Digitization: Implications for Information Policy. Washington, DC, U.S. National Commission on Libraries and Information Science. http://www.nclis.gov/digitization/MassDigitizationSymposium-Report.pdf.
Anonymous (2006b). The un-Google. The Economist. June 17, 2006. 65-66. http://economist.com/displaystory.cfm?story_id=7064434.
Barrows, R. and Traverso, J. (2006). Search Considered Integral. ACM Queue. May, 2006. http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=389.
Battelle, J. (2005). The Search - How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture. New York, NY. Penguin Group.
Davies, K. (2006). Search and Deploy. Bio-IT World. October 16, 2006. http://www.bio-itworld.com/issues/2006/oct/biogen-idec/.
Fallows, D., Rainie, L., et al. (2004). Data Memo on Search Engines. Washington, DC, Pew Internet & American Life Project. http://www.pewinternet.org/pdfs/PIP_Data_Memo_Searchengines.pdf.
Friedman, T. (2003). Is Google God? New York Times. June 29, 2003. 13. http://www.nytimes.com/2003/06/29/opinion/29FRIE.html.
Ferguson, C. (2005). What's Next for Google? Technology Review. January, 2005. 38-46.
Insel, T., Volkow, N., et al. (2003). Neuroscience networks:  data-sharing in an information age. PLoS Biology, 1: E17. http://biology.plosjournals.org/perlserv/?request=get-document&doi=10.1371/journal.pbio.0000017.
Giustini, D. (2005). How Google is changing medicine. British Medical Journal, 331: 1487-1488.
Lindberg, D. and Humphreys, B. (2005). 2015 - the future of medical libraries. New England Journal of Medicine, 352: 1067-1070.
Rainie, L. and Shermak, J. (2005). Search Engine Use November 2005. Washington, DC, Pew Internet & American Life Project. http://www.pewinternet.org/pdfs/PIP_SearchData_1105.pdf.
Steinbrook, R. (2006). Searching for the right search - reaching the medical literature. New England Journal of Medicine, 354: 4-7.
Vogelstein, F. (2005). Search and Destroy. Fortune. May 2, 2005. http://money.cnn.com/magazines/fortune/fortune_archive/2005/05/02/8258478/.

(3/15/05) Another important term worth defining upfront is health information literacy.  The Medical Library Association (MLA, www.mlanet.org) has spoken most eloquently about this term, noting that it differs from health literacy and information/computer literacy.  They define health information literacy as the set of abilities needed to:
They have developed a Web site devoted to this topic, which includes a variety of resources and plans for action (http://www.mlanet.org/resources/healthlit/).

(4/7/05) Saranto and Hovenga (2004) did a literature review to search for papers attempting to define the concept of information literacy, finding that the concept really does not exist in the literature and that it is most often used as a synonym for "computer literacy" or related concepts.  They advocate organizational efforts to define the term and its related skills more precisely.  McCray (2005) recently reviewed health literacy and noted most of it focused on low literacy and its impact on understanding health information.  She noticed a number of categories of articles on the topic, including:
McCray, A. (2005). Promoting health literacy. Journal of the American Medical Informatics Association, 12: 152-163.
Saranto, K. and Hovenga, E. (2004). Information literacy - what is it about?  Literature review of the concept and context. International Journal of Medical Informatics, 73: 503-513.

1.2  Comparisons with other types of computer applications

1.3 Models of IR

(4/4/06) Another model by which to view IR and related areas is to think of how people usually find and process scientific information. I call this the "Hersh Funnel" (see the figure below) and have published it in several places (Hersh, 2005a; Hersh, 2005b; Hersh, 2006). Scientific information users usually begin by searching against all literature to find a set of documents that contain some documents likely to be relevant. These documents are usually reviewed manually to determine which ones are definitely relevant. However, there is currently much research trying to develop means to find that definitely relevant literature automatically, in processes that are called information extraction or text mining. Typically people structure knowledge out of the documents that are definitely relevant.
Funnel
Hersh W, Evaluation of biomedical text mining systems: lessons learned from information retrieval. Briefings in Bioinformatics, 2005, 6: 344-356.
Hersh WR, Information Retrieval and Digital Libraries, in Medical Informatics: Knowledge Management and Data Mining in Biomedicine, Chen H, et al., Editors. 2005, Springer-Verlag: New York. 237-275.
Hersh, W., Bhupatiraju, R., et al. (2006). Enhancing access to the bibliome: the TREC 2004 Genomics Track. Journal of Biomedical Discovery and Collaboration, 1: 3. http://www.j-biomed-discovery.com/content/1/1/3.

(4/1/07) Just how much information is out there? Lyman and Varian (2003) attempted to quantify the amount of information on electronic media and its flow in 2003. A report by Gantz (2007) updated this, estimating information quanity in 2006 and its growth to 2010.

Lyman and Varian found that the sum of information on physical electronic media was about five exabytes (or about 5 billion gigabytes) in 2003. This was noted to be equivalent to about one-half million new libraries the size of the US Library of Congress. The majority of this information (72%) was stored on magnetic media, primarily hard disks, with most of the remanider on film and a small proportion on paper (about 1.5 petabytes or 0.001 exabytes). This amounted to about 800 megabytes for each man, woman, and child on Earth.

In a given year, the distrubtion of paper content around the world was estimated as follows:
The amount of information on the Internet was estimated to include:
Gantz et al. estimated the amount of informaiton in 2006 to be about 161 exabytes, projecting it to grow six-fold annually by 2010 to 988 exabytes (nearly 1 zettabyte). About 70% of the information is generated by individuals but 85% is maintained in some way by various organizations. Most of the growith is fueled by analog-to-digital conversions. For images, about 1 billion devices generate about 250 billion images annually (150 billion on cameras, 100 billion on cell phones), which is projected to double by 2010. The amount of video is also expected to double by 2010. The report also notes that the world has about 1.5 billion email accounts, which consume about six exabytes. It also notes there are about 1.1 billion Internet users now, 60% of whom have broadband access. This is projected to increase to 1.6 billion by 2010. And, perhaps most pertinent to IR, about 95% of the information is "unstructured," i.e., amenable to IR indexing and retrieval techniques.

Card (2003) has depicted this growth graphically, showing that on-line information has exceeded all human documents generated in the first 40,000 of human history and vastly more than all the information on Earth that all humans can learn.

Card, S. (2003). Information Foraging Theory. Palo Alto, CA, Palo Alto Research Center. http://hci.ucsd.edu/220/UP-2003-0101-Card-UM-NextGen.pdf.
Gantz, J., Reinsel, D., et al. (2007). The Expanding Digital Universe: A Forecast of Worldwide Information Growth Through 2010. Hopkinton, MA, EMC Corp. http://www.emc.com/about/destination/digital_universe/pdf/Expanding_Digital_Universe_IDC_WhitePaper_022507.pdf.
Lyman, P. and Varian, H. (2003). How Much Information. Berkeley, CA, University of California Berkeley. http://www.sims.berkeley.edu/research/projects/how-much-info-2003/ .

(3/31/04) Another model was recently put forth to guide information seeking and retrieval research (Jarvelin and Wilson, 2003).  These authors note that scientific theories are useful for a variety of functions, as put forth by Bunge (1967):
They state that different types of information are needed in these information tasks:
An example of this in health care might be the information task of choosing an appropriate diagnostic test. The problem information recognizes that the task originates from the goal of making a medical diagnosis. The domain information brings forth the knowledge of diagnostic tests for patients who have similar symptoms to the one at hand. The problem-solving information leads the clinician to apply the domain knowledge to this specific patient, resuling in a decision of which test (if any) to order.

Bunge, M. (1967). Scientific Research. Heidelberg. Springer-Verlag.
Jarvelin, K. and Wilson, T. (2003). On conceptual models for information seeking and retrieval research. Information Research, 9(1). http://informationr.net/ir/9-1/paper163.html .

1.3.1 The information world

1.3.2 Users

1.3.3 Health decision making

1.4 IR resources

1.4.1 People

(3/15/05) A major pioneer in the IR field was Gerard Salton, a professor of computer science from Cornell University. Dr. Salton invented many of the techniques commonly called "automated retrieval" and is cited throughout the book. Unfortunately, he passed away in 1995 just as the Web searching world was taking off and adopting many of the techniques he developed. Shortly before his death, a conference was held in his honor, celebrating his contributions. Many of the talks from this meeting were captured and are available in the Open Video Project (http://www.open-video.org/) archive. To find the Salton videos, go to this site and search on "Salton."  Salton's talk himself is available at http://www.open-video.org/details.php?videoid=7057. Although I did not know him well personally, I was certainly drawn into this field by his writings. I was also impressed at his continued ability to be engaged in the field right up until his death. (Most of us old-timers recall him and Karen Sparck Jones sitting in the front row at conferences, critiquing presentations and each other's thoughts.)

(3/20/04) The book notes that indivuduals from a variety of disciplines comprise the field of IR. A well-known computer scientist who is among the leaders from that discipline recently gave a keynote lecture discussing the relationship between IR and computer science (Croft, 2003). Croft noted that IR has always been a small part of the overall computer science field but has a common heritage with the database systems area. He also noted that the field grew and was validated by the success of Web search engines in the 1990s. He also laid out some known successes by the field:
Gray, J. (2003). What next? A dozen information technology research goals. Journal of the ACM, 50: 41-57.
Croft, W. (2003). Salton Award Lecture - Information retrieval and computer science:  an evolving relationship. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada. ACM Press. 2-3.

(4/4/06) A recent paper by Moffat (2005) provides a list of the most important IR research papers that are "recommended reading" for research students.

Moffat, A., Zobel, J., et al. (2005). Recommended reading for IR research students. SIGIR Forum, 39(2): 3-14. http://www.acm.org/sigir/forum/2005D/2005d_sigirforum_moffat.pdf.

1.4.2 Organizations

(4/1/07) The NLM recently released a new version of its long-range plan, which includes four overall goals:
  1. Seamless, uninterrupted access to expanding collections of biomedical data, medical knowledge, and health information.
  2. Trusted information services that promote health literacy and the reduction of health disparities worldwide.
  3. Integrated biomedical, clinical, and public health information systems that promote scientific discovery and speed the translation of research into practice.
  4. A strong and diverse workforce for biomedical informatics research, systems development, and innovative service delivery.
Anonymous (2006). Charting a Course for the 21st Century: NLM's Long Range Plan 2006-2016. Bethesda, MD, National Library of Medicine. http://www.nlm.nih.gov/pubs/plan/lrp06/NLM_LRP2006_WEB.pdf.

1.4.3 Journals

(5/6/03) There are three on-line journals devoted to information retrieval and digital library research:
(3/20/04) There are also some new on-line biomedical journals with a strong focus on IR-related issues:

1.4.4 Texts

(4/1/07) I think I can create a whole new category of books: those about Google! There are many; here is a sampling:
Battelle, J. (2005). The Search - How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture. New York, NY. Penguin Group.
Calishain, T. and Dornfest, R. (2004). Google Hacks - Tips & Tools for Smarter Searching (Second Edition). Sebastapol, CA. O'Reilly & Associates.
Davis, H. (2006). Google Advertising Tools. Sebastopol, CA. O'Reilly & Associates.
Langville, A. and Meyer, C. (2006). Google's PageRank and Beyond: The Science of Search Engine Rankings. Princeton, NJ. Princeton University Press.
Miller, M. (2007). Googlepedia - The Ultimate Google Resource. Indianapolis, IN. Que.
Vise, D. and Malseed, M. (2005). The Google Story. New York, NY. Delacorte Press.

There are also some new and updated books about searching MEDLINE:
Katcher, B. (2006). Medline: A Guide to Effective Searching in PubMed and Other Interfaces, Second Edition. San Francisco. Ashbury Press.
Edhlund, B. (2005). Basic Principles of Pubmed. Morrisville, NC. Lulu Press.
Edhlund, B. (2006). PubMed and EndNote. Morrisville, NC. Lulu Press.

(4/1/07) For those interested in image retrieval and its associated issues, an overview textbook is Visual Information Retrieval (Del Bimbo, 1999). Another overview book is Natural Language Processing for Online Applications:  Text Retrieval, Extraction, and Categorization (Jackson, 2002). The book provides a succinct but comprehensive overview of natural language processing, document retrieval, information extraction, text categorization, and text mining. A comprehensive three-volume reference on health and medical information on the Web, available in both print and on CD-ROM, is The MLA Encyclopedic Guide to Searching and Finding Health Information on the Web (Anderson, 2004). A couple books have been published recently on Web searching (Hock, 2004; Poremsky, 2004) and another describes using the Web for research (Schlein, 2004). Another recent book addresses the overlap between IR and information seeking in context (Ingwersen, 2005).

In addition to books on searching MEDLINE and other health resources, additional help can be found in the tutorials and help files on the PubMed site:
Anderson, P. and Allee, N., eds. (2004). The MLA Encyclopedic Guide to Searching and Finding Health Information on the Web. New York, NY. Neal-Schuman Publishers.
Del Bimbo, A. (1999). Visual Information Retrieval. San Francisco, CA. Morgan Kaufmann Publishers.
Hock, R. (2004). The Extreme Searcher's Internet Handbook:  A Guide for the Serious Searcher. New York, NY. Information Today.
Ingwersen, P. and Jarvelin, K. (2005). The Turn - Integration of Information Seeking and Retrieval in Context. Dordrecht, The Netherlands. Springer.
Jackson, P. and Moulinier, I. (2002). Natural Language Processing for Online Applications:  Text Retrieval, Extraction, and Categorization. Amsterdam, Holland. Benjamin Johns Publishing.
Poremsky, D. (2004). Google and Other Search Engines. Berkeley, CA. Peachpit Press.
Schlein, A. (2004). Find It Online, Fourth Edition:  The Complete Guide to Online Research. Tempe, AZ. Facts on Demand Press.

1.4.5 Tools

(4/4/06)  An up-to-date list of open-source search engines is at http://www.searchtools.com/tools/tools-opensource.html .  One system of note is Lucene (Gospodnetic, 2005), which is written in Java and is now part of the open-source Web server Apache.  Another new IR system for research use is Zettair.  Written by a group known for their accomplishments in index compression and search speed, this system is fast and flexible.

Gospodnetic, O. and Hatcher, E. (2005). Lucene in Action. Greenwich, CT. Manning Publications.

1.5 The Internet and World Wide Web

(4/1/07) There are several Web sites that track Internet use in different countries and languages: comScore (www.comscore.com), Internet World Statistics (www.internetworldstats.com), and Global Reach (www.glreach.com). They all paint a relatively consistent picture: Worldwide use of the Internet continues to grow, particularly in emerging economies like India, China, and Russia (Anonymous, 2007). While the largest number of users still comes from the US (154 M), China is rapidly closing in (87 M) and only 20% of all Internet users come from the US. Despite its growth, Doyle et. al (2005) note that the Internet is "robust yet fragile." In other words, its distributed nature makes it fault-tolerant, but faults do occur frequently.

It seems almost like ancient history now, but the original Web (sometimes called Web 1.0) featured a boom and then a bust, i.e., the dot-com era. Some (e.g., O'Reilly, 2005) talk of a new Web now, a Web 2.0 that is built on sustainable business models and widespread collaboration. A more sound business model gives users what they want and make it more sustainable, e.g., Google Ads, eBay, and Amazon. But Web 2.0 is also more collaborative, so that it "gets better the more people use it" (O'Reilly, 2005), e.g., blogging, wikis and Wikipedia, Flickr, and Craig’s List. Could Web 2.0 impact medicine? One view was put forth by Giustini (2006) in British Medical Journal.

Search engine use is very high among Internet users. In a survey done in May-June, 2004, Fallows et al. (2004) found that 84% of Internet users have used a search engine (extrapolating from the usage statistics cited above, that translates into 107 million people) and that 87% of people say they find what they want most of the time. This memo also presented some facts gleaned from tracking the top 25 search engines:
A recent analysis of search engine users shows that while they are enthusiastic and trusting of search engines, they are also unaware and naive about certain aspects of them (Fallows, 2005). A large majority of users report confidence in their searching abilities (92%) and that they have successful searches most of the time (87%). However, 62% are unaware in the differences between paid and unpaid results.

One Google scientist notes that Google receives about 200 million searches per day (Singhal, 2004). Extrapolating from Google's market share, this means about 500 million searches are done per day. Of Google's 200 million searches each day, 100 million are unique. Searches average 2.4 words and are entered in 90 different languages. About 10-20% of the pages in Google's database change each month.

The Web has truly become "world-wide." According to Internet World Stats (http://www.internetworldstats.com), some 1.0 billion of the world's 6.5 billion people use the Internet (15.9%). It is of course higher in developed regions/countries such as the United States (68.1%), Oceania/Australia (52.1%), and Europe (35.9%). But it is growing even more rapidly in places like Latin America (14.8%, as high as 35.7% in Chile and 26.4% in Argentina) and China (8.5%).
??? update

One irony that few IR "old timers" could ever have fathomed is the need, in the Web era, for the study of "adversarial" IR. In other words, the development of techniques to prevent retrieval of certain content. One group of aversarial IR applications is the prevention of "spam" (i.e., unwanted) pages or emails (Metaxas, 2005). On the Web, this called "link spam" (Noruzi, 2006). Singhal (2004) notes there is a continual tit-for-tat battle between those who devleop search engines and those who try to "game" them.  Indeed, there is a large market for attempting to drive traffic to one's Web site via search engines and other means, sometimes called "search engine marketing" (e.g., Moran and Hunt, 2005).  Another form of adversarial IR is in "filtering," with the usual goal of preventing linkage to pornography sites.  Of course, most approaches to such filtering are imperfect and can lead to blocking of legitimate medical Web sites (Richardson et al, 2002).  Indeed, one filter even blocks access to the Web site of the town Toppenish, WA, due to the presence of the letters from a blocked word in the middle of the town name (Anonymous, 2003).

Another concern about search engines is the growing desire of governments to monitor their usage (Hansell, 2006).  Ostensibly to thwart the very real threats of terrorism, many are concerned about governments knowing our searching interests.  There are also some governments, most notably China, who have required search engines to filter pages containing certain words (such as democracy).  At the current time, privacy laws that protect things like email and library check-outs do not protect queries to search engines.

Anonymous (2003). The Insider:  'Bess' Internet porn filter a little too easily offended. Seattle Post-Intelligencer. July 7, 2003. http://seattlepi.nwsource.com/business/129611_insider07.html.
Anonymous (2007). Worldwide Internet Audience has Grown 10 Percent in Last Year, According to comScore Networks. Reston, VA, comScore Networks. http://www.comscore.com/press/release.asp?press=1242.
Doyle, J., Alderson, D., et al. (2005). The "robust yet fragile" nature of the Internet. Proceedings of the National Academy of Sciences, 102: 14497-14502.
Fallows, D., Rainie, L., et al. (2004). Data Memo on Search Engines. Washington, DC, Pew Internet & American Life Project. http://www.pewinternet.org/pdfs/PIP_Data_Memo_Searchengines.pdf.
Fallows, D. (2005). Search Engine Users. Washington, DC, Pew Internet & American Life Project. http://www.pewinternet.org/pdfs/PIP_Searchengine_users.pdf.
Giustini, D. (2006). How Web 2.0 is changing medicine. British Medical Journal, 333: 1283-1284.
Hansell, S. (2006). Online Trail Can Lead To Court. New York Times. February 4, 2006. C1. http://www.nytimes.com/2006/02/04/technology/04privacy.html.
Singhal, A. (2004). Challenges in Running a Commercial Web Search Engine. http://www.research.ibm.com/haifa/Workshops/searchandcollaboration2004/papers/haifa.pdf.
Metaxas, P. and DeStefano, J. (2005). Web spam, propaganda and trust. First International Workshop on Adversarial Information Retrieval on the Web, Chiba, Japan. http://airweb.cse.lehigh.edu/2005/metaxas.pdf.
Moran, M. and Hunt, B. (2005). Search Engine Marketing, Inc.:  Driving Search Traffic to Your Company's Web Site. Englewood Cliffs, NJ. Prentice Hall.
Noruzi, A. (2006). Link spam and search engines. Webology, 3(1). http://www.webology.ir/2006/v3n1/editorial7.html.
O'Reilly, T. (2005). What Is Web 2.0? Design Patterns and Business Models for the Next Generation of Software. Sebastopol, CA, O'Reilly & Associates. http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html.
Richardson, C., Resnick, P., et al. (2002). Does pornography-blocking software block access to health information on the Internet? Journal of the American Medical Association, 288: 2887-2894.

(3/20/04) The popularization of the Web arguably began with the release of the Mosaic Web browser. The tenth anniversary of the release of this software was recently celebrated at the US National Science Foundation (Anonymous, 2003).

Anonymous (2003). Mosaic Web Browser Celebrates 10th Birthday. National Science Foundation. http://www.nsf.gov/od/lpa/news/03/pr0343.htm .

(1/10/03) According to Gartner, the one-billionth PC was sold in 2002:
http://www.intel.com/pressroom/archive/releases/20020701corp.htm

(3/1/03) In Feb. 2003, the National Science Foundation released a report on Cyberinfrastructure, recommending that the organization spend an additional $1 billion per year developing the nation’s "cyberinfrastructure" to support scientific research. The report advocates that investment in a comprehensive cyberinfrastructure will profoundly change what scientists and engineers do, how they do it, and who participates.

(4/5/03) The presentation about Web searching by Travis and Broder (2001) cited in the book was later published as an article:
Broder A, A taxonomy of Web search, SIGIR Forum, 2002, 36(2):  3-10.   http://www.acm.org/sigir/forum/F2002/broder.pdf

In this paper, Travis notes that classic IR is driven by the user's information need, but that Web searching is often not informational. Instead, the user's intent might be navigational (e.g., finding a specific page) or transactional (e.g., purchase something, download a file, check the status of an account). Travis notes that navigational searches are similar to what classic IR calls a "known-item search," since they usually have only one correct answer. He also states that "hub" pages (see section 1.5.2) with lists of links that get to the target in one click may be acceptable. In transcational queries, the user needs not only to reach a site, but also interact with it once he or she gets there.

Travis analyzed the frequency of these types of Web search by users of the AltaVista search engine via two means: a pop-up survey window and a search log analysis. He noted noted the limitations in each: Pop-up survey takers were self-selected and may not represent all users or their needs. In addition, it is usually difficult to know a user's exact intent from the query statement they enter into a search engine. Based on his data, he concludes the following approximate distribution of types of Web search:
In other words, less than half of searches on the Web (at least those entered into AltaVista) are classical IR informational seeking.

Travis also describes what he calls three generations of search engines on the Web. The first generation uses mostly static HTML pages and is very close to classic IR. The second generation uses off-page, Web-specific data such as link analysis, anchor text, and click-through data. He cites the Google PageRank algorithm as an example of this and notes that it supports informational as well as navigational queries. The third generation attempts to discern the "need behind the query" based on semantic analysis of the user's input and determination of their context. He gives the example of the user entering the name of a city and the system returning a hotel reservation page, map server, weather server, etc.. The aim of this generation would be to support all transactional searches in addition to those which are informational and navigational. He does not state so explicitly, but an implication is that the Semantic Web (see section 10.2.4) will be helpful for this type of search.

(3/20/04) A collection of leaders in the field recently held a workshop for defining the research agenda for the IR field (Allan et al., 2003). This workshop was motivated in part by the notion that current Web search engines are so effective that further research and development in the field are not warranted. However, this workshop noted that Web searching has become mainstream and successful, but is not the entire IR picture:
This workshop came to the conclusion that there are two general long-term challenges for the field:
In other words, research in IR must aim to create systems that seamlessly search across the appropriate content at the appropriate time. The paper gives a well-cited example of a user entering a query for "Taj Mahal." If the user's system knew that he or she was going to attend an academic conference in India, it would provide him or her with information about the famous landmark . However, if the user was planning a trip to Atlantic City or enjoyed jazz music, the system would preferentially present information about the casino or jazz musician respectively.

The report then outlined what workshop attendees considered to be the major challenge areas for IR research:
The report also lists the following areas as those of major challenge, but they really represent specific instances of the above general challenges:
Allan, J., Aslam, J., et al. (2003). Challenges in information retrieval and language modeling. SIGIR Forum, 37(1): 31-47. http://www.sigir.org/forum/S2003/ir-challenges2.pdf .

1.5.1 Hypertext and Hypermedia

1.5.2 The Web and health care

(2/12/03) According to the American Medical Association, physician use of the Web continues to grow beyond the figures cited in the second edition. Some other facts from this report include:
A press release describing the report is at:
http://www.ama-assn.org/ama/pub/article/1616-6473.html

(4/4/06) A more recent summary of physician Internet usage data found that 98% of US physicians use the Internet while half own personal digital assistants (PDAs) (Anonymous, 2005). A proposed new model of continuing medical education gives credit for documented information seeking during clinical care (Davis, 2004).

Anonymous (2005). Physician Internet Use Statistics. http://www.max.md/pdf/PhysicianInternetUseStatistics.pdf.
Davis, N. and Willis, C. (2004). A new metric for continuing medical education credit. Journal of Continuing Education in the Health Professions, 24: 139-144.

(3/20/04) Another study of physician used showed those who were more active clinically (i.e., saw more patients per week) spent more time on-line (Taylor and Leitman, 2001).

Taylor, H. and Leitman, R. (2001). The Increasing Impact of eHealth on Physician Behavior. Harris Interactive. http://www.harrisinteractive.com/news/newsletters/healthnews/HI_HealthCareNews2001Vol1_iss31.pdf.

(4/4/06) Another growing category of IR system users are biomedical researchers. This is due in large part to new "high-throughput" biotechnologies, such as gene microarrays. These technologies not only generate large amounts of data, but also identify new information that must be explored, e.g., the microarray experiment that uncovers increased expression of genes previously unknown to be related to a physiological or disease process (Buetow, 2005). There is growing awareness that IR and other techniques, such as text mining, are important tools for researchers (Jensen, 2006; Hunter, 2006)

But literature retrieval and analysis are difficult for scientists. Barnes and Gary (2003) say, "Few areas of biological research call for a broader background in biology than the modern approach to genetics. This background is tested to the extreme in the selection of candidate genes for involvement with a disease process… Literature is the most powerful resource to support this process, but it is also the most complex and confounding data source to search."

Barnes, M. and Gary, R. (2003). Bioinformatics for Geneticists . West Sussex, England. John Wiley & Sons.
Buetow, K. (2005). Cyberinfrastructure: empowering a "third way" in biomedical research. Science, 308: 821-824.
Hunter, L. and Cohen, K. (2006). Biomedical language processing: what's beyond PubMed? Molecular Cell, 21: 589-594.
Jensen, L., Saric, J., et al. (2006). Literature mining for the biologist: from information retrieval to biological discovery. Nature Reviews - Genetics, 7: 119-129.

(4/4/06) Another group of heavy users of the Web for health and biomedical information are consumers and patients. One area of debate concerns how often they use the Web to seek health information. Some early reports put the figure at as high as 80% of all Internet users, such as a Harris Interactive Poll (2002). This poll found about 80% of all adults who are online sometimes used the Web to look for health care information. About 18% said they did so "often", while most did so "sometimes" (35%), or "hardly ever" (27%). This 80% of all those online amounted to 110 million users nationwide. This compared with 54 million in 1998, 69 million in 1999 and 97 million in 2001. On average those who ever looked for health care information online did so three times every month. About half (53%) of those who looked for health care information used a portal or search engine that allowed them to search for the health information they wanted across many different Web sites. About a quarter (26%) went directly to a site that focused only on health-related topics and one in eight (12%) went first to a general site that focused on many topics that may have had a section on health issues.

Taylor, H. (2002). Cyberchondriacs Update. Harris Interactive. http://www.harrisinteractive.com/harris_poll/index.asp?PID=299.

The Pew Internet & American Life Project has published a number of studies on information seeking, related not only to health care but also to search engines in general:
The Fox, 2003 report found that 73 million US users have searched for specific health information (58% of all users) and 93 million have carried out a search related to health information generally (74% of all users).  Both the Fox, 2003 and Rainie, 2005 reports found that 80% of users had searched specifically for health information.  A recent analysis of all the Pew data found several factors associated with a higher likelihood of Internet health searching, including female gender, part-time employed, other Internet use, specific health problems, and helping others deal with health problems (Rice, 2006).
Other studies, however, have taken exception to these high rates of use.  Most notably, a study in JAMA claimed that only 40% of Internet users actually used the Web to seek health information (Baker et al., 2003).  This study also found that only a third of those who sought health information reported that the information affected a decision about their health or health care.  A number of letters to the editor pointed out limitations of this study, the most notable one being that study participants came from a pool of users offered free access to WebTV, a form of Internet access by a very small fraction of all users.  Another study found that 31% of all Americans (not just those on-line) have used the Internet to search for health information over a 12-month period (Murray et al., 2003).  About 8% of these individuals took information to their physician, although two-thirds wanted their physician's opinion as opposed to specific treatment.  Additional research has found that only a minority of Americans (38.2%) seek health information generally, with the most common sources being books or magazines (23.0%), friends or relatives (19.7%), the Internet (16.1%), and television and radio (11.3%) (Tu and Hargraves, 2003).

Baker, L., Wagner, T., et al. (2003). Use of the Internet and e-mail for health care information:  results from a national survey. Journal of the American Medical Association, 289: 2400-2406.
Murray, E., Lo, B., et al. (2003). The impact of health information on the internet on the physician-patient relationship:  patient perceptions. Archives of Internal Medicine, 163: 1727-1734.
Rice, R. (2005). Influences, usage, and outcomes of Internet health information searching:  multivariate results from the Pew surveys. International Journal of Medical Informatics, 75: 8-28.
Tu, H. and Hargraves, J. (2003). Seeking Health Care Information:  Most Consumers Still on the Sidelines. Washington, DC, Center for Studying Health System Change. http://www.hschange.org/CONTENT/537/.

Surveys of users of on-line health information show that they believe there is room for improvement (Anonymous, 2003).  The following was found in this survey of about 3,000 users of on-line health information in 2002 by Manhattan Research:
Anonymous (2003). Americans Expect More of Their Online Health Information Resources. New York, NY, Manhattan Research. http://www.manhattanresearch.com/Credibility,%20Accuracy,%20and%20Readability%20(052703).pdf.

(4/1/07) Many studies on use of the Internet and Web for health searching continue to be published. A new version of the "Cyberchondriacs" report was released in 2006 (Anonymous, 2006), finding that the number of Americans online was now up to 77% and that the number using the Web to seek personal health information continued to hover at around 80%. This meant that the number of Americans who have ever looked online for health information was now 136 million. Other research (Fox, 2006) shows that 66% start their searching for health information with a search engine, with 27% beginning at a health-related site. About 72% visited more than one site when seeking health information. About half (48%) of all health information seeking was done for someone else and slightly more (53%) reported that their seeking resulted in some kind of impact on how they cared for themselves or someone else. The most common topic searched was a specific disease or medical problem (64% of all searchers), followed by a certain medical treatment or procedure (51%); diet, nutrition, vitamins, or supplements (49%); and exercise or fitness (44%). Madden (2006, 2006) found that 20% of Internet users reported that the Internet "greatly" improved the way they get information about health care.

A newer ongoing anaylsis of the Internet's impact on health care has been the Health Information National Trends Survey (HINTS), funded by the National Cancer Institute (NCI) and focused on cancer information (Hesse, 2005). The first report of this survey found that 63.0% of Americans reported going online, with 63.7% of those who did so reporting that they looked for health information for themselves or someone they know in the last 12 months. Despite the availability of these new sources of information, those surveyed still reported that they had "a lot" of trust in cancer information from their physicians (62.4%) as opposed to the Internet (23.9%), television (20.0%), family or friends (18.9%), magazines (15.9%), newspapers (13.1%), or radio (9.9%). Those going online were somewhat more likely to be under 65, female, white, college-educated, and have higher income. Despite the fact that those seeking information about cancer would prefer to go to their health care provider, most ended up going to the Internet, probably because of its convenience.

In addition, the impact of Google on medicine is not insubstantial, as noted from publications in major medical journals:
Anonymous (2006). Number of "Cyberchondriacs" - Adults Who Have Ever Gone Online for Health Information - Increases to an Estimated 136 Million Nationwide. Rochester, NY, Harris Interactive. http://www.harrisinteractive.com/harris_poll/index.asp?PID=686.
Fox, S. (2006). Online Health Search 2006. Washington, DC, Pew Internet & American Life Project. http://www.pewinternet.org/pdfs/PIP_Online_Health_2006.pdf.
Greenwald, R. (2005). And a diagnostic test was performed. New England Journal of Medicine, 353: 2089-2090.
Hesse, B., Nelson, D., et al. (2005). Trust and sources of health information: the impact of the Internet and its implications for health care providers:  findings from the first Health Information National Trends Survey. Archives of Internal Medicine, 165: 2618-2624.
Madden, M. and Fox, S. (2006). Finding Answers Online in Sickness and in Health. Washington, DC, Pew Internet & American Life Project. http://www.pewinternet.org/pdfs/PIP_Health_Decisions_2006.pdf.
Madden, M. (2006). Internet Penetration and Impact. Washington, DC, Pew Internet & American Life Project. http://www.pewinternet.org/pdfs/PIP_Internet_Impact.pdf.
Tang, H. and Ng, J. (2006). Googling for a diagnosis--use of Google as a diagnostic aid: Internet based study. British Medical Journal, 333: 1143-1145.

1.5.3  Models of the World Wide Web

(4/1/07) Berners-Lee et al. (2006) have called for a new "science of the Web" that explores all its intricate phenomena.

Berners-Lee, T., Hall, W., et al. (2006). Creating a science of the Web. Science, 313: 769-771.

(3/20/04) Work on modeling the Web continues to grow.  An entire book recently appeared on the topic, covering such areas as text analysis, link analysis, and human behavior (Baldi et al., 2003). A related area of work (and a book) involves "mining" the Web for information and knowledge (Chakrabarti, 2003).

Baldi, P., Frasconi, P., et al. (2003). Modeling the Internet and the Web - Probabilistic Methods and Algorithms. West Sussex, England. John Wiley & Sons.
Chakrabarti, S. (2003). Mining the Web - Discovering Knowledge from Hypertext Data. San Francisco, CA. Morgan Kauffman.

(3/15/05) Further analysis of the Web has provided insights to general phenomena of networked systems. Barabási (2002) has studied this extensively, noting similarities in different types of networks. He has also noted this phenomenon in the organization of the living cell (Barabási, 2004)

Barabási, A. (2002). Linked: The New Science of Networks. Cambridge, MA. MIT Press.
Barabási, A. and Oltvai, Z. (2004). Network biology: understanding the cell's functional organization. Nature Reviews - Genetics, 5: 101-113.

1.6  A sample document database for examples

Last updated - April 1, 2007