(4/1/07) Zins recently performed an on-line Delphi experiment that
enlisted 57 scholars in information science (including me!) to define
and create a knowledge map of information science. The results are
summarized on a Web
site and in four publications:
Zins, C. (2007). Conceptions of information science. Journal of the American Society for
Information Science & Technology, 58: 335-350.
Zins, C. (2007). Conceptual approaches for defining "data",
"information", and "knowledge". Journal
of the American Society for Information Science & Technology,
58: 479-493.
Zins, C. (2007). Knowledge map of information science. Journal of the American Society for
Information Science & Technology, 58: 526-535.
Zins, C. (2007). Classification schemes of information science: 28
scholars map the field. Journal of
the American Society for Information Science & Technology,
58: 645-672.
(4/1/07) For those of us working in this field for soem time,
i.e., almost two
decades for myself, the "mainsteaming" of IR continues to amaze. By
2004, over 100
million Americans had used a search engine (Fallows, 2004). Search
engines themselves receive about 500 million searches per day
from around the world, and on any given day, about 41% of all Internet
users submit nearly 60 million queries to them (Rainie,
2005). "Search" has been described as an "integral" computer
application (Barrows, 2006). Important variants include enterprise and
desktop search. Maintainance of metadata in organizaitons is considered
a key challenge. The "mass digitization" of information raises a host
of issues, e.g., copyright, optical character recognition (OCR)
quality, libraries, long-term ownership, business models for publishers
and content sellers, information literacy, standards and
interoperability (Anonymous, 2006a).
Additional evidence that
IR is "mainstream" can be gleaned with regards to the Google search
engine. The word "Google" itself has become a verb (i.e., "Did
you Google him?"), while the Google Zeitgeist (http://www.google.com/press/zeitgeist.html)
gives a glimpse of what the world wants to know about. In the meantime,
teenagers and others pass the time engaging in
Googlewhacking, a game where
one tries to have Google retrieve one and only one page. There are over
570,000 Googlewhacks and counting (http://www.googlewhack.com/tally.pl).
Political columnists ask whether Google is a diety (Friedman, 2003),
while the software giant Microsoft has declared that search is the most
important computer application in the near future (Ferguson, 2005) and
is on a "search and destroy" mission against Google (Vogelstein,
2005). Google, of course, is not standing back, developing on-line
versions of Microsoft Office application tools (Barret, 2006) and other
competitors are fighting both (Anonymous, 2006b). The best story of
Google, in my opinion, is the book by Battelle (2005).
Biomedicine is being impacted by the growth of IR as well. The
leaders of the National Library of Medicine has laid out a vision for
the future of medical libraries ten years hence, noting that the
"place" will be preserved but that most of the information will be
interactive and electronic (Lindberg, 2005). A leading
neuroscientist, noting the advances in the Human Genome Project and
related areas, has noted that biology is now an "information science,"
with many advances likely to come from using data to form and test
hypotheses (Insel, 2003). Major medical journals note that search
engines, most notably Google, are the major means that visitors are
brought to access their on-line articles (Giustini, 2005; Steinbrook,
2006). Meanwhile, pharmaceutical companies fight for informations and
library talent (Davies, 2006). One such talented individual quotes
Harvard University Chemistry Professor Frank Westheimer, who once
famously said, "A month in the laboratory can save an hour in the
library."
(3/15/05) Another important term worth defining upfront is health information literacy.
The Medical Library Association (MLA, www.mlanet.org)
has spoken most eloquently about this term, noting that it differs from
health literacy and information/computer literacy. They define
health information literacy as the set of abilities needed to:
Recognize a health information need
Identify likely information sources and use them to retrieve
relevant information
Assess the quality of the information and its applicability to a
specific situation
Analyze, understand, and use the information to make good health
decisions
(4/7/05) Saranto and Hovenga (2004) did a literature review to search
for papers
attempting to define the concept of information literacy, finding that
the concept really does not exist in the literature and that it is most
often used as a synonym for "computer literacy" or related
concepts. They advocate organizational efforts to define the term
and its related skills more precisely. McCray (2005) recently
reviewed health literacy and noted most of it focused on low literacy
and its impact on understanding health information. She noticed a
number of categories of articles on the topic, including:
Methods to assess literacy and the related topic readability of
texts
The mismatch between the readability of health information and
the literacy of those for whom it is intended
The difficulty patients with low literacy have in the health care
system, from accessing care to understanding their treatment plans to
their worse clinical outcomes
The impact of new information technologies
McCray, A. (2005). Promoting health literacy. Journal of the American Medical
Informatics Association, 12: 152-163.
Saranto, K. and Hovenga, E. (2004). Information literacy - what is it
about? Literature review of the concept and context. International Journal of Medical
Informatics, 73: 503-513.
1.2 Comparisons with other types of computer
applications
1.3 Models of IR
(4/4/06) Another model by which to view IR and related areas is
to think of how people usually find and process scientific
information. I call this the "Hersh Funnel" (see the figure
below) and have published it in several places (Hersh, 2005a; Hersh,
2005b; Hersh, 2006). Scientific information users usually begin
by searching
against all literature to find a set of documents
that contain some documents likely to be relevant. These
documents are usually reviewed manually to determine which ones are
definitely relevant. However, there is currently much research
trying to develop means to find that definitely relevant literature
automatically, in processes that are called information extraction or
text mining. Typically people structure knowledge out of the
documents that are definitely relevant.
Hersh W, Evaluation of biomedical text mining systems: lessons
learned from information retrieval. Briefings
in Bioinformatics, 2005, 6: 344-356.
Hersh WR, Information Retrieval and
Digital Libraries, in Medical
Informatics: Knowledge Management and Data Mining in Biomedicine,
Chen H, et al., Editors. 2005, Springer-Verlag: New York. 237-275.
Hersh, W., Bhupatiraju, R., et al. (2006). Enhancing access to the
bibliome: the TREC 2004 Genomics Track. Journal of Biomedical Discovery and
Collaboration, 1: 3. http://www.j-biomed-discovery.com/content/1/1/3.
(4/1/07) Just how much information is out there? Lyman and
Varian (2003) attempted to quantify the amount of information on
electronic media and its flow in 2003. A report by Gantz (2007) updated
this, estimating information quanity in 2006 and its growth to 2010.
Lyman and Varian found that the sum of
information on physical electronic media was about five exabytes (or
about 5 billion gigabytes) in 2003. This was noted to be
equivalent to
about one-half million new libraries the size of the US Library of
Congress. The majority
of this information (72%) was stored on magnetic media, primarily hard
disks,
with most of the remanider on film and a small proportion on paper
(about
1.5 petabytes or 0.001 exabytes). This amounted to about 800 megabytes
for each man, woman, and child on Earth.
In a given year, the distrubtion of paper content around the world was
estimated as follows:
Office documents - 279-1,379 terabytes
Newspapers - 27-138 terabytes
Mass market periodicals - 10-52 terabytes
Books - 8-39 terabytes
Journals - 1.3-6 terabytes
The amount of information on the Internet was estimated to include:
"Surface" Web (fixed Web pages) - 167 terabytes
"Deep" Web (database-driven Web pages) - 91,850 terabytes
Email - 440,606 terabytes
Instant messaging - 274 terabytes
Gantz et al. estimated the amount of informaiton in 2006 to be about
161 exabytes, projecting it to grow six-fold annually by 2010 to 988
exabytes (nearly 1 zettabyte). About 70% of the information is
generated by individuals but 85% is maintained in some way by various
organizations. Most of the growith is fueled by analog-to-digital
conversions. For images, about 1 billion devices generate about 250
billion images annually (150 billion on cameras, 100 billion on cell
phones), which is projected to double by 2010. The amount of video is
also expected to double by 2010. The report also notes that the world
has about 1.5 billion email accounts, which consume about six exabytes.
It also notes there are about 1.1 billion Internet users now, 60% of
whom have broadband access. This is projected to increase to 1.6
billion by 2010. And, perhaps most pertinent to IR, about 95% of the
information is "unstructured," i.e., amenable to IR indexing and
retrieval techniques.
Card (2003) has depicted this growth graphically, showing that on-line
information has exceeded all
human documents generated in the first 40,000 of human history and
vastly more than all
the information on Earth that all humans can learn.
(3/31/04) Another model was recently put forth to guide
information seeking and retrieval research (Jarvelin and Wilson, 2003).
These authors note that scientific theories are useful for a
variety of functions, as put forth by Bunge (1967):
Systemization of knowledge - ingrating, generalizing,
explanation, and expansion
Guiding research - defining relevant problems, data to
collection, proposed new research
Mapping part of reality - representing or modelling objects and
their relationships
They state that different types of information are needed in these
information tasks:
Problem information - the "structure, properties, and
requirements of the problem"
Domain information - the "known facts, concepts, laws, and
theories in the domain of the problem"
Problem-solving information - lays out how problems should be
formulated and how problem and domain information should be used
An example of this in health care might be the information task of
choosing an appropriate diagnostic test. The problem information
recognizes that the task originates from the goal of making a medical
diagnosis. The domain information brings forth the knowledge of
diagnostic tests for patients who have similar symptoms to the one at
hand. The problem-solving information leads the clinician to
apply the domain knowledge to this specific patient, resuling in a
decision of which test (if any) to order.
Bunge, M. (1967). Scientific Research. Heidelberg.
Springer-Verlag.
Jarvelin, K. and Wilson, T. (2003). On conceptual models for
information seeking and retrieval research. Information Research,
9(1).
http://informationr.net/ir/9-1/paper163.html .
1.3.1 The information world
1.3.2 Users
1.3.3 Health decision making
1.4 IR resources
1.4.1 People
(3/15/05) A major pioneer in the IR field was Gerard Salton, a
professor of computer science from Cornell University. Dr. Salton
invented many of the techniques commonly called "automated retrieval"
and is cited throughout the book. Unfortunately, he passed away
in 1995 just as the Web searching world was taking off and adopting
many of the techniques he developed. Shortly before his death, a
conference was held in his honor, celebrating his contributions. Many
of the talks from this meeting were captured and are available in
the Open Video Project (http://www.open-video.org/)
archive. To find the Salton videos, go to this site and search on
"Salton." Salton's talk himself is available at http://www.open-video.org/details.php?videoid=7057.
Although I did not know him well personally, I was certainly drawn into
this field by his writings. I was also impressed at his continued
ability to be engaged in the field right up until his death. (Most of
us old-timers recall him and Karen Sparck Jones sitting in the
front row at conferences, critiquing presentations and each other's
thoughts.)
(3/20/04) The book notes that indivuduals from a variety of
disciplines comprise the field of IR. A well-known computer
scientist who is among the leaders from that discipline recently gave a
keynote lecture discussing the relationship between IR and computer
science (Croft, 2003). Croft noted that IR has always been a
small part of the overall
computer science field but has a common heritage with the database
systems
area. He also noted that the field grew and was validated by the
success of Web search engines in the 1990s. He also laid out some
known successes by the field:
Search engines have become a significant means by which society
accesses information.
IR has long championed the "statistical" approach to using
language, which has now been adopted by other areas of computer
science,
such as natural language processing.
IR has focused on large-scale evaluation more extensively than
other areas of computer science, which have come to adopt many of these
techniques.
IR has also focused on the importance of the user and interaction
as part of its process.
The global goals of information access and contextual retrieval
(see below) are part of the vision of other grand research goals for
computer science, e.g., (Gray, 2003)
Gray, J. (2003). What next? A dozen information technology
research goals. Journal of the ACM, 50: 41-57.
Croft, W. (2003). Salton Award Lecture - Information retrieval and
computer science: an evolving relationship. Proceedings of
the 24th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, Toronto, Canada. ACM Press.
2-3.
(4/4/06) A recent paper by Moffat (2005) provides a list of the
most important IR research papers that are "recommended reading" for
research students.
(4/1/07) The NLM recently released a new version of its long-range
plan, which includes four overall goals:
Seamless, uninterrupted access to expanding collections of
biomedical data, medical knowledge, and health information.
Trusted information services that promote health literacy and the
reduction of health disparities worldwide.
Integrated biomedical, clinical, and public health information
systems that promote scientific discovery and speed the translation of
research into practice.
A strong and diverse workforce for biomedical informatics
research, systems development, and innovative service delivery.
(4/1/07) I think I can create a whole new category of books: those
about Google! There are many; here is a sampling:
Battelle, J. (2005). The Search -
How Google and Its Rivals Rewrote the Rules of Business and Transformed
Our Culture. New York, NY. Penguin Group.
Calishain, T. and Dornfest, R. (2004). Google Hacks - Tips & Tools for
Smarter Searching (Second Edition). Sebastapol, CA. O'Reilly
& Associates.
Davis, H. (2006). Google Advertising
Tools. Sebastopol, CA. O'Reilly & Associates.
Langville, A. and Meyer, C. (2006). Google's
PageRank and Beyond: The Science of Search Engine Rankings.
Princeton, NJ. Princeton University Press.
Miller, M. (2007). Googlepedia - The
Ultimate Google Resource. Indianapolis, IN. Que.
Vise, D. and Malseed, M. (2005). The
Google Story. New York, NY. Delacorte Press.
There are also some new and updated books about searching MEDLINE:
Katcher, B. (2006). Medline: A Guide
to Effective Searching in PubMed and Other Interfaces, Second Edition.
San Francisco. Ashbury Press.
Edhlund, B. (2005). Basic Principles
of Pubmed. Morrisville, NC. Lulu Press.
Edhlund, B. (2006). PubMed and
EndNote. Morrisville, NC. Lulu Press.
(4/1/07) For those interested in image retrieval and its
associated issues, an overview textbook is Visual Information
Retrieval (Del Bimbo, 1999). Another overview book is Natural
Language
Processing for Online Applications: Text Retrieval, Extraction,
and Categorization (Jackson, 2002). The book provides a
succinct but comprehensive
overview of natural language processing, document retrieval,
information extraction, text categorization, and text mining. A
comprehensive three-volume reference on health and
medical information on the Web, available in both print and on CD-ROM,
is The MLA
Encyclopedic Guide to Searching and Finding Health Information on the
Web (Anderson, 2004). A couple books have been published
recently on Web searching (Hock, 2004; Poremsky, 2004) and another
describes using the Web for research (Schlein, 2004). Another
recent book addresses the overlap between IR and information seeking in
context (Ingwersen, 2005).
In addition to books on searching MEDLINE and other health resources,
additional help can be found in the tutorials and help
files on the PubMed site:
Anderson, P. and Allee, N., eds. (2004). The MLA Encyclopedic Guide to Searching
and Finding Health Information on the Web. New York, NY.
Neal-Schuman Publishers.
Del Bimbo, A. (1999). Visual
Information Retrieval. San Francisco, CA. Morgan Kaufmann
Publishers.
Hock, R. (2004). The Extreme
Searcher's Internet Handbook: A Guide for the Serious Searcher.
New York, NY. Information Today.
Ingwersen, P. and Jarvelin, K. (2005). The Turn - Integration of Information
Seeking and Retrieval in Context. Dordrecht, The Netherlands.
Springer.
Jackson, P. and Moulinier, I. (2002). Natural
Language Processing for Online Applications: Text Retrieval,
Extraction, and Categorization. Amsterdam, Holland. Benjamin
Johns Publishing.
Poremsky, D. (2004). Google and
Other Search Engines. Berkeley, CA. Peachpit Press.
Schlein, A. (2004). Find It Online,
Fourth Edition: The Complete Guide to Online Research.
Tempe, AZ. Facts on Demand Press.
1.4.5 Tools
(4/4/06) An up-to-date list of open-source search engines is at
http://www.searchtools.com/tools/tools-opensource.html . One
system of note is Lucene
(Gospodnetic, 2005),
which is written in Java and is now part of the open-source Web server Apache. Another new IR system
for research use is Zettair.
Written by a group known for their accomplishments in index compression
and search speed, this system is fast and flexible.
Gospodnetic, O. and Hatcher, E. (2005). Lucene in Action. Greenwich, CT.
Manning Publications.
1.5 The Internet and World Wide Web
(4/1/07) There are several Web sites that track Internet use in
different countries and languages: comScore (www.comscore.com), Internet World
Statistics (www.internetworldstats.com),
and Global Reach (www.glreach.com).
They all paint a relatively consistent picture: Worldwide use of the
Internet continues to grow, particularly
in emerging economies like India, China, and Russia (Anonymous, 2007).
While the largest number of users still comes from the US (154 M),
China is rapidly closing in (87 M) and only 20% of all Internet users
come from the US. Despite its growth, Doyle et. al (2005) note that the
Internet is "robust yet fragile." In other words, its distributed
nature makes it fault-tolerant, but faults do occur frequently.
It seems almost like ancient history now, but the original Web
(sometimes called Web 1.0) featured a boom and then a bust, i.e., the
dot-com era. Some (e.g., O'Reilly, 2005) talk of a new Web now, a Web
2.0 that is built on sustainable business models and widespread
collaboration. A more sound business model gives users what they want
and make it more sustainable, e.g., Google Ads, eBay, and Amazon. But
Web 2.0 is also more collaborative, so that it "gets better the more
people use it" (O'Reilly, 2005), e.g., blogging, wikis and Wikipedia,
Flickr, and Craig’s List. Could Web 2.0 impact medicine? One view was
put forth by Giustini (2006) in British
Medical Journal.
Search engine use is very high among Internet
users. In a survey done in May-June, 2004, Fallows et al. (2004)
found that 84% of Internet users have used a search engine
(extrapolating from the usage statistics cited above, that translates
into 107 million people) and that 87% of people say they find what they
want most of the time. This memo also presented some facts
gleaned from tracking the top 25 search engines:
Americans conduct 3.9 billion searches per month
The average user performs 33 searches per month, spending about
41 minutes at search engine sites
The average visit to search engine results in 4.4 searches
A recent analysis of search engine users shows that while they are
enthusiastic and trusting of search engines, they are also unaware and
naive about certain aspects of them (Fallows, 2005). A large
majority of users report confidence in their searching abilities (92%)
and that they have successful searches most of the time (87%). However,
62% are unaware in the differences between paid and unpaid
results.
One Google scientist notes that Google receives about 200 million
searches per day (Singhal, 2004). Extrapolating from Google's
market share, this means about 500 million searches are done per
day. Of Google's 200 million searches each day, 100 million are
unique. Searches average 2.4 words and are entered in 90
different languages. About 10-20% of the pages in Google's
database change each month.
The Web has truly become "world-wide." According to Internet
World Stats (http://www.internetworldstats.com),
some 1.0 billion of the world's 6.5 billion people use the Internet
(15.9%). It is of course higher in developed regions/countries
such as the United States (68.1%), Oceania/Australia (52.1%), and
Europe (35.9%). But it is growing even more rapidly in places
like Latin America (14.8%, as high as 35.7% in Chile and 26.4% in
Argentina) and China (8.5%).
??? update
One irony that few IR "old timers" could ever have fathomed is the
need, in the Web era, for the study of "adversarial" IR. In other
words, the development of techniques to prevent retrieval of certain
content. One group of aversarial IR applications is the
prevention of "spam" (i.e., unwanted) pages or emails (Metaxas,
2005). On the Web, this called "link spam" (Noruzi, 2006). Singhal
(2004) notes there is a continual tit-for-tat battle between those who
devleop search engines and those who try to "game" them. Indeed,
there is a large market for attempting to drive traffic to one's Web
site via search engines and other means, sometimes called "search
engine marketing" (e.g., Moran and Hunt,
2005). Another form of adversarial IR is in "filtering," with the
usual goal of preventing linkage to pornography sites. Of course,
most approaches to such filtering are imperfect and can lead to
blocking of legitimate medical Web sites (Richardson et al,
2002). Indeed, one filter even blocks access to the Web site of
the town Toppenish, WA, due to the presence of the letters from a
blocked word in the middle of the town name (Anonymous, 2003).
Another concern about search engines is the growing desire of
governments to monitor their usage (Hansell, 2006). Ostensibly to
thwart the very real threats of terrorism, many are concerned about
governments knowing our searching interests. There are also some
governments, most notably China, who have required search engines to
filter pages containing certain words (such as democracy). At the
current time, privacy laws that protect things like email and library
check-outs do not protect queries to search engines.
Anonymous (2003). The Insider: 'Bess' Internet porn filter a
little too easily offended. Seattle
Post-Intelligencer. July 7, 2003. http://seattlepi.nwsource.com/business/129611_insider07.html.
Anonymous (2007). Worldwide Internet Audience has Grown 10 Percent in
Last Year, According to comScore Networks. Reston, VA, comScore
Networks. http://www.comscore.com/press/release.asp?press=1242.
Doyle, J., Alderson, D., et al. (2005). The "robust yet fragile" nature
of the Internet. Proceedings of the
National Academy of Sciences, 102: 14497-14502.
Fallows, D., Rainie, L., et al. (2004). Data Memo on Search Engines.
Washington, DC, Pew Internet &
American Life Project. http://www.pewinternet.org/pdfs/PIP_Data_Memo_Searchengines.pdf.
Fallows, D. (2005). Search Engine Users. Washington, DC, Pew Internet & American Life Project.
http://www.pewinternet.org/pdfs/PIP_Searchengine_users.pdf.
Giustini, D. (2006). How Web 2.0 is changing medicine. British Medical Journal, 333:
1283-1284.
Hansell, S. (2006). Online Trail Can Lead To Court. New York Times. February 4, 2006.
C1. http://www.nytimes.com/2006/02/04/technology/04privacy.html.
Singhal, A. (2004). Challenges in Running a Commercial Web Search
Engine. http://www.research.ibm.com/haifa/Workshops/searchandcollaboration2004/papers/haifa.pdf.
Metaxas, P. and DeStefano, J. (2005). Web spam, propaganda and trust. First International Workshop on
Adversarial Information Retrieval on the Web, Chiba, Japan. http://airweb.cse.lehigh.edu/2005/metaxas.pdf.
Moran, M. and Hunt, B. (2005). Search
Engine Marketing, Inc.: Driving Search Traffic to Your Company's
Web Site. Englewood Cliffs, NJ. Prentice Hall.
Noruzi, A. (2006). Link spam and search engines. Webology, 3(1).
http://www.webology.ir/2006/v3n1/editorial7.html.
O'Reilly, T. (2005). What Is Web 2.0? Design Patterns and Business
Models for the Next Generation of Software. Sebastopol, CA, O'Reilly
& Associates. http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html.
Richardson, C., Resnick, P., et al. (2002). Does pornography-blocking
software block access to health information on the Internet? Journal of the American Medical Association,
288: 2887-2894.
(3/20/04) The popularization of the Web arguably began with the
release of the Mosaic Web browser. The tenth anniversary of the
release of this software was recently celebrated at the US National
Science
Foundation (Anonymous, 2003).
(3/1/03) In Feb. 2003, the National Science Foundation released a
report on
Cyberinfrastructure, recommending that the organization spend an
additional $1 billion per year developing the nation’s
"cyberinfrastructure" to support scientific research. The report
advocates that investment in a comprehensive cyberinfrastructure will
profoundly change what scientists and engineers do, how they
do it, and who participates.
(4/5/03) The presentation about Web searching by Travis
and Broder (2001) cited in the book was later published as an article:
Broder A, A taxonomy of Web search, SIGIR Forum, 2002,
36(2): 3-10.
http://www.acm.org/sigir/forum/F2002/broder.pdf
In this paper, Travis notes that classic IR is driven by the user's
information need, but that Web searching is often not informational.
Instead, the user's intent might be navigational (e.g., finding a
specific page) or transactional (e.g., purchase something, download
a file, check the status of an account). Travis notes that
navigational searches are similar to what classic IR calls a
"known-item search,"
since they usually have only one correct answer. He also states
that
"hub" pages (see section 1.5.2) with lists of links that get to the
target
in one click may be acceptable. In transcational queries, the
user
needs not only to reach a site, but also interact with it once he or
she
gets there.
Travis analyzed the frequency of these types of Web search by users of
the AltaVista search engine via two means: a pop-up survey window
and a search log analysis. He noted noted the limitations in
each: Pop-up survey takers were self-selected and may not
represent all users or their needs. In addition, it is usually
difficult to know a user's exact intent from the query statement they
enter into a search engine. Based on his data, he concludes the
following
approximate distribution of types of Web search:
Informational - 39-48%
Navigational - 20-24%
Transactional - 30-36%
In other words, less than half of searches on the Web (at least those
entered into AltaVista) are classical IR informational seeking.
Travis also describes what he calls three generations of search engines
on the Web. The first generation uses mostly
static HTML pages and is very close to classic IR. The second
generation uses off-page, Web-specific data such as link analysis,
anchor text, and click-through data. He cites the Google PageRank
algorithm as an example of this and notes that it supports
informational
as well as navigational queries. The third generation attempts
to discern the "need behind the query" based on semantic analysis of
the user's input and determination of their context. He gives the
example of the user entering the name of a city and the system
returning
a hotel reservation page, map server, weather server, etc.. The
aim
of this generation would be to support all transactional searches in
addition to those which are informational and navigational. He
does not state so explicitly, but an implication is that the Semantic
Web (see section 10.2.4) will be helpful for this type of search.
(3/20/04) A collection of leaders in the field recently held a
workshop for defining the research agenda for the IR field (Allan et
al., 2003). This workshop was motivated in part by the notion
that current Web search engines are so effective that further research
and
development in the field are not warranted. However, this
workshop
noted that Web searching has become mainstream and successful, but is
not
the entire IR picture:
Web searching and IR are not equivalent - Web searching is
at best a part of overall information access.
Web queries do not represent all information - Users do much more
than search for the Web pages and other content indexed by Web search
engines
Web search engines are effective for some types of queries
in some contexts - There are many times when users are looking for more
specific and/or different information that resides on the Web.
This workshop came to the conclusion that there are two general
long-term challenges for the field:
Global information access - Information needs should be satisfied
through "natural, efficient interaction with an automated system that
leverages world-wide structured and unstructured data in any language."
Contextual retrieval - Search technologies and knowledge should
be used together to find the most appropriate content for a user's
information need.
In other words, research in IR must aim to create systems that
seamlessly search across the appropriate content at the appropriate
time. The paper gives a well-cited example of a user entering a
query for "Taj
Mahal." If the user's system knew that he or she was going to
attend
an academic conference in India, it would provide him or her with
information
about the famous landmark .
However, if the user was planning a trip to Atlantic City or
enjoyed jazz music, the system would preferentially present information
about the casino or jazz musician
respectively.
The report then outlined what workshop attendees considered to be the
major challenge areas for IR research:
Retrieval models - Web search engines tend to have a "one size
fits all" model that does not take into account other tasks that the
user wishes to perform, such as answer questions, browse specific
collections of information, find certain types of content, etc.
Cross-language IR - While English was initially the predominant
language of the Web, less than half of all pages are in English and at
some point in the future, other languages might surpass it.
Systems need to find content in other languages when appropriate
and provide the user a summary so he or she can determine whether to
expend the resources to translate it.
Web search - While Web search is not the only type of IR
application, it is certainly very popular, and further research must
continue to improve it.
User modeling - Different users have diverse needs, even when
searching for the same "topic." This is certainly true in health
care, where a patient, primary care physician, and subspecialist all
might want information on the same topic but bring different levels of
reading abiilty, prior knowledge, and so forth to the information
seeking process.
The report also lists the following areas as those of major challenge,
but they really represent specific instances of the above general
challenges:
Filtering, topic detection and tracking, and classification -
Users need intelligent tools to sift through large quantities of
information to follow topics and threads.
Summarization - Users can also be aided by systems that more
effectively summarize information instead of just presenting lists of
specific content.
Question-Answering - A common use of IR systems is to answer a
specific question, yet most systems just present content that the user
must read to find that answer.
Metasearch - Users often want to search over multiple resources
that may not all be represented in a single index.
Multimedia - While early IR was focused on text, real users are
also interested in finding other types of content, such as images,
videos, and sounds.
(2/12/03) According to the American Medical Association,
physician use of the Web continues to grow beyond the figures cited in
the second edition. Some other facts from this report include:
Two-thirds of physicians who use the Web do so daily.
The average physician user of the Web spends 7.1 hours
per week using it
About 65% of physicians over 60 years of age use the Web, showing
use is not limited to younger physicians
About 30% of physicians have a Web site for their practice
(4/4/06) A more recent summary of physician Internet usage data
found that 98% of US physicians use the Internet while half own
personal digital assistants (PDAs) (Anonymous, 2005). A proposed
new model of continuing medical education gives credit for documented
information seeking during clinical care (Davis, 2004).
Anonymous (2005). Physician Internet Use Statistics. http://www.max.md/pdf/PhysicianInternetUseStatistics.pdf.
Davis, N. and Willis, C. (2004). A new metric for continuing medical
education credit. Journal of
Continuing Education in the Health Professions, 24: 139-144.
(3/20/04) Another study of physician used showed those who
were more active clinically (i.e., saw more patients per week) spent
more
time on-line (Taylor and Leitman, 2001).
(4/4/06) Another growing category of IR system users are
biomedical researchers. This is due in large part to new
"high-throughput" biotechnologies, such as gene microarrays.
These
technologies not only generate large amounts of data, but also identify
new information that must be explored, e.g., the microarray experiment
that uncovers increased expression of genes previously unknown to be
related to a physiological or disease
process (Buetow, 2005). There is growing awareness that IR and other
techniques,
such as text mining, are important tools for researchers (Jensen, 2006;
Hunter, 2006)
But literature retrieval and analysis are difficult for scientists.
Barnes and Gary (2003) say, "Few areas of biological research
call for a broader background in biology than the modern approach to
genetics. This background is tested to the extreme in the
selection of candidate
genes for involvement with a disease process… Literature is the most
powerful resource to support this process, but it is also the most
complex and confounding data source to search."
Barnes, M. and Gary, R. (2003). Bioinformatics for Geneticists
. West Sussex, England. John Wiley & Sons.
Buetow, K. (2005). Cyberinfrastructure: empowering a "third way" in
biomedical research. Science,
308: 821-824.
Hunter, L. and Cohen, K. (2006). Biomedical language processing: what's
beyond PubMed? Molecular Cell,
21: 589-594.
Jensen, L., Saric, J., et al. (2006). Literature mining for the
biologist: from information retrieval to biological discovery. Nature Reviews - Genetics, 7:
119-129.
(4/4/06) Another group of heavy users of the Web for health and
biomedical information are consumers and patients. One area
of debate concerns how often they use the Web to seek health
information.
Some early reports put the figure at as high as 80% of all
Internet
users, such as a Harris Interactive Poll (2002). This poll found
about 80% of all adults who are online sometimes used the Web to look
for health care information. About 18% said they did so "often",
while most did so "sometimes" (35%), or "hardly ever" (27%). This
80% of all those online amounted to 110 million users nationwide. This
compared with 54 million in 1998, 69 million in 1999 and 97 million in
2001. On average those who ever looked for health care
information
online did so three times every month. About half (53%) of those who
looked
for health care information used a portal or search engine that allowed
them to search for the health information they wanted across many
different
Web sites. About a quarter (26%) went directly to a site that focused
only on health-related topics and one in eight (12%) went first to a
general site that focused on many topics that may have had a section
on health issues.
Taylor, H. (2002). Cyberchondriacs Update. Harris Interactive.
http://www.harrisinteractive.com/harris_poll/index.asp?PID=299.
The Pew Internet & American Life Project has published
a number of studies on information seeking, related not only to health
care but also to search engines in general:
Fox, S. and Rainie, L. (2000). The Online Health Care
Revolution: How the Web Helps Americans Take Better Care of
Themselves. Pew Internet &
American Life Project.
http://www.pewinternet.org/reports/toc.asp?Report=26
Fox, S. and Rainie, L. (2000). Vital Decisions: How Internet
users decide what information to trust when they or their
loved ones are sick. Pew Internet
& American Life Project.
http://www.pewinternet.org/reports/toc.asp?Report=59
Fox, S. and Fallows, D. (2003). Internet Health Resources: Health
searches and email have become more commonplace, but there is
room for improvement in searches and overall Internet access.
Washington,
DC, Pew Internet & American Life
Project.
http://www.pewinternet.org/reports/toc.asp?Report=95
The Fox, 2003 report found that 73 million US users have searched for
specific health information
(58% of all users) and 93 million have carried out a search related to
health
information generally (74% of all users). Both the Fox, 2003 and
Rainie, 2005 reports found that 80% of users had searched specifically
for health information. A recent analysis of all the Pew data
found several factors associated with a higher likelihood of Internet
health searching, including female gender, part-time employed, other
Internet use, specific health problems, and helping others deal with
health problems (Rice, 2006).
Other studies, however, have taken exception to these high rates
of use. Most notably, a study in JAMA claimed that only 40% of
Internet users actually used the Web to seek health information (Baker
et al., 2003). This study also found that only a third of those
who sought health information reported that the information affected a
decision about their health or health care. A number of letters
to the editor pointed out limitations of this study, the most notable
one being that study participants came from a pool of users offered
free access to WebTV, a form of Internet access by a very small
fraction of all users. Another study found
that 31% of all Americans (not just those on-line) have used the
Internet
to search for health information over a 12-month period (Murray et al.,
2003). About 8% of these individuals took information to their
physician, although two-thirds wanted their physician's opinion as
opposed to specific treatment. Additional research has found that
only a minority of
Americans (38.2%) seek health information generally, with the most
common
sources being books or magazines (23.0%), friends or relatives (19.7%),
the Internet (16.1%), and television and radio (11.3%) (Tu and
Hargraves,
2003).
Baker, L., Wagner, T., et al. (2003). Use of the Internet and e-mail
for health care information: results from a national survey.
Journal of the American Medical Association, 289: 2400-2406.
Murray, E., Lo, B., et al. (2003). The impact of health information on
the internet on the physician-patient relationship: patient
perceptions. Archives of Internal Medicine, 163: 1727-1734.
Rice, R. (2005). Influences, usage, and outcomes of Internet health
information searching: multivariate results from the Pew surveys.
International Journal of Medical
Informatics, 75: 8-28.
Tu, H. and Hargraves, J. (2003). Seeking Health Care Information:
Most Consumers Still on the Sidelines. Washington, DC, Center for
Studying Health System Change. http://www.hschange.org/CONTENT/537/.
Surveys of users of on-line health information show that they believe
there is room for improvement (Anonymous, 2003). The following
was found in this survey of about 3,000 users of on-line health
information in 2002 by Manhattan Research:
65% believe accuracy of on-line health information needs to
increase.
64% believe the quality of such information must improve.
22% have difficulty reading and understanding on-line health
information.
51% have a hard time determining credibility of this information.
81% state that content reviewed by a health care professional
increases their likelihood of trusting the information they find.
80% say that separation of content from advertising drives
their trust in the information.
(4/1/07) Many studies on use of the Internet and Web for health
searching continue to be published. A new version of the
"Cyberchondriacs" report was released in 2006 (Anonymous, 2006),
finding that the number of Americans online was now up to 77% and that
the number using the Web to seek personal health information continued
to hover at around 80%. This meant that the number of Americans who
have ever looked online for health information was now 136 million.
Other research (Fox, 2006) shows that 66% start their searching for
health information with a search engine, with 27% beginning at a
health-related site. About 72% visited more than one site when seeking
health information. About half (48%) of all health information seeking
was done for someone else and slightly more (53%) reported that their
seeking resulted in some kind of impact on how they cared for
themselves or someone else. The most common topic searched was a
specific disease or medical problem (64% of all searchers), followed by
a certain medical treatment or procedure (51%); diet, nutrition,
vitamins, or supplements (49%); and exercise or fitness (44%). Madden
(2006, 2006) found that 20% of Internet users reported that the
Internet "greatly" improved the way they get information about health
care.
A newer ongoing anaylsis of the Internet's impact on health care has
been the Health Information National Trends Survey (HINTS), funded by
the National Cancer Institute (NCI) and focused on cancer information
(Hesse, 2005). The first report of this survey found that 63.0% of
Americans reported going online, with 63.7% of those who did so
reporting that they looked for health information for themselves or
someone they know in the last 12 months. Despite the availability of
these new sources of information, those surveyed still reported that
they had "a lot" of trust in cancer information from their physicians
(62.4%) as opposed to the Internet (23.9%), television (20.0%), family
or friends (18.9%), magazines (15.9%), newspapers (13.1%), or radio
(9.9%). Those going online were somewhat more likely to be under 65,
female, white, college-educated, and have higher income. Despite the
fact that those seeking information about cancer would prefer to go to
their health care provider, most ended up going to the Internet,
probably because of its convenience.
In addition, the impact of Google on medicine is not insubstantial, as
noted from publications in major medical journals:
"And a diagnostic test was performed" - a distinguished visiting
professor entered all information about a complex patient into Google
and obtained the correct diagnosis (Greenwald, 2005)
As a diagnostic aid, entering pertinent data into Google
identified the correct diagnosis in 58% of New England Journal of Medicine
diagnostic cases (Tang, 2006)
Anonymous (2006). Number of "Cyberchondriacs" - Adults Who Have Ever
Gone Online for Health Information - Increases to an Estimated 136
Million Nationwide. Rochester, NY, Harris Interactive. http://www.harrisinteractive.com/harris_poll/index.asp?PID=686.
Fox, S. (2006). Online Health Search 2006. Washington, DC, Pew Internet
& American Life Project. http://www.pewinternet.org/pdfs/PIP_Online_Health_2006.pdf.
Greenwald, R. (2005). And a diagnostic test was performed. New England Journal of Medicine,
353: 2089-2090.
Hesse, B., Nelson, D., et al. (2005). Trust and sources of health
information: the impact of the Internet and its implications for health
care providers: findings from the first Health Information
National Trends Survey. Archives of
Internal Medicine, 165: 2618-2624.
Madden, M. and Fox, S. (2006). Finding Answers Online in Sickness and
in Health. Washington, DC, Pew Internet & American Life Project. http://www.pewinternet.org/pdfs/PIP_Health_Decisions_2006.pdf.
Madden, M. (2006). Internet Penetration and Impact. Washington, DC, Pew
Internet & American Life Project. http://www.pewinternet.org/pdfs/PIP_Internet_Impact.pdf.
Tang, H. and Ng, J. (2006). Googling for a diagnosis--use of Google as
a diagnostic aid: Internet based study. British Medical Journal, 333:
1143-1145.
1.5.3 Models of the World Wide Web
(4/1/07) Berners-Lee et al. (2006) have called for a new "science of
the Web" that explores all its intricate phenomena.
Berners-Lee, T., Hall, W., et al. (2006). Creating a science of the
Web. Science, 313: 769-771.
(3/20/04) Work on modeling the Web continues to grow. An
entire book recently appeared on the topic, covering such areas as text
analysis, link analysis, and human behavior (Baldi et al., 2003).
A related area of work (and a book) involves "mining" the Web for
information and knowledge (Chakrabarti, 2003).
Baldi, P., Frasconi, P., et al. (2003). Modeling the Internet
and the Web - Probabilistic Methods and Algorithms. West Sussex,
England. John Wiley & Sons.
Chakrabarti, S. (2003). Mining the Web - Discovering Knowledge from
Hypertext Data. San Francisco, CA. Morgan Kauffman.
(3/15/05) Further analysis of the Web has provided insights to
general phenomena of networked systems. Barabási (2002)
has studied this extensively, noting similarities in different types of
networks. He has also noted this phenomenon in the organization
of the living cell (Barabási, 2004)
Barabási, A. (2002). Linked:
The New Science of Networks. Cambridge, MA. MIT Press.
Barabási, A. and Oltvai, Z. (2004). Network biology:
understanding the cell's functional organization. Nature Reviews - Genetics, 5:
101-113.