出版時間:2009-10 出版社:機(jī)械工業(yè)出版社 作者:(美)克羅夫特 頁數(shù):520
Tag標(biāo)簽:無
前言
This book provides an overview of the important issues in information retrieval, and how those issues affect the design and implementation of search engines. Not every topic is covered at the same level of detail. We focus instead on what we consider to be the most important alternatives to implementing search engine components and the information retrieval models underlying them. Web search engines are obviously a major topic, and we base our coverage primarily on the technology we all use on the Web,l but search engines are also used in many other applications. That is the reason for the strong emphasis on the information retrieval theories and concepts that underlie all search engines.The target audience for the book is primarily undergraduates in computer science or computer engineering, but graduate students should also find this useful. We also consider the book to be suitable for most students in information science programs. Finally, practicing search engineers should benefit from the book, whatever their background. There is mathematics in the book, but nothing too esoteric. There are also code and programming exercises in the book, but nothing beyond the capabilities of someone who has taken some basic computer science and programming classes.
內(nèi)容概要
本書介紹了信息檢索(1R)中的關(guān)鍵問題。以及這些問題如何影響搜索引擎的設(shè)計與實現(xiàn),并且用數(shù)學(xué)模型強(qiáng)化了重要的概念。對于網(wǎng)絡(luò)搜索引擎這一重要的話題,書中主要涵蓋了在網(wǎng)絡(luò)上廣泛使用的搜索技術(shù)。 本書適用于高等院校計算機(jī)科學(xué)或計算機(jī)工程專業(yè)的本科生、研究生,對于專業(yè)人士而言,本書也不失為一本理想的入門教材。
作者簡介
W.Bruce Croft馬薩諸塞大學(xué)阿默斯特分校計算機(jī)科學(xué)特聘教授、ACM會士。他創(chuàng)建了智能信息檢索研究中心,發(fā)表了200余篇論文,多次獲獎,其中包括2003年由ACM SIGIR頒發(fā)的Gerard Salton獎。
書籍目錄
1 Search Engines and Information Retrieval 1.1 What Is Information Retrieval? 1.2 The Big Issues 1.3 Search Engines 1.4 Search Engineers2 Architecture of a Search Engine 2.1 What Is an Architecture ? 2.2 Basic Building Blocks 2.3 Breaking It Down 2.3.1 Text Acquisition 2.3.2 Text Transformation 2.3.3 Index Creation 2.3.4 User Interaction 2.3.5 Ranking 2.3.6 Evaluation 2.4 How Does It Really Work?3 Crawls and Feeds 3.1 Deciding What to Search 3.2 Crawling the Web 3.2.1 Retrieving Web Pages 3.2.2 The Web Crawler 3.2.3 Freshness 3.2.4 Focused Crawling 3.2.5 Deep Web 3.2.6 Sitemaps 3.2.7 Distributed Crawling 3.3 Crawling Documents and Email 3.4 Document Feeds 3.5 The Conversion Problem 3.5.1 Character Encodings 3.6 Storing the Documents 3.6,1 Using a Database System 3.6.2 Random Access 3.6.3 Compression and Large Files 3.6.4 Update 3.6.5 BigTable 3.7 Detecting Duplicates 3.8 Removing Noise4 Processing Text 4.1 From Words to Terms 4.2 Text Statistics 4.2.1 Vocabulary Growth 4.2.2 Estimating Collection and Result Set Sizes 4.3 Document Parsing 4.3.1 Overview 4.3.2 Tokenizing 4.3.3 Stopping 4.3.4 Stemming 4.3.5 Phrases and N-grams 4.4 Document Structure and Markup 4.5 Link Analysis 4.5.1 Anchor Text 4.5.2 PageRank 4.5.3 Link Quality 4.6 Information Extraction 4.6.1 Hidden Markov Models for Extraction 4.7 Internationalization5 Ranking with Indexes6 Queries and Interfaces7 Retrieval Models8 Evaluating Search Engines9 Classification and Clustering10 Social Search11 Beyond Bag of WordsReverencesIndex
章節(jié)摘錄
插圖:After documents have been converted to some common format, they need to bestored in preparation for indexing. The simplest document storage is no document storage, and for some applications this is preferable. In desktop search, for example, the documents are already stored in the file system and do not need to be copied elsewhere. As the crawling process runs, it can send converted documents immediately to an indexing process. By not storing the intermediate converted documents, desktop search systems can save disk space and improve indexing latency.Most other kinds of search engines need to store documents somewhere. Fast access to the document text is required in order to build document snippetsz for each search result. These snippets of text give the user an idea of what is inside the retrieved document without actually needing to click on a link.Even if snippets are not necessary, there are other reasons to keep a copy of each document. Crawling for documents can be expensive in terms of both CPU and network load. It makes sense to keep copies of the documents around instead of trying to fetch them again the next time you want to build an index. Keeping old documents allows you to use HEAD requests in your crawler to save on bandwidth, or to crawl only a subset of the pages in your index.Finally, document storage systems can be a starting point for information extraction (described in Chapter 4). The most pervasive kind of information extraction happens in web search engines, which extract anchor text from links to store with target web documents. Other kinds of extraction are possible, such as identifying names of people or places in documents. Notice that if information extraction is used in the search application, the document storage system should support modification of the document data.
編輯推薦
《搜索引擎:信息檢索實踐(英文版)》:經(jīng)典原版書庫。
圖書封面
圖書標(biāo)簽Tags
無
評論、評分、閱讀與下載