resources according to the release time arrangement ", namely new resources are arranged in a first page page page (or the last page.
Two, the main idea of
The Spider is located in the upstream of the chart
218 hours after the fourth page
for 18 hours after the web page series of fourth pages, three new pages of many resources in this period of time, in Figure 1 the red circle to the matrix resources in 18 hours has been ordered back to red box office over fourth pages.
most of the current Internet sites to index pages and page form to organize web resources, when a new resource increases, goes back to old resources page in the series.
search engine data stream, will be responsible for the collection of local resources on the Internet, for subsequent retrieval, is one of the most important data source of search engine. The goal of the spider system is found and all valuable web crawling in the Internet, to achieve this goal, the first is to find the valuable web links, the spider has a variety of link discovery mechanism to discover the resource as quickly as possible and all links, this paper mainly describe one specific index page link mechanism complement, and give the index page of this specific type of advice for the optimization of processing specification included effect.
for spider, the index page of this specific type is an effective channel resource link discovery, but because spider is regularly check these pages to get new resource links, check the periodic cycles are inevitable with the resource link released will be different (spider will try to probe "release cycle, with reasonable frequency check"), when different cycle of resource link is likely to be pushed to the page in the sequence, so as to complete the spider page of this special type of page series, which fully included resources.
as shown below:
index page link completion mechanism
This paper mainly discusses the