Post by Aaron Rivera on Jul 13, 2007 1:38:55 GMT -5
Essential protection from viruses, hackers and privacy threats - Norton Internet Security 2007
Purchase Norton Internet Security 2007 via the US store Today!
If you've ever wondered how to get a little better control over what parts of your web site get crawled by the search engines, how they crawl your pages, and how to encourage them to visit, keep reading. This article will explain the various protocols that the search engine robots (particularly Google's) follow. It will also touch upon ways to help you guard against scraper bots.
Polite Bots
There have been quite a number of articles on the Robots.txt primer. All have explained the basics of the robots exclusion protocols. Recently while working on removing some pages from Google's archives, I browsed through Google's Webmaster Central Blog over at blogspot and saw some posts by Dan Crow and Vanessa Fox. These posts explained how the Googlebot worked in detail.
Apart from explaining the robots exclusion protocol in detail, Google has new tools which allow the removal of cached pages using the Webmaster Dashboard -- we will only cover that briefly in this piece since I go into detail about it in a different article. This article will look at the specifics of the robots.txt primer specifically for the Googlebot, quoting Dan Crow, Google product manager. Google's bot is incredibly polite when it is indexing pages; we will compare its behavior to that of some malicious scraper bots.
Googlebot has several quirks to it, as all bots do. We will look at a few of these quirks before we discuss the basics of search engine bots. For example if you have your web site down temporarily and you want Googlebot to come back you can use an HTTP 503 command to tell the bot (and your users) that your network is temporarily unavailable. Without this command it is probable that Googlebot will index your "this website is down for maintenance" page. You can get more information on the HTTP 503 status code at askapache.com.
Also note that if the Googlebot is crawling your site too frequently (and hence grabbing all your bandwidth), you can contact Google Support; they should work with you to ensure that the bots don't overload your servers. According to Vanessa Fox, there probably will be a tool that allows you to adjust the crawl rate of the Googlebot on your site.
Googlebot is Google's primary agent in crawling and indexing pages on the web; it's incredibly large, truly living up to the name World Wide Web. As Dan Crow puts it, it's "really, really big." And not every one on the public web wants particular pages crawled. There are pages containing client information or inflammatory material. Some don't mind the crawling but don't want to be cached on Google's database for whatever reason.
White Papers
ViniSyndicate Catalog Integration System
The ViniSyndicate Catalog Integration System™ uses Vinimaya’s patent-pending, distributed Internet search technology to address the catalog management, supplier enablement and usability issues inherent in today’s managed catalog and XML Punchout based methodologies for e-procurement systems.
Request Your Free White Paper!
Learn More About KnowledgeStorm
KnowledgeStorm is the Internet's top-ranked search resource for technology solutions and information. With our premier network, search expertise and performance tools and services, KnowledgeStorm provides technology vendors the most opportunities to reach buyers on the Internet and convert them into Web leads.
Request Your Free White Paper!
The New Information Landscape: A Balanced Strategy for Improving News Content Quality and ...
While they are ubiquitous & free, many news websites lack customization & resource breadth—they search fewer sources than print syndicators & online aggregators & index those sources less frequently. Learn how online aggregation services are proving to be a cost-effective technology solution for faster, broader—yet still relevant—news delivery.
Request Your Free White Paper!
Connecting Through Content, Issue Two
Connecting Through Content, Issue Two examines how technology buyers search for content and how marketers deliver content to them. Content distribution channels, user search techniques, content registration behavior and other aspects of content distribution and syndication are evaluated.
Request Your Free White Paper!
IBM OmniFind Enterprise Edition for Searching Domino Demo
This recorded demonstration shows how IBM OmniFind Enterprise Edition helps organizations maximize employee productivity and knowledge sharing in Lotus Domino environments by providing scalable, secure, and high-quality enterprise search.
Request Your Free White Paper!
Maneuvering the New Media Landscape - Adding Social Media into the PR Campaign Mix
In this webinar, Tim Cox of Zing Public Relations shares his experiences with adding social media into the PR campaign mix. Using examples from recent campaigns, Cox will share how ZingPR is using Vocus & PR Web to orchestrate targeted campaigns that cover traditional media & blogs while also addressing search engine optimization opportunities.
Request Your Free White Paper!
Search is Not Enough: The Strategic Value of Knowledge Management
Companies are beginning to think of search in a strategic new light – search as an enabler of better customer experience; search as a trigger for other applications. But search alone is not enough to deliver the value most companies are seeking. Read this article to learn about the benefits of integrating search with knowledge management.
Request Your Free White Paper!
The New Rules of PR: How to Create a Press Release Strategy for Reaching Buyers Directly
Before the Web, everybody knew that the only reason you issued a press release was to get the media to write about you. Now, the Web has transformed the rules and you must transform your releases to make the most of the Web-enabled marketplace of ideas. This e-book discusses some of the latest developments in public relations and press releases.
Request Your Free White Paper!
Implementing an Effective Electronic Discovery Response Plan
The prevalence of electronic discovery means that all businesses must have ready access to the evidence they need to produce, while guarding against accumulating overwhelming volumes of information. In this paper, learn why effective planning requires a new working relationship among internal and external legal and technical resources.
Request Your Free White Paper!
The Seven Deadly Sins of Site Search
The seven deadly sins of website search drive visitors from your site and into the arms of consumer web search engines that may not represent your company well. Look in depth at some of the biggest sins of corporate website search, committed by Fortune 500 Companies, and learn from their mistakes.
Request Your Free White Paper!
- Polite Bots
- Put Your Site on the Map with Google Sitemaps
- Open Directory Project: DMOZ: Frequently Ask...
- DMOZ: Advanced submissions and listings
- Search Engine and Directory Submission: Auto...
- Blogs and Internet Directories: The Same and...
- Submitting to Directories: A Comprehensive G...
- The DMOZ Directory: Get Your Site Listed