Googlebot is Google’s web crawling bot (sometimes also called a “spider”). Crawling is the process by which Googlebot discovers new and updated pages to be added to the Google index. We use a huge set of computers to fetch (or “crawl”) billions of pages on the web.
- Googlebot, a web crawler that finds and fetches web pages.
- The indexer that sorts every word on every page and stores the resulting index of words in a huge database.
- The query processor, which compares your search query to the index and recommends the documents that it considers most relevant.
Let’s take a closer look at each part.
Googlebot finds pages in two ways: through an add URL form, www.google.com/addurl.html, and through finding links by crawling the web.
Googlebot, Google’s Web Crawler
Googlebot is Google’s web crawling robot, which finds and retrieves pages on the web and hands them off to the Google indexer. It’s easy to imagine Googlebot as a little spider scurrying across the strands of cyberspace, but in reality Googlebot doesn’t traverse the web at all. It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, then handing it off to Google’s indexer.
Googlebot consists of many computers requesting and fetching pages much more quickly than you can with your web browser. In fact, Googlebot can request thousands of different pages simultaneously. To avoid overwhelming web servers, or crowding out requests from human users, Googlebot deliberately makes requests of each individual web server more slowly than it’s capable of doing.
Googlebot is used to search the Internet. It uses Web crawling software by Google, which allows them to scan, find, add and index new web pages. In other words, “Googlebot is the name of the search engine spider for Google. Googlebot will visit sites which have been submitted to the index every once in a while to update its index.”
Note: Googlebot only follows HREF “Hypertext Reference” links–which indicates the URL being linked to–and SRC “Source” links. With a list of webpage URLs, Googlebot will use Web-crawling robots to collect information to build a searchable index for Google’s Indexer.
The function of Googlebot
Googlebot functions as a search bot to crawl content on a site and interpret the contents of a user’s created robots.txt file
How to use Googlebot
Current version: Googlebot 2.1
Tag: Googlebot/2.1 (+http://www.googlebot.com/bot.html)
Switching User-Agent to Googlebot: FireFox extension (User-agent switcher)
IP address range:
- from 18.104.22.168 to 22.214.171.124 (googlebot.com)
(as of May 2008)
Tips: For Googlebot to function entirely, allow the bots (spiders) to have all the access they want/need.
Reminders: Ensure the Prevent Spiders option is set to true in your admin sessions settings.
Updates/changes to Googlebot: check the .txt file (such as “robots.txt”) for content.
How to Allow/Disallow Googlebot (manually):
- To Allow Googlebot
- User-agent: Googlebot
- Allow: / (or list a directory or page that you want to allow)
- To Block Googlebot
- User-agent: Googlebot
- Disallow: / (or list a directory or page that you want to disallow)
How to create a robots.txt file using the Generate robots.txt tool (in 5 steps):
To ensure the robots.txt tool is working properly, test it! Here’s how”
(1) Go to the Webmaster Tools Home page and click the site you want.
(2) Under Site configuration, click Crawler access. If it’s not already selected, click the Test robots.txt tab.
(3) Copy the content of their robots.txt file and paste it into the first box. In the URLs box, list the site to test it against.
Pros and Cons of Googlebot
– It can quickly build a list of links that come from the Web.
– It recrawls popular frequently-changing web pages to keep the index current.
– It only follows HREFlinks and SRC links.
– It takes up an enormous amount of bandwidth.
– Some pages may take longer to find, so crawling may occur once a month vice daily.
– It must be setup/programmed to function properly.
Other Googlebot Options
– crawls pages for Google’s mobile index
– crawls pages for Google’s image index
– crawls pages for AdSense content/ads
– crawls pages to check for Google AdWords