Strange Googlebot behavior
recently, going to Google Webmaster Tools started to notice on the website the growth in the number of errors of type "not followed", which in my case occurred because of the so-called cyclic forwarding. Such is especially used engine website. But none of the browsers this error does not recur, when the query "by hand", i.e. via telnet too, no anomalies were noted. However, errors in GWT continued to appear again and again, pointing to the same URLS on my site and irritating only one of its existence. Had good break down, but to get to the bottom of problems managed.
So, the site has a tag cloud that generates links like: /blog/tag/navaneeta/. Because tags are set by users, it can contain almost any UTF-8 symbol. In addition, these pages (with topics on a particular Daegu) periodically put links from other sites/social networks/forums. Often, due to differences in the encoding of URLS, these links lead to pages which are not canonical.
Here is an example. Is the tag "Harry Potter", which is encrypted can be as "Harry+Potter" and "Harry%20поттер". The website can be reference with both options. But for the engine, rastogirohan REQUEST_URI these links are absolutely identical, which leads to duplicate pages, which search engines do not like. To combat these doubles when the page loads I'll encode the URL and then encode it back using the PHP function urlencode() and compare with the originally requested string. If they don't match — give the browser a 301 code and send it to the correct URL. Thus, the duplicate pages in the eyes of the search engine "stick together".
Everything seems to be pretty simple. Why is Googlebot began to dwell on some of these links? Fortunately, in GWT there is a special function "View as Googlebot", which allows you to take a look at the site as if through the eyes of the search bot. Well, we try. For this example, I'll get another tag: "guns n roses". So, tell the bot to load the page /blog/tag/guns n ' roses/. The bot tells us that all is well, the response received:
the
All right, single quotes according to the RFC 3986 is encoded as %27. OK, now try to send the bot to the URL /blog/tag/guns%27n%27roses/ (as if we were an ordinary browser). In response we get:
the
and perfectly fair at first glance, the note: "found On the page redirects to itself. This can lead to an endless redirect loop."
But in the server logs see that in actual fact the request was again: "GET /blog/tag/guns n 'roses/ HTTP/1.1" instead of clearly specified "GET /blog/tag/guns%27n%27roses/ HTTP/1.1". It turns out, the Googlebot decided to interpret the Directive's value is "Location", RFC spit on, torment meaningless queries myself and my site.
Further gugleniya helped to find out what the Google search bot has a special sympathy to the following characters:
and not convert them to the appropriate codes.
what may well spoil the nerves and spend time already tortured webmasters :)
Article based on information from habrahabr.ru
So, the site has a tag cloud that generates links like: /blog/tag/navaneeta/. Because tags are set by users, it can contain almost any UTF-8 symbol. In addition, these pages (with topics on a particular Daegu) periodically put links from other sites/social networks/forums. Often, due to differences in the encoding of URLS, these links lead to pages which are not canonical.
Here is an example. Is the tag "Harry Potter", which is encrypted can be as "Harry+Potter" and "Harry%20поттер". The website can be reference with both options. But for the engine, rastogirohan REQUEST_URI these links are absolutely identical, which leads to duplicate pages, which search engines do not like. To combat these doubles when the page loads I'll encode the URL and then encode it back using the PHP function urlencode() and compare with the originally requested string. If they don't match — give the browser a 301 code and send it to the correct URL. Thus, the duplicate pages in the eyes of the search engine "stick together".
Everything seems to be pretty simple. Why is Googlebot began to dwell on some of these links? Fortunately, in GWT there is a special function "View as Googlebot", which allows you to take a look at the site as if through the eyes of the search bot. Well, we try. For this example, I'll get another tag: "guns n roses". So, tell the bot to load the page /blog/tag/guns n ' roses/. The bot tells us that all is well, the response received:
the
HTTP/1.1 301 Moved Permanently
...
Location: /blog/tag/guns%27n%27roses/
All right, single quotes according to the RFC 3986 is encoded as %27. OK, now try to send the bot to the URL /blog/tag/guns%27n%27roses/ (as if we were an ordinary browser). In response we get:
the
HTTP/1.1 301 Moved Permanently
...
Location: /blog/tag/guns%27n%27roses/
and perfectly fair at first glance, the note: "found On the page redirects to itself. This can lead to an endless redirect loop."
But in the server logs see that in actual fact the request was again: "GET /blog/tag/guns n 'roses/ HTTP/1.1" instead of clearly specified "GET /blog/tag/guns%27n%27roses/ HTTP/1.1". It turns out, the Googlebot decided to interpret the Directive's value is "Location", RFC spit on, torment meaningless queries myself and my site.
Further gugleniya helped to find out what the Google search bot has a special sympathy to the following characters:
, @ ~ * ( ) ! $ '
and not convert them to the appropriate codes.
%2C %40 %7E %2A %28 %29 %21 %24 %27
what may well spoil the nerves and spend time already tortured webmasters :)
Комментарии
Отправить комментарий