+ Start a Discussion
bob_buzzardbob_buzzard 

Intermittent 401 errors when google crawls a public site

Hi all,

 

We have a public site where all pages are available to the guest user profile.  When accessing this site through the browser, we've never seen an authorization required page.  However, when google bots attempt to crawl the site we get intermittent 401 Authorization Required responses.  

 

I've tried hitting the site from seoconsultants check header tool, and I get intermittent 401 and 302 responses.  The 302 that I've just received is saying that the page has been moved to : cust_maint/site_down/maintenance.html.   Web analyzer gives the same results, yet using fiddler shows that the browsers are receiving nothing but 200 responses.

 

Has anyone else seen behaviour like this?  Its proving difficult to track down as every time I raise a case with support, they close it saying they don't support the google client.  However, that's not the issue here - I need to understand why the Salesforce server is returning the responses that it is.

 

 

 

 

Best Answer chosen by Admin (Salesforce Developers) 
bob_buzzardbob_buzzard

So I finally got to the bottom of this and I'm posting to hopefully save others some pain.

 

It turned out to be a bug in my code that determines the browser that the user is accessing the site with by processing the USER_AGENT header.

 

Google doesn't provide a USER_AGENT header when crawling, so I ended up causing a null pointer exception.  This appears to cause Salesforce to carry out a server side redirect to a standard platform error page, which requires user login.  Unfortunately from the client perspective it just looks like the page you tried to access required authentication.

 

I managed to track this down by accessing the site through telnet and carrying out http requests etc from the keyboard - a fine way to lose a morning!

All Answers

Ryan-GuestRyan-Guest

The 302 errors you are getting: cust_maint/site_down/maintenance.html, means that the salesforce instance was down for maintenance when the page was requested. 

bob_buzzardbob_buzzard

Hi Ryan,

 

Thanks for the reply.  Unfortunately this doesn't stack up with what I'm seeing.

 

For example, this morning I've run a number of requests to check server response headers from seoconsultants (http://www.seoconsultants.com/tools/headers)

 

each and every time I get the following response:

 

 

#1 Server Response: http://ecohomesquad.force.com
HTTP/1.1 302 Moved Temporarily
Server: AkamaiGHost
Content-Length: 0
Location: http://ecohomesquad.force.com/cust_maint//site_down/maintenance.html
Date: Tue, 01 Mar 2011 08:41:32 GMT
Connection: keep-alive

 

 

However, in between trying this I am carrying out hard refreshes from my browser with fiddler enabled and I see nothing but 200 responses, which indicates to me that the site isn't down.

 

If I try to access the site as a google bot, using http://www.avivadirectory.com/bethebot/, I get a 401 response and am taken to the login page.  Again, I've never seen this when accessing the site from a regular browser.  The only difference that I can see is the user agent header.  Does Salesforce sites return different responses based on the browser header?  

 

I'm at a loss as to how to proceed on this one - the platform appears to be returning incorrect responses, yet there's nothing I can do to influence those responses.

 

 

 

bblaybblay

I'm not an expert at all in this, but would it have anything to do with the robots.txt file? I just know google doesn't pick up a salesforce site until you actually create a robots.txt file since Salesforce automatically blocks crawlers.

Patrick DixonPatrick Dixon

Are you sure the bot is using exactly the same address as your browser?

 

www.ecohomesquad.com doesn't seem to take you to the same page as http://ecohomesquad.force.com (the nav is broken on the former) and I've found that force sites doesn't seem to follow CNAME aliases completely and you can end up dumped at a login page if it's not a direct CNAME.

Nathan @HoNathan @Ho

Your Robot.txt file on your site ecohomesquad.force.com/robots.txt is not allowing googlebot to crawl the site... you need to go into Sites in setup and load a new robot.txt file...

 

it would look something like this if you want all the pages to be able to be crawled...

 

<apex:page contentType="text/plain">
User-agent: *
Allow: /

</apex:page>

bob_buzzardbob_buzzard

My robots.txt is set up correctly.  

 

This wouldn't account for 401 errors as that is an error returned by the web server.  Google inspects the robots.txt file and then decides if it should crawl - if it shouldn't then it will log that as the reason that it didn't carry out the crawl.

bob_buzzardbob_buzzard

So I finally got to the bottom of this and I'm posting to hopefully save others some pain.

 

It turned out to be a bug in my code that determines the browser that the user is accessing the site with by processing the USER_AGENT header.

 

Google doesn't provide a USER_AGENT header when crawling, so I ended up causing a null pointer exception.  This appears to cause Salesforce to carry out a server side redirect to a standard platform error page, which requires user login.  Unfortunately from the client perspective it just looks like the page you tried to access required authentication.

 

I managed to track this down by accessing the site through telnet and carrying out http requests etc from the keyboard - a fine way to lose a morning!

This was selected as the best answer
Ramesh SomalagariRamesh Somalagari
I have same problem it asking "Authorization Required".Please click this url https://developer.salesforce.com/forums/#!/feedtype=SINGLE_QUESTION_DETAIL&dc=Mobile&criteria=ALLQUESTIONS&id=906F0000000AZYGIA4 (https://developer.salesforce.com/forums/#!/feedtype=SINGLE_QUESTION_DETAIL&dc=Mobile&criteria=ALLQUESTIONS&id=906F0000000AZYGIA4)