We had been trying for a month to figure out why our pages had disappeared from the google index. The best way to get an answer seems to be to ask a question rather than guessing. I finally got fed up with this issue and got contact information for someone that we had been talking to at google and asked them directly why our pages had disappeared from their index. This is what our problem was. It seems like google does quality checks on their pages to help ensure that the results that the provide are what their users want and our site failed for the reasons below.
- There was spam cloaking, by that they mean that they meant that the pages that we displayed to search were not the same pages we display to users. Most of our articles require a subscription but we allow search engines access to the full content so that they can index our site. When a normal user tried to access content that required a subscription we gave them a 401. The suggestion was that we give them an abstract page for the same content instead. I tend to agree with them here and think this is a much better way of handling this. We just need to add a message on the abstract page to let the user know why they were redirected.
- Our site quality has not been the best of late. Since we moved our site to a new location we have had a myriad of problems causing some downtime. One of the problems mentioned was that for some URLs they were getting timeouts. These URLs should have been giving 401 errors to users that do not have subscriptions but they were taking forever to do so. The cause is described here, if you are interested.
- Redirecting all 404s to a sitemap seems like a no no, and we were doing this. Maybe having the sitemap as the 404 instead of redirecting to the sitemap would be a better approach. That way we could send the 404 as the response code and the sitemap as the content of the page.