Saturday, March 16, 2013

Google Indexing Non-Existent URLs. WordPress Doesn't Show 404


I was inspecting the Google search results for: "site:mywordpress.org." And found lots or pages indexed that shouldn't exist.
There are two problems here:
  1. I don't know how Google located, crawled, or found these URLs.
  2. Wordpress doesn't show a 404 error, so it looks like duplicate content.
I tried the Wordpress support forums, but no one responded. I also have not been able to find anyone reporting this problem. Here's an example of what I am seeing:
mywordpress.org/blog-post/
mywordpress.org/blog-post/1363035032000/
I've added a canonical link reference to the head and I've been doing lots of Google WMT removal requests, but I'm still seeing some results like this.
I've tested this on a few wordpress installs, it seems that if you add any string of numbers to the end of a permalink it will still display the content rather than showing a 404 error.
I also noticed that the number that is being added to the permalinks is the UNIX time stamp with a few zeros on the end. As of this post the current UNIX time stamp is: 1363035971.
I'm looking for some advice on what I should do. I'm particularly interested in a PHP function that would check the url to see if there was a string of numbers at the end, and if there was, 301 redirect it to the right permalink. I'd also value any input on why Google is finding these wrong urls and if the UNIX time stamp is the clue.

No comments:

Post a Comment