The relatively new Sitemap.xml protocol has been readily accepted by webmasters around the world as the best way to help crawlers navigate through your website. However, being as useful as it may to search engines, the Sitemap.xml protocol has also helped content thieves and data miners by providing them the necessary data to scrape a website.
Though I am not sure how many scrapers currently use the Sitemap as a form of site navigation, I can only assume that the number is growing. As copyrighted material gets stolen on a day to day basis, having a Sitemap.xml which is readily accessible to any web crawler may not be the best idea. Worse of all, the joint agreement of adding the Sitemap.xml link to the robots.txt will mean that scrapers have a universal way of finding the sitemap.xml file.
Before the agreement of placing the sitemap.xml link within the robots.txt, the sitemap file could be hidden anywhere on your website, away from views of unwanted scrapers and web bots. However, after the introduction of a universal location for the sitemap file, a lot of concern have been raised regarding the potential abuse of the sitemap.xml by scrapers and content thieves.
Luckily, I thought of several ways to stop scrapers from taking your content (which you worked so damn hard for, I would know as my post takes me a very long time to write). Below are the ones which are worth mentioning:
- Don’t put your sitemap.xml link within your robots.txt
- Keep your sitemap in a compressed format, chances are the scrapers can’t decompress the sitemap (I think?)
- Do not keep your sitemap in the most accessible directory, most people place their sitemaps in their root directory. Yes I did as well
Though I can’t guarantee the above methods will work, I am certain that performing the above task would lower your chance of sitemap abuse. However, if you have any definite method to combat sitemap abuse, send me an email to let me know.