Error when crawling a website

Unfortunately, this sometimes happens. There are pages or even entire websites that our crawler cannot access for various technical reasons, most often due to website rules or limitations.

If you encounter an error after the crawler gets stuck on a certain page, try excluding this path in the crawler settings.

In rare cases, when the error prevents crawling a website entirely, we suggest trying to import the sitemap.xml file, which normally is a valid workaround for building a sitemap. Either enter the domain URL or provide a direct URL address to the sitemap.xml file, which is typically located at www.yourdomainname.com/sitemap.xml, but may vary in some cases.

Possible reasons for errors or missing pages:

  1. We respect robots.txt file policy. http://www.yourdomainname.com/robots.txt file may disallow crawling entire website, or certain directories;
  2. We respect robots no-follow meta tag. If a webpage have this tag enabled, our crawler won't crawl restricted links;
  3. Some websites may have IP range restrictions that disallow our crawler to accessing the website;
  4. Broken link. If a link returns 404 error, it is omitted when generating a project;
  5. Redirect. If a link has a redirect, it is omitted when generating a project;
  6. Timeout. To avoid infinite loops and overloading, we set a timeout for each crawling process at 40 minutes - upon reaching it, visual sitemap project is created and some pages may not have been crawled.