Error when crawling a website
Unfortunately, this sometimes happens. There are pages or even entire websites that our crawler cannot access for various technical reasons, most often due to website rules or limitations.
Recommendations:
- If you encounter an error after the crawler gets stuck on a certain page, try excluding this path in the crawler settings.
- If you are getting an error while crawling a very large website, select 'Run in background', which will unlink the process from your browser tab, which may be causing the error.
In rare cases, when the error prevents crawling a website entirely, we suggest trying to import the sitemap.xml file, which normally is a valid workaround for building a sitemap. Either enter the domain URL or provide a direct URL address to the sitemap.xml file, which is typically located at www.yourdomainname.com/sitemap.xml, but may vary in some cases.
Possible reasons for errors or missing pages:
- We respect robots.txt file policy. http://www.yourdomainname.com/robots.txt file may disallow crawling entire website, or certain directories;
- We respect robots no-follow meta tag. If a webpage have this tag enabled, our crawler won't crawl restricted links;
- Some websites may have IP range restrictions that disallow our crawler to accessing the website;
- Broken link. If a link returns 404 error, it is omitted when generating a project;
- Redirect. If a link has a redirect, it is omitted when generating a project;
- Timeout. To avoid infinite loops and overloading, we set a timeout for each crawling process at 40 minutes - upon reaching it, visual sitemap project is created and some pages may not have been crawled.