Website Crawling
- What is the website root URL?
- What HTTP protocols are supported?
- Will the limits specified in robots.txt be taken into account when crawling?
- What are hidden pages (Deep Web) and how to include them in a Sitemap?
- What will happen to the nofollow links?
- How will the web pages that use robots meta tag or X-Robots-Tag HTTP header be processed?
- How does the crawler process intrahost server redirects?
- Does the crawler handle canonical links (rel=canonical)?
- Does the crawler handle AJAX links (hashbang)?
- Processing and deleting phpsessid and sessionID (session identifiers on PHP- and ASP-applications)
- What data is contained in error report, generated after crawling the website?
- Can I stop the website crawling before it is finished?
- Choosing optimal crawl speed and load capacity on your web server
- How to simulate crawls by search engines robots?
What is the website root URL?
The root URL is the base address for accessing your domain on the web server. It consists of two required components - a protocol scheme (usually https://) and a domain name (eg website.tld). Root URL examples: https://website.tld, http://subdomain.website.tld.
We support any existing domain types, such as top-level domains (TLDs), country specific domains (ccTLDs), as well as subdomains of any level. We also support Internationalized Domains (IDNs) for most languages such as Arabic, Indian and Cyrillic domains. Note that you don't need to convert your hostname to Punycode, just enter the original URL in your language.
What HTTP protocols are supported?
Mysitemapgenerator supports HTTP and HTTPS.
Please note that according to the specification of the XML Sitemaps protocol, site crawling and data generation are carried out only for the specified data transfer protocol.
Will the limits specified in robots.txt be taken into account when crawling?
This is optional, but enabled by default. If this option is ticked, our bot will follow the Allow and Disallow rules in general section of User-agent.
"Personal" sections of User-agent (for example, Google or Yandex) are considered when choosing the appropriate crawler mode as search bot.
In addition, you may create separate section specifically for Mysitemapgenerator:
User-agent: Mysitemapgenerator
Below is an example of a robots.txt file:
#No robots should visit any URL starting with /noindex-directory/ User-agent: * Disallow: /noindex-directory/ #Google does not need to visit a specific URL User-agent: Googlebot Disallow: /noindex-directory/disallow-google.html #Yandex does not need to visit URL starting with /noindex-directory/ #But allows to index a specific page User-agent: Yandex Disallow: /noindex-directory/ Allow: /noindex-directory/allow-yandex.html #Mysitemapgenerator does not need to visit URL starting with /noindex-directory/ #But allows to index pages with a specific extension User-agent: Mysitemapgenerator Disallow: /noindex-directory/ Allow: /noindex-directory/*.html
What are hidden pages (Deep Web) and how to include them in a Sitemap?
Deep Web (also called the Deepnet, the Invisible Web, the Undernet or the hidden Web) web-pages, which are not indexed by search engines, because such pages do not have hyperlinks from accessible pages. For example these are the pages, generated through HTML-forms interface or frame content.
If you wish to discover and include such pages into Sitemap, please tick the appropriate options:
- "Crawl html forms" (submit occurs without filling the form);
- "Crawl framed content" (contents of <frameset> θ <iframe>).
What will happen to the nofollow links?
If the option is on (enabled by default) they will not be considered.
Additionally, if needed, you may always specify ignoring only noindex (pages, marked as noindex) or only nofollow links separately from each other.
Nofollow link types:
- HTML links containing the nofollow attribute
- URLs that are disallowed in the robots.txt file
- Located on a webpage tagged with the nofollow robots tag
How does the crawler process intrahost server redirects?
Crawler identifies the following standard HTTP status codes:
- 301 Moved Permanently
- 302 Found
- 303 See Other
- 307
Does the crawler handle canonical links (rel=canonical)?
Yes, for this it is enough to mark the corresponding option "Follow & consolidate canonical URLs". When the appropriate option is activated, the crawler will take into account the instructions of canonical links, and noncanonical links will be excluded from crawl results.
Our crawler handles the instructions in the HTML code, as well as the HTTP headers in the same way. An example of specifying a canonical link in HTML (placed in the <head> section of the noncanonical version of the page):
Canonical meta tag example:
<link rel="canonical" href="http://www.website.tld/canonical_page.html"/>
Link: <http://www.website.tld/canonical_page.html>; rel="canonical"
Does the crawler handle AJAX links (hashbang)?
Yes, for this it is enough to mark the corresponding option "index AJAX links".
Each indexed AJAX link should have an HTML version, which is available at the address, wgere the combination of "#!" is replaced by the "?_escaped_fragment_=" parameter.
In AJAX links crawler replaces the #! combintion to the ?_escaped_fragment_= parameter and accesses the page by the modified URL.
Links containing hashbang (#!) are used in original form when creating a Sitemap.
Processing and deleting phpsessid and sessionID (session identifiers on PHP- and ASP-applications)
During the process of crawl your website may form session IDs. Our crawler processes and deletes session identifiers. To the Sitemap file all links will be input "clear", without identifiers, passed in URL - phpsessid (for PHP) or objects sessionID (for ASP). This helps to avoid pasting into Sitemap duplicate links, when bot receives the same page with different URLs.
Example of session identifier in PHP:
http://website.tld/page.html?PHPSESSID=123456session6789
http://website.tld/(S(123456session6789))/page.html
http://website.tld/page.html
What data is contained in error report, generated after crawling the website?
In the event that our crawler will face difficulties or obstacles in the process of crawl your website, a detailed report will be created. In a report you will be able to see grouped pages lists describing errors, among them - "Page not found", internal server errors, etc.
Besides the errors, the report will contain information about all the detected server redirects.
Error reports are available in paid versions.
Can I stop the website crawling before it is finished?
Such an opportunity is provided for registered users. In the personal account displays information about all of your created files, as well as information about Websites, which are being indexed at the moment. In order to interrupt the process of indexing, without waiting for the crawler to scan the entire website, click the "Stop" button. In this case, you will receive file, generated only on the basis of pages that have been indexed at the time of the stop.
Choosing optimal crawl speed and load capacity on your web server
In the options of the crawler there are three levels of crawl speed, creating appropriate load capacities on the server being indexed:
- Maximum - this load capacity is used by default. If you have a quality paid hosting, most likely you do not need to worry about creating a load while crawling your site. We recommend using this load value, which allows the crawler to index your website at top speed.
- Average choose this load capacity, if your server requires a gentle mode of indexation.
- Low level of load capacity, which allows crawl your site, creating a minimum load on the server. This load level is recommended for websites, located on a free hosting or for sites that require limited flow of traffic.
We recommend that you select this mode when crawling sites located on free hosting servers.
However, note that this level slows down the process of crawl your site.
How to simulate crawls by search engines robots?
You may choose one of the identification options for our Web-crawler* (Search Engine Bot), which does crawling of your website:
- Standard browser crawler uses this option by default and is a recommended one. Your website will load the same way your regular visitors see it.
- YandexBot this option is used to crawl your website as Yandex search bot sees it. Our crawler will be signed as the main Yandex bot (YandexBot/3.0)
- Googlebot this option is used to crawl your website as Google search bot sees it. Crawler will be signed as Google's web search bot (Googlebot/2.1)
- Baiduspider - Baidu Web Search Bot
- Mysitemapgenerator use direct identification of our crawler if you need separate control settings and an ability to manage website access
- When choosing YandexBot, GoogleBot, Baiduspider or Mysitemapgenerator options only instructions for a particular robot are considered (User-agent: Yandex, User-agent: Googlebot, User-agent: Mysitemapgenerator respectively). General instructions of User-agent: * sections will be used only when "personal" ones are missing.
- If you are using Standard Browser or Mysitemapgenerator - crawler will consider only instructions in Mysitemapgenerator section or general section of User-agent: *. "Personal" sections of User-agent: Yandex or User-agent: Googlebot and others are not considered.