Sitemaps FAQs

Watch our video overview video

What are the requirements for the domain homepage, so that the crawler could index my website?
The page should be accessible, return HTML-code and contain links to the internal pages. Website scanning process occurs based on the links, which will be found on the homepage. Homepage also may contain intra-host server redirect, which will be processed.
What types of website root URLs are accept?
Service only accepts for processing those forms of URL, which are presented as actual domain names.
  • International domain names (gTLD)
  • Internationalized domain names (IDN)
  • National domain names (ccTLD)

Our Service allows you to crawl the Arab, Indian, and Cyrillic domains.
URL Examples:
  • http://mydomain.tld
  • http://punycode.cctld
What HTTP protocols are supported?
Mysitemapgenerator supports HTTP and HTTPS.
Please note that according to the specification of the XML Sitemaps protocol, site crawling and data generation are carried out only for the specified data transfer protocol.
What is the limit on the number of indexed pages in Free Generator?
The free version of generator may crawl up to 500 URLs.
Will be the indexing limits, specified in robots.txt, taken into account during indexing?
This is optional and enabled by default. If this option is ticked, generator will follow the Allow and Disallow rules in general section of User-agent: *
"Personal" sections of User-agent: Google or User-agent: Yandex are considered when choosing the appropriate crawler identification type as search bot.
In addition, you may create separate section specifically for Mysitemapgenerator:
    User-agent: Mysitemapgenerator

Below is an example of a robots.txt file:
    #No robots should visit any URL starting with /noindex-directory/
    User-agent: *
    Disallow: /noindex-directory/
    #Google does not need to visit a specific URL
    User-agent: Googlebot
    Disallow: /noindex-directory/disallow-google.html
    #Yandex does not need to visit URL starting with /noindex-directory/
    #But allows to index a specific page
    User-agent: Yandex
    Disallow: /noindex-directory/
    Allow: /noindex-directory/allow-yandex.html
    #Mysitemapgenerator does not need to visit URL starting with /noindex-directory/
    #But allows to index pages with a specific extension
    User-agent: Mysitemapgenerator
    Disallow: /noindex-directory/
    Allow: /noindex-directory/*.html
What are hidden pages (Deep Web) and how to include them in a Sitemap?
Deep Web (also called the Deepnet, the Invisible Web, the Undernet or the hidden Web) – web-pages, which are not indexed by search engines, because such pages do not have hyperlinks from accessible pages. For example – these are the pages, generated through HTML-forms interface or frame content.
If you wish to discover and include such pages into Sitemap, please tick the appropriate options:
  • "Crawl html forms" (submit occurs without filling the form);
  • "Crawl framed content" (contents of <frameset> θ <iframe>).
What will happen to the nofollow links?
If the option is on (enabled by default) – they will not be considered.
Additionally, if needed, you may always specify ignoring only noindex (pages, marked as noindex) or only nofollow links separately from each other.
Nofollow link types:
  • HTML links containing the nofollow attribute
  • URLs that are disallowed in the robots.txt file
  • Located on a webpage tagged with the nofollow robots tag
How will the web pages that use robots meta or X-Robots-Tag headers be processed?
If the corresponding options are active, the pages will be processed according to the set values of robots meta (noindex, nofollow).
If necessary, you can always apply the processing of only noindex or only nofollow independently of each other.
The tags intended for certain search crawlers (for example, Googlebot) are considered when choosing the appropriate crawler identification option as a search bot.
Also, you can use meta tags on webpages for our robot, which will be taken into account when choosing the direct identification of our robots.
Examples of using robots meta tag:
    <meta name="robots" content="noindex" />
    <meta name="robots" content="nofollow" />
    <meta name="robots" content="noindex,nofollow" />

Examples of using X-Robots-Tag HTTP headers:
    X-Robots-Tag: noindex
    X-Robots-Tag: nofollow
    X-Robots-Tag: noindex, nofollow
How does the crawler process intrahost server redirects?
Crawler identifies the following standard HTTP status codes:
  • 301 Moved Permanently
  • 302 Found
  • 303 See Other
  • 307
If the page of your website will contain redirect on the same domain name, then crawler will index the page, specified in this redirect address.
Does the crawler handle canonical links (rel=canonical)?
Yes, for this it is enough to mark the corresponding option "Follow & consolidate canonical URLs". When the appropriate option is activated, the crawler will take into account the instructions of canonical links, and noncanonical links will be excluded from indexing results.
Our crawler handles the instructions in the HTML code, as well as the HTTP headers in the same way. An example of specifying a canonical link in HTML (placed in the <head> section of the noncanonical version of the page): Canonical meta tag example:
    <link rel="canonical" href=""/>
An example of specifying a canonical reference using an HTTP header:
    Link: <>; rel="canonical"
Pay attention to the technical aspect of processing canonical links by our crawler: the reference to the canonical page is equal to a server-side redirect (HTTP 303 See Other) and is processed in accordance with the general rules of the redirection processing.
Does the crawler handle AJAX links (hashbang)?
Yes, for this it is enough to mark the corresponding option "index AJAX links".
Each indexed AJAX link should have an HTML version, which is available at the address, wgere the combination of "#!" is replaced by the "?_escaped_fragment_=" parameter.
In AJAX links crawler replaces the #! combintion to the ?_escaped_fragment_= parameter and accesses the page by the modified URL.
Links containing hashbang (#!) are used in original form when creating a Sitemap.
Processing and deleting phpsessid and sessionID (session identifiers on PHP- and ASP-applications)
During the process of indexing your website may form session IDs. Our crawler processes and deletes session identifiers. To the Sitemap file all links will be input "clear", without identifiers, passed in URL - phpsessid (for PHP) or objects sessionID (for ASP). This helps to avoid pasting into Sitemap duplicate links, when bot receives the same page with different URLs.

Example of session identifier in PHP:
Example of session identifier in PHP:
Finally, the URL will be transformed back to a basic form:
How are images added to Sitemap?
Generator allows to gather and add to the Sitemap file information on images, located on your website pages. For URL-sections of each page, on which images will be found, corresponding information will be added, according to the Google Sitemap-Image protocol.
Next example shows part of the record in Sitemap file for URL http://website.tld/sample.html, which has two images:
How are multilingual pages indicating in the Sitemap?
Mysitemapgenerator can find localized page versions targeting different languages and/or countries.
Our crawler handles HTML tags and HTTP headers.
An example of specifying alternate page URL via HTML Link Element (placed in the <head> section of a web page):
    <link rel="alternate" href="" hreflang="en-GB" />
An example of specifying alternate page URL via HTTP response headers:
    Link: <>; rel="alternate"; hreflang="en-GB"
Supported values
The value of the hreflang attribute must indicate the language code in the ISO 639-1 format and optionally a country code in the ISO 3166-1 Alpha 2 format of an alternate URL.
How does filtering of unsupported content work?
Unlike the free version, where check of the links availability ends simultaneously with the end of the indexing process (when 500 URL were found), in the paid version of the generator check proceeds to the last link, even if the indexing is already completed. This guarantees that redirects or dead links will not be included into Sitemap.
Although this is in agreement with the Sitemaps protocol and is not an error, the possible presence of links, for example, redirect can cause a redirect corresponding warnings in Google Webmaster Tools on the presence of non-direct links in the website map.
What data is contained in error report, generated after crawling the website?
In the event that our crawler will face difficulties or obstacles in the process of indexing your website, a detailed report will be created. In a report you will be able to see grouped pages lists describing errors, among them - "Page not found", internal server errors, etc. Besides the errors, the report will contain information about all the detected server redirects.
Error reports are available in paid versions.
I have a very large website, what happens when number of scanned pages will go beyond the maximum allowed number of 50,000 URL?
By default large sitemap is broken down in accordance with the sitemap protocol and search engines recommendations – you will get several Sitemap files, containing no more than 50,000 URL each.
Also you may choose the number of URLs per file by yourself.
How to use data filters?
Data filter – convenient tool used dudring the creation of sitemap, which allows along with page URL to specify the following important data for search engines: priority of particular pages in relation to other website pages and updating mode.
Additionally, filter allows excluding particular pages from the indexing process, which are not needed in the Sitemap file.
Data filters can be applied either for separate pages (for this you need to input full URI of the pages), or for groups of the pages (for this you need to input a part of URL, which corresponds to all similar pages. For example: ".jpg" or "/directory/files").
How does function "Get on email" work?
We recommend using this function if you have a large website and its indexing may take a long time. With this option, you don’t have to wait when crawler finishes its work – you can get the results to your email. This feature is available both in paid version (you will get ready* file to the specified email address), and in free version of the generator (you get the link to download ready file from our server).

* If total size of created Sitemap files exceeds allowed size – you will get a link to download it from our server.
For how many does are created files available for download at sent links?
Guaranteed time of storage on our server is:
  • For files created with free version - 7 days,
  • For files created with paid version - 14 days.
How to check the website indexing status?
All registered users may get information on every indexing and on websites, which are currently being indexed, in their personal account.
What determines the speed of my website indexing?
Indexing speed is dependent on the variability of many dynamic factors, such as the responsiveness of your server and the size of the loaded pages. That is why it is impossible to calculate beforehand.
Also, a large impact on the time for website crawling has its structure of internal pages relinking.
Can I stop the website indexing process before it is finished?
Such an opportunity is provided for registered users. In the personal account displays information about all of your created files, as well as information about Websites, which are being indexed at the moment. In order to interrupt the process of indexing, without waiting for the crawler to scan the entire website, click the "Stop" button. In this case, you will receive file, generated only on the basis of pages that have been indexed at the time of the stop.
How do I let search engines know about my Sitemap?
To do it – register your website in webmaster services, provided by search engines (for example: for Google or for Yandex). After registration, you will be able to submit Sitemaps directly in your account.
Another common way – include in robots.txt the following line:
Sitemap: http://website.tld/mysitemapfile.xml
If you need to provide several Sitemaps, please add the same line for each file:
Sitemap: http://website.tld/mysitemapfile1.xml
Sitemap: http://website.tld/mysitemapfile2.xml
Sitemap: http://website.tld/mysitemapfile3.xml
Choosing optimal indexing speed and load capacity on your web server
In the options of the crawler there are three levels of indexing speed, creating appropriate load capacities on the server being indexed:
  • Maximum - this load capacity is used by default. If you have a quality paid hosting, most likely you do not need to worry about creating a load while indexing your site. We recommend using this load value, which allows the crawler to index your website at top speed.
  • Average – choose this load capacity, if your server requires a gentle mode of indexation.
  • Low – level of load capacity, which allows indexing your site, creating a minimum load on the server. This load level is recommended for websites, located on a free hosting or for sites that require limited flow of traffic.
    We recommend that you select this mode when indexing sites located on free hosting servers.
    However, note that this level slows down the process of indexing your site.
How to simulate crawls by search engines robots?
You may choose one of the identification options for our Web-crawler* (Search Engine Bot), which does indexing of your website:
  • Standard browser – crawler uses this option by default and is a recommended one. Your website will load the same way your regular visitors see it.
  • YandexBot – this option is used to crawl your website as Yandex search bot sees it. Our crawler will be signed as the main Yandex bot (YandexBot/3.0)
  • Googlebot – this option is used to crawl your website as Google search bot sees it. Crawler will be signed as Google's web search bot (Googlebot/2.1)
  • Baiduspider - Baidu Web Search Bot
  • Mysitemapgenerator – use direct identification of our crawler if you need separate control settings and an ability to manage website access
Pay attention to the features of robots.txt file processing when choosing different identification ways:
  • When choosing YandexBot, GoogleBot, Baiduspider or Mysitemapgenerator options only instructions for a particular robot are considered (User-agent: Yandex, User-agent: Googlebot, User-agent: Mysitemapgenerator – respectively). General instructions of User-agent: * sections will be used only when "personal" ones are missing.
  • If you are using Standard Browser or Mysitemapgenerator - crawler will consider only instructions in Mysitemapgenerator section or general section of User-agent: *. "Personal" sections of User-agent: Yandex or User-agent: Googlebot and others are not considered.