Sitemaps FAQs

What are the basic requirements to crawl a website?

The website must be publicly accessible, respond with recognizable HTML source code, and contain links to internal pages from the home page. The process of crawling a website takes place based on the links that will be found on the home page. The homepage may contain a redirect within the current hostname to be processed.

What is the website root URL?

The root URL is the base address for accessing your domain on the web server. It consists of two required components - a protocol scheme (usually https://) and a domain name (eg website.tld). Root URL examples: https://website.tld, http://subdomain.website.tld.
We support any existing domain types, such as top-level domains (TLDs), country specific domains (ccTLDs), as well as subdomains of any level. We also support Internationalized Domains (IDNs) for most languages such as Arabic, Indian and Cyrillic domains. Note that you don't need to convert your hostname to Punycode, just enter the original URL in your language.
Optionally, the root URL may contain an indication of the language version of the website.

What forms of root URLs are acceptable to indicate website language version?

Only one language version format is supported, which is the equivalent of the website root folder and consists of one or optionally two values, separated by a dash. The first two-character language code is in ISO 639-1 format, which may be followed by an optional second code representing the region in ISO 3166-1 Alpha 2 format. Examples of valid URLs for language versions:

http://mydomain.com/en
http://mydomain.com/en-US

What HTTP protocols are supported?

Mysitemapgenerator supports HTTP and HTTPS.
Please note that according to the specification of the XML Sitemaps protocol, site crawling and data generation are carried out only for the specified data transfer protocol.

What is the limit on the free plan?

The free plan allows you to run up to 3 generation requests per day.

Will the limits specified in robots.txt be taken into account when crawling?

This is optional, but enabled by default. If this option is ticked, our bot will follow the Allow and Disallow rules in general section of User-agent.
"Personal" sections of User-agent (for example, Google or Yandex) are considered when choosing the appropriate crawler mode as search bot.
In addition, you may create separate section specifically for Mysitemapgenerator:

    User-agent: Mysitemapgenerator

Below is an example of a robots.txt file:

    #No robots should visit any URL starting with /noindex-directory/
    User-agent: *
    Disallow: /noindex-directory/
    
    #Google does not need to visit a specific URL
    User-agent: Googlebot
    Disallow: /noindex-directory/disallow-google.html
    
    #Yandex does not need to visit URL starting with /noindex-directory/
    #But allows to index a specific page
    User-agent: Yandex
    Disallow: /noindex-directory/
    Allow: /noindex-directory/allow-yandex.html
    
    #Mysitemapgenerator does not need to visit URL starting with /noindex-directory/
    #But allows to index pages with a specific extension
    User-agent: Mysitemapgenerator
    Disallow: /noindex-directory/
    Allow: /noindex-directory/*.html

What are hidden pages (Deep Web) and how to include them in a Sitemap?

Deep Web (also called the Deepnet, the Invisible Web, the Undernet or the hidden Web) – web-pages, which are not indexed by search engines, because such pages do not have hyperlinks from accessible pages. For example – these are the pages, generated through HTML-forms interface or frame content.
If you wish to discover and include such pages into Sitemap, please tick the appropriate options:

"Crawl html forms" (submit occurs without filling the form);
"Crawl framed content" (contents of <frameset> и <iframe>).

What will happen to the nofollow links?

If the option is on (enabled by default) – they will not be considered.
Additionally, if needed, you may always specify ignoring only noindex (pages, marked as noindex) or only nofollow links separately from each other.
Nofollow link types:

HTML links containing the nofollow attribute
URLs that are disallowed in the robots.txt file
Located on a webpage tagged with the nofollow robots tag

How will the web pages that use robots meta tag or X-Robots-Tag HTTP header be processed?

If the corresponding options are active, the pages will be processed according to the set values of robots meta (noindex, nofollow).
If necessary, you can always apply the processing of only noindex or only nofollow independently of each other.
The tags intended for certain search crawlers (for example, Googlebot) are considered when choosing the appropriate crawler identification option as a search bot.
Also, you can use meta tags on webpages for our robot, which will be taken into account when choosing the direct identification of our robots.
Examples of using robots meta tag:

    <meta name="robots" content="noindex" />
    
    <meta name="robots" content="nofollow" />
    
    <meta name="robots" content="noindex,nofollow" />

Examples of using X-Robots-Tag HTTP headers:

    X-Robots-Tag: noindex
    
    X-Robots-Tag: nofollow
    
    X-Robots-Tag: noindex, nofollow

How does the crawler process intrahost server redirects?

Crawler identifies the following standard HTTP status codes:

301 Moved Permanently
302 Found
303 See Other
307

If the page of your website will contain redirect on the same domain name, then crawler will index the page, specified in this redirect address.

Does the crawler handle canonical links (rel=canonical)?

Yes, for this it is enough to mark the corresponding option "Follow & consolidate canonical URLs". When the appropriate option is activated, the crawler will take into account the instructions of canonical links, and noncanonical links will be excluded from crawl results.
Our crawler handles the instructions in the HTML code, as well as the HTTP headers in the same way. An example of specifying a canonical link in HTML (placed in the <head> section of the noncanonical version of the page): Canonical meta tag example:

    <link rel="canonical" href="http://www.website.tld/canonical_page.html"/>

An example of specifying a canonical reference using an HTTP header:

    Link: <http://www.website.tld/canonical_page.html>; rel="canonical"

Pay attention to the technical aspect of processing canonical links by our crawler: the reference to the canonical page is equal to a server-side redirect (HTTP 303 See Other) and is processed in accordance with the general rules of the redirection processing.

Does the crawler handle AJAX links (hashbang)?

Yes, for this it is enough to mark the corresponding option "index AJAX links".
Each indexed AJAX link should have an HTML version, which is available at the address, wgere the combination of "#!" is replaced by the "?_escaped_fragment_=" parameter.
In AJAX links crawler replaces the #! combintion to the ?_escaped_fragment_= parameter and accesses the page by the modified URL.
Links containing hashbang (#!) are used in original form when creating a Sitemap.

Processing and deleting phpsessid and sessionID (session identifiers on PHP- and ASP-applications)

During the process of crawl your website may form session IDs. Our crawler processes and deletes session identifiers. To the Sitemap file all links will be input "clear", without identifiers, passed in URL - phpsessid (for PHP) or objects sessionID (for ASP). This helps to avoid pasting into Sitemap duplicate links, when bot receives the same page with different URLs.

Example of session identifier in PHP:

    http://website.tld/page.html?PHPSESSID=123456session6789

Example of session identifier in ASP:

    http://website.tld/(S(123456session6789))/page.html

Finally, the URL will be transformed back to a basic form:

    http://website.tld/page.html

How are images added to a Sitemap?

MySitemapGenerator allows you to gather and add information about images located on the pages of your website to the Sitemap file. For each website page containing images, relevant information will be included according to the Google Sitemap Image protocol.
The following example illustrates a portion of the record in the Sitemap file for the URL http://website.tld/sample.html, which contains two images:

 <url>
   <loc>http://website.tld/sample.html</loc>
   <image:image>
     <image:loc>http://website.tld/logo.jpg</image:loc>
   </image:image>
   <image:image>
     <image:loc>http://website.tld/photo.jpg</image:loc>
   </image:image>
 </url>

How do multilingual pages indicate their presence in the Sitemap?

MySitemapGenerator can find localized page versions targeting different languages and/or countries.
Our crawler handles HTML tags and HTTP headers.
An example of specifying an alternate page URL via the HTML Link Element (placed in the <head> section of a web page):

    <link rel="alternate" href="http://www.website.tld/alternate_page.html" hreflang="en-GB" />

An example of specifying alternate page URL via HTTP response headers:

    Link: <http://www.website.tld/alternate_page.html>; rel="alternate"; hreflang="en-GB"

Supported values
The hreflang attribute's value must indicate the language code in ISO 639-1 format and an optionally country code in the ISO 3166-1 Alpha 2 format for alternate URL.

How does filtering unsupported content work?

While the presence of any URLs is not against the Sitemap protocol and is not technically an error, the presence of links that cause an error or redirect may trigger warnings in Google Webmaster Tools about the presence of indirect/unvailable links in the Sitemap.
In the paid plans, the check for link availability continues until the last link is verified, even after the crawl process has completed (which occurs after all URLs are found). This ensures that redirects or dead links are not included in the Sitemap.

What data is contained in error report, generated after crawling the website?

In the event that our crawler will face difficulties or obstacles in the process of crawl your website, a detailed report will be created. In a report you will be able to see grouped pages lists describing errors, among them - "Page not found", internal server errors, etc. Besides the errors, the report will contain information about all the detected server redirects.
Error reports are available in paid versions.

If I have a very large website, what happens when the number of imported pages exceeds 50,000 URLs?

By default, a large Sitemap is divided according to the Sitemap protocol and the recommendations of search engines. This means you will receive multiple sitemap files, each containing no more than 50,000 URLs.
We also select the number of URLs per file based on best practices.

How to use URL filtering?

URL filtering enables you to exclude certain pages from the crawl process if they are not needed in the sitemap file. Filters can be applied either to individual pages (where you need to input the full URI path) or to groups of pages (by inputting a part of the URI path that corresponds to similar pages, such as "*.jpg" or "/directory/*").

How does function "Get on email" work?

We recommend using this function if you have a large website and its crawling may take a long time. With this option, you don’t have to wait when crawler finishes its work – you can get the results to your email. This feature is available both in paid version (you will get ready* file to the specified email address), and in free version of the generator (you get the link to download ready file from our server).

* If total size of created Sitemap files exceeds allowed size – you will get a link to download it from our server.

For how many does are created files available for download at sent links?

Guaranteed time of storage on our server is:

For files created with free version - 7 days,
For files created with paid version - 14 days.

How to check the website crawl status?

All registered users may get information on every crawling and on websites, which are currently being crawled, in their personal account.

What determines the speed of my website crawling?

Crawling speed is dependent on the variability of many dynamic factors, such as the responsiveness of your web server and the size of the loaded pages. That is why it is impossible to calculate beforehand.
Also, a large impact on the time for website crawling has its structure of internal pages relinking.

Can I stop the website crawling before it is finished?

Such an opportunity is provided for registered users. In the personal account displays information about all of your created files, as well as information about websites, which are being processed at the moment. In order to interrupt the process of crawling, without waiting for the crawler to process the entire website, click the "Stop" button. In this case, you will receive file, generated only on the basis of pages that have been fetched at the time of the stop.

How do I inform search engines about my Sitemap?

To begin, register your website with the webmaster tools provided by search engines (for example, www.google.com/webmasters for Google). Once registered, you can submit your Sitemap directly through your account.
Another common method is to add the following line to your robots.txt file:

Sitemap: http://website.tld/mysitemapfile.xml

If you have multiple Sitemaps to provide, include a line for each file like this:

Sitemap: http://website.tld/mysitemapfile1.xml
Sitemap: http://website.tld/mysitemapfile2.xml
Sitemap: http://website.tld/mysitemapfile3.xml

Choosing the optimal crawl speed depending on your web server

The crawler settings provide three crawling speed levels, which create a corresponding load on the web server:

Maximum - this load capacity is used by default. If you have a quality paid hosting, most likely you do not need to worry about creating a load while crawling your website. We recommend using this load value, which allows the crawler to process your website at top speed.
Average – choose this load capacity, if your web server requires a gentle mode of crawling.
Low – level of load capacity, which allows crawl your website, creating a minimum load on the web server. This load level is recommended for websites hosted on free hosting, or for websites that have significantly limited bandwidth.
However, please note that this level slows down the crawling process of your website.

How to simulate crawling by search engine bots?

You may choose one of the identification options for our crawler*, which can crawl your website as follows:

Standard browser – crawler uses this option by default and is a recommended one. Your website will load the same way your regular visitors see it.
Googlebot – this option is used to crawl your website as Google's crawler sees it. Our crawler will be signed in the same way as Google's web search bot (Googlebot/2.1)
YandexBot – this option is used to crawl your website as Yandex search bot sees it. Our crawler will be signed in the same way as the main Yandex search bot (YandexBot/3.0)
Baiduspider - our crawler will be signed in the same way as the Baidu Web Search Bot.
Mysitemapgenerator – use direct identification of our crawler if you need separate control settings and an ability to manage website access.

Please note the ways in which the robots.txt file is processed when choosing different identification methods:

When choosing GoogleBot, YandexBot, Baiduspider or Mysitemapgenerator option, only instructions for a particular Bot are considered (User-agent: Googlebot, User-agent: Yandex, User-agent: Baiduspider, User-agent: Mysitemapgenerator – respectively). General instructions (in the User-agent: * section) will be used only if there are no “personal” ones.
If you are using Standard browser or Mysitemapgenerator, our crawler will consider only instructions in the Mysitemapgenerator section (User-agent: Mysitemapgenerator), and if it is missing, in the general section (User-agent: *) of the robots.txt file. Any other "personal" sections (such as User-agent: Googlebot or User-agent: Yandex) are not considered.

Handling content dynamically generated via JavaScript

If your website uses JavaScript to generate the bulk of the content (also known as client-side rendering), our crawler will attempt to process the dynamically generated content on the web page (if such a need is automatically detected or the JavaScript processing option is enabled in the settings), however many JavaScript algorithms may not be processed.
JavaScript Processing Limitations:

Please note that our crawler does not load or process JavaScript code from any external sources whose hostname (domain) is different from the website's, such as scripts loaded from CDNs or API servers, including website subdomains.
Our crawler does not process content that is dynamically generated by any specific user action, such as scrolling a page or clicking on an element.
Keep in mind that the MySitemapGenerator crawler only crawls links that are an HTML <a> tag with an "href" attribute, which also applies to content that is dynamically generated using JavaScript. Our algorithms do not recognize or process any other elements or events that function as links but are not a corresponding HTML <a> tag. This means that any other website navigation formats will not be processed, and therefore the content they point to will not be processed either.