In this post I describe two topics in the creation of xml sitemaps:
1. producing the sitemap.xml file and 2. editing this file to correct errors.
1. Problems with creating xml sitemaps
If you want a sitemap.xml on your website there are places on the Net where you can make one for free, but most of them have restrictions of one kind or another. They all, as far as I know, limit your pages, often to something like 500 or fewer. That’s OK if you don’t have too many, but I have more than 3500.
Perhaps you don’t need to index the whole site, or may prefer to split it and index different parts of it separately. This may get round restrictions on the number of pages you can index and will certainly be quicker. But it requires you to start indexing at a subdirectory below your root directory and some sitemap generators don’t allow this.
Currently I’m using Online XML Site Generator. It’s free (though you are invited to make a small donation – £2, equating to 2.46 USD or 2.34 EUR, which seems very fair). The number of pages crawled is limited to 2000. Here is a full list of the conditions.
- Maximum pages 2000 (includes pages which errored)
- Spider will run for maximum of 60 minutes.
- Individual page timeout 20 seconds
- Download limit of 60K (for larger files the first 60K is downloaded).
- Maximum of 20% urls hitting 20s second timeout.
- Average request time across all urls must be 75% of timeout (which is 15 seconds) sampled every 25 urls.
As you are allowed to start at a subdirectory I can live with these conditions, including the 2000-page limit; I make separate sitemaps for different areas on my site.
However, I find that the sitemaps so produced need editing. They are not accurate; they contain links to all the pages in the relevant subdirectory but include others as well.
To see how this works, suppose you have a website called myfruit.com, with subdirectories myfruit.com/apples, myfruit.com/oranges, and myfruit.com/pears. You make a sitemap for myfruit.com/apples. This should contain URLs for apples, but in fact it will contain many URLs from the other subdirectories.
2. Editing the sitemap.xml file
You can easily edit the sitemap.xml you’ve just created, but if you have to remove a lot of unwanted entries manually it will take you a long time. Fortunately Vim (the One Editor to Rule Them All) allows you to automate the process almost completely. (Yes, I know other editors are available but I don’t use them, so if that’s what you use you’ll have to work out your own method.)
Taking the example website above, let’s suppose you have made a sitemap called sitemap-apples.xml. When you load this in Vim you see that as well as entries for apples there are also lots of entries for oranges and pears. They will look like this.
To remove all the entries containing “pears”, use Vim’s macro (recording) facility. Here are the keystrokes,with comments (after #).
Start in Normal mode and issue these commands.
- qa # start recording and store results in register a
- /pears # find first line containing “pears”
- k # move cursor up one line
- 3dd # delete three lines
- n # find next instance of “pears”
- q # end recording
You have now set up your macro. Test it by doing @a. This should delete one instance of “pears” and position your cursor on the next instance. You can repeate it with @@.
If this works, you can continue to delete all the remaining instances of “pears” automatically by doing something like 300@q. Don’t worry if this gives more repetions than necessary; these will simply be ignored.
Make a similar macro for “oranges” and you should be left with a correct sitemamp.xml file that you can upload to your site.
3. Saving the macros
If you only have one or two sets of unwanted URLs to delete it won’t take long to type the macros. But what if there are more than this, or if you update sitemap.xml frequently? Can you save your macros for future use? Yes, you can, and it’s easy.
There is a site which provides most of the information you need, but I found it didn’t work exactly as specified, possibly because of later changes to Vim. Here is what I do, which works as of 12 May 2019.
- Add lines like these to your .vimrc.
let @b=’ /oranges^Mk3ddn’
NB. ^M is produced in Vim by typing Ctrl-v <Return>.
2. Load sitemap.xml into Vim. Do 200@a. This should delete 200 instances of pears. Now do 200@b; this should do the same for 200 instancs of oranges. (Don’t forget to source .vimrrc or stop and reload sitemap.xml in Vim after you’ve modified .vimrc.)