Over the years I have read many topics on this board about whether or not sitemaps are necessary or desirable.
Many have expressed the view that sitemaps are not possible because a phpBB is dynamic. We know that to be false because extensions can be built to update sitemaps according to configurable time intervals and populate with newer topics, refresh page numbers etc.
Others have commented that a sitemap is not needed or helpful because "Google knows how to crawl a phpBB site" and "I never had a sitemap and I can find my board on Google." I made similar assumptions because I could find pages from my board on Google. In the past, I have also commented myself that a sitemap is not necessary. See here for example: viewtopic.php?f=64&t=2528131
. There are many other similar topics.
But I began to wonder whether advising others not to bother with sitemaps had any actual supporting evidence and whether the conclusion that sitemaps or unhelpful or unnecessary was valid. To help figure this out myself, I decided to start using a sitemap extension and map the outcome by tracking the Google indexation performance of my board via Google Search Console. I also did the same with Bing Webmaster Tools.
My board has almost 2 million post and over 160,000 topics. Google has been crawling my board since 2002 so there is also plenty of history for Google to have established ranking signals for the site. My board therefore makes a very good candidate to measure the value of sitemaps. If sitemaps make a difference, then there should be a measurable improvement in indexation.
Prior to testing, the number of pages in my Google Search Console index was quite low compared to the total number of pages available for Google to crawl. In fact only around 15% of pages were actually indexed by Google. I had always assumed this was because Google judged many pages to be of limited value and excluded them from the index. Yet, many topics which I knew were an excellent source of information on quite specialised topics would not only come up in searches but couldn't be found on the Google index at all (as verified by checking within Google Search Console).
I submitted my sitemap to Google on 19 November 2019, approximately 7 weeks ago.
To be honest, I was surprised by the results and am happy to say I was wrong.
In that time, I am pleased to report that the number of pages indexed by Google has increased by 244%
. See below (scale removed for confidentiality)
The (2) notation refers to the date when the sitemap was submitted to Google.
I made absolutely no changes to the configuration of bots on my site. They were always able to crawl the sections of the forum I wanted. I verified this was the case by using the inspection tool within Google Search Console itself. All I did was generate a sitemap covering those same sections to which the bots had access and uploaded it.
The numbers in the index continue to climb.
The other benefit of the sitemap is that Google Search Console allows me to track which URLs in the index were found from the sitemap. As Google re-crawled pages over time, URLs that Google had previously located had moved from "indexed, not submitted in sitemap" (the default if Google just finds your site) to "Submitted and indexed", showing that Google had correlated its own index with the submitted sitemap.
Another benefit is that the sitemap assist Google in determining the canonical URL for pages on the site and thereby eliminate duplicates. There are far fewer pages now showing as ignored duplicates. There will soon be a change to URL parameters so that the f=# element will be removed from viewtopic URLs. This is how Google wants to index phpBB topic URLs and when this change is implemented into the sitemap I expect Google will be happy about it.
On a board this large, sitemaps take a while to load and generate. However, I have configured the sitemap to cache itself for 3 days and only re-generate if accessed after that time has expired.
I have been tracking how regularly Google actually rechecks the sitemap and it turns out the bots only access it once every 3-7 days. This means a 3 day cache is overkill if anything. You can probably afford to keep a sitemap cached for a week and Google won't care. Newer content is likely to be found organically since it will be on the first page of each viewforum index.
There is no other downside.
I saw no measurable benefit to the number of pages indexed by Bing. According Bing Webmaster Tools it only accessed my sitemap once - the day I uploaded it and it has done nothing with it since.
Bing also has awful difficulty with its bot behaviour. Even with the sitemap and properly configured robots.txt, its bot are trying to crawl non-existent pages or pages to which they have no access.
I don't particularly care too much about this as the clickthrough traffic from Bing is trivial anyway.
So, in summary, sitemaps CAN help. I suspect this is especially the case in larger boards that have been around for a long time, where Google hasn't crawled topics appearing on page 1000 of viewforum, or page 6745 of a 10000 page topic.
Will they help a board that has just gone online? Maybe, but you're better off just registering for Google Search Console to tell it about your site first.
Will they help an established board that is smaller in size? Probably not. I suspect Google is happy to completely crawl site with fewer pages and keep the whole site within its index.
Let this also be a lesson in expressing opinions without a solid evidentiary foundation.