A word on sitemaps after empirical testing

Do not post support requests, bug reports or feature requests. Discuss phpBB here. Non-phpBB related discussion goes in General Discussion!
Anti-Spam Guide
Post Reply
KYPREO
Jr. Extension Validator
Posts: 392
Joined: Fri Feb 02, 2018 9:56 am
Contact:

A word on sitemaps after empirical testing

Post by KYPREO »

Over the years I have read many topics on this board about whether or not sitemaps are necessary or desirable.

Many have expressed the view that sitemaps are not possible because a phpBB is dynamic. We know that to be false because extensions can be built to update sitemaps according to configurable time intervals and populate with newer topics, refresh page numbers etc.

Others have commented that a sitemap is not needed or helpful because "Google knows how to crawl a phpBB site" and "I never had a sitemap and I can find my board on Google." I made similar assumptions because I could find pages from my board on Google. In the past, I have also commented myself that a sitemap is not necessary. See here for example: viewtopic.php?f=64&t=2528131. There are many other similar topics.

But I began to wonder whether advising others not to bother with sitemaps had any actual supporting evidence and whether the conclusion that sitemaps or unhelpful or unnecessary was valid. To help figure this out myself, I decided to start using a sitemap extension and map the outcome by tracking the Google indexation performance of my board via Google Search Console. I also did the same with Bing Webmaster Tools.

My board has almost 2 million post and over 160,000 topics. Google has been crawling my board since 2002 so there is also plenty of history for Google to have established ranking signals for the site. My board therefore makes a very good candidate to measure the value of sitemaps. If sitemaps make a difference, then there should be a measurable improvement in indexation.

Prior to testing, the number of pages in my Google Search Console index was quite low compared to the total number of pages available for Google to crawl. In fact only around 15% of pages were actually indexed by Google. I had always assumed this was because Google judged many pages to be of limited value and excluded them from the index. Yet, many topics which I knew were an excellent source of information on quite specialised topics would not only come up in searches but couldn't be found on the Google index at all (as verified by checking within Google Search Console).

The Good

I submitted my sitemap to Google on 19 November 2019, approximately 7 weeks ago.

To be honest, I was surprised by the results and am happy to say I was wrong.

In that time, I am pleased to report that the number of pages indexed by Google has increased by 244%. See below (scale removed for confidentiality)

sitemap.PNG
sitemap.PNG (11.45 KiB) Viewed 837 times

The (2) notation refers to the date when the sitemap was submitted to Google.

I made absolutely no changes to the configuration of bots on my site. They were always able to crawl the sections of the forum I wanted. I verified this was the case by using the inspection tool within Google Search Console itself. All I did was generate a sitemap covering those same sections to which the bots had access and uploaded it.

The numbers in the index continue to climb.

The other benefit of the sitemap is that Google Search Console allows me to track which URLs in the index were found from the sitemap. As Google re-crawled pages over time, URLs that Google had previously located had moved from "indexed, not submitted in sitemap" (the default if Google just finds your site) to "Submitted and indexed", showing that Google had correlated its own index with the submitted sitemap.

Another benefit is that the sitemap assist Google in determining the canonical URL for pages on the site and thereby eliminate duplicates. There are far fewer pages now showing as ignored duplicates. There will soon be a change to URL parameters so that the f=# element will be removed from viewtopic URLs. This is how Google wants to index phpBB topic URLs and when this change is implemented into the sitemap I expect Google will be happy about it.

The Bad

On a board this large, sitemaps take a while to load and generate. However, I have configured the sitemap to cache itself for 3 days and only re-generate if accessed after that time has expired.

I have been tracking how regularly Google actually rechecks the sitemap and it turns out the bots only access it once every 3-7 days. This means a 3 day cache is overkill if anything. You can probably afford to keep a sitemap cached for a week and Google won't care. Newer content is likely to be found organically since it will be on the first page of each viewforum index.

There is no other downside.

The Indifferent

I saw no measurable benefit to the number of pages indexed by Bing. According Bing Webmaster Tools it only accessed my sitemap once - the day I uploaded it and it has done nothing with it since.

Bing also has awful difficulty with its bot behaviour. Even with the sitemap and properly configured robots.txt, its bot are trying to crawl non-existent pages or pages to which they have no access.

I don't particularly care too much about this as the clickthrough traffic from Bing is trivial anyway.

So, in summary, sitemaps CAN help. I suspect this is especially the case in larger boards that have been around for a long time, where Google hasn't crawled topics appearing on page 1000 of viewforum, or page 6745 of a 10000 page topic.

Will they help a board that has just gone online? Maybe, but you're better off just registering for Google Search Console to tell it about your site first.

Will they help an established board that is smaller in size? Probably not. I suspect Google is happy to completely crawl site with fewer pages and keep the whole site within its index.

Let this also be a lesson in expressing opinions without a solid evidentiary foundation.
phpBB user since 2002
www.AusRotary.com

User avatar
AmigoJack
Registered User
Posts: 5715
Joined: Tue Jun 15, 2010 11:33 am
Location: グリーン ヒル ゾーン
Contact:

Re: A word on sitemaps after empirical testing

Post by AmigoJack »

On which concept did you generate your sitemap? And which format (XML, TXT..., see https://support.google.com/webmasters/answer/183668) did you use? I always considered it useless to just have a file with the URIs https://site/viewtopic.php?t=1 to https://site/viewtopic.php?t=9435521, but if it turns out that this is a help for Google then it's no big deal either to just serve that.
The worst thing about censorship is ███████████
Affin wrote:
Tue Nov 20, 2018 9:51 am
The problem is probably not my English but you do not want to understand correctly.
...
We will not come anybody anyway, nevertheless, it's best to shit this.

KYPREO
Jr. Extension Validator
Posts: 392
Joined: Fri Feb 02, 2018 9:56 am
Contact:

Re: A word on sitemaps after empirical testing

Post by KYPREO »

XML with plain URLs, no last modified dates, no change frequency, no priority. The latter 3 are all optional under the XML sitemap standard. My research indicated that Google ignores change frequency and priority, and I figured dates could only possibly act as a negative crawl signal, so I went with plain URLs.

It has a main index and max 50,000 URLs per file, in line with the sitemap standard.

Here is part of my live map: http://www.ausrotary.com/sitemap-4.xml

This has exceeded expectation so far. But to be honest, I wouldn't be surprised if a plain txt file was just as effective. It would certainly much faster to generate and load.
phpBB user since 2002
www.AusRotary.com

User avatar
david63
Registered User
Posts: 17755
Joined: Thu Dec 19, 2002 8:08 am
Location: Lancashire, UK
Contact:

Re: A word on sitemaps after empirical testing

Post by david63 »

Whilst from the data you have posted it is indisputable that Google has indexed more pages from your board what, if any, affect has it had on your ranking within Google?
David
Remember: You only know what you know and - you don't know what you don't know!
My CDB Contributions | How to install an extension
I will not be accepting translations for any of my extensions in Github - please post any translations in the appropriate topic.
No support requests via PM or email as they will be ignored

KYPREO
Jr. Extension Validator
Posts: 392
Joined: Fri Feb 02, 2018 9:56 am
Contact:

Re: A word on sitemaps after empirical testing

Post by KYPREO »

david63 wrote:
Thu Jan 09, 2020 11:07 am
Whilst from the data you have posted it is indisputable that Google has indexed more pages from your board what, if any, affect has it had on your ranking within Google?
Good question.

It's a complex one to answer.

The short version is that it is too early to tell.

The long answer is:

1. Ranking is not simply a global matter. Yes, your site will generally have a ranking based primarily on 3 factors: Authority, Relevance and Trust. Everything I read suggests that these 3 things matter more than any SEO technique, although I don't have enough experience or data to evaluate whether or not that is true.

These rankings must filter down to the URL level. Some pages will be highly relevant and highly authoritative some certain keywords, while others may not, even if the authoritativeness and trust levels of the site are high. My site must generally have a high authoritativeness and trust though because it generally ranks in the top 1 or 2 discussion boards for searches on the subject matter of my board.

By sheer weight of numbers, tripling or quadrupling the number of pages in the index must surely have some benefit. For those keywords where the
site already ranks highly, I doubt it could improve because users don't generally click on anything past the top few ranking hits. But more topics in the index widens the net of potentially keywords that could be triggered.

2. Google updates its ranking algorithm every day but it deployed a major update beginning on around 10 December 2019. Whenever Google does this, you get crazy results for a couple of weeks and sometimes get major sugar-hits or major losses. After this update first hit, my ranking actually got worse and the numbers were all over the place. Since the New Year, things have stabilised and I'm getting very good traffic from Google. I can't say it is measurably better over a 2 year trend. But it's certainly not 240% better!

Overlaying Google search page impressions over the index coverage graph you can see this weird crazy behaviour. The number 1 in a circle is when the Google algorithm update hit:

coverage vs impressions.PNG

Despite the recent erratic behaviour, there does seem to be a slightly upward trendline in search impressions. This does not necessarily translate to search clicks mind you, which is the more relevant measure of whether users find your site. Ultimately if you get more impressions and your results page position and click through rate remain the same, you will get more clicks.

3. It's not just about search traffic. It's certainly not my primary goal. Indexing more pages is an end in itself. I have a Google Search plugin on my board to compliment the native search tool. I have recently implemented Sphinx for the native search and it is great, but Google is more intuitive and has the major benefit of weighing search results by RELEVANCE, not just post time. On a massive board with thousands of topics returned from a search query, Google is a great way to find the right answer. But that search add-on is only any good if the whole board or at least most of it is indexed.

However things work behind the curtain, it seems to take time for rankings of newly indexed pages to be determined by Google.

I think it will takes months of data before I can report on any trend. I'll definitely try to remember to post back here when I have enough data.

I hope this helps people sort their way through the bollocks (there's certainly a lot of that on the topic of SEO).
phpBB user since 2002
www.AusRotary.com

User avatar
thecoalman
Community Team Member
Community Team Member
Posts: 3622
Joined: Wed Dec 22, 2004 3:52 am
Location: Pennsylvania, U.S.A.
Contact:

Re: A word on sitemaps after empirical testing

Post by thecoalman »

david63 wrote:
Thu Jan 09, 2020 11:07 am
Whilst from the data you have posted it is indisputable that Google has indexed more pages from your board what, if any, affect has it had on your ranking within Google?
A page cannot rank if it's not indexed. When you have lots of pages indexed you end up getting a lot of long tail searches where the user is searching for multiple keywords.
“Results! Why, man, I have gotten a lot of results! I have found several thousand things that won’t work.”

Attributed - Thomas Edison

User avatar
david63
Registered User
Posts: 17755
Joined: Thu Dec 19, 2002 8:08 am
Location: Lancashire, UK
Contact:

Re: A word on sitemaps after empirical testing

Post by david63 »

thecoalman wrote:
Fri Jan 10, 2020 6:24 pm
david63 wrote:
Thu Jan 09, 2020 11:07 am
Whilst from the data you have posted it is indisputable that Google has indexed more pages from your board what, if any, affect has it had on your ranking within Google?
A page cannot rank if it's not indexed. When you have lots of pages indexed you end up getting a lot of long tail searches where the user is searching for multiple keywords.
I don't disagree but if before this test you searched for, say, "green widgets" and that was at x position in a search. my point is where is it now?

Obvious;y the more pages/keywords that are indexed will have an affect but only if you use those keywords.
David
Remember: You only know what you know and - you don't know what you don't know!
My CDB Contributions | How to install an extension
I will not be accepting translations for any of my extensions in Github - please post any translations in the appropriate topic.
No support requests via PM or email as they will be ignored

User avatar
thecoalman
Community Team Member
Community Team Member
Posts: 3622
Joined: Wed Dec 22, 2004 3:52 am
Location: Pennsylvania, U.S.A.
Contact:

Re: A word on sitemaps after empirical testing

Post by thecoalman »

david63 wrote:
Fri Jan 10, 2020 6:49 pm
Obvious;y the more pages/keywords that are indexed will have an affect but only if you use those keywords.
That's the point especially if you already indexed for "Green widgets" because "Green widgets with special parts" may only appear for other pages.

I do well fora lot keywords or two word phrases but the cumulative long tail searches are what really brings the traffic.
“Results! Why, man, I have gotten a lot of results! I have found several thousand things that won’t work.”

Attributed - Thomas Edison

User avatar
david63
Registered User
Posts: 17755
Joined: Thu Dec 19, 2002 8:08 am
Location: Lancashire, UK
Contact:

Re: A word on sitemaps after empirical testing

Post by david63 »

Being some five months further down the line can you say whether there has been any improvement with search position?
David
Remember: You only know what you know and - you don't know what you don't know!
My CDB Contributions | How to install an extension
I will not be accepting translations for any of my extensions in Github - please post any translations in the appropriate topic.
No support requests via PM or email as they will be ignored

Post Reply

Return to “phpBB Discussion”