Robots.txt

Get help with installation and running phpBB 3.0.x here. Please do not post bug reports, feature requests, or MOD-related questions here.
Get Involved
Forum rules
END OF SUPPORT: 1 January 2017 (announcement)
Locked
User avatar
gavpedz
Registered User
Posts: 338
Joined: Thu Nov 16, 2006 12:00 pm
Contact:

Robots.txt

Post by gavpedz »

Hi guys i just found this robots.txt file for phpbb3 on a turkish site and was not sure if all of it was necessary or if it is ok
User-agent: *
#Crawl-delay: 100
Disallow: /adm
Disallow: /cache
Disallow: /CVS
Disallow: /develop
Disallow: /docs
Disallow: /files
Disallow: /includes
Disallow: /install
Disallow: /language
Disallow: /store
Disallow: /styles
Disallow: /common.php
Disallow: /config.php
Disallow: /cron.php
Disallow: /download.php
Disallow: /faq.php
Disallow: /mcp.php
Disallow: /posting.php
Disallow: /report.php
Disallow: /style.php
Disallow: /ucp.php
#Disallow: /viewonline.php
#Disallow: /memberlist.php
#Disallow: /search.php

#--- Allowed bots
# for archive.org
User-agent: ia_archiver
Disallow:

# for google adsense bot
User-agent: Mediapartners-Google*
Disallow:

#User-agent: Slurp
Crawl-delay: 30
Disallow:

user-agent: googlebot
crawl-delay: 30

user-agent: msnbot
crawl-delay: 30


# Kötü niyetli bot'ları engelle
User-agent: BotRightHere
Disallow: /

User-agent: WebZip
Disallow: /

User-agent: larbin
Disallow: /

User-agent: b2w/0.1
Disallow: /

User-agent: Copernic
Disallow: /

User-agent: psbot
Disallow: /

User-agent: Python-urllib
Disallow: /

User-agent: NetMechanic
Disallow: /

User-agent: URL_Spider_Pro
Disallow: /

User-agent: CherryPicker
Disallow: /

User-agent: EmailCollector
Disallow: /

User-agent: EmailSiphon
Disallow: /

User-agent: WebBandit
Disallow: /

User-agent: EmailWolf
Disallow: /

User-agent: ExtractorPro
Disallow: /

User-agent: CopyRightCheck
Disallow: /

User-agent: Crescent
Disallow: /

User-agent: SiteSnagger
Disallow: /

User-agent: ProWebWalker
Disallow: /

User-agent: CheeseBot
Disallow: /

User-agent: LNSpiderguy
Disallow: /

User-agent: Alexibot
Disallow: /

User-agent: Teleport
Disallow: /

User-agent: TeleportPro
Disallow: /

User-agent: MIIxpc
Disallow: /

User-agent: Telesoft
Disallow: /

User-agent: Website Quester
Disallow: /

User-agent: moget/2.1
Disallow: /

User-agent: WebZip/4.0
Disallow: /

User-agent: WebStripper
Disallow: /

User-agent: WebSauger
Disallow: /

User-agent: WebCopier
Disallow: /

User-agent: NetAnts
Disallow: /

User-agent: Mister PiX
Disallow: /

User-agent: WebAuto
Disallow: /

User-agent: TheNomad
Disallow: /

User-agent: WWW-Collector-E
Disallow: /

User-agent: RMA
Disallow: /

User-agent: libWeb/clsHTTP
Disallow: /

User-agent: asterias
Disallow: /

User-agent: httplib
Disallow: /

User-agent: turingos
Disallow: /

User-agent: spanner
Disallow: /

User-agent: InfoNaviRobot
Disallow: /

User-agent: Harvest/1.5
Disallow: /

User-agent: Bullseye/1.0
Disallow: /

User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95)
Disallow: /

User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
Disallow: /

User-agent: CherryPickerSE/1.0
Disallow: /

User-agent: CherryPickerElite/1.0
Disallow: /

User-agent: WebBandit/3.50
Disallow: /

User-agent: NICErsPRO
Disallow: /

User-agent: Microsoft URL Control - 5.01.4511
Disallow: /

User-agent: DittoSpyder
Disallow: /

User-agent: Foobot
Disallow: /

User-agent: SpankBot
Disallow: /

User-agent: BotALot
Disallow: /

User-agent: lwp-trivial/1.34
Disallow: /

User-agent: lwp-trivial
Disallow: /

User-agent: BunnySlippers
Disallow: /

User-agent: Microsoft URL Control - 6.00.8169
Disallow: /

User-agent: URLy Warning
Disallow: /

User-agent: Wget/1.6
Disallow: /

User-agent: Wget/1.5.3
Disallow: /

User-agent: Wget
Disallow: /

User-agent: LinkWalker
Disallow: /

User-agent: cosmos
Disallow: /

User-agent: moget
Disallow: /

User-agent: hloader
Disallow: /

User-agent: humanlinks
Disallow: /

User-agent: LinkextractorPro
Disallow: /

User-agent: Offline Explorer
Disallow: /

User-agent: Mata Hari
Disallow: /

User-agent: LexiBot
Disallow: /

User-agent: Web Image Collector
Disallow: /

User-agent: The Intraformant
Disallow: /

User-agent: True_Robot/1.0
Disallow: /

User-agent: True_Robot
Disallow: /

User-agent: BlowFish/1.0
Disallow: /

User-agent: JennyBot
Disallow: /

User-agent: MIIxpc/4.2
Disallow: /

User-agent: BuiltBotTough
Disallow: /

User-agent: ProPowerBot/2.14
Disallow: /

User-agent: BackDoorBot/1.0
Disallow: /

User-agent: toCrawl/UrlDispatcher
Disallow: /

User-agent: suzuran
Disallow: /

User-agent: TightTwatBot
Disallow: /

User-agent: VCI WebViewer VCI WebViewer Win32
Disallow: /

User-agent: VCI
Disallow: /

User-agent: Szukacz/1.4
Disallow: /

User-agent: Openfind data gatherer
Disallow: /

User-agent: Openfind
Disallow: /

User-agent: Xenu's Link Sleuth 1.1c
Disallow: /

User-agent: Xenu's
Disallow: /

User-agent: Zeus
Disallow: /

User-agent: RepoMonkey Bait & Tackle/v1.01
Disallow: /

User-agent: RepoMonkey
Disallow: /

User-agent: Microsoft URL Control
Disallow: /

User-agent: Openbot
Disallow: /

User-agent: URL Control
Disallow: /

User-agent: Zeus Link Scout
Disallow: /

User-agent: Zeus 32297 Webster Pro V2.9 Win32
Disallow: /

User-agent: Webster Pro
Disallow: /

User-agent: EroCrawler
Disallow: /

User-agent: LinkScan/8.1a Unix
Disallow: /

User-agent: Keyword Density/0.9
Disallow: /

User-agent: Kenjin Spider
Disallow: /

User-agent: Iron33/1.0.2
Disallow: /

User-agent: Bookmark search tool
Disallow: /

User-agent: GetRight/4.2
Disallow: /

User-agent: FairAd Client
Disallow: /

User-agent: Gaisbot
Disallow: /

User-agent: Aqua_Products
Disallow: /

User-agent: Radiation Retriever 1.1
Disallow: /

User-agent: Flaming AttackBot
Disallow: /
User avatar
gavpedz
Registered User
Posts: 338
Joined: Thu Nov 16, 2006 12:00 pm
Contact:

Re: Robots.txt

Post by gavpedz »

anyone?
ameeck
Former Team Member
Posts: 6559
Joined: Mon Mar 21, 2005 6:57 pm

Re: Robots.txt

Post by ameeck »

Actually the first part should be enough:

Code: Select all

User-agent: *
Disallow: /adm
Disallow: /cache
Disallow: /CVS
Disallow: /develop
Disallow: /docs
Disallow: /files
Disallow: /includes
Disallow: /install
Disallow: /language
Disallow: /store
Disallow: /styles
Disallow: /common.php
Disallow: /config.php
Disallow: /cron.php
Disallow: /download.php
Disallow: /faq.php
Disallow: /mcp.php
Disallow: /posting.php
Disallow: /report.php
Disallow: /style.php
Disallow: /ucp.php
User avatar
pentapenguin
Former Team Member
Posts: 11030
Joined: Thu Jul 01, 2004 4:15 am
Location: GA, USA

Re: Robots.txt

Post by pentapenguin »

Unless you don't have much bandwidth to spare there's no real need for all of those disallows.
Support Resources: Support Request Template
If you need professional assistance with your board, please contact me for my reasonable rates.
User avatar
thecoalman
Community Team Member
Community Team Member
Posts: 4248
Joined: Wed Dec 22, 2004 3:52 am
Location: Pennsylvania, U.S.A.
Contact:

Re: Robots.txt

Post by thecoalman »

Disagree with pentapenguin because everthing you are disallowing is really of no use to people searching for information and/or shouldn't be accessed anyway. One thing I will note is the correct syntax for disallowing a directory is:

Code: Select all

Disallow: /adm/
Probably won't make a difference though, I'd also check the delays they are quite long. In my experience Google is by far the most aggressive bot, if you want to slow it down get a webmastertools account and you can specify it in there.

I'd just remove the delays and only add them as you need them.
“Results! Why, man, I have gotten a lot of results! I have found several thousand things that won’t work.”

Attributed - Thomas Edison
User avatar
Phil
Former Team Member
Posts: 10403
Joined: Sat Nov 25, 2006 4:11 am
Name: Phil Crumm
Contact:

Re: Robots.txt

Post by Phil »

thecoalman wrote:Disagree with pentapenguin because everthing you are disallowing is really of no use to people searching for information and/or shouldn't be accessed anyway.
I think he was referring to the bot blocking.
Moving on, with the wind. | My Corner of the Web
ameeck
Former Team Member
Posts: 6559
Joined: Mon Mar 21, 2005 6:57 pm

Re: Robots.txt

Post by ameeck »

Yes but the pages disallowed do not contain any content that the user could be directly interested in and which are no use in the search results for your site, e.g. the posting page :-)
User avatar
thecoalman
Community Team Member
Community Team Member
Posts: 4248
Joined: Wed Dec 22, 2004 3:52 am
Location: Pennsylvania, U.S.A.
Contact:

Re: Robots.txt

Post by thecoalman »

iWisdom wrote:I think he was referring to the bot blocking.
Well I can agree with that too, generally I look over my stats and if I see a large amount of bandwidth going to weird user agent I'll look it up. If it's bot and isn't providing benefit to me I'll block it but that doesn't necessarily mean its going to stop it either. In that case you need to set a bot trap.
“Results! Why, man, I have gotten a lot of results! I have found several thousand things that won’t work.”

Attributed - Thomas Edison
User avatar
pentapenguin
Former Team Member
Posts: 11030
Joined: Thu Jul 01, 2004 4:15 am
Location: GA, USA

Re: Robots.txt

Post by pentapenguin »

iWisdom wrote:
thecoalman wrote:Disagree with pentapenguin because everthing you are disallowing is really of no use to people searching for information and/or shouldn't be accessed anyway.
I think he was referring to the bot blocking.
Yes I was indeed referring to all those bots. :) It's trivial to change the user agent string so that provides no added security and could affect some legitimate users.
Support Resources: Support Request Template
If you need professional assistance with your board, please contact me for my reasonable rates.
User avatar
Avaya
Registered User
Posts: 73
Joined: Thu Mar 08, 2007 7:08 pm
Location: Vancouver, Canada
Contact:

Re: Robots.txt

Post by Avaya »

OK, here is a complete newbie question for the group. What exactly is a robots.txt file and what does it do? Why would I want one, or not want one? In a short word - help!
My website: SixSeventyOne.net
Get the best! Browse with Image
ameeck
Former Team Member
Posts: 6559
Joined: Mon Mar 21, 2005 6:57 pm

Re: Robots.txt

Post by ameeck »

I'd suggest Googling, you will find plenty of information. But basically, it's a file search indexers can download and see what parts of the website they can crawl and which they should leave untouched..
User avatar
gavpedz
Registered User
Posts: 338
Joined: Thu Nov 16, 2006 12:00 pm
Contact:

Re: Robots.txt

Post by gavpedz »

thecoalman wrote:One thing I will note is the correct syntax for disallowing a directory is:

Code: Select all

Disallow: /adm/
Like this then is that right?
Do you think there are any others i need to add to it that the search engines don't really need to crawl.

Code: Select all

User-agent: *
Disallow: /adm/
Disallow: /cache/
Disallow: /CVS/
Disallow: /develop/
Disallow: /docs/
Disallow: /files/
Disallow: /includes/
Disallow: /install/
Disallow: /language/
Disallow: /store/
Disallow: /styles/
Disallow: /common.php/
Disallow: /config.php/
Disallow: /cron.php/
Disallow: /download.php/
Disallow: /faq.php/
Disallow: /mcp.php/
Disallow: /posting.php/
Disallow: /report.php/
Disallow: /style.php/
Disallow: /ucp.php/
Disallow: /viewonline.php/
Disallow: /memberlist.php/
Disallow: /search.php/
User avatar
thecoalman
Community Team Member
Community Team Member
Posts: 4248
Joined: Wed Dec 22, 2004 3:52 am
Location: Pennsylvania, U.S.A.
Contact:

Re: Robots.txt

Post by thecoalman »

These are files, you don't add slash to them.

Code: Select all

Disallow: /common.php/
Disallow: /config.php/
Disallow: /cron.php/
Disallow: /download.php/
Disallow: /faq.php/
Disallow: /mcp.php/
Disallow: /posting.php/
Disallow: /report.php/
Disallow: /style.php/
Disallow: /ucp.php/
Disallow: /viewonline.php/
Disallow: /memberlist.php/
Disallow: /search.php/
“Results! Why, man, I have gotten a lot of results! I have found several thousand things that won’t work.”

Attributed - Thomas Edison
ameeck
Former Team Member
Posts: 6559
Joined: Mon Mar 21, 2005 6:57 pm

Re: Robots.txt

Post by ameeck »

Actually both versions work just fine and are acceptable, without the slash and with it.

The only difference is /adm will also match something like /admin/ or /admiral.php(just making up a word)
while /adm/ will match only this single directory...
Locked

Return to “[3.0.x] Support Forum”