Page 1 of 1

Robots.txt

Posted: Wed Nov 21, 2007 1:54 am
by gavpedz
Hi guys i just found this robots.txt file for phpbb3 on a turkish site and was not sure if all of it was necessary or if it is ok
User-agent: *
#Crawl-delay: 100
Disallow: /adm
Disallow: /cache
Disallow: /CVS
Disallow: /develop
Disallow: /docs
Disallow: /files
Disallow: /includes
Disallow: /install
Disallow: /language
Disallow: /store
Disallow: /styles
Disallow: /common.php
Disallow: /config.php
Disallow: /cron.php
Disallow: /download.php
Disallow: /faq.php
Disallow: /mcp.php
Disallow: /posting.php
Disallow: /report.php
Disallow: /style.php
Disallow: /ucp.php
#Disallow: /viewonline.php
#Disallow: /memberlist.php
#Disallow: /search.php

#--- Allowed bots
# for archive.org
User-agent: ia_archiver
Disallow:

# for google adsense bot
User-agent: Mediapartners-Google*
Disallow:

#User-agent: Slurp
Crawl-delay: 30
Disallow:

user-agent: googlebot
crawl-delay: 30

user-agent: msnbot
crawl-delay: 30


# Kötü niyetli bot'ları engelle
User-agent: BotRightHere
Disallow: /

User-agent: WebZip
Disallow: /

User-agent: larbin
Disallow: /

User-agent: b2w/0.1
Disallow: /

User-agent: Copernic
Disallow: /

User-agent: psbot
Disallow: /

User-agent: Python-urllib
Disallow: /

User-agent: NetMechanic
Disallow: /

User-agent: URL_Spider_Pro
Disallow: /

User-agent: CherryPicker
Disallow: /

User-agent: EmailCollector
Disallow: /

User-agent: EmailSiphon
Disallow: /

User-agent: WebBandit
Disallow: /

User-agent: EmailWolf
Disallow: /

User-agent: ExtractorPro
Disallow: /

User-agent: CopyRightCheck
Disallow: /

User-agent: Crescent
Disallow: /

User-agent: SiteSnagger
Disallow: /

User-agent: ProWebWalker
Disallow: /

User-agent: CheeseBot
Disallow: /

User-agent: LNSpiderguy
Disallow: /

User-agent: Alexibot
Disallow: /

User-agent: Teleport
Disallow: /

User-agent: TeleportPro
Disallow: /

User-agent: MIIxpc
Disallow: /

User-agent: Telesoft
Disallow: /

User-agent: Website Quester
Disallow: /

User-agent: moget/2.1
Disallow: /

User-agent: WebZip/4.0
Disallow: /

User-agent: WebStripper
Disallow: /

User-agent: WebSauger
Disallow: /

User-agent: WebCopier
Disallow: /

User-agent: NetAnts
Disallow: /

User-agent: Mister PiX
Disallow: /

User-agent: WebAuto
Disallow: /

User-agent: TheNomad
Disallow: /

User-agent: WWW-Collector-E
Disallow: /

User-agent: RMA
Disallow: /

User-agent: libWeb/clsHTTP
Disallow: /

User-agent: asterias
Disallow: /

User-agent: httplib
Disallow: /

User-agent: turingos
Disallow: /

User-agent: spanner
Disallow: /

User-agent: InfoNaviRobot
Disallow: /

User-agent: Harvest/1.5
Disallow: /

User-agent: Bullseye/1.0
Disallow: /

User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95)
Disallow: /

User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
Disallow: /

User-agent: CherryPickerSE/1.0
Disallow: /

User-agent: CherryPickerElite/1.0
Disallow: /

User-agent: WebBandit/3.50
Disallow: /

User-agent: NICErsPRO
Disallow: /

User-agent: Microsoft URL Control - 5.01.4511
Disallow: /

User-agent: DittoSpyder
Disallow: /

User-agent: Foobot
Disallow: /

User-agent: SpankBot
Disallow: /

User-agent: BotALot
Disallow: /

User-agent: lwp-trivial/1.34
Disallow: /

User-agent: lwp-trivial
Disallow: /

User-agent: BunnySlippers
Disallow: /

User-agent: Microsoft URL Control - 6.00.8169
Disallow: /

User-agent: URLy Warning
Disallow: /

User-agent: Wget/1.6
Disallow: /

User-agent: Wget/1.5.3
Disallow: /

User-agent: Wget
Disallow: /

User-agent: LinkWalker
Disallow: /

User-agent: cosmos
Disallow: /

User-agent: moget
Disallow: /

User-agent: hloader
Disallow: /

User-agent: humanlinks
Disallow: /

User-agent: LinkextractorPro
Disallow: /

User-agent: Offline Explorer
Disallow: /

User-agent: Mata Hari
Disallow: /

User-agent: LexiBot
Disallow: /

User-agent: Web Image Collector
Disallow: /

User-agent: The Intraformant
Disallow: /

User-agent: True_Robot/1.0
Disallow: /

User-agent: True_Robot
Disallow: /

User-agent: BlowFish/1.0
Disallow: /

User-agent: JennyBot
Disallow: /

User-agent: MIIxpc/4.2
Disallow: /

User-agent: BuiltBotTough
Disallow: /

User-agent: ProPowerBot/2.14
Disallow: /

User-agent: BackDoorBot/1.0
Disallow: /

User-agent: toCrawl/UrlDispatcher
Disallow: /

User-agent: suzuran
Disallow: /

User-agent: TightTwatBot
Disallow: /

User-agent: VCI WebViewer VCI WebViewer Win32
Disallow: /

User-agent: VCI
Disallow: /

User-agent: Szukacz/1.4
Disallow: /

User-agent: Openfind data gatherer
Disallow: /

User-agent: Openfind
Disallow: /

User-agent: Xenu's Link Sleuth 1.1c
Disallow: /

User-agent: Xenu's
Disallow: /

User-agent: Zeus
Disallow: /

User-agent: RepoMonkey Bait & Tackle/v1.01
Disallow: /

User-agent: RepoMonkey
Disallow: /

User-agent: Microsoft URL Control
Disallow: /

User-agent: Openbot
Disallow: /

User-agent: URL Control
Disallow: /

User-agent: Zeus Link Scout
Disallow: /

User-agent: Zeus 32297 Webster Pro V2.9 Win32
Disallow: /

User-agent: Webster Pro
Disallow: /

User-agent: EroCrawler
Disallow: /

User-agent: LinkScan/8.1a Unix
Disallow: /

User-agent: Keyword Density/0.9
Disallow: /

User-agent: Kenjin Spider
Disallow: /

User-agent: Iron33/1.0.2
Disallow: /

User-agent: Bookmark search tool
Disallow: /

User-agent: GetRight/4.2
Disallow: /

User-agent: FairAd Client
Disallow: /

User-agent: Gaisbot
Disallow: /

User-agent: Aqua_Products
Disallow: /

User-agent: Radiation Retriever 1.1
Disallow: /

User-agent: Flaming AttackBot
Disallow: /

Re: Robots.txt

Posted: Thu Nov 22, 2007 4:12 pm
by gavpedz
anyone?

Re: Robots.txt

Posted: Thu Nov 22, 2007 5:39 pm
by ameeck
Actually the first part should be enough:

Code: Select all

User-agent: *
Disallow: /adm
Disallow: /cache
Disallow: /CVS
Disallow: /develop
Disallow: /docs
Disallow: /files
Disallow: /includes
Disallow: /install
Disallow: /language
Disallow: /store
Disallow: /styles
Disallow: /common.php
Disallow: /config.php
Disallow: /cron.php
Disallow: /download.php
Disallow: /faq.php
Disallow: /mcp.php
Disallow: /posting.php
Disallow: /report.php
Disallow: /style.php
Disallow: /ucp.php

Re: Robots.txt

Posted: Thu Nov 22, 2007 5:42 pm
by pentapenguin
Unless you don't have much bandwidth to spare there's no real need for all of those disallows.

Re: Robots.txt

Posted: Thu Nov 22, 2007 6:26 pm
by thecoalman
Disagree with pentapenguin because everthing you are disallowing is really of no use to people searching for information and/or shouldn't be accessed anyway. One thing I will note is the correct syntax for disallowing a directory is:

Code: Select all

Disallow: /adm/
Probably won't make a difference though, I'd also check the delays they are quite long. In my experience Google is by far the most aggressive bot, if you want to slow it down get a webmastertools account and you can specify it in there.

I'd just remove the delays and only add them as you need them.

Re: Robots.txt

Posted: Thu Nov 22, 2007 6:45 pm
by Phil
thecoalman wrote:Disagree with pentapenguin because everthing you are disallowing is really of no use to people searching for information and/or shouldn't be accessed anyway.
I think he was referring to the bot blocking.

Re: Robots.txt

Posted: Thu Nov 22, 2007 7:04 pm
by ameeck
Yes but the pages disallowed do not contain any content that the user could be directly interested in and which are no use in the search results for your site, e.g. the posting page :-)

Re: Robots.txt

Posted: Thu Nov 22, 2007 7:38 pm
by thecoalman
iWisdom wrote:I think he was referring to the bot blocking.
Well I can agree with that too, generally I look over my stats and if I see a large amount of bandwidth going to weird user agent I'll look it up. If it's bot and isn't providing benefit to me I'll block it but that doesn't necessarily mean its going to stop it either. In that case you need to set a bot trap.

Re: Robots.txt

Posted: Thu Nov 22, 2007 8:59 pm
by pentapenguin
iWisdom wrote:
thecoalman wrote:Disagree with pentapenguin because everthing you are disallowing is really of no use to people searching for information and/or shouldn't be accessed anyway.
I think he was referring to the bot blocking.
Yes I was indeed referring to all those bots. :) It's trivial to change the user agent string so that provides no added security and could affect some legitimate users.

Re: Robots.txt

Posted: Fri Nov 23, 2007 5:27 am
by Avaya
OK, here is a complete newbie question for the group. What exactly is a robots.txt file and what does it do? Why would I want one, or not want one? In a short word - help!

Re: Robots.txt

Posted: Fri Nov 23, 2007 5:53 am
by ameeck
I'd suggest Googling, you will find plenty of information. But basically, it's a file search indexers can download and see what parts of the website they can crawl and which they should leave untouched..

Re: Robots.txt

Posted: Fri Nov 23, 2007 9:28 am
by gavpedz
thecoalman wrote:One thing I will note is the correct syntax for disallowing a directory is:

Code: Select all

Disallow: /adm/
Like this then is that right?
Do you think there are any others i need to add to it that the search engines don't really need to crawl.

Code: Select all

User-agent: *
Disallow: /adm/
Disallow: /cache/
Disallow: /CVS/
Disallow: /develop/
Disallow: /docs/
Disallow: /files/
Disallow: /includes/
Disallow: /install/
Disallow: /language/
Disallow: /store/
Disallow: /styles/
Disallow: /common.php/
Disallow: /config.php/
Disallow: /cron.php/
Disallow: /download.php/
Disallow: /faq.php/
Disallow: /mcp.php/
Disallow: /posting.php/
Disallow: /report.php/
Disallow: /style.php/
Disallow: /ucp.php/
Disallow: /viewonline.php/
Disallow: /memberlist.php/
Disallow: /search.php/

Re: Robots.txt

Posted: Fri Nov 23, 2007 10:24 am
by thecoalman
These are files, you don't add slash to them.

Code: Select all

Disallow: /common.php/
Disallow: /config.php/
Disallow: /cron.php/
Disallow: /download.php/
Disallow: /faq.php/
Disallow: /mcp.php/
Disallow: /posting.php/
Disallow: /report.php/
Disallow: /style.php/
Disallow: /ucp.php/
Disallow: /viewonline.php/
Disallow: /memberlist.php/
Disallow: /search.php/

Re: Robots.txt

Posted: Fri Nov 23, 2007 2:06 pm
by ameeck
Actually both versions work just fine and are acceptable, without the slash and with it.

The only difference is /adm will also match something like /admin/ or /admiral.php(just making up a word)
while /adm/ will match only this single directory...