Google
Web dns.bdat.net

Configuración avanzada

Robot

Se puede añadir en el directorio raíz del servidor web un fichero llamado robots.txt para indicar el comportamiento que tienen que tener los buscadores y los programas de descargas sobre el servidor. Con este fichero podemos indicar qué buscadores pueden acceder al servidor y sobre qué directorio actuar. Se supone que todos los programas de descarga deben respetarlo, aunque no siempre es así.

Un ejemplo de fichero robots.txt podría ser:


User-agent: *
Disallow: /admin/
Disallow: /imagenes/
Disallow: /includes/
Disallow: /privado/

              

con lo cual estaríamos prohibiendo que los buscadores y programas de descarga accedieran a estos directorios.

La configuración del fichero tiene la siguiente sintaxis:

<Campo> ":" <valor>

donde cada línea tiene que terminar en modo Unix, es decir terminada sólo con \n y no con \n\r como terminan las línea de los ficheros dos/win32.

Los campos pueden ser:

User-agent

En primer lugar se puede especifica el navegador o program cliente. Por ejemplo:

User-agent: googlebot
                

Se puede poner una plantilla "*" para indicar cualquier programa cliente, es decir:

User-agent: *
                

Analizando las peticiones al fichero robots.txt en los log de apache podremos ver el nombre identificador de los diversos programas, por ejemplo:


ia_archiver
Ask Jeeves/Teoma
Yahoo! Slurp
msnbot
Googlebot

                

Disallow:

A continuación tenemos las directivas Disallow: para especificar ficheros o directorios. Por ejemplo las siguietnes directivas impedrían que se sirvieran la página "privada.html", el directorio "/includes/" y el directorio "/cgi-bin/":


Disallow: privada.html
Disallow: /includes/
Disallow: /cgi-bin/

                

Si el nombre que especificamos no termina en "/" supone todo los que empiece por la palabra indicada. Por ejemplo:

Disallow: /datos
                

prohibiría, por ejemplo, que los buscadores cargaran /datos/ y /datos.html.

Si no ponemos nada tras Disallow: supone que se permite todo.

Es obligatoria al menos una línea Disallow:

Comentarios

El símbolo # permite incluir comentarios en los ficheros robots.txt de la misma forma que se hace en la shell

Ejemplos

Permitir cualquier descarga


User-agent: *
Disallow:

                

Para desactivar todas la arañas:


User-agent: *
Disallow: /

                

Prohibimos que las arañas carguen los directorios /cgi-bin/ y /imágenes/


User-agent: *
Disallow: /cgi-bin/
Disallow: /imagenes/

                

Prohibimos que email spider lea cualquier página


User-agent: emailspider
Disallow: /

                

Prohibimos que Google cargue el contenido del directorio visitas:


User-agent: Googlebot
Disallow: /visitas/

                

Ejemplo


# Robots.txt file from http://www.searchengineworld.com
#  
# Built from text file 
http://info.webcrawler.com/mak/projects/robots/active/all.txt
# 
# This restricts access to only known and registered robots. 
#
User-agent: Mozilla/3.0 (compatible;miner;mailto:miner(EN)miner.com.br)
Disallow:
User-agent: WebFerret
Disallow:
User-agent: Due to a deficiency in Java it's not currently possible to set the User-agent. 
Disallow: 
User-agent: no 
Disallow: 
User-agent: 'Ahoy! The Homepage Finder' 
Disallow: 
User-agent: Arachnophilia 
Disallow: 
User-agent: ArchitextSpider 
Disallow: 
User-agent: ASpider/0.09 
Disallow: 
User-agent: AURESYS/1.0 
Disallow: 
User-agent: BackRub/*.* 
Disallow: 
User-agent: Big Brother 
Disallow: 
User-agent: BlackWidow 
Disallow: 
User-agent: BSpider/1.0 libwww-perl/0.40 
Disallow: 
User-agent: CACTVS Chemistry Spider 
Disallow: 
User-agent: Digimarc CGIReader/1.0 
Disallow: 
User-agent: Checkbot/x.xx LWP/5.x 
Disallow: 
User-agent: CMC/0.01 
Disallow: 
User-agent: combine/0.0 
Disallow: 
User-agent: conceptbot/0.3 
Disallow: 
User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
Disallow: 
User-agent: root/0.1 
Disallow: 
User-agent: CS-HKUST-IndexServer/1.0 
Disallow: 
User-agent: CyberSpyder/2.1 
Disallow: 
User-agent: Deweb/1.01 
Disallow: 
User-agent: DragonBot/1.0 libwww/5.0 
Disallow: 
User-agent: EIT-Link-Verifier-Robot/0.2 
Disallow: 
User-agent: Emacs-w3/v[0-9\.]+ 
Disallow: 
User-agent: EmailSiphon
Disallow:
User-agent: EMC Spider 
Disallow: 
User-agent: explorersearch 
Disallow: 
User-agent: Explorer
Disallow: 
User-agent: ExtractorPro
Disallow: 
User-agent: FelixIDE/1.0 
Disallow: 
User-agent: Hazel's Ferret Web hopper, 
Disallow: 
User-agent: ESIRover v1.0 
Disallow: 
User-agent: fido/0.9 Harvest/1.4.pl2 
Disallow: 
User-agent: Hämähäkki/0.2 
Disallow: 
User-agent: KIT-Fireball/2.0 libwww/5.0a 
Disallow: 
User-agent: Fish-Search-Robot 
Disallow: 
User-agent: Mozilla/2.0 (compatible fouineur v2.0; 
fouineur.9bit.qc.ca) 
Disallow: 
User-agent: Robot du CRIM 1.0a 
Disallow: 
User-agent: Freecrawl 
Disallow: 
User-agent: FunnelWeb-1.0 
Disallow: 
User-agent: gcreep/1.0 
Disallow: 
User-agent: ??? 
Disallow: 
User-agent: GetURL.rexx v1.05 
Disallow: 
User-agent: Golem/1.1 
Disallow: 
User-agent: Gromit/1.0 
Disallow: 
User-agent: Gulliver/1.1 
Disallow: 
User-agent: yes 
Disallow: 
User-agent: AITCSRobot/1.1 
Disallow: 
User-agent: wired-digital-newsbot/1.5 
Disallow: 
User-agent: htdig/3.0b3 
Disallow: 
User-agent: HTMLgobble v2.2 
Disallow: 
User-agent: no 
Disallow: 
User-agent: IBM_Planetwide, 
Disallow: 
User-agent: gestaltIconoclast/1.0 libwww-FM/2.17 
Disallow: 
User-agent: INGRID/0.1 
Disallow: 
User-agent: IncyWincy/1.0b1 
Disallow: 
User-agent: Informant 
Disallow: 
User-agent: InfoSeek Robot 1.0 
Disallow: 
User-agent: Infoseek Sidewinder 
Disallow: 
User-agent: InfoSpiders/0.1 
Disallow: 
User-agent: inspectorwww/1.0 
http://www.greenpac.com/inspectorwww.html 
Disallow: 
User-agent: 'IAGENT/1.0' 
Disallow: 
User-agent: IsraeliSearch/1.0 
Disallow: 
User-agent: JCrawler/0.2 
Disallow: 
User-agent: Jeeves v0.05alpha (PERL, LWP, lglb(EN)doc.ic.ac.uk) 
Disallow: 
User-agent: Jobot/0.1alpha libwww-perl/4.0 
Disallow: 
User-agent: JoeBot, 
Disallow: 
User-agent: JubiiRobot
Disallow: 
User-agent: jumpstation 
Disallow: 
User-agent: Katipo/1.0 
Disallow: 
User-agent: KDD-Explorer/0.1 
Disallow: 
User-agent: KO_Yappo_Robot/1.0.4(http://yappo.com/info/robot.html) 
Disallow: 
User-agent: LabelGrab/1.1 
Disallow: 
User-agent: LinkWalker 
Disallow: 
User-agent: logo.gif crawler 
Disallow: 
User-agent: Lycos/x.x 
Disallow: 
User-agent: Lycos_Spider_(T-Rex)
Disallow: 
User-agent: Magpie/1.0 
Disallow: 
User-agent: MediaFox/x.y 
Disallow: 
User-agent: MerzScope 
Disallow: 
User-agent: NEC-MeshExplorer 
Disallow: 
User-agent: MOMspider/1.00 libwww-perl/0.40 
Disallow: 
User-agent: Monster/vX.X.X -$TYPE ($OSTYPE) 
Disallow: 
User-agent: Motor/0.2 
Disallow: 
User-agent: MuscatFerret 
Disallow: 
User-agent: MwdSearch/0.1 
Disallow: 
User-agent: NetCarta CyberPilot Pro 
Disallow: 
User-agent: NetMechanic 
Disallow: 
User-agent: NetScoop/1.0 libwww/5.0a 
Disallow: 
User-agent: NHSEWalker/3.0 
Disallow: 
User-agent: Nomad-V2.x 
Disallow: 
User-agent: NorthStar 
Disallow: 
User-agent: Occam/1.0 
Disallow: 
User-agent: HKU WWW Robot, 
Disallow: 
User-agent: Orbsearch/1.0 
Disallow: 
User-agent: PackRat/1.0 
Disallow: 
User-agent: Patric/0.01a 
Disallow: 
User-agent: Peregrinator-Mathematics/0.7 
Disallow: 
User-agent: Duppies 
Disallow: 
User-agent: Pioneer 
Disallow: 
User-agent: PGP-KA/1.2 
Disallow: 
User-agent: Resume Robot 
Disallow: 
User-agent: Road Runner: ImageScape Robot (lim(EN)cs.leidenuniv.nl) 
Disallow: 
User-agent: Robbie/0.1 
Disallow: 
User-agent: ComputingSite Robi/1.0 (robi(EN)computingsite.com) 
Disallow: 
User-agent: Roverbot 
Disallow: 
User-agent: SafetyNet Robot 0.1, 
Disallow: 
User-agent: Scooter/1.0 
Disallow: 
User-agent: not available 
Disallow: 
User-agent: Senrigan/xxxxxx 
Disallow: 
User-agent: SG-Scout 
Disallow: 
User-agent: Shai'Hulud 
Disallow: 
User-agent: SimBot/1.0 
Disallow: 
User-agent: Open Text Site Crawler V1.0 
Disallow: 
User-agent: SiteTech-Rover 
Disallow: 
User-agent: Slurp/2.0 
Disallow: 
User-agent: ESISmartSpider/2.0 
Disallow: 
User-agent: Snooper/b97_01 
Disallow: 
User-agent: Solbot/1.0 LWP/5.07 
Disallow: 
User-agent: Spanner/1.0 (Linux 2.0.27 i586) 
Disallow: 
User-agent: no 
Disallow: 
User-agent: Mozilla/3.0 (Black Widow v1.1.0; Linux 2.0.27; Dec 31 
1997 12:25:00 
Disallow: 
User-agent: Tarantula/1.0 
Disallow: 
User-agent: tarspider 
Disallow: 
User-agent: dlw3robot/x.y (in TclX by http://hplyot.obspm.fr/~dl/) 
Disallow: 
User-agent: Templeton/ 
Disallow: 
User-agent: TitIn/0.2 
Disallow: 
User-agent: TITAN/0.1 
Disallow: 
User-agent: UCSD-Crawler 
Disallow: 
User-agent: urlck/1.2.3 
Disallow: 
User-agent: Valkyrie/1.0 libwww-perl/0.40 
Disallow: 
User-agent: Victoria/1.0 
Disallow: 
User-agent: vision-search/3.0' 
Disallow: 
User-agent: VWbot_K/4.2 
Disallow: 
User-agent: w3index 
Disallow: 
User-agent: W3M2/x.xxx 
Disallow: 
User-agent: WWWWanderer v3.0 
Disallow: 
User-agent: WebCopy/
Disallow: 
User-agent: WebCrawler/3.0 Robot libwww/5.0a 
Disallow: 
User-agent: WebFetcher/0.8, 
Disallow: 
User-agent: weblayers/0.0 
Disallow: 
User-agent: WebLinker/0.0 libwww-perl/0.1 
Disallow: 
User-agent: no 
Disallow: 
User-agent: WebMoose/0.0.0000 
Disallow: 
User-agent: Digimarc WebReader/1.2 
Disallow: 
User-agent: webs(EN)recruit.co.jp 
Disallow: 
User-agent: webvac/1.0 
Disallow: 
User-agent: webwalk 
Disallow: 
User-agent: WebWalker/1.10 
Disallow: 
User-agent: WebWatch 
Disallow: 
User-agent: Wget/1.4.0 
Disallow: 
User-agent: w3mir 
Disallow: 
User-agent: no 
Disallow: 
User-agent: WWWC/0.25 (Win95) 
Disallow: 
User-agent: none 
Disallow: 
User-agent: XGET/0.7 
Disallow: 
User-agent: Nederland.zoek 
Disallow: 
User-agent: BizBot04 kirk.overleaf.com 
Disallow: 
User-agent: HappyBot (gserver.kw.net) 
Disallow: 
User-agent: CaliforniaBrownSpider 
Disallow: 
User-agent: EI*Net/0.1 libwww/0.1 
Disallow: 
User-agent: Ibot/1.0 libwww-perl/0.40 
Disallow: 
User-agent: Merritt/1.0 
Disallow: 
User-agent: StatFetcher/1.0 
Disallow: 
User-agent: TeacherSoft/1.0 libwww/2.17 
Disallow: 
User-agent: WWW Collector 
Disallow: 
User-agent: processor/0.0ALPHA libwww-perl/0.20 
Disallow: 
User-agent: wobot/1.0 from 206.214.202.45 
Disallow: 
User-agent: Libertech-Rover www.libertech.com? 
Disallow: 
User-agent: WhoWhere Robot 
Disallow: 
User-agent: ITI Spider 
Disallow: 
User-agent: w3index 
Disallow: 
User-agent: MyCNNSpider 
Disallow: 
User-agent: SummyCrawler 
Disallow: 
User-agent: OGspider 
Disallow: 
User-agent: linklooker 
Disallow: 
User-agent: CyberSpyder (amant(EN)www.cyberspyder.com) 
Disallow: 
User-agent: SlowBot 
Disallow: 
User-agent: heraSpider 
Disallow: 
User-agent: Surfbot 
Disallow: 
User-agent: Bizbot003 
Disallow: 
User-agent: WebWalker 
Disallow: 
User-agent: SandBot 
Disallow: 
User-agent: EnigmaBot 
Disallow: 
User-agent: spyder3.microsys.com 
Disallow: 
User-agent: www.freeloader.com. 
Disallow: 
User-agent: Googlebot
Disallow: 
User-agent: METAGOPHER
Disallow: 
User-agent: *
Disallow: /