LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 05-10-2012, 11:22 AM   #1
hemite
LQ Newbie
 
Registered: Jan 2012
Posts: 4

Rep: Reputation: Disabled
robots.txt ignored on vsftpd


I have set up a public ftp server using vsftpd that anyone can access, but I do not wish it to be indexed by the googleBot spider. I have set up a robots.txt in my root directory which is supposed to stop the crawler from downloading files, yet it still continues. My robots.txt file is:


Code:
User-Agent: *
Disallow: /
How come the googleBot is ignoring my robots.txt file, or is my file misconfigured?

Last edited by hemite; 05-10-2012 at 12:04 PM.
 
Old 05-10-2012, 12:37 PM   #2
MensaWater
LQ Guru
 
Registered: May 2005
Location: Atlanta Georgia USA
Distribution: Redhat (RHEL), CentOS, Fedora, CoreOS, Debian, FreeBSD, HP-UX, Solaris, SCO
Posts: 7,432
Blog Entries: 15

Rep: Reputation: 1436Reputation: 1436Reputation: 1436Reputation: 1436Reputation: 1436Reputation: 1436Reputation: 1436Reputation: 1436Reputation: 1436Reputation: 1436
The only time I've run into one of these (when I was trying to crawl a site) documentation I found says it has to be in the root of the web page. After I later got access via another method I saved the robots.txt files (there were multiple files - presumably one for each web page).

For what its worth the text of the main one is below - based on this I'm not sure "/" would be sufficient. Also as noted it may be you have to put one in each web page rather than a single one on the server covering all of them. I can't be sure as I've never set one up myself.

Code:
# $Id: robots.txt,v 1.9.2.1 2008/12/10 20:12:19 goba Exp $
#
# robots.txt
#
# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.
#
# This file will be ignored unless it is at the root of your host:
# Used:    http://example.com/robots.txt
# Ignored: http://example.com/site/robots.txt
#
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/wc/robots.html
#
# For syntax checking, see:
# http://www.sxw.org.uk/computing/robots/check.html

User-agent: *
Crawl-delay: 10
# Directories
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /profiles/
Disallow: /scripts/
Disallow: /sites/
Disallow: /themes/
# Files
Disallow: /CHANGELOG.txt
Disallow: /cron.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /install.php
Disallow: /INSTALL.txt
Disallow: /LICENSE.txt
Disallow: /MAINTAINERS.txt
Disallow: /update.php
Disallow: /UPGRADE.txt
Disallow: /xmlrpc.php
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
 
Old 05-10-2012, 06:09 PM   #3
hemite
LQ Newbie
 
Registered: Jan 2012
Posts: 4

Original Poster
Rep: Reputation: Disabled
So I've changed up my robots.txt

I have to folders in my root directory that I want blocked, folder A and folder B. I changed my robots.txt to explicitly mention folders A and B. And I even put another robots.txt inside folder A with Disallow: / just to be sure. Both did not work as the bot is still crawling through my server.
 
Old 05-11-2012, 07:17 PM   #4
hemite
LQ Newbie
 
Registered: Jan 2012
Posts: 4

Original Poster
Rep: Reputation: Disabled
I figured this out for all people reading this in the future. My robots.txt file was set correctly. However, the googleBot was already inside a subdirectory, and it remembered its place when I killed the ftp server to kill it off. Once it finished downloading the file, it went back to the root folder of the ftp server, read robots.txt, and left my server alone. All if well.
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
I need to stop the viewing of robots.txt shongale Linux - Server 3 07-07-2010 01:29 PM
Stop viewing of robots.txt in browser shongale Linux - Software 5 07-07-2010 12:24 PM
robots.txt paleogryph Linux - Software 1 11-11-2005 02:32 PM
configuring robots.txt jc materi Linux - Security 1 04-09-2005 10:37 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 03:53 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration