Black Viper
Black Viper
Dec 292003
 

I have information about AdShield and the reason the particular software performs as it does (outlined in several news posts below). An explanation was forwarded to me by the particular reader this issue affected:

Robots.txt files are used by web sites to control which of their pages are indexed by search engine spiders.  AdShield isn’t a search engine so it doesn’t conform to this standard even in version 3.  The caching option has always been disabled by default.  Version 3 does have an exclude list which could be used to prevent it from processing any web site which objects for whatever reason.

I guess that explains why it does not conform to the robots.txt standard for the reason quoted above. I also understand the thought process behind it. However, the robots.txt standard was implemented to tell "automated" programs "not to go to a particular spot" on a web site. Actually, it could also be that it was implemented to tell automated programs to "not index a particular file or directory." To help me in determining what the function should be, I quote some information from the first question and answer on http://www.robotstxt.org/wc/faq.html:

What is a WWW robot?

A robot is a program that automatically traverses the Web’s hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

Note that "recursive" here doesn’t limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot.

Now, by this definition, any program that "automatically traverses" a site is considered a robot. I quote one more line from the first question and answer on http://www.robotstxt.org/wc/faq.html:

Normal Web browsers are not robots, because they are operated by a human, and don’t automatically retrieve referenced documents (other than inline images).

Web browsers cannot be considered robots because someone must point them to a particular site and, by default, they do not spider a domain. However, this brings to mind several more questions:

  • If a program "does what a robot does," should it conform to reduce the network load they may cause by ignoring what "robots" should ignore?
  • Should they conform to the standard even if the program was "pointed" to a page?
  • Even if a program is not "indexing" a web site, should it conform?
  • Should I be telling robots, by using the robots.txt method, not to perform particular actions?

One could also take the point of view that since "…the caching option has always been disabled by default…" this particular function is not "automatic" and should not conform.

Where should the line be drawn? I cannot say, but I do know that Internet Explorer, when set in an "automated" fashion, for example: offline browsing, does check the robots.txt file located on a web server.

A Rant outlining this issue and more research will probably appear in due time. Meanwhile, I thank my dedicated reader for, not only pointing this problem out to me, but taking extreme measures to help me in troubleshooting this issue. I just need to sit on this problem a little longer and figure out what I am going to do. Too many questions… not enough answers.