Ads

 

I took the web server down for about 20 minutes while I scrambled to transfer the domain to a different computer. All should be functioning, again. The issue started at about 11:46 AM when one of my image directories became corrupted. The Windows XP Pro Install Guide thumbnails became inaccessible at that time. Unfortunately, I did not discover the issue until many hours later.

I apologize for any inconvenience this may have caused.

 

I am beginning to sound like a broken record. Once again, I am way behind on E-Mails (several days worth). I will get back with as many people as I can.

 

I have information about AdShield and the reason the particular software performs as it does (outlined in several news posts below). An explanation was forwarded to me by the particular reader this issue affected:

Robots.txt files are used by web sites to control which of their pages are indexed by search engine spiders.  AdShield isn’t a search engine so it doesn’t conform to this standard even in version 3.  The caching option has always been disabled by default.  Version 3 does have an exclude list which could be used to prevent it from processing any web site which objects for whatever reason.

I guess that explains why it does not conform to the robots.txt standard for the reason quoted above. I also understand the thought process behind it. However, the robots.txt standard was implemented to tell "automated" programs "not to go to a particular spot" on a web site. Actually, it could also be that it was implemented to tell automated programs to "not index a particular file or directory." To help me in determining what the function should be, I quote some information from the first question and answer on http://www.robotstxt.org/wc/faq.html:

What is a WWW robot?

A robot is a program that automatically traverses the Web’s hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

Note that "recursive" here doesn’t limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot.

Now, by this definition, any program that "automatically traverses" a site is considered a robot. I quote one more line from the first question and answer on http://www.robotstxt.org/wc/faq.html:

Normal Web browsers are not robots, because they are operated by a human, and don’t automatically retrieve referenced documents (other than inline images).

Web browsers cannot be considered robots because someone must point them to a particular site and, by default, they do not spider a domain. However, this brings to mind several more questions:

  • If a program "does what a robot does," should it conform to reduce the network load they may cause by ignoring what "robots" should ignore?
  • Should they conform to the standard even if the program was "pointed" to a page?
  • Even if a program is not "indexing" a web site, should it conform?
  • Should I be telling robots, by using the robots.txt method, not to perform particular actions?

One could also take the point of view that since "…the caching option has always been disabled by default…" this particular function is not "automatic" and should not conform.

Where should the line be drawn? I cannot say, but I do know that Internet Explorer, when set in an "automated" fashion, for example: offline browsing, does check the robots.txt file located on a web server.

A Rant outlining this issue and more research will probably appear in due time. Meanwhile, I thank my dedicated reader for, not only pointing this problem out to me, but taking extreme measures to help me in troubleshooting this issue. I just need to sit on this problem a little longer and figure out what I am going to do. Too many questions… not enough answers.

 

I have been selected SETI@Home user of the day [link removed]. I think it could have something to do with my profile’s [link removed] picture involving Santa. Coincidence? The world may never know.

:)

I have also been told that the person mentioned in the last couple of news updates may be using an "older" version of AdShield. Unfortunately, due to the Holiday season, they have not been able to get any answer from technical support as to the issues outlined below. If it does turn out that the older version is flawed and the latest does conform to the robots.txt standard, I will update the news.

 

Well, the person outlined two updates below "is" using a plain version of IE6, that is, if you do not include using AdShield. AdShield blocks pop-ups and banner ads. Since I do not have any, it is rather pointless to have it running on my site. However, this is the intriguing thing. Cut and pasted from the home page is this "feature:"

Improves performance using optional background downloading and caching of pages/images linked to the ones you’re viewing.

Improves performance for "whom?" That completely explains the reason for the log file entries. AdShield does not conform to the robots.txt standard and, therefore, indiscriminately "sucks everything," while at the exact same time:

Suppresses the download and display of ad images and frames.

Wow. Even though I am highly "anti-banner ads," this gives great cause for those sites that depend on ad revenue to be annoyed at this type of program. Not only does this program block their income generator, it creates "more traffic" by pre-fetching links "just in case." Not exactly sure if the program kills the request for the ad content or downloads it and just does not "display" it to the viewer. That will require more research.

I know there is a Rant in this news update, somewhere…

 

In a totally unrelated matter, I usually post the “time” of my update. The previous post, I put the time as “10:03 PM.” The reality is that is the time I “started” posting the news update. The clock on my system now shows “11:02 PM.” Where am I going with this? All updates, no matter how small or what the content, takes time.

I have tons of stuff I would love more than anything to post about. However, the information that I post, I try to make it as accurate and complete as I possibly can. The problem with this? I just plain do not have the time to do everything. What my readers get is “small” content updates but, rest assured, when I do post an article or guide, I did not just toss the information up at random. I took my time to get it right.

 

In an attempt to give my readers a little insight as to "what goes on behind the scenes," I have posted the following news update.

Time to start the latest Quick Rant.

This is the longest news update I have had in awhile. The reason? I am banging my head up against the monitor.

I, once again, fired up the automatic banning of IP addresses last night. This is due to my desire to stop "bad" robots from sucking too much bandwidth. More information on this practice is located in my Abuse Rant. Basically, I implemented a "hidden link" which all "good" robots, including all major search engines, would ignore.

I had a reader contact me in distress saying they did "nothing wrong" and that they are using a plain version of IE6. Some proxy servers, pre-fetchers and firewall’s chose to ignore the robots.txt standard.

After reviewing the log files, I am attempting to figure out with this person, the exact "reason" this particular version of IE6 is attempting to "pre-fetch" links that are not valid and, as a result, causing my server to flag the IP for abuse. Whether or not this person is using any of the previously mentioned products is unknown at this time.

Once again, I have temporarily removed that particular function from the server. However, even though IP addresses are not automatically banned, I am notified immediately of the spidering attempt and the logs will remain until I can narrow down the cause of this problem.

A cut and paste from the web server log file, and my explanation of the issue will follow: (The actual IP address is removed for obvious reasons).

x.x.x.x - - [22/Dec/2003:12:45:26 -0800] "GET /WinXP/servicecfg.htm HTTP/1.1" 200 9027 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)"

The above log file line (even though it may display in your browser as "several lines," for the sake of argument, it actually is only one line) denotes the page where this person entered the domain. Having no "referer" (sic) information logged (the "-" after the "200 9027") is how I came to that conclusion.

The next several lines is the "normal" traffic. This includes the "referer" (sic) header, which is valid and tells me that "the browser requested the information because of accessing the above page." One such entry is shown below:

x.x.x.x - - [22/Dec/2003:12:45:26 -0800] "GET /css/20031222basic.css HTTP/1.1" 200 266 "http://www.blackviper.com/WinXP/servicecfg.htm" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)"

This shows the requesting page as "servicecfg.htm" and it is also requiring a download of "20031222basic.css." This is normal traffic. However, the following request should not be there and is directly "after" the normal logging of traffic patterns:

x.x.x.x - - [22/Dec/2003:12:45:26 -0800] "GET / HTTP/1.1" 200 4581 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)"

The above line tells me that, in only one second since the first request, the root "index" page ("GET /") was requested by the browser, but it has no "referer" (sic) header attached like the "normal" requests do. Three seconds later, the invalid link is spidered and the IP address was automatically banned. This particular "hidden" link is also the "first link" appearing in my XHTML code. However, It gets better.

The next two lines is what frightens me the most:

x.x.x.x - - [22/Dec/2003:12:45:40 -0800] "GET /AskBV/XP25.htm HTTP/1.1" 200 211 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)"
x.x.x.x - - [22/Dec/2003:12:45:49 -0800] "GET /AskBV/XP25.htm HTTP/1.1" 200 211 "http://www.blackviper.com/WinXP/servicecfg.htm" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)"

On the original page, the reference to XP25.htm is the "next" link in the code. However, the first request had no referer (sic) header information (as noted in the log file by "-" after the "200 211"). The second line "does" have the referer (sic) information logged only 9 seconds later, just as if the person actually "clicked" the link and attempted to go to that page.

The burning question I have is "what on Earth is causing IE to pre-fetch links?"

When that question is answered, I will rest better at night.

I am sure this issue has blocked other legitimate readers and I apologize. My intentions are only good by attempting to protecting my server "from the bad folk."

Other people have wrote to me and, in a matter of speaking, "If you do not want people to visit your site, take it down!" That is not the issue. I am not blocking legitimate traffic (well, except for the unknown cause outlined above). What I am attempting to do is stop the complete download of my domain for no reason other than "because it is there."

 

I have had several complains from readers about my web server automatically banning their IP address because of "abuse." This is due to my recently implemented configuration to stop "bad" robots from sucking too much bandwidth. More information on this practice is located in my Abuse Rant. I implemented a "hidden link" which all "good" robots, including all major search engines, would ignore. Some proxy servers, pre-fetchers and firewall’s chose to ignore the robots.txt standard. This is an issue to take up with the creators of those programs, not my web site.

I have temporarily removed that particular function from the server. All access is currently available with the following exceptions:

  • I still block all "offline browser" access. This is due to many people synchronizing the entire domain (740+ pages) every day, which is entirely not needed.
  • Most "download managers" remain blocked. Please use the "normal" means of downloading my files. No user name and password is required to do so. If such information is requested, it could be due to the use of a download manager.
  • Access by "page editors" are not authorized. I hope the reasons are obvious.
  • When a "bad" robot hits, I will still immediately get a report via E-Mail and will selectively disable IP addresses instead of automatically banning them.

I appreciate everyone’s feedback while I fine tune the domain to provide everyone rapid content while not alienating others.

 

Again, I wish to thank everyone at TechTV for making my appearance on The Screen Savers an enjoyable one. The whole staff made me feel right at home. Thanks to Erica K. and Joshua B. for their support. A special thanks goes out to Leo Laporte for calming my nerves enough to talk on camera. He makes it look really easy.

Since the show from last night just repeated, access to the web site is slow, at best.

Quick links to:

 

I wish to thank everyone at TechTV for making my appearance on The Screen Savers an enjoyable one. Everyone made me feel right at home. A special thanks goes out to Leo Laporte for calming my nerves enough to talk on camera. He makes it look really easy.

Thanks again TechTV!

Ads

Copyright © 1999-2012 by Charles "Black Viper" Sparks. All Rights Reserved.
Contact BV | Disclaimer | Privacy Policy

All comments are moderated. You will not see them appear instantly. Suffusion theme by Sayontan Sinha