blog dds

2005.11.09

US Military Removes Word Documents from the Web?

On August 25th 2004 the comp.risks forum run an article I submitted regarding the large number of Microsoft Word documents available on US milatary sites (sites in the .mil domain) through Google searches (23.50 "U.S. military sites offer a quarter million Microsoft Word documents"). The article documented how such documents could lead to the leakage of confidential data. A week later I setup a script to watch the number of Word documents available through Google searches to see if and when the military would recognise the threat those documents posed and remove them.

According to the data I gathered the number of .mil Word documents returned by Google peaked at 1,180,000 on September 20th 2005, and then started gradually declining. Currently there are 941,000 documents online. No such decline was visible on other domains I monitored, so the change is probably not an artefact of Google's collection or query mechanisms, but an organized move by the US military. The following charts illustrate the changes in the number of Word documents available over a number of different domains (red) compared to the total number of documents available through all monitored domains (green).

.mil Domain

Domain chart

Other non-country TLDs

Domain chart Domain chart Domain chart Domain chart Domain chart

Country TLDs

Domain chart Domain chart Domain chart Domain chart Domain chart Domain chart Domain chart Domain chart Domain chart Domain chart Domain chart Domain chart

Updates

2005.11.12
Jim Horning correctly noted that .mil might now be excluding robots on more sites.

2005.11.13
George Gousios notes that the large September spike may be Google's answer to Yahoo overtaking them in the number of indexed pages.

Read and post comments, or share through   


Creative Commons License Last modified: Sunday, November 13, 2005 4:16 pm
Unless otherwise expressly stated, all original material on this page created by Diomidis Spinellis is licensed under a Creative Commons Attribution-Share Alike 3.0 Greece License.