The Decay and Failures of URL References

Increasingly, scholarly publications contain references to material on the Web. This page contains supporting material relevant to the article examining the repercussions of this phenomenon:

Diomidis Spinellis. The Decay and Failures of URL References. Communications of the ACM, 46(1):71-77, January 2003.

In the article we examine the accessibility and decay rate of web references by extracting and inspecting 4224 ULR references from 2471 computer science articles that appeared over the last five years. Of those URLs 27% were not accessible, while close to 50% of them became inaccessible 4 years from the date they were published. In addition, we found that deep URL path hierarchies are linked to a larger number of failures; educational and research material on the Web is referenced three times more than its population representation; pages hosted by educational and commercial sites are equally probable to deteriorate. Two important article findings can be concisely stated as follows:

In this page we make available the data used for developing the article hoping that other researchers might find it of use. The research was performed by crawling through the ACM and the IEEE Computer Society digital library cites and downloading all articles from IEEE Computer and the Communications of the ACM (CACM) periodicals in the period 1995-1999. This phase started on February 21st, 2000 and was completed on May 5th, 2000. Over 9GB of raw material were downloaded during the process. CACM articles are available on the ACM digital library in PDF format; in order to extract URLs we first converted those articles into text form.

We then extracted the URLs appearing in each article. In total we processed 2471 articles: 1411 articles from Computer (38.2MB of HTML) and 1060 articles from CACM (18.9MB of text). After extracting the URLs we removed duplicate appearances of URLs in the same article (21 cases for CACM, 362 for Computer). We ended up with 4224 URLs: 1391 (33%) obtained from CACM and 2833 (67%) obtained from Computer.

From this page you can download:

See also:

