Webometry

4. Measuring the WWW

Our strategy for viewing the morphogenetic process of a massive neural net may be applied to the WWW. That is indeed the main point of this paper. But how to represent the Web as a Net? There are clearly two necessary steps: to define the nodes, and to measure the connection strengths. For each of these steps there are many possibilities. Here we describe only one approach to each.

Nodes. The WWW is a tree consisting of domains, servers, and pages. There are now tens of thousands of domains, several servers in each domain, and many pages in each server. Each domain has a unique name (for example, vismath.org), each server has a unique name (eg, www.vismath.org) and IP address (eg, 162.227.70.1), and each page has a unique URL (eg, http:// www.vismath.org/index.html). These are the main choices for nodes of the WWW. For reasons of size, mainly, let us regard domain names as the nodes of the Web. We may further reduce the size of the network to be visualized by considering only the suffices edu or org. Besides reducing to a smaller number of nodes, we might anticipate that the domains in the com class are relatively sparsely connected, and thus less interesting from the mathematical point of view.

Connections. The interconnections of the WWW, as a hypertext and hypermedia system, are links. Links connect pages, but pages are secondary to domains according to our choice above. Thus, given two domains, that is, nodes, we must determine all links from any page of the first domain, to any page of the second domain. Then this simple count should be normalized. That is, regarding the number of all pages of all servers of the first domain as a width, and all pages of all servers of the second domain as a height, we obtain a rectangle, the area of which (the product of the two page counts) may be regarded as contributing to the probability of a link. Thus, the connection strength we are proposing here is the ratio of the number of links to the product of the width and the height. A more precise measure might take into account the byte size of pages, or equivalently, the total storage served by each domain. However, this data is much more expensive to obtain.

In any case, the data to construct the massive connection matrix for the entire WWW is to be collected by a Web crawler or robot, not just once, but repeatedly, according to our larger plan. And fortunately for this program, a number of Web crawlers are already at work collecting links for indices of the WWW. This is to be the basis for further work in this project.