FETA (Framework for Evolving Topology Analysis) software

Defining the Inner and Outer models

Let G be some graph which evolves in time. Let G_t be the state of this graph at some step of evolution, t. Consider a model for network evolution as consisting of two separate (but interconnected) models. The outer model selects the operation which transforms the graph between two steps. The inner model chooses the entity for that operation. The operation and the entity together define the transition from G_{i-1} to G_i. Both outer and inner model may depend on the state of the graph G_i on the step of the evolution i and possibly on exogenous parameters. Outer model operations might be the following.

  1. Add a new node and connect it to an existing node.

  2. Connect the newest node to an existing node.

  3. Connect two existing nodes.

  4. Delete an existing connection.

  5. Delete an existing node and its connections.

These outer models work with inner models which select either nodes or edges for the operation. The inner model assigns probabilities to each node (operations 1, 2 and 5) or edge (operations 3 and 4). There may be a different inner model for each outer model operation. The outer model might be adapted further if the known graph data can include unconnected (degree zero) nodes, if graphs can be unconnected and so on. The focus of this paper is the inner model and the outer model is always kept relatively simple.

Estimating likelihood

Let G_0 be the graph at the first step of evolution observed (this need not be right at the start of the evolution of the graph). Assume that the state of the graph is observed until some step G_t. The graph evolves between step G_{i-1} and G_i according to an outer and inner model. Each step involves the addition of one edge. For simplicity of explanation consider the outer model to consist only of the two operations:

  1. add a new node and connect it to an existing node N_i; or

  2. connect the newest node to an existing node N_i.

The inner model theta assigns probabilities to the existing nodes at a given step. Given the above outer model, from G_{i-1} and G_i the node N_i chosen by the inner model can be inferred. Call the set of all observed choices C = (N_1, ldots,N_t).

Definition 1

An inner model theta is a map which at every choice stage j maps a node i to a probability p_j(i|theta). A model theta is a valid model if the sum over all nodes is one sum_i p_j(i|theta)=1.

Theorem 1

Let C=(N_1,ldots,N_t) be the observed node choices at steps 1,ldots,t of the evolution of the graph G. Let theta be some hypothesised inner model which assigns a probability p_j(i|theta) to node i at step j. The likelihood of the observed C given theta is  L(C|theta) = prod_{i=1}^t p_j(N_j|theta).

Proof of theorem 1

If L(C_j|theta) is the likelihood of the jth choice given model theta then L(C|theta) = prod_{j=1}^t L(C_j|theta). Given p_j(N_j|theta) is the probability model theta assigns to node N_j at step j, therefore it is also the likelihood of choice N_j at step j given model theta. The theorem follows.

If two inner models theta and theta' are hypothesised to explain the node choices C arising from observations of a graph G_0,ldots,G_t and a given outer model, then the one with the higher likelihood is to be preferred. In practice, for even moderate sized graphs, this likelihood will be beyond the computational accuracy of most programming languages and the log likelihood l(C|theta) = log(L(C|theta)) is more useful.

A common statistical measure is the deviance D = -2 l(C|theta). (The deviance is usually defined with respect to a “saturated model” – in this case the saturated model theta_s is the model which has p_j(C_j|theta_s) = 1 for all j in 1,ldots,t and hence has l(C|theta_s) = 0. The saturated model theta_s has likelihood one but is useless for anything except exactly reproducing G_0, ldots, G_t).

Definition 2

Let theta_0 be the null model. Here, an appropriate null model is the one which assigns equal probability to all nodes in the choice set (the random model). The choice set is either the set of all nodes or, if a simple graph is desired, the set of all nodes to which the new node does not already connect.

The null model allows the assessment of the null deviance D_0 = -2 (l(C|theta) - l (C| theta_0)). However, both D and D_0 depend heavily on the size of t (the number of choices made). A more useful measure created for this situation is now given.

Definition 3

Let theta be some inner model hypothesis for the set of node choices C = (N_1, ldots, N_t). Let theta_A be some rival model to compare theta with. The per choice likelihood ratio with theta_A, c_A, is the likelihood ratio normalised by t the number of choices. It is given by c_A = left [frac{L(C|theta)}{L(C|theta_A)} right]^{1/t} = exp left[ frac{l(C|theta) - l(C|theta_A)}{t} right].

A value c_A > 1 indicates that theta is a better explanatory model for the choice set C than theta_A and c_A < 1 indicates it is worse. Particularly useful is c_0 the per choice likelihood ratio relative to the null model. Note that for a fixed C, given the c_0 statistic for two models theta and theta_A then c_A can be shown to be the ratio of the former over the latter.

In summary, the likelihood L(C|theta) gives the absolute likelihood of a given model theta producing the choice set C arising from a set of graphs G_0, ldots, G_t. However, the per choice likelihood ratio produces a result on a more comprehensible scale.

Fitting linear combinations of models

An inner model theta can be constructed from a linear combination of other inner models. Let theta_1, theta_2, ldots be probability models. A combined model can now be constructed from component models as follows, theta =  beta_1 theta_1 + beta_2 theta_2 + cdots + beta_N theta_N. The beta_i are known as the component weights. The model theta is a valid model if all beta in (0,1) and sum_i beta_i = 1. The weights beta that best explain C can be obtained using a fitting procedure from statistics known as Generalised Linear Models (GLM).

Let  P_j(i) =  1 if i = N_j, and P_j(i) = 0 otherwise. The problem of finding the best model weights becomes the problem of fitting the GLM,  P_j(i) = beta_1 p_j(i|theta_1) + beta_2 p_j(i|theta_2) +  cdots + varepsilon. A GLM procedure can fit the beta parameters to find the combined model theta which best fits the P_j(i). This fit is obtained by creating a data point for each choice j and for each node i giving information about that node at that choice time and also the value of P_j(i).

GLM fitting in a statistical language such as R can be used to find the choice of beta_i which maximises the likelihood of this model. This is equivalent to finding the beta_i which gives the maximum likelihood for theta since for model theta, the expectation E[P_j(i)] = p_j(i|theta). The fitting procedure estimates for each beta_i, the value, the error and the statistical significance.

Because this procedure requires one line of data for each node at each choice then it produces a large amount of data and sampling is necessary. It has been shown that in practice this procedure can recover parameters from networks generated by known models. See the publications page for more details.

Executive summary

Given some data about how a network evolves and also some hypothesised model then

  1. The likelihood based estimate can say how well the hypothesised model explains the true evolution of the network.

  2. The GLM procedure can find which weighted combination of a given set of model components best explains the evolution.

The advantages of FETA over some traditional methods of testing network models are the following:

  1. No need to construct a network and then measure statistics on it. The likelihood is measured directly against the evolution data.

  2. The likelihood is a single statistically reliable measure of which is the best model to explain the evolution of a given set of network data.

The disadvantage is the need for data about how the network evolved.

Contact: Richard G. Clegg (richard@richardclegg.org)