## FETA (Framework for Evolving Topology Analysis) software## Defining the Inner and Outer modelsLet be some graph which evolves in time. Let be the state
of this graph at some step of evolution, .
Consider a model for network evolution as consisting of two separate
(but interconnected) models. The Add a new node and connect it to an existing node. Connect the newest node to an existing node. Connect two existing nodes. Delete an existing connection. Delete an existing node and its connections.
These outer models work with inner models which select either nodes or edges for the operation. The inner model assigns probabilities to each node (operations 1, 2 and 5) or edge (operations 3 and 4). There may be a different inner model for each outer model operation. The outer model might be adapted further if the known graph data can include unconnected (degree zero) nodes, if graphs can be unconnected and so on. The focus of this paper is the inner model and the outer model is always kept relatively simple. ## Estimating likelihoodLet be the graph at the first step of evolution observed (this need not be right at the start of the evolution of the graph). Assume that the state of the graph is observed until some step . The graph evolves between step and according to an outer and inner model. Each step involves the addition of one edge. For simplicity of explanation consider the outer model to consist only of the two operations: add a new node and connect it to an existing node ; or connect the newest node to an existing node .
The inner model assigns probabilities to the existing nodes at a given step. Given the above outer model, from and the node chosen by the inner model can be inferred. Call the set of all observed choices . Definition 1
An inner model is a map which at every choice stage
maps a node
to a probability . A model is a
Theorem 1
Let be the observed node choices at steps of the evolution of the graph . Let be some hypothesised inner model which assigns a probability to node at step . The likelihood of the observed given is Proof of theorem 1
If is the likelihood of the th choice given model then . Given is the probability model assigns to node at step , therefore it is also the likelihood of choice at step given model . The theorem follows. If two inner models and are hypothesised to explain the node choices arising from observations of a graph and a given outer model, then the one with the higher likelihood is to be preferred. In practice, for even moderate sized graphs, this likelihood will be beyond the computational accuracy of most programming languages and the log likelihood is more useful. A common statistical measure is the deviance . (The deviance is usually defined with respect to a “saturated model” – in this case the saturated model is the model which has for all and hence has . The saturated model has likelihood one but is useless for anything except exactly reproducing ). Definition 2
Let be the The null model allows the assessment of the null deviance . However, both and depend heavily on the size of (the number of choices made). A more useful measure created for this situation is now given. Definition 3
Let be some inner model hypothesis for the set
of node choices .
Let be some rival model to
compare with. The A value indicates that is a better explanatory
model for the choice set than and
indicates it is worse.
Particularly useful is the
In summary, the likelihood gives the absolute likelihood of a given model producing the choice set arising from a set of graphs . However, the per choice likelihood ratio produces a result on a more comprehensible scale. ## Fitting linear combinations of modelsAn inner model can be constructed from a linear combination of other inner models. Let , , be probability models. A combined model can now be constructed from component models as follows, . The are known as the component weights. The model is a valid model if all and . The weights that best explain can be obtained using a fitting procedure from statistics known as Generalised Linear Models (GLM). Let if and otherwise. The problem of finding the best model weights becomes the problem of fitting the GLM, A GLM procedure can fit the parameters to find the combined model which best fits the . This fit is obtained by creating a data point for each choice and for each node giving information about that node at that choice time and also the value of . GLM fitting in a statistical language such as R can be used to find the choice of which maximises the likelihood of this model. This is equivalent to finding the which gives the maximum likelihood for since for model , the expectation . The fitting procedure estimates for each , the value, the error and the statistical significance. Because this procedure requires one line of data for each node at each choice then it produces a large amount of data and sampling is necessary. It has been shown that in practice this procedure can recover parameters from networks generated by known models. See the publications page for more details. ## Executive summaryGiven some data about how a network evolves and also some hypothesised model then The likelihood based estimate can say how well the hypothesised model explains the true evolution of the network. The GLM procedure can find which weighted combination of a given set of model components best explains the evolution.
The advantages of FETA over some traditional methods of testing network models are the following: No need to construct a network and then measure statistics on it. The likelihood is measured directly against the evolution data. The likelihood is a single statistically reliable measure of which is the best model to explain the evolution of a given set of network data.
The disadvantage is the need for data about how the network evolved. Contact: Richard G. Clegg (richard@richardclegg.org) |