FETA (Framework for Evolving Topology Analysis) software
Defining the Inner and Outer models
Let be some graph which evolves in time. Let be the state
of this graph at some step of evolution, .
Consider a model for network evolution as consisting of two separate
(but interconnected) models. The outer model
selects the operation which transforms the graph between two steps.
The inner model chooses the entity for
that operation. The operation and the entity together define
the transition from to .
Both outer and inner model may depend on the state of
the graph on the step of the evolution and possibly on
exogenous parameters.
Outer model operations might be the following.
Add a new node and connect it to an existing node.
Connect the newest node to an existing node.
Connect two existing nodes.
Delete an existing connection.
Delete an existing node and its connections.
These outer models work with inner models which select either nodes or
edges for the operation. The inner model assigns probabilities to each node
(operations 1, 2 and 5)
or edge (operations 3 and 4).
There may be a different inner model for each
outer model operation. The outer model might be adapted further
if the known graph
data can include unconnected (degree zero) nodes, if graphs can
be unconnected and so on. The focus of this paper is the inner model
and the outer model is always kept relatively simple.
Estimating likelihood
Let be the graph at the first step of evolution observed
(this need not be right at the start of the evolution of the graph).
Assume that the state of the graph is observed until some step .
The graph evolves between step and according to an outer
and inner model. Each step involves the addition of one edge.
For simplicity of explanation consider the outer model
to consist only of the two operations:
add a new node and connect it to an existing node ; or
connect the newest node to an existing node .
The inner model assigns probabilities to the
existing nodes at a given step. Given the above outer model,
from and the node chosen by the inner
model can be inferred. Call the set of all observed choices
.
If two inner models and are hypothesised to
explain the node choices arising from observations of a graph
and a given outer model, then the one with the
higher likelihood is to be
preferred.
In practice, for even moderate
sized graphs, this likelihood will be beyond the computational
accuracy of most programming languages and the log likelihood
is more useful.
A common statistical measure is the deviance
. (The deviance is usually defined with respect
to a “saturated model” – in this case the saturated model
is the model which has for
all and hence has . The saturated model
has likelihood one but is useless for anything except
exactly reproducing ).
Definition 2
Let be the null model. Here, an appropriate
null model is the one which assigns equal probability
to all nodes in the choice set (the random model).
The choice set is either the set
of all nodes or, if a simple graph is desired, the set of all
nodes to which the new node does not already connect.
The null model allows the assessment of the null deviance
. However, both
and depend heavily on the size of (the number
of choices made). A more useful measure created for this
situation is now given.
A value indicates that is a better explanatory
model for the choice set than and
indicates it is worse.
Particularly useful is the
per choice likelihood ratio relative to the null model.
Note that for a fixed , given the statistic
for two models and
then can be shown to be the
ratio of the former over the latter.
In summary, the likelihood gives the absolute
likelihood of a given model producing the choice set
arising from a set of graphs . However,
the per choice likelihood ratio produces a result on a more
comprehensible scale.
Fitting linear combinations of models
An inner model can be constructed from
a linear combination of other inner models.
Let , ,
be probability models. A combined model can
now be constructed from component models as follows, .
The are known as the component weights. The model
is a valid model if all and
. The weights
that best explain can be obtained
using a fitting
procedure from statistics known as Generalised Linear Models
(GLM).
Let
if and otherwise.
The problem of finding the best model weights
becomes the problem of fitting the GLM,
A GLM procedure can fit the parameters to
find the combined model which best fits the .
This fit is obtained by creating a data point
for each choice and
for each node
giving information about that node at that
choice time and also the value of .
GLM fitting in a statistical language such as
R
can be used to find the choice of
which maximises the likelihood of this model.
This
is equivalent to finding the which gives the
maximum likelihood for since for model ,
the expectation .
The fitting procedure estimates for each , the
value, the error and the statistical significance.
Because this procedure requires one line of data for each node
at each choice then it produces a large amount of data and sampling
is necessary. It has been shown that in practice this procedure
can recover parameters from networks generated by known models.
See the publications page for more details.
Executive summary
Given some data about how a network evolves and also some hypothesised
model then
The likelihood based estimate can say how well the hypothesised model
explains the true evolution of the network.
The GLM procedure can find which weighted
combination of a given set of model
components best explains the evolution.
The advantages of FETA over some traditional methods of testing network
models are the following:
No need to construct a network and then measure statistics on it. The
likelihood is measured directly against the evolution data.
The likelihood is a single statistically reliable measure of which
is the best model to explain the evolution of a given set of network data.
The disadvantage is the need for data about how the network evolved.
Contact:
Richard G. Clegg (richard@richardclegg.org)
|