FETA (Framework for Evolving Topology Analysis) software

FETA in use examples

This transcript shows how to interact with FETA. For this example you will need python R and the FETA software. An example of FETA in use on real data will follow shortly.

Artificial test network

Let us assume we are creating an artificial network to test first. This network is going to be specified using the netcreator software.

./netcreator.py -i 30000 model5 > net5

This says, do 30,000 iterations of statistical model 5 and put the output into a file called net5. This creates a test network to play with. This command may take some time to run so feel free to create a smaller one if you are just testing.

The file model5 is as follows

S n 0 0.3 0.3 0.3 0.1
e 0.3 0.3 0.3 0.1
N 3 0.7 0.048
N 4 0.3
E 1 0.5
E 5 0.2
E 6 0.3

This is the FETA model format. In brief S specifies a “simple graph” (no repeated edges between a node pair). n and e specify the “outer model” – this says that every new node is connected to either 1, 2, 3 or 4 nodes with given probabilities. A new node is followed by 0, 1, 2 or 3 edges between existing nodes (also with given probabilities).

The lines beginning with N specify an “inner model” for nodes connecting to new nodes. 70% of the model is PFP with delta = 0.048 30% of the model is connection to singleton nodes. The lines beginning with E specify an “inner model” for nodes between existing edges. 50% of the model is totally random, 20% connects to doubleton nodes and 30% is proportional to a node’s “triangle count”.

OK – now run the analyser to produce the files node5 and edge5. In this case the precise details of the model do not matter. The file “simplegraphmodel” is ideal when you know only that you are dealing with a “simple” graph.

./netanalyser.py -w 1000 -r 0.01 net5 simplegraphmodel node5 edge5

The -w flag skips the first 1000 edges as “warm up” – just in case too small a model biases the sample. You can also use -t to specify a start time if your file uses times. The -r 0.01 “thins” the data by only looking at 1 in every 100 choices. Let’s check how much data we have.

wc node5 edge5
553186 7191418 23838888 node5
684667 8900671 30512504 edge5
1237853 16092089 54351392 total

With this much data R might run OK but might not. Rough guide for the R software – look for significant parameters and look for the models with the lowest “deviance”.

Now start R and type source (“FETA.R”)

This loads the FETA software into R.

Now make a first attempt to fit l<-linearFETA(“node5”,single=TRUE, double=TRUE)

The output should be

Data read
Now fitting prob   0 + randFact + degrees + singlecol + doublecol with 4 variables

This tries to fit the FETA model to the data in “node5” (our data for connections to new nodes). It adds a connection to degress and a random factor unless you tell it not to.

Get a summary as follows test

summary(l)

Call:
glm(formula = fmla, family = family, start = mystart)

Deviance Residuals:
Min 1Q Median 3Q Max
-0.75616 -0.02479 -0.01864 -0.01485 4.47522

Coefficients:
Estimate Std. Error z value Pr(>|z|)
randFact -0.09747 0.09408 -1.036 0.30
degrees 0.82873 0.10882 7.615 2.63e-14 ***
singlecol 0.24863 0.05868 4.237 2.27e-05 ***
doublecol 0.01754 0.01248 1.405 0.16
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: Inf on 553186 degrees of freedom
Residual deviance: 2953 on 553182 degrees of freedom
AIC: 2961

Number of Fisher Scoring iterations: 4

The p factors (and the helpful stars) tells us that the degrees and the single parts were a good guess but the double and the random factor part not so much. Ignore the “Null deviance” – it isn’t useful. The “Residual Deviance” and AIC should be as low as possible. However, later you will see a better way to get this.

Let’s try a different model without the double part and the random part.

l<-linearFETA(“node5”, single=TRUE, rand= FALSE)
> summary(l)

Call:
glm(formula = fmla, family = family, start = mystart)

Deviance Residuals:
Min 1Q Median 3Q Max
-0.72206 -0.02518 -0.01919 -0.01548 4.43201

Coefficients:
Estimate Std. Error z value Pr(>|z|)
degrees 0.76970 0.07835 9.824 < 2e-16 ***
singlecol 0.22922 0.05392 4.251 2.13e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: Inf on 553186 degrees of freedom
Residual deviance: 2968.5 on 553184 degrees of freedom
AIC: 2972.5

Number of Fisher Scoring iterations: 4

We might also suspect a PFP model. Let’s imagine we do (since it is right). It’s usually a bad idea to mix a degree based model and a PFP model so take the degrees out and drop the PFP model in. So this model is a mix of PFP and singleton (which is correct but pretend we don’t know that). Also pretend we don’t know delta so let’s put in a bad value.

l<-linearFETA(“node5”,single=TRUE,rand=FALSE,deg=FALSE,pfp=TRUE, delta=0.1)
Data read
Now fitting prob   0
pfpcol singlecol with 2 variables

summary(l)

Call:
glm(formula = fmla, family = family, start = mystart)

Deviance Residuals:
Min 1Q Median 3Q Max
-0.85330 -0.02325 -0.01769 -0.01416 4.55237

Coefficients:
Estimate Std. Error z value Pr(>|z|)
pfpcol 0.68565 0.07300 9.393 < 2e-16 ***
singlecol 0.30499 0.05373 5.677 1.37e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: Inf on 553186 degrees of freedom
Residual deviance: 2969.5 on 553184 degrees of freedom
AIC: 2973.5

Number of Fisher Scoring iterations: 4

OK – this has not worked out so badly. It’s a worse model than the degree based model (because the AIC and deviance are higher) but that might be due to the wrong delta. There’s an automatic procedude for finding good deltas but it is SLOOOOOW.

finddelta(“node5”,single=TRUE,rand=FALSE,deg=FALSE,pfp=TRUE, range=seq(0.02,0.06,0.01))

This will search the model chaging the PFP delta parameter from 0.02 to 0.06 in steps of 0.01. It prints the deviance which we want to be low. (It is a little bit of a cheat that I already know the answer to be in this range).

0.02 2966.276
0.03 2965.594
0.04 2965.202
0.05 2965.109
0.06 2965.327

Our best value is 0.05 which is pretty good really (0.048 is correct).

l<-linearFETA(“node5”,single=TRUE,rand=FALSE,deg=FALSE,pfp=TRUE,delta=0.05)
summary(l)

Call:
glm(formula = fmla, family = family, start = mystart)

Deviance Residuals:
Min 1Q Median 3Q Max
-0.77949 -0.02429 -0.01847 -0.01482 4.48375

Coefficients:
Estimate Std. Error z value Pr(>|z|)
pfpcol 0.73430 0.07622 9.634 < 2e-16 ***
singlecol 0.26179 0.05396 4.852 1.22e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: Inf on 553186 degrees of freedom
Residual deviance: 2965.1 on 553184 degrees of freedom
AIC: 2969.1

Number of Fisher Scoring iterations: 3

Not so bad but the deviance for the PFP and the degree models is similar.

2968.5 for the degree model and 2965.1 for PFP.

This is the important part – now to use the netanalyser to test the likelihood.

Create a model file for the two models testmodel1 – ignore the outer model and the inner edge model. We are testing the N part here – the model is .73 PFP and .27 single.

S
n 0 1.0
e 0 1.0
N 2 0.77
N 4 0.23
E 1 1.0

We want to race this against testmodel2. This is the similar but uses the results from the degree modelling not the pfp modelling.

./netanalyser.py -w 1000 -r 0.0001 -S net5 testmodel1 /dev/null /dev/null

Back to the netanalyser program – the new -S flag asks for likelihood statistics. We are no longer intrested in the node and edge files so these are thrown away (to devnull). This is more accurate than the testing in R.

Note that the exact results depend on the exact network created which was from a random process. They should be close to this however.

NODE MODEL
Log likelihood -136045.417587
Null likelihood -150897.106699
Deviance 272090.835173
Null deviance -29703.3782251
Mean prob ratio rel random 2.16382436434
EDGE MODEL
(Not of interest here)

The Deviance and null deviance should be as low as possible. (The null likelihood should be more or less the same – it is the likelihood of a random model). The mean prob rel random should be high.

./netanalyser.py -w 1000 -r 0.0001 -S net5 testmodel2 /dev/null /dev/null

The results are

NODE MODEL
Log likelihood -136209.593883
Null likelihood -150897.106699
Deviance 272419.187767
Null deviance -29375.0256315
Mean prob ratio rel random 2.14543980167
EDGE MODEL
(Not of interest here)

Model 1 is better than model 2 in this test therefore. The final winning model is .73 PFP with delta = 0.05 .27 singleton

The actual answer was .7 PFP with delta 0.048 and .3 singleton.

Contact: Richard G. Clegg (richard@richardclegg.org)