Supervised FP-growth

This is the sFP-groowth program used in “An 'almost exhaustive’ search-based sequential permutation method for detecting epistasis in disease association studies”.

Our code for the supervised FP-growth software was developed based on an implementation of the original FP-growth by Christian Borgelt, obtained at http://www.borgelt.net/fpgrowth.html.

The modified version applies to data sets with a case/control binary label. It finds all the “frequent” patterns among the cases, and count them among both the cases and the controls. This is achieved by building the frequent-pattern tree structure using only the case samples while for each tree node keeping track of the corresponding counts for both the cases and the case/control combined sample.

Download

The source code is here.

Documentation

More detailed documentation is needed and I will update this page incrementally over time. In the meantime, if you have questions or need help to run the program, please feel free to email me. The syntax for running ‘‘sfpgrowth’’ is as follows.

./sfpgrowth [options] infile outfile responsefile pvalfile

Options

The commonly used options for ‘‘sfpgrowth’’ are listed below.

-l#      threshold for significance level (default: 0)
-v       print count detail
-w       print itemset
-x       print total relative support
-m#      minimal number of items per item set (default: 1)
-n#      maximal number of items per item set (default: 5)
-s#      minimal support of an item set among cases(default: 10%)
         (positive: percentage, negative: absolute number)

For example, if we want to find and print out all the patterns of length 1 to 3, whose significance levels as given in pvalfile (see below) are larger than 2, and whose frequencies among cases are at least 2%, the syntax is as follows.

./sfpgrowth -l2 -w -m1 -n3 -s2 infile outfile responsefile pvalfile

Input files

infile: File to read transactions from. Each row corresponds to one observation in the following form.

Item1 Item2 Item3 ... ItemK

One way to name the items is use the form suggested in our paper: MarkerName@Genotype. For example, if an individual has genotype CG at a SNP marker rs1234567, then the corresponding item for that individual can be coded as rs1234567@CG.

outfile: File to write frequent item sets to. This file does not have to exist already.

responsefile: File to read case/control labels for the observations given in infile. It has a single column of 0s and 1s with 0 for control and cases. The first few rows of the file may look like

0
1
0
0
1
1
1

pvalfile: File to read the precomputed significance levels, e.g. -log P-value, for each possible combination of counts for a pattern. It has three columns.

[counts among cases] [counts among controls] [significance level]

For example if for a pattern that occurs 35 times among controls and 70 times among cases the significance level is 10^{-5.78}, then the corresponding row in the file will be

70	35	5.78

This file should exhaust all possible combinations of the two counts. So if you have 1000 cases and 1000 controls, the file should have 1000*1000 rows. The specific order of the rows does not matter. I provide two R functions for the purpose of producing such a file when the significance is the log p-value under the one sided Fisher's exact test, which was used in the paper.

pval.fun=function(n1,n2,N1,N0,log=FALSE) {

  ## One sided Fisher's test
  pval=phyper(n1-1,m=n1+n2, n=N1+N0-n1-n2,k=N1,lower.tail=FALSE)
  if(log) {pval=-log(pval)/log(10)}
  return(pval)
}

pgrid=function(N1,N0,outfile=NA,digits=6) {
  n1=0:N1; n2=0:N0
  grid=expand.grid(n1,n2)
  n1.v=grid[-1,1]; n2.v=grid[-1,2] # drop the (0,0) point
  L=length(n1.v)
  N1.v=rep(N1,L);N0.v=rep(N0,L)

  pval=pval.fun(n1.v,n2.v,N1.v,N0.v,log=TRUE)

  if (any(is.infinite(pval))) {
    warning(paste(sum(is.infinite(pval)),"-log p-values are infinite. Replaced with 999."))
    pval[is.infinite(pval)]=999
  }

  pval.grid=cbind(n1=grid[-1,1],n2=grid[-1,2],p=round(pval,digits))
  if (!is.na(outfile)) {
    write.table(pval.grid,file=outfile,col.names=FALSE,row.names=FALSE,sep="\t")
    print("P-val grid constructed successfully.")
  } else stop("P-val grid file name must be specified.")

}

After loading these two functions into R, one can produce the pvalfile by running

pgrid(N1,N0,outfile=``pvalfile'')

where N1 is the total number of cases and N0 the total number of controls, and ‘‘pvalfile’’ can be replaced by another file name of your choice.