Predicting DNA Recognition by C2H2 Zinc Finger Proteins by Support Vector Machine

How-To Guide


Using the Online Scoring Form:

For an input zinc finger protein, and an input DNA sequence, the online program will locate the zinc fingers in your protein sequence and output the ten top scoring DNA regions (i.e., those that are predicted to be the best binding regions for the found zinc fingers). You can select the usage of either the linear or polynomial pre-trained SVM model. Note that the program will assume all fingers are binding consecutive bases. If it is known that only a subset of the fingers in a protein bind, then you may want to input just those fingers.

Sequences are input the one-letter amino acid and nucleotide standard code. Any other non-standard symbols, spaces, or special characters are ignored and will be not used for scoring. Please check your original sequence in the output page. Please note that the protein may bind to either the primary or complimentary DNA chain: this will be highlighted in the output window.

Advanced options:

You may choose the "Calculate p-values" option to compute a p-value for each score (i.e., the probability of obtaining the score by chance only). The p-values are computed by generating 1000 sequences of the same length as the binding region, and evaluting how many of these would be scored as high the original score. To take into account the length of the input DNA region, an E-value is approximated as the p-value * (number of windows scored in the DNA sequence).

Choosing the p-value option can dramatically increase calculation time, especially in case of using polynomial kernel (up to several minutes). Please be patient. It is always a good idea to start from the calculation without a p-value calculation, and check whether the binding regions and scores are worth evaluating before going to advanced options.

You may choose different background nucleotide probabilities for generating randomized DNA sequences. By default, the uniform 25% distribution for all four nuclotides is used. Alternatively, you can specify any distribution (e.g., the nucleotide distribution in the corresponding genome) or choose the option which computes and uses the distribution in your input DNA sequence.

Pre-trained model files:

If you would like to test our pre-trained SVM models using external programs, such as SVM_light, you can download pre-trained model files for Linear and Polynomial SVMs.
Please consult the conversion table for amino acid - base interacting pairs.

Experimental Database download:

We have also made available for download the database of experimental data collected from 25 individual manuscripts published in 1990 - 2005 and from the Protein Data Bank. This archive is password-protected. You can request the password by contacting us: persikov@princeton.edu. Each line in the database represents one experiment including fields: source - data origin; dna - DNA sequence; zf - number of zinc fingers in protein; f1-fN - sequnces of corresponding zinc finger regions; ex - type of example: + for binding, - for non-binding, Kd - for experimentally measured dissociation constant, and > for comparative examples when binding of sequence A is compared to the subsequently listed sequence B. Please consult the list of sources for all individual references.

If you use this program, please cite:

Anton Persikov, Robert Osada and Mona Singh (2008) "Predicting DNA recognition by Cys2His2 zinc finger proteins". Bioinformatics, 2009 Jan 1; 25(1): 22-29.


Contact us:

To give feedback or to send your comments or suggestions please email us: persikov@princeton.edu

Return to the Front Page