About MaLDoSS
Run the server
Encoding procedures
Input/Output
NCBI
HS3D

Encoding Procedures

Di-nucleotide association score

Following are the adjacent di-nucleotide associations scores computed under the three types of encoding procedures.

  1. EP-1: Association between nucleotides of two adjacent positions was expressed in terms of the ratio of observed frequency to the frequency due to random occurrence. The denominator of the ratio is expressed as N times of 0.0625, where N is the number of aligned sequences and 0.0625 is the probability of occurrence of di-nucleotide together at random (0.25*0.25). The observed frequency of occurrence for a pair of nucleotides at adjacent positions was measured from all the total set of splice site motifs, by counting the occurrence of such pairs of nucleotides. The score was measured in terms of logarithm scale with base 2 to store the information in bit and to bring the scores to normal.

 

  1. EP-2: Association between nucleotides of two adjacent positions was expressed in terms of the ratio of observed frequency to the expected frequency, where expected frequency is computed from PWM under the assumption of independence between positions.
  1. EP-3: Adjacent di-nucleotide association was expressed in terms of the relative deviation of observed frequency from the expected frequency computed from PWM.

 

The mathematical expressions for the proposed di-nucleotide association scores between two adjacent nucleotides is explained as follow;
Let gfis the probability of occurrence of fgh nucleotide atfghfgposition. fghfgis the probability of occurrence of gfhgf nucleotide at gffhh position. ghgis the frequency of occurrence of fgh and gfhfnucleotide together atfghand fghgfhpositions respectively. Then the three different di-nucleotide association scores betweenfgh and hgfhnucleotides occurring athand gfhposition under EP-1, EP-2, EP-3 was calculated using following formula
ghfgh
respectively, where hfgh is the association score, N is the total number of sequence motif in the data set and fghhgf,ghhfghgffЄ{A,T, G, C} and fgh = 1, 2, …, 101 ; fg. A pseudo count of 0.001 was added to avoid logarithms of zero in the frequencies.

Encoding ofsplice sitemotifs

Since the sequence length of eachsplice site motif is of 102 nucleotides, 101 (=102-1) combinations of two adjacent positions are possible. Again, in each combination, 16 possible pairs of nucleotides (AA, AT,…,CG, CC) are possible. Thus, scoring matrices, each of order 16×101, were constructed using di-nucleotide association scores under all the three procedures. Then eachsplice site motif was encoded into a vector of scores of length 101 from the scoring matrix. So for N each sequence one will get a numeric data set of N observations and 101 variables (features). The diagrammatic representation of encoding splice site motif is given below

process

Team: Prabina Kumar Meher, Tanmaya Kumar Sahu, A. R. Rao and S. D. Wahi
Contact: meherprabin@yahoo.com, tanmayabioinfo@gmail.com, arrao@iasri.res.in


Indian Agricultural Statistics Research Institute, (ICAR), Library Avenue, New Delhi - 110012