Input
The input sequence data should be provided in single or multiple sequences in FASTA format. An example of input sequence file is
>gene1
GAGCTCACATTAACTATTTACAGGGTAACTGCTTAGGACCAGTATTATGAGGAGAATTTA
CCTTTCCCGCCTCTCTTTCCAAGAAACAAGGAGGGGGTGAAGGTACGGAGAACAGTATTT
CTTCTGTTGAAAGCAACTTAGCTACAAAGATAAATTACAGCTATGTACACTGAAGGTAGC
TATTTCATTCCACAAAATAAGAGTTTTTTAAAAAGCTATGTATGTATGTGCTGCATATAG
AGCAGATATACAGCCTAT
>gene2
ACCTTACTCGCCCCAGTCTGTCCCGACGTGACTTCCTCGACCCTCTAAAGACGTACAGAC
CAGACACGGCGGCGGCGGCGGGAGAGGGGATTCCCTGCGCCCCCGGACCTCAGGGCCGCT
CAGATTCCTGGAGAGGAAGCCAAGTGTCCTTCTGCCCTCCCCCGGTATCCCATCCAAGGC
GATCAGTCCAGAACTGGCTCTCGGAAGCGCTCGGGCAAAGACTGCG
Output
The result provided in the table below is obtained after executing all the three machine learning techniques.The generated output file contains the probabilities with which the putative splice sites are predicted as true by different machine learning approaches. Higher is the probability more is the predictive strength. The last column contains the probabilities, averaged over prediction methods used. Here, the results of the putative splice sites with average probability >0.65 are displayed. However, the recommended threshold value and methods are explained below under "Prediction using threshold value".
RF- Random Forest, SVM-Support Vector Machine, ANN-Artificial Neural Network
Prediction using threshold value
Among all machine learning techniques, though ANN achieved higher accuracy it is recommended to use all the three classifiers for the prediction of donor splice sites and take the decision on majority basis. In other words, if a putative splice site is predicted as true by any of the two methods then it can be declared as true splice site.
|