Why we need to know?
With the explosion of sequence data in public and private databases and the coming explosion of gene expression
data in a similar vein, it is becoming increasingly important to understand how to apply well-established data
analysis and data classification methods that have been developed in other fields to this field to try to make
sense of the data, to glean biological insights from it, to categorize the data, and to put all of these to good
use in industrial applications.
What is the Algorithm?
The dataset consists of 639 non-redundant set of antibiotic proteins and 602 non-redundant set of non-antibiotic proteins obtained from NCBI.
To reduce redundancy in the sequence ExPASy sequence alignment tool Decrease Redundancy was used with a criteria
that no two sequence had >90% sequence identity to any other sequence in the dataset.
The Support Vector Machine (SVM) module developed by Thorsten Joachims was used to train the dataset.
SVM light is a freely available package which can be downloaded form http://svmlight.joachims.org
Amino acid composition provides the information of protein in a vector of 20 dimensions. The amino acid
composition is the fraction of each amino acid in protein.
Dipeptide composition provides information of protein in a vector of 400 dimensions.
The dipeptide composition encapsulates the information about fraction of amino acids as well as thier local order.
|