Abstract
Given the expense of more direct determinations, using machine-learning schemes to predict a protein secondary structure from the sequence alone remains an important methodology. To achieve significant improvements in prediction accuracy, the authors have developed an automated tool to prepare very large biological datasets, to be used by the learning network. By focusing on improvements in data quality and validation, our experiments yielded a highest prediction accuracy of protein secondary structure of 90.97%. An important additional aspect of this achievement is that the predictions are based on a template-free statistical modeling mechanism. The performance of each different classifier is also evaluated and discussed. In this paper a protein set of 232 protein chains are proposed to be used in the prediction. Our goal is to make the tools discussed available as services in part of a digital ecosystem that supports knowledge sharing amongst the protein structure prediction community.