Gene tree construction using Kullback-Leibler divergence on milk governing genes in dairy cattle

Document Type : Research Paper

Authors

1 Ph.D Student, Department of Animal Science, Faculty of Agricultural Sciences, University of Guilan, Rasht, Iran

2 Professor, Department of Animal Science, Faculty of Agricultural Sciences, University of Guilan, Rasht, Iran

3 Assistant Professor, Department of Animal science, Faculty of Agricultural Sciences, University of Yasouj, Yasouj, Iran

4 Assistant Professor, Department of Electrical Engineering, Faculty of Electrical Engineering, University of Guilan, Rasht, Iran

5 AsAssociate Professor, Department of Biotechnology, Animal Science Research Institute, Agricultural Research, Education and Extension Organization (AREEO), Karaj, Iran

Abstract

Information theory is a branch of mathematics that overlaps with communications, biology. The aim of the current study was to provide a method for clustering a number of Milk Governing Genes in Dairy Cattle using an algorithm based on Kullback-Leibler divergence. In this study, after retrieving gene and exon DNA sequences affecting milk yield in dairy cattle, the entropy in orders one to four was calculated. In order to extract gene distances, Kullback-Leibler divergence over three different methods was calculated. The first and second methods were based on the genes alignment but the third method was based on non-alignment and the relative entropy of the genes. The results of each method of Kullback-Leibler divergence over DNA and exon sequences were entered as input into 7 general clustering algorithms: Single, Complete, Average, Weighted, Centroid, Median and K-Means. Integrated result of each clustering algorithm due to AdaBoost algorithm, which implied as gene tree, indicated that the third method was based on the relative entropy of the genes, biologically grouped set of genes as it was proved by their gene annotation using GeneMANIA. We believe that the proposed method might be used with other DNA based clustering competitive methods and therefore, it can be used to group set of genes in other species.

Keywords

Main Subjects


Buitenhuis A. J., Sundekilde U. K., Poulsen N., Bertram H. C., Larsen L. B. and Sørensen P. 2013. Estimation of genetic parameters and detection of QTL for metabolites in Danish Holstein milk. Journal of Dairy Science, 14(79): 1-10.
Changchuan Y., Ying C. and Stephen Y. 2014. A measure of DNA sequence similarity by Fourier Transform with applications on hierarchical clustering. Journal of Theoretical Biology,  359: 18–28.
Clemente J. C., Satou K. and Valiente G. 2007. Phylogenetic reconstruction from non-genomic data. Bioinformatics, 23: 110–115.
Edwards S. V., Fertil B., Giron A. and Deschavanne P.J. 2002. A genomic schism in birds revealed by phylogenetic analysis of DNA strings. System Biology, 51: 599-613.
Erill I. 2012. Information Theory and biological sequences: Insights from an evolutionary prespective. 2012 Nova Science Publishers, Inc.
Freund Y. and Schapire R. 1996. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55: 119.
Freund Y. and Schapire R. 1996. Experiments with a new boosting algoritm. Paper read at Proceeding of the Thirteenth Internatioanal Conference on Machine Learning.
Forst C. V. and Schulten K. 2001. Phylogenetic analysis of metabolic pathways. Journal of Molecular Evolution, 52: 471–489.
Ghaderi-Zefrehei M., Bandi Dastjerdi A., Bahreini Behzadi A., Samadian F. and Meamar M. 2016. Investigation of information accumulation in Escherichia Coli's DNA sequence affecting mastitis in dairy cow using information theory. Journal of Ruminant Research, 4(2): 1-22.
Gray R. M . 2013. Entropy and Information Theory. First Edition. Springer-Verlag New York publisher.
Heymans M. and Singh A. K. 2003. Deriving phylogenetic trees from the similarity analysis of metabolic pathways. Bioinformatics, 19 (1): 138–146.
Jiang S., Tang C., Zhang L. and Zhang A. 2014. A maximum entropy approach to classifying gene array data sets. Workshop on Data Mining for Genomics, First SIAM International Conference on Data Mining.
Khatib H., Monson R. L., Schutzkus V., Kohl D. M., Rosa G. J. M. and Rutledge J. J. 2008. Mutations in the STAT5A gene are associated with embryonic survival and milk composition in cattle. Journal of Dairy Science, 91: 784–793.
Kim J., Kim S., Lee K. and Kwon Y. 2009. Entropy analysis in yeast DNA. Chaos, Solitons and Fractals, 39: 1565–1571.
Kullback S. and Leibler R. 1951. On information and sufficiency. The Annals of Mathematical Statistics, 22: 79–86.
Lee L. 2009. Used kullback-Liebler measure as a new method for the reconstruction of the phylogenetic tree of the Cornavirus and SARS viruses.
Lemay D. G., Lynn D. J., Martin W. F., Neville M. C.,  Casey T. M., Rincon G.,  Kriventseva E. V., Barris W. C., Hinrichs A. S.,  Molenaar A. J.,  Pollard K. S.,  Maqbool N. J.,  Singh K., Murney R., Zdobnov E. M.,  Tellam R. L.,  Medrano J. F.,  German J. B. and Rijnkels M. 2009. The bovine lactation genome: insights into the evolution of mammalian milk. Genome Biology. 10:R43.
Li C.and Wang J. 2005. Relative entropy of DNA andits application. Physica A, 347: 465–471.
Liou C. Y., Tseng S. H., Cheng W. C. and Tsai H. Y. 2013. Structural complexity of DNA sequence. Computational and Mathematical Methods in Medicine, 2013: 1-11.
Liu B. 2007. Uncertainty Theory, 2nd ed., Springer-Verlag, Berlin.
Machado J. T. 2012. Shannon entropy analysis of the genome code. Mathematical Problems in Engineering, 2012:1-12.
Monge R. E. and Crespo J. L. 2014. Comparison of Complexity Measures for DNA Sequence Analysis. 2014 International Work Conference on Bio-inspired Intelligence (IWOBI).
Neagoe I. M., Popescu D. and Niculescu V. I. R. 2014. Applications of entropic divergence measures for DNA segmentation into high variable regiones of cryposporidium spp. GP60 gene. Romanian Reports in Physics, 66(4): 1078–1087.
Pham T. D., Crane D. I., Tannock D. and Beck D. 2004. Kullback-Leibler dissimilarity of Markov models for phylogenetic tree reconstruction. Proceeding of 2004 international Symposium on Inteligent Multimedia, Video and Speech Processing. October 20-22, 2004 HongKong.
Porto-DIaz L., BolOn-Canedo V., Alonso-Betanzos A. and Fontenla-Rome O. 2011. A study of performance on microarray data sets for a classifier based on information theoretic learning. Neural Networks, 24: 888-896.
Qi J., Wang B. and Hao B. 2004. Whole proteome prokaryote phylogeny without sequence alignment: a K-string composition approach. Journal of Molecular Evolution, 58: 1-11.
Ruiz-Marin  M., Matilla-Garcia M.,  Cordoba J. A. G.,  Susillo-Gonzalez J. L., Romo-Astorga A., Gonzalez-Pérez A., Ruiz A. and Gayan J. 2010. An entrpyetest for single-locus genetic association analysis. BMC Genetics, 11: 19.
Shannon C. 1948. A mathematical theory of communication. Bell System Technical Journal, 27: 379-423 and 623-656.
Sherwin B. W. 2010. Entropy and Information Approaches to Genetic Diversity and its Expression: Genomic Geography Entropy, 12: 1765-1798.
Stuart G. W., Moffet K. and Baker S. 2002. Integrated genespecies phylogenies from unaligned whole genomeprotein sequences. Bioinformatics, 18: 100-108.
Stuart G. W., Moffet K. and Leader J. J. 2002. A comprehensivevertebrate phylogeny using vector representationsof protein sequences from whole genomes. Molecular Biology and Evolution, 19: 554-562.
Sundekilde U. K., Larsen L. B. and Bertram H. C. 2013. NMR-Based Milk Metabolomics. Metabolites, 3:204-222.
Tautz D., Trick M., Dover G. A. 1986. Cryptic simplicity in DNA is a major source of genetic variation. Nature, 322: 652–656.
Vinga S.,AlmeidaJ.2003.Alignment-freesequencecomparison:review. Bioinformatics, 19 (4):513-523.
Vinga S. 2013. Information theory applications for biological sequence analysis. Briefings in bioinformatics. 15 (3): 376-389.
 Warde-Farley D., Donaldson S. L., Comes, O., Zuberi  K., Badrawi  R., Chao  P., Franz M., Grouios C., Kazi F., Lopes C. T., Maitland A., Mostafavi S.,  Montojo J., Shao Q., Wright G., Bader G. D. and Morris Q. 2010. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Research, 38, Web Server issue doi:10.1093/nar/gkq537.
Xie X., Yu Y., Liu G., Yuan Z. and Song J. 2010. Complexity and entropy analysis of DNA methyltransferase. Journal of Data Mining in Genom Proteomics, 1(2): 100-105.
Yu Z. G., Anh V. and Lau K. S. 2003. Multifractal and correlation analysis of protein sequences from complete genome, Physics Review E, 68: 021913.
Yu Z. G., Anh V. V. and Zhou L. Q. 2005. Fractal and dynamical language methods to construct phylogenetic tree based on protein sequences from complete genomes, in L.Wang, K. Chen and Y.S. Ong (Eds): ICNC 2005, Lecture Notes in Computer Science, 3612: 337-347.
Yu Z. G., Zhou L. Q., Anh V., Chu K. H. 2005. Phylogenyof prokaryotes and chloroplasts revealed by asimple composition approach on all protein sequencesfrom whole genome without sequence alignment. Journal ofMolecular Evolution, 60: 538-545.
Zhang J. L., Zan L. S., Fang P., Zhang F., Shen G. L. and Tian W. Q. 2008. Genetic variation of PRLR gene and association with milk performance traits in dairy cattle. Canadian Journal of Animal Science, 88: 33-39.
Zhou L. Q., Yu Z. G., Anh V., Nie P. R., Liao F. F. and Chen Y. J. 2007. Log-correlation distance and Fourier transformation with Kullback-Leibler divergence distance for construction of vertebrate phylogeny using complete mitochondrial genomes. In Proceedings of the 3nd International Conference on Natural Computation (ICNC2007), Haikou, China, August 2007; pp: 304–308.