Application of a random forest algorithm to estimate marker effects and identify candidate genes for reproductive traits in Iranian Holstein dairy cattle

Document Type : Research Paper

Authors

1 Ph.D. Student in Animal Breeding and Genetics, Department of Animal Science, Faculty of Agricultural Sciences, University of Tabriz, Tabriz, Iran

2 Professor, Department of Animal Sciences, Faculty of Agriculture, University of Tabriz, Tabriz, Iran

3 Associate Professor, the National Animal Breeding Center and Promotion of Animal Products, Karaj, Iran

Abstract

Introduction: The genome-wide association study (GWAS) is a powerful approach to identify genomic regions associated with fertility traits that explain a significant portion of the genetic variance associated with these traits and identify the relevant causal mutations. Evaluating the correlation between each genotyped marker and trait is an essential strategy for GWAS studies that examine the effects of all markers by considering their possible interactions, environmental factors, and even mutual effects between markers. Recently, machine learning methods have been introduced to genomic topics, and the basis of these methods is different from the common methods of genomic evaluation. The machine learning method is used to estimate the genomic breeding values of the candidate animals by considering the training data (genotypic and phenotypic information of the reference population). One of the key advantages of this method is the ability to analyze large data. Machine learning is a branch of artificial intelligence whose goal is to achieve machines that can extract knowledge (learning) from the environment. A variety of machine learning methods (random forest, boosting, and deep learning) are used to model genetic variance and environmental factors, study gene networks, GWAS, study epistasis effects, and genomic evaluation. Random forest is one of the machine learning methods that has been successfully used in various fields of science. This research was conducted to identify markers and genes related to reproductive traits such as calving interval (CI), days open (DO), daughter pregnancy rate (DPR), and age at first calving (AFC) in Iranian Holstein dairy cattle. These traits have already been investigated with the ssGBLUP method and using a smaller sample size. However, in the present research, by using more genotyped animals, a random forest algorithm was used to identify markers and genes related to reproductive traits.
Materials and methods: The records used in this research were provided by the National Animal Breeding Center and Promotion of Animal Products of Iran and included AFC, DO, CI, and DPR related to the genotyped bulls' daughters. In this research, the pedigree information of 2774183 animals was used. The genotypic information of the markers related to 2419 Holstein bulls was used. Genomic data quality control was performed using factors such as the number of genotyped SNPs per animal (ACR), the number of genotyped animals per SNP (CR), Hardy-Weinberg equilibrium (HWE), and minor allele Frequency (MAF). When filtering genomic data, the markers whose MAF was less than 5% were removed, and then the samples whose genotyped frequency was less than 90% were identified and removed. Then, the markers whose genotyping rate was less than 95% in the samples were identified and removed. Finally, the SNPs that deviated from the HWE test (P<10-6) were excluded from the analysis as a measure of genotyping error. To control the quality of genomic data, PLINK 1.9 software was used. Then Ranfog software was used in the Linux environment to perform analysis through random forest algorithm.
Results and discussion: By using the random forest algorithm, a total of 21 important SNPs were observed, then important fertility trait candidate genes were identified by the gene ontology method, and 62 genes were within 250 Kb of these SNPs. The most significant SNP was observed for AFC. The main SNP for AFC is in ARS-BFGL-NGS-22647 BTA3, for CI is in ARS-BFGL-NGS-114194 (BTA11), for DO is in BTA-74076 -no-rs (BTA5), and for DPR is in ARS-BFGL-NGS-32553 (BTA26). The researchers, who studied fertility traits in Nellore cattle using machine learning methods, identified MPZL1 and CD247 genes on chromosome number 3 and this gene was associated with age at first calving. Many pathways of cell biology affect the performance of reproductive traits. Research has reported the relationship between the CD247 gene and pathways of biology, including cell development and function. Research has shown that the IFFO2 gene plays an important role in the molecular structure of cells, as well as in the mechanism of blastocyst formation, embryos, and the length of gestation in cattle. In a study conducted on the mouse population on the structure of the flagellum and the sperm maturation process, the role of the ALDH4A1 gene in the sperm maturation process was reported. The association of the RPS6KC1 gene with pregnancy rate and antral follicle number in Nellore heifers has been reported. The KAT2B gene is a transcriptional activator that plays an essential role in regulating the correction of histone acetylation and plays an important role in improving carcass quality, muscle and fat development, and metabolism in native Chinese cattle. In addition, they play a key role in regulating biological processes and are related to cell growth, metabolism and immune system function.
Conclusions:  According to the objectives of this research, new information on markers and candidate genes related to reproductive traits in Iranian Holstein dairy cattle was reported. The markers and candidate genes identified in the present research can be used in genomic selection to improve the reproductive traits of Holstein dairy cattle.

Keywords

Main Subjects


‏ Abdel-Shafy, H., Awad, M. A., El-Regalaty, H., Ismael, A., El-Assal, S. E. D., & Abou-Bakr, S. (2020). A single-step genomic evaluation for milk production in Egyptian buffalo. Livestock Science234, 103977. doi: 10.1016/j.livsci.2020.103977
Aloisio, G. M., Nakada, Y., Saatcioglu, H. D., Peña, C. G., Baker, M. D., Tarnawa, E. D., & Castrillon, D. H. (2014). PAX7 expression defines germline stem cells in the adult testis. The Journal of Clinical Investigation124(9), 3929-3944. doi: 10.1172/JCI75943
Alves, A. A. C., da Costa, R. M., Fonseca, L. F. S., Carvalheiro, R., Ventura, R. V., Rosa, G. J. D. M., & Albuquerque, L. G. (2022). A random forest-based genome-wide scan reveals fertility-related candidate genes and potential inter-chromosomal epistatic regions associated with age at first calving in Nellore cattle. Frontiers in Genetics13, 834724. doi: 10.3389/fgene.2022.834724
An, B., Xu, L., Xia, J., Wang, X., Miao, J., Chang, T., & Gao, H. (2020). Multiple association analysis of loci and candidate genes that regulate body size at three growth stages in Simmental beef cattle. BMC Genetics21(1), 1-11. doi: 10.1186/s12863-020-0837-6
Berisha, B., Schams, D., Rodler, D., Sinowatz, F., & Pfaffl, M. W. (2017). Expression pattern of HIF 1alpha and vasohibins during follicle maturation and corpus luteum function in the bovine ovary. Reproduction in Domestic Animals52(1), 130-139. doi: 10.3389/fgene.2022.834724
Bolormaa, S., Pryce, J. E., Hayes, B. J., & Goddard, M. E. (2010). Multivariate analysis of a genome-wide association study in dairy cattle. Journal of Dairy Science93(8), 3818-3833. doi: 10.3168/jds.2009-2980
Bonnefont, C., Toufeer, M., Caubet, C., Foulon, E., Tasca, C., Aurel, M. R., & Rupp, R. (2011). Transcriptomic analysis of milk somatic cells in mastitis resistant and susceptible sheep upon challenge with Staphylococcus epidermidis and Staphylococcus aureus. BMC Genomics12(1), 1-16. doi: 10.1186/1471-2164-12-208
Breiman, L. (2001). Random forests. Machine Learning45, 5-32. doi: 10.1023/A:1010933404324.
Brieuc, M. S., Waters, C. D., Drinan, D. P., & Naish, K. A. (2018). A practical introduction to Random Forest for genetic association studies in ecology and evolution. Molecular Ecology Resources18(4), 755-766. doi: 10.1111/1755-0998.12773
Bureau, A., Dupuis, J., Falls, K., Lunetta, K. L., Hayward, B., Keith, T. P., & Van Eerdewegh, P. (2005). Identifying SNPs predictive of phenotype using random forests. Genetic Epidemiology: The Official Publication of the International Genetic Epidemiology Society28(2), 171-182. doi: 10.1002/gepi.20041
Bureau, A., Dupuis, J., Hayward, B., Falls, K., & Van Eerdewegh, P. (2003). Mapping complex traits using Random Forests. BioMed Central, 4, 1-5. doi: 1471-2156/4/s1/S64
Carreño, L. O. D., da Conceição Pessoa, M., Espigolan, R., Takada, L., Bresolin, T., Cavani, L., & Da Fonseca, R. (2019). Genome association study for visual scores in Nellore cattle measured at weaning. BMC Genomics20, 1-9. doi: 10.1186/s12864-019-5520-9
Choudhary, R. K., & Capuco, A. V. (2021). Expression of NR5A2, NUP153, HNF4A, USP15 and FNDC3B is consistent with their use as novel biomarkers for bovine mammary stem/progenitor cells. Journal of Molecular Histology52(2), 289-300. doi: 10.1007/s10735-020-09948-8
Cole, J. B., Wiggans, G. R., Ma, L., Sonstegard, T. S., Lawlor, T. J., Crooker, B. A., & Da, Y. (2011). Genome-wide association analysis of thirty-one production, health, reproduction and body conformation traits in contemporary US Holstein cows. BMC Genomics12(1), 1-17. doi: 1471-2164/12/408
Devlin, D. J., Nozawa, K., Ikawa, M., & Matzuk, M. M. (2020). Knockout of family with sequence similarity 170 member A (Fam170a) causes male subfertility, while Fam170b is dispensable in mice. Biology of Reproduction103(2), 205-222. doi: 10.1093/biolre/ioaa082
Diniz, W. J., Banerjee, P., Rodning, S. P., & Dyce, P. W. (2022). Machine learning-based Co-expression network analysis unravels potential fertility-related genes in beef cows. Animals, 12(19), 2715. doi: 10.3390/ani12192715
Ebrahimie, E., Ebrahimi, F., Ebrahimi, M., Tomlinson, S., & Petrovski, K. R. (2018). Hierarchical pattern recognition in milking parameters predicts mastitis prevalence. Computers and Electronics in Agriculture, 147, 6-11. doi: 10.1016/j.compag.2018.02.003
Eichhorn, P. J., Rodón, L., Gonzàlez-Juncà, A., Dirac, A., Gili, M., Martínez-Sáez, E., & Seoane, J. (2012). USP15 stabilizes TGF-β receptor I and promotes oncogenesis through the activation of TGF-β signaling in glioblastoma. Nature Medicine18(3), 429-435. doi: 10.1038/nm.2619
Garrick, D. J., Taylor, J. F., & Fernando, R. L. (2009). Deregressing estimated breeding values and weighting information for genomic regression analyses. Genetics Selection Evolution41, 1-8. doi: 10.1186/1297-9686-41-55
Gianola, D., Okut, H., Weigel, K. A., & Rosa, G. J. (2011). Predicting complex quantitative traits with Bayesian neural networks: a case study with Jersey cows and wheat. BMC Genetics12, 1-14. doi: 1471-2156/12/87
Jayawardana, J. M. D. R., Lopez-Villalobos, N., McNaughton, L. R., & Hickson, R. E. (2023). Genomic regions associated with milk composition and fertility traits in spring-calved dairy cows in New Zealand. Genes14(4), 8-60. doi: 10.3390/genes14040860
Jiang, J., Ma, L., Prakapenka, D., VanRaden, P. M., Cole, J. B., & Da, Y. (2019). A large-scale genome-wide association study in US Holstein cattle. Frontiers in Genetics, 412. doi: 10.3389/fgene.2019.00412
Júnior, G. O., Perez, B. C., Cole, J. B., Santana, M. H. D. A., Silveira, J., Mazzoni, G., & Ferraz, J. B. S. (2017). Genomic study and medical subject headings enrichment analysis of early pregnancy rate and antral follicle numbers in Nelore heifers. Journal of Animal Science95(11), 4796-4812. doi: 10.2527/jas2017.1752
Kordowitzki, P., Haghani, A., Zoller, J. A., Li, C. Z., Raj, K., Spangler, M. L., & Horvath, S. (2021). Epigenetic clock and methylation study of oocytes from a bovine model of reproductive aging. Aging Cell20(5), 33-49. doi: 10.1111/acel.13349.
Kramer, M., Erbe, M., Seefried, F. R., Gredler, B., Bapst, B., Bieber, A., & Simianer, H. (2014). Accuracy of direct genomic values for functional traits in Brown Swiss cattle. Journal of Dairy Science97(3), 1774-1781. doi: 10.3168/jds.2013-7054
Li, B., Zhang, N., Wang, Y. G., George, A. W., Reverter, A., & Li, Y. (2018). Genomic prediction of breeding values using a subset of SNPs identified by three machine learning methods. Frontiers in Genetics9, 2-37. doi: 10.3389/fgene.2018.00237
Liao, S. F., Boling, J. A., & Matthews, J. C. (2015). Gene expression profiling indicates an increased capacity for proline, serine, and ATP synthesis and mitochondrial mass by the liver of steers grazing high vs. low endophyte-infected tall fescue. Journal of Animal Science93(12), 5659-5671. doi: 10.2527/jas.2015-9193
Lin, X., Li, B., Chen, Y., Chen, H., & Liu, M. (2022). KAT2B Gene Polymorphisms Are Associated with Body Measure Traits in Four Chinese Cattle Breeds. Animals12(15), 19-54. doi: 10.3390/ani12151954
Liu, R. H., Yang, M. H., Xiang, H., Bao, L. M., Yang, H. A., Yue, L. W., & Huang, Y. (2012). Depletion of OLFM4 gene inhibits cell growth and increases sensitization to hydrogen peroxide and tumor necrosis factor-alpha induced-apoptosis in gastric cancer cells. Journal of Biomedical Science19(1), 1-11. doi: 10.1186/1423-0127-19-38
Lu, X., Abdalla, I. M., Nazar, M., Fan, Y., Zhang, Z., Wu, X., & Yang, Z. (2021). Genome-wide association study on reproduction-related body-shape traits of chinese Holstein cows. Animals11(7), 19-27. doi: 10.3390/ani11071927
Lu, X., Arbab, A. A. I., Abdalla, I. M., Liu, D., Zhang, Z., Xu, T., & Yang, Z. (2022). Genetic parameter estimation and genome-wide association study-based loci identification of milk-related traits in Chinese Holstein. Frontiers in Genetics12, 799-664. doi: 10.3389/fgene.2021.799664
Meuwissen, T. H., Hayes, B. J., & Goddard, M. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics157(4), 1819-1829. doi: 10.1093/genetics/157.4.1819
Mohammadi, A., Alijani, S., Rafat, S. A., & Abdollahi-Arpanahi, R. (2020). Genome-wide association study and pathway analysis for female fertility traits in Iranian Holstein cattle. Annals of Animal Science20(3), 825-851. doi: 10.2478/aoas-2020-0031
Mohammadi, H., Khaltabadi Farahani, A. H., & Moradi, M. H. (2022). Genome-wide association study based on haplotype model and gene-set enrichment analysis associated with age at first calving in Nelore cattle. Animal Production Research11(2), 69-80. doi: 10.22124/ar.2022.19943.1629 [In Persian]
Panetto, J. C. D. C., Machado, M. A., da Silva, M. V. G., Barbosa, R. S., dos Santos, G. G., de MH Leite, R., & Peixoto, M. G. C. (2017). Parentage assignment using SNP markers, inbreeding and population size for the Brazilian Red Sindhi cattle. Livestock Science, 204, 33-38. doi: 10.1016/j.livsci.2017.08.008
Peng, Y., Liu, J., Liu, Q., Yao, Y., Guo, C., Zhang, Y., & Lin, D. (2010). Conformational and biochemical characterization of a rat epididymis-specific lipocalin 12 expressed in Escherichia coli. Biochimica et Biophysica Acta (BBA)-Proteins and Proteomics1804(11), 2102-2110. doi: 10.1016/j.bbapap.2010.07.020
Pryce, J. E., Bolormaa, S., Chamberlain, A. J., Bowman, P. J., Savin, K., Goddard, M. E., & Hayes, B. J. (2010). A validated genome-wide association study in 2 dairy cattle breeds for milk production and fertility traits using variable length haplotypes. Journal of Dairy Science93(7), 3331-3345. doi: 10.3168/jds.2009-2893
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., & Sham, P. C. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics81(3), 559-575. doi: 0.1086/519795
Purfield, D. C., McClure, M., & Berry, D. P. (2016). Justification for setting the individual animal genotype call rate threshold at eighty-five percent. Journal of Animal Science, 94(11), 4558-4569. doi: 10.2527/jas.2016-0802
Santos, E. M. S., Almeida, A. C., Santos, H. O., Cangussu, A. R., Costa, K. S., Alves, J. N., & Aguiar, R. W. S. (2019). Mechanism of Brassica oleracea performance in bovine infectious mastitis by bioinformatic analysis. Microbial Pathogenesis129, 19-29. doi: 10.1016/j.micpath.2019.01.029
Sargolzaei, M., Chesnais, J. P., & Schenkel, F. S. (2014). A new approach for efficient genotype imputation using information from relatives. BMC Genomics15(1), 1-12. doi: 1471-2164/15/478
Sargolzaei, M., Iwaisaki, H., & Colleau, J. J. (2006). CFC: A tool for monitoring genetic diversity. In Proceedings of the 8th world congress on genetics applied to livestock production. Pp. 13-18. doi: 10.1017/S1751731112001723
Sun, X., Jiang, J., Wang, G., Zhou, P., Li, J., Chen, C., & Ren, H. (2023). Genome-wide association analysis of nine reproduction and morphological traits in three goat breeds from Southern China. Animal Bioscience36(2), 191. doi: 10.5713/ab.21.0577
Wallén, S. E., Prestløkken, E., Meuwissen, T. H. E., McParland, S., & Berry, D. P. (2018). Milk mid-infrared spectral data as a tool to predict feed intake in lactating Norwegian Red dairy cows. Journal of Dairy Science101(7), 6232-6243. doi: 10.3168/jds.2017-13874
Wang, M., Moisá, S., Khan, M. J., Wang, J., Bu, D., & Loor, J. J. (2012). MicroRNA expression patterns in the bovine mammary gland are affected by stage of lactation. Journal of Dairy Science95(11), 6529-6535. doi: 10.2012-5748.3168
Wang, Y., Liu, S., Yan, Y., Li, S., & Tong, H. (2019). SPARCL1 promotes C2C12 cell differentiation via BMP7-mediated BMP/TGF-β cell signaling pathway. Cell Death & Disease10(11), 852. doi: 10.1038/s41419-019-2049-4
Wiggans, G. R., VanRaden, P. M., Bacheller, L. R., Tooker, M. E., Hutchison, J. L., Cooper, T., & Sonstegard, T. S. (2010). Selection and management of DNA markers for use in genomic evaluation. Journal of dairy Science, 93(5), 2287-2292. doi: 10.3168/jds.2009-2773
Xiao, Y., Wen, Z. Z., Wu, B., Zhu, H. X., Zhang, A. Z., Li, J. Y., & Gao, J. G. (2022). Deletion of Aldh4a1 leads to impaired sperm maturation in mice. Molecular Biology56(4), 543-550. doi: 10.1134/S002689332204015X
Xuan, R., Wang, J., Zhao, X., Li, Q., Wang, Y., Du, S., & Chao, T. (2022). Transcriptome analysis of goat mammary gland tissue reveals the adaptive strategies and molecular mechanisms of lactation and involution. International Journal of Molecular Sciences23(22), 14424. doi: 10.3390/ijms232214424
Yang, P., Hwa Yang, Y., B Zhou, B., & Y Zomaya, A. (2010). A review of ensemble methods in bioinformatics. Current Bioinformatics5(4), 296-308. doi: 10.2174/157489310794072508
Zhang, H., Wang, Z., Wang, S., & Li, H. (2012). Progress of genome wide association study in domestic animals. Journal of Animal Science and Biotechnology3(1), 1-10. doi: 10.1186/2049-1891-3-26
Zhang, R., Li, X., Ma, Y., Liu, Y., Zhang, Y., Cheng, X., & Ning, Z. (2023). Identification of candidate genomic regions for thermogelled egg yolk traits based on a genome-wide association study. Poultry Science102(3), 102-402. doi: 10.1016/j.psj.2022.102402
Zhang, T., Wang, T., Niu, Q., Xu, L., Chen, Y., Gao, X., & Xu, L. (2022). Transcriptional atlas analysis from multiple tissues reveals the expression specificity patterns in beef cattle. BMC Biology20(1), 79. doi: 10.1186/s12915-022-01269-4