[ABE-L] Fwd: Academic Seminar of Data Science with Gabriel Goldstein of Instituto de Biociências - USP - Online

Hedibert Lopes hedibert em gmail.com
Qui Mar 9 10:57:09 -03 2023


*Subject:* *Fwd: Academic Seminar of Data Science with Gabriel Goldstein of
Instituto de Biociências - USP -  Online*

Updated: The Seminar will be online

Title: Identifying Drosophila new genes using machine learning

Speaker:  Gabriel Goldstein

University:  Instituto de Biociências - USP

Abstract: There is a class of genes that emerged recently in the history of
a taxon: new genes. These genes are so classified because, despite their
presence in a taxon, they are absent in a sister taxon and outgroups. To
identify new genes in a genome it is necessary to date all genes of a focal
species to the point in the phylogeny of the taxon in which each gene
originated. The main gene dating method for identifying new genes uses
synteny and parsimony when comparing genomes of related species to date all
genes of a focal species. Despite the precision of the method, it is
extremely dependent on the assembly and annotation of the genome of
interest, which limits its application to model species that have a manual
and curated annotation. There are a number of biological characteristics
that are known to differ between new and old genes in a wide range of
analyzed taxa, such as humans, mice and plants. An example of this is the
expression profile of these groups, since new genes are mostly expressed in
male gametogenesis and old genes are expressed in a general way. With these
facts in mind, we propose in this work a new gene identification method
that uses biological information to separate new genes from old ones
through the use of machine learning. For this, we collected information
from databases and generated expression, orthology and dn/ds data
information for D. melanogaster, the species of the genus that had its new
genes dated and makes it possible to train a supervised machine learning
model. In addition to this information, we use orthology data to eliminate
old genes while losing few new genes. This is possible because old genes
have, on average, more species with orthologs than new genes, since they
appeared earlier in the evolutionary history of the taxon. First, we tested
whether information from databases would be able to inform a machine
learning model that would separate new genes from old ones. For this, we
generated several models with different levels of complexity and different
combinations of variables, reaching a model that had 0.702 precision
(fraction of relevant instances among retrieved instances) and 0.733 recall
(fraction of relevant instances that were retrieved). After this step, we
needed to generate a model that approximated the reality expected in
species without information available in databases, such as D.
melanogaster. So, we did similar tests with different sets of variables,
however, we used data that we generated ourselves in this work. After
performing these tests, we generated a model with 0.508 precision and 0.718
recall, demonstrating that it is possible, even with data generated in our
own experiments, to identify and classify new genes in D. melanogaster. To
verify whether the method we are proposing works in other species of the
Drosophila genus, we date the genes of another species to identify its new
genes. We used the method based on synteny and parsimony in the species D.
pseudoobscura and identified 1523 new genes and 12648 old genes.


March 9, 2023
12pm de São Paulo, Brasil (UTC/GMT -03:00)
The seminar will be streamed at link<https://zoom.us/j/95781336030>
-------------- Próxima Parte ----------
Um anexo em HTML foi limpo...
URL: <http://lists.ime.usp.br/pipermail/abe/attachments/20230309/8b80ee79/attachment-0001.htm>


Mais detalhes sobre a lista de discussão abe