[ABE-L] Fwd: Fw: Academic Seminar of Data Science with Gabriel Goldstein of Instituto de Biociências - USP - In Person

Hedibert Lopes hedibert em gmail.com
Qui Mar 2 20:50:36 -03 2023


*Subject:* Academic Seminar of Data Science with Gabriel Goldstein of
Instituto de Biociências - USP - In Person




[image: Logo Insper]

Insper
<https://e.allin.insper.edu.br/bendar/?atmca=10094556&atmme=339&atmte=1&atmso=ck&utm_content=277659976&atmem=cm9kcmlnb2ZzM0BpbnNwZXIuZWR1LmJy>

[image:
http://arquivos.insper.edu.br/2019/fundo_programa_bolsas/images/div.jpg]

Accreditations
<https://e.allin.insper.edu.br/bendar/?atmca=10094556&atmme=339&atmte=1&atmso=ck&utm_content=277659977&atmem=cm9kcmlnb2ZzM0BpbnNwZXIuZWR1LmJy>

[image:
http://arquivos.insper.edu.br/2019/fundo_programa_bolsas/images/div.jpg]

Academic Programs
<https://e.allin.insper.edu.br/bendar/?atmca=10094556&atmme=339&atmte=1&atmso=ck&utm_content=277659975&atmem=cm9kcmlnb2ZzM0BpbnNwZXIuZWR1LmJy>

 [image:
https://arquivos.insper.edu.br/2023/pesquisa/imagens/Data-Science-hybrid.jpg]



*Title:* Identifying Drosophila new genes using machine learning



*Speaker:*  Gabriel Goldstein
<http://buscatextual.cnpq.br/buscatextual/visualizacv.do;jsessionid=B7AB908F0ABDE2DBFB98AB7EFE30A5BC.buscatextual_3>

*University:*  Instituto de Biociências - USP <https://www.ib.usp.br/>





*Abstract:* There is a class of genes that emerged recently in the history
of a taxon: new genes. These genes are so classified because, despite their
presence in a taxon, they are absent in a sister taxon and outgroups. To
identify new genes in a genome it is necessary to date all genes of a focal
species to the point in the phylogeny of the taxon in which each gene
originated. The main gene dating method for identifying new genes uses
synteny and parsimony when comparing genomes of related species to date all
genes of a focal species. Despite the precision of the method, it is
extremely dependent on the assembly and annotation of the genome of
interest, which limits its application to model species that have a manual
and curated annotation. There are a number of biological characteristics
that are known to differ between new and old genes in a wide range of
analyzed taxa, such as humans, mice and plants. An example of this is the
expression profile of these groups, since new genes are mostly expressed in
male gametogenesis and old genes are expressed in a general way. With these
facts in mind, we propose in this work a new gene identification method
that uses biological information to separate new genes from old ones
through the use of machine learning. For this, we collected information
from databases and generated expression, orthology and *dn/ds* data
information for *D. melanogaster*, the species of the genus that had its
new genes dated and makes it possible to train a supervised machine
learning model. In addition to this information, we use orthology data to
eliminate old genes while losing few new genes. This is possible because
old genes have, on average, more species with orthologs than new genes,
since they appeared earlier in the evolutionary history of the taxon.
First, we tested whether information from databases would be able to inform
a machine learning model that would separate new genes from old ones. For
this, we generated several models with different levels of complexity and
different combinations of variables, reaching a model that had 0.702
precision (fraction of relevant instances among retrieved instances) and
0.733 recall (fraction of relevant instances that were retrieved). After
this step, we needed to generate a model that approximated the reality
expected in species without information available in databases, such as *D.
melanogaster*. So, we did similar tests with different sets of variables,
however, we used data that we generated ourselves in this work. After
performing these tests, we generated a model with 0.508 precision and 0.718
recall, demonstrating that it is possible, even with data generated in our
own experiments, to identify and classify new genes in *D. melanogaster*.
To verify whether the method we are proposing works in other species of the
Drosophila genus, we date the genes of another species to identify its new
genes. We used the method based on synteny and parsimony in the species *D.
pseudoobscura* and identified 1523 new genes and 12648 old genes.





[image: Ã cone Data]

March 9, 2023

[image: Ã cone Hora]

12pm de São Paulo, Brasil (UTC/GMT -03:00)

[image: Ã cone Data]



Paulo Renato de Souza room, 2nd floor - Building 1



The seminar will be streamed at link <https://zoom.us/j/95781336030> -
https://zoom.us/j/95781336030
-------------- Próxima Parte ----------
Um anexo em HTML foi limpo...
URL: <http://lists.ime.usp.br/pipermail/abe/attachments/20230302/57b7e892/attachment-0001.htm>
-------------- Próxima Parte ----------
Um anexo não-texto foi limpo...
Nome: image001.png
Tipo: image/png
Tamanho: 4441 bytes
Descrição: não disponível
URL: <http://lists.ime.usp.br/pipermail/abe/attachments/20230302/57b7e892/attachment-0004.png>
-------------- Próxima Parte ----------
Um anexo não-texto foi limpo...
Nome: image002.jpg
Tipo: image/jpeg
Tamanho: 1230 bytes
Descrição: não disponível
URL: <http://lists.ime.usp.br/pipermail/abe/attachments/20230302/57b7e892/attachment-0002.jpg>
-------------- Próxima Parte ----------
Um anexo não-texto foi limpo...
Nome: image003.jpg
Tipo: image/jpeg
Tamanho: 81805 bytes
Descrição: não disponível
URL: <http://lists.ime.usp.br/pipermail/abe/attachments/20230302/57b7e892/attachment-0003.jpg>
-------------- Próxima Parte ----------
Um anexo não-texto foi limpo...
Nome: image004.png
Tipo: image/png
Tamanho: 592 bytes
Descrição: não disponível
URL: <http://lists.ime.usp.br/pipermail/abe/attachments/20230302/57b7e892/attachment-0005.png>
-------------- Próxima Parte ----------
Um anexo não-texto foi limpo...
Nome: image005.png
Tipo: image/png
Tamanho: 1069 bytes
Descrição: não disponível
URL: <http://lists.ime.usp.br/pipermail/abe/attachments/20230302/57b7e892/attachment-0006.png>
-------------- Próxima Parte ----------
Um anexo não-texto foi limpo...
Nome: image006.png
Tipo: image/png
Tamanho: 1222 bytes
Descrição: não disponível
URL: <http://lists.ime.usp.br/pipermail/abe/attachments/20230302/57b7e892/attachment-0007.png>


Mais detalhes sobre a lista de discussão abe