Bacteria Gene Interactions

WARNING: The specification of the task changed on December 10th.

This task consists in a full extraction of genetic processes mentioned in scientific texts concerning the bacterium Bacillus subtilis. This organism is a long-time model species; there is an active systemic biology community around B. subtilis that extensively uses such information. Unfortunately it is almost exclusively available in the literature and can be rarely found in public databases.

The state of the knowledge about the genetic regulations in B. subtilis is quite advanced and detailed. This corpus reflects this by providing a wider range of annotation and event types.

Corpus

The INRA-GI corpus is a set of sentences from abstracts of selected PubMed references concerning the genetic regulation of B. subtilis. Most sentences are the same as the LLL challenge corpus.

The annotation was revised and enriched by a joint effort of the Bibliome team of MIG Laboratory at the Institut National de Recherche Agronomique (INRA) and the Laboratoire d'Informatique de Paris Nord at the Université Paris 13.

Entities

The corpus was annotated with a rich set of entity types divided into two main groups: genic entities express biological object representing a gene, a group of genes or a gene product. This entity type has the following sub-types:

  • GeneProduct : the result of the transcription and possibly the translation of a gene, this entity type thus includes RNAs and proteins.

    • Protein : a protein.

      • PolymeraseComplex : RNA polymerase, possibly containing a sigma factor.

  • Gene : a gene.

  • ProteinFamily : a family of proteins mentioned by their common function or by their common ancestor.

  • GeneFamily : a family of genes mentioned by their common function or their common ancestor.

  • GeneComplex : a group of adjacent genes, this entity type thus includes operons, and gene fusions.

  • Regulon : a set of genes that are regulated by a common protein or mechanism.

  • Site : a (short) genomic location that correspond to a binding site for the transcription machinery or a transcription factor.

  • Promoter : upstream region of a gene or operon that binds the polymerase for gene transcription.

The second group of entities are phrases expressing either molecular processes or the molecular state of the bacteria. They represent some kind of action that can be performed on a genic entity. This entity type has the following sub-types:

  • Action: molecular process in a broad meaning, or the (molecular) state of the bacteria (level or concentration of a protein/transcript).

  • Transcription: particular case of Action, corresponding to the transcription of a gene

  • Expression: particular case of Action, different from Transcription in that it relates to the gene product and not the transcript (gene product being a protein or RNA molecule).

Events

The events to predict are binary and directed relations. They can only be found between so-called relevant entities (see previous section). Interaction events were broken into several distinct types:

  • RegulonDependence: a protein is said to be part of the molecular mechanism underlying a regulon.

  • BindTo: a protein is explicitely said to bind to another protein, or to DNA.

  • TranscriptionFrom: the transcription action is said to start from a given genomic location.

  • RegulonMember: part-of relation between genes and regulons.

  • SiteOf: a site is near or inside a promoter or a gene it is functionally related with.

  • TranscriptionBy: a protein is or is a part of the protein complex that actually performs the transcription of a gene.

  • PromoterOf: a promoter is said to be related, or located near a gene or an operon.

  • PromoterDependence: a promoter is said to be controlled by a protein.

  • ActionTarget: relation between an action entity to its target genic entity.

  • Interaction: an interaction between two molecules, in a very broad meaning (could be regulation, binding, regulon membership).

The arguments of these events are labeled and typed, as specified in the following table:

In this specification, [ X | Y ] stands for the union of type X and Y, meaning that the argument can be either of type X or of type Y. The type GeneEntity is an abbreviation for [ Gene | GeneFamily | GeneComplex ], and ProteinEntity an abbreviation for [ Protein | ProteinFamily | PolymeraseComplex | GeneProduct ]. The notation [ * ] means any type.

The task consists in predicting interaction events for texts with entities and syntactic dependencies given.

Special rules

    1. an Interaction may only occur between molecules that are explicitely named (using an identifier, that is). For instance in the sentence: ”Thus, this gene is not controlled by sigmaK”, there is no interaction.

    2. It happens pretty often that several entities in the same sentence denote the same object (same molecule for instance). These coreferences are not given in the challenge data but are important for the predictions. In a coreference relation, one of the denotations may be more precise than the others, and this defines a partial order on a set of coreferences on the same object. In a set of coreferences, only maximally precise denotations can be the argument of a relation. Let’s see that on an example: ”Not only abrB is repressed by sigmaK, but this gene is also a member of sigmaA regulon”. ”this gene” although it is annotated as a gene, cannot be part of the RegulonMember relation, since it has a coreference with abrB which is a more precise denotation. If there are several maximally precise denotations, all relations should be given. Example: ”Phosphorylated Spo0A (Spo0A P) regulates cotD”. Note that finding the most precise denotations of a set of coreference can be tricky, and even rather subjective. That is why we applied the following rules for ordering denotations:

      • two terms with identifiers are equally informative

      • a term with an identifier is more informative than any term without

      • a term with indications on the function or nature of the object is more informative than a term without it, or a pronoun

    3. In this corpus there is a common pattern of phrasing which expresses that a protein regulates transcription by a given sigma factor. That is, the transcriptional process of a gene G initiated by this sigma factor S is controlled by a factor F . In that case, it is tempting, often true in practice but not safe, to conclude that there is an interaction between F and S. In order to avoid any interpretation argument, we decided to discard it, and only report an interaction between F and G.

Evaluation

Participants will be evaluated and ranked according to two scores:

    1. F-score (precision and recall) for all event types together

    2. F-score for the Interaction event type

In order for a predicted event to count as a hit, both arguments must be the same as in the reference in the right order and the event type must be the same as in the reference.

Results

Detailed Results for team UTurku

References

Manine A.P., Alphonse E., Bessières P. (2010). Extraction of genic interactions with the recursive logical theory of an ontology. Lecture Notes in Computer Sciences 6008:549-63.

Manine A.P., Alphonse E., Bessières P. (2009). Learning ontological rules to extract multiple relations of genic interactions from text. Int. J. Medical Informatics 78(12):31-8.

Manine A.P., Alphonse E., Bessières P. (2008). Information extraction as an ontology population task and its application to genic interactions. 20th IEEE Intl. Conf. Tools with Artificial Intelligence (ICTAI'08) pp. 74-81.