WARNING: The specification of the task changed on December 10th.
This task
consists in a full extraction of genetic processes mentioned in scientific
texts concerning the bacterium Bacillus subtilis. This organism is a
long-time model species; there is an active systemic biology community around B.
subtilis that extensively uses such information. Unfortunately it is almost
exclusively available in the literature and can be rarely found in public
databases.
The state of the
knowledge about the genetic regulations in B. subtilis is quite advanced and
detailed. This corpus reflects this by providing a wider range of annotation
and event types.
Corpus
The INRA-GI
corpus is a set of sentences from abstracts of selected PubMed references
concerning the genetic regulation of B. subtilis. Most sentences are the
same as the LLL challenge corpus.
The annotation was revised and enriched by a joint effort of the Bibliome team of MIG Laboratory at the Institut National de Recherche Agronomique (INRA) and the Laboratoire d'Informatique de Paris Nord at the Université Paris 13.
Entities
The corpus was
annotated with a rich set of entity types divided into two main groups: genic
entities express biological object representing a gene, a group of genes or a
gene product. This entity type has the following sub-types:
- GeneProduct : the
result of the transcription and possibly the translation of a gene, this entity
type thus includes RNAs and proteins.
- Protein : a protein.
- PolymeraseComplex : RNA polymerase, possibly containing a sigma factor.
- Gene : a gene.
- ProteinFamily : a
family of proteins mentioned by their common function or by their common
ancestor.
- GeneFamily : a family
of genes mentioned by their common function or their common ancestor.
- GeneComplex : a group
of adjacent genes, this entity type thus includes operons, and gene fusions.
- Regulon : a set of
genes that are regulated by a common protein or mechanism.
- Site : a (short) genomic location that correspond to a binding site for the transcription machinery or a transcription factor.
- Promoter : upstream region of a gene or operon that binds the polymerase for gene transcription.
The second group
of entities are phrases expressing either molecular processes or the molecular state of the bacteria. They represent some kind of action that can be performed
on a genic entity. This entity type has the following sub-types:
- Action: molecular process in a broad meaning, or the (molecular) state of the bacteria (level or concentration of a protein/transcript).
- Transcription: particular case of Action, corresponding to the transcription of a gene
- Expression: particular case of Action, different from Transcription in that it relates to the gene product and not the transcript (gene product being a protein or RNA molecule).
Events
The events to predict are binary and directed relations. They can only be found between so-called relevant entities (see previous section). Interaction events were
broken into several distinct types:
- RegulonDependence: a protein is said to be part of the molecular mechanism underlying a regulon.
- BindTo: a protein is explicitely said to bind to another protein, or to DNA.
- TranscriptionFrom: the transcription action is said to start from a given genomic location.
- RegulonMember: part-of relation between genes and regulons.
- SiteOf: a site is near or inside a promoter or a gene it is functionally related with.
- TranscriptionBy: a protein is or is a part of the protein complex that actually performs the transcription of a gene.
- PromoterOf: a promoter is said to be related, or located near a gene or an operon.
- PromoterDependence: a promoter is said to be controlled by a protein.
- ActionTarget: relation between an action entity to its target genic entity.
- Interaction: an interaction between two molecules, in a very broad meaning (could be regulation, binding, regulon membership).
The arguments of these events are labeled and typed, as specified in the following table:
Event type
|
Arguments
|
RegulonDependence
|
Regulon : [Regulon]
Target : [GeneEntity | ProteinEntity]
|
BindTo
|
Agent : [ ProteinEntity ]
Target : [ Site | Promoter | Gene | GeneComplex ]
|
TranscriptionFrom
|
Transcription : [ Transcription | Expression ]
Site : [ Site | Promoter ]
|
RegulonMember
|
Regulon : [ Regulon ]
Member : [ GeneEntity | ProteinEntity ]
|
SiteOf
|
Site : [ Site ]
Entity : [ Site | Promoter | GeneEntity ]
|
TranscriptionBy
|
Transcription : [ Transcription ]
Agent : [ ProteinEntity ]
|
PromoterOf
|
Promoter : [ Promoter ]
Gene : [ GeneEntity | ProteinEntity ]
|
PromoterDependence
|
Promoter : [ Promoter ]
Protein : [ GeneEntity | ProteinEntity ]
|
ActionTarget
|
Action : [ Action | Expression | Transcription ]
Target : [ * ]
|
Interaction
|
Agent : [ GeneEntity | ProteinEntity ]
Target : [ GeneEntity | ProteinEntity ]
|
In this specification, [ X | Y ] stands for the union of type X and Y, meaning that the argument can be either of type X or of type Y. The type GeneEntity is an abbreviation for [ Gene | GeneFamily | GeneComplex ], and ProteinEntity an abbreviation for [ Protein | ProteinFamily | PolymeraseComplex | GeneProduct ]. The notation [ * ] means any type.
The task consists in predicting interaction events for texts with
entities and syntactic dependencies given.
Special rules
- an Interaction may only occur between molecules that are explicitely named (using an identifier, that is). For instance in the sentence: ”Thus, this gene is not controlled by sigmaK”, there is no interaction.
- It happens pretty often that several entities in the same sentence denote the same object (same molecule for instance). These coreferences are not given in the challenge data but are important for the predictions. In a coreference relation, one of the denotations may be more precise than the others, and this defines a partial order on a set of coreferences on the same object. In a set of coreferences, only maximally precise denotations can be the argument of a relation. Let’s see that on an example: ”Not only abrB is repressed by sigmaK, but this gene is also a member of sigmaA regulon”. ”this gene” although it is annotated as a gene, cannot be part of the RegulonMember relation, since it has a coreference with abrB which is a more precise denotation. If there are several maximally precise denotations, all relations should be given. Example: ”Phosphorylated Spo0A (Spo0A P) regulates cotD”. Note that finding the most precise denotations of a set of coreference can be tricky, and even rather subjective. That is why we applied the following rules for ordering denotations:
- two terms with identifiers are equally informative
- a term with an identifier is more informative than any term without
- a term with indications on the function or nature of the object is more informative than a term without it, or a pronoun
- In this corpus there is a common pattern of phrasing which expresses that a protein regulates transcription by a given sigma factor. That is, the transcriptional process of a gene G initiated by this sigma factor S is controlled by a factor F . In that case, it is tempting, often true in practice but not safe, to conclude that there is an interaction between F and S. In order to avoid any interpretation argument, we decided to discard it, and only report an interaction between F and G.
Evaluation
Participants will be evaluated and ranked according to two scores:
- F-score (precision and recall) for all event types together
- F-score for the Interaction event type
In order for a predicted event to count as a hit, both arguments must be the same as in the reference in the right order
and the event type must be the same as in the reference.
Results
Team | Interaction Recall
| Interaction Precision
| Interaction F-Score
| Global Recall
| Global Precision
| Global F-Score
|
Uturku | 0.56 | 0.75
| 0.64
| 0.71
| 0.85 | 0.77 |
Detailed Results for team UTurku
Type | Precision
| Recall | F-score |
ActionTarget | 0.94 | 0.92 | 0.93 |
BindTo | 0.75 | 0.75 | 0.75 |
Interaction | 0.75 | 0.56
| 0.64 |
PromoterDependence | 1.00 | 1.00 | 1.00 |
PromoterOf | 1.00 | 1.00 | 1.00 |
RegulonDependence | 1.00 | 1.00 | 1.00 |
RegulonMember | 1.00 | 0.50 | 0.67 |
SiteOf | 1.00 | 0.17
| 0.29 |
TranscriptionBy | 0.67 | 0.50 | 0.57 |
TranscriptionFrom | 1.00 | 1.00 | 1.00 |
Global | 0.85 | 0.71 | 0.77 |
References
Manine A.P., Alphonse E., Bessières P. (2010). Extraction of genic
interactions with the recursive logical theory of an ontology. Lecture
Notes in Computer Sciences 6008:549-63.
Manine A.P., Alphonse E., Bessières P. (2009). Learning ontological
rules to extract multiple relations of genic interactions from
text. Int. J. Medical Informatics 78(12):31-8.
Manine A.P., Alphonse E., Bessières P. (2008). Information extraction
as an ontology population task and its application to genic
interactions. 20th IEEE Intl. Conf. Tools with Artificial Intelligence
(ICTAI'08) pp. 74-81.