The task consists in extracting bacteria localization events, in other words, mentions of given species and the place where it lives. For example:
Escherichia coli is commonly found in the lower intestine of warm-blooded organisms (endotherms).
Bacteria localizations range from plant or animal hosts for pathogenic or symbiotic bacteria to natural environments like soil or water. This task also targets specific environments of interest like medical environments (hospitals, surgery devices, etc.), processed food (dairy) and geographical localizations.
The high number of sequenced genomes (around 2,000 according to the NCBI) has led to the rapid development of high-throughput experiments and systemic biology in the microbiology domain. However these studies require environmental information about bacteria in order to make sense from the enormous amount of data produced. This information is scarcely available in databases and almost exclusively expressed in natural language texts (articles and educational material), hence the motivation for a text-extraction task.
Bacteria localization extraction systems can prove useful to microbiologists whose studies have applications in medicine, agronomy, bioremediation, bio-energy. etc.
The biotopes corpus is a set of textbook documents that give general information about bacteria species in common language. These documents were taken from relevant public web sites. There are more than 20 source sites but the most important sources are:
The annotation guidelines are provided at the bottom of this page.
The annotation was provided by the Bibliome team at the Mathématiques Informatique et Génome laboratory at the Institut National de la Recharche Agronomique (INRA).
Bacteria taxon names are annotated as text-bound entities. The definition of the "Bacteria" type may be at any taxonomic level from phylum (Eubacteria) to strain.
Localizations have also been annotated and broken into several types:
Host : living organisms in which pathogenic and symbiotic bacteria can live, denoted by non-bacterial taxonomic names (common or scientific).
HostPart : host parts in which bacteria can live, this type of entity includes organs, tissues, cell types, organelles and organic fluids.
Geographical : named places including cities, countries, continents, oceans, etc.
Environment : area where the bacteria live in, apart hosts, Environment type is divided into sub-types of interest for the target audience:
Food : includes processed human, pet or cattle food.
Medical : medical environments including hospitals and medical devices.
Soil : includes all types of soils (e.g. agricultural, natural, industrial).
Water : includes all aquatic environments.
Localization event relate a bacterium to the place where it lives. This type of event has two mandatory arguments: the first is of type Bacterium and the second is one of the localization types. PartOf events denote an organ that belongs to an organism. This type of event has two mandatory arguments of type HostPart and Host respectively.
The task consists in predicting Localization events and PartOf events for texts with bacteria and localisations given as input.
Evaluation of participants predictions will be based on the precision/recall/F-score; participants will be ranked according to the global F-measure. Only event will be evaluated, predicted entities entities which are not part of any predicted event will be ignored (they won't penalize the score). Participants will not be evaluated on coreference equivalence prediction.
Each event in the reference set is matched to the predicted event that maximizes an event matching score (see below for the description of the matching scores). The recall is the sum of matching scores divided by the number of events in the reference set.
Each event in the predicted set is matched to the reference event that maximizes the same matching score. The precision is the sum of matching scores divided by the number of events in the predicted set.
The F-score is, as usual, the harmonic mean between precision and recall.
Eab, the event matching score between a reference Localization event a and a predicted Localization event b, is defined as:
Eab = Bab . Lab
Bab is a matching score between the Bacterium arguments of a and b respectively, defined as:
if both have the same start and end boundaries, then Bab = 1
else Bab = 0
Where Lab is a matching score between the Localization arguments of a and b respectively, defined as:
Lab = Tab . Jab
Where Tab is a localization type matching score between the two localizations, defined as:
if both have the same type, then Tab = 1
else Tab = 0.5
Jab is the Jaccard index between the boundary spans of reference and predicted localizations.
For PartOf events, the matching score Pab is defined as:
if both arguments of reference and predicted events overlap, then Pab = 1
else Pab = 0
Coreference equivalence: an event argument can be matched to its corresponding entity in the reference or prediction set, or any of its coreference equivalents. In other words it doesn't matter if the argument in the predicted event is one entity or an equivalent one. Coreference equivalence is considered commutative and transitive.
Notice that this evaluation has the following properties:
Bacteria names boundary matching is strict
Localization entities boundary matching in Localization events is relaxed, though the Jaccard index rewards predictions that approaches the reference
Localization entities boundary matching in PartOf events is super-relaxed since boundary mistakes are already penalized in Localization matching
Matching of localization types is relaxed