Protein/Gene Coreference Task

Online submission closed. Thank you very much for your participation!

The Protein Coreference (COREF) task is one of the supporting tasks in the BioNLP Shared Task 2011.

It is one of the lessons from BioNLP-ST'09 that anaphoric expressions set a non-trivial obstacle which prevents further improvement of event extraction. The COREF task addresses the problem of finding anaphoric references to proteins or genes. We expect addressing the task to have a potential to significantly improve the event extraction performance. Below is an example of text involving coreferencing expressions: the spans highlighted in red are anaphoric expressions, whose referents are indicated by arrows.

In the example, the definite noun phrase, "this transcription factor" (T32), means "NF-kappa B p65" (T31) or "p65" (T10). Knowing the connection should be helpful in finding the event, localization of p65 (out of nucleus), as expressed in "nuclear exclusion of this transcription factor".

With this task, we concentrate on the goal to find anaphoric expressions to proteins (or genes). Following the tradition of BioNLP-ST, we begin with protein annotations, i.e. the gold protein annotations will be given, e.g. those that are highlighted in purple in the above example.

Then, the first step would be to find candidate anaphoric expressions that may refer to proteins. In this task, pronouns, e.g. it or they, and definite noun phrases that may refer to proteins, e.g. the transcription factor or the inhibitor are regarded as candidates of anaphoric protein references.

The next step would be to find antecedents of such anaphoric expressions. The training and test materials of this task include annotations that link candidates of anaphoric protein references and their antecedents if exist in the text.

Note that, sometimes, an anaphoric expression, e.g. "which" (T29), is connected to more than one protein references, e.g. "p65" (T4) and "p50" (T5). Sometimes, coreferencing structures do not involve any specific protein references, e.g. T30 and T27.

In order to establish a stable evaluation, we only focus on coreferencing structures that involve specific protein references, e.g. T29 and T28, and T32 and T31.

R1 Coref Anaphora:T29 Antecedent:T28 [T4, T5]

R2 Coref Anaphora:T30 Antecedent:T27

R3 Coref Anaphora:T32 Antecedent:T31 [T10]

The coreference relation is represented in predicate-argument structure as above. Among the three, only two, R1 and T3, involves specific protein references, T4 and T5, and T10. Thus finding of R2 will be ignored in evaluation. However, those not involving specific protein references will be provided in the training data to help system development.

Task Definition

The participants will be given gold annotation for protein references, e.g. purple ones in the above example. The participants then have to find expressions having coreference relation with the protein mentions, e.g. R1 and R3. Note that the boundary of the span T28 or T31 does not need to be precise: if they contain T4 and T5, or T10, it is okay. Correct finding of R1 will be credited with 2 points, while finding of R3 will be given 1 point.

Annotation

The *.a1 files include annotations for specific protein/gene mentions. These files will be given to the participants. In other words, the participants will begin this task with gold annotations of proteins/genes. Following is protein/gene annotations corresponding to the above example:

T4 Protein 275 278 p65

T5 Protein 294 297 p50

T6 Protein 367 372 v-rel

T7 Protein 406 409 p65

T8 Protein 597 600 p50

T9 Protein 843 848 MAD-3

T10 Protein 879 882 p65

The *.a2 files include annotations for coreferencing expressions.

T27 Exp 179 222 the NF-kappa B transcription factor complex 215 222 complex

T28 Exp 264 297 NF-kappa B p65 and NF-kappa B p50

T29 Exp 307 312 which

T30 Exp 459 471 this complex 464 471 complex

T31 Exp 868 882 NF-kappa B p65

T32 Exp 1022 1047 this transcription factor 1027 1047 transcription factor

R1 Coref Anaphora:T29 Antecedent:T28 [T5, T4]

R2 Coref Anaphora:T30 Antecedent:T27

R3 Coref Anaphora:T32 Antecedent:T31 [T10]

The expressions that may participate in coreference relations are annotated with Exp labels. It includes followings:

    1. The anaphoric expressions that may refer to protein/genes ("protein markables"; T27, T29, T30, T32).

      • definite noun phrases, pronouns, ...

      • Note that the expression, "the molecular basis", is a definite noun phrase, but is not annotated as it is unlikely to be a protein reference.)

    2. The protein/gene name including expressions that are antecedents of the anaphoric expressions (T28, T31).

      • These are the target of evaluation in the atom link evaluation mode (see below).

    3. The antecedents of the anaphoric expressions that are not linked to protein/gene name including expressions.

      • These are included in the annotation to support machine learning-based approach

      • These are included in the target of evaluation in the surface link evaluation mode (see below).

The coreference relations are annotated with Coref labels, connecting anaphora-antecedent pairs. The protein/gene IDs appeared in square brackets indicate the specific proteins/genes that are related to the coreference relations. Participants do not need to produce the protein/gene IDs as it is clear from the corresponding *.a1 files. Note that the boundary of Exp annotation can be arbitrary to some extent. For example, the definite article "the" can be omitted from the expression, T27. Since the minimal span of T27 is "complex", at least the span needs to be included.

The following annotations are equivalent to the annotations T28 and R1:

T28-1 Exp 264 280 NF-kappa B p65

T28-2 Exp 285 279 NF-kappa B p50

R1 Coref Anaphora:T29 Antecedent:T28-1 Antecent2:T28-2 [T5, T4]

Evaluation

The evaluation is carried out in two steps: evaluation of mention detection, and evaluation of mention linking to produce coreference links.

  • Evaluation of mention detection

According to the task definition, a gene/protein mention in this task can be:

(Type1) an expression that contains gene/protein name annotations, which are called name containing mentions. Note that not all expressions containing gene/protein name annotations refer to the gene/protein entities.

(Type2) an apposition of (1)

(Type3) an anaphoric expression, coreferring with (1), (2), or (3)

All of the mentions are represented by 'Exp' in the '.a2' files of the corpus. Mention detection is the detection of these mentions, which include both anaphors and antecedents. The evaluation is based on standard precision, recall, and F-score, calculated as below:

P = number of correctly detected mentions/number of detected mentions

R = number of correctly detected mentions/ number of gold mentions

F = 2PR/(P + R)

Recall is sometimes called coverage rate of detected mentions. While low coverage can be a bottle neck for the next step of linking mentions, high coverage raises the complexity of the next step, since the number of antecedent candidate increases.

In order to provide different views for the results, we use different criteria to judge whether a detected mention is correct or not. They are:

(1) Exact match:

begin(detected mention)= begin(gold mention) & end(detected mention)= end(gold mention)

(2) Partial match based on minimal and maximal boundaries of gold mentions:

begin(detected mention)>=begin(maximal boundary) & end(detected mention)<=end(maximal boundary)

begin(detected mention)<=begin(minimal boundary) & end(detected mention)>=end(minimal boundary)

  • Evaluation of mention linking

Evaluation of mention linking task is reported using precision (P), recall (R), and F-score(F).

A response coreference link is correct when:

- the antecedent and anaphor mentions of the link are correct, following one of the above criteria for mention detection.

- there is a gold coreference link between the corresponding gold mentions.

We calculate evaluation scores for two perspectives: surface coreference links and atom coreference links. A surface coreference link is represented by 'Coref' type in the '.a2' files of the corpus. An atom coreference link is a link from an anaphoric expression to a name containing mention (Type3->Type1) or (Type3->(Type2)*->Type1). Atom links are generated from surface links. While a surface coreference link gives us a general view of the problem, a successful atom coreference link helps us to trace from an anaphoric expression to a gene/protein name. Such atom links may contribute to the increase of recall for information extraction system.

For that evaluation purpose, atom coreference links can also be considered equivalent to links between anaphoric expressions and the genes/proteins included in antecedents. A third evaluation perspective called protein coreference links have been added in order to loosen the expression boundary matching criteria.

The following are examples of the above three evaluation perspectives for evaluating of mention linking.

Example 1:

Gold:

R1 Coref Anaphora:T29 Antecedent:T28 [T4, T5]

R2 Coref Anaphora:T30 Antecedent:T27

R3 Coref Anaphora:T32 Antecedent:T31 [T10]

Surface coreference links = {(T29->T28)/1 score, (T30->T27)/1 score, (T32->T31)/1 score}

Atom coreference links = {(T29->T28)/2 score, (T32->T31)/1 score}

Protein coreference links = {(T29->T4)/1 score, (T29->T5)/1 score, (T32->T10)/1 score}

Example 2:

Gold:

R1 Coref Anaphora:T29 Antecedent:T28 [T4, T5]

R2 Coref Anaphora:T30 Antecedent:T27

R3 Coref Anaphora:T32 Antecedent:T31 [T10]

R4 Coref Anaphora:T33 Antecedent:T32

Surface coreference links = {(T29->T28)/1 score, (T30->T27)/1 score, (T32->T31)/1 score, (T33->T32)/1 score}

Atom coreference links = {(T29->T28)/2 score, (T32->T31)/1 score, (T33->T31)/1 score}

Protein coreference links = {(T29->T4)/1 score, (T29->T5)/1 score, (T32->T10)/1 score, (T33->T10)/1 score}

Recall is calculated in two ways:

>> to evaluate coreference resolution algorithm:

- R = total correct links/ total gold links after removing broken links caused by failure of mention detection

>> to evaluation of coreference resolution system as a whole:

- R = total correct links/ total gold links

Corpus

The coreference annotations for BioNLP-ST'11 were produced based on the GENIA-MedCo coreference corpus, which is a product of collaboration between GENIA project and MedCo Annotation Project. For BioNLP-ST'11, annotations relevant to proteins or genes were selected, cleaned and augmented. Two other sets of annotations, GENIA event annotation and GENIA syntactic tree annotation, were referenced for the polishing.

* We declare that the use of GENIA-MedCo coreference corpus in anyway for this coreference task is prohibited.

Task Results

The CO supporting task is completed. Final submissions were received from six teams, of which the evaluation results are summarized in the following table (protein coreference link perspective):

* The protein coreference link evaluation was chosen as the primary evaluation perspective, as it reflects the task definition faithfully.

* For example, the performance of the 1'st ranked system may be interpreted as follows:

"22.18% of hidden protein references can be found at the precision of 73.26%."

The primary performance metric is overall F-score, shown in bold in the table above.