File formats

The BioNLP Shared Task 2011 data uses standoff formats similar to those of the BioNLP'09 Shared Task format. In the standoff representation, the texts of the documents are kept separate from annotations, which are connected to specific spans of texts through character offsets. The annotations are associated with their texts by the file naming convention that their base name (file name without suffix) is the same: for example, the file PMID-1000.a1 contains annotations for the file PMID-1000.txt.

The BioNLP'11 Shared Task file formats are identified by file name suffixes (".txt", "a1", etc.) and described in detail in the following.

General annotation structure

All annotation file formats follow the same basic structure: Each line contains one annotation, and each annotation is given an ID that appears first on the line, separated from the rest of the annotation by a single TAB character. The rest of the structure varies by annotation type. Examples of annotation for an entity (T1), an event trigger (T2), an event (E1), an event modification (M1) and a relation (R1) are shown in the following.

T1 Protein 0 7 RFLAT-1

T2 Positive_regulation 53 62 activates

E1 Positive_regulation:T2 Theme:E2 Cause:T1

M1 Speculation E1

R1 Subunit-Complex Arg1:T1 Arg2:T3

Detailed descriptions of these annotations are given below.

Text-bound annotations

Text-bound annotations are an important category of annotation used in many of the file formats. Text-bound annotation identifies a specific span of text as an entity mention or event trigger and assigns it a type.

T1 Protein 0 7 RFLAT-1

T2 Positive_regulation 53 62 activates

All text-bound annotations follow the same structure. As in all annotations, the ID occurs first and is delimited from the rest of the line with a TAB character. The primary annotation is given as a SPACE-separated triple (type, start-offset, end-offset). The start-offset is the index of the first character of the annotated span in the text (".txt" file), i.e. the number of characters in the document preceding it. The end-offset is the index of the first character after the annotated span. Thus, the character in the end-offset position is not included in the annotated span. For reference, the text spanned by the annotation is included, separated by a TAB character.

Annotation ID conventions

All annotations IDs consist of a single upper-case character identifying the annotation type and a number. The initial ID characters relate to annotation types as follows:

    • T : text-bound annotation (entity / event trigger)

    • E : event

    • M : event modification

    • R : relation

Additionally, an asterisk ("*") can be used as a placeholder for an ID in special cases in the gold data, but should not be used in system output.

Main task file formats

These file formats are relevant to participation in any of the main tasks.

Text files (.txt)

These files contain text from the original documents.

RFLAT-1: a new zinc finger transcription factor that activates RANTES gene expression in T lymphocytes.

RANTES (Regulated upon Activation, Normal T cell Expressed and Secreted) is a chemoattractant cytokine (chemokine) important in the generation of inflammatory infiltrate and human immunodeficiency virus entry into immune cells. RANTES is expressed late (3-5 days) after activation in T lymphocytes.

The texts are given as plain text files with ASCII characters and UNIX-style newline convention. The titles of documents and sections are separated from body text by a newline. However, sentence segmentation is not provided and abstract and section content text is given as a single long line without newlines.

Entity annotation files (.a1)

These files contain annotation for given entities found in the text. All entity annotations are given a unique ID and are defined by type (e.g. Protein or Chemical) and the span of characters containing the entity mention (represented as a "start end" offset pair).

T1 Protein 0 7 RFLAT-1

T2 Protein 63 69 RANTES

T3 Protein 105 111 RANTES

T4 Protein 113 176 Regulated upon Activation, Normal T cell Expressed and Secreted

[...]

Each line contains one text-bound annotation identifying the entity.

Note that the entity annotation .a1 files with human-created "gold standard" annotations will be provided to participants for both training and test data. Named entity recognition is thus not necessary for participation.

Event annotation files (.a2)

These files contain annotation for the events stated in the text and related information. All event annotations are given a unique ID and are defined by type (e.g. Binding or Localization), event trigger (the text stating the event) and arguments.

T13 Positive_regulation 53 62 activates

T14 Gene_expression 75 85 expression

T15 Gene_expression 343 352 expressed

T16 Phosphorylation 600 614 phosphorylated

[...]

E1 Positive_regulation:T13 Theme:E2 Cause:T1

E2 Gene_expression:T14 Theme:T2

E3 Gene_expression:T15 Theme:T5

E4 Phosphorylation:T16 Theme:T8

The event triggers, annotations marking the word or words stating each event, are text-bound annotations and their format is identical to that for entities. The IDs of triggers must not overlap with those of entities.

As for all annotations, the event ID occurs first, separated by a TAB character. The event trigger is specified as TYPE:ID and identifies the event type and its trigger through the ID. By convention, the event type is specified both in the trigger annotation and the event annotation. The event trigger is separated from the event arguments by SPACE. The event arguments are a SPACE-separated set of ROLE:ID pairs, where ROLE is one of the event- and task-specific argument roles (e.g. Theme, Cause, Site) and the ID identifies the entity or event filling that role. Note that several events can share the same trigger and that while the event trigger should be specified first, the event arguments can appear in any order.

These event annotations are the primary extraction target in the main tasks. Participants will be provided by human-created gold standard event annotations for the training and development data, but will need to create both event trigger and event annotations for the test data.

Additional entity annotations

The event annotation files (.a2) for some main tasks contain annotation identifying additional entities that are relevant to events but not among the given core entities found in the .a1 files. These annotations identify, for example, the cellular component to which a protein is moved in a Localization event or the domain that is bound in a Binding event. The annotations are specified as text-bound annotations, that is, their format is identical to that for the entities in the .a1 files (see above).

These annotations are only provided for training and development data, not test data, and they are a target of extraction. Systems participating in (sub)tasks involving these entities will thus need to extract them and include them in the output.

Event modification annotations

The event annotation files (.a2) for some main tasks contain an additional class of annotation identifying when events are stated speculatively or in a negative context.

M1 Speculation E1

M2 Negation E2

Event modification annotations begin with an ID, separated by TAB from the modification type (Speculation or Negation), which is in turn separated by SPACE from the ID of the event the modification applies to.

Entity equivalence annotations

The event annotation files (.a2) contain an additional class of annotation identifying equivalence stated through simple local abbreviations and other aliasing between given entities, such as between interleukin-2 and IL-2 in the text "interleukin-2 (IL-2)".

* Equiv T3 T4

Equiv annotations are given a placeholder "*" in place of an ID, separated by TAB. The primary annotation consists of the relation type ("Equiv") and a set of two or more ID numbers separated by SPACE. These annotations specify that the listed IDs are mutually interchangeable so that any other annotation (e.g. an event) referencing such an ID would be interpreted identically if this ID was replaced with any other in the set.

Note that while Equiv annotations will not be provided for test data, they are not extraction targets in the task and participating systems should not output Equiv annotations.

Supporting task file formats

These file formats are relevant to participation in the supporting tasks.

Text files (.txt) and entity annotation files (.a1)

These file formats are identical to those in the main tasks (see definitions above).

Relation annotation files (.rel)

These files contain annotation for directed pairwise relations stated in the text, specified through the type of relation and the two entities.

R1 Subunit-Complex Arg1:T11 Arg2:T32

R2 Subunit-Complex Arg1:T10 Arg2:T32

R3 Protein-Component Arg1:T22 Arg2:T34

R4 Protein-Component Arg1:T22 Arg2:T36

The format is similar to that applied for Events in the main tasks, with the exception that the annotation does not identify a specific piece of text expressing the relation ("text binding").

Additional entity annotations and entity equivalence annotations

Relation annotation files can also contain additional entity annotations and entity equivalence (Equiv) annotations, both in a format identical to the corresponding format in the main tasks (see definitions above).

File naming conventions

All files in the shared task follow the same naming convention, with the suffixes identifying the file format (see above) and the base name the text source the file relates to, as follows:

ID_SYSTEM - ID_NUMBER - SECTION_SPECIFICATION - SUBSECTION_NUMBER

Where

    • ID_SYSTEM identifies the system from which IDs are drawn, e.g. "PMID" for files for which the original source is PubMed or "PMC" for files for which the source is PubMed Central.

    • ID_NUMBER is the ID number within the ID system, e.g. "1234567" for a file with PubMed/PMC ID 1234567.

    • SECTION_SPECIFICATION identifies the top-level section of the document that the file relates to, consisting of

      • SECTION_NUMBER a running two-digit section number, "01" for the first top-level section etc. By convention, files relating to the title and abstract are given the number "00".

      • SECTION_TITLE the title of the section, as in the original document except with space replaced by underscore, e.g. "Materials_and_Methods". By convention, files relating to the title and abstract are given the title "TIAB".

    • SUBSECTION_NUMBER a running two-digit subsection number ("01" for the first subsection etc.) in the top-level section that the file relates to. The number is incremented by one for each subsection, sub-subsection or similar, thus "flattening" sub-subsection or further structure. For top-level sections with no subsections this string is empty. For text before the first subsection in top-level sections with subsections, the number is "00".

If the document has no sections, both SECTION-SPECIFICATION and SUBSECTION_NUMBER are empty.

Thus, for example,

    • PMID-123456: entire content (i.e. title and abstract) of the PubMed document with PMID 123456

    • PMC-1234567-00-TIAB : title and abstract of PubMed Central document with PMC ID 1234567

    • PMC-1234567-01-Introduction: the 1st top-level section, "Introduction", of PubMed Central document with PMC ID 1234567. The section has no subsections.

    • PMC-1234567-04-Results-07: 7th sequential subsection (or sub-subsection etc.) of the 4th top-level section, "Results", of PubMed Central document with PMC ID 1234567.

    • PMC-1234567-04-Results-00: text before first subsection of the 4th top-level section, "Results", of PubMed Central document with PMC ID 1234567.

Note that in cases where a top-level section has no text before the first subsection, files with SUBSECTION_NUMBER "00" would have no text content to refer to and are thus not included.