The BioNLP Shared Task 2011 data uses standoff formats similar to those of the BioNLP'09 Shared Task format. In the standoff representation, the texts of the documents are kept separate from annotations, which are connected to specific spans of texts through character offsets. The annotations are associated with their texts by the file naming convention that their base name (file name without suffix) is the same: for example, the file PMID-1000.a1 contains annotations for the file PMID-1000.txt. The BioNLP'11 Shared Task file formats are identified by file name suffixes (".txt", "a1", etc.) and described in detail in the following. General annotation structureAll annotation file formats follow the same basic structure: Each line contains one annotation, and each annotation is given an ID that appears first on the line, separated from the rest of the annotation by a single TAB character. The rest of the structure varies by annotation type. Examples of annotation for an entity (T1), an event trigger (T2), an event (E1), an event modification (M1) and a relation (R1) are shown in the following.
Detailed descriptions of these annotations are given below. Text-bound annotationsText-bound annotations are an important category of annotation used in many of the file formats. Text-bound annotation identifies a specific span of text as an entity mention or event trigger and assigns it a type.
All text-bound annotations follow the same structure. As in all annotations, the ID occurs first and is delimited from the rest of the line with a TAB character. The primary annotation is given as a SPACE-separated triple (type, start-offset, end-offset). The start-offset is the index of the first character of the annotated span in the text (".txt" file), i.e. the number of characters in the document preceding it. The end-offset is the index of the first character after the annotated span. Thus, the character in the end-offset position is not included in the annotated span. For reference, the text spanned by the annotation is included, separated by a TAB character. Annotation ID conventionsAll annotations IDs consist of a single upper-case character identifying the annotation type and a number. The initial ID characters relate to annotation types as follows:
Main task file formatsThese file formats are relevant to participation in any of the main tasks.Text files (.txt)These files contain text from the original documents.
The texts are given as plain text files with ASCII characters and UNIX-style newline convention. The titles of documents and sections are separated from body text by a newline. However, sentence segmentation is not provided and abstract and section content text is given as a single long line without newlines. Entity annotation files (.a1)These files contain annotation for given entities found in the text. All entity annotations are given a unique ID and are defined by type (e.g. Protein or Chemical) and the span of characters containing the entity mention (represented as a "start end" offset pair).
Each line contains one text-bound annotation identifying the entity. Note that the entity annotation .a1 files with human-created "gold standard" annotations will be provided to participants for both training and test data. Named entity recognition is thus not necessary for participation. Event annotation files (.a2)These files contain annotation for the events stated in the text and related information. All event annotations are given a unique ID and are defined by type (e.g. Binding or Localization), event trigger (the text stating the event) and arguments.
The event triggers, annotations marking the word or words stating each event, are text-bound annotations and their format is identical to that for entities. The IDs of triggers must not overlap with those of entities. As for all annotations, the event ID occurs first, separated by a TAB character. The event trigger is specified as TYPE:ID and identifies the event type and its trigger through the ID. By convention, the event type is specified both in the trigger annotation and the event annotation. The event trigger is separated from the event arguments by SPACE. The event arguments are a SPACE-separated set of ROLE:ID pairs, where ROLE is one of the event- and task-specific argument roles (e.g. Theme, Cause, Site) and the ID identifies the entity or event filling that role. Note that several events can share the same trigger and that while the event trigger should be specified first, the event arguments can appear in any order. These event annotations are the primary extraction target in the main tasks. Participants will be provided by human-created gold standard event annotations for the training and development data, but will need to create both event trigger and event annotations for the test data. Additional entity annotationsThe event annotation files (.a2) for some main tasks contain annotation identifying additional entities that are relevant to events but not among the given core entities found in the .a1 files. These annotations identify, for example, the cellular component to which a protein is moved in a Localization event or the domain that is bound in a Binding event. The annotations are specified as text-bound annotations, that is, their format is identical to that for the entities in the .a1 files (see above).These annotations are only provided for training and development data, not test data, and they are a target of extraction. Systems participating in (sub)tasks involving these entities will thus need to extract them and include them in the output. Event modification annotationsThe event annotation files (.a2) for some main tasks contain an additional class of annotation identifying when events are stated speculatively or in a negative context.
Event modification annotations begin with an ID, separated by TAB from the modification type (Speculation or Negation), which is in turn separated by SPACE from the ID of the event the modification applies to. Entity equivalence annotationsThe event annotation files (.a2) contain an additional class of annotation identifying equivalence stated through simple local abbreviations and other aliasing between given entities, such as between interleukin-2 and IL-2 in the text "interleukin-2 (IL-2)".
Equiv annotations are given a placeholder "*" in place of an ID, separated by TAB. The primary annotation consists of the relation type ("Equiv") and a set of two or more ID numbers separated by SPACE. These annotations specify that the listed IDs are mutually interchangeable so that any other annotation (e.g. an event) referencing such an ID would be interpreted identically if this ID was replaced with any other in the set. Note that while Equiv annotations will not be provided for test data, they are not extraction targets in the task and participating systems should not output Equiv annotations. Supporting task file formatsThese file formats are relevant to participation in the supporting tasks.Text files (.txt) and entity annotation files (.a1)These file formats are identical to those in the main tasks (see definitions above).Relation annotation files (.rel)These files contain annotation for directed pairwise relations stated in the text, specified through the type of relation and the two entities.
The format is similar to that applied for Events in the main tasks, with the exception that the annotation does not identify a specific piece of text expressing the relation ("text binding"). Additional entity annotations and entity equivalence annotationsRelation annotation files can also contain additional entity annotations and entity equivalence (Equiv) annotations, both in a format identical to the corresponding format in the main tasks (see definitions above).File naming conventionsAll files in the shared task follow the same naming convention, with the suffixes identifying the file format (see above) and the base name the text source the file relates to, as follows:ID_SYSTEM - ID_NUMBER - SECTION_SPECIFICATION - SUBSECTION_NUMBER Where
Thus, for example,
|