Curation Guidelines
Contents
TnCentral Curation Guidelines
For TnCentral, we are interested in curating the following features: Protein-coding genes (e.g., transposase genes, accessory genes, passenger genes), Mobile Elements (e.g., transposons, insertion sequences, integrons), Repeat Elements (e.g., IRL, IRR), and Recombination Sites (e.g, Res and attC). These features will all be documented in the enhanced Genbank files according to the guidelines below.
General Guidelines
- In some cases, the Genbank files have additional features that we are not interested in capturing for TnCentral. Therefore, for each feature which we do want to extract, we include a field in /Note Capture = yes.
- Multiple items within any annotation field should be separated by “||”
- Label (#1 in Figure 1) refers to the text entered in the Feature text box in the SnapGene feature editing interface. Name (#2 in Figure 1) is a sub-field within the /note field for some feature types. For a given feature, Name may or may not be the same as Label.
- Annotation of split features: If a feature is interrupted by insertion of another sequence, its 5’ and 3’ ends should be annotated separately. For example, if the merA gene is interrupted, two features—merA 5’-end and merA 3’-end—should be created. If the feature is an ORF, “N/A” should be entered in the /product field. The 5’-end and 3’-end features should be captured (i.e., fill in Capture = yes in the /Note field) and saved in the SnapGene library. However, they should not be shown on the map. (Uncheck them in the feature list in the SnapGene file.) For the purpose of displaying in the map, a third feature should be created with the coordinates of the 5’ and 3’ ends shown as two disjoint regions (see Split Feature option below). This split feature should not be captured (i.e., do not fill in Capture = yes in the /Note field). Since it is not captured, no annotation needs to be provided other than the Label.
- Feature Coordinates (#3 in Figure 1) are captured automatically by SnapGene when a new feature is defined by selecting a sequence region in the SnapGene sequence or map or when a feature from the SnapGene library is detected using Detect Common Features…. If a feature is interrupted by insertion of another sequence, its coordinates can be defined as two disjoint regions using the Split Feature option (#4 in Figure 1, see example in Figure 2).
Figure 1
Figure 2
- Sometimes, a mobile element may contain other mobile elements inserted within it (“internal” mobile elements). Features within these internal elements should be associated with the internal element. For example, Figure 3 shows that transposon Tn1546.2 contains insertion sequence IS1216E. The tnp gene of IS1216E should be named tnp (IS1216E), not tnp (Tn1546.2).
Figure 3
Mobile Element Annotation
In addition to capturing features of interest within a mobile element, the mobile element will be captured as a feature itself. This section covers annotation fields for mobile element features. Transposons should be oriented with the transposase gene oriented left-to-right. This then defines IRL and IRR.
Mobile Element Sequence Variants
There are often minor sequence variants of the same mobile element. If the sequence of the variant is >95% identical at the DNA level to the reference sequence for that element, it can be considered equivalent to an already annotated element and does not need a new name. For Insertion Sequences, the reference sequence is defined by ISfinder (https://www-is.biotoul.fr/index.php). For other elements, the reference sequence is defined by TnCentral. Variant sequences can be checked for their % identity by BLASTing against ISfinder (for Insertion Sequences) or against TnCentral (for other mobile element sequences). A separate SnapGene file should not be made for the sequence variant. Related transposons that have slightly different complements of passenger genes are sometimes named TnXXX.1, TnXXX.2, etc. but this notation is not used consistently. Note: Ideally, the SnapGene library will contain the reference sequences for Insertion Sequences (i.e., the sequence in the SnapGene library will exactly match the sequence in ISfinder for that Insertion Sequence. However, this is difficult to implement systematically. Therefore, some Insertion Sequence may be represented temporarily in the SnapGene library with a sequence that doesn’t exactly match the reference. If a closer match the reference sequence in found during subsequent annotations, the library copy should be updated with the more closely matching sequence. However, the TnCentral Accession Number should NOT be changed. In these cases, a updated separate SnapGene file should be made for the Insertion Sequence as well.
Feature Label
The label for mobile element features will simply be the mobile element name (see below).
Feature Type (selected from SnapGene Pull-Down Menu)
mobile_element
Feature Graphic
Select the graphic with no arrowheads (Figure 4).
Figure 4
Feature Color
mobile element
Mobile Element Type and Name (/mobile_element_type field)
- Select type from pull-down menu (e.g., transposon, insertion sequence, integron).
- Enter the name of the mobile element (e.g., Tn21) in the text box.
- If one mobile element is interrupted by insertion of another mobile element, it will sometimes be named element1::element2. (e.g., IS1326::IS1353; this is read as “IS1326 interrupted by IS1353”). However, this convention is not always followed.
- Integrons should be named In_<parent TE name>, e.g., In_Tn7 for an integron found in Tn7 unless they are given a specific name in a paper about the Tn (e.g., In2 in Tn21, In4 in Tn1696).
Other Annotation (/note field)
The /note field can contain the following sub-fields. Sub-fields should be separated with semi-colons.
- Accession = Accession Number of the enhanced GenBank file for this mobile element. It is generally of the form MobileElementName-OriginalGenBankAccession, where the OriginalGenBankAccession is the GenBank Accession of the large DNA sequence (e.g., plasmid or chromosome) from which the mobile element was isolated (e.g., Tn21- AF071413)
- Family = Mobile element family (e.g., Family = Tn3). For Insertion Sequences (IS), this information can often be found in ISfinder.
- Group = Mobile element group (e.g., Group = Tn21). For Insertion Sequences, this can often be found in ISfinder
- Synonyms = Other names for the mobile element
- Partial = Enter “yes” if only a partial copy of the transposon is present
- Transposition = Enter “yes”, “no”, “ND” (not determined) depending on whether the transposition ability of the transposon has been observed. The presence of 5bp direct repeats just outside of the transposon sequence can be used as evidence of transposition. If the original GenBank file covers a larger region than just the transposon itself (e.g., a full plasmid sequence), then the presence of direct repeats can be checked.
- Other Information = Miscellaneous information. For Insertion Sequences, indicate the ISfinder accession number. For minor sequence variants, indicate the percent identity to the reference sequence and include the accession number of the reference sequence (e.g., OtherInformation = 99% identical to reference sequence for IS26 (ISfinder:X00011). If no % identity is noted, it is assumed to be 100% identical to reference (although 100% identity can be noted if desired.) Other miscellaneous information can also be included here.
- Hosts: the Hosts sub-field has several sub-sub-fields; these are separated from each other by the “|” symbol (see Examples to Copy and Paste). Organism and
- Molecular Source information should be taken from the Description Panel of the SnapGene file. Epidemiological information (e.g., Region, Country, OtherLocInfo, DateIdentified) can be taken from articles in the references.) No Host information should be recorded for Insertion Sequences. ISfinder is considered the definitive resource for the host range of Insertion Sequences.
- § Organism = Genus, species, and strain of the organism in which the element was identified
- § Taxonomy = NCBI taxon ID of the organisms in which the element was identified
- § BacGroup = Group (informal, not taxonomic) that the host bacterium belongs to (e.g., enterobacteria)
- § MolecularSource = Plasmid or chromosome on which the element was found
- § Region = Geographic region where the element was identified
- § Country = Country where the element was identified
- § OtherLocInfo = Other information about the location where identified
- § DateIdentified = Date the mobile element was identified
- § First = Enter “yes” if this is the first host in which the transposon was discovered
- Capture = Include Capture = yes for all features to be included in the database
Annotation of Internal Mobile Elements
Sometimes, a mobile element (“main” mobile element) may contain other mobile elements inserted within it (“internal” mobile elements). If an internal mobile element is already in the SnapGene library and is fully annotated, no further changes are needed. If the internal mobile element has not been previously annotated (i.e., does not have a gold star in the SnapGene library), then it should be annotated according to the guidelines for annotating mobile elements above. The host information should be the same as for the main mobile element. After annotation, regions spanned by internal mobile elements should be copied to their own SnapGene files (Figure 5). The Description Panel information of the main mobile element should be copies to the new file. Regardless of their orientation in the original transposon, in the separate files, the elements should be oriented in the conventional way with the transposase gene oriented left to right, the IRL on the left and the IRR on the right. The Flip Sequence option in the View menu of SnapGene and be used to flip the orientation of the element.
Figure 5
Annotation of Protein Coding Genes
This section covers annotation of genes that code for proteins, including transposases genes, accessory genes (e.g., resolvases), and passenger genes (e.g., antibiotic resistance, heavy metal resistance, plant pathogen, and toxin/anti-toxin genes).
Feature Label
Gene name (see below) followed by the AssociatedElement (see below) in parenthesis. For example: merA (Tn3).
Feature Color
Feature Type (selected from SnapGene Pull-Down Menu)
CDS
Gene Name (/gene field)
- Use official gene symbol, when possible. By convention, gene name start with a lowercase letter (e.g., merA)
- If a gene is interrupted by insertion of another sequence, the name should be of the form geneName_disrupted (e.g., merA_disrupted).
Protein Name (/product field)
- Usually the protein name will be the same as the gene name, but starting with an uppercase letter (e.g., MerA).
- For disrupted genes, enter N/A in the /product field.
Function (/function field)
- Description of function (e.g., GO terms, UniProt keywords, Antibiotic Resistance Ontology (ARO) terms or free text). Use defined vocabularies/ontologies for annotating function whenever possible
- If terms from defined vocabularies are used, include the term identifier in parentheses after the term (e.g., antibiotic inactivation (ARO:0001004); response to mercury ion (GO:0046689).
Other Annotation (/note field)
The /note field can contain the following sub-fields. Sub-fields should be separated with semi-colons. Not all sub-fields will be relevant for all genes.
- AssociatedElement = the name of the mobile element that the protein-coding gene is associated with. In most cases, this will be the main mobile element being curated. However, some mobile elements contain other mobile elements inside them. In these cases, some protein-coding genes might be associated with these internal mobile elements. The Feature Label (see above) is composed of the gene name and the contents of the AssociatedElement field in parenthesis.
- LibraryName = the unique identifier for this sequence in the Custom Feature library. Usually, the LibraryName will be based on the first example of the sequence found, so it will have a form that resembles a Feature Label (e.g., merA (Tn21)), but that format is not mandatory.
- Class = Major classification of gene. Possible options: Transposase, Accessory Gene, Integron Integrase, Passenger Gene, Unclassified, Hypothetical o Subclass = Secondary classification of gene (see examples in section Protein Coding Gene Classification)
- Target = molecular target of the gene (e.g., type of antibiotics targeted by antibiotic resistance genes or metal targeted by heavy metal resistance genes; see examples in section Protein Coding Gene Classification)
- Chemistry = mechanism of action of transposase or resolvase genes (see examples in section Protein Coding Gene Classification)
- SequenceFamily = group or family to which gene belongs
- OtherInformation = miscellaneous information
- Capture = Include Capture = yes for all features to be included in the database