OUP user menu

Gene cassettes and cassette arrays in mobile resistance integrons

Sally R. Partridge, Guy Tsafnat, Enrico Coiera, Jonathan R. Iredell
DOI: http://dx.doi.org/10.1111/j.1574-6976.2009.00175.x 757-784 First published online: 1 July 2009


Gene cassettes are small mobile elements, consisting of little more than a single gene and recombination site, which are captured by larger elements called integrons. Several cassettes may be inserted into the same integron forming a tandem array. The discovery of integrons in the chromosome of many species has led to the identification of thousands of gene cassettes, mostly of unknown function, while integrons associated with transposons and plasmids carry mainly antibiotic resistance genes and constitute an important means of spreading resistance. An updated compilation of gene cassettes found in sequences of such ‘mobile resistance integrons’ in GenBank was facilitated by a specially developed automated annotation system. At least 130 different (<98% identical) cassettes that carry known or predicted antibiotic resistance genes were identified, along with many cassettes of unknown function. We list exemplar GenBank accession numbers for each and address some nomenclature issues. Various modifications to cassettes, some of which may be useful in tracking cassette epidemiology, are also described. Despite potential biases in the GenBank dataset, preliminary analysis of cassette distribution suggests interesting differences between cassettes and may provide useful information to direct more systematic studies.

  • gene cassette
  • integron
  • antibiotic resistance
  • nomenclature
  • bioinformatics
  • computer-aided discovery


Antibiotic resistance in bacteria is a global problem and the genes conferring this resistance have received considerable attention. In Gram-negative bacteria particularly, many of these genes are associated with mobile genetic elements, enabling movement between different DNA molecules (e.g. a bacterial chromosome and a plasmid) and thus transfer between cells, including those of different genera. Genes conferring resistance to many different classes of antibiotics and to disinfectants are found in the form of particular ‘genes cassettes’ that collectively form an important gene pool. These cassettes can exist transiently in a free circular form (Collis & Hall, 1992) but do not include all of the functions necessary for their own movement and are usually associated with gene capture and expression elements called integrons (Stokes & Hall, 1989).

A gene cassette typically consists of little more than a single promoter-less gene and a recombination site. These recombination sites differ in length and sequence but share conserved regions at their ends and are generally imperfect inverted repeats predicted to form stem–loop structures. They were initially termed 59-base elements from a consensus of the first examples to be identified (Cameron et al., 1986) and this name was retained even when it became apparent that their lengths are quite variable (Hall et al., 1991). The term attC site (Hansson et al., 1997), consistent with terminology used in other site-specific recombination systems, has since been widely adopted and will be used in the remainder of this review.

An integron is generally defined by the presence of an intI gene, encoding an integrase (IntI) of the tyrosine recombinase family, and an attI recombination site. IntI-catalysed recombination between attI and/or attC sites results in insertion or excision of cassettes (Fig. 1). Several cassettes may be inserted in tandem in the same integron to create an array, with cassettes always inserted in the same orientation. Integrons were first identified as a result of their association with antibiotic resistance genes and mobile elements (Stokes & Hall, 1989), but it is now clear that they are found on the chromosome of many species and constitute an important and near-ubiquitous class of genetic elements (Mazel, 2006; Boucher et al., 2007).

Figure 1

Integration and excision of gene cassettes by site-specific recombination. IntI encoded by the intI gene in the integron catalyses recombination between the attI1 site (open box) of the integron and/or the attC site(s) of gene cassette(s) (black box) resulting in insertion or excision of a cassette. Horizontal arrows indicate the opposite orientations of intI and cassette-borne genes.

The amino acid sequences of IntI integrases have been used as a basis for dividing integrons into ‘classes’, with those carrying intI1 defined as ‘class 1’, intI2 as ‘class 2’, intI3 as ‘class 3’, etc. intI1, intI2 and intI3 were first identified in association with mobile genetic elements and intI4 and others with chromosomal integrons. However, intI1 (Stokes et al., 2006; Gillings et al., 2008) and intI3 (Xu et al., 2007) have recently been found in different contexts in environmental isolates and a ‘chromosomal’intI appears to have been transferred to a Vibrio cholerae plasmid (Szekeres et al., 2007). Thus the same intI gene may be found as part of different structures, and using the class designation alone, which does not indicate the context of intI, may no longer be sufficient. Here, we use the term ‘mobile resistance integrons’ (MRI) to refer to those with intI1, intI2 or intI3 that carry mainly antibiotic resistance genes and are associated with mobile or potentially mobile elements and we largely restrict our analysis to cassettes found in these integrons.

Chromosomal integrons typically have long arrays with related attC sites that exhibit some species specificity (Rowe-Magnus et al., 2001). For example, the attC sites of the V. cholerae chromosomal integron were initially identified as V. cholerae repetitive DNA sequences (VCR) (Barker et al., 1994). In contrast, MRI generally contain few cassettes but the associated attC sites may be quite varied. Some attC sites found in cassettes carried by MRI are closely related to those in chromosomal integrons, for example ‘group 1’ (Recchia & Hall, 1997) or ‘classical’attC sites are related to those found in Xanthomonas chromosomal integrons (Rowe-Magnus et al., 2001). Thus, although the mechanism by which cassettes are assembled remains unknown, this process may take place in different species with chromosomal integrons, which then act as reservoirs of cassettes that can be acquired by MRI (Rowe-Magnus et al., 2001). Completely unrelated genes may be associated with similar attC sites, probably reflecting acquisition from a common chromosomal integron or species (Rowe-Magnus et al., 2001). Closely related genes may also be associated with almost identical attC sites, suggesting divergence from a common ancestral cassette, or with different attC sites, suggesting different origins (Recchia & Hall, 1997).

Capture of cassettes and expression of cassette-borne genes

IntI-mediated site-specific recombination has important differences from reactions mediated by other tyrosine recombinases. A typical tyrosine recombinase ‘simple site’ is minimally comprised of a pair of highly conserved 9–13-bp inverted integrase binding sites separated by a 6–8-bp spacer region. Two such sites are recombined by sequential strand exchange events at the 5′ boundaries of the spacer and identity between the spacers of the recombination partners is generally required for activity (Grindley et al., 2006). In contrast, the most efficient reaction catalysed by an IntI integrase (Collis et al., 1993, 2001) appears to be that between two architecturally distinct sites, the cognate attI site and the attC site of a gene cassette, and only a single strand exchange occurs. Each attI site includes one simple site and two additional integrase binding sites (Collis et al., 1998; Gravel et al., 1998; Partridge et al., 2000), while the attC sites are more complex and variable.

attC sites structure and the recombination reaction

An attC site contains two simple sites, each composed of a pair of conserved ‘core sites’ (7 or 8 bp), referred to as 1L and 2L, 2R and 1R (Stokes et al., 1997) or R″ and L″, L′ and R′ (Recchia & Sherratt, 2002). 1L and 2L are separated by a 7-bp spacer and 2R and 1R by a 7- or 8-bp spacer (Fig. 2a). The 1L/2L and 2R/1R pairs are separated by a central region that varies in length and sequence between different attC sites. 1L and 1R are usually reverse complements of one another and 2L and 2R are generally complementary except for an extra base present in 2L. The spacer regions are not complementary, but the central region usually forms an imperfect inverted repeat. The GTT of 1R (and complementary AAC of 1L) are completely conserved and the recombination site is between the G and first T of 1R (dotted line in Fig. 2a). In the linear integrated form of a cassette opened up at this position, the G of 1R defines the end of the cassette and remainder of 1R is found at the start of the cassette, separated from the rest of the attC site by the cassette gene.

Figure 2

attC site architecture. (a) Double-stranded circular form of the aadA7 cassette showing the sequence of the attC site. The direction of the aadA7 gene is indicated by a horizontal arrowhead and the start and stop codons on the top strand are in bold and underlined. Core sites are boxed and labeled, with the end points of the spacers indicated and extrahelical bases on the bottom strand are in bold. The position at which the cassette ‘opens up’ to give the linear form is shown by a vertical dotted line. (b) Folded bottom strand of the aadA7 attC site showing the extrahelical bases ‘flipped out’. The position of the recombination crossover is shown by a vertical arrowhead.

The current model for IntI1-catalysed site-specific recombination, summarized in Mazel (2006), involves the bottom strand of the attC site only (Francia et al., 1999), folded into a bulged hairpin structure (Fig. 2b) (Bouvier et al., 2005; MacDonald et al., 2006). The folded attC recombines with a double-stranded attI1 site or another folded attC site and a subsequent replication step is required to resolve the Holliday junction intermediate (Bouvier et al., 2005; MacDonald et al., 2006).

Conservation of the first three bases (GTT) of the 1R site and the complementary AAC of 1L is important for attC site activity, but the identity of the remaining bases in these sites is less critical (Stokes et al., 1997). Sequence complementarity in and flanking the core sites (Biskri et al., 2005) appears very important, as are two extrahelical bases that are ‘flipped out’ of the folded structure (Johansson et al., 2004; MacDonald et al., 2006) (Fig. 2b). One, labeled 39 in Johansson et al. (2004) and 20″ in Demarre et al. (2007), corresponds to the extra base in 2L compared with 2R. While this is G in the attC sites examined in detail, base substitution here does not appear to greatly affect integrase binding (Johansson et al., 2004). The second extrahelical base, at position 32 (or 12″), is needed for optimal integrase binding but its positioning may be less important than that of G39 (Johansson et al., 2004). The five bases of 2L and 2R furthest from the central spacer region appear to be important for function, but mismatches in the remainder of these sites seemed to have less effect (Johansson et al., 2004).

Expression of cassette-borne genes

Most cassettes include only a short region between the end of the 1R site and the predicted start codon of the cassette gene. A suitably spaced ribosome-binding site (RBS) can usually be identified in this region, but only a few cassettes, for example qac (Guerineau et al., 1990), cmlA (Bissonnette et al., 1991), and ereA (Biskri & Mazel, 2003), appear to carry a promoter. Expression of most cassette-borne genes thus relies on their location in an integron. In class 1 integrons, a promoter (Pc; formerly Pant) located within intI1 drives expression of cassette genes (Collis & Hall, 1995). In some cases, a second promoter (P2) is created by insertion of three G residues that increase the spacing between potential −35 and −10 sites to the optimum 17 bp (Collis & Hall, 1995). Several Pc variants have been identified and shown to result in different levels of expression (Lévesque et al., 1994; Bunny et al., 1995; Collis & Hall, 1995; Brízio et al., 2006; Papagiannitsis et al., 2008). A similar promoter has been demonstrated in a class 3 integron (Collis et al., 2002).

Analysis of transcripts originating from Pc suggests that the stem–loop structures formed by attC sites might be acting as transcription terminators, so that the position of a cassette in an array may be an important determinant of cassette gene expression (Collis & Hall, 1995). However, little is currently known about whether attC sites of different lengths and sequences have noticeably different effects on the expression of downstream genes. Recombination between attC and attI appears to be the preferred reaction, resulting in insertion of an incoming cassette at the start of an array, closest to Pc. Cassettes have also been observed to ‘move up’ to the first position in an array following exposure to the relevant antibiotic (e.g. Rowe-Magnus et al., 2002). Such rearrangement of cassettes in an array could potentially occur via a number of IntI-mediated processes or by homologous recombination (Hall & Collis, 1995).

Structure of MRI

Integrons with intI1

The ancestor of mobile class 1 integrons may have been generated by acquisition of intI1 and attI1 by a transposon of the Tn5053 family (Stokes et al., 2006; Gillings et al., 2008) to give a structure related to Tn402 (also called Tn5090; Radstrom et al., 1994). Tn402 contains a complete tni transposition region (Fig. 3a), making it both a functional transposon and an integron, and is bounded by 25-bp inverted repeats (IRi, integrase end; IRt, tni end). The specific sequence from IRi to the start of the first cassette is referred to as the 5′-conserved segment (5′-CS).

Figure 3

Structures of MRI. Filled vertical bars present inverted repeats, attI sites are shown as small open boxes (not to scale) and Pc promoters are indicated. Selected genes are shown by labeled arrows. Dotted lines joining arrowheads represent typical cassette array PCR products. (a) Tn402-like class 1 integron with the 5′-CS and a complete tni transposition region. (b) Typical class 1 integron structures with only part of the tni region and different extents of the 3′-CS. IS1353 may also be present at the position indicated and ISCR1 and one or more noncassette resistance genes may be inserted at the position indicated (after nucleotide 1313 of the 3′-CS). (c) Class 2 integrons are part of the large transposon Tn7. The asterisk represents the internal stop codon usually present in intI2. The ybeA gene is found within a cassette with an incomplete attC site. (d) The first class 3 integron characterized, in which intI3 and cassettes are associated with a Tn5053 family transposon.

The most frequently identified type of class 1 integrons retain the 5′-CS but include part of a region referred to as the 3′-CS and only part of the Tn402 tni region. The 3′-CS is composed of several elements and different extents are present in different integrons. The start of the 3′-CS corresponds to the first 390 bp of qacE, the last cassette in Tn402, consistent with derivation from a Tn402-like transposon (Radstrom et al., 1994). The truncated qacEΔ1 gene overlaps with sul1 (encoding sulphonamide resistance), which may also have once been part of a cassette (Stokes & Hall, 1989). Two ORFs of unknown function (orf5 and orf6) are found beyond sul1 in some integrons and are also considered part of the 3′-CS. By convention, the end of the 3′-CS in each particular class 1 integron is defined as its boundary with another identifiable region. Common adjacent structures (Fig. 3b) include insertion sequences (IS) such as IS1326 (with or without IS1353; ‘In5-like’) or IS6100 flanked by inverted repeats of the end of the tni region (‘In4-like’). In so-called ‘complex’ integrons, ISCR1 and associated resistance genes are found between partial duplications of the 3′-CS (Toleman et al., 2006).

The first few class 1 integrons to be identified were given integron (In) numbers intended to indicate the whole structure. For example, In2 carries the complete 5′-CS, the aadA1a gene cassette, 2025 bp of the 3′-CS followed by IS1326, IS1353 and 2678 bp of tni including IRt (Liebert et al., 1999). The conserved nature of the 5′-CS and that part of the 3′-CS immediately adjacent to the inserted cassette array enables amplification of the ‘variable region’ using a single primer pair (e.g. Lévesque & Roy, 1993; White et al., 2000) (Fig. 3b). Many cassette arrays in class 1 integrons identified in this way have been given integron numbers, even though nothing is known about the structures beyond the start of the 3′-CS. This is counterproductive, as cassettes may move independently, or entire arrays move between integrons with different structures (Partridge et al., 2002b). Even continuing to assign unique numbers to specific ‘complete’ integrons may not be helpful, as the epidemiology of class 1 integrons and gene cassettes appear to be somewhat independent.

Typical class 1 integrons lacking a complete tni region are defective transposon derivatives that are unable to move themselves, but if they retain both IRi and IRt they may still be transposed (e.g. Partridge et al., 2002a). This transposition is presumably mediated by Tni proteins encoded by related intact transposons present in the same cell and is likely to happen only rarely. However, as Tn402-like transposons target the resolution (res) sites (Minakhina et al., 1999) of plasmids (Kamali-Moghaddam & Sundstrom, 2000) and Tn21-like transposons (Liebert et al., 1999), inserted integrons may then move as part of larger structures.

Integrons with intI2

The intI2 gene was first described as part of the c. 14-kb transposon Tn7 (Fig. 3c). Tn7 includes the tns transposition region and is bounded by short segments, containing transposase-binding sites, called Tn7-L (c. 150 bp) and Tn7-R (c. 90 bp), which are necessary for transposition (Peters & Craig, 2001). Tn7 inserts at high frequency into a single specific site in bacterial chromosomes but is also able to transpose to many other sites at low frequency, with a marked preference for certain replicons on conjugative plasmids (Peters & Craig, 2001). Most examples of the intI2 gene described to date contain an internal stop codon that renders IntI2 inactive, but natural suppression of the stop codon in IntI2 or the action of other IntI in trans (Hansson et al., 2002) may allow occasional acquisition of new cassettes. In Tn7 itself, the cassette array appears to end with a truncated cassette known as orfX or ybeA (Fig. 3c) and primers in ybeA and the conserved intI2 region of Tn7 (e.g. White et al., 2001) have been used to amplify cassette arrays in class 2 integrons.

Integrons with intI3

In the first class 3 integron identified (Arakawa et al., 1995) intI3 is associated with a Tn5053-family transposon (Collis et al., 2002) but in the opposite orientation compared with intI1 in Tn402, with IRi found just beyond the end of the last cassette (Fig. 3d). intI3 genes are generally not detected in surveys for intI genes in clinical isolates (e.g. Yu et al., 2003).

Gene cassette nomenclature

Although the process by which gene cassettes are assembled remains unknown, it is possible that the same gene could become associated with different attC sites. However, this has generally not been observed and cassettes are traditionally given the same name as the gene they carry. Unfortunately, the naming of new genes/cassettes is not formally regulated and several authors have commented on the confusion caused by the same name being given to two different genes/cassettes (Vanhoof et al., 1998; White et al., 2000; Lee & Jeong, 2005). The same gene/cassette may also be given different names, the simplest examples being the use of Roman vs. Arabic numerals or an updated gene name that is not universally adopted (e.g. dhfrVII vs. dfrA7 for a dihydrofolate reductase; see Table 1).

View this table:
Table 1

Gene cassettes carrying known antibiotic resistance and related genes

Name in FDBOther namesNamed variantsGenBank accession numberStartEndattC No.
Aminoglycoside (6′) acetyltransferases
aacA2aac(6)-Id orfBX12618.18961421721
Aminoglycoside (3) acetyltransferases
aacC7aacC-A7CP000282.12 333 6012 334 159751
Aminoglycoside (3″) adenylyltransferases (streptomycin/spectinomycin resistance)
aadA9AJ420072.126 76427 664608
aadA11aadA11bAM261282.123 61124 466601
Aminoglycoside (2″) adenylyltransferases
Aminoglycoside (3′) phosphotransferases
Class A β-lactamases
blaP1PSE-1,CARB-2P2 , PSE-4,5 CARB-3,8Z18955.1102114511139
blaP7CARB-7P9 (CARB-9)AF409092.181918791282
Class B metallo-β-lactamases
Class D β-lactamases
Chloramphenicol acetyltransferases
Chloramphenicol exporters
Dihyrofolate reductases (trimethoprim resistance)
dfrA1dhfrIb dfr1 dhfrIX00926.121679295162
dfrA5dhfrV dfrVX12868.1128718548713
dfrA7dhfrVII dfrVII dfrA17X58425.1573118913422
dfrA12dhfrXII dfr12Z21672.13028859052
dfrA16dhfrXVI dfr16AF174129.31333192010711
dfrA17dhfrXVII dfr17AF169041.114175613354
dfrA22dfr22 dfr23AJ968952.1237820902
dfrA29dfrVII dfrA7AM237806.159412091332
dfrB1dhfrIIa dfr2a dfrIIAY139601.198508576
dfrB2dhfrIIb dhfrJ01773.17071090576
dfrB3dhfrIIc dfr2cU67194.433 06133 468574
dfrB4dhfr2 dfr2dAJ429132.12409577
dfrB5dhfrB5 dfrIIe dhfr2eAY943084.127863196578
dfrB7Not annotatedDQ993182.164540721
Streptothricin acetyltransferases
Quaternary ammonium compound efflux
qacEU67194.433 78934 3751412
Small multidrug resistance proteins
smr2smr orfOAY260546.354555859607
ADP-ribosyl transferases (rifampicin resistance)
Erythromycin esterases
Lincomycin nucleotidyltransferases
Fosfomycin resistance
fosEorfI, orfiAY029772.121932654602
fosGORFV fosCAY907717.129733469782
fosHorf2, orf2aDQ342344.114061872584
Quinolone resistance
  • * Amino acids changes in variants of β-lactamase cassettes are listed in Tables S1–S8 and for ges in Table 9.

  • The length of the attC site (this may differ slightly in variants).

  • ‡ The number of GenBank entries with a complete version of the gene cassette is given, including examples in all contexts (MRI, chromosomal integrons and secondary sites).

  • § Early examples of the aac(6) group were designated aac(6)-I or aac(6)-II on the basis of resistance phenotype but this does not always reflect genetic relatedness and these genes fall into several clusters. Here, all cassettes of this type have been given an aacA number, corresponding to the letter in the aac(6) name where possible. Low numbers that appear not to have been used for other genes have been used for cassettes first identified some time ago and numbers >26 have been used for more recently identified cassettes.

  • aacA1:gcuG contains two ORFs but only one attC site. The whole sequence was used as a feature and all partial examples identified here were due to the sequence in GenBank starting or ending within the cassette.

  • The only available versions of these cassettes have precise truncations in the attC site. In the case of aacA33, it is not possible to tell whether the 7 bp at the end correspond to the left or right spacer.

  • ** ** The number of GenBank entries with incomplete versions of the cassettes is given, as the complete versions have not yet been identified.

  • †† The cassette gene was annotated as aac(6)-IIa in EU912537 but the protein is only 78% identical to AAC(6′)-IIa.

  • ‡‡ aacCA (Levings et al., 2005) or aacC-A (Elbourne & Hall, 2006) have also been used for this gene family.

  • §§ Part of the attC site is missing from the cassette sequence.

  • ¶¶ ¶¶ These cassettes have only been identified outside an MRI context to date. It is possible that other antibiotic resistance cassettes that are only found outside an MRI context were not identified here.

  • ∥∥ blaP2 is listed as a separate cassette in previous compilations, but is >98% identical to blaP1.

  • *** Only the sequence of the imp5 gene is available in GenBank accession number AF290912. The full cassette sequence is from Da Silva et al. (2002).

  • ††† The original sequence of the oxa1 cassette was found to contain an error (Boyd & Mulvey, 2006) and the correct sequence matches oxa30.

  • ‡‡‡ The cassette was annotated as encoding a ‘Cat-like protein’ that is 87% identical to CatB2.

  • §§§ dfrA13 was identified first, but has probable errors compared with the related dfrA12 cassette; hence, the dfrA21 sequence was used here.

  • ¶¶¶ This cassette was annotated as dhfrVII in AM237806 and called dfrA7 in O'Mahony et al. (2006) but is only 80% identical to the dfrA7 gene cassette.

  • ∥∥∥ This cassette was annotated as dhfrV in AM997279 but is only 93% identical to the dfrA5 cassette.

  • **** The gene was annotated as dfr6 in AB200915, but the cassette is only 90% identical to the dfrA6 cassette.

  • †††† Neither a cassette nor a gene was annotated in DQ993182. The protein is 85% identical to DfrB2.

  • ‡‡‡‡ The estX cassette (see Table 2) is sometimes incorrectly identified as sat (Partridge & Hall, 2005).

  • §§§§ This cassette is annotated as qacH in GenBank AF205943, but called qacI in the accompanying publication (Naas et al., 2001). qacI has been used here, as there is a distinct qacH gene in Staphylococcus.

  • ¶¶¶¶ Neither a cassette nor a gene was annotated in EF522838. The protein is 81% identical to QacE and was designated qacK, as qacJ had already been assigned to a distinct qac gene in Staphylococcus aureus.

In other cases, alternative nomenclature systems are well established. One scheme for genes encoding aminoglycoside modifying enzymes is based on the type (acetylation, aac: adenylylation, ant; phosphorylation, aph) and site (3, 3′, 6′, etc.) of modification and the resistance profile (I, II, etc.), with a, b, etc. used to distinguish unique proteins (Shaw et al., 1993). This system is becoming more difficult to apply, as new genes encoding potential aminoglycoside-modifying enzymes (especially those sequenced as part of large plasmids) are increasingly recognized and named on the basis of sequence similarity alone. An alternative scheme based on guidelines for naming plasmid genes (Novick et al., 1976) is often used for cassette-borne aminoglycoside resistance genes and has been used here. In this system, aacC equates with aac(3), aacA with aac(6′), aadA with ant(3″), aadB with ant(2″) and aphA with aph(3′), with 1, 2, 3, etc. distinguishing different genes/gene cassettes.

Each cassette may also have several minor variants and deciding when a variant cassette should be given a separate name is problematic. In some cases (e.g. β-lactamases) a single amino acid change can dramatically change the resistance phenotype and is considered sufficient for assigning a unique protein number (http://www.lahey.org/Studies/), but cassettes encoding the same protein can have silent nucleotide differences. For other gene families, often where the effects of minor sequence variations on resistance phenotype have not been investigated and/or are of little clinical importance, variants may not be given separate names.

Some gene cassettes identified in MRI include ORFs potentially encoding proteins of as yet unknown function. Those identified some time ago generally have well-established names (e.g. orfA and orfC), but these names are also likely to be used for other completely unrelated ORFs. More recently identified examples are often not annotated or are given generic names, frequently ‘orfX’ or ‘orf1’ (see Table 2). As well as creating confusion, some such cassettes may have functions other than to encode proteins (Holmes et al., 2003) and we believe that an alternative to the ‘orf’ designation is necessary. We propose the term gcu, for gene cassette of unknown function, for sequences found in MRI that have a credible attC site but for which a function cannot yet be assigned. We have retained the established letter designation for cassettes up to orfQ i.e. orfA becomes gcuA, orfC becomes gcuC, etc., with variants of orfD and orfE (originally mostly called ‘orfE-like’) designated gcuD1, gcuE1, gcuE2 etc. Other gcu were numbered in order of the date of the first GenBank submission.

View this table:
Table 2

Gene cassette in MRI not encoding known resistance genes

Name in FDBOther names/annotationsGenBank accession numberStartEndattC No.
estXsat AB121039.16510167130
psp AB121039.1101716896014
lsp EU780012.1272133311111
gcuCorfC orfXAF455254.165011616035
gcuD1orf4; similar to orfDDQ278189.185403601
gcuE1 orf3600
gcuE2ORF1 orfE likeAJ487033.2689950605
gcuE3orfE likeAY139595.112481509601
gcuE4orfE likeAY139597.198360601
gcuE5orfE likeAJ564903.116 57916 321573
gcuE7Similar to orfE likeDQ522236.114221740592
gcuE8orfE likeEU434616.132058601
gcuF1Similar to orfDFJ207466.113171635601
gcuH orfHAF047479.233013752861
gcuI orfIAF047479.237534350771
gcuJ orfJAF047479.243514726741
gcuNorfN orfIAJ223604.137164404751
gcuOORFO ORFX ORFAJ251519.12467765
gcuPorfX orfXA orf9U90945.1199114547021
gcuQorfX′ orfY orfXB orf10U90945.1145310556921
gcu1Not annotatedAF318077.1984921171
gcu2orf2a, orf2bAY139592.112671606882
gcu4Not annotatedAJ536835.1103713071022
gcu5a UnknownAY220520.111631588912
gcu5b Not annotatedDQ520941.132953720911
gcu8a orf416, ORF1AJ704863.322 74222 299604
gcu8b Not annotatedAJ487033.29511393601
gcu9 ORF1, hypothetical proteinAJ784256.115632203762
gcu11Cassette without gene or orfAB195796.120292599714
gcu13ORFIV, ORFVI, orfviAY907717.111131424562
gcu16orf2, ORF IN682DQ278190.1324681042
gcu18Not annotatedAM237806.1655931322
gcu19GCN-5 acetyltransferaseAM237806.112101726722
gcu21Not annotatedDQ522237.119402403851
gcu22ORF1 and ORF2DQ533990.114442908721
gcu23ORF3 and ORF4DQ533990.129094323721
gcu27 Not annotatedEF614235.138285621601
gcu28Cassette, unknown functionEU165039.1151219891111
gcu29Hypothetical proteinEU284133.1204651601
gcu30 orf102DQ914960.2208645851
gcu31Not annotatedEU434611.1936574791
gcu33 JK0007EU591509.110 53211 161601
  • * The length of the attC site (this may differ slightly in variants).

  • Number of GenBank entries containing the complete cassette.

  • The estX cassette is often mistakenly identified as sat, but the gene encodes a putative esterase (Partridge & Hall, 2005).

  • § The psp gene encodes a putative phosophoserine phosphatase.

  • The lsp gene encodes a putative lipoprotein signal peptidase.

  • The gcuE1 sequence is not available in GenBank and was obtained from Yano et al. (2001).

  • ** This cassette is 75% identical to the gcuF cassette and 74% identical to the gcuD cassette.

  • †† In AF047479 ORFs annotated as orfK, orfL and orfM follow the gcuJ cassette. The region after orfKL contains a potential 1L core site and could form a folded structure but the typical 2L/2R pairing is not evident. The region after orfM also contains a potential core site but an appropriately folded structure was not predicted. This region was included in the FDB as noncassette insertion (designated KLM).

  • ‡‡ gcu5a and gcu5b are 96% identical, as are gcu8a and gcu8b.

  • §§ The ORF in gcu9 may encode a quinolinate synthetase.

  • ¶¶ The only available example of the gcu27 cassette is interrupted by ISUnCu1, positions 4199–5580 in EF614235.

  • ∥∥ The only example of gcu30 is inserted in at a secondary recombination site in the 5′-CS.

  • *** The ORF in gcu33 encodes a predicted NADPH-dependent FMN reductase.

Automated annotation of cassettes and cassette arrays

Several compilations of gene cassettes have been published (e.g. Recchia & Hall, 1995b; Fluit & Schmitz, 1999, 2004; Rowe-Magnus & Mazel, 2002). However, wide use of PCR to amplify cassette arrays (particularly from class 1 integrons) has resulted in a huge proliferation of sequences in GenBank and no current summary of resistance gene cassettes is available. The task of analysing the available sequence data to produce such a compilation is daunting due to the large number of relevant sequences and is hampered by inconsistent nomenclature and poor annotation. Although methods for automated analysis of DNA sequences have been developed in recent years, the regions containing mobile resistance genes in Gram-negative bacteria present specific problems. In contrast to other systems, identifying and assigning a function to a newly sequenced resistance gene is often relatively simple, as these genes often encode proteins belonging to easily recognizable families. However, identification of the boundaries of potentially mobile regions (e.g. gene cassettes) and consistent annotation are both extremely important.

We have developed a novel bioinformatics tool to enable a detailed analysis of gene cassettes and cassette arrays in MRI, to be described in more detail elsewhere (G. Tsafnat et al., unpublished data). A database of defined sequence ‘features’ and automated blastn searches were used to identify and reannotate sequences containing gene cassettes. We then used similarities between DNA and natural languages (Baquero et al., 2004) to develop a context-sensitive grammar to define higher order genetic structures (cassette arrays) from annotations of ‘features’ (gene cassettes). While computational grammars have previously been used to decode genetic patterns at the letter-to-word (base pair to feature) level (e.g. Leung et al., 2001; Searls et al., 2002), our grammar operates at the word-to-sentence level.

The feature database

We compiled a feature database (FDB) of gene cassettes (‘features’) listed in previous reviews and those reported more recently. Cassettes were identified by nucleotide/protein similarity searches, by searches of GenBank and PubMed with the word ‘integron’ and iteratively during the automated annotation process. Closely related cassettes (>98% identical) were grouped, regardless of phenotype conferred, and an exemplar GenBank accession number chosen for each group. This was generally the first report, unless errors were apparent, or the most common sequence among minor variants. The span of the cassette (Tables 1 and 2) and the sequence were recorded in the FDB. A minimum percentage of base pair identity (usually 97%) required for part of a sequence to be considered a match was also recorded. Short sequences flanking cassette arrays in different MRI, i.e. the ends of the 5′-CS, 3′-CS and Tn402 tni of class 1 integrons, the intI2 region and the ybeA cassette of class 2 integrons and the int3 and IRi regions from the first class 3 integron, were also included as features.

A single name was selected to represent each cassette sequence in the FDB, but alternative names are also listed in Tables 1 and 2. Where possible the name given in the original GenBank entry or publication was used. For cassettes that were not suitably annotated in the original GenBank entry blastn or blastx searches were used to assign an appropriate gene family name and the next apparently available letter and/or number was used. For simplicity, cassettes for which the full name of the gene is of the form blaXYZ-1 are represented as xyz1 throughout.

Annotation of cassette arrays in MRI

All features in the FDB were used in a blastn (Altschul et al., 1997) search against the complete nucleotide GenBank database (http://www.ncbi.nlm.nih.gov) (Benson et al., 2009). GenBank entries containing any segment that met the minimum identity match criteria for any feature in the FDB were collected. The species from which the sequence was obtained was acquired from the organism field and entries with the words ‘vector’, ‘synthetic construct’ or ‘artificial sequence’ were excluded, as were a few (e.g. DQ915939) containing long runs of Ns representing unsequenced regions. Sequences in the RefSeq collection (accession numbers of the form ‘NC_’, etc.) were also excluded, as each is derived from and has a sequence identical to an entry with a conventional accession number.

Sequences were annotated with the most similar features from the FDB (as determined by the blast bit score) without reference to annotations in GenBank. Many sequences carry ‘partial’ features, created either because the end of the submitted sequence lies within the feature or because of an insertion, leading to gaps in annotations. Compiling a database from all sequences in the FDB and searching with the sequences of these gaps allowed partial features to be annotated; these were designated by # after the feature name. Allowing the system to register matches of <25 nt introduced an unacceptable number of false annotations, but many sequences of ‘cassette array PCR’ amplicons include <25 nt flanking sequence at one or both ends. The automated annotation process also missed a few short 5′-CS and/or 3′-CS that were >25 nt but contained a significant number of differences (possibly errors) from the standard sequences. In these cases, the sequence and, if possible, the corresponding paper were checked and annotations of short flanking regions were added manually as appropriate.

A context-sensitive grammar (Grune & Jacobs, 2007), consisting of 21 rules to identify cassette arrays flanked by end markers, was then applied to parse cassette arrays (in either orientation in sequences in GenBank) from annotated features. The parser also examined the context of annotation gaps, identifying those found within an array as potential cassettes that were missed when the FDB was first compiled. These were checked manually and, if appropriate, added to the FDB as gene cassettes. Other elements identified within cassette arrays were added to the FDB as ‘noncassette insertions’ and were registered by the grammar as features that do not ‘break’ a cassette array. These include group II introns, IS, short regions possibly corresponding to the beginning of a truncated cassette [designated ‘potential cassette starts’ (PCS)] and rare longer inserts of unknown origin.

Data were presented as a searchable list of all collected GenBank entries with the names and the spans of all annotated features indicated and any complete cassette arrays identified. The number of complete and partial versions of each cassette and the composition and number of each cassette array found in the different MRI were automatically compiled. In the following sections, we give a summary of gene cassettes found in sequences of MRI lodged in GenBank. We also list common modifications of cassettes and describe selected illustrative cassettes in more detail. The aim is to provide a reference tool, including extensive tables that will be periodically updated online (http://www2.chi.unsw.edu.au/genecassettes).

Gene cassettes and cassette arrays

Gene cassettes

All resistance gene cassettes identified here and used as features in the FDB are listed in Table 1, grouped by resistance conferred and gene type, with gcu in Table 2. If <98% identity is used as the cut-off for defining a new cassette, 132 different cassettes carrying known antibiotic resistance genes (or homologues assumed to confer similar resistance phenotypes) and 62 different gcu were found in MRI. If the same criteria are used, this compares with 40 cassettes (35 resistance gene cassettes+5 gcu) compiled in 1995 (Recchia & Hall, 1995b), 53 (47+6) in 1999 (Fluit & Schmitz, 1999), 69 (63+6) in 2002 (Rowe-Magnus & Mazel, 2002) with 7 (4+3) added in 2004 (Fluit & Schmitz, 2004) to give 76 (67+9), indicating that novel cassettes continue to be acquired by MRI.

While most cassettes identified since the last compilation belong to one of the known families, cassettes conferring resistance to fosfomycin (Yatsuyanagi et al., 2005) and lincomycin (Heir et al., 2004) and, most recently, presumed to confer quinolone resistance (Fonseca et al., 2008) have now been found. We identified several gcu that were not annotated in the original GenBank entries (Table 2) as well as two novel cassettes carrying putative antibiotic resistance genes, which we designated dfrB7 (EF522838). In several cases a resistance gene or cassette was annotated in the relevant GenBank entry, but the sequence was at most 93% identical to the exemplar with the same name (Table 1) and these were also designated as new cassettes.

attC sites

The sequences of all the attC sites of different cassettes identified here were compiled and their lengths are indicated in Tables 1 and 2, but presentation of a detailed analysis is beyond the scope of this review. Compilation of the attC sites from 39 cassette sequences available in 1997 suggested the consensus GTTAGSC/GYTCTAAC (top strand, completely conserved residues in bold) for 1R/1L (Stokes et al., 1997) but GTTRRRY/RYYYAAC is also commonly used. Analysis of 1R sequences identified here indicated that GTTAGGC, GTTAGCC and GTTAGAC dominated. While, as expected, the fourth position was most commonly A, or to a lesser extent G, C has now been seen at this position in one example (gcu13, GTTCTGT). The final nucleotide of 1R was A or G, rather than T or C, in a number of attC sites but few had mismatches between 1R and 1L.

Secondary structures of bottom strands were generated (http://mfold.bioinfo.rpi.edu/applications/hybrid/quikfold.php; Markham & Zuker, 2005, 2008) to help identify 2L and 2R and extrahelical bases. 2L and 2R sites were quite variable and difficult to identify conclusively for some attC sites. The extrahelical base in 2L was mostly G (c. 67%) or C (c. 25%) while A, and particularly T, were rarer. Where a second extrahelical base could be identified it was commonly T.

The shortest attC sites (55 nt) identified here were those of the aadA4 and oxa118 cassettes (two and three complete examples, respectively), while attC sites of the ‘classical’ 60 nt type (group 1 in Recchia & Hall, 1997) appeared most common. These ‘classical’attC sites generally have G39 and T32 as extrahelical bases and several have been tested for activity, with the aadB attC reported as being highly active (Hall et al., 1991) and aadA1a and aadA7 were used in mechanistic experiments (Francia et al., 1999; Johansson et al., 2004; Bouvier et al., 2005). The gcuD and gcuF attC sites are 60 nt and were classed as part of group 1, but have G31 rather than T32 as the second extrahelical base and the gcuD attC site appears to be active (Hall et al., 1991). The dfrB1-6 attC sites are only 57 nt and have an extrahelical G on the top strand opposite T32, as noted by Hall et al. (1991).

The bottom strands of some longer attC sites are predicted to have more complicated secondary structures and the second extrahelical base may be more difficult to identify. The longest attC sites (up to 141 bp, group 3 in Recchia & Hall, 1997) include those of the dfrA7, qacE and most imp cassettes. These attC sites appear to have a loop immediately after the 2L/2R complementary region (Recchia & Hall, 1997). The dfrA7 attC was found to be active in cointegration assays but its activity seemed lower than most others tested (Collis et al., 2001).

The typical attC in the V. cholerae chromosomal integron (VCR) has an additional extrahelical T at position 16″ (Demarre et al., 2007). Although cassettes with this type of attC can be integrated/excised by IntI1 (Rowe-Magnus et al., 2002) only two examples, blaP3 and qnrVC1, have been seen in class 1 integrons, with blaP7, catB9, dfrA6, dfrA31 and qnrVC2 only identified in V. cholerae chromosomal integrons to date. Some shorter attC sites, including those of the related oxa2, oxa21 and oxa53 cassettes (70 nt) also appear to have a third extrahelical base.

The sequences of a few other attC sites were also more difficult to fit to the expected pattern. For example, the veb1 attC is not closely related to that of any other cassette identified here and in this case altering spacer lengths appears to be necessary to give the best match between potential 2L and 2R sites.

Cassette arrays flanked by the 5′-CS and 3′-CS

Over 300 different complete cassette arrays flanked by the 5′-CS and 3′-CS were identified in GenBank, most of which were found only once (Fig. 4a). Most arrays had two or three gene cassettes (Fig. 4b), but this may partly reflect a bias against PCR amplification of longer arrays. The most frequently lodged arrays (Table 3) generally correspond to those commonly found in surveys (e.g. Yu et al., 2003; Machado et al., 2007). Some of the common arrays would yield cassette PCR products of similar size (Table 3), but these may be distinguishable by restriction digests.

Figure 4

(a) The frequencies of different cassette arrays flanked by the 5′-CS and 3′-CS in GenBank. (b) The number of cassettes in arrays flanked by the 5′-CS and 3′-CS.

View this table:
Table 3

Common cassette arrays flanked by the 5′-CS and 3′-CS

Cassette arraySize (bp)
  • * Size does not include any 5′-CS or 3′-CS sequence.

Most cassette array sequences in class 1 integrons with the 5′-CS and 3′-CS deposited in GenBank were from Escherichia coli (21%), Pseudomonas aeruginosa (19%), Salmonella spp. (14%), Acinetobacter baumannii, Klebsiella pneumoniae or V. cholerae (each c. 6%). Class 1 integrons carrying a few different cassette arrays have been reported in Gram-positive bacteria. These include Corynebacterium (Nesvera et al., 1998), Enterococcus (Clark et al., 1999) and, most recently, Staphylococcus spp. (e.g. Shi et al., 2006).

Cassette arrays flanked by the 5′-CS and a complete tni region

A recent publication lists several cassette arrays flanked by the 5′-CS and the tni region (Post et al., 2007). Three additional examples of the |aacA7|vim2|dfrB5|aacC5| array (AM749810-11, FM165436) and several additional arrays were identified here: |imp4|qacG|aacA4|aphA15| (AF288045.2), |dfrB1|aacA4|vim2| (AM993098), |aadB|qacI-ISKpn4a-qacI| (EF408254), |gcu33|qacG| (EU591509) and |vim2|gcu9| (FJ237530). The complete Tn402 tni region (4733 bp) is found in plasmids R751 (Tn402 itself; U67194) and pTB11 (AJ744860). In AY033653 the Tn402 tni region is truncated by another transposon and an almost complete version in AM993098 includes a short section only 85% identical to Tn402. EU591509 (Labbate et al., 2008) and AF288045 each include a tni region that is a hybrid of Tn402 and a related transposon, with the crossover in the vicinity of the res site at position 792. All other entries end after <700 bp of Tn402 tni sequence.

The apparent rarity of class 1 integrons with tni in place of the 3′-CS may reflect surveillance bias, but structures with the 3′-CS are likely to be at a selective advantage due to the presence of sul1 (resistance to sulphonamides). The potentially less-biased set of class 1 integrons from complete resistance plasmid sequences in GenBank includes only two with the Tn402 tni region (in R751 and pTB11), compared with at least 30 with part of the 3′-CS.

Cassette arrays with the 5′-CS but no 3′-CS or tni region

Surveys for cassette arrays usually identify some intI1-positive isolates from which no amplicon is obtained with standard primers in the 5′-CS and 3′-CS (e.g. Yu et al., 2003). One possible explanation is failure to amplify long arrays containing many cassettes or large insertion(s) under commonly used conditions. In other cases intI1 may be associated with the complete Tn402 tni region or may be found outside the Tn402 context (Stokes et al., 2006; Gillings et al., 2008). A ‘hybrid’ integron in which intI2 and the 3′-CS flank the cassette array has been reported (AJ289189) (Ploy et al., 2000) and the reciprocal structure with 5′-CS and the tns region may also occur, although no examples were identified by our methods. Recently, a region containing a transposase-like gene, commonly annotated as IS440, and the sul3 sulphonamide resistance gene has been found beyond arrays that include qacI as the last cassette (Table 4). All sequences of this structure currently in GenBank are from E. coli or Salmonella and the cassette arrays are related, suggesting an ancestral structure with subsequent recombination within cassettes. Long-range PCR pairing a primer in the 5′-CS with one in the IS440 region or sul3 might enable amplification of these arrays.

View this table:
Table 4

Examination of other sequences in GenBank that contain the 5′-CS and a cassette array but lack the 3′-CS revealed several other possible explanations involving various IS. IS6100 is found beyond the 3′-CS in In4-like integrons (Partridge et al., 2001) and IS6100-mediated deletions into the cassette array may explain some structures. In other cases, the array is truncated by IS26 or IS1, both common components of multi-resistance regions and resistance plasmids. We have detected cassette arrays truncated by IS6100 using a primer in the 5′-CS paired with a reverse primer in this IS (unpublished data). Similar pairings with primers in IS1 or IS26 (both orientations) may also yield products for some isolates with intI1 from which a standard cassette array amplicon is not obtained.

Cassette arrays flanked by intI2 and ybeA

The most commonly identified array flanked by intI2 and ybeA was |dfrA1|sat2|aadA1a| (n=31), as seen in Tn7 itself, followed by |estX|sat2|aadA1a| (n=7), |sat2|aadA1a| (n=4), |dfrA1|sat2| (n=3) and |sat2|ereA1|aadA1a| (n=1). An additional array in A. baumannii with an unusual structure, |sat2|aadB|catB2#^|dfrA1|sat2|aadA1a| (DQ176450), where ^ represents the last 258 bp of the intI2 region, is proposed to have arisen by integrase-mediated intermolecular recombination (Ramirez et al., 2005). Where the context has been examined, these arrays generally appear to be associated with Tn7 tns genes, but some, for example a |dfrA1|sat2|aadB| array with IS911 inserted in sat2 (EU732664) (Gassama Sow et al., 2008), appear to be in a different context.

Arrays associated with intI2 were reported from several different species, most commonly E. coli (16/48) and Shigella (11/48), but these integrons are relatively unimportant clinically and are likely to remain so unless they acquire more varied cassettes by capture or recombination. However, the recent finding of two examples of an intI2 gene without the usual internal stop suggests that cassette arrays in class 2 integrons should continue to be monitored. One intact intI2 is associated with a dfrA14 cassette and a possible lipoprotein signal peptidase (lps) with no additional context information available (Márquez et al., 2008), the other is associated with several gcu and the Tn7 tns genes (Barlow & Gobius, 2006).

Cassette arrays associated with intI3

Only two different intI3-associated cassette arrays that include known resistance genes were detected in GenBank. The |imp1|aacA4| cassette array is found in D50438, AB070224 and AF416297, all of which were obtained from the same plasmid from S. marcescens (Arakawa et al., 1995; Collis et al., 2002). PCR with primers for intI3, imp1 and aacA4 detected similar integron structures in isolates of other species from Japan (Senda et al., 1996). The second array, |ges1|oxa10:aacA4|, was identified in Portugal (AY219651) (Correia et al., 2003). The sequence beyond intI3 is not available in this case but a rep gene related to that of plasmid RSF1010 was identified beyond the cassette array, rather than the region containing IRi of a Tn402-like transposon seen in the isolate from Japan.

Cassettes in secondary sites

A few cassettes found in MRI were also found in secondary integration sites (Recchia & Hall, 1995a). These included aadB in plasmid backbones (U14415, AF003958) and dfrA14 in the strA gene (e.g. AJ313522). The only example of the gcu30 cassette is inserted within the intI1 gene (DQ914960).

Detailed analysis of selected cassettes and arrays

A number of cassettes with modifications were identified here. Many of these modifications were overlooked in the original GenBank entry and/or publication, as sequence analysis and annotations are often limited to cassette-borne genes, rather than complete cassettes. These modifications may have implications for cassette movement or expression of cassette genes and/or be useful as epidemiological markers, as well as complicating nomenclature. In the sections below we provide instructions for identifying cassette boundaries and attC sites, and give examples of common types of cassette modifications.

Finding the boundaries of a cassette and correctly identifying the attC site

Identifying cassette boundaries and attC sites is useful for analysing array sequences (allowing separate searches with each component cassette), for annotation and for identifying modifications. For an array in a class 1 integron (see Fig. 5a) searching with aaaacaaagTT usually allows identification of the boundary between the end of the 5′-CS (the g) and the start of the first cassette (the first T, hereafter position 1). The gTT and the following four bases give the sequence of the 7 nt 1R core site. In most cases (e.g. the first cassette in Fig. 5a) searching for the reverse complement of 1R identifies 1L, corresponding to the first 7 nt of the attC site. A gap of 5 nt separates the final C of 1L from the start of the 8 nt 2L core site. In the case shown, taking the sequence of 2L, removing the fourth base (usually C) and searching with the reverse complement identifies the 7 nt 2R core site. A gap of 5 or 6 nt separates the end of 2R from the G that defines the end of the first cassette and of its attC site. The t following this G defines the start of the next cassette and the steps above can be repeated to identify the end and attC site of this and following cassettes. The start of the 3′-CS is defined by TTAGAT, but all qac cassettes also begin with this sequence.

Figure 5

Cassette boundaries and attC site modifications. (a) Identifying cassette boundaries and attC sites. Different cassettes are shown in red and blue with the 5′-CS and 3′-CS in black and alternating upper and lower case letters are also used to emphasize boundaries between different regions. The start and stop codons of cassette genes are shown (bold and underlined) with most of the gene sequence omitted. Core sites are boxed and labeled, with the extra nucleotide in the 2L core site indicated. Vertical lines indicate recombination sites/cassette boundaries. The final G residue of each cassette is the first nucleotide of the 1R site of that cassette, but for simplicity is included as part of the 1R site of the following cassette. (b) L-spacer and R-spacer truncations illustrated by the vim2 attC site. (c) A-spacer truncation illustrated by the oxa10 attC site. (d) Insertions into attC sites illustrated by the aadA1a attC site. Vertical arrows on the top line show the insertion points of class C-attC GII introns and IS1111-attC elements. The position of the 4 nt intron-binding site (IBS1) is underlined here and also in the vim2 and oxa10 attC sites in parts (b) and (c). In the middle line, the intron is represented as an oval and the ends of the intron sequence are shown. In the bottom line the IS1111-attC element (ISUncu1) is represented as an oval, and the ends of the IS are shown, with the subterminal inverted repeats underlined. The horizontal arrows indicate the direction of the IEP gene in the intron and the transposase gene in the IS.

In some cassettes (e.g. the second cassette in Fig. 5a) 1R and 1L are not completely complementary beyond GTT/AAC. Notable examples are the aadA1 (GTTAAAC/GTCTAAC), gcuE-like (GTTAGTC/GTCTAAC) and gcuF (GTTAGCA/TTCTAAC) cassettes. In such cases marking ORFs and searching for YAAC close to stop codons may allow identification of 1L. 1L sites ending in TAAC are most common and the TAA triplet often corresponds to the stop codon of the cassette gene. However, the ORF may also end before 1L (e.g. aadA1a), extend further into the attC site (e.g. aadA10) or even continue right through the attC site, ending within the 1R site of the next cassette (e.g. gcuD, gcuE, gcuF and related cassettes).

Cassettes (e.g. the second example in Fig. 5a) may also have mismatches between 2L and 2R. In these cases searching for Gttr sequences c. 40–130 nt beyond the end of 2L but before the start of the next ORF (up to c. 300 bases from position 1 in known cassettes) identifies possible boundaries with the next cassette. Counting back six or seven positions from the g of these gTTR motifs will generally identify the eighth nucleotide of potential 2R sites and the correct one should be almost complementary to 2L minus the fourth base.

Cassette boundaries in class 2 integrons can be identified in a similar way: TAATAAAATG is found adjacent to the start of the first cassette and TTAGAG defines the start of the ybeA cassette usually found at the end of arrays.

Cassettes with a truncated attC site

Short versions of some cassettes with precise deletions in attC have been identified. The first six bases of the 1L core site are still present, but the remainder of the attC site is replaced by the L or R spacer sequence (shown as -L or -R; Fig. 5b) or by the last seven bases of the attI1 site (‘attI1 spacer’, shown as -A; Fig. 5c), probably due to recombination between the incorrect pair of core sites (Partridge et al., 2000; Ramirez et al., 2008). Few studies have examined movement of such cassettes, but testing for the loss of resistance phenotype suggested that aadA10-A was not excised from |oxa10|aadB|aadA10-A| (Partridge et al., 2002b). PCR analysis of the |aacA4|aadA1a-A|oxa9| array indicated that oxa9 alone was not excised and aadA1a-A was excised rarely while aadA1a-A and oxa9 together were excised at higher levels (Ramirez et al., 2008). Such truncated cassettes may be more likely to travel with the next cassette in the array or, if located adjacent to the 3′-CS, may be unlikely to be excised from an integron. Loss of most of attC may also allow increased expression of downstream cassette gene(s) from the Pc promoter.

For a few cassettes (aacA33-, arr4-R, fosB-A) only a truncated version is currently available and was used in the FDB. For other cassettes of this type, the appropriate spacer was added as a manual annotation that appeared in the cassette array analysis. Examples of aadA1a-A, aadA1a-R, aadA2-L, aadA2-R, aadA5-R, aadA6-A, aadA10-A, aadA16-A, catB3-R, ges1-A, oxa10-A, oxa10-R, oxa13-A, vim2-L and vim2-R were identified here. In the case of oxa10, the truncated oxa10-A version was more common in GenBank than the complete cassette (n=19 vs. 7).

Group II introns targeting 1L of attC sites

Group II introns (Lambowitz & Zimmerly, 2004; Toro et al., 2007) belonging to bacterial class C are known to insert after potential Rho-independent terminators. Several have been identified within attC sites, inserted after the fifth base of 1L (Fig. 5d) with the gene for the intron-encoded protein (IEP) in the opposite orientation to cassette genes. Phylogenetic analysis of IEPs suggests that those targeting attC sites form a distinct clade, termed class C-attC GII introns (Quiroga et al., 2008). Several were identified here (Table 5) including some that have already been published (Centrón & Roy, 2002; Dai & Zimmerly, 2002; Sunde et al., 2005; Michael et al., 2008; Quiroga et al., 2008).

View this table:
Table 5

Recent experiments (Quiroga et al., 2008) demonstrated that S.ma.I2 is able to insert into several different attC sites (aadA1, aacA1:gcuG, imp1, oxa10, sat2 and dfrA1). These attC all have putative intron-binding sites (IBS1, TTGT; IBS3, TAR) on the bottom strand, complementary to the proposed exon-binding site (EBS1, AACA; EBS3, A+N) in S.ma.I2 and also found in the introns listed in Table 5. The TTGT site overlaps 1L and the L-spacer (underlined in Fig. 5b–d) and is found in many attC sites. S.ma.I2 was not inserted into the aacA4 (GGGT) or gcuH (AGGT) attC sites (Quiroga et al., 2008). Insertion of S.ma.I2 required the secondary structure of the attC site in addition to these short RNA–DNA matches and a gene cassette with S.ma.I2 inserted in attC could still be excised (Quiroga et al., 2008).

A.g.I1 is found in several related arrays, all from P. aeruginosa apparently isolated in Korea. In one array, a partial vim2 cassette precedes A.g.I1 but the sequence after the intron matches the aadA1a attC site. This structure may result from recombination between introns inserted in different cassettes or intron-mediated deletions, but it not clear whether A.g.I1 could insert into the vim2 attC site, which has ATGT at the IBS1 position (Fig. 5b). Other introns were also automatically annotated as flanked by one partial cassette and the attC site of a different cassette (Table 5), but in most cases (except |aadB-S.t.I1cmlA1|) the attC sites of the flanking cassettes are the same length and closely related, making it harder to exclude variations in attC site sequences as an explanation.

IS1111-attC IS targeting 2L of attC sites

IS of the IS1111-attC group of the IS1111-like family insert after the first nucleotide of the 2L core site of attC (Fig. 5d) with the transposase in the opposite orientation to cassette genes (Post & Hall, 2008; Tetu & Holmes, 2008). IS of the IS1111-like family target specific sequences and are also unusual in that their inverted repeats are located a few bases inside the IS boundaries (Fig. 5d) (Partridge & Hall, 2003), often leading to incorrect annotation of their ends. Several different IS1111-attC were identified here (Table 6) and more details are available in Tetu & Holmes (2008) and/or Post & Hall (2008). In two arrays, the regions flanking the IS were annotated as belonging to different cassettes but, as with some introns described above, the attC sites of the two cassettes were the same length and closely related.

View this table:
Table 6

Insertion of an IS1111-attC element presumably reduces Pc-mediated transcription of downstream cassettes, but this may be overcome by the presence of an outward-facing promoter in the IS itself. ISPa21 has a suitably situated promoter (TTGGCC–17 bp–TTTCAT) (Poirel et al., 2005) and the same sequence is present in all ISPa21 variants and in ISUnCu1, while TTGGCC–17 bp–CTTCAT is found at the equivalent position in ISKpn4.

Hybrid gene cassettes

Some gene cassettes appear to be ‘hybrids’ formed by homologous recombination between two closely related cassettes. The vim cassette in DQ143913, identified as a hybrid but named vim12 (Pournaras et al., 2005), was automatically annotated as a partial vim1 cassette followed by a partial vim2 cassette. CARB-6 in AF030945 may be a blaP1/blaP7 hybrid and a cassette in AJ878850 could be a hybrid of a cassette designated aacA36 here and aacA4.

The majority of hybrids are those formed between the closely related (89% identical) aadA1a and aadA2 cassettes (Gestal et al., 2005). There are several known variants of each of these cassettes (Partridge et al., 2002a; Gestal et al., 2005) and a number of different recombination crossover regions. Previously unrecognized hybrids were therefore identified by adding ‘artificial hybrid’ sequences, consisting of one quarter aadA1, three-quarters aadA2 or half aadA1a and half aadA2 etc. to the FDB. aadA2/1 hybrids were most common (Supporting Information, Table S9), with 10 different crossover regions identified and are associated with a limited number of different arrays. Two different aadA1/2 hybrids and three examples of the same aadA2/1/2 hybrid were also identified.

Hybrid cassettes raise nomenclature issues, as some have been given distinct names (e.g. aadA3, aadA8, aadA21, vim12), but identifying and annotating them as hybrids may be more useful. In combination with context data, such information may reveal more about the role of homologous recombination in movement of parts of integrons and creation of different multi-resistance regions and plasmid structures.

Cassettes with atypical attC sites

As stated above, each cassette gene is generally associated with one particular attC site, but there are exceptions. A variant of the aadA1 cassette (M95287.4, 3311–4166) known as aadA1b (Recchia & Hall, 1995b) has changes at the end of the attC site in 2R and the R-spacer (GCTTACCTTGGCCG vs. GCTTAACTCAAGCG) (Stokes & Hall, 1992). Our analysis identified other cassettes with an attC site that differed more substantially from the expected one. Cassettes named aacA29a and aacA29b carry almost identical genes (Poirel et al., 2001) but the attC sites differ near 2R, with part of the aacA29b version matching the end of the oxa20 attC. Three related arrays in P. aeruginosa isolates from Taiwan (DQ393784, 2461–3099; EF138817; EU090799) each contain one or two copies of an aacA4 gene associated with an attC site of the expected 72 nt but only c. 77% identical to the typical attC site and with two differences from the gcu3 attC. AF364344 (117–755) and AY139599 each include a gene >98% identical to the exemplar aadB gene associated with a 109-nt attC site most closely related to the one typically found in the aacC1 cassette, rather than the usual 60-nt attC. In EU851865 (172–699), an aacC1 gene is associated with the attC site found usually found in the aadA1a cassette.

These variant cassettes all match the exemplar cassette from the start until part way through the attC site. They may have been created by recombination in the attC site, either by homologous recombination between related stretches in two different attC sites or by abnormal site-specific recombination events such as a second round of cleavage and transfer normally avoided during IntI1-mediated reactions (MacDonald et al., 2006). In contrast, a cassette in EU723083 (76–697) includes a gene that is 100% identical to aacA3 but the entire attC site matches the one typically found in the aadA5 cassette and the region from 1R to the start codon also differs from usual aacA3 cassette. This may provide an example of the same progenitor gene becoming associated with two distinct attC sites. Cassettes with atypical attC sites appear rare (although it is possible that our analysis missed some), but they have implications for cassette nomenclature and may be useful epidemiological markers.

Other cassette variants

A number of cassettes that apparently consist of the start of one cassette and the end of an unrelated cassette were also identified (Table 7). These ‘fused’ cassettes have presumably arisen by deletions with endpoints in adjacent cassettes (Recchia & Hall, 1995b).

View this table:
Table 7

Examples of cassette fusions

Fusion1st2ndOverlapGenBank accession numberStartEnd
catB2:aacA38-A94?-1L?EF382672.1112 481111 841
aadA1:dfrA11323-endGAAY339625.210 41110 976
  • * The extent of the first cassette present, from position 1 to the position indicated.

  • † The extent of the second cassette present, where position 1 is start of the cassette.

  • ‡ Bases at the junction between the two cassette fragments that could be derived from either cassette.

  • § § A T residue that does not appear to be derived from either cassette is present at the junction between the two cassette fragments.

  • ¶ The only example to date is found in a class 3 integron.

  • ∥ The sequence in S49888 ends within the cassette.

  • ** ISAeca1 is inserted in the qacG fragment at positions 1439–2529 in EF118171.1, flanked by a 2 nt direct duplication (AC).

  • †† A complete version of the aacA38 cassette has not yet be been identified; thus, the start position and any possible overlapping bases cannot be defined.

Several cassette variants with internal tandem duplications were also found. A dfrA1 cassette with a 90-bp duplication in the coding sequence flanked by 15-bp direct repeats was associated with reduced levels of trimethoprim resistance (AJ400733.1, 2-end) (Gibreel & Skold, 2000). The first dfrB1 cassette identified (U36276.2, 573–1057) includes a 72-bp duplication compared with the version most common in GenBank (Levings et al., 2006). A vim1 cassette, often the vim4 variant, with a 170-bp duplication that includes all but the final C of the 1L site of attC (AY152821.1, 958–2041) (Patzer et al., 2004; Scoulica et al., 2004) is found 10 times in GenBank. A blaP1 cassette, which appears to have a 52-bp duplication resulting in a 5′ extension of the gene, was also identified here (AB126603.1, 77–1172). These variant cassettes were added to the FDB as separate features and we have designated them as dfrB1d72, dfrA1d90, vim1d170 and blaP1d52.

Enhancing transcription of cassette-borne genes, for example oxa10

Expression of most cassette genes relies on the integron-borne Pc promoter and is influenced by the position of the cassette in the array. While a few cassettes carry an internal promoter, the oxa10 gene may be expressed from a promoter in an adjacent region. As indicated above, oxa10-A is more common than the complete oxa10 cassette in GenBank and most copies are preceded by a 161-bp region (7331–7491 in AF205943.1) that may be the start of a cassette, but the complete version has never been identified. This region, designated PCS-1 here, contains two overlapping promoters (TTGAAG–17 bp–TAAAGT and TTTAAA–16 bp–TCTGAT) (Naas et al., 2001) and association with PCS-1 may thus allow transcription of oxa10 independently of Pc. |PCS-1|oxa10-A| has been found in a few related arrays, followed by either aadA1a or aacA4, suggesting limited movement, but has been seen in a number of species and geographic locations.

Variations in the aacA4 cassette that may provide translational signals

A few cassettes do not appear to carry a suitably positioned RBS, and in these cases translation of the cassette gene may be enhanced by proximity to a short ORF (ORF11) within the attI site when the cassette is first in the array (Hanau-Berçot et al., 2002). An example is accA4, which has a GTG start codon at positions 25–27 of the cassette (Hanau-Berçot et al., 2002). In several sequences, the aacA4 gene is fused to ORF11 as a result of changes in the att1 site (Fig. 6a) and several other modifications to aacA4 may also enhance expression. Many of the fused cassettes listed in Table 7 include aacA4 as the second partial cassette, and in these cases the RBS and start codon of the first gene are presumably used. A truncated cassette with a spacer instead of a complete attC site preceding a complete aacA4 cassette may also provide an RBS and/or start codon (Fig. 6b).

Figure 6

Variations in the start of the aacA4 gene cassette. Cassette sequences are shown in red and blue and other regions in black, with alternating upper and lower case letters used to emphasize boundaries between different regions. Potential RBSs are indicated by ovals and -35 and -10 regions are underlined and labeled. Start and stop codons are in bold and underlined and amino acid sequences are shown above. Core sites are boxed. (a) Alterations in the attI1 site. Top, standard attI1 showing ORF11; middle, an 11-bp deletion in attI1 (six in GenBank, e.g. AJ621187); bottom, a 19-bp duplication in attI1 (30 in GenBank, e.g. AB212941). (b) Modifications associated with aacA4cr variants. Top, aacA4 preceded by oxa10-A; middle, aacA4 cassette with a 36-bp insertion replacing positions 9–20; bottom, aacA4 preceded by the qac101aadA1621-A structure. The two lowercase residues in blue (aa) are presumably derived from the 1L site in aadA16-A.

Many of these modifications predict short N-terminal extensions of AacA4 and the variation in observed sizes of AacA4 proteins (e.g. Casin et al., 1998; Hanau-Berçot et al., 2002) and available N-terminal sequences (Tran van Nhieu & Collatz, 1987; Dery et al., 2003) confirms some of these. In some cases, for example |aadA6-A|aacA4| in AF453998 (Centrón & Roy, 2002) and |aacC2:aacA4| in AF355189 (Dubois et al., 2002), long ORFs are created by fusion of almost full-length genes.

A distinctive region associated with an AacA4 protein variant

In addition to the modifications to aacA4 described above, several point mutations are known to result in differences in the aminoglycoside resistance phenotype conferred. A change (T to C) at position 329 of the aacA4 cassette results in Leu102Ser (numbered from the GTG start codon) in the AacA4 protein. Both variants confer resistance to tobramycin, but additional resistance to amikacin correlates with T329/Leu102 [an aac(6′)-I phenotype, aacA4Ak here] and resistance to gentamicin with C329/Ser102 [an aac(6′)-II phenotype; aacA4Gm here] (Rather et al., 1992). A variant with Gln101Leu as well as Leu102Ser confers significant resistance to both amikacin and gentamicin (Casin et al., 2003). The more recently identified AAC(6′)-Ib-cr variant is less effective against aminoglycosides, but confers additional low-level resistance to fluoroquinolones (Robicsek et al., 2006). All currently known cassettes encoding this variant have T329/Leu102, the same mutation at position 514 (GAT to TAT; Asp164Tyr) and synonymous mutations at position 283, TGG to AGG (designated aacA4crA here) or to CGG (aacA4crC) giving Trp87Arg.

Preliminary examination of the distribution of different aacA4 variants suggests that aacA4Ak and aacA4Gm are each found in a variety of arrays. However, the majority of aacA4cr cassettes are preceded by a distinctive structure consisting of first 101 bp of the qacE cassette (or the 3′-CS, both are identical in this region; qacE101 here), the first 21 nt of the aadA16 cassette (aadA1621 here) and the sequence AAAACAAAG (Fig. 6b). qacE101 is also found preceding most examples of aacA29 and complete aadA16 cassettes identified to date and may have resulted from site-specific recombination at a secondary site (GATATA, where G is position 101) in the qacE cassette/3′-CS (Poirel et al., 2001). The remainder of this structure may have been created by deletion of 792 bp from |qacE101|aadA16-A|, which is found in AY740681 (Table 8). qacE101 includes the weak qacE promoter (Guerineau et al., 1990) and a potential RBS and translation from the aadA16 start codon would give an AacA4 protein with a 15 amino acid N-terminal extension (Fig. 6b) (Robicsek et al., 2006).

View this table:
Table 8

In EF636461 positions 9–20 of the aacA4 cassette have been replaced by a 36-bp sequence that provides a potential RBS (Fig. 6b), but similarity of the downstream cassettes (Table 8) to those found in arrays with the |qacE101|aadA1621-A structure suggests homologous recombination within the cassette array. The remaining aacA4cr cassette (EU161636) is preceded by oxa10-A (Fig. 6b). Determining the context of other examples of aacA4cr may reveal more about the spread of this important cassette variant.

Using silent mutations to track epidemiology, for example ges cassettes

Characteristic large-scale modifications to cassettes are clear epidemiological markers. However, point mutations, both silent and those resulting in amino acid changes that may or may not affect resistance phenotype, are also potentially useful. As part of developing automated methods to identify cassette variants, we found some interesting patterns in the distribution of variants of the ges1 cassette.

ges1 confers resistance to penicillins and some cephalosporins, but not aztreonam, β-lactamase inhibitors, or cephamycins (Poirel et al., 2000). Other ges genes are minor variants encoding proteins with 1–3 amino acid changes (Table 9). GES-2, GES-4, GES-5 and GES-6 (all Gly170Asn or Ser) have appreciable carbapenemase activity, particularly GES-5 (Smith et al., 2007), while GES-9 (Gly243Ser) has increased activity against aztreonam (Poirel et al., 2005).

View this table:
Table 9

ges cassettes have been found in several unrelated arrays, often as the first cassette and sometimes in truncated form as ges-A (Table 9). Detailed sequence analysis suggested that ges cassettes from different geographic locations (Greece, Japan, China; Table 9) have characteristic combinations of silent mutations, with mutations that result in phenotypic changes superimposed. This provides evidence of local evolution and suggests that the mutations that affect phenotype, particularly to give ges5, may have occurred on separate occasions.

It is important to distinguish variants that have arisen locally from those acquired from other regions, as their associations with other cassettes and resistance genes may differ. Moreover, a sudden predominance of ‘migrant’ genes may signal greater transmissibility risk than a local variant that has arisen by point mutation in the presence of strong selection pressure.

Trying to understand cassette epidemiology

A better understanding of how different cassettes move within the gene pool would greatly assist in tracking the spread of antibiotic resistance genes, but trying to find meaningful patterns in currently available information is difficult. The data available in GenBank and analysed here is unlikely to be representative of natural bacterial populations due to biases in, for example, the selection of isolates studied and of sequences deposited. However, these data are the most easily accessible and a limited analysis, with reference to complementary published data, may provide useful information to direct more systematic studies. The numbers of complete copies of each cassette or gcu identified in sequences in GenBank were therefore compiled (Tables 1 and 2) and the distribution patterns of selected cassettes examined.

Nearly half of the resistance cassettes (60/132) and most (49/62) gcu were found in ≤2 GenBank entries and some of these cassettes may indeed be uncommon. The aacA2 [aac(6)-Id] gene was rare in early surveys of aminoglycoside-resistant strains (Shaw et al., 1993) and is found in only one sequence in GenBank (X12618, apparently deposited in 1988). In contrast, aacA4 and aadB were common in these surveys (Shaw et al., 1993) and are also common in sequences in GenBank (Table 1). vim2 was the most common metallo-β-lactamase (MBL) cassette in GenBank and VIM-2 appears to be the most widespread MBL worldwide, at least in P. aeruginosa (Walsh et al., 2008). A few other cassettes were identified frequently in GenBank (Table 1) and it seems reasonable to conclude that at least some of these are more widespread than the majority of cassettes that are found in only a few sequences.

A simple analysis of the distribution patterns of complete examples of these apparently ‘successful’ cassettes in arrays are in class 1 integrons also revealed some potentially interesting differences, suggesting that different cassettes may spread in different ways. aacA4 was found in many different arrays (c. 90) with none dominating, but was generally at the first or second position (c. 40% for each). aadB and vim2 were each found in a variety of arrays (c. 35) and as the first cassette in most (c. 70% and c. 60%, respectively). aadA1a was commonly found as a lone cassette (c. 20%), in |dfrA1|aadA1a| (c. 25%) or the last cassette (in 54 of 56 other arrays). dfrA1 was largely restricted to two arrays, |dfrA1|aadA1| (c. 50%) and |dfrA1|gcuC| (c. 30%). The |dfrA12|gcuF|aadA2| array accounted for almost all examples of both dfrA12 and gcuF but only c. 30% of aadA2, which was also commonly seen as a lone cassette (c. 40%). Most examples of aadA5 (c. 80%) and all examples of dfrA17 were found in |dfrA17|aadA5|. Many of these apparently common arrays also dominate in surveys from which no sequences were deposited in GenBank (e.g. Yu et al., 2003; Machado et al., 2007).

Factors influencing cassette epidemiology

Many different factors are likely to contribute to the epidemiology of a gene cassette, one of which is attC site activity. However, relatively few attC sites have actually been tested for activity and the variety of methods used makes it difficult to compare results across studies and to relate attC site structure to activity. Several of the most common cassettes in GenBank (aadA1a, aadA2, aadB) have ‘classical’ 60-nt attC sites, the type that is probably the best studied and may be among the most active. However, this type of attC site is also found in many other cassettes that are rarer in GenBank. The aadA1b variant has a mismatch in the 2L/2R pair at a position that appears to be important in integrase binding and was much less common in GenBank than aadA1a (n=13 vs. 259). The dfrA1 attC site recombined only at an extremely low frequency (Biskri et al., 2005), which may help to explain why this cassette, although common, appears largely restricted to only two different arrays. It is also interesting that the veb1 cassette, which has an attC site that appears atypical compared with other attC found in MRI, may have limited mobility: available examples of the complete cassette are always found preceding aadB in related arrays and a truncated veb cassette has been found flanked by short repeated elements, which may provide an alternative means of movement (Zong et al., 2009).

The epidemiology of a cassette is also likely to be influenced by the phenotype it confers and how strongly this phenotype is selected. The phenotype will depend on the intrinsic activity of the encoded protein and the amount produced. Expression of the cassette gene is influenced by the strength of the associated Pc promoter, the position of the cassette in the array, the presence of any additional promoters and whether the cassette carries an RBS or is positioned to take advantage of an adjacent one. Of the cassettes discussed above, those conferring resistance to antibiotics that now have little clinical importance (aadA, streptomycin and spectinomycin resistance) were commonly found as the last cassette in an array, where they might be expected to be poorly expressed. In contrast, cassettes conferring resistance to antibiotics that are clinically relevant (aadB, gentamicin; aacA4, gentamicin or amikacin; vim2, β-lactams and potentially carbapenems) appeared most often in the first or second position, which is more favourable for expression. Many arrays in GenBank include both aminoglycoside and β-lactam resistance cassettes, which may reflect coselection, as well as a particular interest (and therefore surveillance/reporting bias) in these phenotypes.

The mobility of integrons themselves and of transposons and larger multiresistance regions carrying integrons will also influence the distribution of different gene cassettes. Selection pressure on linked resistance genes associated with other mobile elements (e.g. ISCR1) will also play a part, as will the copy number, transferability and host range of plasmid vehicles. The association of class 1 integrons with Tn21-like transposons is obviously likely to be important. The aadA1a cassette may owe much of its apparent success to an early association with Tn21, derivatives of which have disseminated widely (Liebert et al., 1999). An association with a particularly successful larger mobile structure(s) or plasmid(s) may also help to explain why arrays such as |dfrA12|gcuF|aadA2| and |dfrA17|aadA5| appear common. However, as only about 10% of class 1 integron sequences examined here include any sequence beyond a complete 5′-CS or 3′-CS, little can be inferred about the contributions of these different factors at present.

Concluding remarks

Our novel automated approach (G. Tsafnat et al., unpublished data) enabled integration of large amounts of data that would have been very difficult and extremely time consuming to handle manually and allowed identification of cassettes based on their context. Future analyses could potentially be modified to incorporate automated text searching (Grivell et al., 2002) to extract and include data only currently available in published papers. It may also be possible to use this information to more easily distinguish isolates from different sources (e.g. clinical, animal, environmental), geographical regions and years.

It is clear that useful information in available sequences may be overlooked because cassette genes, rather than complete cassettes, are identified and annotated. Many different modifications to cassettes were revealed, some of which may act as useful epidemiological signals for tracking spread of cassettes and evolution of cassette arrays. Cassette sequences with characteristic minor variations may also provide extra information about the epidemiology of cassettes and the evolution of arrays and we are developing additional grammar rules to distinguish these variants.

Our analysis further suggests that new resistance gene cassettes, including those conferring resistance to additional classes of antibiotics, continue to be acquired by MRI. Any attempts to predict the epidemic potential of these emerging cassettes clearly require a much better understanding of the relative influences of all the factors contributing to the spread of a gene cassette and how much this varies between different cassettes. While this analysis of MRI sequences in GenBank may provide a few hints, more systematic experimental exploration of these factors, particularly the wider genetic contexts of cassette arrays, is needed.

Supporting Information

Additional Supporting Information may be found in the online version of this article:

Table S1. OXA-10 variants.

Table S2. OXA-13 variants.

Table S3. OXA-2 variants.

Table S4. VEB variants.

Table S5. IMP-1 variants.

Table S6. IMP-2 variants.

Table S7. VIM-1 variants.

Table S8. VIM-2 variants.

Table S9. aadA1/aadA2 hybrid cassettes.


S.R.P. and G.T. are supported by separate NSW Health Capacity Building and Infrastructure Grants to CIDM and CHI.


  • Editor: Eva Top


View Abstract