Human coronavirus disease 2019 (COVID-19) emerged in late December 20191,2 and a novel betacoronavirus, subsequently named severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), shown to be the cause. This virus could rather easily transmit from person to person and rapidly spread worldwide3,4. SARS-CoV-2 belongs to the Order Nidovirales, Family Coronaviridae, Subfamily Orthocoronavirinae, Genus Betacoronavirus, Subgenus Sarbecovirus, Species Severe acute respiratory syndrome-related coronavirus and individuum SARS-CoV-2 with the addition of the strain/sequence, e.g., SARS-CoV-2 Wuhan-Hu-1 as the reference strain5.
Similar to other coronaviruses, SARS-CoV-2 is an enveloped, positive sense, single stranded RNA virus with a genome of nearly 30,000 nucleotides6. After having entered the host cell, replication of coronaviruses initially involves generation of a complementary negative sense genome length RNA for amplification of plus strand virus genome RNA, as well as transcription of a series of plus strand subgenomic RNAs all with a common leader joined to gene sequences in the 3′-end of the virus genome. Virus replication and transcription both involve cytoplasmic membrane structures forming virus replication/transcription organelles. These structures include virus proteins derived from proteolytic processing of the polyprotein encoded in the 5′ two thirds of the virus genome (termed Open Reading Frame (Orf) 1a and 1b) with a minus 1 ribosomal frameshift between Orf1a and 1b, and translated from the full length plus sense virus genome RNA. A set of subgenomic RNAs are also generated, most likely from a complex mechanism involving paused negative sense RNA synthesis leading to a nested set of negative sense RNAs from the 3′end of the virus genome joined to a common 5′-leader sequence of approximately 70 nucleotides7,8. The pausing of the virus replication/transcription complex occurs at so-called transcription-regulatory sequences (TRS) located immediately adjacent to open reading frames for these virus genes9,10. These nested negative sense RNAs in turn serve as templates for transcription of plus strands able to serve as a nested set of virus mRNAs for translation of specific proteins from the 3′-third of the virus genome7. These subgenomic mRNAs of SARS-CoV-2, as illustrated in Kim et al.9, are thought to encode the following virus proteins: structural proteins spike (S), envelope (E), membrane (M) and nucleocapsid protein (N) and several accessory proteins for SARS-CoV-2 thought to include 3a, 6, 7a, 7b, 8, and 109. Furthermore, it appears that the expression of the N protein is required for efficient coronavirus subgenomic mRNA transcription7.
The subcellular site/s of coronavirus RNA replication and transcription in the cytoplasm of infected cells is not fully defined, but thought to involve so-called “double-membrane vesicles” (DMV) in or on, which the virus replication complex synthesize the needed double and single stranded full length genomic and subgenomic RNAs7,8,11. While it is still unclear whether this RNA synthesis takes place inside or on the outside of these vesicles, it is thought that the membranes somehow “protect” the synthesized RNA, including double stranded RNA, from host cell recognition and response, and also from experimental exposure to RNase8,12. In addition, it has been shown that coronavirus cytosolic RNA is protected from so-called “nonsense-mediated decay” (NMD) by the virus N protein and thus are more stable in that environment compared to what would have been expected for nonspliced RNA13.
While it was originally thought that coronavirus virions contained subgenomic RNAs in addition to the virus plus strand genomic length RNA, it has now been shown that these subgenomic RNAs do not contain a packaging signal and are not found in highly purified, cellular membrane free, coronavirus virions14. However, it is important to stress, that unless specific steps to remove cellular membranes are used for sample preparation and virion purification, such subgenomic coronavirus RNAs are tightly associated with membrane structures, and less purified coronavirus preparations are well known to include subgenomic RNAs that, similar to virion RNA, are nuclease resistant15.
One study has been published looking at the abundance of subgenomic RNAs for SARS-CoV-2 cultured in Vero cells9. That study indicated that while the predicted spike (S; Orf2), Orf3a, envelope (E; Orf4), membrane (M; Orf5), Orf6, Orf7a, and nucleocapsid protein (N; Orf9) subgenomic RNAs were found at high levels in cell culture, only low levels of the Orf7b subgenomic RNA was detected and the Orf10 subgenomic RNA (also sometimes referred to as Orf1510) was detected at extremely low level (1 read detected, corresponding to only 0.000009% of reads analysed)9. This far, little has been published in regards to the presence of SARS-CoV-2 subgenomic RNAs in samples from infected people. A single study by Wölfel et al.16, looked specifically for the presence of the E gene subgenomic RNA by a PCR and took the presence of subgenomic RNA as an indication of active virus infection/transcription. That study could detect E gene subgenomic RNA at a level of only 0.4% of the virus genome RNA in sputum samples from days 4–9 of infection, but only up to day 5 in throat swab samples16. That study assumed a correlation between the presence of the subgenomic E mRNA and active virus replication/transcription and thus active infection, however, this assumption may not be accurate considering what has been mentioned above about the membrane associated nature of coronavirus RNA and their stability/protection from the host cell response and from RNases.
In this work we describe the detection of SARS-CoV-2 subgenomic RNAs in routine diagnostic oropharyngeal/nasopharyngeal swabs up to 17 and 11 days after first detection by next generation sequencing (NGS) and PCR, respectively. Our finding of extended detection of subgenomic RNA in diagnostic samples has subsequently been supported by another study (available as preprint)17 using the same E gene PCR mentioned above16. That very recent study detected subgenomic E RNA in swab samples from hospitalized patients up to 22 days after onset of clinical symptoms17. Thus, it is becoming clear that the presence, and thus detection, of SARS-CoV-2 subgenomic RNAs in diagnostic samples is rather prolonged and consequently not a good marker/indication of active virus replication/transcription or active/recent infection. Despite that, a number of high-profile studies18,19,20,21 have continued to use presence or reduction of subgenomic RNA level as evidence of or protection from active infection, and consequently, we believe it is important to understand that these subgenomic RNAs may be present for a significant time after active infection.
Detection and abundance of NGS reads mapped to subgenomic RNAs
Our analysis of subgenomic RNAs included 12 SARS-CoV-2 positive swab samples and a virus-negative control sample (Table 1). Manual inspection of reads indicated the presence of subgenomic RNAs and mapping against a reference (fasta file available as Supplementary Data 1) designed to specifically map the ten potential subgenomic RNAs, indicated the presence of variable number of reads mapping to subgenomic RNAs in all SARS-CoV-2 positive samples (NCBI Sequence Read Archive (SRA): PRJNA636225) while no reads were found in the negative control sample (Table 2 and Fig. 1). Overall, of the 56 million NGS reads generated from the virus-positive samples, nearly 800,000 reads mapped to one of the ten SARS-CoV-2 subgenomic RNAs (Table 2). No reads mapped to the tentative Orf10 RNA and only five reads were mapped to the tentative Orf7b RNA (Table 2 and Fig. 1). In contrast, reads were mapped to the other 8 subgenomic RNAs, and although it differed among samples, S (Spike), Orf3a, and M were consistently mapped at a low level followed in increasing order by subgenomic RNAs for Orf8, Orf6, and E while Orf7a and N were mapped in the highest abundance, although this was not consistent for all samples (Table 3 and Fig. 1). The abundance, although overall more or less as expected based on assumed subgenomic RNA abundance7,8,9,10,15, differed widely among samples, most likely depending on sample quality and overall virus genomic and subgenomic RNA abundance. Comparing samples amplified with two different polymerases (Table 3; sample GC-11/34 compared with sample GC-11/38 and GC-14/33 compared with GC-14/37) and comparing samples with longer average read length and high virus coverage (Table 3; samples GC-26/66, GC-11/38, GC-24/61, GC-14/37, and GC-23/60) did also, although with some variability from sample to sample, generate a somewhat comparable pattern. Indeed, looking at sample quality, as indicated by average read length (Table 3), strongly indicated that sample quality/read length influenced levels of subgenomic RNAs detected, likely due to these subgenomic RNA amplicons incidentally being shorter than many of the virus genome amplicons (Supplementary Table 1 and the Source Data File). To look at this, we analysed the mapping results of two samples already known to be of poor quality, having been suspended in water rather than PBS/transport medium before coming to our laboratory. Although these two samples had a high virus load in the diagnostic PCRs, the NGS generated mostly very short reads (Table 3; samples GC-25/65 and GC-55/68) and had a different pattern with a very high abundance of subgenomic RNAs dominated by the Orf7a subgenomic amplicon. This is most likely due to this amplicon being short (sequence length between leader sequence forward primer and nearest pool 2 reverse primer of only 85 nucleotides, although most other subgenomic amplicons would also be expected to be short and some genomic amplicons also being short (Supplementary Table 1 and the Source Data file). Our sample set included multiple samples from two individuals sampled 11–17 days apart and representing early and late infection (Tables 1 and 3). As can be seen when comparing those samples, subgenomic RNAs are also detected in the late infection samples and may even be preferentially amplified. Although this may possibly indicate a rather long period of virus replication/transcription, we believe it is more likely due to coronavirus membrane-associated RNAs being partly, albeit not fully, protected from host and environmental degradation (see below). Partial degradation, represented as shorter average read lengths, may result in some shorter amplicon targets being preferentially amplified.
Subgenomic RNA reads mapped to the virus genome by filtering
To validate our results detailed above, we looked at the NGS reads to find likely subgenomic RNAs already mapped to the virus reference genome (Wuhan-Hu-1-NC_045512/MN908947.3 [https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2/ and https://www.ncbi.nlm.nih.gov/nuccore/MN908947.3]), but filtering so only reads containing part of the leader sequence were included and then looked at where these reads had been mapped. A total of between 8 and 256,123 reads containing the leader sequence were found in our positive samples while none was detected in the negative sample GC-28/67 (Supplementary Table 2). Reads were mapped to the location of the TRS of nine of the ten known subgenomic RNAs, however, only samples GC-26/66, GC-11/38, and GC-14/37 possessed reads, in a low number, mapping to the start of Orf7b. The number of reads with a leader sequence mapped to the corresponding ORF in the SARS-CoV-2 genome are shown in Supplementary Table 2 and Supplementary Fig. 1. While, the percentages varied among the samples, the three subgenomic RNAs with the highest median number of reads with the leader sequence were the E gene/Orf4 (4.1%), Orf7a (17.4%), Orf8 (4.3%) and N gene/Orf9 (10.7%) (Supplementary Fig. 1).
The samples with the highest number of reads mapping to cryptic or unknown TRS were the poorer quality samples GC-11/34, GC-21/64, and GC-25/65 and no consistent pattern was observed in the mapping of reads with the leader sequence to any individual unrecognized TRS site.
Searching the NCBI SRA for reads mapping to subgenomic RNAs
Another step in our analysis included searching the NCBI SRA and selection of a few deposited NGS reads from studies using either the same SARS-CoV-2 Ampliseq panel or generated by other methods. Although not abundant for all of them, reads representing subgenomic RNAs rather than virus genomic RNA could be found by simple analysis using e.g., BlastN. Again, as in our own data, we detected no or very little subgenomic RNA of Orf7b and no evidence for Orf10 subgenomic RNA.
To look at this in more detail, we downloaded a selection of SRA’s generated from different sample types, different sequencing platforms and employing different library strategies. Reads belonging to subgenomic RNA could be identified in all samples except sample (SRR11454612) from RNAseq on a sputum sample from an infected human (Supplementary Table 3). The two selected Ion Torrent Ampliseq SRA’s (SRR11810731 and SRR11810737) produced the highest number of subgenomic reads, followed by an RNA-Seq experiment performed in cell culture using a Nanopore platform (SRR11267570). The selected RNA-Seq experiments performed on clinical samples typically generated very low levels of reads mapping to the virus genome and consequently to the leader sequence. The Artic network primers22 also detected subgenomic reads in virus culture experiments (ERR4157962 and ERR4157960).
The subgenomic RNAs with the highest number of reads mapped in the SRA’s were the N and Orf7a followed by the Orf3a and M gene. The subgenomic S gene and Orf6 were typically low and no reads were mapped to the subgenomic Orf10 in any sample. Only sample SRR11267570 and SRR11810737 had any reads mapped to the subgenomic Orf7b (0.2–0.3% of reads having the leader sequence).
Further abundance analysis of mapped NGS amplicons
The number of reads mapped to either the first 21,500 nucleotides (nt) of the reference virus genome, to the subgenomic region from nucleotide 21,500 onward, to subgenomic RNA containing the leader sequence, to the included cellular control mRNA amplicons and reads not mapped to any of these are summarized in Fig. 2. Specific details about the abundance of cellular mRNA amplicons in each NGS sample are shown in Table 4. Some samples have very few reads mapped to cellular mRNA amplicons, e.g., samples GC-25/65 and GC-55/68 having been submitted in water, while other samples, such as the low virus load samples GC-23/60, GC-24/61, GC-51/62, GC-20/63, and GC-21/64 and the negative control sample GC-28/67, have many reads mapped to cellular mRNA amplicons (Table 4 and Fig. 2). Interestingly, samples GC-14/33/37 and GC-11/34/38 also had a low number of reads mapped to cellular mRNA amplicons. These samples have a high SARS-CoV-2 load and were taken early in infection and this may also be the case for sample GC-26/66 (Table 4), consistent with a likely reduced level of cellular mRNAs in early, high virus load infection (Table 4 and Fig. 2).
No comments:
Post a Comment