Ethics declaration

Our research complies with the Local Ethics Review Board of the Smorodintsev Research Institute of Influenza (approved on April 30, 2020, reference number: 131) and by the Biomedical Ethics Committee of the I.P. Pavlov First Saint Petersburg State Medical University. The research protocol was approved by IRB and ethics committees and participants gave written informed consent, according to CARE guidelines (See: https://www.care-statement.org/), and in compliance with the Declaration of Helsinki principles.

Sample collection and sequencing

RT-PCR of swabs and sequencing of viral RNA were performed at the Smorodintsev Influenza Research Institute. All specimens were obtained and transported according to standard sampling protocol. RNA from nasopharyngeal swabs was extracted using QIAamp Viral RNA Mini Kit (QIAGEN). RNA from patient A post-mortem FFPE specimens was extracted using RNeasy FFPE Kit (QIAGEN). Samples were tested for SARS-CoV-2 viral RNA by real-time RT-PCR on thermal cycler CFX96 (BioRad) using “Intifica SARS-CoV-2” Kit (Alkor Bio). Whole-genome amplification of SARS-CoV-2 virus genome for samples from August 2020 and from January 2021 was performed using BioMaster RT-PCR Premium kit (Biolabmix) and primers from ARTIC Network protocol version 342 and ARTIC Network protocol version 143 with modifications, respectively. Nextera XT (Illumina) kit was used for library preparation in August 2020 and DNA Prep (Illumina) kit was used for library preparation in January 2021, and the libraries were sequenced using the MiSeq platform (Illumina) with version 3 600-cycle chemistry.

The DNA of patient S was extracted from peripheral blood using QIAmp Blood DNA Mini kit. DNA sample was prepared and captured with the SureSelect Human All Exon kit v7 (Agilent), and whole exome was sequenced using MGISEQ-2000 at Pirogov Russian National Research Medical University (Moscow, Russia).

Flow cytometry assays

Flow cytometry assays were performed using cryopreserved PBMCs. Cells were isolated from patients’ heparinized blood by gradient centrifugation with lymphocyte separation medium Lymphosep (BioWest), frozen in freezing medium containing 10% DMSO (AppliChem) in FBS (Gibco) and stored in liquid nitrogen until usage.

For B-cells analysis presented in Supplementary Fig. 5, PBMCs samples were towed in a 37°С water bath and stained with fluorescently-labeled antibodies to surface markers CD19-APC/Fire 750 (Clone: SJ25C1, Biolegend), BV421-CD20 (Clone: 2H7, Biolegend), CD3-BV605 (Clone: OKT3, Biolegend) (RRIDs and catalog numbers are provided in Supplementary Data, Reagents). PBMCs from a healthy volunteer were used as a control. B-cells were identified as a live CD3-/CD19 + /CD20 + population.

The T-cell response was assessed by intracellular cytokine staining. For further analysis, cells were towed in a 37°С water bath and stimulated for 6 hours with 5 µg/ml of the commercial available peptide mixture of SARS-CoV-2 proteins S, N, M, ORF3a and ORF7a (Generium, Russia)(for Supplementary Fig. 6b, c) or one of the peptides PTDNYITTY, PADNYITTY or PPDNYITTY or peptide pools (YLQPRTFLL + STNVTIATY + KPRSQMEIDF + GPQNQRNAPRITF + VPLHGTIL and YLQPSTFLL + SINVTIATY + KLRSQMEIDF + GTQNQRNAPRITF + VPLHGTIR) (for Fig. 3) in the RPMI medium (Gibco), containing 10% of FBS (Gibco), 1% of penicillin-streptomycin solution (Gibco), Brefeldin A (BD) and costimulatory CD28/CD49d reagent (BD). Negative control samples were stimulated with the complete medium; for positive control, PMA/ionomycin (Sigma) combination was used. Surface markers were stained with fluorescent antibody panel containing CD3-APC/Fire (Clone: SK7, Biolegend), CD4-AF647 (Clone: SK3, Biolegend), CD8a-AF700 (Clone: HIT8a, Biolegend), CD45RA-PE/Dazzle (Clone: HI100, Biolegend), CD197-BV421 (Clone: 150503, BD). Intracellular cytokines were stained using IL-2-FITC (Clone: MQ1-17H12, Biolegend), IFNγ-PE (Clone: 45.15, Beckman Coulter), TNFα-BV785 (Clone: MAb11, Biolegend) antibodies (RRIDs and catalog numbers are provided in Supplementary Data 1, Reagents). Cells were permeabilized with BD Cytofix/Cytoperm™ Fixation/Permeabilization Solution Kit (BD) according to the manufacturer’s instructions. Data was collected on a CytoFlex flow cytometer (Beckman Coulter) using CytExpert software (Beckman Coulter). The results were analyzed using the Kaluza Analysis v2.1 program (Beckman Coulter). Interleukin (IL) 2, interferon γ (IFNγ) and tumor necrosis factor (TNFα) response was measured in effector memory T cells (Tem). To identify Tem, lymphocytes were gated based on their size and granularity. Live CD3+T cells were identified and subdivided into CD4 + and CD8 + T cells. These populations were further subdivided based on the expression of CD45RA and CD197(CCR7). CD3 + CD4 + or CD3 + CD8 + lymphocytes with the CD45RA-/CCR7- phenotype were considered Tem cells (Supplementary Fig. 6a). Cut-off values for the definition of cytokine-producing T cell responses stimulated with SARS-CoV-2 peptides were ≥5 events and a ≥ 2-fold difference in the magnitude of TNF + , IFNγ + or IL-2+ Tem cells compared to the non stimulated control.

Virus isolation and antigenicity

Live viruses (samples 30579 V and 30769 V from August 20, 2020 and 22748 V and 23680 V from February 19, 2020) were isolated from patient S swab samples in Vero E6 cells (IZSLER #BSCL87). Culture was inoculated for 2 hours with swab material diluted 1/10 in DMEM (Biolot) supplemented with 2% HI-FBS (Gibco), 1% anti-anti (Gibco) and then incubated for 3 days until first CPE signs. Samples were subsequently passaged one time in Vero cells (ATCC #CCL81). Virus culture media contained AlphaMEM (Biolot) supplemented with 2% HI-FBS (Gibco), 1% anti-anti (Gibco). (RRIDs and catalog numbers are provided in Supplementary Data 1, Reagents).

A total of 16 serum samples were obtained during the first wave of the COVID-19 pandemic in spring-summer 2020 from recovered volunteers with PCR-confirmed SARS-CoV-2 infection and tested in a microneutralization assay.

Microneutralization was performed with hCoV-19/St_Petersburg-3524V/2020 virus (GISAID EPI_ISL_415710, with the ΔF combination of mutations absent, designated as Reference), and 30769 V and 23680 V viruses isolated from the patient S (designated patient S August 2020 and patient S January 2021, respectively). Serum was heat inactivated (56 °C, 60 min), serially diluted 2-fold starting from 1/10, mixed with 25 TCID50 of virus, incubated for 1 h at 37 °C and inoculated into Vero cells in triplicates in 96-well plate. Culture media was the same as for virus isolation. 5 days after inoculation, neutralizing antibody titer was calculated as the reciprocal of the highest serum dilution preventing CPE.

Serum samples obtained from patient S were tested for virus specific antibodies in ELISA and in microneutralization assay with either Reference or patient S viruses. ELISA was performed with “SARS-CoV-2-IgG-EIA-BEST” (VEKTOR BEST #D-5501) according to the manufacturer’s instructions.

HLA genotyping

HLA genotyping was performed using a commercial kit according to the manufacturer’s instructions (PARallele™ HLA solution v3, Parseq Lab). HLA-A, -B, -C, -DRB1 and -DQB1 loci were genotyped with 3-field resolution. Simultaneously, HLA calling was performed from WES data using HLA-HD version 1.3.044 with IPD-IMGT/HLA database Release Version 3.43. The inferred alleles are listed in Supplementary Table 3.

Using HLA-2-Haplo software tool45 this set of alleles was split into two haplotypes presented in Supplementary Table 3. A European population database was used in this procedure. An a-posteriori probability of found combination was 97.6%. As one can see, the found haplotypes are among the most common variants in the European population.

Consensus calling

Raw reads were trimmed with Trimmomatic version 0.3946 to remove adapter sequences and low-quality ends. Trimmed reads were mapped onto the Wuhan-Hu-1 (MN908947.3) reference genome with BWA MEM version 0.7.1747. The following reads were then removed from the mapping: reads with abnormal insert length to read ratio (greater than 10 or less than 0.8), reads with insert length greater than 1100, reads with more than 50% soft-clipped bases. Soft-clipped ends were trimmed from the remaining reads, 10 nucleotides were cropped from read ends using custom scripts, and primer sequences were removed with ivar version 1.348. Only reads with at least 30 nucleotides remaining after the procedure were kept. SNV and short indel calling was done with LoFreq version 2.1.549, with SNVs considered consensus if they were covered by at least 4 reads and supported by more than 50% of those reads; indels were considered consensus if they were covered by at least 20 reads with at least 50% of those supporting the variant. Regions that were covered by fewer than 4 reads were masked as NC. We attributed several positions that were covered by 2 or 3 reads, but matched the reference and were conserved throughout all samples (22612, 23680, 24160, 27064, 30579 and 30769), to the reference; as these positions did not mutate, this decision did not affect any of our analyses. Consensus was created by bcftools version 1.950,51 consensus.

Phylogenetic analysis

255,389 genomes of SARS-CoV-2 were downloaded from GISAID on December 12, 2020, (Supplementary Data) and aligned with MAFFT v7.45352 against the reference genome Wuhan-Hu-1/2019 (NCBI ID: MN908947.353 with --addfragments --keeplength options. 100 nucleotides from the beginning and from the end of the alignment were trimmed. After that, we excluded sequences (1) with more than 300 positions of missing data (Ns) and gaps, (2) excluded by Nextstrain (https://github.com/nextstrain/ncov/blob/master/defaults/exclude.txt), or (3) from non-human animals other than minks, leaving us with 201,948 sequences. Identical sequences were then collapsed within the country and host and annotated by the Pangolin package version 2.1.054. To this dataset, we added the two patient S samples obtained in August, 2020 as well as the patient A sample. As sequences of patient S belonged to the B.1.1 lineage, we further only kept sequences annotated as B.1.1, excluding a large clade defined by mutation G25563T (GH clade in GISAID55 nomenclature). For the purposes of phylogenetic analysis, we additionally masked the highly homoplastic site 11083. The final set of 49,083 sequences was used to construct the phylogenetic tree with IQ-Tree v2.1.156 under the GTR substitution model and ‘-fast’ option. Ancestral sequences at the internal tree nodes were reconstructed with TreeTime v. 0.8.057. Having ensured that the two patient S samples form a clade rooted at the patient A sample and not carrying any samples other than those of patient S, we then separately reconstructed the phylogeny of all six samples of patient S, rooted it with patient A, and manually added the resulting clade to the downsampled B.1.1 tree. For visualization purposes, the tree was downsampled to contain 1% of samples, including the patient A sample and the complete clade containing all patient S samples.

To estimate the molecular clock rate of the patient S lineage (Fig. 1c), we downloaded all sequences available in GISAID on May 31, 2021, filtered them as described above, and subsampled the filtered dataset to 50,000 samples preserving all Russian sequences. To this dataset, we added the six patient S samples and the ancestral patient A sample. We then aligned the obtained 50,007 sequences against the reference sequence and reconstructed the phylogeny with Fasttree version 2.1.1158. Finally, we computed root-to-tip distances and calculated the slope of the root-to-tip distance vs. sampling dates regression line for the three separate datasets: (1) patient S samples, (2) B.1.1.7 samples, and (3) the remaining samples from the subsampled GISAID dataset. To validate the difference between the estimated clock rates for patient S samples and samples belonging to dataset (3), we subsampled this dataset, picking six random samples collected on the same dates as the patient S samples, and computed the linear regression slope, in each of the 10,000 trials. (For dataset (2), this procedure was impossible because there were no B.1.1.7 samples in August 2020). None of the 10,000 samples resulted in the estimated clock rate above 15.3 × 10−4, implying the p-value of <0.0001.

Effect of viral mutations on antigen presentation

To study the effect of mutations in SARS-CoV-2 proteins on their antigen presentation, we adapted metrics from Marty et al.30 (Fig. 2a). For each mutated site in both its ancestral and derived states, we inferred all possible peptides of certain lengths overlapping it, and calculated their percentile ranks (Rank_El) relative to a set of random natural peptides by netMHCpan version 4.1 and netMHCIIpan version 4.059 for HLA I and HLA II respectively. We used peptide lengths between 8 and 12 amino acids for HLA I alleles, and between 12 and 18 amino acids for HLA II. If the mutated site was not presented by any of the HLA alleles either in the ancestral or derived states, we excluded it from analysis. To exclude non-presenting peptides, we used the percentile rank <2% threshold for HLA I, and <10% threshold for HLA II, as recommended by the netMHCpan manual. For derived states of deletions, we extended the peptide in the C-direction as necessary to preserve its length. We paired the predicted A and B chains of HLA class II alleles as suggested in the tool allele list: HLA-DQA10101-DQB10501, HLA-DQA10501-DQB10201, HLA-DPA10103-DPB10402, HLA-DPA10103-DPB10401, DRB1_0301, DRB1_0101. We excluded the stop-codon producing mutation ORF8:Q18* from comparisons of ancestral and derived states, since the corresponding values for the derived state were undefined.

As in Marty et al.30, we used the best percentile rank (BR) among all possible peptides overlapping the mutated site as the presentation score of this site for the particular HLA allele. To estimate the overall presentation of the site in the patient, we calculated the patient harmonic best rank (PHBR), i.e., the harmonic mean of BRs of HLA alleles of the same class. To compare the effect of a mutation on site presentation, we calculated the fold change of PHBR score as the ratio of the derived PHBR to the ancestral PHBR (so that fold change > 1 indicates weakening of presentation).

To focus on the peptides shown to be immunogenic to T cells in other SARS-CoV-2 infected patients carrying the same HLA alleles as patient S, we used IEDB31 (Immune Epitope Database and Analysis Resource, accessed on June 1, 2021) with the “positive assay only” filter. IEDB identifiers of found epitopes are presented in the Supplementary Data file. For those sites inferred to be contained in immunogenic peptide, we calculated the best percentile rank of immunogenic peptide overlapping the site of mutation (imBR).

Population-level effects of mutations

To check the effect of detected SARS-CoV-2 mutations on presentation by the HLA alleles other than those of patient S, we calculated the BR scores as explained above for the most frequent classical HLA alleles of each family that together represented 95% of the HLA alleles in the world population. The list and frequencies of such alleles were taken from Sarkisova et al. and Solberg et al.36,37.

For most mutations detected in immunogenic epitopes, at least one of the HLA I alleles of patient S demonstrated extreme values of BR fold change in comparison with other alleles (Fig. 4a). To check the probability of such an observation happening by chance, we performed a permutation test, calculating the probability that a randomly chosen set of alleles has the same or a more extreme value of mean BR fold change across all mutations overlapped by immunogenic peptides as that of alleles of patient S. This was true for 33 out of 100000 permutations, corresponding to p = 0.0033 (Fig. 4b). None of the HLA II immunogenic epitopes overlapped any of the mutated sites; the only mutated site adjacent to such an epitope (S:S50L) did not stand out in the permutation test (p = 0.6996; Fig. 4c, d).

To compare the effects of mutations between different HLA alleles in Fig. 4e, f, we calculated the mean BR across all changed sites. This analysis again excluded ORF8:Q18*, which nevertheless prevented production of high-affinity epitopes for most alleles.

Data analysis and visualization

Data analysis was performed in R version 4.0.060, and figures were visualized with ggplot2 package version 3.3.261. SARS-CoV-2 phylogenetic tree was visualized with ITOL version 662.

Statistics and reproducibility

No statistical method was used to predetermine the sample size. No data were excluded from the analyses. The experiments were not randomized. The Investigators were not blinded to allocation during experiments and outcome assessment.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.