Extracellular vesicles carry distinct proteo-transcriptomic signatures that are different from their cancer cell of origin

For quantification of gene expression, raw reads were aligned to the latest Ensembl GRCh38.p13 (GCA_000001405.28) using bowtie aligner (version 2.5.4b). FeatureCounts was then used to map the aligned reads to the GENCODE v26 primary gene annotation, including transcripts corresponding to ncRNAs such as lncRNA, miRNA as well as protein-coding RNA. To maximize recovery and minimize the noise, multimapping reads were quantified up to m = 10 and distributed using unique reads mapping distribution, as described in most recent best practices protocols.

Formal analysis

Data cleaning, filtering, and analysis were performed in R and under expressed genes or proteins with low or no counts across all samples of the similar phenotype were removed (at least one of the samples have CPM >10). Normalization via trimmed mean of M-values in edgeR ensures library sizes of all samples are scaled properly to minimize the influences of external factors. The limma package, originally designed for microarray data, performs linear modeling on normally distributed data. Thus, to accommodate for the non independent mean-variance relationship of RNA-seq data, the voom function assigns a precision weight derived from the library size and normalization factor of each sample itself to convert the raw counts to log2-CPM values. The log2-transformed counts minimize the changes in variance as the count size increases. Prior to examining differential expressions, we performed unsupervised clustering of samples to evaluate the similarities and dissimilarities between samples as well as across phenotypes of interest using the prcomp package in R. The result is reflected in the PCA plots.

Differentially expressed genes are discerned between 1) 22RV1 cell lines versus 22RV1 cell-line-derived EVs, and 2) Prostate cancer patient serum-derived EVs isolated using nanoDLD versus the EVs isolated using UC via the standard differential expression pipeline as illustrated in limma/edgeR packages. Results of the differentially expressed genes are represented in high-resolution heatmap as well as volcano plots made using pheatmap and ggplot2 packages. Likewise, differentially expressed proteins are discerned between 1) 22RV1 cell-line, 2) the EVs derived from the corresponding cell lines, and 3) patient serum EVs derived from UC versus those derived from nanoDLD.

Correlation analyses: Spearman Rho correlations were determined across cellular and EV genetic profiles as well as the proteomic profiles. Gene expressions were plotted in the x/y axis, where x/y axis are log2 (CPM), all RNA types were analyzed.

Biotype analysis: The gene biotype was recovered from the GTF annotation file for Ensembl GRCh38 (same as for alignment). Mapping resolution was kept as CDS with intron and exon annotation levels and combined to gene level when necessary. After differential expression quantification of gene biotype proportions, numbers and expression levels was taking into account. Thus, expressing gene biotype as (1) number of molecules per biotype (after lib. size adjustment) and (2) levels of expression using RPKM to adjust for gene/transcript length sizes.

Pathway Analysis: To effectively compare, not only enriched genes, but also against enriched proteins, pathway enrichment analyses are performed using the enrichR package in R. Specifically, we referenced databases including Kyoto Encyclopedia of Genes and Genomes (Versions: 2013, 2015, 2016, 2019, 2021), Gene Ontology Molecular Function (Versions: 2013, 2015, 2017, 2017b, 2018, 2021), Gene Ontology Cellular Component (Versions: 2013, 2015, 2017, 2017b, 2018, 2021), Gene Ontology Biological Process (Versions: 2013, 2015, 2017, 2017b, 2018, 2021), Reactome (Version: 2016), and WikiPathways (Version: 2019). Top 1500 enriched genetics and all of the proteomic signatures were used for the pathway analysis. Top 10 genomic and proteomic pathways from each database with FDR below 0.05 and at least three enriched genes present were selected. The overlaps across different cellular and EV datasets are outlined in the barplot made using the ggplot2 package in R. Similarly, we also performed the gene set enrichment analysis using our RNAseq result for hallmark cancer pathways. GSVA scores are generated per sample using the GSVA program in R. Wilcoxon test is then performed to identify significant GSVA scores across both cell and EV samples with a cutoff of 0.05 FDR. Of which, pathways with GSVA score differences greater than 0.1 across the cell and EV samples are then represented in a heatmap. Finally, upon the acceptance of manuscript our data and results will be uploaded to GEO and will be openly available.