From raw FASTQ files off the sequencer to a fully annotated UMAP — every step explained with commands, file formats, and the reasoning behind each decision.
Contents
1. Overview of the scRNA-seq Workflow 2. File Formats You Will Encounter 3. Step 1 — Alignment with Cell Ranger 4. Step 2 — Loading Data into Seurat 5. Step 3 — Quality Control (QC) 6. Step 4 — Normalization 7. Step 5 — Highly Variable Genes 8. Step 6 — Scaling and PCA 9. Step 7 — Clustering 10. Step 8 — UMAP Visualization 11. Step 9 — Finding Marker Genes 12. Step 10 — Cell Type Annotation 13. Sharing Your scRNA-seq DataSingle-cell RNA sequencing (scRNA-seq) lets you measure gene expression in thousands of individual cells simultaneously — rather than averaging across a bulk population. This reveals cell heterogeneity, rare cell types, and cell state transitions that bulk RNA-seq completely misses.
The full analysis pipeline looks like this:
| Stage | Tool | Input | Output |
|---|---|---|---|
| Alignment & counting | Cell Ranger | FASTQ files | MEX count matrix |
| QC & filtering | Seurat / Scanpy | MEX matrix | Filtered Seurat object |
| Normalization | Seurat | Raw counts | Normalized counts |
| Dimensionality reduction | Seurat | Normalized | PCA embeddings |
| Clustering | Seurat | PCA | Cell clusters |
| Visualization | Seurat | PCA/clusters | UMAP plot |
| Annotation | Manual / SingleR | Marker genes | Cell type labels |
Before touching any code, understand the files you are working with. Confusion about formats is the most common source of errors in scRNA-seq workflows.
The raw output from the sequencer. For 10x Chromium, each sample produces three FASTQ files per lane:
| File | Read | Content | Length |
|---|---|---|---|
*_R1_001.fastq.gz | Read 1 | Cell barcode (16 bp) + UMI (12 bp) | 28 bp |
*_R2_001.fastq.gz | Read 2 | cDNA insert (actual mRNA sequence) | 90–150 bp |
*_I1_001.fastq.gz | Index 1 | Sample index (for demultiplexing) | 8–10 bp |
After Cell Ranger runs, it produces a folder called filtered_feature_bc_matrix/ containing three files. This is called the MEX (Market Exchange Format):
| File | What It Contains | Format |
|---|---|---|
matrix.mtx.gz | Sparse count matrix (genes × cells) | Matrix Market sparse format |
barcodes.tsv.gz | Cell barcodes (one per line) | Plain text, one barcode per row |
features.tsv.gz | Gene IDs and names | Tab-separated: Ensembl ID, gene name, type |
These three files together represent a sparse matrix where rows are genes, columns are cells (barcodes), and values are UMI counts. Most of the values are zero — scRNA-seq data is extremely sparse, typically 90–95% zeros.
Cell Ranger also outputs a filtered_feature_bc_matrix.h5 file. This is the same data as the MEX folder but in a single HDF5 binary file. It loads faster and is easier to transfer. Both Seurat and Scanpy can read it directly.
After loading into R and running any processing steps, Seurat stores everything — raw counts, normalized counts, PCA, UMAP coordinates, cluster assignments, metadata — in a single R object saved as an .rds file. This is your primary working file throughout the analysis.
The Python equivalent of the Seurat object, used by Scanpy. If your collaborators use Python while you use R, you may need to convert between formats using SeuratDisk or zellkonverter.
Cell Ranger is 10x Genomics' proprietary pipeline that aligns FASTQ reads to a reference genome and generates the count matrix. It handles barcode correction, UMI deduplication, and cell calling automatically.
PATHrefdata-gex-GRCh38-2024-A for human)_S1_L001_R1_001.fastq.gz pattern)
# cellranger count — aligns reads and generates count matrix
cellranger count \
--id=sample_PBMC \ # Output folder name (anything you choose)
--transcriptome=/data/refs/refdata-gex-GRCh38-2024-A \ # Path to reference
--fastqs=/data/fastqs/PBMC/ \ # Directory containing your FASTQ files
--sample=PBMC_sample1 \ # The sample prefix in FASTQ filenames
--localcores=16 \ # Number of CPU cores to use
--localmem=64 # RAM in GB (Cell Ranger needs a lot — 32 GB minimum)
| Flag | Why It Matters |
|---|---|
--id | Name of the output directory. All results go here. Use descriptive names like patient01_tumor. |
--transcriptome | Pre-built reference package. Must match your organism and genome build. Using the wrong reference silently gives wrong gene names. |
--fastqs | Directory containing your FASTQs. Cell Ranger will find all files automatically based on naming conventions. |
--sample | The sample name prefix in the FASTQ filename. If your files are named PBMC_sample1_S1_L001_R1_001.fastq.gz, the sample name is PBMC_sample1. |
--localcores | Threads to use. More = faster. Don't exceed the number of physical cores on your machine. |
--localmem | RAM in GB. Cell Ranger needs 16–64 GB depending on dataset size. Running out of memory causes cryptic failures. |
sample_PBMC/outs/
├── filtered_feature_bc_matrix/ # MEX format — use this for analysis
│ ├── matrix.mtx.gz
│ ├── barcodes.tsv.gz
│ └── features.tsv.gz
├── raw_feature_bc_matrix/ # All barcodes (including empty droplets)
├── filtered_feature_bc_matrix.h5 # Same as MEX folder, single file
├── web_summary.html # QC report — CHECK THIS FIRST
├── metrics_summary.csv # Key QC metrics
└── molecule_info.h5 # Per-molecule data (needed for multi-sample)
web_summary.html first. Open it in a browser before doing anything else. Key metrics to check: Estimated Number of Cells (should match expected cell count), Median Genes per Cell (typically 1,000–5,000 for good data), and Sequencing Saturation (above 70% is good — means most transcripts were captured).
Once Cell Ranger finishes, open R and load the data into a Seurat object. All downstream analysis happens within this object.
# Install Seurat if you haven't already
install.packages("Seurat")
# Load the library
library(Seurat)
# Read10X reads the three MEX files (matrix.mtx, barcodes.tsv, features.tsv)
# Point it to the filtered_feature_bc_matrix directory
counts <- Read10X(
data.dir = "sample_PBMC/outs/filtered_feature_bc_matrix/"
)
# Create a Seurat object from the count matrix
seurat_obj <- CreateSeuratObject(
counts = counts,
project = "PBMC_tutorial", # Project name — appears in plots
min.cells = 3, # Keep genes detected in at least 3 cells
min.features = 200 # Keep cells with at least 200 genes detected
)
# Check what you have
seurat_obj
# Output: An object of class Seurat
# 33538 features across 8615 samples within 1 assay
min.cells and min.features domin.cells = 3 removes genes that appear in fewer than 3 cells. These are likely noise or sequencing errors, and keeping them inflates your feature space unnecessarily. min.features = 200 removes cells with fewer than 200 genes detected — these are typically empty droplets or debris that Cell Ranger failed to filter out.
counts <- Read10X_h5("sample_PBMC/outs/filtered_feature_bc_matrix.h5")
Even after Cell Ranger's filtering, your data contains low-quality cells that will corrupt your clustering if not removed. The three standard QC metrics are:
| Metric | What It Measures | Low Value Means | High Value Means |
|---|---|---|---|
nFeature_RNA | Number of unique genes per cell | Empty droplet or dead cell | Possible doublet (two cells in one droplet) |
nCount_RNA | Total UMI counts per cell | Low-quality / dying cell | Possible doublet |
percent.mt | % reads from mitochondrial genes | Healthy cell | Dead or damaged cell (cytoplasm leaked out) |
# Calculate mitochondrial gene percentage
# Human mito genes start with "MT-", mouse start with "mt-"
seurat_obj[["percent.mt"]] <- PercentageFeatureSet(
seurat_obj,
pattern = "^MT-"
)
# Visualise QC metrics to decide on thresholds
VlnPlot(
seurat_obj,
features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
ncol = 3,
pt.size = 0.1 # Show individual cell points (set to 0 for large datasets)
)
# Filter cells based on QC thresholds
seurat_obj <- subset(
seurat_obj,
subset = nFeature_RNA > 200 & # Remove cells with too few genes
nFeature_RNA < 6000 & # Remove likely doublets
percent.mt < 20 # Remove dead/damaged cells
)
# Check how many cells remain after filtering
ncol(seurat_obj) # Should typically retain 70-90% of cells
There is no universal cutoff. Look at the violin plots and choose thresholds that remove the obvious outlier tails. A few guidelines:
Raw UMI counts are biased by sequencing depth — a cell with 10,000 UMIs will appear to express every gene at twice the level of an identical cell with 5,000 UMIs. Normalization removes this technical bias so we can compare gene expression across cells.
# NormalizeData divides each cell's counts by the total counts for that cell,
# multiplies by scale.factor (10,000 by default — makes it "counts per 10k"),
# then applies log1p transformation: log(x + 1)
seurat_obj <- NormalizeData(
seurat_obj,
normalization.method = "LogNormalize",
scale.factor = 10000
)
# After normalization, data is stored in seurat_obj[["RNA"]]@data
# Raw counts are still in seurat_obj[["RNA"]]@counts (never overwritten)
Raw count distributions are highly skewed — a few highly expressed genes (like ribosomal genes) dominate. The log transformation compresses this skew, bringing expression values into a roughly normal distribution that works better for linear methods like PCA. The +1 prevents log(0) errors for genes with zero counts.
# SCTransform is a more statistically principled normalization
# It uses regularized negative binomial regression to remove
# variation due to sequencing depth while preserving biological signal
# Recommended for datasets with high variation in cell depth
seurat_obj <- SCTransform(
seurat_obj,
vars.to.regress = "percent.mt", # Also regress out mito % while you're at it
verbose = FALSE
)
SCT assay instead of RNA.
Your dataset has ~20,000+ genes but most are either not expressed or expressed uniformly across all cells (housekeeping genes). These uninformative genes add noise to downstream analysis. We select only the most highly variable genes (HVGs) — genes whose expression varies substantially between cells — for dimensionality reduction and clustering.
# FindVariableFeatures selects the top 2000 most variable genes by default
# Method "vst" (variance-stabilizing transformation) is recommended
seurat_obj <- FindVariableFeatures(
seurat_obj,
selection.method = "vst",
nfeatures = 2000 # 2000 is standard; some analyses use 3000-5000
)
# See which genes were selected
top20 <- head(VariableFeatures(seurat_obj), 20)
print(top20)
# These will typically include marker genes for the major cell types in your sample
# Visualise — shows mean expression vs variance for all genes
VariableFeaturePlot(seurat_obj)
The VST method fits a local polynomial regression of log(variance) ~ log(mean) across genes, then standardizes each gene's variance using the expected variance from the fitted model. Genes that deviate upward from the expected variance are "variable" — they carry cell-type information rather than just technical noise.
Before PCA, we scale the data so that highly expressed genes do not dominate simply because of their absolute expression level.
# ScaleData centers and scales each gene to mean=0, variance=1
# Only run on the variable genes for speed
all_genes <- rownames(seurat_obj)
seurat_obj <- ScaleData(seurat_obj, features = all_genes)
# PCA on the scaled variable genes
seurat_obj <- RunPCA(
seurat_obj,
features = VariableFeatures(seurat_obj),
npcs = 50, # Compute 50 principal components
verbose = FALSE
)
# How many PCs are meaningful? Use ElbowPlot
ElbowPlot(seurat_obj, ndims = 50)
# Look for where the curve flattens — that's your "elbow"
# Typically between PC 10-30. Choose a value slightly past the elbow.
This is one of the most important parameter choices in the pipeline. Too few PCs and you miss biological variation. Too many and you include noise. The elbow plot shows the variance explained by each PC — use the number of PCs at the inflection point. A common approach is to also try JackStraw resampling for a statistical assessment, though this is slow on large datasets.
Seurat uses a graph-based clustering approach. It first builds a K-nearest neighbour (KNN) graph in PCA space, then applies the Louvain or Leiden algorithm to find communities (clusters) within the graph.
# FindNeighbors builds the KNN graph using the selected PCs
# dims = 1:20 means we use PC1 through PC20
seurat_obj <- FindNeighbors(
seurat_obj,
dims = 1:20 # Use the number of PCs you chose from the elbow plot
)
# FindClusters runs the Louvain algorithm on the KNN graph
# resolution controls cluster granularity:
# lower (0.1-0.3) = fewer, larger clusters
# higher (0.8-1.5) = more, smaller clusters
seurat_obj <- FindClusters(
seurat_obj,
resolution = 0.5 # Start with 0.5, adjust based on biology
)
# Check how many clusters were found
table(seurat_obj$seurat_clusters)
# Each number (0, 1, 2, ...) is a cluster
# The largest cluster is always numbered 0
UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that projects your thousands of PCs down to 2 dimensions for visualization. Critically, UMAP is for visualization only — clustering decisions should be made in PCA space, not based on UMAP proximity.
# RunUMAP computes 2D coordinates for visualization
seurat_obj <- RunUMAP(
seurat_obj,
dims = 1:20, # Same dims as FindNeighbors — must be consistent
seed.use = 42 # Set seed for reproducibility (UMAP is stochastic)
)
# Plot UMAP colored by cluster
DimPlot(
seurat_obj,
reduction = "umap",
label = TRUE, # Add cluster number labels on plot
label.size = 5,
pt.size = 0.5
)
# Overlay QC metrics on UMAP to spot remaining low-quality cells
FeaturePlot(
seurat_obj,
features = c("nFeature_RNA", "percent.mt"),
ncol = 2
)
To identify what each cluster is, we find genes that are differentially expressed in each cluster compared to all other clusters. These are your marker genes.
# FindAllMarkers tests each cluster vs all others
# Wilcoxon rank-sum test is the default (fast and robust)
markers <- FindAllMarkers(
seurat_obj,
only.pos = TRUE, # Only return upregulated markers (positive fold change)
min.pct = 0.25, # Gene must be detected in 25% of cells in the cluster
logfc.threshold = 0.25 # Minimum log fold-change threshold
)
# Get top 5 markers per cluster
top5 <- markers %>%
group_by(cluster) %>%
top_n(n = 5, wt = avg_log2FC)
print(top5)
# Heatmap of top markers
DoHeatmap(
seurat_obj,
features = top5$gene
) + NoLegend()
| Column | Meaning |
|---|---|
p_val | Unadjusted p-value. Don't use this. |
avg_log2FC | Log2 fold change of expression in this cluster vs others. Higher = more specific. |
pct.1 | Fraction of cells in this cluster expressing the gene. |
pct.2 | Fraction of cells in OTHER clusters expressing the gene. |
p_val_adj | Bonferroni-corrected p-value. Use this for significance. <0.05 is significant. |
cluster | Which cluster this marker belongs to. |
gene | Gene name. |
A good marker gene has high avg_log2FC, high pct.1, low pct.2, and low p_val_adj. Genes with high pct.1 AND high pct.2 are not specific — they are expressed in many cell types and won't help you annotate.
The final step is assigning biological identities to clusters using your marker genes and known cell type signatures.
The most reliable approach is cross-referencing your top markers with known cell-type marker genes from literature. For human PBMCs, canonical markers are:
| Cell Type | Key Marker Genes |
|---|---|
| CD4+ T cells | CD3D, CD3E, CD4, IL7R, CCR7 |
| CD8+ T cells | CD3D, CD8A, CD8B, GZMB, PRF1 |
| NK cells | GNLY, NKG7, NCAM1, KLRD1 |
| B cells | CD19, MS4A1, CD79A, CD79B |
| Monocytes (classical) | CD14, LYZ, S100A8, S100A9 |
| Monocytes (non-classical) | FCGR3A, MS4A7, CX3CR1 |
| Dendritic cells | FCER1A, CST3, IL3RA |
| Platelets | PPBP, PF4, GP9 |
# Visualise known markers on UMAP to confirm cluster identities
FeaturePlot(
seurat_obj,
features = c("CD3D", "CD14", "CD19", "GNLY", "PPBP", "FCER1A"),
ncol = 3
)
# Once you've decided on identities, rename the clusters
new_labels <- c(
"0" = "CD14+ Monocytes",
"1" = "CD4+ T cells",
"2" = "CD8+ T cells",
"3" = "NK cells",
"4" = "B cells",
"5" = "CD16+ Monocytes",
"6" = "Dendritic cells",
"7" = "Platelets"
)
seurat_obj <- RenameIdents(seurat_obj, new_labels)
# Final annotated UMAP
DimPlot(
seurat_obj,
reduction = "umap",
label = TRUE,
repel = TRUE, # Prevents label overlaps
pt.size = 0.5
)
# Save your Seurat object — don't lose your work!
saveRDS(seurat_obj, file = "PBMC_annotated.rds")
SingleR package, which automatically annotates cells by comparing expression profiles to reference datasets (Human Cell Atlas, ImmGen, etc.). It is not a replacement for manual review but a useful starting point.
scRNA-seq projects generate large files at multiple stages, each of which you may need to share with collaborators:
| Stage | Files to Share | Typical Size | Encryption? |
|---|---|---|---|
| Raw data | FASTQ files | 20–200 GB | Yes (if human subjects) |
| Cell Ranger output | filtered_feature_bc_matrix/ folder or .h5 file | 100 MB – 5 GB | Depends on IRB |
| Seurat object | .rds file | 500 MB – 10 GB | Usually not required |
| Count matrix only | .h5ad or .csv.gz | 50 MB – 2 GB | Usually not required |
For raw FASTQ files from human subjects, use BioTransfer's Secure Transfer mode to ensure end-to-end encryption. For Cell Ranger outputs and Seurat objects derived from de-identified data, standard transfer is typically sufficient. Always confirm with your IRB protocol what level of protection is required at each data tier.
Use BioTransfer's batch/folder transfer to send the full filtered_feature_bc_matrix/ directory in one go, preserving the three-file structure Cell Ranger outputs.
Large FASTQ and Seurat objects transferred with integrity verification and optional E2EE.
Start a Transfer