scRNA-seq Analysis: Complete Step-by-Step Tutorial

Overview of the scRNA-seq Workflow

Single-cell RNA sequencing (scRNA-seq) lets you measure gene expression in thousands of individual cells simultaneously — rather than averaging across a bulk population. This reveals cell heterogeneity, rare cell types, and cell state transitions that bulk RNA-seq completely misses.

The full analysis pipeline looks like this:

Stage	Tool	Input	Output
Alignment & counting	Cell Ranger	FASTQ files	MEX count matrix
QC & filtering	Seurat / Scanpy	MEX matrix	Filtered Seurat object
Normalization	Seurat	Raw counts	Normalized counts
Dimensionality reduction	Seurat	Normalized	PCA embeddings
Clustering	Seurat	PCA	Cell clusters
Visualization	Seurat	PCA/clusters	UMAP plot
Annotation	Manual / SingleR	Marker genes	Cell type labels

Platform note: This tutorial uses 10x Genomics Chromium data (the most common platform). The upstream steps (Cell Ranger) are 10x-specific. The downstream Seurat steps work with data from any platform.

File Formats You Will Encounter

Before touching any code, understand the files you are working with. Confusion about formats is the most common source of errors in scRNA-seq workflows.

FASTQ (.fastq.gz)

The raw output from the sequencer. For 10x Chromium, each sample produces three FASTQ files per lane:

File	Read	Content	Length
`*_R1_001.fastq.gz`	Read 1	Cell barcode (16 bp) + UMI (12 bp)	28 bp
`*_R2_001.fastq.gz`	Read 2	cDNA insert (actual mRNA sequence)	90–150 bp
`*_I1_001.fastq.gz`	Index 1	Sample index (for demultiplexing)	8–10 bp

Common mistake: R1 is NOT the biological read — it contains only the barcode and UMI. R2 contains the actual mRNA sequence. If you try to align R1 to a genome you will get near-zero mapping rates.

MEX Format (Cell Ranger output)

After Cell Ranger runs, it produces a folder called filtered_feature_bc_matrix/ containing three files. This is called the MEX (Market Exchange Format):

File	What It Contains	Format
`matrix.mtx.gz`	Sparse count matrix (genes × cells)	Matrix Market sparse format
`barcodes.tsv.gz`	Cell barcodes (one per line)	Plain text, one barcode per row
`features.tsv.gz`	Gene IDs and names	Tab-separated: Ensembl ID, gene name, type

These three files together represent a sparse matrix where rows are genes, columns are cells (barcodes), and values are UMI counts. Most of the values are zero — scRNA-seq data is extremely sparse, typically 90–95% zeros.

HDF5 / H5 (.h5)

Cell Ranger also outputs a filtered_feature_bc_matrix.h5 file. This is the same data as the MEX folder but in a single HDF5 binary file. It loads faster and is easier to transfer. Both Seurat and Scanpy can read it directly.

Seurat Object (.rds)

After loading into R and running any processing steps, Seurat stores everything — raw counts, normalized counts, PCA, UMAP coordinates, cluster assignments, metadata — in a single R object saved as an .rds file. This is your primary working file throughout the analysis.

AnnData (.h5ad)

The Python equivalent of the Seurat object, used by Scanpy. If your collaborators use Python while you use R, you may need to convert between formats using SeuratDisk or zellkonverter.

Step 1 Alignment with Cell Ranger

Cell Ranger is 10x Genomics' proprietary pipeline that aligns FASTQ reads to a reference genome and generates the count matrix. It handles barcode correction, UMI deduplication, and cell calling automatically.

What you need before running Cell Ranger

Cell Ranger installed and in your PATH
A reference genome package — download from 10x Genomics website (e.g., refdata-gex-GRCh38-2024-A for human)
Your FASTQ files, named according to 10x conventions (with _S1_L001_R1_001.fastq.gz pattern)

The core Cell Ranger command



# cellranger count — aligns reads and generates count matrix
cellranger count \
  --id=sample_PBMC \          # Output folder name (anything you choose)
  --transcriptome=/data/refs/refdata-gex-GRCh38-2024-A \ # Path to reference
  --fastqs=/data/fastqs/PBMC/ \ # Directory containing your FASTQ files
  --sample=PBMC_sample1 \      # The sample prefix in FASTQ filenames
  --localcores=16 \            # Number of CPU cores to use
  --localmem=64               # RAM in GB (Cell Ranger needs a lot — 32 GB minimum)

What each flag means

Flag	Why It Matters
`--id`	Name of the output directory. All results go here. Use descriptive names like `patient01_tumor`.
`--transcriptome`	Pre-built reference package. Must match your organism and genome build. Using the wrong reference silently gives wrong gene names.
`--fastqs`	Directory containing your FASTQs. Cell Ranger will find all files automatically based on naming conventions.
`--sample`	The sample name prefix in the FASTQ filename. If your files are named `PBMC_sample1_S1_L001_R1_001.fastq.gz`, the sample name is `PBMC_sample1`.
`--localcores`	Threads to use. More = faster. Don't exceed the number of physical cores on your machine.
`--localmem`	RAM in GB. Cell Ranger needs 16–64 GB depending on dataset size. Running out of memory causes cryptic failures.

Cell Ranger output structure

sample_PBMC/outs/
├── filtered_feature_bc_matrix/     # MEX format — use this for analysis
│   ├── matrix.mtx.gz
│   ├── barcodes.tsv.gz
│   └── features.tsv.gz
├── raw_feature_bc_matrix/          # All barcodes (including empty droplets)
├── filtered_feature_bc_matrix.h5  # Same as MEX folder, single file
├── web_summary.html               # QC report — CHECK THIS FIRST
├── metrics_summary.csv            # Key QC metrics
└── molecule_info.h5               # Per-molecule data (needed for multi-sample)

Always check web_summary.html first. Open it in a browser before doing anything else. Key metrics to check: Estimated Number of Cells (should match expected cell count), Median Genes per Cell (typically 1,000–5,000 for good data), and Sequencing Saturation (above 70% is good — means most transcripts were captured).

Step 2 Loading Data into Seurat

Once Cell Ranger finishes, open R and load the data into a Seurat object. All downstream analysis happens within this object.

# Install Seurat if you haven't already
install.packages("Seurat")

# Load the library
library(Seurat)

# Read10X reads the three MEX files (matrix.mtx, barcodes.tsv, features.tsv)
# Point it to the filtered_feature_bc_matrix directory
counts <- Read10X(
  data.dir = "sample_PBMC/outs/filtered_feature_bc_matrix/"
)

# Create a Seurat object from the count matrix
seurat_obj <- CreateSeuratObject(
  counts = counts,
  project = "PBMC_tutorial",  # Project name — appears in plots
  min.cells = 3,              # Keep genes detected in at least 3 cells
  min.features = 200          # Keep cells with at least 200 genes detected
)

# Check what you have
seurat_obj
# Output: An object of class Seurat
# 33538 features across 8615 samples within 1 assay

What `min.cells` and `min.features` do

min.cells = 3 removes genes that appear in fewer than 3 cells. These are likely noise or sequencing errors, and keeping them inflates your feature space unnecessarily. min.features = 200 removes cells with fewer than 200 genes detected — these are typically empty droplets or debris that Cell Ranger failed to filter out.

Alternative — Load from .h5 file: If you prefer the single H5 file:

counts <- Read10X_h5("sample_PBMC/outs/filtered_feature_bc_matrix.h5")

Step 3 Quality Control (QC)

Even after Cell Ranger's filtering, your data contains low-quality cells that will corrupt your clustering if not removed. The three standard QC metrics are:

Metric	What It Measures	Low Value Means	High Value Means
`nFeature_RNA`	Number of unique genes per cell	Empty droplet or dead cell	Possible doublet (two cells in one droplet)
`nCount_RNA`	Total UMI counts per cell	Low-quality / dying cell	Possible doublet
`percent.mt`	% reads from mitochondrial genes	Healthy cell	Dead or damaged cell (cytoplasm leaked out)

# Calculate mitochondrial gene percentage
# Human mito genes start with "MT-", mouse start with "mt-"
seurat_obj[["percent.mt"]] <- PercentageFeatureSet(
  seurat_obj,
  pattern = "^MT-"
)

# Visualise QC metrics to decide on thresholds
VlnPlot(
  seurat_obj,
  features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
  ncol = 3,
  pt.size = 0.1   # Show individual cell points (set to 0 for large datasets)
)

# Filter cells based on QC thresholds
seurat_obj <- subset(
  seurat_obj,
  subset = nFeature_RNA > 200 &    # Remove cells with too few genes
           nFeature_RNA < 6000 &   # Remove likely doublets
           percent.mt < 20         # Remove dead/damaged cells
)

# Check how many cells remain after filtering
ncol(seurat_obj)  # Should typically retain 70-90% of cells

How to choose QC thresholds

There is no universal cutoff. Look at the violin plots and choose thresholds that remove the obvious outlier tails. A few guidelines:

nFeature_RNA upper limit: Look for a gap or inflection in the distribution. Cells with >2x the median are likely doublets. Common cutoffs: 4,000–8,000 for most tissues.
percent.mt: In most tissues, <20% is standard. In metabolically active cells (neurons, cardiomyocytes), mitochondrial content is naturally higher — consider relaxing to 25–30%.
Never use the same thresholds for every dataset — always look at your data first.

Step 4 Normalization

Raw UMI counts are biased by sequencing depth — a cell with 10,000 UMIs will appear to express every gene at twice the level of an identical cell with 5,000 UMIs. Normalization removes this technical bias so we can compare gene expression across cells.

Log normalization (standard approach)

# NormalizeData divides each cell's counts by the total counts for that cell,
# multiplies by scale.factor (10,000 by default — makes it "counts per 10k"),
# then applies log1p transformation: log(x + 1)
seurat_obj <- NormalizeData(
  seurat_obj,
  normalization.method = "LogNormalize",
  scale.factor = 10000
)

# After normalization, data is stored in seurat_obj[["RNA"]]@data
# Raw counts are still in seurat_obj[["RNA"]]@counts (never overwritten)

Why log1p and not just division?

Raw count distributions are highly skewed — a few highly expressed genes (like ribosomal genes) dominate. The log transformation compresses this skew, bringing expression values into a roughly normal distribution that works better for linear methods like PCA. The +1 prevents log(0) errors for genes with zero counts.

SCTransform — the modern alternative

# SCTransform is a more statistically principled normalization
# It uses regularized negative binomial regression to remove
# variation due to sequencing depth while preserving biological signal
# Recommended for datasets with high variation in cell depth
seurat_obj <- SCTransform(
  seurat_obj,
  vars.to.regress = "percent.mt",  # Also regress out mito % while you're at it
  verbose = FALSE
)

SCTransform vs LogNormalize: SCTransform is more principled but slower and memory-intensive. For standard PBMC/tissue datasets, LogNormalize works well. For datasets with strong depth variation or if you want to regress covariates, use SCTransform. If you use SCTransform, the downstream steps use SCT assay instead of RNA.

Step 5 Highly Variable Genes (HVGs)

Your dataset has ~20,000+ genes but most are either not expressed or expressed uniformly across all cells (housekeeping genes). These uninformative genes add noise to downstream analysis. We select only the most highly variable genes (HVGs) — genes whose expression varies substantially between cells — for dimensionality reduction and clustering.

# FindVariableFeatures selects the top 2000 most variable genes by default
# Method "vst" (variance-stabilizing transformation) is recommended
seurat_obj <- FindVariableFeatures(
  seurat_obj,
  selection.method = "vst",
  nfeatures = 2000    # 2000 is standard; some analyses use 3000-5000
)

# See which genes were selected
top20 <- head(VariableFeatures(seurat_obj), 20)
print(top20)
# These will typically include marker genes for the major cell types in your sample

# Visualise — shows mean expression vs variance for all genes
VariableFeaturePlot(seurat_obj)

The VST method fits a local polynomial regression of log(variance) ~ log(mean) across genes, then standardizes each gene's variance using the expected variance from the fitted model. Genes that deviate upward from the expected variance are "variable" — they carry cell-type information rather than just technical noise.

Step 6 Scaling and PCA

Before PCA, we scale the data so that highly expressed genes do not dominate simply because of their absolute expression level.

# ScaleData centers and scales each gene to mean=0, variance=1
# Only run on the variable genes for speed
all_genes <- rownames(seurat_obj)
seurat_obj <- ScaleData(seurat_obj, features = all_genes)

# PCA on the scaled variable genes
seurat_obj <- RunPCA(
  seurat_obj,
  features = VariableFeatures(seurat_obj),
  npcs = 50,       # Compute 50 principal components
  verbose = FALSE
)

# How many PCs are meaningful? Use ElbowPlot
ElbowPlot(seurat_obj, ndims = 50)
# Look for where the curve flattens — that's your "elbow"
# Typically between PC 10-30. Choose a value slightly past the elbow.

Choosing the number of PCs

This is one of the most important parameter choices in the pipeline. Too few PCs and you miss biological variation. Too many and you include noise. The elbow plot shows the variance explained by each PC — use the number of PCs at the inflection point. A common approach is to also try JackStraw resampling for a statistical assessment, though this is slow on large datasets.

Step 7 Clustering

Seurat uses a graph-based clustering approach. It first builds a K-nearest neighbour (KNN) graph in PCA space, then applies the Louvain or Leiden algorithm to find communities (clusters) within the graph.

# FindNeighbors builds the KNN graph using the selected PCs
# dims = 1:20 means we use PC1 through PC20
seurat_obj <- FindNeighbors(
  seurat_obj,
  dims = 1:20    # Use the number of PCs you chose from the elbow plot
)

# FindClusters runs the Louvain algorithm on the KNN graph
# resolution controls cluster granularity:
#   lower (0.1-0.3) = fewer, larger clusters
#   higher (0.8-1.5) = more, smaller clusters
seurat_obj <- FindClusters(
  seurat_obj,
  resolution = 0.5   # Start with 0.5, adjust based on biology
)

# Check how many clusters were found
table(seurat_obj$seurat_clusters)
# Each number (0, 1, 2, ...) is a cluster
# The largest cluster is always numbered 0

Resolution is the most subjective parameter in scRNA-seq. There is no "correct" resolution — it depends on the biological question. For identifying major cell types (T cells, B cells, macrophages), use lower resolution (0.3–0.5). For resolving subtypes within a lineage, use higher resolution (0.8–1.2). Try multiple resolutions and compare with known marker genes.

Step 8 UMAP Visualization

UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that projects your thousands of PCs down to 2 dimensions for visualization. Critically, UMAP is for visualization only — clustering decisions should be made in PCA space, not based on UMAP proximity.

# RunUMAP computes 2D coordinates for visualization
seurat_obj <- RunUMAP(
  seurat_obj,
  dims = 1:20,    # Same dims as FindNeighbors — must be consistent
  seed.use = 42   # Set seed for reproducibility (UMAP is stochastic)
)

# Plot UMAP colored by cluster
DimPlot(
  seurat_obj,
  reduction = "umap",
  label = TRUE,        # Add cluster number labels on plot
  label.size = 5,
  pt.size = 0.5
)

# Overlay QC metrics on UMAP to spot remaining low-quality cells
FeaturePlot(
  seurat_obj,
  features = c("nFeature_RNA", "percent.mt"),
  ncol = 2
)

UMAP is not a map — it's a visualization. Distances between clusters in UMAP space are not meaningful. Two clusters that appear close together might be biologically distant and vice versa. Always validate with marker gene expression, not UMAP proximity.

Step 9 Finding Marker Genes

To identify what each cluster is, we find genes that are differentially expressed in each cluster compared to all other clusters. These are your marker genes.

# FindAllMarkers tests each cluster vs all others
# Wilcoxon rank-sum test is the default (fast and robust)
markers <- FindAllMarkers(
  seurat_obj,
  only.pos = TRUE,     # Only return upregulated markers (positive fold change)
  min.pct = 0.25,      # Gene must be detected in 25% of cells in the cluster
  logfc.threshold = 0.25  # Minimum log fold-change threshold
)

# Get top 5 markers per cluster
top5 <- markers %>%
  group_by(cluster) %>%
  top_n(n = 5, wt = avg_log2FC)

print(top5)

# Heatmap of top markers
DoHeatmap(
  seurat_obj,
  features = top5$gene
) + NoLegend()

Understanding the output columns

Column	Meaning
`p_val`	Unadjusted p-value. Don't use this.
`avg_log2FC`	Log2 fold change of expression in this cluster vs others. Higher = more specific.
`pct.1`	Fraction of cells in this cluster expressing the gene.
`pct.2`	Fraction of cells in OTHER clusters expressing the gene.
`p_val_adj`	Bonferroni-corrected p-value. Use this for significance. <0.05 is significant.
`cluster`	Which cluster this marker belongs to.
`gene`	Gene name.

A good marker gene has high avg_log2FC, high pct.1, low pct.2, and low p_val_adj. Genes with high pct.1 AND high pct.2 are not specific — they are expressed in many cell types and won't help you annotate.

Step 10 Cell Type Annotation

The final step is assigning biological identities to clusters using your marker genes and known cell type signatures.

Manual annotation using canonical markers

The most reliable approach is cross-referencing your top markers with known cell-type marker genes from literature. For human PBMCs, canonical markers are:

Cell Type	Key Marker Genes
CD4+ T cells	CD3D, CD3E, CD4, IL7R, CCR7
CD8+ T cells	CD3D, CD8A, CD8B, GZMB, PRF1
NK cells	GNLY, NKG7, NCAM1, KLRD1
B cells	CD19, MS4A1, CD79A, CD79B
Monocytes (classical)	CD14, LYZ, S100A8, S100A9
Monocytes (non-classical)	FCGR3A, MS4A7, CX3CR1
Dendritic cells	FCER1A, CST3, IL3RA
Platelets	PPBP, PF4, GP9

# Visualise known markers on UMAP to confirm cluster identities
FeaturePlot(
  seurat_obj,
  features = c("CD3D", "CD14", "CD19", "GNLY", "PPBP", "FCER1A"),
  ncol = 3
)

# Once you've decided on identities, rename the clusters
new_labels <- c(
  "0" = "CD14+ Monocytes",
  "1" = "CD4+ T cells",
  "2" = "CD8+ T cells",
  "3" = "NK cells",
  "4" = "B cells",
  "5" = "CD16+ Monocytes",
  "6" = "Dendritic cells",
  "7" = "Platelets"
)

seurat_obj <- RenameIdents(seurat_obj, new_labels)

# Final annotated UMAP
DimPlot(
  seurat_obj,
  reduction = "umap",
  label = TRUE,
  repel = TRUE,    # Prevents label overlaps
  pt.size = 0.5
)

# Save your Seurat object — don't lose your work!
saveRDS(seurat_obj, file = "PBMC_annotated.rds")

Automated annotation with SingleR: For large datasets or unfamiliar tissues, consider the SingleR package, which automatically annotates cells by comparing expression profiles to reference datasets (Human Cell Atlas, ImmGen, etc.). It is not a replacement for manual review but a useful starting point.

Sharing Your scRNA-seq Data

scRNA-seq projects generate large files at multiple stages, each of which you may need to share with collaborators:

Stage	Files to Share	Typical Size	Encryption?
Raw data	FASTQ files	20–200 GB	Yes (if human subjects)
Cell Ranger output	filtered_feature_bc_matrix/ folder or .h5 file	100 MB – 5 GB	Depends on IRB
Seurat object	.rds file	500 MB – 10 GB	Usually not required
Count matrix only	.h5ad or .csv.gz	50 MB – 2 GB	Usually not required

For raw FASTQ files from human subjects, use BioTransfer's Secure Transfer mode to ensure end-to-end encryption. For Cell Ranger outputs and Seurat objects derived from de-identified data, standard transfer is typically sufficient. Always confirm with your IRB protocol what level of protection is required at each data tier.

Use BioTransfer's batch/folder transfer to send the full filtered_feature_bc_matrix/ directory in one go, preserving the three-file structure Cell Ranger outputs.