RNA-Seq Data Sharing Best Practices for Researchers

Why RNA-Seq Data Sharing Is Complicated

RNA sequencing has become one of the most widely used techniques in molecular biology, enabling researchers to profile gene expression across entire transcriptomes. But the datasets it produces are large, complex, and contextually sensitive in ways that make sharing genuinely challenging.

A typical bulk RNA-Seq experiment with 12 samples might generate 50–150 GB of raw FASTQ data. A single-cell RNA-Seq (scRNA-Seq) experiment with tens of thousands of cells can produce terabytes of data across multiple file types. When this data needs to move from a sequencing core to an analysis lab, from one institution to a collaborator, or from a lab to a public repository, the transfer must preserve not just the files but the metadata required to make sense of them.

Beyond scale, RNA-Seq data from human subjects carries privacy implications. Transcriptome profiles can reveal information about disease state, immune status, and — for studies involving human primary cells — potentially re-identifiable genomic variants present in the RNA-Seq reads. This adds a security dimension to what might otherwise seem like a routine data handoff.

RNA-Seq File Types: What You Will Be Transferring

File Type	Description	Typical Size
`.fastq.gz`	Raw reads, compressed. Primary output from sequencer.	5–30 GB / sample
`.bam`	Aligned reads. Output from STAR, HISAT2, etc.	3–15 GB / sample
`.bai`	BAM index file. Required alongside .bam.	<10 MB
`.tsv / .csv`	Count matrices. Output from featureCounts, HTSeq, etc.	1–100 MB
`.h5ad / .loom`	Single-cell expression matrices (AnnData, Loom formats).	100 MB – 10 GB
`.rds`	R data objects. Seurat objects, DESeq2 results, etc.	100 MB – 5 GB
`.gtf / .gff`	Genome annotation files used for alignment and quantification.	50–300 MB

When transferring data to a collaborator, always consider which of these files they actually need. Raw FASTQ files give maximum flexibility but require significant compute for re-processing. Count matrices or R objects are far smaller and immediately usable if the collaborator trusts your pre-processing pipeline. Communicate clearly which pipeline versions and reference genome builds were used.

Metadata: The Most Undervalued Part of RNA-Seq Sharing

Many failed collaborations trace back not to the data itself but to incomplete metadata. A count matrix without a sample sheet is nearly useless. A FASTQ file without information about the library preparation kit, read length, or strandedness may require guesswork that introduces errors in downstream analysis.

What Metadata to Include

Sample sheet: One row per sample with: sample ID, condition/group, batch, replicate number, tissue/cell type, organism, and any relevant clinical covariates.
Sequencing parameters: Platform (Illumina NovaSeq, etc.), read length (e.g., 2×150 bp), sequencing depth (average reads per sample), strandedness (unstranded, forward, reverse).
Library preparation: Kit name and version (e.g., Illumina TruSeq Stranded mRNA), any modifications (e.g., poly-A selection vs. ribo-depletion).
Pre-processing pipeline: Trimming tool and parameters (e.g., Trim Galore v0.6.7), aligner and version (e.g., STAR 2.7.10a), reference genome and annotation version (e.g., GRCh38 / Ensembl 107).
Quality metrics: MultiQC report or individual FastQC reports for each sample.

A well-documented README file included in the transfer bundle can save a collaborator days of back-and-forth. BioTransfer's batch transfer feature preserves folder structure, so you can organise your transfer as a project directory with subdirectories for raw data, processed data, and metadata.

Integrity Verification: Never Skip This Step

RNA-Seq analysis pipelines are sensitive to data corruption. A single corrupted byte in a FASTQ file can cause a STAR alignment to fail silently or produce subtly incorrect outputs. Unlike obvious crashes, silent corruption is especially dangerous because it may only become apparent after weeks of downstream analysis.

Always generate and share MD5 checksums for every file in your transfer. The standard workflow:

Before transfer: generate checksums with md5sum *.fastq.gz > checksums.md5
Include checksums.md5 in the transfer bundle
After receipt: verify with md5sum -c checksums.md5
Any mismatch indicates corruption — request a re-transfer immediately

BioTransfer automatically computes an MD5 checksum of each uploaded file and stores it with the transfer record, giving both sender and recipient a verifiable integrity reference without manual checksum generation.

When to Use Secure Encrypted Transfer for RNA-Seq Data

Not all RNA-Seq data requires end-to-end encryption. Here is a practical decision framework:

Mouse / model organism data, no human involvement: Standard transfer (TLS in transit) is generally appropriate. No specific encryption requirement.
Human cell lines (HEK293, HeLa, etc.): Low re-identification risk. Standard transfer acceptable unless your IRB specifies otherwise.
Human primary cells, patient-derived samples, or clinical specimens: Use Secure Transfer (E2EE) mode. Data may contain re-identifiable genomic variants in RNA reads.
Single-cell RNA-Seq from human subjects: Always use Secure Transfer. scRNA-Seq data is particularly information-dense and the single-cell resolution can reveal individual cell identities.
Data governed by a DUA or data sharing agreement that specifies encryption: Use Secure Transfer regardless of data type.

Sharing Data With Public Repositories

Many funding agencies (NIH, Wellcome Trust, ERC) and journals now require raw RNA-Seq data to be deposited in a public repository upon publication. The primary repositories are:

GEO (Gene Expression Omnibus): NCBI's repository for functional genomics data. Accepts raw FASTQ and processed count matrices. Required for most NIH-funded studies.
ArrayExpress / BioStudies: EMBL-EBI's repository, used primarily by European researchers. Equivalent to GEO.
SRA (Sequence Read Archive): The primary archive for raw sequencing reads. GEO submissions are often linked to SRA for the raw FASTQ files.
dbGaP: For controlled-access human genomic data. Required when data cannot be fully de-identified.

BioTransfer is designed for researcher-to-researcher collaboration during the active phase of a project — before public deposition, when data is still being processed and shared with co-investigators. For final public archiving, use the repositories above. For the working transfers that happen throughout a project — sharing raw data with a bioinformatics core, sending processed results to a collaborating PI, distributing a Seurat object to a co-first author — BioTransfer provides the speed, security, and simplicity that institutional FTP and consumer cloud drives cannot.

Organising a Transfer Bundle for Maximum Clarity

When sending RNA-Seq data to a collaborator, structure your transfer as a clear project directory. A recommended layout:

raw_data/ — FASTQ files, one sub-folder per sample
aligned/ — BAM files and .bai indices
counts/ — Count matrices (featureCounts output, etc.)
qc/ — MultiQC HTML report, individual FastQC outputs
metadata/ — Sample sheet (CSV), experimental design notes
README.txt — Pipeline versions, reference genome, key parameters, checksums

BioTransfer's folder transfer feature preserves this directory structure end-to-end. The recipient downloads a ZIP that reconstructs the exact folder hierarchy — no manual reorganisation required.