Preprocess 10X Genomics Gene Expression + ATAC Data#
[1]:
import precellar
Create the SeqSpec object#
We first create a SeqSpec object from the 10X multiome seqspec template file.
[2]:
assay = precellar.SeqSpec("https://raw.githubusercontent.com/regulatory-genomics/precellar/refs/heads/main/seqspec_templates/10x_rna_atac.yaml")
[2024-10-01T15:16:56Z INFO cached_path::cache] Cached version of https://raw.githubusercontent.com/regulatory-genomics/precellar/refs/heads/main/seqspec_templates/10x_rna_atac.yaml is up-to-date
The SeqSpec object can be visualized as a tree structure.
[3]:
assay
[3]:
├── rna(153-1150)
│ ├── rna-illumina_p5(29)
│ ├── rna-truseq_read1(29) [↓rna-R1(28)]
│ ├── rna-cell_barcode(16)
│ ├── rna-umi(12)
│ ├── rna-cDNA(1-1000)
│ ├── rna-truseq_read2(34) [↓rna-I1(8), ↑rna-R2(1-98)]
│ ├── rna-index7(8)
│ └── rna-illumina_p7(24)
└── atac(153-1150)
├── atac-illumina_p5(29)
├── atac-cell_barcode(16)
├── atac-linker(8)
├── atac-nextera_read1(33) [↓atac-R1(1-98), ↑atac-I2(24)]
├── atac-gDNA(1-1000)
├── atac-nextera_read2(34) [↓atac-I1(8), ↑atac-R2(1-98)]
├── atac-index7(8)
└── atac-illumina_p7(24)
We then add ATAC data to the SeqSpec object. Specifically, we add R1, R2 and index fastq files. Note that the index read, ususally named “R2” in the 10X multiome data, can be sequenced in either direction, depending on the sequencing platform. Older platforms such as NovaSeq 6000 with v1.0 reagent kits, MiniSeq with Rapid Reagent kits, MiSeq, HiSeq 2500, and HiSeq 2000 use the forward strand workflow. iSeq 100, MiniSeq with Standard reagent kits, NextSeq Systems, NovaSeq 6000 with v1.5 reagent kits, HiSeq X, HiSeq 4000, and HiSeq 3000 use the reverse strand workflow.
We need to pay special attention to the orientation of the index read. In this example, the index read is sequenced in the reverse direction.
[4]:
assay.update_read("atac-R1", fastq="/data/kzhang/projects/workshop/data/PS_DNA01-1_S2_L001_R1_001.fq.zst")
assay.update_read("atac-I2", fastq="/data/kzhang/projects/workshop/data/PS_DNA01-1_S2_L001_R2_001.fq.zst")
assay.update_read("atac-R2", fastq="/data/kzhang/projects/workshop/data/PS_DNA01-1_S2_L001_R3_001.fq.zst")
We can visualize the SeqSpec object again after adding the ATAC data. Note that read information has been updated, reflecting the real read length of the fastq files.
[5]:
assay
[5]:
├── rna(153-1150)
│ ├── rna-illumina_p5(29)
│ ├── rna-truseq_read1(29) [↓rna-R1(28)]
│ ├── rna-cell_barcode(16)
│ ├── rna-umi(12)
│ ├── rna-cDNA(1-1000)
│ ├── rna-truseq_read2(34) [↓rna-I1(8), ↑rna-R2(1-98)]
│ ├── rna-index7(8)
│ └── rna-illumina_p7(24)
└── atac(153-1150)
├── atac-illumina_p5(29)
├── atac-cell_barcode(16)
├── atac-linker(8)
├── atac-nextera_read1(33) [↓atac-R1(150), ↑atac-I2(24)]
├── atac-gDNA(1-1000)
├── atac-nextera_read2(34) [↓atac-I1(8), ↑atac-R2(150)]
├── atac-index7(8)
└── atac-illumina_p7(24)
We can save the SeqSpec object to a yaml file and load it next time.
[6]:
with open("10x_rna_atac.yaml", "w") as f:
f.write(assay.to_yaml())
Align ATAC reads#
We can now start reads mapping and create the fragment file. If you do not have the genome index, you can create it using the make_genome_index
function.
[7]:
qc = precellar.align(
assay, "/data/kzhang/genome/mm10",
modality="atac",
output_fragment="atac_fragments.tsv.zst",
num_threads=32,
)
[2024-10-01T15:17:03Z INFO precellar::align] Counting barcodes...
[2024-10-01T15:17:04Z INFO cached_path::cache] Cached version of https://teichlab.github.io/scg_lib_structs/data/10X-Genomics/atac_737K-arc-v1.txt.gz is up-to-date
[2024-10-01T15:17:04Z WARN seqspec] Reads (atac-R1) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.
[2024-10-01T15:17:11Z INFO precellar::align] Found 12387063 barcodes. 94.19% of them have an exact match in whitelist
[2024-10-01T15:17:11Z INFO precellar::align] Aligning reads...
[2024-10-01T15:17:11Z WARN seqspec] Reads (atac-R1) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.
[2024-10-01T15:17:11Z WARN seqspec] Reads (atac-R2) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.
100%|██████████| 12387063/12387063 [05:33<00:00, 37127.51it/s]
QC metrics:
[9]:
print(qc)
{'frac_fragment_flanking_single_nucleosome': 2.912761037492815e-05, 'frac_confidently_mapped': 0.9292046064511015, 'frac_q30_bases_read1': 0.9173089515515771, 'frac_valid_barcode': 0.9815396111249293, 'frac_duplicates': 0.33183332695206896, 'frac_unmapped': 0.008819422106194463, 'frac_q30_bases_barcode': 0.9219054690365263, 'frac_nonnuclear': 7.443249461151526e-05, 'frac_q30_bases_read2': 0.9229901653577338, 'sequenced_read_pairs': 12387063.0, 'sequenced_reads': 24774126.0, 'frac_fragment_in_nucleosome_free_region': 4.518862917979508e-05}