Preprocess 10X Genomics Gene Expression + ATAC Data#

[1]:
import precellar

Create the SeqSpec object#

We first create a SeqSpec object from the 10X multiome seqspec template file.

[2]:
assay = precellar.SeqSpec("https://raw.githubusercontent.com/regulatory-genomics/precellar/refs/heads/main/seqspec_templates/10x_rna_atac.yaml")
[2024-10-01T15:16:56Z INFO  cached_path::cache] Cached version of https://raw.githubusercontent.com/regulatory-genomics/precellar/refs/heads/main/seqspec_templates/10x_rna_atac.yaml is up-to-date

The SeqSpec object can be visualized as a tree structure.

[3]:
assay
[3]:

├── rna(153-1150)
│   ├── rna-illumina_p5(29)
│   ├── rna-truseq_read1(29) [↓rna-R1(28)]
│   ├── rna-cell_barcode(16)
│   ├── rna-umi(12)
│   ├── rna-cDNA(1-1000)
│   ├── rna-truseq_read2(34) [↓rna-I1(8), ↑rna-R2(1-98)]
│   ├── rna-index7(8)
│   └── rna-illumina_p7(24)
└── atac(153-1150)
    ├── atac-illumina_p5(29)
    ├── atac-cell_barcode(16)
    ├── atac-linker(8)
    ├── atac-nextera_read1(33) [↓atac-R1(1-98), ↑atac-I2(24)]
    ├── atac-gDNA(1-1000)
    ├── atac-nextera_read2(34) [↓atac-I1(8), ↑atac-R2(1-98)]
    ├── atac-index7(8)
    └── atac-illumina_p7(24)

We then add ATAC data to the SeqSpec object. Specifically, we add R1, R2 and index fastq files. Note that the index read, ususally named “R2” in the 10X multiome data, can be sequenced in either direction, depending on the sequencing platform. Older platforms such as NovaSeq 6000 with v1.0 reagent kits, MiniSeq with Rapid Reagent kits, MiSeq, HiSeq 2500, and HiSeq 2000 use the forward strand workflow. iSeq 100, MiniSeq with Standard reagent kits, NextSeq Systems, NovaSeq 6000 with v1.5 reagent kits, HiSeq X, HiSeq 4000, and HiSeq 3000 use the reverse strand workflow.

We need to pay special attention to the orientation of the index read. In this example, the index read is sequenced in the reverse direction.

[4]:
assay.update_read("atac-R1", fastq="/data/kzhang/projects/workshop/data/PS_DNA01-1_S2_L001_R1_001.fq.zst")
assay.update_read("atac-I2", fastq="/data/kzhang/projects/workshop/data/PS_DNA01-1_S2_L001_R2_001.fq.zst")
assay.update_read("atac-R2", fastq="/data/kzhang/projects/workshop/data/PS_DNA01-1_S2_L001_R3_001.fq.zst")

We can visualize the SeqSpec object again after adding the ATAC data. Note that read information has been updated, reflecting the real read length of the fastq files.

[5]:
assay
[5]:

├── rna(153-1150)
│   ├── rna-illumina_p5(29)
│   ├── rna-truseq_read1(29) [↓rna-R1(28)]
│   ├── rna-cell_barcode(16)
│   ├── rna-umi(12)
│   ├── rna-cDNA(1-1000)
│   ├── rna-truseq_read2(34) [↓rna-I1(8), ↑rna-R2(1-98)]
│   ├── rna-index7(8)
│   └── rna-illumina_p7(24)
└── atac(153-1150)
    ├── atac-illumina_p5(29)
    ├── atac-cell_barcode(16)
    ├── atac-linker(8)
    ├── atac-nextera_read1(33) [↓atac-R1(150), ↑atac-I2(24)]
    ├── atac-gDNA(1-1000)
    ├── atac-nextera_read2(34) [↓atac-I1(8), ↑atac-R2(150)]
    ├── atac-index7(8)
    └── atac-illumina_p7(24)

We can save the SeqSpec object to a yaml file and load it next time.

[6]:
with open("10x_rna_atac.yaml", "w") as f:
    f.write(assay.to_yaml())

Align ATAC reads#

We can now start reads mapping and create the fragment file. If you do not have the genome index, you can create it using the make_genome_index function.

[7]:
qc = precellar.align(
    assay, "/data/kzhang/genome/mm10",
    modality="atac",
    output_fragment="atac_fragments.tsv.zst",
    num_threads=32,
)
[2024-10-01T15:17:03Z INFO  precellar::align] Counting barcodes...
[2024-10-01T15:17:04Z INFO  cached_path::cache] Cached version of https://teichlab.github.io/scg_lib_structs/data/10X-Genomics/atac_737K-arc-v1.txt.gz is up-to-date
[2024-10-01T15:17:04Z WARN  seqspec] Reads (atac-R1) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.
[2024-10-01T15:17:11Z INFO  precellar::align] Found 12387063 barcodes. 94.19% of them have an exact match in whitelist
[2024-10-01T15:17:11Z INFO  precellar::align] Aligning reads...
[2024-10-01T15:17:11Z WARN  seqspec] Reads (atac-R1) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.
[2024-10-01T15:17:11Z WARN  seqspec] Reads (atac-R2) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.
100%|██████████| 12387063/12387063 [05:33<00:00, 37127.51it/s]

QC metrics:

[9]:
print(qc)
{'frac_fragment_flanking_single_nucleosome': 2.912761037492815e-05, 'frac_confidently_mapped': 0.9292046064511015, 'frac_q30_bases_read1': 0.9173089515515771, 'frac_valid_barcode': 0.9815396111249293, 'frac_duplicates': 0.33183332695206896, 'frac_unmapped': 0.008819422106194463, 'frac_q30_bases_barcode': 0.9219054690365263, 'frac_nonnuclear': 7.443249461151526e-05, 'frac_q30_bases_read2': 0.9229901653577338, 'sequenced_read_pairs': 12387063.0, 'sequenced_reads': 24774126.0, 'frac_fragment_in_nucleosome_free_region': 4.518862917979508e-05}