Preprocess 10X Genomics Gene Expression + ATAC Data#

[1]:

import precellar

Create the SeqSpec object#

We first create a Assay object from the 10X multiome seqspec template file.

[ ]:

assay = precellar.Assay("https://raw.githubusercontent.com/regulatory-genomics/precellar/refs/heads/main/seqspec_templates/10x_rna_atac.yaml")

The Assay object can be visualized as a tree structure.

[3]:

assay

[3]:


├── rna(153-1150)
│   ├── rna-illumina_p5(29)
│   ├── rna-truseq_read1(29)
│   ├── rna-cell_barcode(16)
│   ├── rna-umi(12)
│   ├── rna-polyT(10-250)
│   ├── rna-cDNA(1-1000)
│   ├── rna-truseq_read2(34)
│   ├── rna-index7(8)
│   └── rna-illumina_p7(24)
└── atac(153-1150)
    ├── atac-illumina_p5(29)
    ├── atac-cell_barcode(16)
    ├── atac-linker(8)
    ├── atac-nextera_read1(33)
    ├── atac-gDNA(1-1000)
    ├── atac-nextera_read2(34)
    ├── atac-index7(8)
    └── atac-illumina_p7(24)

We then add ATAC data to the Assay object. Specifically, we add R1, R2 and index fastq files. Note that the index read, ususally named “R2” in the 10X multiome data, can be sequenced in either direction, depending on the sequencing platform. Older platforms such as NovaSeq 6000 with v1.0 reagent kits, MiniSeq with Rapid Reagent kits, MiSeq, HiSeq 2500, and HiSeq 2000 use the forward strand workflow. iSeq 100, MiniSeq with Standard reagent kits, NextSeq Systems, NovaSeq 6000 with v1.5 reagent kits, HiSeq X, HiSeq 4000, and HiSeq 3000 use the reverse strand workflow.

We need to pay special attention to the orientation of the index read. In this example, the index read is sequenced in the reverse direction.

[4]:

assay.add_illumina_reads('rna')
assay.add_illumina_reads('atac')
assay

[2024-12-28T06:29:57Z WARN  seqspec] 'rna-R1' may read through and contain sequences from: 'rna-truseq_read2'
[2024-12-28T06:29:57Z WARN  seqspec] 'rna-R2' may read through and contain sequences from: 'rna-polyT'
[2024-12-28T06:29:57Z WARN  seqspec] 'atac-R1' may read through and contain sequences from: 'atac-nextera_read2'
[2024-12-28T06:29:57Z WARN  seqspec] 'atac-R2' may read through and contain sequences from: 'atac-nextera_read1'

[4]:


├── rna(153-1150)
│   ├── rna-illumina_p5(29)
│   ├── rna-truseq_read1(29) [↓rna-R1(150)✗]
│   ├── rna-cell_barcode(16)
│   ├── rna-umi(12)
│   ├── rna-polyT(10-250)
│   ├── rna-cDNA(1-1000)
│   ├── rna-truseq_read2(34) [↑rna-R2(150)✗, ↓rna-I1(8)✗]
│   ├── rna-index7(8)
│   └── rna-illumina_p7(24)
└── atac(153-1150)
    ├── atac-illumina_p5(29)
    ├── atac-cell_barcode(16)
    ├── atac-linker(8)
    ├── atac-nextera_read1(33) [↓atac-R1(150)✗, ↑atac-I2(24)✗]
    ├── atac-gDNA(1-1000)
    ├── atac-nextera_read2(34) [↑atac-R2(150)✗, ↓atac-I1(8)✗]
    ├── atac-index7(8)
    └── atac-illumina_p7(24)

[4]:

assay.update_read("atac-R1", fastq="/data/kzhang/projects/workshop/data/PS_DNA01-1_S2_L001_R1_001.fq.zst")
assay.update_read("atac-I2", fastq="/data/kzhang/projects/workshop/data/PS_DNA01-1_S2_L001_R2_001.fq.zst")
assay.update_read("atac-R2", fastq="/data/kzhang/projects/workshop/data/PS_DNA01-1_S2_L001_R3_001.fq.zst")

[2024-10-04T06:26:36Z WARN  seqspec] 'atac-R1' may read through and contain sequences from: 'atac-nextera_read2'
[2024-10-04T06:26:36Z WARN  seqspec] 'atac-R2' may read through and contain sequences from: 'atac-nextera_read1'

We can visualize the SeqSpec object again after adding the ATAC data. Note that read information has been updated, reflecting the real read length of the fastq files.

[5]:

assay

[5]:


├── rna(153-1150)
│   ├── rna-illumina_p5(29)
│   ├── rna-truseq_read1(29) [↓rna-R1(28)✗]
│   ├── rna-cell_barcode(16)
│   ├── rna-umi(12)
│   ├── rna-cDNA(1-1000)
│   ├── rna-truseq_read2(34) [↓rna-I1(8)✗, ↑rna-R2(1-98)✗]
│   ├── rna-index7(8)
│   └── rna-illumina_p7(24)
└── atac(153-1150)
    ├── atac-illumina_p5(29)
    ├── atac-cell_barcode(16)
    ├── atac-linker(8)
    ├── atac-nextera_read1(33) [↓atac-R1(150)✓, ↑atac-I2(24)✓]
    ├── atac-gDNA(1-1000)
    ├── atac-nextera_read2(34) [↓atac-I1(8)✗, ↑atac-R2(150)✓]
    ├── atac-index7(8)
    └── atac-illumina_p7(24)

We can save the SeqSpec object to a yaml file and load it next time.

[6]:

with open("10x_rna_atac.yaml", "w") as f:
    f.write(assay.to_yaml())

Align ATAC reads#

We can now start reads mapping and create the fragment file. If you do not have the genome index, you can create it using the make_genome_index function.

[7]:

qc = precellar.align(
    assay, "/data/kzhang/genome/mm10",
    modality="atac",
    output="atac_fragments.tsv.zst",
    output_type="fragment",
    num_threads=32,
)

[2024-10-04T06:26:44Z INFO  precellar::align] Counting barcodes...
[2024-10-04T06:26:45Z INFO  cached_path::cache] Cached version of https://teichlab.github.io/scg_lib_structs/data/10X-Genomics/atac_737K-arc-v1.txt.gz is up-to-date
[2024-10-04T06:26:52Z INFO  precellar::align] Found 12387063 barcodes. 94.19% of them have an exact match in whitelist
[2024-10-04T06:26:52Z INFO  precellar::align] Aligning reads...
100%|██████████| 12387063/12387063 [05:38<00:00, 36643.94it/s]

QC metrics:

[8]:

print(qc)

{'frac_fragment_flanking_single_nucleosome': 2.912761037492815e-05, 'frac_q30_bases_read2': 0.9229901653577338, 'sequenced_read_pairs': 12387063.0, 'frac_duplicates': 0.33183332695206896, 'frac_valid_barcode': 0.9815396111249293, 'frac_q30_bases_read1': 0.9173089515515771, 'sequenced_reads': 24774126.0, 'frac_nonnuclear': 7.443249461151526e-05, 'frac_unmapped': 0.008819422106194463, 'frac_confidently_mapped': 0.9292046064511015, 'frac_fragment_in_nucleosome_free_region': 4.518862917979508e-05, 'frac_q30_bases_barcode': 0.9219054690365263}

Align RNA reads#

[ ]:

qc = precellar.align(
    assay, "/data/kzhang/genome/mm10",
    modality="rna",
    output="gene_matrix.h5ad",
    output_type="gene_quantification",
    num_threads=32,
)