Processing barcoded Fastq files#

You would likely encounter barcoded fastq files when working with single cell ATAC-seq data. As on early days of single cell ATAC-seq, cell barcodes are usually added to the read name of the fastq files. This notebook demonstrates how to process these barcoded fastq files.

[1]:
import precellar

Extracting cell barcodes from read names#

[2]:
!zcat R1.fq.gz | head
@CCAGCACAAGCCATCCTATCGT:A00953:155:HVCHLDRXX:1:1101:1036:1031 1:N:0:1
ANCTTGGATCATCAGGTTTGTCTGTAGCTGATTTATTTCTTTAAGTTTCCC
+
F#FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@TAACCACTACGAATGACTGACA:A00953:155:HVCHLDRXX:1:1101:1127:1031 1:N:0:1
TNCCAGGACCAGTGACCGTCACCCGCAGTAAGGATCGGGGCGGCTCCGCCA
+
F#:FFFFFFFFF:FFFFF:FF,F,FFFFFFFF,FFF:FFFF:FFFFFF,FF
@CGATATGTAGGGGACTAATTCC:A00953:155:HVCHLDRXX:1:1101:1145:1031 1:N:0:1
GNCGGATCACAAGGTCAGGAGTTCGAGACCTGGCTGGCCAACACGGTGAAA

gzip: stdout: Broken pipe
[3]:
precellar.utils.strip_barcode_from_fastq(
    'R1.fq.gz',
    'R1_processed.fq.zst',
    out_barcode='I1.fq.zst',
    regex="^([ACTG]+):",
    right_add=1,
)

precellar.utils.strip_barcode_from_fastq(
    'R2.fq.gz',
    'R2_processed.fq.zst',
    regex="^([ACTG]+):",
    right_add=1,
)
[4]:
assay = precellar.SeqSpec("https://raw.githubusercontent.com/regulatory-genomics/precellar/refs/heads/main/seqspec_templates/generic_atac.yaml")
[2024-10-01T15:18:02Z INFO  cached_path::cache] Starting download of https://raw.githubusercontent.com/regulatory-genomics/precellar/refs/heads/main/seqspec_templates/generic_atac.yaml
[2024-10-01T15:18:02Z INFO  cached_path::cache] Downloaded 2643 bytes
[2024-10-01T15:18:02Z INFO  cached_path::cache] New version of https://raw.githubusercontent.com/regulatory-genomics/precellar/refs/heads/main/seqspec_templates/generic_atac.yaml cached
[5]:
assay
[5]:

└── atac(153-1150)
    ├── atac-illumina_p5(29)
    ├── atac-read1(34) [↓R1(1-98)]
    ├── gDNA(1-1000)
    ├── atac-read2(34) [↑R2(1-98), ↓I1(22)]
    ├── atac-cell_barcode(22)
    └── atac-illumina_p7(24)
[6]:
assay.update_read("R1", fastq="R1_processed.fq.zst")
assay.update_read("I1", fastq="I1.fq.zst")
assay.update_read("R2", fastq="R2_processed.fq.zst")
[7]:
assay
[7]:

└── atac(153-1150)
    ├── atac-illumina_p5(29)
    ├── atac-read1(34) [↓R1(51)]
    ├── gDNA(1-1000)
    ├── atac-read2(34) [↑R2(51), ↓I1(22)]
    ├── atac-cell_barcode(22)
    └── atac-illumina_p7(24)
[8]:
qc = precellar.align(
    assay, "/data/kzhang/GRCh38/hg38.fa.gz",
    modality="atac",
    output_fragment="atac_fragments.tsv.zst",
    num_threads=32,
)
[2024-10-01T15:18:10Z INFO  precellar::align] Counting barcodes...
[2024-10-01T15:18:10Z WARN  seqspec] Reads (R1) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.
[2024-10-01T15:18:10Z WARN  seqspec] Reads (R2) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.
[2024-10-01T15:18:10Z INFO  precellar::align] Found 2500 barcodes. 100.00% of them have an exact match in whitelist
[2024-10-01T15:18:10Z INFO  precellar::align] Aligning reads...
[2024-10-01T15:18:10Z WARN  seqspec] Reads (R1) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.
[2024-10-01T15:18:10Z WARN  seqspec] Reads (R2) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.
100%|██████████| 2500/2500 [00:00<00:00, 15545.42it/s]
[9]:
qc
[9]:
{'frac_q30_bases_read1': 0.8179764705882353,
 'frac_valid_barcode': 1.0,
 'sequenced_read_pairs': 2500.0,
 'frac_q30_bases_barcode': 1.0,
 'frac_unmapped': 0.07640000000000002,
 'sequenced_reads': 5000.0,
 'frac_fragment_flanking_single_nucleosome': 0.0029791459781529296,
 'frac_confidently_mapped': 0.8524,
 'frac_fragment_in_nucleosome_free_region': 0.010427010923535254,
 'frac_q30_bases_read2': 0.9442745098039216,
 'frac_nonnuclear': 0.0128,
 'frac_duplicates': 0.004940711462450593}