Processing barcoded Fastq files#

You would likely encounter barcoded fastq files when working with single cell ATAC-seq data. As on early days of single cell ATAC-seq, cell barcodes are usually added to the read name of the fastq files. This notebook demonstrates how to process these barcoded fastq files.

[1]:
import precellar

Extracting cell barcodes from read names#

[2]:
!zcat R1.fq.gz | head
@CCAGCACAAGCCATCCTATCGT:A00953:155:HVCHLDRXX:1:1101:1036:1031 1:N:0:1
ANCTTGGATCATCAGGTTTGTCTGTAGCTGATTTATTTCTTTAAGTTTCCC
+
F#FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@TAACCACTACGAATGACTGACA:A00953:155:HVCHLDRXX:1:1101:1127:1031 1:N:0:1
TNCCAGGACCAGTGACCGTCACCCGCAGTAAGGATCGGGGCGGCTCCGCCA
+
F#:FFFFFFFFF:FFFFF:FF,F,FFFFFFFF,FFF:FFFF:FFFFFF,FF
@CGATATGTAGGGGACTAATTCC:A00953:155:HVCHLDRXX:1:1101:1145:1031 1:N:0:1
GNCGGATCACAAGGTCAGGAGTTCGAGACCTGGCTGGCCAACACGGTGAAA

gzip: stdout: Broken pipe
[3]:
precellar.utils.strip_barcode_from_fastq(
    'R1.fq.gz',
    'R1_processed.fq.zst',
    out_barcode='I1.fq.zst',
    regex="^([ACTG]+):",
    right_add=1,
)

precellar.utils.strip_barcode_from_fastq(
    'R2.fq.gz',
    'R2_processed.fq.zst',
    regex="^([ACTG]+):",
    right_add=1,
)
[4]:
assay = precellar.Assay("https://raw.githubusercontent.com/regulatory-genomics/precellar/refs/heads/main/seqspec_templates/generic_atac.yaml")
[2024-10-04T06:24:33Z INFO  cached_path::cache] Cached version of https://raw.githubusercontent.com/regulatory-genomics/precellar/refs/heads/main/seqspec_templates/generic_atac.yaml is up-to-date
[5]:
assay
[5]:

└── atac(153-1150)
    ├── atac-illumina_p5(29)
    ├── atac-read1(34) [↓R1(1-98)✗]
    ├── gDNA(1-1000)
    ├── atac-read2(34) [↑R2(1-98)✗, ↓I1(22)✗]
    ├── atac-cell_barcode(22)
    └── atac-illumina_p7(24)
[6]:
assay.update_read("R1", fastq="R1_processed.fq.zst")
assay.update_read("I1", fastq="I1.fq.zst")
assay.update_read("R2", fastq="R2_processed.fq.zst")
[2024-10-04T06:24:33Z WARN  seqspec] 'R1' may read through and contain sequences from: 'atac-read2'
[2024-10-04T06:24:33Z WARN  seqspec] 'R2' may read through and contain sequences from: 'atac-read1'
[7]:
assay
[7]:

└── atac(153-1150)
    ├── atac-illumina_p5(29)
    ├── atac-read1(34) [↓R1(51)✓]
    ├── gDNA(1-1000)
    ├── atac-read2(34) [↑R2(51)✓, ↓I1(22)✓]
    ├── atac-cell_barcode(22)
    └── atac-illumina_p7(24)
[8]:
qc = precellar.align(
    assay, "/data/kzhang/GRCh38/hg38.fa.gz",
    modality="atac",
    output_fragment="atac_fragments.tsv.zst",
    num_threads=32,
)
[2024-10-04T06:24:40Z INFO  precellar::align] Counting barcodes...
[2024-10-04T06:24:40Z INFO  precellar::align] Found 2500 barcodes. 100.00% of them have an exact match in whitelist
[2024-10-04T06:24:40Z INFO  precellar::align] Aligning reads...
100%|██████████| 2500/2500 [00:00<00:00, 18920.08it/s]
[9]:
qc
[9]:
{'sequenced_reads': 5000.0,
 'frac_q30_bases_barcode': 1.0,
 'frac_fragment_in_nucleosome_free_region': 0.010427010923535254,
 'frac_q30_bases_read2': 0.9442745098039216,
 'frac_fragment_flanking_single_nucleosome': 0.0029791459781529296,
 'frac_valid_barcode': 1.0,
 'frac_unmapped': 0.07640000000000002,
 'frac_duplicates': 0.004940711462450593,
 'frac_nonnuclear': 0.0128,
 'sequenced_read_pairs': 2500.0,
 'frac_q30_bases_read1': 0.8179764705882353,
 'frac_confidently_mapped': 0.8524}