Processing barcoded Fastq files#
You would likely encounter barcoded fastq files when working with single cell ATAC-seq data. As on early days of single cell ATAC-seq, cell barcodes are usually added to the read name of the fastq files. This notebook demonstrates how to process these barcoded fastq files.
[1]:
import precellar
Extracting cell barcodes from read names#
[2]:
!zcat R1.fq.gz | head
@CCAGCACAAGCCATCCTATCGT:A00953:155:HVCHLDRXX:1:1101:1036:1031 1:N:0:1
ANCTTGGATCATCAGGTTTGTCTGTAGCTGATTTATTTCTTTAAGTTTCCC
+
F#FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@TAACCACTACGAATGACTGACA:A00953:155:HVCHLDRXX:1:1101:1127:1031 1:N:0:1
TNCCAGGACCAGTGACCGTCACCCGCAGTAAGGATCGGGGCGGCTCCGCCA
+
F#:FFFFFFFFF:FFFFF:FF,F,FFFFFFFF,FFF:FFFF:FFFFFF,FF
@CGATATGTAGGGGACTAATTCC:A00953:155:HVCHLDRXX:1:1101:1145:1031 1:N:0:1
GNCGGATCACAAGGTCAGGAGTTCGAGACCTGGCTGGCCAACACGGTGAAA
gzip: stdout: Broken pipe
[3]:
precellar.utils.strip_barcode_from_fastq(
'R1.fq.gz',
'R1_processed.fq.zst',
out_barcode='I1.fq.zst',
regex="^([ACTG]+):",
right_add=1,
)
precellar.utils.strip_barcode_from_fastq(
'R2.fq.gz',
'R2_processed.fq.zst',
regex="^([ACTG]+):",
right_add=1,
)
[4]:
assay = precellar.SeqSpec("https://raw.githubusercontent.com/regulatory-genomics/precellar/refs/heads/main/seqspec_templates/generic_atac.yaml")
[2024-10-01T15:18:02Z INFO cached_path::cache] Starting download of https://raw.githubusercontent.com/regulatory-genomics/precellar/refs/heads/main/seqspec_templates/generic_atac.yaml
[2024-10-01T15:18:02Z INFO cached_path::cache] Downloaded 2643 bytes
[2024-10-01T15:18:02Z INFO cached_path::cache] New version of https://raw.githubusercontent.com/regulatory-genomics/precellar/refs/heads/main/seqspec_templates/generic_atac.yaml cached
[5]:
assay
[5]:
└── atac(153-1150)
├── atac-illumina_p5(29)
├── atac-read1(34) [↓R1(1-98)]
├── gDNA(1-1000)
├── atac-read2(34) [↑R2(1-98), ↓I1(22)]
├── atac-cell_barcode(22)
└── atac-illumina_p7(24)
[6]:
assay.update_read("R1", fastq="R1_processed.fq.zst")
assay.update_read("I1", fastq="I1.fq.zst")
assay.update_read("R2", fastq="R2_processed.fq.zst")
[7]:
assay
[7]:
└── atac(153-1150)
├── atac-illumina_p5(29)
├── atac-read1(34) [↓R1(51)]
├── gDNA(1-1000)
├── atac-read2(34) [↑R2(51), ↓I1(22)]
├── atac-cell_barcode(22)
└── atac-illumina_p7(24)
[8]:
qc = precellar.align(
assay, "/data/kzhang/GRCh38/hg38.fa.gz",
modality="atac",
output_fragment="atac_fragments.tsv.zst",
num_threads=32,
)
[2024-10-01T15:18:10Z INFO precellar::align] Counting barcodes...
[2024-10-01T15:18:10Z WARN seqspec] Reads (R1) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.
[2024-10-01T15:18:10Z WARN seqspec] Reads (R2) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.
[2024-10-01T15:18:10Z INFO precellar::align] Found 2500 barcodes. 100.00% of them have an exact match in whitelist
[2024-10-01T15:18:10Z INFO precellar::align] Aligning reads...
[2024-10-01T15:18:10Z WARN seqspec] Reads (R1) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.
[2024-10-01T15:18:10Z WARN seqspec] Reads (R2) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.
100%|██████████| 2500/2500 [00:00<00:00, 15545.42it/s]
[9]:
qc
[9]:
{'frac_q30_bases_read1': 0.8179764705882353,
'frac_valid_barcode': 1.0,
'sequenced_read_pairs': 2500.0,
'frac_q30_bases_barcode': 1.0,
'frac_unmapped': 0.07640000000000002,
'sequenced_reads': 5000.0,
'frac_fragment_flanking_single_nucleosome': 0.0029791459781529296,
'frac_confidently_mapped': 0.8524,
'frac_fragment_in_nucleosome_free_region': 0.010427010923535254,
'frac_q30_bases_read2': 0.9442745098039216,
'frac_nonnuclear': 0.0128,
'frac_duplicates': 0.004940711462450593}