Processing barcoded Fastq files#
You would likely encounter barcoded fastq files when working with single cell ATAC-seq data. As on early days of single cell ATAC-seq, cell barcodes are usually added to the read name of the fastq files. This notebook demonstrates how to process these barcoded fastq files.
[1]:
import precellar
Extracting cell barcodes from read names#
[2]:
!zcat R1.fq.gz | head
@CCAGCACAAGCCATCCTATCGT:A00953:155:HVCHLDRXX:1:1101:1036:1031 1:N:0:1
ANCTTGGATCATCAGGTTTGTCTGTAGCTGATTTATTTCTTTAAGTTTCCC
+
F#FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@TAACCACTACGAATGACTGACA:A00953:155:HVCHLDRXX:1:1101:1127:1031 1:N:0:1
TNCCAGGACCAGTGACCGTCACCCGCAGTAAGGATCGGGGCGGCTCCGCCA
+
F#:FFFFFFFFF:FFFFF:FF,F,FFFFFFFF,FFF:FFFF:FFFFFF,FF
@CGATATGTAGGGGACTAATTCC:A00953:155:HVCHLDRXX:1:1101:1145:1031 1:N:0:1
GNCGGATCACAAGGTCAGGAGTTCGAGACCTGGCTGGCCAACACGGTGAAA
gzip: stdout: Broken pipe
[3]:
precellar.utils.strip_barcode_from_fastq(
'R1.fq.gz',
'R1_processed.fq.zst',
out_barcode='I1.fq.zst',
regex="^([ACTG]+):",
right_add=1,
)
precellar.utils.strip_barcode_from_fastq(
'R2.fq.gz',
'R2_processed.fq.zst',
regex="^([ACTG]+):",
right_add=1,
)
[4]:
assay = precellar.Assay("https://raw.githubusercontent.com/regulatory-genomics/precellar/refs/heads/main/seqspec_templates/generic_atac.yaml")
[2024-10-04T06:24:33Z INFO cached_path::cache] Cached version of https://raw.githubusercontent.com/regulatory-genomics/precellar/refs/heads/main/seqspec_templates/generic_atac.yaml is up-to-date
[5]:
assay
[5]:
└── atac(153-1150)
├── atac-illumina_p5(29)
├── atac-read1(34) [↓R1(1-98)✗]
├── gDNA(1-1000)
├── atac-read2(34) [↑R2(1-98)✗, ↓I1(22)✗]
├── atac-cell_barcode(22)
└── atac-illumina_p7(24)
[6]:
assay.update_read("R1", fastq="R1_processed.fq.zst")
assay.update_read("I1", fastq="I1.fq.zst")
assay.update_read("R2", fastq="R2_processed.fq.zst")
[2024-10-04T06:24:33Z WARN seqspec] 'R1' may read through and contain sequences from: 'atac-read2'
[2024-10-04T06:24:33Z WARN seqspec] 'R2' may read through and contain sequences from: 'atac-read1'
[7]:
assay
[7]:
└── atac(153-1150)
├── atac-illumina_p5(29)
├── atac-read1(34) [↓R1(51)✓]
├── gDNA(1-1000)
├── atac-read2(34) [↑R2(51)✓, ↓I1(22)✓]
├── atac-cell_barcode(22)
└── atac-illumina_p7(24)
[8]:
qc = precellar.align(
assay, "/data/kzhang/GRCh38/hg38.fa.gz",
modality="atac",
output_fragment="atac_fragments.tsv.zst",
num_threads=32,
)
[2024-10-04T06:24:40Z INFO precellar::align] Counting barcodes...
[2024-10-04T06:24:40Z INFO precellar::align] Found 2500 barcodes. 100.00% of them have an exact match in whitelist
[2024-10-04T06:24:40Z INFO precellar::align] Aligning reads...
100%|██████████| 2500/2500 [00:00<00:00, 18920.08it/s]
[9]:
qc
[9]:
{'sequenced_reads': 5000.0,
'frac_q30_bases_barcode': 1.0,
'frac_fragment_in_nucleosome_free_region': 0.010427010923535254,
'frac_q30_bases_read2': 0.9442745098039216,
'frac_fragment_flanking_single_nucleosome': 0.0029791459781529296,
'frac_valid_barcode': 1.0,
'frac_unmapped': 0.07640000000000002,
'frac_duplicates': 0.004940711462450593,
'frac_nonnuclear': 0.0128,
'sequenced_read_pairs': 2500.0,
'frac_q30_bases_read1': 0.8179764705882353,
'frac_confidently_mapped': 0.8524}