10X ATAC-seq pipeline#
[1]:
import precellar
precellar.__version__
[1]:
'0.1.1-dev2'
We now create a Assay
object from a seqspec template, specifically for 10X ATAC-seq data. The Assay object contains detailed information about the sequence library structure. This object will be used to run the pipeline.
[10]:
assay = precellar.Assay("https://raw.githubusercontent.com/regulatory-genomics/precellar/refs/heads/main/seqspec_templates/10x_atac.yaml")
assay
[10]:
└── atac(145-1144)
├── illumina_p5(29)
├── cell_bc(16)
├── nextera_read1(33)
├── gdna(1-1000)
├── nextera_read2(34)
├── index7(8)
└── illumina_p7(24)
As you can see, the seqspec template does not contain sequenced read information. We can use add_illumina_reads
to add this information.
It is important to note that illumina has two different sequencing workflows: forward strand workflow and reverse strand workflow. The key difference between the two workflows is the orientation of the index reads. The reverse strand workflow is the default workflow for newer Illumina sequencers. For more details, see the Indexed Sequencing Overview Guide from Illumina.
As the example dataset of this tutorial is generated from the old illumina sequencer, we will use the forward strand workflow.
[12]:
assay.add_illumina_reads('atac', forward_strand_workflow=True)
assay
[2024-12-28T06:00:01Z WARN seqspec] 'atac-R1' may read through and contain sequences from: 'nextera_read2'
[2024-12-28T06:00:01Z WARN seqspec] 'atac-R2' may read through and contain sequences from: 'nextera_read1'
[12]:
└── atac(145-1144)
├── illumina_p5(29) [↓atac-I2(16)✗]
├── cell_bc(16)
├── nextera_read1(33) [↓atac-R1(150)✗]
├── gdna(1-1000)
├── nextera_read2(34) [↑atac-R2(150)✗, ↓atac-I1(8)✗]
├── index7(8)
└── illumina_p7(24)
We now need to attach files containing sequenced reads to the Assay
object. The test data can be downloaded from the 10X website: https://cf.10xgenomics.com/samples/cell-atac/1.2.0/atac_pbmc_500_v1/atac_pbmc_500_v1_fastqs.tar. We only need the “R1”, “R2”, “R3” files as “I1” is the sample index. We can use precellar.update_read
to add these files to the Assay
object.
[13]:
assay.update_read('atac-I2', fastq='atac_pbmc_500_v1_fastqs/*R2*.fastq.gz')
assay.update_read('atac-R1', fastq='atac_pbmc_500_v1_fastqs/*R1*.fastq.gz')
assay.update_read('atac-R2', fastq='atac_pbmc_500_v1_fastqs/*R3*.fastq.gz')
assay
[2024-12-28T06:01:07Z INFO cached_path::cache] Cached version of https://teichlab.github.io/scg_lib_structs/data/10X-Genomics/737K-cratac-v1.txt.gz is up-to-date
[2024-12-28T06:01:08Z WARN seqspec] 'atac-R1' may read through and contain sequences from: 'nextera_read2'
[2024-12-28T06:01:08Z WARN seqspec] 'atac-R2' may read through and contain sequences from: 'nextera_read1'
[13]:
└── atac(145-1144)
├── illumina_p5(29) [↓atac-I2(16)✓]
├── cell_bc(16)
├── nextera_read1(33) [↓atac-R1(50)✓]
├── gdna(1-1000)
├── nextera_read2(34) [↑atac-R2(49)✓, ↓atac-I1(8)✗]
├── index7(8)
└── illumina_p7(24)
[15]:
precellar.align(
assay,
modality='atac',
genome_index='/data/Public/BWA_MEM2_index/GRCh38',
output='out.fragment.tsv.zst',
output_type='fragment',
num_threads=32,
)
[2024-12-28T06:02:35Z INFO precellar::align::fastq] Counting barcodes...
[2024-12-28T06:02:35Z INFO cached_path::cache] Cached version of https://teichlab.github.io/scg_lib_structs/data/10X-Genomics/737K-cratac-v1.txt.gz is up-to-date
[2024-12-28T06:03:06Z INFO precellar::align::fastq] Found 22772118 barcodes. 94.20% of them have an exact match in whitelist cell_bc
[2024-12-28T06:03:06Z INFO precellar::align::fastq] Aligning 22772118 reads...
100%|██████████| 22772118/22772118 [19:29<00:00, 19467.43it/s]
[15]:
{'frac_q30_bases_read2': 0.9548379213706222,
'frac_duplicates': 0.576675957646637,
'frac_fragment_in_nucleosome_free_region': 0.004232557108438031,
'frac_fragment_flanking_single_nucleosome': 0.0002958156859493062,
'sequenced_read_pairs': 22772118.0,
'frac_confidently_mapped': 0.9141162147760858,
'frac_q30_bases_read1': 0.9518468910094353,
'frac_unmapped': 0.012643659546708985,
'frac_q30_bases_barcode': 0.85864855104826,
'frac_valid_barcode': 0.9816151927545782,
'frac_nonnuclear': 0.005350223904964677,
'sequenced_reads': 45544236.0}
Generate cell by bin matrix#
We have generated the fragment files in the previous step. We can use the “SnapATAC2” package to further process the data.
[17]:
import snapatac2 as snap
data = snap.pp.import_fragments(
'out.fragment.tsv.zst',
snap.genome.hg38,
)
data
[17]:
AnnData object with n_obs × n_vars = 736 × 0
obs: 'n_fragment', 'frac_dup', 'frac_mito'
uns: 'reference_sequences'
obsm: 'fragment_paired'
[18]:
snap.metrics.tsse(data, gene_anno=snap.genome.hg38)
[19]:
snap.pl.tsse(data)
/data/kzhang/dev/SnapATAC2/snapatac2-python/python/snapatac2/plotting/__init__.py:97: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
x = adata.obs["n_fragment"][selected_cells]
/data/kzhang/dev/SnapATAC2/snapatac2-python/python/snapatac2/plotting/__init__.py:98: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
y = adata.obs["tsse"][selected_cells]
[20]:
snap.pp.filter_cells(data, min_tsse=10, min_counts=1000)
data
[20]:
AnnData object with n_obs × n_vars = 445 × 0
obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse'
uns: 'reference_sequences', 'library_tsse', 'frac_overlap_TSS', 'TSS_profile'
obsm: 'fragment_paired'
[21]:
snap.pp.add_tile_matrix(data)
[22]:
snap.pp.select_features(data)
2024-12-28 14:25:52 - INFO - Selected 500000 features.
[23]:
snap.tl.spectral(data)
[24]:
snap.pp.knn(data)
snap.tl.umap(data)
/data/kzhang/micromamba/lib/python3.10/site-packages/umap/umap_.py:1945: UserWarning:
n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
[25]:
snap.tl.leiden(data)
[26]:
snap.pl.umap(data, color='leiden')