precellar.align#

precellar.align(assay, aligner, *, output, modality=None, output_type='alignment', mito_dna=['chrM', 'M'], shift_left=4, shift_right=-5, compute_snv=False, strandedness=None, compression=None, compression_level=None, temp_dir=None, num_threads=8, chunk_size=10000000)#

Align fastq reads to the reference genome and generate unique fragments.

Parameters:
  • assay (Assay | Path | list[Assay | Path]) – A Assay object or file path to the yaml sequencing specification file, see pachterlab/seqspec. The assay can also be a list of Assay objects or file paths. In this case, the results will be concatenated into a single output file.

  • aligner (STAR | BWAMEM2) – The aligner to use for the alignment. Available aligners can be found at precellar.aligners submodule.

  • output (Path) – File path to the output file. The type of the output file is determined by the output_type parameter (see below).

  • modality (str | None) – The modality of the sequencing data, e.g., “rna” or “atac”.

  • output_type (Literal["alignment", "fragment", "gene_quantification"]) – The type of the output file. If “alignment”, the output will be a BAM file containing the alignments. If “fragment”, the output will be a fragment file containing the unique fragments. If “gene_quantification”, the output will be a h5ad file containing the gene quantification.

  • mito_dna (list[str]) – List of mitochondrial DNA names.

  • shift_left (int) – The number of bases to shift the left end of the fragment. For example, in ATAC-seq, this is usually set to 4 to account for the Tn5 transposase insertion bias. Available only when output_type='fragment'.

  • shift_right (int) – The number of bases to shift the right end of the fragment. For example, in ATAC-seq, this is usually set to -5 to account for the Tn5 transposase insertion bias. Available only when output_type='fragment'.

  • compute_snv (bool) – Whether to compute single nucleotide variants (SNVs) from the alignments. If True, the SNVs will be computed and added to the fragment file.

  • strandedness (Literal['forward', 'reverse', 'unstranded', 'auto'] | None) – The strand specificity of the assay. Can be “forward”, “reverse”, “unstranded”, “auto”. “forward” means that the read 1 is expected to be aligned to the same strand as the original RNA molecule. For example, in 10x Genomics 3’ scRNA-seq, the read 1 is aligned to the reverse strand of the original RNA molecule, so the strandedness should be set to “reverse”. If “auto”, the strand specificity will be inferred from the data. Note that automatic inference may not always be accurate, especially for in the samples where the antisense transcription is prevalent, e.g., brain samples.

  • compression (str | None) – The compression algorithm to use for the output fragment file. If None, the compression algorithm will be inferred from the file extension.

  • compression_level (int | None) – The compression level to use for the output fragment file.

  • temp_dir (Path | None) – The temporary directory to use.

  • num_threads (int) – The number of threads to use.

  • chunk_size (int) – This parameter is used to control the number of bases processed in each chunk per thread. The total number of bases in each chunk is determined by: chunk_size * num_threads.

Returns:

A dictionary containing the QC metrics of the alignment and fragment generation.

Return type:

dict