gdata.GenomeDataBuilder#

class gdata.GenomeDataBuilder(self, /, location, genome_fasta, window_size, *, segments=None, step_size=None, resolution=32, padding=0, chroms=None, temp_dir=None)#

Represents a builder for genomic data, allowing for the creation and management of genomic datasets.

This struct provides methods to create a new genomic dataset, open an existing dataset, and manage genomic data chunks. It supports operations like adding files, retrieving chromosome information, and iterating over data chunks.

Parameters:
  • location – The directory where the genomic data will be stored.

  • genome_fasta – The path to the FASTA file containing the genome sequences.

  • window_size – The size of the genomic windows to be processed.

  • segments – Optional list of genomic segments to include in the dataset. The genomic segments should be provided as strings in the format “chrom:start-end”. If None, the entire genome will be used.

  • step_size – The step size for sliding the window across the genome (default is None, which uses window_size).

  • resolution – The resolution of the stored genomic data (default is 1).

  • padding – The amount of padding to add around each genomic segment (default is 0). Extra padding allows shifting the window within the padded region during data loading.

  • chroms – A list of chromosomes to include in the dataset. If None, all chromosomes in the FASTA file will be used.

  • temp_dir – Optional temporary directory for intermediate files. If None, a system temporary directory will be used.

See also

GenomeDataLoader

Examples

>>> from gdata import as GenomeDataBuilder
>>> regions = ["chr11:35041782-35238390", "chr11:35200000-35300000"]
>>> tracks = {'DNase:CD14-positive monocyte': 'ENCSR464ETX.w5z', 'DNase:keratinocyte': 'ENCSR000EPQ.w5z'}
>>> builder = GenomeDataBuilder("genome.gdata", 'genome.fa.gz', segments=regions, window_size=196_608, resolution=128)
>>> builder.add_files(tracks)
>>> builder.finish()

Methods

add_file(key, path)

Adds a single file to the dataset.

add_files(files)

Adds w5z files to the dataset.

finish()

Finalizes the dataset creation.

segments()

Returns the segments of the genome as a vector of strings.

tracks()

Returns the keys (track names) in the dataset.