gdata.GenomeDataBuilder#
- class gdata.GenomeDataBuilder(self, /, location, genome_fasta, window_size, *, segments=None, step_size=None, chunk_size=None, resolution=1, chroms=None, overwrite=False)#
Represents a builder for genomic data, allowing for the creation and management of genomic datasets.
This struct provides methods to create a new genomic dataset, open an existing dataset, and manage genomic data chunks. It supports operations like adding files, retrieving chromosome information, and iterating over data chunks.
The builder creates a structured dataset in a specified location:
` root/ ├── metadata.json ├── chr1/ │ ├── 0/ │ │ ├── data | | ├── data.index │ │ ├── sequence.dat | | └── names.txt │ |── 1/ │ │ ├── data | | ├── data.index │ │ ├── sequence.dat | | └── names.txt `
- Parameters:
location – The directory where the genomic data will be stored.
genome_fasta – The path to the FASTA file containing the genome sequences.
window_size – The size of the genomic windows to be processed.
segments – Optional list of genomic segments to include in the dataset. The genomic segments should be provided as strings in the format “chrom:start-end”. If None, the entire genome will be used.
step_size – The step size for sliding the window across the genome (default is None, which uses
window_size
).chunk_size – The number of segments to store in each chunk. If None, it will be calculated based on the formula
2^(25 - log2(window_size))
.resolution – The resolution of the stored genomic data (default is 1).
chroms – A list of chromosomes to include in the dataset. If None, all chromosomes in the FASTA file will be used.
overwrite – If True, existing data at the specified location will be overwritten (default is False).
See also
Examples
>>> from gdata import as GenomeDataBuilder >>> regions = ["chr11:35041782-35238390", "chr11:35200000-35300000"] >>> tracks = {'DNase:CD14-positive monocyte': 'ENCSR464ETX.w5z', 'DNase:keratinocyte': 'ENCSR000EPQ.w5z'} >>> builder = GenomeDataBuilder("genome", 'genome.fa.gz', segments=regions, window_size=196_608, resolution=128) >>> builder.add_files(tracks)
Methods
add_file
(key, w5z)Adds a single file to the dataset.
add_files
(files)Adds w5z files to the dataset.
chroms
()Returns a vector of chromosome names present in the genomic data.
Returns the size of each chunk in the dataset.
open
()Open an existing GenomeDataBuilder instance from a specified location.
segments
()Returns the segments of the genome as a vector of strings.
tracks
()Returns the keys (track names) in the dataset.