Base Dataset Classes¶

Here are the two base classes for MusPy datasets.

class muspy.Dataset[source]

Base class for MusPy datasets.

To build a custom dataset, it should inherit this class and overide the methods __getitem__ and __len__ as well as the class attribute _info. __getitem__ should return the i-th data sample as a muspy.Music object. __len__ should return the size of the dataset. _info should be a muspy.DatasetInfo instance storing the dataset information.

classmethod citation()[source]: Print the citation infomation.

classmethod info()[source]: Return the dataset infomation.

save(root: Union[str, pathlib.Path], kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, verbose: bool = True, **kwargs)[source]

Save all the music objects to a directory.

Parameters:

root (str or Path) – Root directory to save the data.
kind ({'json', 'yaml'}, default: 'json') – File format to save the data.
n_jobs (int, default: 1) – Maximum number of concurrently running jobs. If equal to 1, disable multiprocessing.
ignore_exceptions (bool, default: True) – Whether to ignore errors and skip failed conversions. This can be helpful if some source files are known to be corrupted.
verbose (bool, default: True) – Whether to be verbose.
**kwargs – Keyword arguments to pass to muspy.save().

split(filename: Union[str, pathlib.Path] = None, splits: Sequence[float] = None, random_state: Any = None) → Dict[str, List[int]][source]

Return the dataset as a PyTorch dataset.

Parameters:

filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split.
splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits.
random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or array_like, the value is passed to numpy.random.RandomState, and the created RandomState object is used to create the splits. If RandomState, it will be used to create the splits.

to_pytorch_dataset(factory: Callable = None, representation: str = None, split_filename: Union[str, pathlib.Path] = None, splits: Sequence[float] = None, random_state: Any = None, **kwargs) → Union[TorchDataset, Dict[str, TorchDataset]][source]

Return the dataset as a PyTorch dataset.

Parameters:	factory (Callable, optional) – Function to be applied to the Music objects. The input is a Music object, and the output is an array or a tensor. representation (str, optional) – Target representation. See `muspy.to_representation()` for available representation. split_filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split. splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits. random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or array_like, the value is passed to `numpy.random.RandomState`, and the created RandomState object is used to create the splits. If RandomState, it will be used to create the splits.
Returns:	Converted PyTorch dataset(s).
Return type:	class:torch.utils.data.Dataset` or Dict of :class:torch.utils.data.Dataset`

to_tensorflow_dataset(factory: Callable = None, representation: str = None, split_filename: Union[str, pathlib.Path] = None, splits: Sequence[float] = None, random_state: Any = None, **kwargs) → Union[TFDataset, Dict[str, TFDataset]][source]

Return the dataset as a TensorFlow dataset.

Parameters:

factory (Callable, optional) – Function to be applied to the Music objects. The input is a Music object, and the output is an array or a tensor.
representation (str, optional) – Target representation. See muspy.to_representation() for available representation.
split_filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split.
splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits.
random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or array_like, the value is passed to numpy.random.RandomState, and the created RandomState object is used to create the splits. If RandomState, it will be used to create the splits.

Returns:

class:tensorflow.data.Dataset` or Dict of
class:tensorflow.data.dataset` – Converted TensorFlow dataset(s).

class muspy.RemoteDataset(root: Union[str, pathlib.Path], download_and_extract: bool = False, overwrite: bool = False, cleanup: bool = False, verbose: bool = True)[source]

Base class for remote MusPy datasets.

This class extends muspy.Dataset to support remote datasets. To build a custom remote dataset, please refer to the documentation of muspy.Dataset for details. In addition, set the class attribute _sources to the URLs to the source files (see Notes).

root¶

Root directory of the dataset.

Type:	str or Path

Parameters:	download_and_extract (bool, default: False) – Whether to download and extract the dataset. overwrite (bool, default: False) – Whether to overwrite existing file(s). cleanup (bool, default: False) – Whether to remove the source archive(s). verbose (bool, default: True) – Whether to be verbose.
Raises:	RuntimeError: – If `download_and_extract` is False but file `{root}/.muspy.success` does not exist (see below).

Important

muspy.Dataset.exists() depends solely on a special file named .muspy.success in directory {root}/_converted/. This file serves as an indicator for the existence and integrity of the dataset. It will automatically be created if the dataset is successfully downloaded and extracted by muspy.Dataset.download_and_extract(). If the dataset is downloaded manually, make sure to create the .muspy.success file in directory {root}/_converted/ to prevent errors.

Notes

The class attribute _sources is a dictionary storing the following information of each source file.

filename (str): Name to save the file.
url (str): URL to the file.
archive (bool): Whether the file is an archive.
md5 (str, optional): Expected MD5 checksum of the file.
sha256 (str, optional): Expected SHA256 checksum of the file.

Here is an example.:

_sources = {
    "example": {
        "filename": "example.tar.gz",
        "url": "https://www.example.com/example.tar.gz",
        "archive": True,
        "md5": None,
        "sha256": None,
    }
}

See also

muspy.Dataset: Base class for MusPy datasets.

classmethod citation(): Print the citation infomation.

download(overwrite: bool = False, verbose: bool = True) → RemoteDatasetType[source]

Download the dataset source(s).

Parameters:	overwrite (bool, default: False) – Whether to overwrite existing file(s). verbose (bool, default: True) – Whether to be verbose.
Returns:
Return type:	Object itself.

download_and_extract(overwrite: bool = False, cleanup: bool = False, verbose: bool = True) → RemoteDatasetType[source]

Download source datasets and extract the downloaded archives.

Parameters:	overwrite (bool, default: False) – Whether to overwrite existing file(s). cleanup (bool, default: False) – Whether to remove the source archive(s). verbose (bool, default: True) – Whether to be verbose.
Returns:
Return type:	Object itself.

exists() → bool[source]: Return True if the dataset exists, otherwise False.

extract(cleanup: bool = False, verbose: bool = True) → RemoteDatasetType[source]

Extract the downloaded archive(s).

Parameters:	cleanup (bool, default: False) – Whether to remove the source archive after extraction. verbose (bool, default: True) – Whether to be verbose.
Returns:
Return type:	Object itself.

classmethod info(): Return the dataset infomation.

save(root: Union[str, pathlib.Path], kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, verbose: bool = True, **kwargs)

Save all the music objects to a directory.

Parameters:

root (str or Path) – Root directory to save the data.
kind ({'json', 'yaml'}, default: 'json') – File format to save the data.
n_jobs (int, default: 1) – Maximum number of concurrently running jobs. If equal to 1, disable multiprocessing.
ignore_exceptions (bool, default: True) – Whether to ignore errors and skip failed conversions. This can be helpful if some source files are known to be corrupted.
verbose (bool, default: True) – Whether to be verbose.
**kwargs – Keyword arguments to pass to muspy.save().

source_exists() → bool[source]: Return True if all the sources exist, otherwise False.

split(filename: Union[str, pathlib.Path] = None, splits: Sequence[float] = None, random_state: Any = None) → Dict[str, List[int]]

Return the dataset as a PyTorch dataset.

Parameters:

filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split.
splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits.
random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or array_like, the value is passed to numpy.random.RandomState, and the created RandomState object is used to create the splits. If RandomState, it will be used to create the splits.

to_pytorch_dataset(factory: Callable = None, representation: str = None, split_filename: Union[str, pathlib.Path] = None, splits: Sequence[float] = None, random_state: Any = None, **kwargs) → Union[TorchDataset, Dict[str, TorchDataset]]

Return the dataset as a PyTorch dataset.

Parameters:	factory (Callable, optional) – Function to be applied to the Music objects. The input is a Music object, and the output is an array or a tensor. representation (str, optional) – Target representation. See `muspy.to_representation()` for available representation. split_filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split. splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits. random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or array_like, the value is passed to `numpy.random.RandomState`, and the created RandomState object is used to create the splits. If RandomState, it will be used to create the splits.
Returns:	Converted PyTorch dataset(s).
Return type:	class:torch.utils.data.Dataset` or Dict of :class:torch.utils.data.Dataset`

to_tensorflow_dataset(factory: Callable = None, representation: str = None, split_filename: Union[str, pathlib.Path] = None, splits: Sequence[float] = None, random_state: Any = None, **kwargs) → Union[TFDataset, Dict[str, TFDataset]]

Return the dataset as a TensorFlow dataset.

Parameters:

factory (Callable, optional) – Function to be applied to the Music objects. The input is a Music object, and the output is an array or a tensor.
representation (str, optional) – Target representation. See muspy.to_representation() for available representation.
split_filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split.
splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits.
random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or array_like, the value is passed to numpy.random.RandomState, and the created RandomState object is used to create the splits. If RandomState, it will be used to create the splits.

Returns:

class:tensorflow.data.Dataset` or Dict of
class:tensorflow.data.dataset` – Converted TensorFlow dataset(s).