Base Dataset Classes

Here are the two base classes for MusPy datasets.

class muspy.Dataset[source]

Base class for MusPy datasets.

To build a custom dataset, it should inherit this class and overide the methods __getitem__ and __len__ as well as the class attribute _info. __getitem__ should return the i-th data sample as a muspy.Music object. __len__ should return the size of the dataset. _info should be a muspy.DatasetInfo instance storing the dataset information.

classmethod citation()[source]

Print the citation infomation.

classmethod info()[source]

Return the dataset infomation.

save(root: Union[str, pathlib.Path], kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, verbose: bool = True, **kwargs)[source]

Save all the music objects to a directory.

Parameters:
  • root (str or Path) – Root directory to save the data.
  • kind ({'json', 'yaml'}, default: 'json') – File format to save the data.
  • n_jobs (int, default: 1) – Maximum number of concurrently running jobs. If equal to 1, disable multiprocessing.
  • ignore_exceptions (bool, default: True) – Whether to ignore errors and skip failed conversions. This can be helpful if some source files are known to be corrupted.
  • verbose (bool, default: True) – Whether to be verbose.
  • **kwargs – Keyword arguments to pass to muspy.save().
split(filename: Union[str, pathlib.Path] = None, splits: Sequence[float] = None, random_state: Any = None) → Dict[str, List[int]][source]

Return the dataset as a PyTorch dataset.

Parameters:
  • filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split.
  • splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits.
  • random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or array_like, the value is passed to numpy.random.RandomState, and the created RandomState object is used to create the splits. If RandomState, it will be used to create the splits.
to_pytorch_dataset(factory: Callable = None, representation: str = None, split_filename: Union[str, pathlib.Path] = None, splits: Sequence[float] = None, random_state: Any = None, **kwargs) → Union[TorchDataset, Dict[str, TorchDataset]][source]

Return the dataset as a PyTorch dataset.

Parameters:
  • factory (Callable, optional) – Function to be applied to the Music objects. The input is a Music object, and the output is an array or a tensor.
  • representation (str, optional) – Target representation. See muspy.to_representation() for available representation.
  • split_filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split.
  • splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits.
  • random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or array_like, the value is passed to numpy.random.RandomState, and the created RandomState object is used to create the splits. If RandomState, it will be used to create the splits.
Returns:

Converted PyTorch dataset(s).

Return type:

class:torch.utils.data.Dataset` or Dict of :class:torch.utils.data.Dataset`

to_tensorflow_dataset(factory: Callable = None, representation: str = None, split_filename: Union[str, pathlib.Path] = None, splits: Sequence[float] = None, random_state: Any = None, **kwargs) → Union[TFDataset, Dict[str, TFDataset]][source]

Return the dataset as a TensorFlow dataset.

Parameters:
  • factory (Callable, optional) – Function to be applied to the Music objects. The input is a Music object, and the output is an array or a tensor.
  • representation (str, optional) – Target representation. See muspy.to_representation() for available representation.
  • split_filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split.
  • splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits.
  • random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or array_like, the value is passed to numpy.random.RandomState, and the created RandomState object is used to create the splits. If RandomState, it will be used to create the splits.
Returns:

  • class:tensorflow.data.Dataset` or Dict of
  • class:tensorflow.data.dataset` – Converted TensorFlow dataset(s).

class muspy.RemoteDataset(root: Union[str, pathlib.Path], download_and_extract: bool = False, overwrite: bool = False, cleanup: bool = False, verbose: bool = True)[source]

Base class for remote MusPy datasets.

This class extends muspy.Dataset to support remote datasets. To build a custom remote dataset, please refer to the documentation of muspy.Dataset for details. In addition, set the class attribute _sources to the URLs to the source files (see Notes).

root

Root directory of the dataset.

Type:str or Path
Parameters:
  • download_and_extract (bool, default: False) – Whether to download and extract the dataset.
  • overwrite (bool, default: False) – Whether to overwrite existing file(s).
  • cleanup (bool, default: False) – Whether to remove the source archive(s).
  • verbose (bool, default: True) – Whether to be verbose.
Raises:

RuntimeError: – If download_and_extract is False but file {root}/.muspy.success does not exist (see below).

Important

muspy.Dataset.exists() depends solely on a special file named .muspy.success in directory {root}/_converted/. This file serves as an indicator for the existence and integrity of the dataset. It will automatically be created if the dataset is successfully downloaded and extracted by muspy.Dataset.download_and_extract(). If the dataset is downloaded manually, make sure to create the .muspy.success file in directory {root}/_converted/ to prevent errors.

Notes

The class attribute _sources is a dictionary storing the following information of each source file.

  • filename (str): Name to save the file.
  • url (str): URL to the file.
  • archive (bool): Whether the file is an archive.
  • md5 (str, optional): Expected MD5 checksum of the file.
  • sha256 (str, optional): Expected SHA256 checksum of the file.

Here is an example.:

_sources = {
    "example": {
        "filename": "example.tar.gz",
        "url": "https://www.example.com/example.tar.gz",
        "archive": True,
        "md5": None,
        "sha256": None,
    }
}

See also

muspy.Dataset
Base class for MusPy datasets.
classmethod citation()

Print the citation infomation.

download(overwrite: bool = False, verbose: bool = True) → RemoteDatasetType[source]

Download the dataset source(s).

Parameters:
  • overwrite (bool, default: False) – Whether to overwrite existing file(s).
  • verbose (bool, default: True) – Whether to be verbose.
Returns:

Return type:

Object itself.

download_and_extract(overwrite: bool = False, cleanup: bool = False, verbose: bool = True) → RemoteDatasetType[source]

Download source datasets and extract the downloaded archives.

Parameters:
  • overwrite (bool, default: False) – Whether to overwrite existing file(s).
  • cleanup (bool, default: False) – Whether to remove the source archive(s).
  • verbose (bool, default: True) – Whether to be verbose.
Returns:

Return type:

Object itself.

exists() → bool[source]

Return True if the dataset exists, otherwise False.

extract(cleanup: bool = False, verbose: bool = True) → RemoteDatasetType[source]

Extract the downloaded archive(s).

Parameters:
  • cleanup (bool, default: False) – Whether to remove the source archive after extraction.
  • verbose (bool, default: True) – Whether to be verbose.
Returns:

Return type:

Object itself.

classmethod info()

Return the dataset infomation.

save(root: Union[str, pathlib.Path], kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, verbose: bool = True, **kwargs)

Save all the music objects to a directory.

Parameters:
  • root (str or Path) – Root directory to save the data.
  • kind ({'json', 'yaml'}, default: 'json') – File format to save the data.
  • n_jobs (int, default: 1) – Maximum number of concurrently running jobs. If equal to 1, disable multiprocessing.
  • ignore_exceptions (bool, default: True) – Whether to ignore errors and skip failed conversions. This can be helpful if some source files are known to be corrupted.
  • verbose (bool, default: True) – Whether to be verbose.
  • **kwargs – Keyword arguments to pass to muspy.save().
source_exists() → bool[source]

Return True if all the sources exist, otherwise False.

split(filename: Union[str, pathlib.Path] = None, splits: Sequence[float] = None, random_state: Any = None) → Dict[str, List[int]]

Return the dataset as a PyTorch dataset.

Parameters:
  • filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split.
  • splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits.
  • random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or array_like, the value is passed to numpy.random.RandomState, and the created RandomState object is used to create the splits. If RandomState, it will be used to create the splits.
to_pytorch_dataset(factory: Callable = None, representation: str = None, split_filename: Union[str, pathlib.Path] = None, splits: Sequence[float] = None, random_state: Any = None, **kwargs) → Union[TorchDataset, Dict[str, TorchDataset]]

Return the dataset as a PyTorch dataset.

Parameters:
  • factory (Callable, optional) – Function to be applied to the Music objects. The input is a Music object, and the output is an array or a tensor.
  • representation (str, optional) – Target representation. See muspy.to_representation() for available representation.
  • split_filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split.
  • splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits.
  • random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or array_like, the value is passed to numpy.random.RandomState, and the created RandomState object is used to create the splits. If RandomState, it will be used to create the splits.
Returns:

Converted PyTorch dataset(s).

Return type:

class:torch.utils.data.Dataset` or Dict of :class:torch.utils.data.Dataset`

to_tensorflow_dataset(factory: Callable = None, representation: str = None, split_filename: Union[str, pathlib.Path] = None, splits: Sequence[float] = None, random_state: Any = None, **kwargs) → Union[TFDataset, Dict[str, TFDataset]]

Return the dataset as a TensorFlow dataset.

Parameters:
  • factory (Callable, optional) – Function to be applied to the Music objects. The input is a Music object, and the output is an array or a tensor.
  • representation (str, optional) – Target representation. See muspy.to_representation() for available representation.
  • split_filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split.
  • splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits.
  • random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or array_like, the value is passed to numpy.random.RandomState, and the created RandomState object is used to create the splits. If RandomState, it will be used to create the splits.
Returns:

  • class:tensorflow.data.Dataset` or Dict of
  • class:tensorflow.data.dataset` – Converted TensorFlow dataset(s).