muspy.datasets¶
Dataset classes.
This module provides an easy-to-use dataset management system. Each supported dataset in MusPy comes with a class inherited from the base MusPy Dataset class. It also provides interfaces to PyTorch and TensorFlow for creating input pipelines for machine learning.
Base Classes¶
- ABCFolderDataset
- Dataset
- DatasetInfo
- FolderDataset
- RemoteABCFolderDataset
- RemoteDataset
- RemoteFolderDataset
- RemoteMusicDataset
- MusicDataset
Dataset Classes¶
- EssenFolkSongDatabase
- EMOPIADataset
- HaydnOp20Dataset
- HymnalDataset
- HymnalTuneDataset
- JSBChoralesDataset
- LakhMIDIAlignedDataset
- LakhMIDIDataset
- LakhMIDIMatchedDataset
- MAESTRODatasetV1
- MAESTRODatasetV2
- Music21Dataset
- MusicNetDataset
- NESMusicDatabase
- NottinghamDatabase
- WikifoniaDataset
-
class
muspy.datasets.
ABCFolderDataset
(root: Union[str, pathlib.Path], convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: bool = None)[source]¶ Class for datasets storing ABC files in a folder.
See also
muspy.FolderDataset
- Class for datasets storing files in a folder.
-
class
muspy.datasets.
Dataset
[source]¶ Base class for MusPy datasets.
To build a custom dataset, it should inherit this class and overide the methods
__getitem__
and__len__
as well as the class attribute_info
.__getitem__
should return thei
-th data sample as amuspy.Music
object.__len__
should return the size of the dataset._info
should be amuspy.DatasetInfo
instance storing the dataset information.-
save
(root: Union[str, pathlib.Path], kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, verbose: bool = True, **kwargs)[source]¶ Save all the music objects to a directory.
Parameters: - root (str or Path) – Root directory to save the data.
- kind ({'json', 'yaml'}, default: 'json') – File format to save the data.
- n_jobs (int, default: 1) – Maximum number of concurrently running jobs. If equal to 1, disable multiprocessing.
- ignore_exceptions (bool, default: True) – Whether to ignore errors and skip failed conversions. This can be helpful if some source files are known to be corrupted.
- verbose (bool, default: True) – Whether to be verbose.
- **kwargs – Keyword arguments to pass to
muspy.save()
.
-
split
(filename: Union[str, pathlib.Path] = None, splits: Sequence[float] = None, random_state: Any = None) → Dict[str, List[int]][source]¶ Return the dataset as a PyTorch dataset.
Parameters: - filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split.
- splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits.
- random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or
array_like, the value is passed to
numpy.random.RandomState
, and the created RandomState object is used to create the splits. If RandomState, it will be used to create the splits.
-
to_pytorch_dataset
(factory: Callable = None, representation: str = None, split_filename: Union[str, pathlib.Path] = None, splits: Sequence[float] = None, random_state: Any = None, **kwargs) → Union[TorchDataset, Dict[str, TorchDataset]][source]¶ Return the dataset as a PyTorch dataset.
Parameters: - factory (Callable, optional) – Function to be applied to the Music objects. The input is a Music object, and the output is an array or a tensor.
- representation (str, optional) – Target representation. See
muspy.to_representation()
for available representation. - split_filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split.
- splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits.
- random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or
array_like, the value is passed to
numpy.random.RandomState
, and the created RandomState object is used to create the splits. If RandomState, it will be used to create the splits.
Returns: Converted PyTorch dataset(s).
Return type: class:torch.utils.data.Dataset` or Dict of :class:torch.utils.data.Dataset`
-
to_tensorflow_dataset
(factory: Callable = None, representation: str = None, split_filename: Union[str, pathlib.Path] = None, splits: Sequence[float] = None, random_state: Any = None, **kwargs) → Union[TFDataset, Dict[str, TFDataset]][source]¶ Return the dataset as a TensorFlow dataset.
Parameters: - factory (Callable, optional) – Function to be applied to the Music objects. The input is a Music object, and the output is an array or a tensor.
- representation (str, optional) – Target representation. See
muspy.to_representation()
for available representation. - split_filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split.
- splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits.
- random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or
array_like, the value is passed to
numpy.random.RandomState
, and the created RandomState object is used to create the splits. If RandomState, it will be used to create the splits.
Returns: - class:tensorflow.data.Dataset` or Dict of
- class:tensorflow.data.dataset` – Converted TensorFlow dataset(s).
-
-
class
muspy.datasets.
DatasetInfo
(name: str = None, description: str = None, homepage: str = None, license: str = None)[source]¶ A container for dataset information.
-
class
muspy.datasets.
EMOPIADataset
(root: Union[str, pathlib.Path], download_and_extract: bool = False, overwrite: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: bool = None, verbose: bool = True)[source]¶ EMOPIA Dataset.
-
class
muspy.datasets.
EssenFolkSongDatabase
(root: Union[str, pathlib.Path], download_and_extract: bool = False, overwrite: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: bool = None, verbose: bool = True)[source]¶ Essen Folk Song Database.
-
class
muspy.datasets.
FolderDataset
(root: Union[str, pathlib.Path], convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: bool = None)[source]¶ Class for datasets storing files in a folder.
This class extends
muspy.Dataset
to support folder datasets. To build a custom folder dataset, please refer to the documentation ofmuspy.Dataset
for details. In addition, set class attribute_extension
to the extension to look for when building the dataset and setread
to a callable that takes as inputs a filename of a source file and return the converted Music object.Parameters: - convert (bool, default: False) – Whether to convert the dataset to MusPy JSON/YAML files. If False, will check if converted data exists. If so, disable on-the-fly mode. If not, enable on-the-fly mode and warns.
- kind ({'json', 'yaml'}, default: 'json') – File format to save the data.
- n_jobs (int, default: 1) – Maximum number of concurrently running jobs. If equal to 1, disable multiprocessing.
- ignore_exceptions (bool, default: True) – Whether to ignore errors and skip failed conversions. This can be helpful if some source files are known to be corrupted.
- use_converted (bool, optional) – Force to disable on-the-fly mode and use converted data. Defaults to True if converted data exist, otherwise False.
Important
muspy.FolderDataset.converted_exists()
depends solely on a special file named.muspy.success
in the folder{root}/_converted/
, which serves as an indicator for the existence and integrity of the converted dataset. If the converted dataset is built bymuspy.FolderDataset.convert()
, the.muspy.success
file will be created as well. If the converted dataset is created manually, make sure to create the.muspy.success
file in the folder{root}/_converted/
to prevent errors.Notes
Two modes are available for this dataset. When the on-the-fly mode is enabled, a data sample is converted to a music object on the fly when being indexed. When the on-the-fly mode is disabled, a data sample is loaded from the precomputed converted data.
See also
muspy.Dataset
- Base class for MusPy datasets.
-
convert
(kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, verbose: bool = True, **kwargs) → FolderDatasetType[source]¶ Convert and save the Music objects.
The converted files will be named by its index and saved to
root/_converted
. The original filenames can be found in thefilenames
attribute. For example, the file atfilenames[i]
will be converted and saved to{i}.json
.Parameters: - kind ({'json', 'yaml'}, default: 'json') – File format to save the data.
- n_jobs (int, default: 1) – Maximum number of concurrently running jobs. If equal to 1, disable multiprocessing.
- ignore_exceptions (bool, default: True) – Whether to ignore errors and skip failed conversions. This can be helpful if some source files are known to be corrupted.
- verbose (bool, default: True) – Whether to be verbose.
- **kwargs – Keyword arguments to pass to
muspy.save()
.
Returns: Return type: Object itself.
-
converted_dir
¶ Path to the root directory of the converted dataset.
-
load
(filename: Union[str, pathlib.Path]) → muspy.music.Music[source]¶ Load a file into a Music object.
-
class
muspy.datasets.
HaydnOp20Dataset
(root: Union[str, pathlib.Path], download_and_extract: bool = False, overwrite: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: bool = None, verbose: bool = True)[source]¶ Haydn Op.20 Dataset.
-
class
muspy.datasets.
HymnalDataset
(root: Union[str, pathlib.Path], download: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: bool = None)[source]¶ Hymnal Dataset.
-
class
muspy.datasets.
HymnalTuneDataset
(root: Union[str, pathlib.Path], download: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: bool = None)[source]¶ Hymnal Dataset (tune only).
-
class
muspy.datasets.
JSBChoralesDataset
(root: Union[str, pathlib.Path], download_and_extract: bool = False, overwrite: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: bool = None, verbose: bool = True)[source]¶ Johann Sebastian Bach Chorales Dataset.
-
class
muspy.datasets.
LakhMIDIAlignedDataset
(root: Union[str, pathlib.Path], download_and_extract: bool = False, overwrite: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: bool = None, verbose: bool = True)[source]¶ Lakh MIDI Dataset - aligned subset.
-
class
muspy.datasets.
LakhMIDIDataset
(root: Union[str, pathlib.Path], download_and_extract: bool = False, overwrite: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: bool = None, verbose: bool = True)[source]¶ Lakh MIDI Dataset.
-
class
muspy.datasets.
LakhMIDIMatchedDataset
(root: Union[str, pathlib.Path], download_and_extract: bool = False, overwrite: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: bool = None, verbose: bool = True)[source]¶ Lakh MIDI Dataset - matched subset.
-
class
muspy.datasets.
MAESTRODatasetV1
(root: Union[str, pathlib.Path], download_and_extract: bool = False, overwrite: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: bool = None, verbose: bool = True)[source]¶ MAESTRO Dataset V1 (MIDI only).
-
class
muspy.datasets.
MAESTRODatasetV2
(root: Union[str, pathlib.Path], download_and_extract: bool = False, overwrite: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: bool = None, verbose: bool = True)[source]¶ MAESTRO Dataset V2 (MIDI only).
-
class
muspy.datasets.
MAESTRODatasetV3
(root: Union[str, pathlib.Path], download_and_extract: bool = False, overwrite: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: bool = None, verbose: bool = True)[source]¶ MAESTRO Dataset V3 (MIDI only).
-
class
muspy.datasets.
Music21Dataset
(composer: str = None)[source]¶ A class of datasets containing files in music21 corpus.
Parameters: - composer (str) – Name of a composer or a collection. Please refer to the music21 corpus reference page for a full list [1].
- extensions (list of str) – File extensions of desired files.
References
[1] https://web.mit.edu/music21/doc/about/referenceCorpus.html
-
convert
(root: Union[str, pathlib.Path], kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True) → muspy.datasets.base.MusicDataset[source]¶ Convert and save the Music objects.
Parameters: - root (str or Path) – Root directory to save the data.
- kind ({'json', 'yaml'}, default: 'json') – File format to save the data.
- n_jobs (int, default: 1) – Maximum number of concurrently running jobs. If equal to 1, disable multiprocessing.
- ignore_exceptions (bool, default: True) – Whether to ignore errors and skip failed conversions. This can be helpful if some source files are known to be corrupted.
-
class
muspy.datasets.
MusicDataset
(root: Union[str, pathlib.Path], kind: str = None)[source]¶ Class for datasets of MusPy JSON/YAML files.
Parameters: - root (str or Path) – Root directory of the dataset.
- kind ({'json', 'yaml'}, optional) – File formats to include in the dataset. Defaults to include both JSON and YAML files.
-
root
¶ Root directory of the dataset.
Type: Path
-
filenames
¶ Path to the files, relative to root.
Type: list of Path
See also
muspy.Dataset
- Base class for MusPy datasets.
-
class
muspy.datasets.
MusicNetDataset
(root: Union[str, pathlib.Path], download_and_extract: bool = False, overwrite: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: bool = None, verbose: bool = True)[source]¶ MusicNet Dataset (MIDI only).
-
class
muspy.datasets.
NESMusicDatabase
(root: Union[str, pathlib.Path], download_and_extract: bool = False, overwrite: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: bool = None, verbose: bool = True)[source]¶ NES Music Database.
-
class
muspy.datasets.
NottinghamDatabase
(root: Union[str, pathlib.Path], download_and_extract: bool = False, overwrite: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: bool = None, verbose: bool = True)[source]¶ Nottingham Database.
-
class
muspy.datasets.
RemoteABCFolderDataset
(root: Union[str, pathlib.Path], download_and_extract: bool = False, overwrite: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: bool = None, verbose: bool = True)[source]¶ Base class for remote datasets storing ABC files in a folder.
See also
muspy.ABCFolderDataset
- Class for datasets storing ABC files in a folder.
muspy.RemoteDataset
- Base class for remote MusPy datasets.
-
class
muspy.datasets.
RemoteDataset
(root: Union[str, pathlib.Path], download_and_extract: bool = False, overwrite: bool = False, cleanup: bool = False, verbose: bool = True)[source]¶ Base class for remote MusPy datasets.
This class extends
muspy.Dataset
to support remote datasets. To build a custom remote dataset, please refer to the documentation ofmuspy.Dataset
for details. In addition, set the class attribute_sources
to the URLs to the source files (see Notes).Parameters: Raises: RuntimeError: – If
download_and_extract
is False but file{root}/.muspy.success
does not exist (see below).Important
muspy.Dataset.exists()
depends solely on a special file named.muspy.success
in directory{root}/_converted/
. This file serves as an indicator for the existence and integrity of the dataset. It will automatically be created if the dataset is successfully downloaded and extracted bymuspy.Dataset.download_and_extract()
. If the dataset is downloaded manually, make sure to create the.muspy.success
file in directory{root}/_converted/
to prevent errors.Notes
The class attribute
_sources
is a dictionary storing the following information of each source file.- filename (str): Name to save the file.
- url (str): URL to the file.
- archive (bool): Whether the file is an archive.
- md5 (str, optional): Expected MD5 checksum of the file.
- sha256 (str, optional): Expected SHA256 checksum of the file.
Here is an example.:
_sources = { "example": { "filename": "example.tar.gz", "url": "https://www.example.com/example.tar.gz", "archive": True, "md5": None, "sha256": None, } }
See also
muspy.Dataset
- Base class for MusPy datasets.
-
download
(overwrite: bool = False, verbose: bool = True) → RemoteDatasetType[source]¶ Download the dataset source(s).
Parameters: Returns: Return type: Object itself.
-
download_and_extract
(overwrite: bool = False, cleanup: bool = False, verbose: bool = True) → RemoteDatasetType[source]¶ Download source datasets and extract the downloaded archives.
Parameters: Returns: Return type: Object itself.
-
class
muspy.datasets.
RemoteFolderDataset
(root: Union[str, pathlib.Path], download_and_extract: bool = False, overwrite: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: bool = None, verbose: bool = True)[source]¶ Base class for remote datasets storing files in a folder.
Parameters: - download_and_extract (bool, default: False) – Whether to download and extract the dataset.
- cleanup (bool, default: False) – Whether to remove the source archive(s).
- convert (bool, default: False) – Whether to convert the dataset to MusPy JSON/YAML files. If False, will check if converted data exists. If so, disable on-the-fly mode. If not, enable on-the-fly mode and warns.
- kind ({'json', 'yaml'}, default: 'json') – File format to save the data.
- n_jobs (int, default: 1) – Maximum number of concurrently running jobs. If equal to 1, disable multiprocessing.
- ignore_exceptions (bool, default: True) – Whether to ignore errors and skip failed conversions. This can be helpful if some source files are known to be corrupted.
- use_converted (bool, optional) – Force to disable on-the-fly mode and use converted data. Defaults to True if converted data exist, otherwise False.
See also
muspy.FolderDataset
- Class for datasets storing files in a folder.
muspy.RemoteDataset
- Base class for remote MusPy datasets.
-
class
muspy.datasets.
RemoteMusicDataset
(root: Union[str, pathlib.Path], download_and_extract: bool = False, overwrite: bool = False, cleanup: bool = False, kind: str = None, verbose: bool = True)[source]¶ Base class for remote datasets of MusPy JSON/YAML files.
Parameters: - root (str or Path) – Root directory of the dataset.
- download_and_extract (bool, default: False) – Whether to download and extract the dataset.
- overwrite (bool, default: False) – Whether to overwrite existing file(s).
- cleanup (bool, default: False) – Whether to remove the source archive(s).
- kind ({'json', 'yaml'}, optional) – File formats to include in the dataset. Defaults to include both JSON and YAML files.
- verbose (bool. default: True) – Whether to be verbose.
-
root
¶ Root directory of the dataset.
Type: Path
-
filenames
¶ Path to the files, relative to root.
Type: list of Path
See also
muspy.MusicDataset
- Class for datasets of MusPy JSON/YAML files.
muspy.RemoteDataset
- Base class for remote MusPy datasets.
-
class
muspy.datasets.
WikifoniaDataset
(root: Union[str, pathlib.Path], download_and_extract: bool = False, overwrite: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: bool = None, verbose: bool = True)[source]¶ Wikifonia dataset.