muspy.datasets¶
Dataset classes.
This module provides an easy-to-use dataset management system. Each supported dataset in MusPy comes with a class inherited from the base MusPy Dataset class. It also provides interfaces to PyTorch and TensorFlow for creating input pipelines for machine learning.
Base Classes¶
- ABCFolderDataset
- Dataset
- DatasetInfo
- FolderDataset
- RemoteABCFolderDataset
- RemoteDataset
- RemoteFolderDataset
- RemoteMusicDataset
- MusicDataset
Dataset Classes¶
- EssenFolkSongDatabase
- HymnalDataset
- HymnalTuneDataset
- JSBChoralesDataset
- LakhMIDIAlignedDataset
- LakhMIDIDataset
- LakhMIDIMatchedDataset
- MAESTRODatasetV1
- MAESTRODatasetV2
- Music21Dataset
- NESMusicDatabase
- NottinghamDatabase
- WikifoniaDataset
-
class
muspy.datasets.
ABCFolderDataset
(root: Union[str, pathlib.Path], convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]¶ Class for datasets storing ABC files in a folder.
See also
muspy.FolderDataset
- Class for datasets storing files in a folder.
-
class
muspy.datasets.
Dataset
[source]¶ Base class for MusPy datasets.
To build a custom dataset, it should inherit this class and overide the methods
__getitem__
and__len__
as well as the class attribute_info
.__getitem__
should return thei
-th data sample as amuspy.Music
object.__len__
should return the size of the dataset._info
should be amuspy.DatasetInfo
instance storing the dataset information.-
save
(root: Union[str, pathlib.Path], kind: Optional[str] = 'json', n_jobs: int = 1, ignore_exceptions: bool = True)[source]¶ Save all the music objects to a directory.
Parameters: - root (str or Path) – Root directory to save the data.
- kind ({'json', 'yaml'}, optional) – File format to save the data. Defaults to ‘json’.
- n_jobs (int, optional) – Maximum number of concurrently running jobs. If equal to 1, disable multiprocessing. Defaults to 1.
- ignore_exceptions (bool, optional) – Whether to ignore errors and skip failed conversions. This can be helpful if some source files are known to be corrupted. Defaults to True.
Notes
The converted files will be named by its index. The original filenames can be found in the
filenames
attribute. For example, the file atfilenames[i]
will be converted and saved to{i}.json
.
-
split
(filename: Union[str, pathlib.Path, None] = None, splits: Optional[Sequence[float]] = None, random_state: Any = None) → Dict[str, List[int]][source]¶ Return the dataset as a PyTorch dataset.
Parameters: - filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split.
- splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits.
- random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or
array_like, the value is passed to
numpy.random.RandomState
, and the created RandomState object is used to create the splits. If RandomState, it will be used to create the splits.
-
to_pytorch_dataset
(factory: Optional[Callable] = None, representation: Optional[str] = None, split_filename: Union[str, pathlib.Path, None] = None, splits: Optional[Sequence[float]] = None, random_state: Any = None, **kwargs) → Union[TorchDataset, Dict[str, TorchDataset]][source]¶ Return the dataset as a PyTorch dataset.
Parameters: - factory (Callable, optional) – Function to be applied to the Music objects. The input is a Music object, and the output is an array or a tensor.
- representation (str, optional) – Target representation. See
muspy.to_representation()
for available representation. - split_filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split.
- splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits.
- random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or
array_like, the value is passed to
numpy.random.RandomState
, and the created RandomState object is used to create the splits. If RandomState, it will be used to create the splits.
Returns: Converted PyTorch dataset(s).
Return type: class:torch.utils.data.Dataset` or Dict of :class:torch.utils.data.Dataset`
-
to_tensorflow_dataset
(factory: Optional[Callable] = None, representation: Optional[str] = None, split_filename: Union[str, pathlib.Path, None] = None, splits: Optional[Sequence[float]] = None, random_state: Any = None, **kwargs) → Union[TFDataset, Dict[str, TFDataset]][source]¶ Return the dataset as a TensorFlow dataset.
Parameters: - factory (Callable, optional) – Function to be applied to the Music objects. The input is a Music object, and the output is an array or a tensor.
- representation (str, optional) – Target representation. See
muspy.to_representation()
for available representation. - split_filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split.
- splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits.
- random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or
array_like, the value is passed to
numpy.random.RandomState
, and the created RandomState object is used to create the splits. If RandomState, it will be used to create the splits.
Returns: - class:tensorflow.data.Dataset` or Dict of
- class:tensorflow.data.dataset` – Converted TensorFlow dataset(s).
-
-
class
muspy.datasets.
DatasetInfo
(name: Optional[str] = None, description: Optional[str] = None, homepage: Optional[str] = None, license: Optional[str] = None)[source]¶ A container for dataset information.
-
class
muspy.datasets.
EssenFolkSongDatabase
(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]¶ Essen Folk Song Database.
-
class
muspy.datasets.
FolderDataset
(root: Union[str, pathlib.Path], convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]¶ Class for datasets storing files in a folder.
This class extends
muspy.Dataset
to support folder datasets. To build a custom folder dataset, please refer to the documentation ofmuspy.Dataset
for details. In addition, set class attribute_extension
to the extension to look for when building the dataset and setread
to a callable that takes as inputs a filename of a source file and return the converted Music object.Parameters: - convert (bool, optional) – Whether to convert the dataset to MusPy JSON/YAML files. If False, will check if converted data exists. If so, disable on-the-fly mode. If not, enable on-the-fly mode and warns. Defaults to False.
- kind ({'json', 'yaml'}, optional) – File format to save the data. Defaults to ‘json’.
- n_jobs (int, optional) – Maximum number of concurrently running jobs. If equal to 1, disable multiprocessing. Defaults to 1.
- ignore_exceptions (bool, optional) – Whether to ignore errors and skip failed conversions. This can be helpful if some source files are known to be corrupted. Defaults to True.
- use_converted (bool, optional) – Force to disable on-the-fly mode and use stored converted data
Important
muspy.FolderDataset.converted_exists()
depends solely on a special file named.muspy.success
in the folder{root}/_converted/
, which serves as an indicator for the existence and integrity of the converted dataset. If the converted dataset is built bymuspy.FolderDataset.convert()
, the.muspy.success
file will be created as well. If the converted dataset is created manually, make sure to create the.muspy.success
file in the folder{root}/_converted/
to prevent errors.Notes
Two modes are available for this dataset. When the on-the-fly mode is enabled, a data sample is converted to a music object on the fly when being indexed. When the on-the-fly mode is disabled, a data sample is loaded from the precomputed converted data.
See also
muspy.Dataset
- Base class for MusPy datasets.
-
convert
(kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True) → FolderDatasetType[source]¶ Convert and save the Music objects.
The converted files will be named by its index and saved to
root/_converted
. The original filenames can be found in thefilenames
attribute. For example, the file atfilenames[i]
will be converted and saved to{i}.json
.Parameters: - kind ({'json', 'yaml'}, optional) – File format to save the data. Defaults to ‘json’.
- n_jobs (int, optional) – Maximum number of concurrently running jobs. If equal to 1, disable multiprocessing. Defaults to 1.
- ignore_exceptions (bool, optional) – Whether to ignore errors and skip failed conversions. This can be helpful if some source files are known to be corrupted. Defaults to True.
Returns: Return type: Object itself.
-
converted_dir
¶ Path to the root directory of the converted dataset.
-
load
(filename: Union[str, pathlib.Path]) → muspy.music.Music[source]¶ Load a file into a Music object.
-
class
muspy.datasets.
HymnalDataset
(root: Union[str, pathlib.Path], download: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]¶ Hymnal Dataset.
-
class
muspy.datasets.
HymnalTuneDataset
(root: Union[str, pathlib.Path], download: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]¶ Hymnal Dataset (tune only).
-
class
muspy.datasets.
JSBChoralesDataset
(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]¶ Johann Sebastian Bach Chorales Dataset.
-
class
muspy.datasets.
LakhMIDIAlignedDataset
(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]¶ Lakh MIDI Dataset - aligned subset.
-
class
muspy.datasets.
LakhMIDIDataset
(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]¶ Lakh MIDI Dataset.
-
class
muspy.datasets.
LakhMIDIMatchedDataset
(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]¶ Lakh MIDI Dataset - matched subset.
-
class
muspy.datasets.
MAESTRODatasetV1
(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]¶ MAESTRO Dataset (MIDI only).
-
class
muspy.datasets.
MAESTRODatasetV2
(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]¶ MAESTRO Dataset (MIDI only).
-
class
muspy.datasets.
Music21Dataset
(composer: Optional[str] = None)[source]¶ A class of datasets containing files in music21 corpus.
Parameters: - composer (str) – Name of a composer or a collection. Please refer to the music21 corpus reference page for a full list [1].
- extensions (list of str) – File extensions of desired files.
References
[1] https://web.mit.edu/music21/doc/about/referenceCorpus.html
-
convert
(root: Union[str, pathlib.Path], kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True) → muspy.datasets.base.MusicDataset[source]¶ Convert and save the Music objects.
Parameters: - root (str or Path) – Root directory to save the data.
- kind ({'json', 'yaml'}, optional) – File format to save the data. Defaults to ‘json’.
- n_jobs (int, optional) – Maximum number of concurrently running jobs. If equal to 1, disable multiprocessing. Defaults to 1.
- ignore_exceptions (bool, optional) – Whether to ignore errors and skip failed conversions. This can be helpful if some source files are known to be corrupted. Defaults to True.
-
class
muspy.datasets.
MusicDataset
(root: Union[str, pathlib.Path], kind: str = 'json')[source]¶ Class for datasets of MusPy JSON/YAML files.
-
kind
¶ File format of the data. Defaults to ‘json’.
Type: {‘json’, ‘yaml’}, optional
See also
muspy.Dataset
- Base class for MusPy datasets.
-
-
class
muspy.datasets.
NESMusicDatabase
(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]¶ NES Music Database.
-
class
muspy.datasets.
NottinghamDatabase
(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]¶ Nottingham Database.
-
class
muspy.datasets.
RemoteABCFolderDataset
(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]¶ Base class for remote datasets storing ABC files in a folder.
See also
muspy.ABCFolderDataset
- Class for datasets storing ABC files in a folder.
muspy.RemoteDataset
- Base class for remote MusPy datasets.
-
class
muspy.datasets.
RemoteDataset
(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False)[source]¶ Base class for remote MusPy datasets.
This class extends
muspy.Dataset
to support remote datasets. To build a custom remote dataset, please refer to the documentation ofmuspy.Dataset
for details. In addition, set the class attribute_sources
to the URLs to the source files (see Notes).Parameters: Raises: RuntimeError: – If
download_and_extract
is False but file{root}/.muspy.success
does not exist (see below).Important
muspy.Dataset.exists()
depends solely on a special file named.muspy.success
in directory{root}/_converted/
. This file serves as an indicator for the existence and integrity of the dataset. It will automatically be created if the dataset is successfully downloaded and extracted bymuspy.Dataset.download_and_extract()
. If the dataset is downloaded manually, make sure to create the.muspy.success
file in directory{root}/_converted/
to prevent errors.Notes
The class attribute
_sources
is a dictionary storing the following information of each source file.- filename (str): Name to save the file.
- url (str): URL to the file.
- archive (bool): Whether the file is an archive.
- md5 (str, optional): Expected MD5 checksum of the file.
Here is an example.:
_sources = { "example": { "filename": "example.tar.gz", "url": "https://www.example.com/example.tar.gz", "archive": True, "md5": None, } }
See also
muspy.Dataset
- Base class for MusPy datasets.
-
download
() → RemoteDatasetType[source]¶ Download the source datasets.
Returns: Return type: Object itself.
-
download_and_extract
(cleanup: bool = False) → RemoteDatasetType[source]¶ Extract the downloaded archives.
Parameters: cleanup (bool, optional) – Whether to remove the original archive. Defaults to False. Returns: Return type: Object itself. Notes
Equivalent to
RemoteDataset.download().extract(cleanup)
.
-
class
muspy.datasets.
RemoteFolderDataset
(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]¶ Base class for remote datasets stoting files in a folder.
Parameters: - download_and_extract (bool, optional) – Whether to download and extract the dataset. Defaults to False.
- cleanup (bool, optional) – Whether to remove the original archive(s). Defaults to False.
- convert (bool, optional) – Whether to convert the dataset to MusPy JSON/YAML files. If False, will check if converted data exists. If so, disable on-the-fly mode. If not, enable on-the-fly mode and warns. Defaults to False.
- kind ({'json', 'yaml'}, optional) – File format to save the data. Defaults to ‘json’.
- n_jobs (int, optional) – Maximum number of concurrently running jobs. If equal to 1, disable multiprocessing. Defaults to 1.
- ignore_exceptions (bool, optional) – Whether to ignore errors and skip failed conversions. This can be helpful if some source files are known to be corrupted. Defaults to True.
- use_converted (bool, optional) – Force to disable on-the-fly mode and use stored converted data
See also
muspy.FolderDataset
- Class for datasets storing files in a folder.
muspy.RemoteDataset
- Base class for remote MusPy datasets.
-
class
muspy.datasets.
RemoteMusicDataset
(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, kind: str = 'json')[source]¶ Base class for remote datasets of MusPy JSON/YAML files.
-
kind
¶ File format of the data. Defaults to ‘json’.
Type: {‘json’, ‘yaml’}, optional
Parameters: See also
muspy.MusicDataset
- Class for datasets of MusPy JSON/YAML files.
muspy.RemoteDataset
- Base class for remote MusPy datasets.
-
-
class
muspy.datasets.
WikifoniaDataset
(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]¶ Wikifonia dataset.