muspy.datasets

Dataset classes.

This module provides an easy-to-use dataset management system. Each supported dataset in MusPy comes with a class inherited from the base MusPy Dataset class. It also provides interfaces to PyTorch and TensorFlow for creating input pipelines for machine learning.

Base Classes

  • ABCFolderDataset
  • Dataset
  • DatasetInfo
  • FolderDataset
  • RemoteABCFolderDataset
  • RemoteDataset
  • RemoteFolderDataset
  • RemoteMusicDataset
  • MusicDataset

Dataset Classes

  • EssenFolkSongDatabase
  • HymnalDataset
  • HymnalTuneDataset
  • JSBChoralesDataset
  • LakhMIDIAlignedDataset
  • LakhMIDIDataset
  • LakhMIDIMatchedDataset
  • MAESTRODatasetV1
  • MAESTRODatasetV2
  • Music21Dataset
  • NESMusicDatabase
  • NottinghamDatabase
  • WikifoniaDataset
class muspy.datasets.ABCFolderDataset(root: Union[str, pathlib.Path], convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]

A class of local datasets containing ABC files in a folder.

on_the_fly() → FolderDatasetType[source]

Enable on-the-fly mode and convert the data on the fly.

Returns:
Return type:Object itself.
read(filename: Tuple[str, Tuple[int, int]]) → muspy.music.Music[source]

Read a file into a Music object.

class muspy.datasets.Dataset[source]

Base class for all MusPy datasets.

To build a custom dataset, it should inherit this class and overide the methods __getitem__ and __len__ as well as the class attribute _info. __getitem__ should return the i-th data sample as a muspy.Music object. __len__ should return the size of the dataset. _info should be a muspy.DatasetInfo instance containing the dataset information.

classmethod citation()[source]

Print the citation infomation.

classmethod info()[source]

Return the dataset infomation.

save(root: Union[str, pathlib.Path], kind: Optional[str] = 'json', n_jobs: int = 1, ignore_exceptions: bool = True)[source]

Save all the music objects to a directory.

The converted files will be named by its index and saved to root/.

Parameters:
  • root (str or Path) – Root directory to save the data.
  • kind ({'json', 'yaml'}, optional) – File format to save the data. Defaults to ‘json’.
  • n_jobs (int, optional) – Maximum number of concurrently running jobs in multiprocessing. If equal to 1, disable multiprocessing. Defaults to 1.
  • ignore_exceptions (bool, optional) – Whether to ignore errors and skip failed conversions. This can be helpful if some of the source files is known to be corrupted. Defaults to False.

Notes

The original filenames can be found in the filenames attribute. For example, the file at filenames[i] will be converted and saved to {i}.json.

split(filename: Union[str, pathlib.Path, None] = None, splits: Optional[Sequence[float]] = None, random_state: Any = None) → Dict[str, List[int]][source]

Return the dataset as a PyTorch dataset.

Parameters:
  • filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split.
  • splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits.
  • random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or array_like, the value is passed to numpy.random.RandomState, and the create RandomState object is used to create the splits. If RandomState, it will be used to create the splits.
to_pytorch_dataset(factory: Optional[Callable] = None, representation: Optional[str] = None, split_filename: Union[str, pathlib.Path, None] = None, splits: Optional[Sequence[float]] = None, random_state: Any = None, **kwargs) → Union[TorchDataset, Dict[str, TorchDataset]][source]

Return the dataset as a PyTorch dataset.

Parameters:
  • factory (Callable, optional) – Function to be applied to the Music objects. The input is a Music object, and the output is an array or a tensor.
  • representation ({'pitch', 'piano-roll', 'event', 'note'}, optional) – Target representation.
  • split_filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split.
  • splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits.
  • random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or array_like, the value is passed to numpy.random.RandomState, and the create RandomState object is used to create the splits. If RandomState, it will be used to create the splits.
Returns:

  • class:torch.utils.data.Dataset` or Dict of
  • class:torch.utils.data.Dataset` – Converted PyTorch dataset(s).

to_tensorflow_dataset(factory: Optional[Callable] = None, representation: Optional[str] = None, split_filename: Union[str, pathlib.Path, None] = None, splits: Optional[Sequence[float]] = None, random_state: Any = None, **kwargs) → Union[TFDataset, Dict[str, TFDataset]][source]

Return the dataset as a TensorFlow dataset.

Parameters:
  • factory (Callable, optional) – Function to be applied to the Music objects. The input is a Music object, and the output is an array or a tensor.
  • representation ({'pitch', 'piano-roll', 'event', 'note'}, optional) – Target representation.
  • split_filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split.
  • splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits.
  • random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or array_like, the value is passed to numpy.random.RandomState, and the create RandomState object is used to create the splits. If RandomState, it will be used to create the splits.
Returns:

  • class:tensorflow.data.Dataset` or Dict of
  • class:tensorflow.data.dataset` – Converted TensorFlow dataset(s).

class muspy.datasets.DatasetInfo(name: Optional[str] = None, description: Optional[str] = None, homepage: Optional[str] = None, license: Optional[str] = None)[source]

A container for dataset information.

class muspy.datasets.EssenFolkSongDatabase(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]

Essen Folk Song Database.

class muspy.datasets.FolderDataset(root: Union[str, pathlib.Path], convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]

A class of datasets containing files in a folder.

Two modes are available for this dataset. When the on-the-fly mode is enabled, a data sample is converted to a music object on the fly when being indexed. When the on-the-fly mode is disabled, a data sample is loaded from the precomputed converted data.

root

Root directory of the dataset.

Type:str or Path
Parameters:
  • convert (bool, optional) – Whether to convert the dataset to MusPy JSON/YAML files. If False, will check if converted data exists. If so, disable on-the-fly mode. If not, enable on-the-fly mode and warns. Defaults to False.
  • kind ({'json', 'yaml'}, optional) – File format to save the data. Defaults to ‘json’.
  • n_jobs (int, optional) – Maximum number of concurrently running jobs in multiprocessing. If equal to 1, disable multiprocessing. Defaults to 1.
  • ignore_exceptions (bool, optional) – Whether to ignore errors and skip failed conversions. This can be helpful if some of the source files is known to be corrupted. Defaults to True.
  • use_converted (bool, optional) – Force to disable on-the-fly mode and use stored converted data

Important

muspy.FolderDataset.converted_exists() depends solely on a special file named .muspy.success in the folder {root}/_converted/, which serves as an indicator for the existence and integrity of the converted dataset. If the converted dataset is built by muspy.FolderDataset.convert(), the .muspy.success file will be created as well. If the converted dataset is created manually, make sure to create the .muspy.success file in the folder {root}/_converted/ to prevent errors.

Notes

This class is extended from muspy.Dataset. To build a custom dataset based on this class, please refer to muspy.Dataset for the docmentation of the methods __getitem__ and __len__, and the class attribute _info.

In addition, the attribute _extension and method read should be properly set. _extension is the extension to look for when building the dataset. All files with the given extension will be included as source files. read is a callable that takes as inputs a filename of a source file and return the converted Music object.

See also

muspy.Dataset
The base class for all MusPy datasets.
convert(kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True) → FolderDatasetType[source]

Convert and save the Music objects.

The converted files will be named by its index and saved to root/_converted. The original filenames can be found in the filenames attribute. For example, the file at filenames[i] will be converted and saved to {i}.json.

Parameters:
  • kind ({'json', 'yaml'}, optional) – File format to save the data. Defaults to ‘json’.
  • n_jobs (int, optional) – Maximum number of concurrently running jobs in multiprocessing. If equal to 1, disable multiprocessing. Defaults to 1.
  • ignore_exceptions (bool, optional) – Whether to ignore errors and skip failed conversions. This can be helpful if some of the source files is known to be corrupted. Defaults to True.
Returns:

Return type:

Object itself.

converted_dir

Return the path to the root directory of the converted dataset.

converted_exists() → bool[source]

Return True if the saved dataset exists, otherwise False.

exists() → bool[source]

Return True if the dataset exists, otherwise False.

load(filename: Union[str, pathlib.Path]) → muspy.music.Music[source]

Read a file into a Music object.

on_the_fly() → FolderDatasetType[source]

Enable on-the-fly mode and convert the data on the fly.

Returns:
Return type:Object itself.
read(filename: Any) → muspy.music.Music[source]

Read a file into a Music object.

use_converted() → FolderDatasetType[source]

Disable on-the-fly mode and use converted data.

Returns:
Return type:Object itself.
class muspy.datasets.HymnalDataset(root: Union[str, pathlib.Path], download: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]

Hymnal Dataset.

download() → muspy.datasets.base.FolderDataset[source]

Download the source datasets.

Returns:
Return type:Object itself.
read(filename: Union[str, pathlib.Path]) → muspy.music.Music[source]

Read a file into a Music object.

class muspy.datasets.HymnalTuneDataset(root: Union[str, pathlib.Path], download: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]

Hymnal Dataset (tune only).

download() → muspy.datasets.base.FolderDataset[source]

Download the source datasets.

Returns:
Return type:Object itself.
read(filename: Union[str, pathlib.Path]) → muspy.music.Music[source]

Read a file into a Music object.

class muspy.datasets.JSBChoralesDataset(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]

Johann Sebastian Bach Chorales Dataset.

read(filename: Union[str, pathlib.Path]) → muspy.music.Music[source]

Read a file into a Music object.

class muspy.datasets.LakhMIDIAlignedDataset(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]

Lakh MIDI Dataset - aligned subset.

read(filename: Union[str, pathlib.Path]) → muspy.music.Music[source]

Read a file into a Music object.

class muspy.datasets.LakhMIDIDataset(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]

Lakh MIDI Dataset.

read(filename: Union[str, pathlib.Path]) → muspy.music.Music[source]

Read a file into a Music object.

class muspy.datasets.LakhMIDIMatchedDataset(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]

Lakh MIDI Dataset - matched subset.

read(filename: Union[str, pathlib.Path]) → muspy.music.Music[source]

Read a file into a Music object.

class muspy.datasets.MAESTRODatasetV1(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]

MAESTRO Dataset (MIDI only).

read(filename: Union[str, pathlib.Path]) → muspy.music.Music[source]

Read a file into a Music object.

class muspy.datasets.MAESTRODatasetV2(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]

MAESTRO Dataset (MIDI only).

read(filename: Union[str, pathlib.Path]) → muspy.music.Music[source]

Read a file into a Music object.

class muspy.datasets.Music21Dataset(composer: Optional[str] = None)[source]

A class of datasets containing files in music21 corpus.

Parameters:
  • composer (str) – Name of a composer or a collection.
  • extensions (list of str) – File extensions of desired files.

Notes

Please refer to the music21 corpus reference page for a full list [1].

[1] https://web.mit.edu/music21/doc/about/referenceCorpus.html

convert(root: Union[str, pathlib.Path], kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True) → muspy.datasets.base.MusicDataset[source]

Convert and save the Music objects; return a MusicDataset instance.

Parameters:
  • root (str or Path) – Root directory to save the data.
  • kind ({'json', 'yaml'}, optional) – File format to save the data. Defaults to ‘json’.
  • n_jobs (int, optional) – Maximum number of concurrently running jobs in multiprocessing. If equal to 1, disable multiprocessing. Defaults to 1.
  • ignore_exceptions (bool, optional) – Whether to ignore errors and skip failed conversions. This can be helpful if some of the source files is known to be corrupted. Defaults to True.
class muspy.datasets.MusicDataset(root: Union[str, pathlib.Path], kind: str = 'json')[source]

A local dataset containing MusPy JSON/YAML files in a folder.

root

Root directory of the dataset.

Type:str or Path
kind

File format of the data. Defaults to ‘json’.

Type:{‘json’, ‘yaml’}, optional
class muspy.datasets.NESMusicDatabase(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]

NES Music Database.

read(filename: Union[str, pathlib.Path]) → muspy.music.Music[source]

Read a file into a Music object.

class muspy.datasets.NottinghamDatabase(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]

Nottingham Database.

class muspy.datasets.RemoteABCFolderDataset(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]

A class of remote datasets containing ABC files in a folder.

class muspy.datasets.RemoteDataset(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False)[source]

Base class for remote MusPy datasets.

This class is extended from muspy.Dataset to support remote datasets. To build a custom dataset based on this class, please refer to muspy.Dataset for the docmentation of the methods __getitem__ and __len__, and the class attribute _info. In addition, the class attribute _sources containing the URLs to the source files should be properly set (see Notes).

root

Root directory of the dataset.

Type:str or Path
Parameters:
  • download_and_extract (bool, optional) – Whether to download and extract the dataset. Defaults to False.
  • cleanup (bool, optional) – Whether to remove the original archive(s). Defaults to False.
Raises:

RuntimeError: – If download_and_extract is False but file {root}/.muspy.success does not exist (see below).

Important

muspy.Dataset.exists() depends solely on a special file named .muspy.success in the folder {root}/, which serves as an indicator for the existence and integrity of the dataset. This file will automatically be created if the dataset is successfully downloaded and extracted by muspy.Dataset.download_and_extract().

If the dataset is downloaded manually, make sure to create the .muspy.success file in the folder {root}/ to prevent errors.

Notes

The class attribute _sources is a dictionary containing the following information of each source file.

  • filename (str): Name to save the file.
  • url (str): URL to the file.
  • archive (bool): Whether the file is an archive.
  • md5 (str, optional): Expected MD5 checksum of the file.

Here is an example.:

_sources = {
    "example": {
        "filename": "example.tar.gz",
        "url": "https://www.example.com/example.tar.gz",
        "archive": True,
        "md5": None,
    }
}

See also

muspy.Dataset
The base class for all MusPy datasets.
download() → RemoteDatasetType[source]

Download the source datasets.

Returns:
Return type:Object itself.
download_and_extract(cleanup: bool = False) → RemoteDatasetType[source]

Extract the downloaded archives.

This is equivalent to RemoteDataset.download().extract(cleanup).

Parameters:cleanup (bool, optional) – Whether to remove the original archive. Defaults to False.
Returns:
Return type:Object itself.
exists() → bool[source]

Return True if the dataset exists, otherwise False.

extract(cleanup: bool = False) → RemoteDatasetType[source]

Extract the downloaded archive(s).

Parameters:cleanup (bool, optional) – Whether to remove the original archive. Defaults to False.
Returns:
Return type:Object itself.
source_exists() → bool[source]

Return True if all the sources exist, otherwise False.

class muspy.datasets.RemoteFolderDataset(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]

A class of remote datasets containing files in a folder.

This class extended muspy.RemoteDataset and muspy.FolderDataset. Please refer to their documentation for details.

root

Root directory of the dataset.

Type:str or Path
Parameters:
  • download_and_extract (bool, optional) – Whether to download and extract the dataset. Defaults to False.
  • cleanup (bool, optional) – Whether to remove the original archive(s). Defaults to False.
  • convert (bool, optional) – Whether to convert the dataset to MusPy JSON/YAML files. If False, will check if converted data exists. If so, disable on-the-fly mode. If not, enable on-the-fly mode and warns. Defaults to False.
  • kind ({'json', 'yaml'}, optional) – File format to save the data. Defaults to ‘json’.
  • n_jobs (int, optional) – Maximum number of concurrently running jobs in multiprocessing. If equal to 1, disable multiprocessing. Defaults to 1.
  • ignore_exceptions (bool, optional) – Whether to ignore errors and skip failed conversions. This can be helpful if some of the source files is known to be corrupted. Defaults to True.
  • use_converted (bool, optional) – Force to disable on-the-fly mode and use stored converted data

See also

muspy.RemoteDataset
Base class for remote MusPy datasets.
muspy.FolderDataset
A class of datasets containing files in a folder.
read(filename: str) → muspy.music.Music[source]

Read a file into a Music object.

class muspy.datasets.RemoteMusicDataset(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, kind: str = 'json')[source]

A dataset containing MusPy JSON/YAML files in a folder.

This class extended muspy.RemoteDataset and muspy.FolderDataset. Please refer to their documentation for details.

root

Root directory of the dataset.

Type:str or Path
kind

File format of the data. Defaults to ‘json’.

Type:{‘json’, ‘yaml’}, optional
Parameters:
  • download_and_extract (bool, optional) – Whether to download and extract the dataset. Defaults to False.
  • cleanup (bool, optional) – Whether to remove the original archive(s). Defaults to False.
class muspy.datasets.WikifoniaDataset(root: Union[str, pathlib.Path], download_and_extract: bool = False, cleanup: bool = False, convert: bool = False, kind: str = 'json', n_jobs: int = 1, ignore_exceptions: bool = True, use_converted: Optional[bool] = None)[source]

Wikifonia dataset.

read(filename: Union[str, pathlib.Path]) → muspy.music.Music[source]

Read a file into a Music object.

muspy.datasets.get_dataset(key: str) → Type[muspy.datasets.base.Dataset][source]

Return a certain dataset class by key.

Parameters:key (str) – Dataset key (case-insensitive).
Returns:
Return type:The corresponding dataset class.
muspy.datasets.list_datasets()[source]

Return all supported dataset classes as a list.

Returns:
Return type:A list of all supported dataset classes.