soundmentations package¶

Subpackages¶

Module contents¶

class soundmentations.BaseCompose(transforms)[source]¶

Bases: object

Base class for composing multiple transforms into a sequential pipeline.

This class provides the fundamental functionality for chaining transforms together, where each transform is applied sequentially to the audio data.

Parameters:: transforms (list) – List of transform objects to apply sequentially. Each transform must have a __call__ method that accepts (samples, sample_rate) parameters.

Notes

This is an internal base class. Use the Compose class instead.

class soundmentations.Compose(transforms)[source]¶

Bases: BaseCompose

Compose multiple audio transforms into a sequential pipeline.

This class allows you to chain multiple transforms together into a single callable object. Transforms are applied in the order they appear in the list, with each transform receiving the output of the previous one.

Parameters:: transforms (list) – List of transform objects to apply sequentially. Each transform must implement __call__(samples, sample_rate).

Examples

Create a basic augmentation pipeline:

>>> import soundmentations as S
>>>
>>> # Define individual transforms
>>> pipeline = S.Compose([
...     S.RandomTrim(duration=(1.0, 3.0), p=0.8),
...     S.Pad(pad_length=44100, p=0.6),
...     S.Gain(gain=6.0, p=0.5)
... ])
>>>
>>> # Apply to audio
>>> augmented = pipeline(audio_samples, sample_rate=44100)

Complex preprocessing pipeline:

>>> # ML training data preparation
>>> ml_pipeline = S.Compose([
...     S.CenterTrim(duration=2.0),              # Extract 2s from center
...     S.PadToLength(pad_length=88200),         # Normalize to exactly 2s
...     S.Gain(gain=3.0, p=0.7),                # Boost volume 70% of time
...     S.FadeIn(duration=0.1, p=0.5),          # Smooth start 50% of time
...     S.FadeOut(duration=0.1, p=0.5)          # Smooth end 50% of time
... ])
>>>
>>> # Process batch of audio files
>>> for audio in audio_batch:
...     processed = ml_pipeline(audio, sample_rate=16000)

Audio enhancement pipeline:

>>> # Clean up audio recordings
>>> enhance_pipeline = S.Compose([
...     S.StartTrim(start_time=0.5),            # Remove first 0.5s
...     S.EndTrim(end_time=10.0),               # Keep max 10s
...     S.Gain(gain=6.0),                       # Boost volume
...     S.FadeIn(duration=0.2),                 # Smooth fade-in
...     S.FadeOut(duration=0.2)                 # Smooth fade-out
... ])
>>>
>>> enhanced = enhance_pipeline(noisy_audio, sample_rate=44100)

Notes

Transforms are applied in order: first transform in list is applied first
Each transform receives the output of the previous transform
Probability parameters (p) in individual transforms are respected
The pipeline preserves mono audio format throughout
All transforms must accept (samples, sample_rate) parameters

See also

Individual, Trim, Pad, RandomTrim, FadeIn, FadeOut

class soundmentations.Trim(start_time: float = 0.0, end_time: float | None = None, p: float = 1.0)[source]¶

Bases: BaseTrim

Trim audio to keep only the portion between start_time and end_time.

This is the most basic trimming operation that allows specifying exact start and end times for the audio segment to keep.

Parameters:

start_time (float, optional) – Start time in seconds to begin keeping audio, by default 0.0. Must be non-negative.
end_time (float, optional) – End time in seconds to stop keeping audio, by default None. If None, keeps audio until the end. Must be greater than start_time.
p (float, optional) – Probability of applying the transform, by default 1.0.

Examples

Trim audio to specific time range:

>>> import numpy as np
>>> from soundmentations.transforms.time import Trim
>>>
>>> # Create 5 seconds of audio at 44.1kHz
>>> audio = np.random.randn(220500)
>>>
>>> # Keep audio from 1.5 to 3.0 seconds
>>> trim_transform = Trim(start_time=1.5, end_time=3.0)
>>> trimmed = trim_transform(audio, sample_rate=44100)
>>> print(len(trimmed) / 44100)  # 1.5 seconds

Use in a pipeline:

>>> import soundmentations as S
>>>
>>> # Extract middle portion and apply gain
>>> pipeline = S.Compose([
...     S.Trim(start_time=1.0, end_time=4.0, p=1.0),
...     S.Gain(gain=6.0, p=0.5)
... ])
>>>
>>> result = pipeline(audio, sample_rate=44100)

class soundmentations.RandomTrim(duration: float | Tuple[float, float], p: float = 1.0)[source]¶

Bases: BaseTrim

Randomly trim audio by selecting a random segment of specified duration.

This transform randomly selects a continuous segment from the audio, useful for data augmentation where you want random crops of fixed or variable duration.

Parameters:

duration (float or tuple of float) – If float, exact duration to keep in seconds. If tuple (min_duration, max_duration), random duration in range.
p (float, optional) – Probability of applying the transform, by default 1.0.

Examples

Fixed duration random trimming:

>>> import numpy as np
>>> from soundmentations.transforms.time import RandomTrim
>>>
>>> # Always keep 2 seconds randomly
>>> trim_transform = RandomTrim(duration=2.0)
>>> trimmed = trim_transform(audio, sample_rate=44100)
>>> print(len(trimmed) / 44100)  # 2.0 seconds

Variable duration random trimming:

>>> # Keep 1-3 seconds randomly
>>> variable_trim = RandomTrim(duration=(1.0, 3.0))
>>> result = variable_trim(audio, sample_rate=44100)

Use for data augmentation:

>>> import soundmentations as S
>>>
>>> # Random crop and normalize for training
>>> augment = S.Compose([
...     S.RandomTrim(duration=(0.5, 2.5), p=0.8),
...     S.PadToLength(pad_length=88200, p=1.0),  # 2 seconds
...     S.Gain(gain=(-6, 6), p=0.5)
... ])
>>>
>>> augmented = augment(training_audio, sample_rate=44100)

class soundmentations.StartTrim(start_time: float = 0.0, p: float = 1.0)[source]¶

Bases: BaseTrim

Trim audio to keep only the portion starting from start_time to the end.

This removes the beginning of the audio up to start_time, keeping everything after that point.

Parameters:

start_time (float, optional) – Start time in seconds to begin keeping audio, by default 0.0. Must be non-negative.
p (float, optional) – Probability of applying the transform, by default 1.0.

Examples

Remove silence from beginning:

>>> import numpy as np
>>> from soundmentations.transforms.time import StartTrim
>>>
>>> # Remove first 2 seconds
>>> trim_transform = StartTrim(start_time=2.0)
>>> trimmed = trim_transform(audio, sample_rate=44100)

Use in preprocessing pipeline:

>>> import soundmentations as S
>>>
>>> # Remove intro and normalize
>>> preprocess = S.Compose([
...     S.StartTrim(start_time=1.5, p=1.0),
...     S.PadToLength(pad_length=132300, p=1.0)  # 3 seconds
... ])
>>>
>>> processed = preprocess(raw_audio, sample_rate=44100)

class soundmentations.EndTrim(end_time: float, p: float = 1.0)[source]¶

Bases: BaseTrim

Trim audio to keep only the portion from the start to end_time.

This removes the end of the audio after end_time, keeping everything before that point.

Parameters:

end_time (float) – End time in seconds to stop keeping audio. Must be positive.
p (float, optional) – Probability of applying the transform, by default 1.0.

Examples

Keep only first part of audio:

>>> import numpy as np
>>> from soundmentations.transforms.time import EndTrim
>>>
>>> # Keep first 5 seconds only
>>> trim_transform = EndTrim(end_time=5.0)
>>> trimmed = trim_transform(audio, sample_rate=44100)

Use for consistent audio lengths:

>>> import soundmentations as S
>>>
>>> # Ensure maximum 10 seconds
>>> limit_length = S.Compose([
...     S.EndTrim(end_time=10.0, p=1.0),
...     S.Gain(gain=3.0, p=0.3)
... ])
>>>
>>> limited = limit_length(long_audio, sample_rate=44100)

class soundmentations.CenterTrim(duration: float, p: float = 1.0)[source]¶

Bases: BaseTrim

Trim audio to keep only the center portion of specified duration.

This extracts a segment from the middle of the audio, useful for focusing on the main content while removing silence at the beginning and end.

Parameters:

duration (float) – Duration of the center portion to keep in seconds. Must be positive.
p (float, optional) – Probability of applying the transform, by default 1.0.

Examples

Extract center content:

>>> import numpy as np
>>> from soundmentations.transforms.time import CenterTrim
>>>
>>> # Keep 3 seconds from center
>>> trim_transform = CenterTrim(duration=3.0)
>>> trimmed = trim_transform(audio, sample_rate=44100)
>>> print(len(trimmed) / 44100)  # 3.0 seconds

Use for focusing on main content:

>>> import soundmentations as S
>>>
>>> # Extract center and enhance
>>> focus_pipeline = S.Compose([
...     S.CenterTrim(duration=4.0, p=1.0),
...     S.Gain(gain=6.0, p=0.6),
...     S.PadToLength(pad_length=176400, p=1.0)  # 4 seconds
... ])
>>>
>>> focused = focus_pipeline(noisy_audio, sample_rate=44100)

class soundmentations.Pad(pad_length: int, p: float = 1.0)[source]¶

Bases: BasePad

Pad audio to minimum length by adding zeros at the end.

If the input audio is shorter than pad_length, zeros are appended to reach the minimum length. If already longer or equal, returns unchanged.

Parameters:

pad_length (int) – Minimum length for the audio in samples.
p (float, optional) – Probability of applying the transform, by default 1.0.

Examples

Apply end padding to ensure minimum length:

>>> import numpy as np
>>> from soundmentations.transforms.time import Pad
>>>
>>> # Create short audio sample
>>> audio = np.array([0.1, 0.2, 0.3])
>>>
>>> # Pad to minimum 1000 samples
>>> pad_transform = Pad(pad_length=1000)
>>> padded = pad_transform(audio)
>>> print(len(padded))  # 1000

Use in a pipeline:

>>> import soundmentations as S
>>>
>>> # Ensure all audio is at least 2 seconds (44.1kHz)
>>> augment = S.Compose([
...     S.Pad(pad_length=88200, p=1.0),
...     S.Gain(gain=3.0, p=0.5)
... ])
>>>
>>> result = augment(audio)

class soundmentations.CenterPad(pad_length: int, p: float = 1.0)[source]¶

Bases: BasePad

Pad audio to minimum length by adding zeros symmetrically on both sides.

If the input audio is shorter than pad_length, zeros are added equally to both sides. For odd padding amounts, the extra zero goes to the right.

Parameters:

pad_length (int) – Minimum length for the audio in samples.
p (float, optional) – Probability of applying the transform, by default 1.0.

Examples

Apply symmetric padding:

>>> import numpy as np
>>> from soundmentations.transforms.time import CenterPad
>>>
>>> audio = np.array([1, 2, 3])
>>> pad_transform = CenterPad(pad_length=7)
>>> result = pad_transform(audio)
>>> print(result)  # [0 0 1 2 3 0 0]

Use for centering audio in fixed-length windows:

>>> # Center audio in 5-second windows (44.1kHz)
>>> center_pad = CenterPad(pad_length=220500)
>>> centered_audio = center_pad(audio_sample)

class soundmentations.StartPad(pad_length: int, p: float = 1.0)[source]¶

Bases: BasePad

Pad audio to minimum length by adding zeros at the beginning.

If the input audio is shorter than pad_length, zeros are prepended to reach the minimum length. If already longer or equal, returns unchanged.

Parameters:

pad_length (int) – Minimum length for the audio in samples.
p (float, optional) – Probability of applying the transform, by default 1.0.

Examples

Apply start padding:

>>> import numpy as np
>>> from soundmentations.transforms.time import StartPad
>>>
>>> audio = np.array([1, 2, 3])
>>> pad_transform = StartPad(pad_length=6)
>>> result = pad_transform(audio)
>>> print(result)  # [0 0 0 1 2 3]

Use for aligning audio to end of fixed windows:

>>> # Align audio to end of 3-second windows
>>> start_pad = StartPad(pad_length=132300)  # 3 seconds at 44.1kHz
>>> aligned_audio = start_pad(audio_sample)

class soundmentations.PadToLength(pad_length: int, p: float = 1.0)[source]¶

Bases: BasePad

Pad or trim audio to exact target length using end operations.

If shorter: adds zeros at the end to reach exact length
If longer: trims from the end to reach exact length
If equal: returns unchanged

Parameters:

pad_length (int) – Exact target length for the audio in samples.
p (float, optional) – Probability of applying the transform, by default 1.0.

Examples

Normalize all audio to exact length:

>>> import numpy as np
>>> from soundmentations.transforms.time import PadToLength
>>>
>>> # Short audio
>>> short_audio = np.array([1, 2, 3])
>>> # Long audio
>>> long_audio = np.arange(10)
>>>
>>> pad_transform = PadToLength(pad_length=5)
>>>
>>> result1 = pad_transform(short_audio)
>>> print(result1)  # [1 2 3 0 0]
>>>
>>> result2 = pad_transform(long_audio)
>>> print(result2)  # [0 1 2 3 4]

Use for fixed-length model inputs:

>>> # Ensure all audio is exactly 2 seconds for ML model
>>> normalize_length = PadToLength(pad_length=88200)  # 2s at 44.1kHz
>>> model_input = normalize_length(variable_length_audio)

class soundmentations.CenterPadToLength(pad_length: int, p: float = 1.0)[source]¶

Bases: BasePad

Pad or trim audio to exact target length using center operations.

If shorter: adds zeros symmetrically on both sides
If longer: trims symmetrically from both sides (keeps center)
If equal: returns unchanged

Parameters:

pad_length (int) – Exact target length for the audio in samples.
p (float, optional) – Probability of applying the transform, by default 1.0.

Examples

Center-normalize audio to exact length:

>>> import numpy as np
>>> from soundmentations.transforms.time import CenterPadToLength
>>>
>>> # Short audio - will be center-padded
>>> short_audio = np.array([1, 2, 3])
>>> # Long audio - will be center-trimmed
>>> long_audio = np.arange(9)
>>>
>>> pad_transform = CenterPadToLength(pad_length=7)
>>>
>>> result1 = pad_transform(short_audio)
>>> print(result1)  # [0 0 1 2 3 0 0]
>>>
>>> result2 = pad_transform(long_audio)
>>> print(result2)  # [1 2 3 4 5 6 7]

Use for preserving important audio content in center:

>>> # Keep center 3 seconds for speech processing
>>> center_normalize = CenterPadToLength(pad_length=132300)
>>> processed_audio = center_normalize(speech_audio)

class soundmentations.PadToMultiple(pad_length: int, p: float = 1.0)[source]¶

Bases: BasePad

Pad audio to make its length a multiple of the specified value.

This is useful for STFT operations where frame sizes must be multiples of certain values. Only adds padding at the end, never trims.

Parameters:

pad_length (int) – The multiple value. Audio length will be padded to next multiple of this value. Common values: 1024, 512, 256 for STFT operations.
p (float, optional) – Probability of applying the transform, by default 1.0.

Examples

Pad for STFT-friendly lengths:

>>> import numpy as np
>>> from soundmentations.transforms.time import PadToMultiple
>>>
>>> # Audio with length 2050 samples
>>> audio = np.random.randn(2050)
>>>
>>> # Pad to multiple of 1024 (STFT frame size)
>>> pad_transform = PadToMultiple(pad_length=1024)
>>> result = pad_transform(audio)
>>> print(len(result))  # 3072 (3 * 1024)

Use in spectral processing pipeline:

>>> import soundmentations as S
>>>
>>> # Prepare audio for spectral analysis
>>> spectral_prep = S.Compose([
...     S.PadToMultiple(pad_length=512, p=1.0),  # STFT-friendly
...     S.Gain(gain=(-3, 3), p=0.5)
... ])
>>>
>>> stft_ready_audio = spectral_prep(raw_audio)

class soundmentations.Gain(gain: float = 1.0, clip: bool = True, p: float = 1.0)[source]¶

Bases: BaseGain

Apply a fixed gain (in dB) to audio samples.

This transform multiplies the audio samples by a gain factor derived from the specified gain in decibels. Optionally clips the output to prevent values from exceeding the [-1, 1] range.

Parameters:

gain (float, optional) – Gain in decibels, by default 1.0. Positive values increase volume, negative values decrease volume.
clip (bool, optional) – Whether to clip the output to [-1, 1] range, by default True. Prevents audio distortion from excessive gain.
p (float, optional) – Probability of applying the gain transform, by default 1.0.

Examples

Apply a fixed gain to audio samples:

>>> import numpy as np
>>> from soundmentations.transforms.amplitude import Gain
>>>
>>> # Create audio samples
>>> samples = np.array([0.1, 0.2, -0.1, 0.3])
>>>
>>> # Apply +6dB gain
>>> gain_transform = Gain(gain=6.0)
>>> amplified = gain_transform(samples)
>>>
>>> # Apply -12dB gain with 50% probability
>>> quiet_transform = Gain(gain=-12.0, p=0.5)
>>> result = quiet_transform(samples)

Use in a pipeline with other transforms:

>>> import soundmentations as S
>>>
>>> # Create augmentation pipeline
>>> augment = S.Compose([
...     S.RandomTrim(duration=(1.0, 3.0), p=0.8),
...     S.Gain(gain=6.0, clip=True, p=0.7),
...     S.PadToLength(pad_length=44100, p=0.5)
... ])
>>>
>>> # Apply pipeline to audio
>>> audio_samples = np.random.randn(22050)  # 0.5 seconds at 44.1kHz
>>> augmented = augment(samples=audio_samples, sample_rate=44100)

Different gain scenarios:

>>> # Boost quiet audio
>>> boost = Gain(gain=12.0, clip=True)
>>>
>>> # Attenuate loud audio
>>> attenuate = Gain(gain=-6.0, clip=False)
>>>
>>> # Random volume variation
>>> random_volume = Gain(gain=np.random.uniform(-10, 10), p=0.6)

class soundmentations.Limiter(threshold: float = 0.9, p: float = 1.0)[source]¶

Bases: BaseLimiter

Apply hard limiting to audio samples to prevent clipping.

This transform clips audio samples that exceed the specified threshold, preventing digital clipping and maintaining signal integrity within the specified dynamic range.

Parameters:

threshold (float, optional) – The threshold level for limiting, by default 0.9. Values above this threshold will be clipped. Must be between 0.0 and 1.0.
p (float, optional) – Probability of applying the transform, by default 1.0. Must be between 0.0 and 1.0.

Examples

Apply hard limiting to prevent clipping:

>>> import numpy as np
>>> from soundmentations.transforms.amplitude import Limiter
>>>
>>> # Create audio with some peaks above 0.9
>>> audio = np.array([0.5, 1.2, -1.5, 0.8, 0.95])
>>>
>>> # Apply limiting at 0.9 threshold
>>> limiter = Limiter(threshold=0.9)
>>> limited = limiter(audio, sample_rate=44100)
>>> print(limited)  # [0.5, 0.9, -0.9, 0.8, 0.9]

Use in audio processing pipeline:

>>> import soundmentations as S
>>>
>>> # Safe audio processing with limiting
>>> safe_pipeline = S.Compose([
...     S.Gain(gain=12.0, p=1.0),           # Boost signal
...     S.Limiter(threshold=0.95, p=1.0),   # Prevent clipping
...     S.FadeOut(duration=0.1, p=0.5)      # Smooth ending
... ])
>>>
>>> processed = safe_pipeline(audio, sample_rate=44100)

Protect against digital distortion:

>>> # Conservative limiting for pristine quality
>>> conservative_limiter = Limiter(threshold=0.8, p=1.0)
>>> clean_audio = conservative_limiter(loud_audio, sample_rate=44100)

class soundmentations.FadeIn(duration: float = 0.1, p: float = 1.0)[source]¶

Bases: BaseFade

Fade-in effect for audio samples.

This transform applies a fade-in effect to the beginning of the audio samples.

class soundmentations.FadeOut(duration: float = 0.1, p: float = 1.0)[source]¶

Bases: BaseFade

Apply a fade-out effect to the end of audio samples.

This transform gradually decreases the amplitude from full amplitude to silence (0) over the specified duration, creating a smooth fade-out effect.

Parameters:

duration (float, optional) – Duration of the fade-out effect in seconds, by default 0.1. Must be positive and less than the audio duration.
p (float, optional) – Probability of applying the transform, by default 1.0.

class soundmentations.PitchShift(semitones: float, p: float = 1.0)[source]¶

Bases: BasePitchShift

Shift the pitch of audio by a specified number of semitones.

Parameters:

semitones (float) – Number of semitones to shift (positive or negative). - 12 semitones = 1 octave - Positive: pitch up, Negative: pitch down
p (float, optional) – Probability of applying the transform, by default 1.0.

class soundmentations.RandomPitchShift(min_semitones: float = -2.0, max_semitones: float = 2.0, p: float = 1.0)[source]¶

Bases: BasePitchShift

Randomly shift the pitch within a specified semitone range.

This class wraps PitchShift to provide random pitch variations for data augmentation purposes.

Parameters:

min_semitones (float, optional) – Minimum semitone shift, by default -2.0.
max_semitones (float, optional) – Maximum semitone shift, by default 2.0.
p (float, optional) – Probability of applying the transform, by default 1.0.

Examples

>>> # Random pitch variation for training data
>>> random_pitch = RandomPitchShift(min_semitones=-1.0, max_semitones=1.0, p=0.8)
>>> augmented = random_pitch(audio, sample_rate=44100)

soundmentations.load_audio(file_path: str, sample_rate: int | None = None) → Tuple[ndarray, int][source]¶

Load an audio file and return the audio data as a mono numpy array.

Parameters: - file_path (str): Path to the audio file. - sample_rate (int, optional): Desired sample rate. If None, uses the original sample rate.

Returns: - Tuple[np.ndarray, int]: Mono audio data as numpy array and sample rate.

Raises: - FileNotFoundError: If the audio file doesn’t exist. - ValueError: If the audio file format is unsupported. - RuntimeError: If resampling fails.