Creating an Audio Analyzer with Qdrant Vector Database

Introduction

Vector databases have revolutionized how we handle high-dimensional data, especially in domains like audio processing, image recognition, and natural language processing. In this blog post, I’ll walk you through my experience building and testing a Qdrant vector database implementation for audio feature extraction and similarity search.

As a music producer and synthesizer enthusiast, I was particularly interested in analyzing Serum 2 - my personal favorite wavetable synthesizer - to extract meaningful features from its wavetables and create a system for finding similar sounds. The project combines Python’s powerful audio analysis libraries (librosa) with Qdrant’s efficient vector storage to create a system that can analyze audio files, extract meaningful features, and find similar audio based on acoustic characteristics.

The Goal: Analyzing Serum 2 Wavetables

Serum 2 is a powerful wavetable synthesizer that allows users to create complex, evolving sounds by manipulating wavetables - essentially arrays of single-cycle waveforms. My aim was to:

Extract comprehensive audio features from Serum 2 wavetables
Store these features as vectors in Qdrant for efficient similarity search
Build a recommendation system that could suggest similar wavetables based on acoustic characteristics
Understand the harmonic content of different wavetable types for synthesis applications

This would enable producers to find wavetables with similar timbral characteristics, discover new sounds, and understand the harmonic relationships between different wavetable types.

What is Qdrant?

Qdrant is a vector similarity search engine that provides a production-ready service with a convenient API to store, search, and manage points (vectors) with additional payload data. Unlike traditional databases that organize data in rows and columns, vector databases are optimized for storing and querying high-dimensional vectors efficiently.

Key features of Qdrant:

High Performance: Uses HNSW (Hierarchical Navigable Small World) indexing
Multiple Distance Metrics: Supports Euclidean Distance, Cosine Similarity, and Dot Product
Payload Support: Store additional metadata alongside vectors
Production Ready: Docker deployment and cloud options available

Project Architecture

The audio analysis system consists of three main components:

AudioAnalyzer: Extracts comprehensive audio features using librosa
QdrantAudioDatabase: Manages vector storage and similarity search
Main Application: Orchestrates the analysis and storage process

Let’s dive into each component:

Audio Feature Extraction

The AudioAnalyzer class is the heart of our feature extraction system. It uses librosa to extract multiple types of audio features that capture different aspects of the audio signal.

Core Feature Sets

class AudioAnalyzer:
    def __init__(self, sample_rate: int = 22050, n_mels: int = 128,
                 n_mfcc: int = 13, max_harmonics: int = 50):
        self.sample_rate = sample_rate
        self.n_mels = n_mels
        self.n_mfcc = n_mfcc
        self.max_harmonics = max_harmonics
        self._audio_cache = {}  # Cache for efficient processing

The analyzer extracts seven different feature sets:

1. Basic Features

def extract_basic_features(self, audio_path: Union[str, Path]) -> Dict[str, float]:
    y, sr = self.load_audio(audio_path)

    duration = librosa.get_duration(y=y, sr=sr)
    rms = librosa.feature.rms(y=y)[0]

    return {
        "duration": duration,
        "rms_mean": float(np.mean(rms)),
        "rms_std": float(np.std(rms)),
    }

2. Spectral Features

def extract_spectral_features(self, audio_path: Union[str, Path]) -> Dict[str, float]:
    y, sr = self.load_audio(audio_path)

    spectral_centroids = librosa.feature.spectral_centroid(y=y, sr=sr)[0]
    spectral_rolloff = librosa.feature.spectral_rolloff(y=y, sr=sr)[0]
    spectral_bandwidth = librosa.feature.spectral_bandwidth(y=y, sr=sr)[0]

    return {
        "spectral_centroid_mean": float(np.mean(spectral_centroids)),
        "spectral_centroid_std": float(np.std(spectral_centroids)),
        "spectral_rolloff_mean": float(np.mean(spectral_rolloff)),
        # ... more spectral features
    }

3. MFCC Features

MFCCs - (Mel-Frequency Cepstral Coefficients) are crucial for audio similarity. MFCCs capture what makes sounds perceptually distinct (like instruments or voices) while ignoring details we don’t notice (like exact pitch or phase).

That makes them perfect for tasks such as speech recognition, speaker identification, and music similarity search.

MFCC Mean – averaging across time compresses the whole clip into a fixed-length vector. This makes clips of different durations directly comparable and highlights the overall timbre rather than every frame.

MFCC Delta – measuring how MFCCs change across frames captures the dynamics of the sound: how notes attack, decay, or transition. This adds motion information on top of the static timbre.

The mean and delta turn MFCCs into a feature set that describes both what the sound is (its timbre) and how it evolves over time (its dynamics).

def extract_mfcc_features(self, audio_path: Union[str, Path]) -> Dict[str, np.ndarray]:
    y, sr = self.load_audio(audio_path)

    mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=self.n_mfcc)
    mfcc_mean = np.mean(mfccs, axis=1)
    mfcc_delta = librosa.feature.delta(mfccs)

    return {
        "mfcc_mean": mfcc_mean.astype(np.float32),
        "mfcc_std": np.std(mfccs, axis=1).astype(np.float32),
        "mfcc_delta_mean": np.mean(mfcc_delta, axis=1).astype(np.float32),
        # ... more MFCC features
    }

4. Harmonic Analysis

In addition to MFCCs, another powerful way to describe audio is by looking at its fundamental frequency and harmonics. This helps us understand not just the timbre, but the musical character of a sound.

def extract_harmonic_features(self, audio_path: Union[str, Path]) -> Dict[str, Union[float, int, str, np.ndarray]]:
    y, sr = self.load_audio(audio_path)

    # Get fundamental frequency using librosa's pitch detection
    f0, voiced_flag, voiced_probs = librosa.pyin(
        y, fmin=librosa.note_to_hz("C2"), fmax=librosa.note_to_hz("C7")
    )

    # Analyze harmonics using FFT
    fft_result = fft(y)
    freqs = fftfreq(len(y), 1 / sr)
    magnitude = np.abs(fft_result)

    harmonic_data = self._analyze_harmonics(freqs, magnitude, fundamental_freq)

    return {
        "fundamental_frequency": fundamental_freq,
        "total_harmonics": harmonic_data["total_harmonics"],
        "odd_harmonics": harmonic_data["odd_harmonics"],
        "even_harmonics": harmonic_data["even_harmonics"],
        "waveform_type": self._classify_waveform_type(harmonic_data),
        # ... more harmonic features
    }

Fundamental Frequency (f₀)

Using librosa.pyin, we estimate the pitch — the lowest frequency that defines the note being played or sung. This is essential for recognizing melodies or matching sounds at the note level.

Harmonics via FFT

Every real-world sound isn’t just a pure sine wave. Instruments and voices generate overtones — multiples of the fundamental frequency — called harmonics. By applying the Fast Fourier Transform (FFT), we can break the signal into its frequency components and measure how strong each harmonic is.

Why this matters

The balance of harmonics (odd vs. even) is what makes a flute sound different from a violin playing the same note.

Counting harmonics and comparing their strengths lets us classify waveform types (sawtooth, square, triangle, etc.) and better capture timbre.

For similarity tasks, this adds another dimension: two sounds might share MFCCs but differ in harmonic structure.

Feature Vector Creation

All features are combined into a normalized vector for similarity search:

def create_feature_vector(self, features: Dict[str, Union[float, int, str, np.ndarray]]) -> np.ndarray:
    vector_components = []

    for key, value in features.items():
        if key == "waveform_type":  # Skip string features
            continue

        if isinstance(value, (int, float)):
            if not np.isnan(value) and not np.isinf(value):
                vector_components.append([float(value)])
        elif isinstance(value, np.ndarray):
            flat_value = value.flatten()
            flat_value = np.where(np.isfinite(flat_value), flat_value, 0.0)
            vector_components.append(flat_value.astype(np.float32))

    # Normalize the feature vector
    feature_vector = np.concatenate(vector_components)
    norm = np.linalg.norm(feature_vector)
    if norm > 1e-8:
        feature_vector = feature_vector / norm

    return feature_vector.astype(np.float32)

Qdrant Database Integration

The QdrantAudioDatabase class provides a high-level interface for storing and retrieving audio features:

Database Initialization

class QdrantAudioDatabase:
    def __init__(self, host: str = "localhost", port: int = 6333,
                 collection_name: str = "audio_features",
                 analyzer: Optional[AudioAnalyzer] = None):
        self.client = QdrantClient(host=host, port=port)
        self.collection_name = collection_name
        self.analyzer = analyzer if analyzer is not None else AudioAnalyzer()
        self.vector_size = self._calculate_vector_size()

Collection Setup

def initialize_collection(self, recreate: bool = False) -> None:
    collection_exists = self.client.collection_exists(self.collection_name)

    if recreate and collection_exists:
        self.client.delete_collection(self.collection_name)
        collection_exists = False

    if not collection_exists:
        self.client.create_collection(
            collection_name=self.collection_name,
            vectors_config=VectorParams(
                size=self.vector_size,
                distance=Distance.COSINE,  # Best for normalized feature vectors
            ),
        )

Storing Audio Features

def store_audio_features(self, audio_path: Union[str, Path],
                        metadata: Optional[Dict] = None,
                        feature_sets: Optional[List[str]] = None) -> str:
    # Extract audio features
    features = self.analyzer.extract_all_features(audio_path, feature_sets)
    feature_vector = self.analyzer.create_feature_vector(features)

    # Generate unique point ID
    point_id = str(uuid.uuid4())

    # Prepare payload with metadata
    payload = {
        "file_path": str(audio_path),
        "file_name": audio_path.name,
        "analysis_timestamp": str(np.datetime64("now")),
    }

    # Add relevant features to payload for filtering
    self._add_features_to_payload(payload, features)
    if metadata:
        payload.update(metadata)

    # Store in Qdrant
    self.client.upsert(
        collection_name=self.collection_name,
        points=[PointStruct(
            id=point_id,
            vector=feature_vector.tolist(),
            payload=payload
        )],
    )

    return point_id

Similarity Search

def find_similar_audio(self, audio_path: Union[str, Path],
                     limit: int = 5,
                     score_threshold: Optional[float] = None) -> List[Dict]:
    # Extract features from query audio
    features = self.analyzer.extract_all_features(audio_path)
    query_vector = self.analyzer.create_feature_vector(features)

    # Search for similar vectors
    search_results = self.client.search(
        collection_name=self.collection_name,
        query_vector=query_vector.tolist(),
        limit=limit,
        with_payload=True,
        score_threshold=score_threshold
    )

    # Format results
    results = []
    for result in search_results:
        result_data = {
            "id": result.id,
            "score": result.score,
            "file_path": result.payload.get("file_path"),
            "file_name": result.payload.get("file_name"),
        }
        result_data.update(result.payload)
        results.append(result_data)

    return results

Usage Examples

Command Line Interface

The main application provides a comprehensive CLI for processing audio files:

# Process all audio files in a directory
python main.py --directory /path/to/audio/files

# Generate sample files for testing
python main.py --generate-samples

# Process directory and show similarity results
python main.py --directory Tables/ --similarity-search

# Extract only specific features
python main.py --directory Tables/ --features basic spectral harmonic

Programmatic Usage

from audio_analyzer import AudioAnalyzer
from qdrant_database import QdrantAudioDatabase

# Initialize components
analyzer = AudioAnalyzer(sample_rate=22050, n_mels=128, n_mfcc=13)
audio_db = QdrantAudioDatabase(
    host="localhost",
    port=6333,
    collection_name="audio_features",
    analyzer=analyzer
)

# Initialize collection
audio_db.initialize_collection(recreate=True)

# Store audio features
point_id = audio_db.store_audio_features(
    "path/to/audio.wav",
    metadata={"genre": "electronic", "instrument": "synthesizer"}
)

# Find similar audio
similar_audio = audio_db.find_similar_audio("query_audio.wav", limit=5)

for result in similar_audio:
    print(f"File: {result['file_name']}")
    print(f"Similarity Score: {result['score']:.4f}")
    print(f"Tempo: {result['tempo']:.1f} BPM")
    print(f"Waveform Type: {result['waveform_type']}")
    print(f"Fundamental Freq: {result['fundamental_frequency']:.1f} Hz")

Serum 2 Wavetable Analysis

Real Wavetable Analysis

One of the most exciting aspects of this project was analyzing actual Serum 2 wavetables. I processed wavetables from Serum’s built-in library, including the “4088” wavetable from the Analog category. Here’s what the analysis revealed:

================================================================================
SERUM WAVETABLE ANALYSIS REPORT
================================================================================
File: Tables/Analog/4088.wav
Analysis Date: 2025-08-03T21:26:46

## SUMMARY STATISTICS:

Total Harmonics: 31
Odd Harmonics: 8
Even Harmonics: 23
Odd Content Ratio: 0.699
Even Content Ratio: 0.301
Fundamental Frequency: 43.07 Hz

## HARMONIC ANALYSIS:

## Harm Freq (Hz) Magnitude Phase (rad) Type

1 43.07 204.979 2.506 odd
3 129.20 175.082 2.593 odd
6 258.40 46.229 -3.113 even
9 387.60 33.811 -2.383 odd
14 581.40 18.030 1.817 even
18 753.66 13.658 2.877 even
22 925.93 12.952 -2.308 even
24 1055.13 9.457 -1.549 even
28 1184.33 15.732 -0.705 even
30 1270.46 4.255 -0.024 even
32 1356.59 3.945 0.479 even
34 1442.72 10.751 0.948 even
37 1593.46 5.442 -1.436 odd
40 1722.66 5.774 -0.585 even
44 1894.92 4.624 0.473 even
48 2067.19 4.127 1.543 even
52 2239.45 4.508 2.671 even
55 2368.65 3.777 -2.926 odd
58 2497.85 6.147 -2.035 even
64 2777.78 4.983 2.756 even
68 2906.98 2.904 -2.830 even
70 3036.18 2.842 -1.942 even
74 3208.45 2.415 -0.903 even
79 3402.25 2.243 -3.095 odd
83 3574.51 2.369 -1.883 odd
86 3682.18 2.334 1.947 even
89 3832.91 3.406 -0.287 odd
96 4112.84 3.127 -1.915 even
118 5103.37 2.120 -2.562 even
120 5167.97 2.475 1.184 even
126 5426.37 2.242 2.967 even

## REVERSE ENGINEERING NOTES:

• This wavetable has mixed harmonic content
• Consider using custom wavetable or complex waveform
• Low fundamental frequency - good for bass sounds
================================================================================

Key Insights from Serum 2 Analysis

The analysis of the 4088 wavetable revealed fascinating characteristics:

Mixed Harmonic Content: The wavetable contains both odd and even harmonics, making it suitable for complex, evolving sounds
Low Fundamental Frequency: At 43.07 Hz, this wavetable is perfect for bass sounds
Rich Harmonic Spectrum: 31 harmonics detected, indicating a complex timbre
Synthesis Formula: The analysis provided an exact mathematical formula for recreating the wavetable using additive synthesis

This level of analysis would be incredibly valuable for:

Sound Design: Understanding the harmonic makeup of favorite wavetables
Wavetable Creation: Reverse-engineering complex sounds
Sound Matching: Finding wavetables with similar harmonic characteristics
Educational Purposes: Learning about harmonic relationships in synthesis

Practical Applications for Serum 2 Users

The system I built could revolutionize how producers work with Serum 2:

1. Intelligent Wavetable Search

# Find wavetables similar to your favorite bass sound
similar_wavetables = audio_db.find_similar_audio(
    "Tables/Analog/4088.wav",
    limit=10,
    filter_conditions={"category": "Analog"}  # Only search within Analog category
)

for result in similar_wavetables:
    print(f"Similar wavetable: {result['file_name']}")
    print(f"Similarity: {result['score']:.3f}")
    print(f"Fundamental: {result['fundamental_frequency']:.1f} Hz")
    print(f"Harmonic richness: {result['harmonic_richness']:.3f}")

2. Harmonic Content Analysis The system provides detailed insights into wavetable characteristics:

Odd vs Even Harmonics: Understanding the timbral character
Fundamental Frequency: Determining the best pitch range
Harmonic Richness: Measuring complexity and brightness
Spectral Characteristics: Brightness, warmth, and bandwidth analysis

3. Wavetable Recommendation Engine

# Find wavetables suitable for bass sounds
bass_wavetables = audio_db.filter_audio_by_metadata({
    "fundamental_frequency": {"$lt": 100},  # Low fundamental frequency
    "harmonic_richness": {"$gt": 0.5}       # Rich harmonic content
})

4. Educational Tool for Synthesis The analysis reports help producers understand:

How different wavetables create different timbres
The relationship between harmonic content and perceived sound
How to choose wavetables for specific musical applications

Testing and Results

Sample Audio Generation

I created synthetic audio files to test the system:

def create_sample_audio_file(output_path: Union[str, Path],
                           waveform_type: str = "sine",
                           frequency: float = 440.0,
                           duration: float = 3.0) -> Path:
    t = np.linspace(0, duration, int(sample_rate * duration), False)

    if waveform_type.lower() == "sine":
        audio_data = amplitude * np.sin(2 * np.pi * frequency * t)
    elif waveform_type.lower() == "square":
        audio_data = amplitude * np.sign(np.sin(2 * np.pi * frequency * t))
        # Add odd harmonics for more realistic sound
        for n in range(3, 20, 2):
            audio_data += (amplitude / n) * np.sin(2 * np.pi * frequency * n * t)
    # ... more waveform types

    sf.write(output_path, audio_data, sample_rate)
    return output_path

Performance Testing

The system successfully processed various audio files and demonstrated:

Feature Extraction: Comprehensive analysis of audio characteristics
Vector Similarity: Accurate similarity matching based on acoustic features
Scalability: Efficient processing of multiple audio files
Metadata Storage: Rich payload data for filtering and display

Sample Results

When testing with synthetic waveforms, the system correctly identified:

Sine waves: Low harmonic content, clean spectral characteristics
Square waves: Odd harmonics dominant, characteristic timbre
Sawtooth waves: Rich harmonic content, bright spectral signature
Triangle waves: Odd harmonics with decreasing amplitude

Conclusion

Key Takeaways

Vector databases excel at similarity search for high-dimensional audio data
Feature engineering is crucial for meaningful similarity matching in synthesis applications
Qdrant’s payload system enables rich metadata storage and filtering for audio libraries
Proper normalization and error handling are essential for production audio analysis systems
Serum 2 wavetables contain rich harmonic information that can be systematically analyzed and categorized

The system successfully demonstrates how vector databases can be applied to synthesizer analysis, opening possibilities for:

Music production tools that suggest similar sounds
Educational applications for learning synthesis
Sound design workflows that leverage harmonic analysis
Wavetable libraries with intelligent search capabilities

This project bridges the gap between music production and data science, showing how modern vector database technology can enhance creative workflows in music production.