Jared AI Hub
Published on

Global Image Descriptors: From HOG to Deep Learning Features

Authors
  • avatar
    Name
    Jared Chung
    Twitter

Global image descriptors are fundamental to computer vision, providing compact yet informative representations of entire images. These descriptors encode important visual characteristics and have been crucial for tasks like image retrieval, classification, and similarity matching.

In this comprehensive guide, we'll explore the evolution from traditional hand-crafted descriptors to modern deep learning-based global features.

What are Global Image Descriptors?

Global image descriptors capture characteristics of an entire image in a fixed-size vector representation. Unlike local descriptors that focus on specific keypoints, global descriptors summarize the overall visual content.

Key Properties

  1. Fixed dimensionality: Consistent vector size regardless of image size
  2. Translation/rotation invariance: Robust to geometric transformations
  3. Discriminative power: Ability to distinguish between different image classes
  4. Computational efficiency: Fast extraction and comparison
  5. Compact representation: Small memory footprint
import cv2
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from skimage import feature, filters, measure
import torch
import torch.nn as nn
import torchvision.models as models
import torchvision.transforms as transforms
from typing import Tuple, List, Optional

Traditional Global Descriptors

1. Color Histograms

The simplest global descriptor based on color distribution:

class ColorHistogramDescriptor:
    """Extract color histogram features from images"""
    
    def __init__(self, bins=(8, 8, 8), color_space='RGB'):
        self.bins = bins
        self.color_space = color_space
    
    def extract(self, image):
        """Extract color histogram descriptor"""
        # Convert color space if needed
        if self.color_space == 'HSV':
            image = cv2.cvtColor(image, cv2.COLOR_RGB2HSV)
        elif self.color_space == 'LAB':
            image = cv2.cvtColor(image, cv2.COLOR_RGB2LAB)
        
        # Calculate histogram for each channel
        hist = cv2.calcHist([image], [0, 1, 2], None, self.bins, 
                           [0, 256, 0, 256, 0, 256])
        
        # Normalize histogram
        hist = cv2.normalize(hist, hist).flatten()
        
        return hist
    
    def compare(self, hist1, hist2, method='correlation'):
        """Compare two histograms"""
        methods = {
            'correlation': cv2.HISTCMP_CORREL,
            'chi_square': cv2.HISTCMP_CHISQR,
            'intersection': cv2.HISTCMP_INTERSECT,
            'bhattacharyya': cv2.HISTCMP_BHATTACHARYYA
        }
        
        return cv2.compareHist(hist1, hist2, methods[method])

# Usage example
color_desc = ColorHistogramDescriptor(bins=(16, 16, 16), color_space='HSV')

# Extract features from image
image = cv2.imread('sample_image.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
hist_features = color_desc.extract(image)

print(f"Color histogram shape: {hist_features.shape}")

2. Histogram of Oriented Gradients (HOG)

HOG captures shape and structure through gradient orientations:

class HOGDescriptor:
    """Extract HOG (Histogram of Oriented Gradients) features"""
    
    def __init__(self, orientations=9, pixels_per_cell=(8, 8), 
                 cells_per_block=(2, 2), block_norm='L2-Hys'):
        self.orientations = orientations
        self.pixels_per_cell = pixels_per_cell
        self.cells_per_block = cells_per_block
        self.block_norm = block_norm
    
    def extract(self, image):
        """Extract HOG descriptor from image"""
        # Convert to grayscale if needed
        if len(image.shape) == 3:
            gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
        else:
            gray = image
        
        # Extract HOG features
        features = feature.hog(
            gray,
            orientations=self.orientations,
            pixels_per_cell=self.pixels_per_cell,
            cells_per_block=self.cells_per_block,
            block_norm=self.block_norm,
            visualize=False,
            feature_vector=True
        )
        
        return features
    
    def extract_with_visualization(self, image):
        """Extract HOG features with visualization"""
        if len(image.shape) == 3:
            gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
        else:
            gray = image
        
        features, hog_image = feature.hog(
            gray,
            orientations=self.orientations,
            pixels_per_cell=self.pixels_per_cell,
            cells_per_block=self.cells_per_block,
            block_norm=self.block_norm,
            visualize=True,
            feature_vector=True
        )
        
        return features, hog_image

# Visualize HOG features
def visualize_hog(image_path):
    """Visualize HOG features"""
    image = cv2.imread(image_path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    
    hog_desc = HOGDescriptor()
    features, hog_image = hog_desc.extract_with_visualization(image)
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
    
    ax1.imshow(image)
    ax1.set_title('Original Image')
    ax1.axis('off')
    
    ax2.imshow(hog_image, cmap='gray')
    ax2.set_title('HOG Features')
    ax2.axis('off')
    
    plt.tight_layout()
    plt.show()
    
    print(f"HOG feature vector length: {len(features)}")
    return features

3. Local Binary Patterns (LBP)

LBP captures local texture patterns:

class LBPDescriptor:
    """Extract Local Binary Pattern features"""
    
    def __init__(self, radius=3, n_points=None, method='uniform'):
        self.radius = radius
        self.n_points = n_points if n_points else 8 * radius
        self.method = method
    
    def extract(self, image):
        """Extract LBP descriptor"""
        # Convert to grayscale
        if len(image.shape) == 3:
            gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
        else:
            gray = image
        
        # Calculate LBP
        lbp = feature.local_binary_pattern(
            gray, self.n_points, self.radius, method=self.method
        )
        
        # Calculate histogram of LBP values
        if self.method == 'uniform':
            n_bins = self.n_points + 2
        else:
            n_bins = 2 ** self.n_points
        
        hist, _ = np.histogram(lbp.ravel(), bins=n_bins, 
                             range=(0, n_bins), density=True)
        
        return hist
    
    def extract_with_visualization(self, image):
        """Extract LBP with visualization"""
        if len(image.shape) == 3:
            gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
        else:
            gray = image
        
        lbp = feature.local_binary_pattern(
            gray, self.n_points, self.radius, method=self.method
        )
        
        # Calculate histogram
        if self.method == 'uniform':
            n_bins = self.n_points + 2
        else:
            n_bins = 2 ** self.n_points
        
        hist, _ = np.histogram(lbp.ravel(), bins=n_bins, 
                             range=(0, n_bins), density=True)
        
        return hist, lbp

# Visualize LBP
def visualize_lbp(image_path):
    """Visualize LBP features"""
    image = cv2.imread(image_path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    
    lbp_desc = LBPDescriptor(radius=3, method='uniform')
    hist, lbp_image = lbp_desc.extract_with_visualization(image)
    
    fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))
    
    ax1.imshow(image)
    ax1.set_title('Original Image')
    ax1.axis('off')
    
    ax2.imshow(lbp_image, cmap='gray')
    ax2.set_title('LBP Image')
    ax2.axis('off')
    
    ax3.plot(hist)
    ax3.set_title('LBP Histogram')
    ax3.set_xlabel('LBP Value')
    ax3.set_ylabel('Frequency')
    
    plt.tight_layout()
    plt.show()
    
    return hist

4. Haralick Texture Features

Statistical texture features from gray-level co-occurrence matrices:

class HaralickDescriptor:
    """Extract Haralick texture features"""
    
    def __init__(self, distances=[1], angles=[0, 45, 90, 135], 
                 levels=256, symmetric=True, normed=True):
        self.distances = distances
        self.angles = np.radians(angles)
        self.levels = levels
        self.symmetric = symmetric
        self.normed = normed
    
    def extract(self, image):
        """Extract Haralick features"""
        # Convert to grayscale and reduce levels
        if len(image.shape) == 3:
            gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
        else:
            gray = image
        
        # Reduce gray levels for GLCM computation
        gray = (gray // (256 // self.levels)).astype(np.uint8)
        
        # Calculate Haralick features
        features = []
        
        for distance in self.distances:
            for angle in self.angles:
                # Calculate GLCM
                glcm = feature.greycomatrix(
                    gray, [distance], [angle], 
                    levels=self.levels,
                    symmetric=self.symmetric,
                    normed=self.normed
                )
                
                # Extract Haralick properties
                contrast = feature.greycoprops(glcm, 'contrast')[0, 0]
                dissimilarity = feature.greycoprops(glcm, 'dissimilarity')[0, 0]
                homogeneity = feature.greycoprops(glcm, 'homogeneity')[0, 0]
                energy = feature.greycoprops(glcm, 'energy')[0, 0]
                correlation = feature.greycoprops(glcm, 'correlation')[0, 0]
                
                features.extend([contrast, dissimilarity, homogeneity, 
                               energy, correlation])
        
        return np.array(features)

# Example usage
haralick_desc = HaralickDescriptor(distances=[1, 2], angles=[0, 45, 90, 135])

Modern Deep Learning Descriptors

1. CNN Feature Extraction

Using pre-trained CNN models for global features:

class CNNDescriptor:
    """Extract CNN-based global descriptors"""
    
    def __init__(self, model_name='resnet50', layer='avgpool', 
                 pretrained=True, device='cuda'):
        self.device = device
        self.model_name = model_name
        self.layer = layer
        
        # Load pre-trained model
        if model_name == 'resnet50':
            self.model = models.resnet50(pretrained=pretrained)
        elif model_name == 'vgg16':
            self.model = models.vgg16(pretrained=pretrained)
        elif model_name == 'densenet121':
            self.model = models.densenet121(pretrained=pretrained)
        elif model_name == 'efficientnet_b0':
            self.model = models.efficientnet_b0(pretrained=pretrained)
        else:
            raise ValueError(f"Unsupported model: {model_name}")
        
        self.model = self.model.to(device)
        self.model.eval()
        
        # Set up feature extraction
        self.features = None
        self._register_hook()
        
        # Image preprocessing
        self.preprocess = transforms.Compose([
            transforms.ToPILImage(),
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406],
                               std=[0.229, 0.224, 0.225])
        ])
    
    def _register_hook(self):
        """Register forward hook to extract features"""
        def hook(module, input, output):
            self.features = output.detach()
        
        # Find the target layer
        for name, module in self.model.named_modules():
            if name == self.layer:
                module.register_forward_hook(hook)
                break
    
    def extract(self, image):
        """Extract CNN features from image"""
        # Preprocess image
        if isinstance(image, np.ndarray):
            if len(image.shape) == 3 and image.shape[2] == 3:
                # RGB image
                input_tensor = self.preprocess(image)
            else:
                # Grayscale - convert to RGB
                image_rgb = np.stack([image] * 3, axis=-1)
                input_tensor = self.preprocess(image_rgb)
        else:
            input_tensor = self.preprocess(image)
        
        input_batch = input_tensor.unsqueeze(0).to(self.device)
        
        # Extract features
        with torch.no_grad():
            _ = self.model(input_batch)
        
        # Global average pooling if needed
        features = self.features.squeeze()
        if len(features.shape) over 1:
            features = torch.mean(features.view(features.size(0), -1), dim=1)
        
        return features.cpu().numpy()
    
    def extract_batch(self, images):
        """Extract features from batch of images"""
        # Preprocess batch
        batch_tensors = []
        for image in images:
            if isinstance(image, np.ndarray):
                tensor = self.preprocess(image)
            else:
                tensor = self.preprocess(image)
            batch_tensors.append(tensor)
        
        input_batch = torch.stack(batch_tensors).to(self.device)
        
        # Extract features
        with torch.no_grad():
            _ = self.model(input_batch)
        
        # Process features
        features = self.features
        if len(features.shape) over 2:
            features = torch.mean(features.view(features.size(0), features.size(1), -1), dim=2)
        
        return features.cpu().numpy()

# Example usage
cnn_desc = CNNDescriptor(model_name='resnet50', layer='avgpool')

2. Self-Supervised Global Descriptors

Using self-supervised models for feature extraction:

class SelfSupervisedDescriptor:
    """Extract features using self-supervised models"""
    
    def __init__(self, model_type='simclr', checkpoint_path=None, device='cuda'):
        self.device = device
        self.model_type = model_type
        
        if model_type == 'simclr':
            self.model = self._load_simclr_model(checkpoint_path)
        elif model_type == 'swav':
            self.model = self._load_swav_model(checkpoint_path)
        elif model_type == 'dino':
            self.model = self._load_dino_model(checkpoint_path)
        else:
            raise ValueError(f"Unsupported model type: {model_type}")
        
        self.model = self.model.to(device)
        self.model.eval()
        
        # Preprocessing
        self.preprocess = transforms.Compose([
            transforms.ToPILImage(),
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406],
                               std=[0.229, 0.224, 0.225])
        ])
    
    def _load_simclr_model(self, checkpoint_path):
        """Load SimCLR model"""
        # Simplified SimCLR encoder
        encoder = models.resnet50(pretrained=False)
        encoder.fc = nn.Identity()  # Remove classification head
        
        if checkpoint_path:
            checkpoint = torch.load(checkpoint_path, map_location='cpu')
            encoder.load_state_dict(checkpoint['encoder'])
        
        return encoder
    
    def _load_swav_model(self, checkpoint_path):
        """Load SwAV model"""
        # Load SwAV model (simplified)
        encoder = models.resnet50(pretrained=False)
        encoder.fc = nn.Identity()
        
        if checkpoint_path:
            checkpoint = torch.load(checkpoint_path, map_location='cpu')
            encoder.load_state_dict(checkpoint['encoder'])
        
        return encoder
    
    def _load_dino_model(self, checkpoint_path):
        """Load DINO model"""
        # For this example, use a ViT model
        import timm
        model = timm.create_model('vit_base_patch16_224', pretrained=True)
        model.head = nn.Identity()
        return model
    
    def extract(self, image):
        """Extract features from single image"""
        input_tensor = self.preprocess(image).unsqueeze(0).to(self.device)
        
        with torch.no_grad():
            features = self.model(input_tensor)
        
        return features.squeeze().cpu().numpy()
    
    def extract_batch(self, images):
        """Extract features from batch of images"""
        batch_tensors = []
        for image in images:
            tensor = self.preprocess(image)
            batch_tensors.append(tensor)
        
        input_batch = torch.stack(batch_tensors).to(self.device)
        
        with torch.no_grad():
            features = self.model(input_batch)
        
        return features.cpu().numpy()

3. Vision Transformer Global Features

Using Vision Transformers for global descriptors:

class ViTDescriptor:
    """Extract global features using Vision Transformers"""
    
    def __init__(self, model_name='vit_base_patch16_224', 
                 pretrained=True, device='cuda'):
        self.device = device
        
        import timm
        self.model = timm.create_model(model_name, pretrained=pretrained)
        
        # Remove classification head to get features
        if hasattr(self.model, 'head'):
            self.model.head = nn.Identity()
        elif hasattr(self.model, 'classifier'):
            self.model.classifier = nn.Identity()
        
        self.model = self.model.to(device)
        self.model.eval()
        
        # Get input size
        self.input_size = self.model.default_cfg['input_size'][1]
        
        self.preprocess = transforms.Compose([
            transforms.ToPILImage(),
            transforms.Resize(self.input_size),
            transforms.CenterCrop(self.input_size),
            transforms.ToTensor(),
            transforms.Normalize(
                mean=self.model.default_cfg['mean'],
                std=self.model.default_cfg['std']
            )
        ])
    
    def extract(self, image):
        """Extract ViT features"""
        input_tensor = self.preprocess(image).unsqueeze(0).to(self.device)
        
        with torch.no_grad():
            features = self.model(input_tensor)
        
        return features.squeeze().cpu().numpy()
    
    def extract_patch_features(self, image):
        """Extract patch-level features (not just CLS token)"""
        input_tensor = self.preprocess(image).unsqueeze(0).to(self.device)
        
        with torch.no_grad():
            # Get patch embeddings before final pooling
            x = self.model.patch_embed(input_tensor)
            x = self.model._pos_embed(x)
            x = self.model.norm_pre(x)
            
            for block in self.model.blocks:
                x = block(x)
            
            x = self.model.norm(x)
        
        # Return both CLS token and patch tokens
        cls_token = x[:, 0]  # CLS token
        patch_tokens = x[:, 1:]  # Patch tokens
        
        return cls_token.cpu().numpy(), patch_tokens.cpu().numpy()

Descriptor Comparison and Evaluation

Performance Evaluation Framework

class DescriptorEvaluator:
    """Evaluate and compare different image descriptors"""
    
    def __init__(self, descriptors, dataset_path):
        self.descriptors = descriptors
        self.dataset_path = dataset_path
        self.features_cache = {}
        
    def extract_features(self, descriptor_name, image_paths):
        """Extract features for all images using specified descriptor"""
        if descriptor_name in self.features_cache:
            return self.features_cache[descriptor_name]
        
        descriptor = self.descriptors[descriptor_name]
        features = []
        
        print(f"Extracting {descriptor_name} features...")
        for i, img_path in enumerate(image_paths):
            if i % 100 == 0:
                print(f"Processing {i}/{len(image_paths)}")
            
            image = cv2.imread(img_path)
            if image is None:
                continue
                
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
            
            try:
                feature = descriptor.extract(image)
                features.append(feature)
            except Exception as e:
                print(f"Error processing {img_path}: {e}")
                continue
        
        features = np.array(features)
        self.features_cache[descriptor_name] = features
        
        return features
    
    def evaluate_retrieval(self, descriptor_name, query_features, 
                          gallery_features, query_labels, gallery_labels, k=10):
        """Evaluate image retrieval performance"""
        from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
        
        # Calculate similarities
        similarities = cosine_similarity(query_features, gallery_features)
        
        # Calculate metrics
        precisions = []
        recalls = []
        
        for i, query_label in enumerate(query_labels):
            # Get top-k similar images
            sim_scores = similarities[i]
            top_k_indices = np.argsort(sim_scores)[::-1][:k]
            
            # Calculate precision and recall
            retrieved_labels = gallery_labels[top_k_indices]
            relevant_retrieved = np.sum(retrieved_labels == query_label)
            total_relevant = np.sum(gallery_labels == query_label)
            
            precision = relevant_retrieved / k if k over 0 else 0
            recall = relevant_retrieved / total_relevant if total_relevant over 0 else 0
            
            precisions.append(precision)
            recalls.append(recall)
        
        return {
            'precision@k': np.mean(precisions),
            'recall@k': np.mean(recalls),
            'descriptor': descriptor_name
        }
    
    def evaluate_classification(self, descriptor_name, features, labels, test_size=0.3):
        """Evaluate classification performance"""
        from sklearn.model_selection import train_test_split
        from sklearn.svm import SVC
        from sklearn.metrics import accuracy_score, classification_report
        
        # Split data
        X_train, X_test, y_train, y_test = train_test_split(
            features, labels, test_size=test_size, random_state=42, stratify=labels
        )
        
        # Normalize features
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        
        # Train classifier
        classifier = SVC(kernel='rbf', random_state=42)
        classifier.fit(X_train_scaled, y_train)
        
        # Predict
        y_pred = classifier.predict(X_test_scaled)
        
        # Calculate metrics
        accuracy = accuracy_score(y_test, y_pred)
        
        return {
            'accuracy': accuracy,
            'descriptor': descriptor_name,
            'report': classification_report(y_test, y_pred, output_dict=True)
        }
    
    def compare_descriptors(self, image_paths, labels, evaluation_type='classification'):
        """Compare all descriptors"""
        results = {}
        
        for desc_name in self.descriptors.keys():
            print(f"\nEvaluating {desc_name}...")
            
            # Extract features
            features = self.extract_features(desc_name, image_paths)
            
            if evaluation_type == 'classification':
                result = self.evaluate_classification(desc_name, features, labels)
            elif evaluation_type == 'retrieval':
                # For retrieval, split into query and gallery
                split_idx = len(features) // 2
                query_features = features[:split_idx]
                gallery_features = features[split_idx:]
                query_labels = labels[:split_idx]
                gallery_labels = labels[split_idx:]
                
                result = self.evaluate_retrieval(
                    desc_name, query_features, gallery_features,
                    query_labels, gallery_labels
                )
            
            results[desc_name] = result
        
        return results
    
    def visualize_results(self, results, metric='accuracy'):
        """Visualize comparison results"""
        descriptors = list(results.keys())
        scores = [results[desc][metric] for desc in descriptors]
        
        plt.figure(figsize=(12, 6))
        bars = plt.bar(descriptors, scores)
        plt.title(f'Descriptor Comparison - {metric.title()}')
        plt.ylabel(metric.title())
        plt.xlabel('Descriptor')
        plt.xticks(rotation=45)
        
        # Add value labels on bars
        for bar, score in zip(bars, scores):
            plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
                    f'{score:.3f}', ha='center', va='bottom')
        
        plt.tight_layout()
        plt.show()
        
        return plt.gcf()

Advanced Descriptor Techniques

1. Bag of Visual Words (BoVW)

class BagOfVisualWords:
    """Bag of Visual Words descriptor using local features"""
    
    def __init__(self, vocab_size=500, detector_type='sift'):
        self.vocab_size = vocab_size
        self.detector_type = detector_type
        self.vocabulary = None
        self.kmeans = None
        
        # Initialize feature detector
        if detector_type == 'sift':
            self.detector = cv2.SIFT_create()
        elif detector_type == 'orb':
            self.detector = cv2.ORB_create()
        else:
            raise ValueError(f"Unsupported detector: {detector_type}")
    
    def build_vocabulary(self, image_paths, max_images=1000):
        """Build visual vocabulary from training images"""
        print("Building visual vocabulary...")
        
        all_descriptors = []
        
        # Extract descriptors from training images
        for i, img_path in enumerate(image_paths[:max_images]):
            if i % 100 == 0:
                print(f"Processing {i}/{min(len(image_paths), max_images)}")
            
            image = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
            if image is None:
                continue
            
            # Detect keypoints and extract descriptors
            keypoints, descriptors = self.detector.detectAndCompute(image, None)
            
            if descriptors is not None:
                all_descriptors.append(descriptors)
        
        # Combine all descriptors
        all_descriptors = np.vstack(all_descriptors)
        
        # Cluster descriptors to create vocabulary
        print(f"Clustering {len(all_descriptors)} descriptors into {self.vocab_size} words...")
        self.kmeans = KMeans(n_clusters=self.vocab_size, random_state=42, n_init=10)
        self.kmeans.fit(all_descriptors)
        self.vocabulary = self.kmeans.cluster_centers_
        
        print("Vocabulary built successfully!")
    
    def extract(self, image):
        """Extract BoVW descriptor from image"""
        if self.vocabulary is None:
            raise ValueError("Vocabulary not built. Call build_vocabulary() first.")
        
        # Convert to grayscale if needed
        if len(image.shape) == 3:
            gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
        else:
            gray = image
        
        # Detect keypoints and extract descriptors
        keypoints, descriptors = self.detector.detectAndCompute(gray, None)
        
        if descriptors is None:
            return np.zeros(self.vocab_size)
        
        # Assign descriptors to vocabulary words
        words = self.kmeans.predict(descriptors)
        
        # Create histogram of word occurrences
        hist, _ = np.histogram(words, bins=self.vocab_size, range=(0, self.vocab_size))
        
        # Normalize histogram
        hist = hist.astype(np.float32)
        if np.sum(hist) over 0:
            hist = hist / np.sum(hist)
        
        return hist

# Usage example
bovw = BagOfVisualWords(vocab_size=500, detector_type='sift')

2. Fisher Vector Encoding

class FisherVectorDescriptor:
    """Fisher Vector encoding of local features"""
    
    def __init__(self, n_components=64, detector_type='sift'):
        self.n_components = n_components
        self.detector_type = detector_type
        self.gmm = None
        
        # Initialize feature detector
        if detector_type == 'sift':
            self.detector = cv2.SIFT_create()
        elif detector_type == 'surf':
            self.detector = cv2.xfeatures2d.SURF_create()
        else:
            raise ValueError(f"Unsupported detector: {detector_type}")
    
    def fit_gmm(self, image_paths, max_images=1000):
        """Fit Gaussian Mixture Model on training descriptors"""
        from sklearn.mixture import GaussianMixture
        
        print("Fitting GMM for Fisher Vector...")
        
        all_descriptors = []
        
        # Extract descriptors from training images
        for i, img_path in enumerate(image_paths[:max_images]):
            if i % 100 == 0:
                print(f"Processing {i}/{min(len(image_paths), max_images)}")
            
            image = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
            if image is None:
                continue
            
            keypoints, descriptors = self.detector.detectAndCompute(image, None)
            
            if descriptors is not None:
                all_descriptors.append(descriptors)
        
        # Combine all descriptors
        all_descriptors = np.vstack(all_descriptors)
        
        # Fit GMM
        print(f"Fitting GMM with {self.n_components} components...")
        self.gmm = GaussianMixture(n_components=self.n_components, random_state=42)
        self.gmm.fit(all_descriptors)
        
        print("GMM fitted successfully!")
    
    def extract(self, image):
        """Extract Fisher Vector from image"""
        if self.gmm is None:
            raise ValueError("GMM not fitted. Call fit_gmm() first.")
        
        # Convert to grayscale if needed
        if len(image.shape) == 3:
            gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
        else:
            gray = image
        
        # Extract local descriptors
        keypoints, descriptors = self.detector.detectAndCompute(gray, None)
        
        if descriptors is None:
            return np.zeros(2 * self.n_components * descriptors.shape[1])
        
        # Compute Fisher Vector
        fisher_vector = self._compute_fisher_vector(descriptors)
        
        return fisher_vector
    
    def _compute_fisher_vector(self, descriptors):
        """Compute Fisher Vector encoding"""
        # Get GMM parameters
        means = self.gmm.means_
        covariances = self.gmm.covariances_
        weights = self.gmm.weights_
        
        # Compute soft assignments
        posteriors = self.gmm.predict_proba(descriptors)
        
        # Initialize Fisher Vector
        d = descriptors.shape[1]
        fv = np.zeros(2 * self.n_components * d)
        
        # Compute Fisher Vector components
        for k in range(self.n_components):
            # Deviation from mean
            diff = descriptors - means[k]
            
            # Weighted deviations
            weighted_diff = posteriors[:, k:k+1] * diff
            
            # First order statistics (gradient w.r.t. mean)
            first_order = np.sum(weighted_diff, axis=0) / np.sqrt(weights[k])
            
            # Second order statistics (gradient w.r.t. variance)
            second_order = np.sum(posteriors[:, k:k+1] * 
                                (diff**2 / covariances[k] - 1), axis=0) / np.sqrt(2 * weights[k])
            
            # Store in Fisher Vector
            fv[k*d:(k+1)*d] = first_order
            fv[(self.n_components + k)*d:(self.n_components + k + 1)*d] = second_order
        
        # Power normalization
        fv = np.sign(fv) * np.sqrt(np.abs(fv))
        
        # L2 normalization
        fv = fv / np.linalg.norm(fv)
        
        return fv

3. VLAD (Vector of Locally Aggregated Descriptors)

class VLADDescriptor:
    """VLAD (Vector of Locally Aggregated Descriptors) encoding"""
    
    def __init__(self, n_clusters=64, detector_type='sift'):
        self.n_clusters = n_clusters
        self.detector_type = detector_type
        self.kmeans = None
        
        # Initialize feature detector
        if detector_type == 'sift':
            self.detector = cv2.SIFT_create()
        elif detector_type == 'orb':
            self.detector = cv2.ORB_create()
        else:
            raise ValueError(f"Unsupported detector: {detector_type}")
    
    def build_codebook(self, image_paths, max_images=1000):
        """Build visual codebook for VLAD"""
        print("Building VLAD codebook...")
        
        all_descriptors = []
        
        # Extract descriptors from training images
        for i, img_path in enumerate(image_paths[:max_images]):
            if i % 100 == 0:
                print(f"Processing {i}/{min(len(image_paths), max_images)}")
            
            image = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
            if image is None:
                continue
            
            keypoints, descriptors = self.detector.detectAndCompute(image, None)
            
            if descriptors is not None:
                all_descriptors.append(descriptors)
        
        # Combine all descriptors
        all_descriptors = np.vstack(all_descriptors)
        
        # Build codebook using K-means
        print(f"Clustering into {self.n_clusters} visual words...")
        self.kmeans = KMeans(n_clusters=self.n_clusters, random_state=42, n_init=10)
        self.kmeans.fit(all_descriptors)
        
        print("Codebook built successfully!")
    
    def extract(self, image):
        """Extract VLAD descriptor from image"""
        if self.kmeans is None:
            raise ValueError("Codebook not built. Call build_codebook() first.")
        
        # Convert to grayscale if needed
        if len(image.shape) == 3:
            gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
        else:
            gray = image
        
        # Extract local descriptors
        keypoints, descriptors = self.detector.detectAndCompute(gray, None)
        
        if descriptors is None:
            return np.zeros(self.n_clusters * descriptors.shape[1])
        
        # Compute VLAD encoding
        vlad_vector = self._compute_vlad(descriptors)
        
        return vlad_vector
    
    def _compute_vlad(self, descriptors):
        """Compute VLAD encoding"""
        # Assign descriptors to nearest cluster centers
        cluster_assignments = self.kmeans.predict(descriptors)
        
        # Get cluster centers
        centers = self.kmeans.cluster_centers_
        
        # Initialize VLAD vector
        vlad = np.zeros((self.n_clusters, descriptors.shape[1]))
        
        # Accumulate residuals for each cluster
        for i in range(self.n_clusters):
            # Find descriptors assigned to cluster i
            cluster_mask = cluster_assignments == i
            
            if np.any(cluster_mask):
                # Compute residuals (descriptor - cluster center)
                residuals = descriptors[cluster_mask] - centers[i]
                # Sum residuals
                vlad[i] = np.sum(residuals, axis=0)
        
        # Flatten VLAD vector
        vlad = vlad.flatten()
        
        # Power normalization
        vlad = np.sign(vlad) * np.sqrt(np.abs(vlad))
        
        # L2 normalization
        vlad = vlad / np.linalg.norm(vlad)
        
        return vlad

Practical Applications

Image Retrieval System

class ImageRetrievalSystem:
    """Complete image retrieval system using global descriptors"""
    
    def __init__(self, descriptor, similarity_metric='cosine'):
        self.descriptor = descriptor
        self.similarity_metric = similarity_metric
        self.database_features = None
        self.database_paths = None
        
    def build_database(self, image_paths):
        """Build feature database"""
        print("Building image database...")
        
        features = []
        valid_paths = []
        
        for i, img_path in enumerate(image_paths):
            if i % 100 == 0:
                print(f"Processing {i}/{len(image_paths)}")
            
            try:
                image = cv2.imread(img_path)
                if image is None:
                    continue
                
                image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
                feature = self.descriptor.extract(image)
                
                features.append(feature)
                valid_paths.append(img_path)
                
            except Exception as e:
                print(f"Error processing {img_path}: {e}")
                continue
        
        self.database_features = np.array(features)
        self.database_paths = valid_paths
        
        print(f"Database built with {len(valid_paths)} images")
    
    def search(self, query_image_path, top_k=10):
        """Search for similar images"""
        if self.database_features is None:
            raise ValueError("Database not built. Call build_database() first.")
        
        # Extract query features
        query_image = cv2.imread(query_image_path)
        query_image = cv2.cvtColor(query_image, cv2.COLOR_BGR2RGB)
        query_features = self.descriptor.extract(query_image)
        
        # Compute similarities
        if self.similarity_metric == 'cosine':
            from sklearn.metrics.pairwise import cosine_similarity
            similarities = cosine_similarity([query_features], self.database_features)[0]
        elif self.similarity_metric == 'euclidean':
            from sklearn.metrics.pairwise import euclidean_distances
            distances = euclidean_distances([query_features], self.database_features)[0]
            similarities = 1 / (1 + distances)  # Convert to similarity
        
        # Get top-k results
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        results = []
        for idx in top_indices:
            results.append({
                'path': self.database_paths[idx],
                'similarity': similarities[idx]
            })
        
        return results
    
    def visualize_results(self, query_path, results, max_display=5):
        """Visualize search results"""
        fig, axes = plt.subplots(1, min(len(results) + 1, max_display + 1), 
                                figsize=(20, 4))
        
        # Display query image
        query_img = cv2.imread(query_path)
        query_img = cv2.cvtColor(query_img, cv2.COLOR_BGR2RGB)
        axes[0].imshow(query_img)
        axes[0].set_title('Query')
        axes[0].axis('off')
        
        # Display top results
        for i, result in enumerate(results[:max_display]):
            if i + 1 >= len(axes):
                break
            
            result_img = cv2.imread(result['path'])
            result_img = cv2.cvtColor(result_img, cv2.COLOR_BGR2RGB)
            axes[i + 1].imshow(result_img)
            axes[i + 1].set_title(f'Rank {i+1}\nSim: {result["similarity"]:.3f}')
            axes[i + 1].axis('off')
        
        plt.tight_layout()
        plt.show()

# Example usage
def example_retrieval_system():
    """Example of using the image retrieval system"""
    
    # Initialize descriptor
    descriptor = CNNDescriptor(model_name='resnet50', layer='avgpool')
    
    # Create retrieval system
    retrieval_system = ImageRetrievalSystem(descriptor, similarity_metric='cosine')
    
    # Build database (example paths)
    database_paths = ['path/to/database/images/*.jpg']  # Replace with actual paths
    retrieval_system.build_database(database_paths)
    
    # Search for similar images
    query_path = 'path/to/query/image.jpg'  # Replace with actual path
    results = retrieval_system.search(query_path, top_k=10)
    
    # Visualize results
    retrieval_system.visualize_results(query_path, results)
    
    return results

Conclusion

Global image descriptors remain fundamental to computer vision, evolving from hand-crafted features to sophisticated deep learning representations. Key insights:

Traditional vs. Modern Descriptors

Traditional strengths:

  • Interpretability: Clear understanding of what features represent
  • Efficiency: Fast extraction and low memory requirements
  • Robustness: Well-understood invariance properties
  • Specialization: Tailored for specific visual properties

Deep learning advantages:

  • Representation power: Learn complex, hierarchical features
  • Transfer learning: Pre-trained models work across domains
  • End-to-end optimization: Features optimized for specific tasks
  • Scalability: Handle large-scale datasets effectively

Best Practices

  1. Task-specific selection: Choose descriptors based on application requirements
  2. Preprocessing importance: Proper image normalization and resizing
  3. Feature normalization: L2 normalization often improves performance
  4. Dimensionality considerations: Balance between discriminative power and efficiency
  5. Evaluation methodology: Use appropriate metrics for your specific task

Future Directions

  • Self-supervised learning: More robust features without labeled data
  • Multi-modal descriptors: Combining visual with textual information
  • Efficient architectures: Lightweight models for mobile applications
  • Domain adaptation: Descriptors that transfer across different domains

The choice of global descriptor depends on your specific requirements: computational constraints, accuracy needs, interpretability requirements, and available training data.

References

  • Dalal, N., & Triggs, B. (2005). "Histograms of oriented gradients for human detection."
  • Ojala, T., et al. (2002). "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns."
  • Perronnin, F., & Dance, C. (2007). "Fisher kernels on visual vocabularies for image categorization."
  • Jégou, H., et al. (2010). "Aggregating local descriptors into a compact image representation."
  • Simonyan, K., & Zisserman, A. (2014). "Very deep convolutional networks for large-scale image recognition."