Model Soup: Improving Deep Learning Through Weight Averaging

Model Soup is a simple yet powerful technique that improves deep learning model performance by averaging the weights of multiple models trained on the same task. Unlike traditional ensembling that requires running multiple models at inference time, model soup creates a single model with averaged weights, maintaining the same computational cost as a single model while often achieving superior performance.

In this comprehensive guide, we'll explore the theory behind model soup, its variants, and practical implementations.

Introduction to Model Soup

The Core Idea

Model Soup leverages the observation that when multiple models are trained on the same task with different random initializations or hyperparameters, their weights often lie in similar regions of the loss landscape. By averaging these weights, we can find a point that generalizes better than any individual model.

Key Benefits

Improved accuracy: Often outperforms individual models and traditional ensembles
No inference overhead: Single model with same computational cost
Simple implementation: Just weight averaging - no complex training procedures
Robust performance: Reduces variance across different training runs
Better calibration: Often produces more reliable confidence estimates

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.models as models
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import numpy as np
import matplotlib.pyplot as plt
import copy
from collections import OrderedDict
from typing import List, Dict, Optional, Tuple
import os
import json

Basic Model Soup Implementation

Simple Weight Averaging

class ModelSoup:
    """Basic Model Soup implementation"""
    
    def __init__(self, model_class, model_kwargs=None):
        self.model_class = model_class
        self.model_kwargs = model_kwargs or {}
        self.ingredient_models = []
        self.soup_model = None
    
    def add_model(self, model_path_or_state_dict, weight=1.0):
        """Add a model to the soup ingredients"""
        if isinstance(model_path_or_state_dict, str):
            # Load model from path
            state_dict = torch.load(model_path_or_state_dict, map_location='cpu')
        else:
            # Use provided state dict
            state_dict = model_path_or_state_dict
        
        self.ingredient_models.append({
            'state_dict': state_dict,
            'weight': weight
        })
    
    def create_soup(self, normalize_weights=True):
        """Create soup by averaging model weights"""
        if not self.ingredient_models:
            raise ValueError("No models added to soup")
        
        # Initialize soup model
        self.soup_model = self.model_class(**self.model_kwargs)
        soup_state_dict = OrderedDict()
        
        # Get weights for normalization
        weights = [model_info['weight'] for model_info in self.ingredient_models]
        if normalize_weights:
            total_weight = sum(weights)
            weights = [w / total_weight for w in weights]
        
        # Initialize with zeros
        for key in self.ingredient_models[0]['state_dict'].keys():
            soup_state_dict[key] = torch.zeros_like(
                self.ingredient_models[0]['state_dict'][key]
            )
        
        # Average weights
        for i, model_info in enumerate(self.ingredient_models):
            state_dict = model_info['state_dict']
            weight = weights[i]
            
            for key, param in state_dict.items():
                soup_state_dict[key] += weight * param
        
        # Load averaged weights
        self.soup_model.load_state_dict(soup_state_dict)
        return self.soup_model
    
    def evaluate_ingredients_and_soup(self, test_loader, device='cuda'):
        """Evaluate individual models and soup"""
        results = {}
        
        # Evaluate individual ingredients
        for i, model_info in enumerate(self.ingredient_models):
            model = self.model_class(**self.model_kwargs)
            model.load_state_dict(model_info['state_dict'])
            model.to(device)
            
            accuracy = self._evaluate_model(model, test_loader, device)
            results[f'ingredient_{i}'] = accuracy
            print(f"Ingredient {i} accuracy: {accuracy:.4f}")
        
        # Evaluate soup
        if self.soup_model is not None:
            self.soup_model.to(device)
            soup_accuracy = self._evaluate_model(self.soup_model, test_loader, device)
            results['soup'] = soup_accuracy
            print(f"Soup accuracy: {soup_accuracy:.4f}")
        
        return results
    
    def _evaluate_model(self, model, test_loader, device):
        """Evaluate a single model"""
        model.eval()
        correct = 0
        total = 0
        
        with torch.no_grad():
            for inputs, targets in test_loader:
                inputs, targets = inputs.to(device), targets.to(device)
                outputs = model(inputs)
                _, predicted = torch.max(outputs.data, 1)
                total += targets.size(0)
                correct += (predicted == targets).sum().item()
        
        return correct / total

# Example usage
def basic_soup_example():
    """Basic example of creating and evaluating model soup"""
    
    # Assume we have multiple trained model checkpoints
    model_paths = [
        'model_seed_0.pth',
        'model_seed_1.pth', 
        'model_seed_2.pth',
        'model_seed_3.pth'
    ]
    
    # Create soup
    soup = ModelSoup(models.resnet18, {'num_classes': 10})
    
    # Add models with equal weights
    for path in model_paths:
        soup.add_model(path, weight=1.0)
    
    # Create averaged model
    soup_model = soup.create_soup()
    
    return soup_model

Advanced Model Soup Techniques

1. Greedy Soup

Greedy soup iteratively adds models that improve performance:

class GreedySoup(ModelSoup):
    """Greedy Model Soup that selectively adds beneficial models"""
    
    def __init__(self, model_class, model_kwargs=None, validation_loader=None, device='cuda'):
        super().__init__(model_class, model_kwargs)
        self.validation_loader = validation_loader
        self.device = device
        self.selected_models = []
        self.best_performance = 0.0
    
    def create_greedy_soup(self, candidate_models, max_models=None):
        """Create soup by greedily selecting models"""
        if self.validation_loader is None:
            raise ValueError("Validation loader required for greedy soup")
        
        # Start with best individual model
        best_model_idx = self._find_best_individual_model(candidate_models)
        self.selected_models = [candidate_models[best_model_idx]]
        
        # Current soup performance
        current_soup = self._create_soup_from_selected()
        self.best_performance = self._evaluate_model(
            current_soup, self.validation_loader, self.device
        )
        
        print(f"Starting with model {best_model_idx}, performance: {self.best_performance:.4f}")
        
        # Remaining candidates
        remaining_candidates = [
            candidate_models[i] for i in range(len(candidate_models)) 
            if i != best_model_idx
        ]
        
        # Greedily add models
        iteration = 0
        while remaining_candidates and (max_models is None or len(self.selected_models) < max_models):
            best_candidate_idx = None
            best_improvement = 0.0
            
            # Try adding each remaining candidate
            for i, candidate in enumerate(remaining_candidates):
                # Create temporary soup with candidate added
                temp_selected = self.selected_models + [candidate]
                temp_soup = self._create_soup_from_models(temp_selected)
                
                # Evaluate performance
                performance = self._evaluate_model(
                    temp_soup, self.validation_loader, self.device
                )
                
                improvement = performance - self.best_performance
                if improvement > best_improvement:
                    best_improvement = improvement
                    best_candidate_idx = i
            
            # Add best candidate if it improves performance
            if best_candidate_idx is not None and best_improvement over 0:
                best_candidate = remaining_candidates.pop(best_candidate_idx)
                self.selected_models.append(best_candidate)
                self.best_performance += best_improvement
                
                print(f"Added model {len(self.selected_models)-1}, "
                      f"performance: {self.best_performance:.4f}")
            else:
                print("No beneficial candidates found, stopping")
                break
            
            iteration += 1
        
        # Create final soup
        self.soup_model = self._create_soup_from_selected()
        return self.soup_model
    
    def _find_best_individual_model(self, models):
        """Find the best performing individual model"""
        best_performance = 0.0
        best_model_idx = 0
        
        for i, model_info in enumerate(models):
            model = self.model_class(**self.model_kwargs)
            model.load_state_dict(model_info['state_dict'])
            model.to(self.device)
            
            performance = self._evaluate_model(
                model, self.validation_loader, self.device
            )
            
            if performance > best_performance:
                best_performance = performance
                best_model_idx = i
        
        return best_model_idx
    
    def _create_soup_from_selected(self):
        """Create soup from currently selected models"""
        return self._create_soup_from_models(self.selected_models)
    
    def _create_soup_from_models(self, models):
        """Create soup from given list of models"""
        if not models:
            return None
        
        # Create soup model
        soup_model = self.model_class(**self.model_kwargs)
        soup_state_dict = OrderedDict()
        
        # Initialize with zeros
        for key in models[0]['state_dict'].keys():
            soup_state_dict[key] = torch.zeros_like(models[0]['state_dict'][key])
        
        # Average weights
        for model_info in models:
            state_dict = model_info['state_dict']
            weight = 1.0 / len(models)  # Equal weights
            
            for key, param in state_dict.items():
                soup_state_dict[key] += weight * param
        
        soup_model.load_state_dict(soup_state_dict)
        return soup_model

2. Learned Mixing Weights

Learn optimal mixing weights for different models:

class LearnedWeightSoup(nn.Module):
    """Model Soup with learned mixing weights"""
    
    def __init__(self, models, num_classes, device='cuda'):
        super().__init__()
        self.models = [model.to(device) for model in models]
        self.num_models = len(models)
        self.device = device
        
        # Learnable mixing weights
        self.mixing_weights = nn.Parameter(torch.ones(self.num_models) / self.num_models)
        
        # Freeze base model parameters
        for model in self.models:
            for param in model.parameters():
                param.requires_grad = False
    
    def forward(self, x):
        """Forward pass with learned mixing"""
        # Get predictions from all models
        outputs = []
        for model in self.models:
            with torch.no_grad():
                output = model(x)
            outputs.append(output)
        
        # Stack outputs
        stacked_outputs = torch.stack(outputs, dim=0)  # (num_models, batch_size, num_classes)
        
        # Apply softmax to mixing weights
        normalized_weights = F.softmax(self.mixing_weights, dim=0)
        
        # Weighted combination
        mixed_output = torch.sum(
            normalized_weights.view(-1, 1, 1) * stacked_outputs, dim=0
        )
        
        return mixed_output
    
    def get_mixing_weights(self):
        """Get current mixing weights"""
        return F.softmax(self.mixing_weights, dim=0).detach().cpu().numpy()

def train_learned_weights(learned_soup, train_loader, val_loader, 
                         num_epochs=10, lr=0.01, device='cuda'):
    """Train the mixing weights"""
    optimizer = torch.optim.Adam([learned_soup.mixing_weights], lr=lr)
    criterion = nn.CrossEntropyLoss()
    
    best_val_acc = 0.0
    
    for epoch in range(num_epochs):
        # Training
        learned_soup.train()
        train_loss = 0.0
        
        for inputs, targets in train_loader:
            inputs, targets = inputs.to(device), targets.to(device)
            
            optimizer.zero_grad()
            outputs = learned_soup(inputs)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item()
        
        # Validation
        learned_soup.eval()
        val_acc = evaluate_model(learned_soup, val_loader, device)
        
        print(f"Epoch {epoch+1}/{num_epochs}")
        print(f"  Train Loss: {train_loss/len(train_loader):.4f}")
        print(f"  Val Accuracy: {val_acc:.4f}")
        print(f"  Mixing Weights: {learned_soup.get_mixing_weights()}")
        
        if val_acc > best_val_acc:
            best_val_acc = val_acc
    
    return best_val_acc

3. Layer-wise Soup

Create soups at different layer granularities:

class LayerwiseSoup:
    """Layer-wise model soup with selective averaging"""
    
    def __init__(self, model_class, model_kwargs=None):
        self.model_class = model_class
        self.model_kwargs = model_kwargs or {}
        self.ingredient_models = []
        
    def add_model(self, state_dict, weight=1.0):
        """Add a model to ingredients"""
        self.ingredient_models.append({
            'state_dict': state_dict,
            'weight': weight
        })
    
    def create_layerwise_soup(self, layer_selection_strategy='all', 
                            validation_loader=None, device='cuda'):
        """Create soup with layer-wise selection"""
        
        if layer_selection_strategy == 'all':
            # Average all layers
            return self._create_full_soup()
        
        elif layer_selection_strategy == 'selective':
            # Selectively average layers based on performance
            return self._create_selective_soup(validation_loader, device)
        
        elif layer_selection_strategy == 'classifier_only':
            # Only average classifier layers
            return self._create_classifier_soup()
        
        else:
            raise ValueError(f"Unknown strategy: {layer_selection_strategy}")
    
    def _create_full_soup(self):
        """Create soup by averaging all layers"""
        soup_model = self.model_class(**self.model_kwargs)
        soup_state_dict = OrderedDict()
        
        # Get normalized weights
        weights = [model_info['weight'] for model_info in self.ingredient_models]
        total_weight = sum(weights)
        weights = [w / total_weight for w in weights]
        
        # Initialize
        for key in self.ingredient_models[0]['state_dict'].keys():
            soup_state_dict[key] = torch.zeros_like(
                self.ingredient_models[0]['state_dict'][key]
            )
        
        # Average
        for i, model_info in enumerate(self.ingredient_models):
            for key, param in model_info['state_dict'].items():
                soup_state_dict[key] += weights[i] * param
        
        soup_model.load_state_dict(soup_state_dict)
        return soup_model
    
    def _create_selective_soup(self, validation_loader, device):
        """Create soup by selectively averaging beneficial layers"""
        if validation_loader is None:
            raise ValueError("Validation loader required for selective soup")
        
        # Get base model (first ingredient)
        base_model = self.model_class(**self.model_kwargs)
        base_model.load_state_dict(self.ingredient_models[0]['state_dict'])
        base_performance = self._evaluate_model(base_model, validation_loader, device)
        
        soup_state_dict = copy.deepcopy(self.ingredient_models[0]['state_dict'])
        layer_names = list(soup_state_dict.keys())
        
        # Test averaging each layer group
        for layer_prefix in ['conv', 'bn', 'fc', 'classifier']:
            matching_layers = [name for name in layer_names if layer_prefix in name]
            
            if not matching_layers:
                continue
            
            # Create temporary soup with this layer group averaged
            temp_state_dict = copy.deepcopy(soup_state_dict)
            
            # Average matching layers
            for layer_name in matching_layers:
                temp_state_dict[layer_name] = torch.zeros_like(
                    self.ingredient_models[0]['state_dict'][layer_name]
                )
                
                for model_info in self.ingredient_models:
                    weight = model_info['weight'] / sum(m['weight'] for m in self.ingredient_models)
                    temp_state_dict[layer_name] += weight * model_info['state_dict'][layer_name]
            
            # Test performance
            temp_model = self.model_class(**self.model_kwargs)
            temp_model.load_state_dict(temp_state_dict)
            temp_performance = self._evaluate_model(temp_model, validation_loader, device)
            
            # Keep changes if improvement
            if temp_performance > base_performance:
                soup_state_dict = temp_state_dict
                base_performance = temp_performance
                print(f"Averaged {layer_prefix} layers: {temp_performance:.4f}")
        
        # Create final soup
        soup_model = self.model_class(**self.model_kwargs)
        soup_model.load_state_dict(soup_state_dict)
        return soup_model
    
    def _create_classifier_soup(self):
        """Create soup by only averaging classifier layers"""
        # Use first model as base
        soup_state_dict = copy.deepcopy(self.ingredient_models[0]['state_dict'])
        
        # Find classifier layers (typically contain 'fc' or 'classifier')
        classifier_keys = [key for key in soup_state_dict.keys() 
                         if 'fc' in key or 'classifier' in key or 'head' in key]
        
        if not classifier_keys:
            print("No classifier layers found, using full soup")
            return self._create_full_soup()
        
        # Average only classifier layers
        weights = [model_info['weight'] for model_info in self.ingredient_models]
        total_weight = sum(weights)
        weights = [w / total_weight for w in weights]
        
        for key in classifier_keys:
            soup_state_dict[key] = torch.zeros_like(soup_state_dict[key])
            
            for i, model_info in enumerate(self.ingredient_models):
                soup_state_dict[key] += weights[i] * model_info['state_dict'][key]
        
        # Create soup model
        soup_model = self.model_class(**self.model_kwargs)
        soup_model.load_state_dict(soup_state_dict)
        return soup_model
    
    def _evaluate_model(self, model, test_loader, device):
        """Evaluate model performance"""
        model.to(device)
        model.eval()
        correct = 0
        total = 0
        
        with torch.no_grad():
            for inputs, targets in test_loader:
                inputs, targets = inputs.to(device), targets.to(device)
                outputs = model(inputs)
                _, predicted = torch.max(outputs.data, 1)
                total += targets.size(0)
                correct += (predicted == targets).sum().item()
        
        return correct / total

Domain-Specific Model Soups

1. Vision Model Soup

class VisionModelSoup:
    """Specialized soup for computer vision models"""
    
    def __init__(self, architecture='resnet50', num_classes=1000, pretrained=True):
        self.architecture = architecture
        self.num_classes = num_classes
        self.pretrained = pretrained
        self.models = []
        
    def train_diverse_models(self, train_loader, val_loader, num_models=5, 
                           epochs_per_model=10, device='cuda'):
        """Train diverse models for soup ingredients"""
        
        training_configs = [
            {'lr': 0.01, 'weight_decay': 1e-4, 'optimizer': 'sgd'},
            {'lr': 0.001, 'weight_decay': 1e-3, 'optimizer': 'adam'},
            {'lr': 0.005, 'weight_decay': 5e-4, 'optimizer': 'sgd'},
            {'lr': 0.002, 'weight_decay': 1e-4, 'optimizer': 'adamw'},
            {'lr': 0.01, 'weight_decay': 1e-5, 'optimizer': 'sgd'},
        ]
        
        for i in range(min(num_models, len(training_configs))):
            print(f"\nTraining model {i+1}/{num_models}")
            config = training_configs[i]
            
            # Create model
            if self.architecture == 'resnet50':
                model = models.resnet50(pretrained=self.pretrained)
                model.fc = nn.Linear(model.fc.in_features, self.num_classes)
            else:
                raise ValueError(f"Unsupported architecture: {self.architecture}")
            
            model = model.to(device)
            
            # Setup optimizer
            if config['optimizer'] == 'sgd':
                optimizer = torch.optim.SGD(
                    model.parameters(), 
                    lr=config['lr'], 
                    weight_decay=config['weight_decay'],
                    momentum=0.9
                )
            elif config['optimizer'] == 'adam':
                optimizer = torch.optim.Adam(
                    model.parameters(),
                    lr=config['lr'],
                    weight_decay=config['weight_decay']
                )
            elif config['optimizer'] == 'adamw':
                optimizer = torch.optim.AdamW(
                    model.parameters(),
                    lr=config['lr'],
                    weight_decay=config['weight_decay']
                )
            
            # Train model
            trained_model = self._train_single_model(
                model, train_loader, val_loader, optimizer, epochs_per_model, device
            )
            
            self.models.append(trained_model.state_dict())
            
    def _train_single_model(self, model, train_loader, val_loader, 
                          optimizer, epochs, device):
        """Train a single model"""
        criterion = nn.CrossEntropyLoss()
        scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, epochs)
        
        best_val_acc = 0.0
        best_state_dict = None
        
        for epoch in range(epochs):
            # Training
            model.train()
            train_loss = 0.0
            
            for inputs, targets in train_loader:
                inputs, targets = inputs.to(device), targets.to(device)
                
                optimizer.zero_grad()
                outputs = model(inputs)
                loss = criterion(outputs, targets)
                loss.backward()
                optimizer.step()
                
                train_loss += loss.item()
            
            # Validation
            model.eval()
            val_acc = 0.0
            with torch.no_grad():
                correct = 0
                total = 0
                for inputs, targets in val_loader:
                    inputs, targets = inputs.to(device), targets.to(device)
                    outputs = model(inputs)
                    _, predicted = torch.max(outputs.data, 1)
                    total += targets.size(0)
                    correct += (predicted == targets).sum().item()
                val_acc = correct / total
            
            scheduler.step()
            
            if val_acc > best_val_acc:
                best_val_acc = val_acc
                best_state_dict = copy.deepcopy(model.state_dict())
            
            print(f"  Epoch {epoch+1}: Loss: {train_loss/len(train_loader):.4f}, "
                  f"Val Acc: {val_acc:.4f}")
        
        # Load best weights
        model.load_state_dict(best_state_dict)
        return model
    
    def create_vision_soup(self, soup_type='uniform'):
        """Create vision-specific soup"""
        if not self.models:
            raise ValueError("No models trained. Call train_diverse_models first.")
        
        # Create soup
        soup = ModelSoup(
            lambda: models.resnet50(pretrained=False), 
            {'num_classes': self.num_classes}
        )
        
        if soup_type == 'uniform':
            # Equal weights
            for state_dict in self.models:
                soup.add_model(state_dict, weight=1.0)
        
        elif soup_type == 'performance_weighted':
            # Weight by validation performance (would need to track this)
            # Simplified: use uniform for now
            for state_dict in self.models:
                soup.add_model(state_dict, weight=1.0)
        
        return soup.create_soup()

2. NLP Model Soup

class NLPModelSoup:
    """Model soup for NLP tasks"""
    
    def __init__(self, model_class, tokenizer, num_labels):
        self.model_class = model_class
        self.tokenizer = tokenizer
        self.num_labels = num_labels
        self.models = []
    
    def fine_tune_diverse_models(self, train_dataset, val_dataset, num_models=3):
        """Fine-tune diverse models for different seeds/hyperparameters"""
        
        configs = [
            {'learning_rate': 2e-5, 'warmup_steps': 500, 'seed': 42},
            {'learning_rate': 3e-5, 'warmup_steps': 1000, 'seed': 123},
            {'learning_rate': 5e-5, 'warmup_steps': 250, 'seed': 456},
        ]
        
        for i, config in enumerate(configs[:num_models]):
            print(f"Training model {i+1} with config: {config}")
            
            # Set seed
            torch.manual_seed(config['seed'])
            
            # Create model
            model = self.model_class.from_pretrained(
                'bert-base-uncased',
                num_labels=self.num_labels
            )
            
            # Train model (simplified training loop)
            trained_model = self._fine_tune_model(model, train_dataset, val_dataset, config)
            self.models.append(trained_model.state_dict())
    
    def _fine_tune_model(self, model, train_dataset, val_dataset, config):
        """Fine-tune a single model"""
        # This would contain the actual fine-tuning logic
        # For brevity, returning the model as-is
        return model
    
    def create_nlp_soup(self):
        """Create NLP model soup"""
        if not self.models:
            raise ValueError("No models fine-tuned")
        
        # Create base model
        soup_model = self.model_class.from_pretrained(
            'bert-base-uncased',
            num_labels=self.num_labels
        )
        
        # Average weights
        soup_state_dict = OrderedDict()
        
        # Initialize with zeros
        for key in self.models[0].keys():
            soup_state_dict[key] = torch.zeros_like(self.models[0][key])
        
        # Average
        for state_dict in self.models:
            for key, param in state_dict.items():
                soup_state_dict[key] += param / len(self.models)
        
        soup_model.load_state_dict(soup_state_dict)
        return soup_model

Evaluation and Analysis

Comprehensive Evaluation Framework

class SoupEvaluator:
    """Comprehensive evaluation of model soups"""
    
    def __init__(self, test_loaders, device='cuda'):
        self.test_loaders = test_loaders  # Dict of test loaders
        self.device = device
    
    def evaluate_comprehensive(self, soup_model, ingredient_models):
        """Comprehensive evaluation of soup vs ingredients"""
        results = {}
        
        # Evaluate on multiple test sets
        for test_name, test_loader in self.test_loaders.items():
            print(f"\nEvaluating on {test_name}")
            
            # Individual models
            individual_scores = []
            for i, model_state in enumerate(ingredient_models):
                model = self._create_model_from_state(model_state)
                score = self._evaluate_single(model, test_loader)
                individual_scores.append(score)
                print(f"  Ingredient {i}: {score:.4f}")
            
            # Soup model
            soup_score = self._evaluate_single(soup_model, test_loader)
            print(f"  Soup: {soup_score:.4f}")
            
            # Traditional ensemble
            ensemble_score = self._evaluate_ensemble(ingredient_models, test_loader)
            print(f"  Ensemble: {ensemble_score:.4f}")
            
            results[test_name] = {
                'individual': individual_scores,
                'soup': soup_score,
                'ensemble': ensemble_score,
                'improvement_over_best': soup_score - max(individual_scores),
                'improvement_over_avg': soup_score - np.mean(individual_scores)
            }
        
        return results
    
    def _evaluate_single(self, model, test_loader):
        """Evaluate single model"""
        model.to(self.device)
        model.eval()
        
        correct = 0
        total = 0
        
        with torch.no_grad():
            for inputs, targets in test_loader:
                inputs, targets = inputs.to(self.device), targets.to(self.device)
                outputs = model(inputs)
                _, predicted = torch.max(outputs.data, 1)
                total += targets.size(0)
                correct += (predicted == targets).sum().item()
        
        return correct / total
    
    def _evaluate_ensemble(self, ingredient_models, test_loader):
        """Evaluate traditional ensemble"""
        models = [self._create_model_from_state(state) for state in ingredient_models]
        for model in models:
            model.to(self.device)
            model.eval()
        
        correct = 0
        total = 0
        
        with torch.no_grad():
            for inputs, targets in test_loader:
                inputs, targets = inputs.to(self.device), targets.to(self.device)
                
                # Average predictions
                ensemble_output = None
                for model in models:
                    output = model(inputs)
                    if ensemble_output is None:
                        ensemble_output = output
                    else:
                        ensemble_output += output
                
                ensemble_output /= len(models)
                _, predicted = torch.max(ensemble_output.data, 1)
                total += targets.size(0)
                correct += (predicted == targets).sum().item()
        
        return correct / total
    
    def _create_model_from_state(self, state_dict):
        """Create model from state dict - needs to be implemented based on your model"""
        # This is a placeholder - implement based on your specific model architecture
        pass
    
    def plot_results(self, results):
        """Plot evaluation results"""
        datasets = list(results.keys())
        soup_scores = [results[d]['soup'] for d in datasets]
        avg_individual = [np.mean(results[d]['individual']) for d in datasets]
        ensemble_scores = [results[d]['ensemble'] for d in datasets]
        
        x = np.arange(len(datasets))
        width = 0.25
        
        fig, ax = plt.subplots(figsize=(12, 6))
        
        bars1 = ax.bar(x - width, avg_individual, width, label='Avg Individual', alpha=0.8)
        bars2 = ax.bar(x, soup_scores, width, label='Soup', alpha=0.8)
        bars3 = ax.bar(x + width, ensemble_scores, width, label='Ensemble', alpha=0.8)
        
        ax.set_xlabel('Test Dataset')
        ax.set_ylabel('Accuracy')
        ax.set_title('Model Soup vs Individual Models vs Ensemble')
        ax.set_xticks(x)
        ax.set_xticklabels(datasets)
        ax.legend()
        
        # Add value labels
        for bars in [bars1, bars2, bars3]:
            for bar in bars:
                height = bar.get_height()
                ax.text(bar.get_x() + bar.get_width()/2., height + 0.005,
                       f'{height:.3f}', ha='center', va='bottom')
        
        plt.tight_layout()
        plt.show()

Theoretical Analysis

class SoupAnalyzer:
    """Analyze model soup from theoretical perspective"""
    
    def __init__(self):
        pass
    
    def analyze_weight_similarity(self, ingredient_models):
        """Analyze similarity between ingredient model weights"""
        similarities = {}
        
        # Get all parameter names
        param_names = list(ingredient_models[0].keys())
        
        for param_name in param_names:
            if 'weight' in param_name:  # Only analyze weight parameters
                # Get parameters from all models
                params = [model[param_name].flatten() for model in ingredient_models]
                param_matrix = torch.stack(params)
                
                # Compute pairwise cosine similarities
                cosine_sim = F.cosine_similarity(
                    param_matrix.unsqueeze(1), 
                    param_matrix.unsqueeze(0), 
                    dim=2
                )
                
                # Store statistics
                similarities[param_name] = {
                    'mean_similarity': cosine_sim[torch.triu(torch.ones_like(cosine_sim), diagonal=1) == 1].mean().item(),
                    'min_similarity': cosine_sim[torch.triu(torch.ones_like(cosine_sim), diagonal=1) == 1].min().item(),
                    'max_similarity': cosine_sim[torch.triu(torch.ones_like(cosine_sim), diagonal=1) == 1].max().item()
                }
        
        return similarities
    
    def analyze_loss_landscape(self, soup_model, ingredient_models, test_loader, device='cuda'):
        """Analyze loss landscape around soup and ingredient models"""
        
        def compute_loss(model, loader):
            model.eval()
            total_loss = 0.0
            criterion = nn.CrossEntropyLoss()
            
            with torch.no_grad():
                for inputs, targets in loader:
                    inputs, targets = inputs.to(device), targets.to(device)
                    outputs = model(inputs)
                    loss = criterion(outputs, targets)
                    total_loss += loss.item()
            
            return total_loss / len(loader)
        
        # Compute losses
        soup_loss = compute_loss(soup_model, test_loader)
        
        individual_losses = []
        for state_dict in ingredient_models:
            model = self._create_model_from_state(state_dict)
            loss = compute_loss(model, test_loader)
            individual_losses.append(loss)
        
        # Analyze interpolation paths
        interpolation_losses = []
        alphas = np.linspace(0, 1, 11)
        
        for i in range(len(ingredient_models)):
            for j in range(i+1, len(ingredient_models)):
                losses_path = []
                
                for alpha in alphas:
                    # Interpolate between models i and j
                    interpolated_state = {}
                    for key in ingredient_models[i].keys():
                        interpolated_state[key] = (
                            alpha * ingredient_models[i][key] + 
                            (1 - alpha) * ingredient_models[j][key]
                        )
                    
                    model = self._create_model_from_state(interpolated_state)
                    loss = compute_loss(model, test_loader)
                    losses_path.append(loss)
                
                interpolation_losses.append(losses_path)
        
        return {
            'soup_loss': soup_loss,
            'individual_losses': individual_losses,
            'interpolation_paths': interpolation_losses,
            'alphas': alphas
        }
    
    def plot_loss_landscape(self, landscape_results):
        """Plot loss landscape analysis"""
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
        
        # Individual vs soup losses
        individual_losses = landscape_results['individual_losses']
        soup_loss = landscape_results['soup_loss']
        
        ax1.bar(range(len(individual_losses)), individual_losses, alpha=0.7, label='Individual Models')
        ax1.axhline(y=soup_loss, color='red', linestyle='--', label='Soup Model')
        ax1.set_xlabel('Model Index')
        ax1.set_ylabel('Test Loss')
        ax1.set_title('Individual vs Soup Model Losses')
        ax1.legend()
        
        # Interpolation paths
        alphas = landscape_results['alphas']
        paths = landscape_results['interpolation_paths']
        
        for i, path in enumerate(paths):
            ax2.plot(alphas, path, alpha=0.7, label=f'Path {i+1}')
        
        ax2.axhline(y=soup_loss, color='red', linestyle='--', label='Soup Loss')
        ax2.set_xlabel('Interpolation Factor (α)')
        ax2.set_ylabel('Test Loss')
        ax2.set_title('Loss Along Interpolation Paths')
        ax2.legend()
        
        plt.tight_layout()
        plt.show()

Best Practices and Guidelines

When to Use Model Soup

def soup_recommendation_system(task_type, num_models, diversity_score, computational_budget):
    """Recommend whether to use model soup based on context"""
    
    recommendations = []
    
    # Check basic requirements
    if num_models under 2:
        return ["Need at least 2 models for soup"]
    
    # Task-specific recommendations
    if task_type in ['classification', 'regression']:
        recommendations.append("✓ Classification/regression tasks are well-suited for model soup")
    elif task_type in ['generation', 'structured_prediction']:
        recommendations.append("⚠ Generation tasks may require more careful soup preparation")
    
    # Diversity recommendations
    if diversity_score over 0.7:
        recommendations.append("✓ High model diversity suggests good soup potential")
    elif diversity_score under 0.3:
        recommendations.append("⚠ Low diversity may limit soup benefits")
    
    # Computational considerations
    if computational_budget == 'low':
        recommendations.append("✓ Soup provides better accuracy without inference overhead")
    elif computational_budget == 'high':
        recommendations.append("? Consider comparing soup vs traditional ensemble")
    
    return recommendations

# Example usage
task = 'classification'
models = 5
diversity = 0.8
budget = 'low'

recommendations = soup_recommendation_system(task, models, diversity, budget)
for rec in recommendations:
    print(rec)

Conclusion

Model Soup offers a compelling approach to improving deep learning performance through simple weight averaging. Key insights:

Advantages

No inference overhead: Single model with same computational cost
Improved robustness: Reduces variance across training runs
Simple implementation: Just weight averaging
Better calibration: Often more reliable confidence estimates

Best Practices

Diverse ingredients: Use models with different hyperparameters, seeds, or training procedures
Validation-guided selection: Use greedy soup for optimal model selection
Layer-wise analysis: Consider selective averaging of different layer types
Domain-specific adaptations: Tailor soup recipes to specific task requirements

Limitations

Requires multiple trained models: Initial training cost
Task dependency: Effectiveness varies across different tasks
Mode connectivity: Assumes models lie in connected loss regions

Future Directions

Learned soup recipes: Automatically discovering optimal mixing strategies
Online soups: Dynamically updating soup during training
Structured soups: Incorporating architectural constraints in averaging
Theoretical understanding: Better characterization of when soups work

Model Soup represents an elegant solution to the ensemble vs. efficiency trade-off, providing a practical way to improve model performance without computational overhead.

References

Wortsman, M., et al. (2022). "Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time."
Garipov, T., et al. (2018). "Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs."
Frankle, J., et al. (2020). "Linear mode connectivity and the lottery ticket hypothesis."
Fort, S., et al. (2019). "Deep ensembles: A loss landscape perspective."