Introduction

The gap between a Jupyter notebook and production code is vast. Most ML tutorials stop at model.fit(), but real value comes from maintainable, testable, deployable code. This guide covers the practices that separate hobby projects from production systems.

Project Structure

A well-organized ML project:

my_ml_project/
├── src/
│   └── my_ml_project/
│       ├── __init__.py
│       ├── config.py           # Configuration management
│       ├── data/
│       │   ├── __init__.py
│       │   ├── loaders.py      # Data loading
│       │   └── processors.py   # Data transformations
│       ├── models/
│       │   ├── __init__.py
│       │   ├── architectures.py
│       │   └── training.py
│       ├── inference/
│       │   ├── __init__.py
│       │   └── predictor.py
│       └── utils/
│           ├── __init__.py
│           └── logging.py
├── tests/
│   ├── __init__.py
│   ├── conftest.py             # Pytest fixtures
│   ├── unit/
│   │   └── test_processors.py
│   └── integration/
│       └── test_pipeline.py
├── scripts/
│   ├── train.py
│   └── evaluate.py
├── notebooks/                   # Exploration only
│   └── exploration.ipynb
├── configs/
│   ├── base.yaml
│   └── production.yaml
├── pyproject.toml              # Project metadata & deps
├── Makefile                    # Common commands
├── Dockerfile
├── .env.example
├── .gitignore
└── README.md

Key principles:

Source code under src/ with package name subdirectory
Tests mirror source structure
Configuration separate from code
Scripts for entrypoints
Notebooks for exploration only, never production

Modern Dependency Management

pyproject.toml

The modern standard for Python projects:

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[project]
name = "my-ml-project"
version = "0.1.0"
description = "Production ML pipeline"
requires-python = ">=3.10"
dependencies = [
    "torch>=2.0",
    "transformers>=4.30",
    "pandas>=2.0",
    "pydantic>=2.0",
    "pydantic-settings>=2.0",
]

[project.optional-dependencies]
dev = [
    "pytest>=7.0",
    "pytest-cov>=4.0",
    "ruff>=0.1",
    "mypy>=1.0",
    "pre-commit>=3.0",
]

[project.scripts]
train = "my_ml_project.scripts.train:main"
serve = "my_ml_project.scripts.serve:main"

[tool.ruff]
line-length = 100
select = ["E", "F", "I", "N", "W", "UP"]

[tool.mypy]
python_version = "3.11"
strict = true
ignore_missing_imports = true

[tool.pytest.ini_options]
testpaths = ["tests"]
addopts = "-v --cov=src"

uv for Fast Dependency Management

Modern alternative to pip:

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment
uv venv

# Install dependencies (10-100x faster than pip)
uv pip install -e ".[dev]"

# Lock dependencies
uv pip compile pyproject.toml -o requirements.lock

# Sync from lock file
uv pip sync requirements.lock

Configuration Management

Never hardcode configuration. Use environment variables and config files.

Pydantic Settings

# src/my_ml_project/config.py
from pydantic_settings import BaseSettings, SettingsConfigDict
from pydantic import Field
from pathlib import Path
from typing import Literal

class Settings(BaseSettings):
    model_config = SettingsConfigDict(
        env_file=".env",
        env_file_encoding="utf-8",
        extra="ignore"
    )

    # Environment
    environment: Literal["development", "staging", "production"] = "development"
    debug: bool = False

    # Model settings
    model_name: str = "bert-base-uncased"
    model_path: Path = Field(default=Path("models/"))
    max_sequence_length: int = 512

    # Training
    batch_size: int = 32
    learning_rate: float = 2e-5
    epochs: int = 3

    # API settings
    api_host: str = "0.0.0.0"
    api_port: int = 8000

    # External services
    database_url: str = Field(default="sqlite:///./data.db")
    redis_url: str = Field(default="redis://localhost:6379")

    @property
    def is_production(self) -> bool:
        return self.environment == "production"

# Singleton pattern
_settings = None

def get_settings() -> Settings:
    global _settings
    if _settings is None:
        _settings = Settings()
    return _settings

YAML Configuration for Experiments

# configs/base.yaml
model:
  name: bert-base-uncased
  max_length: 512

training:
  batch_size: 32
  learning_rate: 2e-5
  epochs: 3
  warmup_steps: 100

data:
  train_path: data/train.csv
  val_path: data/val.csv

# Load with Hydra or OmegaConf
from omegaconf import OmegaConf

config = OmegaConf.load("configs/base.yaml")
print(config.model.name)  # bert-base-uncased

Logging

Structured logging for observability:

# src/my_ml_project/utils/logging.py
import logging
import sys
from pythonjsonlogger import jsonlogger

def setup_logging(level: str = "INFO", json_format: bool = False):
    root_logger = logging.getLogger()
    root_logger.setLevel(level)

    handler = logging.StreamHandler(sys.stdout)

    if json_format:
        formatter = jsonlogger.JsonFormatter(
            "%(timestamp)s %(level)s %(name)s %(message)s",
            rename_fields={"levelname": "level", "asctime": "timestamp"}
        )
    else:
        formatter = logging.Formatter(
            "%(asctime)s | %(levelname)-8s | %(name)s | %(message)s",
            datefmt="%Y-%m-%d %H:%M:%S"
        )

    handler.setFormatter(formatter)
    root_logger.addHandler(handler)

    # Reduce noise from libraries
    logging.getLogger("urllib3").setLevel(logging.WARNING)
    logging.getLogger("transformers").setLevel(logging.WARNING)

def get_logger(name: str) -> logging.Logger:
    return logging.getLogger(name)

Usage:

from my_ml_project.utils.logging import get_logger

logger = get_logger(__name__)

def train_model(config):
    logger.info("Starting training", extra={"config": config.dict()})

    for epoch in range(config.epochs):
        logger.info(f"Epoch {epoch}", extra={"epoch": epoch, "loss": loss})

    logger.info("Training complete", extra={"final_metrics": metrics})

Error Handling

Define custom exceptions:

# src/my_ml_project/exceptions.py
class MLProjectError(Exception):
    """Base exception for the project."""
    pass

class DataValidationError(MLProjectError):
    """Raised when data validation fails."""
    pass

class ModelNotFoundError(MLProjectError):
    """Raised when a model file is not found."""
    pass

class InferenceError(MLProjectError):
    """Raised when inference fails."""
    pass

Use context managers for resources:

from contextlib import contextmanager
import torch

@contextmanager
def inference_mode():
    """Context manager for inference."""
    was_training = torch.is_grad_enabled()
    try:
        torch.set_grad_enabled(False)
        yield
    finally:
        torch.set_grad_enabled(was_training)

# Usage
with inference_mode():
    predictions = model(inputs)

Testing

Test Structure

# tests/conftest.py
import pytest
from pathlib import Path

@pytest.fixture
def sample_data():
    return {
        "texts": ["Hello world", "Test input"],
        "labels": [0, 1]
    }

@pytest.fixture
def model_path(tmp_path):
    return tmp_path / "test_model"

@pytest.fixture
def settings():
    from my_ml_project.config import Settings
    return Settings(environment="development", debug=True)

# tests/unit/test_processors.py
import pytest
from my_ml_project.data.processors import TextProcessor

class TestTextProcessor:
    def test_tokenize_basic(self):
        processor = TextProcessor(max_length=128)
        result = processor.tokenize("Hello world")

        assert "input_ids" in result
        assert len(result["input_ids"]) <= 128

    def test_tokenize_empty_raises(self):
        processor = TextProcessor()

        with pytest.raises(ValueError, match="empty"):
            processor.tokenize("")

    @pytest.mark.parametrize("text,expected_length", [
        ("Short", 3),
        ("A bit longer text here", 6),
    ])
    def test_tokenize_lengths(self, text, expected_length):
        processor = TextProcessor()
        result = processor.tokenize(text)
        # Approximate token count
        assert len(result["input_ids"]) >= expected_length

Integration Tests

# tests/integration/test_pipeline.py
import pytest
from my_ml_project.inference.predictor import Predictor

@pytest.mark.integration
class TestPredictionPipeline:
    @pytest.fixture
    def predictor(self, model_path):
        return Predictor(model_path=model_path)

    def test_end_to_end_prediction(self, predictor, sample_data):
        predictions = predictor.predict(sample_data["texts"])

        assert len(predictions) == len(sample_data["texts"])
        assert all(0 <= p <= 1 for p in predictions)

Run tests:

# Run all tests
pytest

# Run with coverage
pytest --cov=src --cov-report=html

# Run only unit tests
pytest tests/unit

# Run integration tests
pytest -m integration

Pre-commit Hooks

Automate code quality:

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.1.6
    hooks:
      - id: ruff
        args: [--fix, --exit-non-zero-on-fix]
      - id: ruff-format

  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.7.0
    hooks:
      - id: mypy
        additional_dependencies:
          - pydantic>=2.0
          - types-requests

  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-added-large-files
        args: ['--maxkb=1000']

Setup:

pip install pre-commit
pre-commit install

Makefile for Common Commands

.PHONY: install test lint format clean

install:
	uv pip install -e ".[dev]"
	pre-commit install

test:
	pytest tests/ -v --cov=src

test-unit:
	pytest tests/unit -v

test-integration:
	pytest tests/integration -v -m integration

lint:
	ruff check src tests
	mypy src

format:
	ruff format src tests
	ruff check --fix src tests

clean:
	rm -rf .pytest_cache .mypy_cache .ruff_cache
	rm -rf dist build *.egg-info
	find . -type d -name __pycache__ -exec rm -rf {} +

docker-build:
	docker build -t my-ml-project .

docker-run:
	docker run -p 8000:8000 my-ml-project

CI/CD with GitHub Actions

# .github/workflows/ci.yaml
name: CI

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ['3.10', '3.11']

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}

      - name: Install uv
        run: curl -LsSf https://astral.sh/uv/install.sh | sh

      - name: Install dependencies
        run: |
          uv venv
          source .venv/bin/activate
          uv pip install -e ".[dev]"

      - name: Lint
        run: |
          source .venv/bin/activate
          ruff check src tests
          mypy src

      - name: Test
        run: |
          source .venv/bin/activate
          pytest tests/ --cov=src --cov-report=xml

      - name: Upload coverage
        uses: codecov/codecov-action@v3
        with:
          files: coverage.xml

Docker

# Dockerfile
FROM python:3.11-slim as base

WORKDIR /app

# Install uv
RUN pip install uv

# Copy dependency files
COPY pyproject.toml .
COPY requirements.lock .

# Install dependencies
RUN uv venv && \
    . .venv/bin/activate && \
    uv pip sync requirements.lock

# Copy source
COPY src/ src/

# Production stage
FROM base as production

# Non-root user
RUN useradd -m appuser && chown -R appuser:appuser /app
USER appuser

ENV PATH="/app/.venv/bin:$PATH"

CMD ["python", "-m", "my_ml_project.scripts.serve"]

Conclusion

Production Python is about consistency, maintainability, and reliability. These practices might seem like overhead for small projects, but they pay dividends as projects grow. Start with the essentials project structure, testing, and linting and add more as needed.

The goal isn't perfection. It's a codebase that you and your team can maintain and extend with confidence.