GridGulp Testing Guide¶

Overview¶

This guide covers testing procedures for GridGulp, including unit tests, integration tests, and performance testing.

Test Structure¶

tests/
├── unit/              # Unit tests for individual components
├── integration/       # Integration tests for end-to-end workflows
├── detectors/        # Detector-specific tests
├── fixtures/         # Shared test fixtures
├── manual/           # Manual test files organized by complexity
│   ├── level0/       # Basic single-table files
│   ├── level1/       # Medium complexity files
│   └── level2/       # Complex multi-table files
└── outputs/          # Test output captures

Running Tests¶

All Tests¶

# Run all tests
pytest

# Run with coverage
pytest --cov=gridgulp

# Run with verbose output
pytest -v

Specific Test Categories¶

# Unit tests only
pytest tests/unit/

# Integration tests only
pytest tests/integration/

# Detector tests
pytest tests/detectors/

Individual Test Files¶

# Test specific detector
pytest tests/detectors/test_format_analyzer.py

# Test file detection
pytest tests/unit/test_file_detection.py

Unit Tests¶

SimpleCaseDetector Tests¶

Tests for the fast single-table detector:

# tests/test_simple_detector.py
- test_single_table_detection
- test_offset_table_detection
- test_empty_sheet_handling
- test_sparse_data_handling

IslandDetector Tests¶

Tests for multi-table detection:

# tests/detectors/test_island_detector.py
- test_multiple_tables_detection
- test_connected_component_analysis
- test_density_calculation
- test_edge_cases

File Type Detection Tests¶

# tests/unit/test_file_detection.py
- test_excel_detection
- test_csv_detection
- test_text_file_detection
- test_encoding_detection
- test_malformed_files

Integration Tests¶

End-to-End Detection¶

# tests/integration/test_complex_tables.py
async def test_financial_report():
    """Test complete detection pipeline on financial data."""
    porter = GridGulp()
    result = await porter.detect_tables("tests/manual/level1/complex_table.xlsx")

    assert result.total_tables >= 3
    assert result.detection_time < 5.0

Text File Processing¶

# tests/integration/test_text_files.py
async def test_scientific_data():
    """Test text file with tab-delimited scientific data."""
    porter = GridGulp()
    result = await porter.detect_tables("examples/proprietary/NOV_PEGDA6000.txt")

    assert len(result.sheets) == 1
    assert result.sheets[0].tables[0].shape[1] > 50  # Wide table

Performance Testing¶

Benchmark Script¶

# scripts/testing/benchmark_detection.py
import time
import asyncio
from gridgulp import GridGulp

async def benchmark_file(file_path):
    porter = GridGulp()
    start = time.time()
    result = await porter.detect_tables(file_path)
    duration = time.time() - start

    print(f"File: {file_path}")
    print(f"Tables: {result.total_tables}")
    print(f"Time: {duration:.2f}s")
    print(f"Cells/sec: {calculate_cells_per_sec(result, duration)}")

Performance Targets¶

Small files (<1MB): < 0.5 seconds
Medium files (1-10MB): < 2 seconds
Large files (10-50MB): < 10 seconds
Cell processing rate: > 100,000 cells/second

Test Data¶

Level 0: Basic Files¶

Simple single-table files for baseline testing: - test_basic.xlsx - Basic Excel table - test_comma.csv - Standard CSV - test_tab.tsv - Tab-separated values - test_formatting.xlsx - Formatted Excel table

Level 1: Medium Complexity¶

Real-world files with multiple tables: - complex_table.xlsx - Financial report with sections - large_table.csv - Large dataset (>10k rows) - simple_table.xlsx - Clean multi-sheet workbook

Level 2: Complex Files¶

Edge cases and challenging layouts: - creative_tables.xlsx - Unusual table layouts - weird_tables.xlsx - Non-standard structures

Writing New Tests¶

Test Structure¶

import pytest
from gridgulp import GridGulp, Config

@pytest.mark.asyncio
async def test_my_feature():
    """Test description."""
    # Arrange
    config = Config(confidence_threshold=0.8)
    porter = GridGulp(config)

    # Act
    result = await porter.detect_tables("path/to/test/file.xlsx")

    # Assert
    assert result.total_tables == expected_count
    assert all(table.confidence > 0.7 for sheet in result.sheets
               for table in sheet.tables)

Using Fixtures¶

@pytest.fixture
def sample_sheet_data():
    """Create sample sheet data for testing."""
    sheet = SheetData(name="test_sheet")
    # Add test data
    return sheet

async def test_with_fixture(sample_sheet_data):
    detector = SimpleCaseDetector()
    result = detector.detect_simple_table(sample_sheet_data)
    assert result.is_simple_table

Continuous Integration¶

GitHub Actions Workflow¶

# .github/workflows/test.yml
name: Tests
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - run: pip install -e ".[dev]"
      - run: pytest --cov=gridgulp

Debugging Tests¶

Verbose Output¶

# See detailed test execution
pytest -vv

# Show print statements
pytest -s

# Stop on first failure
pytest -x

Test Isolation¶

# Run tests in random order to detect dependencies
pytest --random-order

# Run specific test by name
pytest -k "test_scientific_data"

Coverage Reports¶

Generate Coverage¶

# Generate coverage report
pytest --cov=gridgulp --cov-report=html

# View report
open htmlcov/index.html

Coverage Goals¶

Overall coverage: > 80%
Core detectors: > 90%
File readers: > 85%
Error handling paths: > 70%