GridGulp Architecture¶
Overview¶
GridGulp is a streamlined table detection framework that uses proven algorithms to extract tables from spreadsheets and text files. The architecture prioritizes simplicity, performance, and accuracy.
Core Design Principles¶
- Fast Path First: most use cases handled by simple algorithms
- No External Dependencies: Pure algorithmic detection without AI/ML services
- Format Agnostic: Unified interface for Excel, CSV, and text files
- Memory Efficient: Streaming processing for large files
- Type Safe: Pydantic models for all data structures
Architecture Diagram¶
┌─────────────────────────────────────────────────────────┐
│ GridGulp API │
├─────────────────────────────────────────────────────────┤
│ File Type Detection │
│ (Magika + Magic) │
├─────────────────────────────────────────────────────────┤
│ File Readers │
│ ┌─────────────┬──────────────┬────────────────────┐ │
│ │ ExcelReader │ CSVReader │ TextReader │ │
│ │ (openpyxl) │ (csv.reader)│ (encoding detect) │ │
│ └─────────────┴──────────────┴────────────────────┘ │
├─────────────────────────────────────────────────────────┤
│ Detection Pipeline │
│ ┌─────────────────────────────────────────────────┐ │
│ │ 1. SimpleCaseDetector (single table near A1) │ │
│ │ 2. IslandDetector (multi-table detection) │ │
│ │ 3. ExcelMetadataExtractor (ListObjects) │ │
│ └─────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────┤
│ Output Models │
│ DetectionResult → SheetResult → TableInfo │
└─────────────────────────────────────────────────────────┘
Component Details¶
1. File Type Detection¶
Purpose: Accurately identify file types regardless of extension
Components:
- FileFormatDetector: Main detection class
- Magika: Google's AI-powered file type detection
- python-magic: Fallback using libmagic
- EncodingResult: Sophisticated encoding detection for text files
Key Features: - BOM (Byte Order Mark) detection - Multi-layer encoding detection with chardet - Pattern-based detection for scientific data - Handles misnamed files (e.g., CSV with .xlsx extension)
2. File Readers¶
Purpose: Extract cell data from various file formats
ExcelReader¶
- Uses
openpyxlfor .xlsx/.xlsm files - Uses
xlrdfor legacy .xls files - Preserves formatting information
- Handles merged cells
CSVReader¶
- Automatic delimiter detection
- Encoding detection with fallbacks
- Type inference for cell values
- Memory-efficient streaming
TextReader¶
- Sophisticated encoding detection (UTF-8, UTF-16, etc.)
- Automatic CSV/TSV detection
- Scientific instrument data support
- Handles wide tables (100+ columns)
3. Detection Pipeline¶
Purpose: Identify table boundaries within sheets
SimpleCaseDetector¶
- Use Case: Single table starting near cell A1
- Performance: < 1ms for most sheets
- Accuracy: 100% for standard tables
- Algorithm: Find data bounds, check density
IslandDetector¶
- Use Case: Multiple disconnected tables
- Performance: < 100ms for complex sheets
- Accuracy: 95%+ for well-formatted data
- Algorithm: Connected component analysis
ExcelMetadataExtractor¶
- Use Case: Excel tables with defined ListObjects
- Performance: < 10ms
- Accuracy: 100% for defined tables
- Algorithm: Direct metadata extraction
4. Data Models¶
All models use Pydantic v2 for validation and serialization:
# Core detection result
class DetectionResult(BaseModel):
file_info: FileInfo
sheets: list[SheetResult]
metadata: dict[str, Any]
# Table information
class TableInfo(BaseModel):
range: TableRange
confidence: float
detection_method: str
headers: list[str] | None
shape: tuple[int, int]
Processing Flow¶
1. File Loading¶
# Detect file type
file_info = detector.detect_file_type(file_path)
# Create appropriate reader
reader = ReaderFactory.create_reader(file_path, file_info)
# Read file data
file_data = await reader.read()
2. Table Detection¶
# For each sheet
for sheet in file_data.sheets:
# Try simple case first (fast path)
if simple_detector.is_simple_case(sheet):
tables = [simple_detector.detect_simple_table(sheet)]
else:
# Fall back to island detection
tables = island_detector.detect_tables(sheet)
3. Result Assembly¶
# Create detection result
result = DetectionResult(
file_info=file_info,
sheets=[
SheetResult(
name=sheet.name,
tables=tables
)
for sheet, tables in detected_tables
]
)
Performance Characteristics¶
Memory Usage¶
- Streaming: Large files processed in chunks
- Cell Storage: Only non-empty cells stored
- Format Data: Minimal formatting preserved
Processing Speed¶
- Simple Tables: 1M+ cells/second
- Complex Tables: 100K+ cells/second
- File Loading: Limited by I/O speed
Scalability¶
- File Size: Tested up to 100MB files
- Row Limit: 1M rows (configurable)
- Column Limit: 16K columns (Excel limit)
Extension Points¶
Adding New File Formats¶
- Create reader class extending
BaseReader - Implement
read()method - Register in
ReaderFactory
Adding New Detectors¶
- Create detector class
- Implement detection algorithm
- Add to detection pipeline
Custom Output Formats¶
- Extend base models
- Add serialization methods
- Configure in GridGulp
Error Handling¶
Reader Errors¶
FileNotFoundError: File doesn't existPermissionError: No read accessReaderError: Format-specific issues
Detection Errors¶
NoTablesFoundError: No tables detectedDetectionTimeoutError: Processing timeoutInvalidSheetError: Corrupted sheet data
Configuration¶
Key configuration options:
- confidence_threshold: Minimum confidence (0.0-1.0)
- max_tables_per_sheet: Limit table count
- min_table_size: Minimum rows/columns
- timeout_seconds: Processing timeout
- enable_magika: Use AI file detection
Testing Strategy¶
Unit Tests¶
- Individual detector algorithms
- Reader format handling
- Model validation
Integration Tests¶
- End-to-end detection
- Cross-format compatibility
- Performance benchmarks
Test Data¶
- Level 0: Basic single tables
- Level 1: Real-world files
- Level 2: Edge cases