Changelog¶
All notable changes to GridGulp will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[0.3.0] - 2025-07-29¶
Added¶
- DataFrame Extraction: Extract detected tables as pandas DataFrames
DataFrameExtractor: Intelligent header detection with multi-row support- Quality scoring for extracted tables (0-1 scale)
- Automatic header detection including merged cells
- Type consistency analysis across columns (checks up to 100 rows)
- Plate map format detection (6, 24, 96, 384, 1536 wells)
-
Export to CSV, JSON, and summary formats
-
Enhanced TSV/Text File Detection: Specialized detection for instrument output
StructuredTextDetector: Handles TSV files and instrument output- Column consistency analysis for better table separation
- Wide table detection for plate data (50+ columns)
- Structural analysis using row patterns and empty row detection
-
Better handling of multiple table formats in single file
-
Improved Island Detection: Smarter heuristics for table separation
- Column consistency threshold (80% similarity for grouping)
- Configurable gap settings per file type (gap=0 for TSV, gap=1 for Excel)
- Structural analysis mode for text files
- Empty row detection for natural table boundaries
-
Aspect ratio and size-based confidence scoring
-
Documentation and Packaging:
- Streamlined README.md for clarity
- NOTICE file with all dependency licenses
- GitHub Action for PyPI releases with OIDC
- Updated all documentation to reflect new features
Changed¶
- Island detector now uses structural analysis for text files by default
- Constants centralized with file type-specific defaults
- Detection pipeline selects strategies based on file type
- DataFrame extraction integrated into main workflow
Fixed¶
- TSV file detection now properly identifies multiple tables
- Wide plate data detection for instrument output files
- Header detection handles merged cells correctly
- Empty cells in headers properly handled
[0.2.2] - 2025-07-28¶
Added¶
- Excel Metadata Extraction: Leverage Excel's built-in table definitions
ExcelMetadataExtractor: Extract ListObjects (Excel tables) with high confidence- Named ranges detection for better table boundaries
- Zero API cost for Excel native table detection
-
Integration with hybrid detection pipeline
-
Traditional Detection Methods: Fast, zero-cost table detection
SimpleCaseDetector: Optimized for single tables starting near A1IslandDetector: Connected component analysis for multi-table sheets- Flood-fill algorithm for finding disconnected data regions
- Confidence scoring based on density, shape, and header patterns
-
Both methods completely free (no API calls)
-
Cost Optimization Framework: Intelligent routing and budget management
CostOptimizer: Track API usage and costs per session/file- Budget limits with automatic fallback to free methods
- Real-time cost tracking and reporting
- Intelligent method selection based on complexity and budget
OpenAIPricing: Real OpenAI cost calculation with current pricing-
OpenAICostsAPI: Integration with OpenAI's costs endpoint (admin key required) -
Code Organization Improvements: Major refactoring for maintainability
- Centralized Constants: All constants moved to
src/gridgulp/core/constants.py - Core Module: New organizational structure with:
constants.py- Categorized constants (Island Detection, Format Analysis, etc.)exceptions.py- Custom exception classes for better error handlingtypes.py- Shared type definitions and aliasesconfigurable_constants.py- Mechanism for config overrides
- Contextual Logging: New logging framework with file/sheet/table context
-
Better separation of concerns and cleaner imports
-
Enhanced Detection Pipeline: Hybrid approach for optimal cost/quality
- Try free methods first (Simple Case, Island Detection, Excel Metadata)
- Fall back to vision processing only when needed
- Confidence-based routing and early termination
- Methods used tracking in detection results
- Cost reporting in metadata
Changed¶
- Configuration System: New options for cost control and traditional methods
max_cost_per_session: Budget limit for entire session (default: $1.00)max_cost_per_file: Budget limit per file (default: $0.10)enable_simple_case_detection: Enable fast single-table detection (default: True)enable_island_detection: Enable multi-table detection (default: True)use_excel_metadata: Use Excel ListObjects/named ranges (default: True)openai_admin_key: Optional admin key for cost API access-
Configurable detection thresholds for fine-tuning
-
GridGulp Core: Enhanced with hybrid detection and cost tracking
- Integrated traditional methods into main detection pipeline
- Added cost tracking to all detection results
- Enhanced metadata with methods used and performance metrics
-
Better handling of budget constraints
-
ComplexTableAgent: Updated with cost-aware routing
- Intelligent fallback from expensive to free methods
- Enhanced confidence scoring combining multiple detection strategies
- Better integration with Excel metadata extraction
- Contextual logging throughout detection process
Fixed¶
- Improved constants organization reduces code duplication
- Better error handling with custom exception classes
- Enhanced logging provides better debugging information
- More robust detection with multiple fallback strategies
- Fixed magic number scattered throughout codebase
Examples¶
week6_excel_metadata_example.py: Demonstrates Excel ListObjects extractionweek6_hybrid_detection_example.py: Shows hybrid detection approach with cost tracking
Documentation¶
docs/WEEK6_SUMMARY.md: Comprehensive overview of Week 6 featuresdocs/testing/WEEK6_TESTING_GUIDE.md: Manual testing procedures for new featuresCONSTANTS_REFACTORING.md: Details of code organization improvements
Developer Experience¶
- Constants centralization improves maintainability
- Better type safety with custom types module
- Enhanced debugging with contextual logging
- Cleaner code organization with core module structure
- All code continues to pass strict linting standards
[0.2.1] - 2025-07-27¶
Added¶
- Complex Table Detection: New agent-based system for detecting complex spreadsheet structures
ComplexTableAgent: Orchestrates multi-row header detection, semantic analysis, and format preservation- Handles financial reports, pivot tables, and hierarchical data structures
- Confidence scoring based on multiple detection strategies
-
Format-aware detection preserving semantic meaning
-
Multi-Row Header Detection: Advanced header analysis with merged cell support
MultiHeaderDetector: Identifies hierarchical headers spanning multiple rows- Column hierarchy mapping for nested headers
- Merged cell analysis with span detection
- Support for complex pivot table structures
-
Header confidence scoring based on formatting and content
-
Semantic Structure Analysis: Understanding table meaning beyond layout
SemanticFormatAnalyzer: Detects sections, subtotals, and grand totals- Row type classification (header, data, total, separator, section)
- Format pattern detection for consistent styling
- Preserves semantic blank rows and formatting
-
Section boundary detection for grouped data
-
Merged Cell Analysis: Comprehensive merged cell handling
MergedCellAnalyzer: Detects and maps merged cell regions- Column span calculation for proper data alignment
- Header cell hierarchy construction
-
Support for both Excel native and custom merge formats
-
Feature Collection System: Telemetry for continuous improvement
FeatureCollector: Records detailed detection metrics- SQLite-based feature storage with 40+ metrics
- Geometric features (rectangularness, density, contiguity)
- Pattern features (type, orientation, headers)
- Format features (bold headers, totals, sections)
- Export to CSV for analysis in pandas/Excel
-
Configurable retention and privacy-preserving
-
Comprehensive Test Suite: 100% test coverage for semantic features
- 20 test scenarios covering all detection strategies
- Integration tests for complex real-world patterns
- Performance benchmarks for large files
- Feature collection validation tests
Changed¶
- Configuration System: Enhanced with new options
use_vision: Toggle vision-based detection (default: True)enable_feature_collection: Enable telemetry (default: False)feature_db_path: SQLite database locationfeature_retention_days: Data retention period (default: 30)-
All options configurable via environment variables
-
GridGulp Core: Enhanced with semantic understanding
- Integrated ComplexTableAgent into main detection pipeline
- Added metadata fields for tracking LLM usage
- Improved confidence scoring with multi-factor analysis
- Better handling of sparse and complex spreadsheets
Fixed¶
- Improved handling of sparse spreadsheets with many empty cells
- Better detection of table boundaries in complex layouts
- More accurate confidence scoring for multi-table sheets
- Fixed edge cases in merged cell detection
- Resolved issues with format preservation in detection results
Examples¶
week5_complex_tables_with_features.py: Demonstrates complex financial report detectionweek5_feature_collection_example.py: Shows feature collection and analysis workflowfeature_collection_example.py: Basic feature collection usage
Developer Experience¶
- All code now passes ruff linting standards
- Improved type hints throughout the codebase
- Better error messages for debugging
- Comprehensive docstrings for all new components
[0.2.0] - 2025-07-25¶
Added¶
- Region Verification System: AI proposal validation using geometry analysis
- RegionVerifier class with configurable thresholds
- Geometry metrics: rectangularness, filledness, density, contiguity
- Pattern-specific verification for header-data, matrix, and hierarchical patterns
- Feedback generation for invalid regions
- Integration with vision pipeline for automatic filtering
- Verification Configuration: New config options for region verification
- enable_region_verification, verification_strict_mode
- Configurable thresholds for filledness, rectangularness, contiguity
- Feedback loop settings for iterative refinement
Changed¶
- Vision Infrastructure: Complete bitmap-based vision pipeline for table detection
- Bitmap generation with adaptive compression (2-bit, 4-bit, sampled modes)
- Multi-scale visualization with quadtree optimization
- Pattern detection for sparse spreadsheets (header-data, matrix, form, time series)
- Hierarchical detector for financial statements with indentation
- Integrated 4-phase detection pipeline
- Vision Model Integration:
- OpenAI GPT-4 Vision support
- Ollama local vision model support (qwen2-vl)
- Region proposal parsing with confidence scoring
- Batch processing and caching for performance
- Reader Implementations:
- ExcelReader with full formatting support (openpyxl/xlrd)
- CSVReader with encoding detection and delimiter inference
- CalamineReader (Rust-based) for 10-100x faster Excel processing
- Factory pattern for automatic reader selection
- Async/sync adapters for flexible usage
- Telemetry System:
- OpenTelemetry integration for LLM usage tracking
- Token usage metrics and cost tracking
- Performance monitoring
- Large File Support:
- Handle full Excel limits (1M×16K cells for .xlsx)
- Memory-efficient processing with streaming
- Adaptive sampling for oversized sheets
Planned¶
- Region verification algorithms
- Geometry analysis tools
- Excel ListObjects detection integration
- LLM-powered range naming suggestions
- CLI tool with progress indicators
- Full agent implementation with openai-agents-python
[0.2.0] - 2025-07-25¶
Added¶
- Vision Module Implementation: Complete vision-based table detection system
- 9 specialized modules for different aspects of vision processing
- Support for sparse, hierarchical, and complex table patterns
- Integration with both cloud and local vision models
- High-Performance Readers: CalamineReader for fast Excel processing
- Pattern Detection: Automatic detection of common spreadsheet patterns
- Telemetry: OpenTelemetry-based monitoring and cost tracking
Changed¶
- Default Excel reader changed to CalamineReader for performance
- Enhanced file type detection with magic byte verification
Technical Details¶
- Implemented adaptive bitmap compression for large spreadsheets
- Added quadtree spatial indexing for efficient processing
- Created hierarchical pattern detector for financial statements
- Built integrated pipeline coordinating multiple detection strategies
[0.1.0] - 2025-07-23¶
Added¶
- Project Foundation: Complete project structure and build system
- Pydantic 2 Models: Type-safe data models for FileInfo, TableInfo, and DetectionResult
- Configuration System: Comprehensive config with environment variable support
- Main API Structure: GridGulp class with cost-efficient architecture design
- File Type Detection: Basic file magic detection utilities (placeholder implementation)
- Development Tooling:
- Pre-commit hooks with Black, Ruff, mypy
- GitHub Actions CI/CD pipeline
- uv-based build system with hatchling
- EditorConfig and development guidelines
- Documentation:
- Comprehensive README with usage examples
- AGENT_ARCHITECTURE.md with cost-optimization strategy
- PROJECT_PLAN.md with weekly breakdown
- CONTRIBUTING.md with development guidelines
- Test Infrastructure: Complete test structure with pytest configuration
- Examples: Basic usage examples showing different configuration patterns
- Cost-Efficient Design:
- Local-first processing architecture
- Optional LLM integration with local model support
- Token usage tracking and optimization
Changed¶
- Build system updated to use uv instead of standard setuptools
- Project timeline restructured from 2-week phases to weekly milestones
Security¶
- Input validation framework for file operations
- File size limits and processing timeouts
- Safe file type detection before processing
- No execution of Excel macros (by design)