Detection Methods¶
GridGulp uses multiple detection algorithms to find tables in spreadsheets. Each method is optimized for different scenarios.
Detection Pipeline¶
The detection process follows this pipeline:
graph LR
A[File Input] --> B[File Type Detection]
B --> C[Read File Data]
C --> D{Simple Case?}
D -->|Yes| E[SimpleCaseDetector]
D -->|No| F[Excel Metadata]
F --> G[Island Detection]
G --> H[Multi-Header Detection]
H --> I[Semantic Analysis]
I --> J[Results]
E --> J
SimpleCaseDetector¶
Purpose: Quickly handles the most common case - a single table starting near cell A1.
When it's used: - First detector to run - Handles ~80% of real-world spreadsheets - Very fast (typically <50ms)
How it works: 1. Checks if data starts in the top-left region 2. Looks for continuous data without large gaps 3. Identifies clear boundaries 4. Returns high confidence if pattern matches
Example:
# Ideal for SimpleCaseDetector
| Name | Age | City |
|-------|-----|----------|
| Alice | 30 | New York |
| Bob | 25 | Boston |
| Carol | 35 | Chicago |
Configuration:
config = Config(
enable_simple_case_detection=True, # Enable/disable
simple_case_max_start_row=5, # How far from A1
simple_case_max_start_col=5 # to look for data
)
ExcelMetadataExtractor¶
Purpose: Uses Excel's native table definitions and named ranges.
When it's used: - Only for Excel files (.xlsx, .xlsm) - When Excel tables (ListObjects) are defined - When named ranges might contain tables
How it works: 1. Reads Excel's internal table definitions 2. Extracts named ranges 3. Converts to GridGulp table format 4. Highest confidence (1.0) for native tables
Benefits: - 100% accurate for defined tables - Preserves table names and settings - No computation needed
Configuration:
config = Config(
enable_excel_metadata=True, # Enable/disable
read_excel_table_styles=True, # Include style info
expand_excel_tables=False # Expand beyond defined range
)
IslandDetector¶
Purpose: Finds multiple disconnected data regions (islands) on a single sheet.
When it's used: - Multiple tables on one sheet - Tables not starting at A1 - Complex layouts with scattered data
How it works: 1. Creates binary mask of non-empty cells 2. Uses connected component analysis 3. Groups adjacent cells into islands 4. Filters by size and density
Example:
A1: Summary Stats F1: Product Sales
A2: Total: 1000 F2: Product | Revenue
A3: Average: 50 F3: Widget | $500
F4: Gadget | $300
F5: Tool | $200
Configuration:
config = Config(
enable_island_detection=True,
island_min_cells=5, # Minimum cells per island
island_max_gap=1, # Max gap between cells
island_density_threshold=0.3, # Min filled ratio
island_neighbor_distance=3 # Distance to merge islands
)
Multi-Header Detection¶
Purpose: Identifies tables with complex, multi-row headers including merged cells.
When it's used: - Hierarchical headers - Merged cells in headers - Financial reports with nested categories
How it works: 1. Analyzes formatting patterns in top rows 2. Detects merged cell regions 3. Builds header hierarchy 4. Maps columns to header paths
Example:
| Sales by Region |
| Q1 | Q2 | Q3 | Q4 |
| Jan|Feb| Apr|May| Jul|Aug| Oct |
|--------+--------+--------+-----|
| 100|120| 150|140| 180|190| 200 |
Configuration:
config = Config(
enable_multi_header_detection=True,
max_header_rows=10, # Max rows to check
header_confidence_threshold=0.7, # Min confidence
detect_merged_cells=True, # Handle merged cells
header_format_weight=0.4 # Formatting importance
)
Semantic Analysis¶
Purpose: Uses content analysis to understand table structure and boundaries.
When it's used: - After structural detection - To refine table boundaries - To identify special rows (totals, subtotals)
How it works: 1. Analyzes data types per column 2. Identifies consistent patterns 3. Detects summary rows 4. Finds natural boundaries
Features detected: - Header rows (bold, different type) - Total/subtotal rows - Section breaks - Data type consistency
Configuration:
config = Config(
enable_semantic_analysis=True,
semantic_blank_row_threshold=0.9, # Row is blank if 90%+ empty
semantic_total_keyword_list=[ # Total row keywords
'total', 'sum', 'subtotal',
'grand total', 'overall'
],
semantic_header_keyword_boost=0.2 # Confidence boost for headers
)
Confidence Scoring¶
Each detection method produces a confidence score:
Base Confidence¶
- SimpleCaseDetector: 0.85-0.95 (high for clear cases)
- ExcelMetadata: 1.0 (perfect for defined tables)
- IslandDetector: 0.5-0.9 (based on island quality)
- Multi-Header: 0.6-0.95 (based on header structure)
Confidence Adjustments¶
# Factors that increase confidence:
+ Large table size (more data points)
+ High data density (fewer gaps)
+ Consistent column types
+ Clear headers
+ Formatting patterns
# Factors that decrease confidence:
- Small table size
- Sparse data
- Mixed data types
- No clear headers
- Irregular shape
Method Selection Strategy¶
GridGulp automatically selects appropriate methods:
def select_detection_methods(file_info, sheet_data):
methods = []
# Always try simple case first (fast)
if config.enable_simple_case_detection:
methods.append(SimpleCaseDetector)
# Excel metadata if available
if file_info.is_excel and config.enable_excel_metadata:
methods.append(ExcelMetadataExtractor)
# Island detection for complex layouts
if not simple_case_found and config.enable_island_detection:
methods.append(IslandDetector)
# Header detection for tables with headers
if has_potential_headers and config.enable_multi_header_detection:
methods.append(MultiHeaderDetector)
# Semantic analysis to refine results
if config.enable_semantic_analysis:
methods.append(SemanticAnalyzer)
return methods
Performance Considerations¶
| Method | Speed | Memory | Best For |
|---|---|---|---|
| SimpleCaseDetector | Very Fast (<50ms) | Low | Single tables at A1 |
| ExcelMetadata | Very Fast (<10ms) | Low | Defined Excel tables |
| IslandDetector | Fast (<200ms) | Medium | Multiple tables |
| Multi-Header | Medium (<500ms) | Medium | Complex headers |
| Semantic | Slow (<1s) | High | Accuracy refinement |
Custom Detection Patterns¶
For specific use cases, combine methods:
Financial Reports¶
config = Config(
enable_multi_header_detection=True,
enable_semantic_analysis=True,
semantic_total_keyword_list=['total', 'sum', 'net', 'gross'],
header_format_weight=0.5
)
Scientific Data¶
config = Config(
enable_island_detection=True,
island_min_cells=10,
text_min_delimiter_consistency=0.8,
expected_header_rows=1
)
Simple CSV Files¶
config = Config(
enable_simple_case_detection=True,
enable_island_detection=False,
enable_multi_header_detection=False,
enable_semantic_analysis=False
)
Debugging Detection¶
Enable detailed logging to understand detection:
import logging
logging.basicConfig(level=logging.DEBUG)
# Detection logs will show:
# - Which methods were tried
# - Confidence scores from each
# - Why tables were accepted/rejected
# - Performance metrics
Next Steps¶
- Learn about configuration options to tune detection
- See file formats for format-specific detection
- Explore DataFrame extraction after detection