Basic Usage¶
This guide covers the fundamental concepts and common usage patterns for GridGulp.
Core Concepts¶
Detection Result¶
When you detect tables in a file, GridGulp returns a DetectionResult object containing:
- file_info: Information about the processed file
- sheets: List of sheets with detected tables
- file_data: Raw data for extraction
- metadata: Detection statistics and timing
Table Information¶
Each detected table includes:
- range: Cell boundaries (start/end row/column)
- confidence: Detection confidence score (0.0-1.0)
- shape: Tuple of (rows, columns)
- headers: List of column headers (if detected)
- detection_method: Algorithm used for detection
Basic Workflows¶
Single File Processing¶
from gridgulp import GridGulp
async def process_single_file():
gg = GridGulp()
# Detect tables
result = await gg.detect_tables("report.xlsx")
# Check if any tables were found
if result.total_tables == 0:
print("No tables found!")
return
# Process each sheet
for sheet in result.sheets:
print(f"\nSheet: {sheet.name}")
# Process each table
for table in sheet.tables:
print(f" Found table at {table.range.excel_range}")
# Extract as DataFrame
df = await gg.extract_dataframe(result.file_data, table)
# Do something with the data
process_dataframe(df)
Batch Processing¶
Process multiple files efficiently:
async def process_multiple_files(file_paths: list[str]):
gg = GridGulp()
# Process files in batch
results = await gg.batch_detect(file_paths)
# Process results
for result in results:
if result.error:
print(f"Error in {result.file_info.path}: {result.error}")
continue
print(f"\n{result.file_info.path.name}:")
print(f" Tables: {result.total_tables}")
print(f" Sheets: {len(result.sheets)}")
Working with Different Table Types¶
Simple Tables¶
Most spreadsheets contain simple tables that start near cell A1:
# GridGulp automatically handles simple cases
result = await gg.detect_tables("simple_data.xlsx")
# Usually returns one table per sheet with high confidence
for sheet in result.sheets:
if sheet.tables:
table = sheet.tables[0]
if table.confidence > 0.9:
print(f"High confidence table found: {table.range.excel_range}")
Multiple Tables per Sheet¶
When sheets contain multiple tables:
result = await gg.detect_tables("multi_table_sheet.xlsx")
for sheet in result.sheets:
if len(sheet.tables) > 1:
print(f"{sheet.name} contains {len(sheet.tables)} tables:")
# Sort tables by position (top to bottom, left to right)
sorted_tables = sorted(
sheet.tables,
key=lambda t: (t.range.start_row, t.range.start_col)
)
for i, table in enumerate(sorted_tables, 1):
print(f" Table {i}: {table.range.excel_range}")
Tables with Merged Cells¶
Handle complex headers with merged cells:
from gridgulp import Config
# Enable merged cell detection
config = Config(detect_merged_cells=True)
gg = GridGulp(config=config)
result = await gg.detect_tables("merged_headers.xlsx")
for sheet in result.sheets:
for table in sheet.tables:
if table.has_merged_cells:
print(f"Table with merged cells at {table.range.excel_range}")
# Extract with special handling
df = await gg.extract_dataframe(
result.file_data,
table,
handle_merged_cells=True
)
Data Extraction Options¶
Basic Extraction¶
Advanced Extraction¶
# Extract with options
df = await gg.extract_dataframe(
result.file_data,
table,
include_headers=True, # Include headers in DataFrame
infer_types=True, # Infer column data types
clean_values=True, # Clean whitespace and formatting
handle_merged_cells=True, # Unmerge cells properly
parse_dates=True # Parse date columns
)
Direct Cell Access¶
For custom processing, access cells directly:
# Get sheet data
sheet_data = result.file_data.sheets[0]
# Access specific cell
cell = sheet_data.get_cell(row=5, col=2)
if cell:
print(f"Value: {cell.value}")
print(f"Format: {cell.data_type}")
print(f"Bold: {cell.is_bold}")
Error Handling¶
Common Errors¶
from gridgulp import (
GridGulp,
FileNotFoundError,
ReaderError,
UnsupportedFormatError,
DetectionError
)
gg = GridGulp()
try:
result = await gg.detect_tables("data.xlsx")
except FileNotFoundError:
print("File not found!")
except UnsupportedFormatError as e:
print(f"Unsupported format: {e}")
except ReaderError as e:
print(f"Could not read file: {e}")
except DetectionError as e:
print(f"Detection failed: {e}")
Partial Results¶
GridGulp tries to return partial results when possible:
result = await gg.detect_tables("partially_corrupted.xlsx")
# Check for sheet-level errors
for sheet in result.sheets:
if sheet.error:
print(f"Error in sheet {sheet.name}: {sheet.error}")
else:
print(f"Successfully processed {sheet.name}")
Performance Tips¶
1. Use Appropriate Detection Methods¶
from gridgulp import Config
# For simple files, disable complex detection
config = Config(
enable_island_detection=False,
enable_multi_header_detection=False
)
2. Set Reasonable Limits¶
config = Config(
max_tables_per_sheet=10, # Don't look for too many tables
timeout_seconds=30 # Set timeout for large files
)
3. Configure for Your Use Case¶
# Use appropriate settings for your files
config = Config() # Default configuration works well for most files
Logging and Debugging¶
Enable detailed logging for troubleshooting:
import logging
# Set up logging
logging.basicConfig(level=logging.DEBUG)
# GridGulp will now output detailed logs
gg = GridGulp()
result = await gg.detect_tables("problem_file.xlsx")
Next Steps¶
- Learn about configuration options for fine-tuning
- Explore detection methods in detail
- See DataFrame extraction for data processing
- Read about file format support