Gemma-first spreadsheet schematization for Excel, CSV, TSV, and text files.
GridGulp now supports two explicit execution modes:
gemma_first: the supported primary path. Heuristics propose candidate regions, then Gemma finalizes boundaries, names, and merges/splits/additions. Structured text runs first; normalized sheet images are used only as a Gemma fallback for visually ambiguous sheets. The resulting tables are then lifted into canonical entities, fields, records, and relationships.heuristic_only: an explicit offline/debug escape hatch. This is useful for local debugging and for legacy test coverage, but it is not the main product path.Only Gemma-family models are accepted for LLM execution.
pip install gridgulp
For local development:
uv pip install -e ".[dev]"
GridGulp loads config in this order:
gridgulp.tomlGRIDGULP_* environment variablesStart from gridgulp.toml.example.
Minimal Gemma-first config:
detection_strategy = "gemma_first"
[gemma]
enabled = true
base_url = "http://localhost:11434/v1"
model_id = "gemma4:26b"
timeout_seconds = 60
max_retries = 2
max_tokens = 1200
[vision]
enabled = true
mode = "fallback_only"
max_proposal_crops = 4
crop_margin_cells = 1
Validate your config first:
gridgulp doctor
Validate multimodal support explicitly against Ollama:
gridgulp --gemma-base-url http://localhost:11434/v1 --gemma-model-id gemma4:26b doctor
Schematize a file:
gridgulp schematize path/to/report.xlsx
Emit JSON:
gridgulp schematize path/to/report.xlsx --json
Keep the lower-level table detector available when you need raw table boundaries:
gridgulp detect path/to/report.xlsx --save-debug-images .gridgulp-debug
Process a directory:
gridgulp schematize path/to/reports --recursive
Gemma-first schematization:
from gridgulp import GridGulp
from gridgulp.config import Config
config = Config.from_toml("gridgulp.toml")
gg = GridGulp(config=config)
result = gg.schematize_sync("sales_report.xlsx")
print(result.total_entities)
for entity in result.entities:
print(entity.suggested_name, entity.record_count_estimate)
for field in entity.fields:
print(" ", field.id, field.semantic_type, field.physical_type)
Lower-level table debugging:
from gridgulp import GridGulp
from gridgulp.config import Config, DetectionStrategy
config = Config(detection_strategy=DetectionStrategy.HEURISTIC_ONLY)
gg = GridGulp(config=config)
result = gg.detect_tables_sync("sales_report.xlsx")
The primary output is WorkbookSchemaResult, which includes:
DetectionResult remains available as a lower-level primitive for debugging table boundaries.
For gemma_first, GridGulp runs:
keep, merge, split, add, and reject decisionsTableInfo objects with clipped ranges, deduplicated overlaps, header metadata, and semantic summaries.xlsx.xls.xlsm.csv.tsv.txt.xlsb is detected but not supported for reading.
gridgulp doctor
gridgulp schematize examples/spreadsheets/simple/basic_table.csv --json
pytest -q tests/test_detection_agent.py
To compare Gemma-first schema output against the deterministic baseline, switch detection_strategy to heuristic_only in config or via env:
GRIDGULP_DETECTION_STRATEGY=heuristic_only gridgulp schematize file.xlsx --json