Texture is a framework for extracting structured data from PDF documents by first identifying document structure (titles, sections, lists, tables) and then running extraction rules on top of that structure.

PDFs do not expose structure in a reliable way. Hard-coded scripts tend to break as soon as layouts change, while machine-learning approaches usually require large labeled datasets.
Texture takes a middle approach: document structure is inferred using many small heuristics, improved with human feedback, and then used as a stable layer for data extraction through simple rules.

Texture first identifies document structure by running multiple heuristics that analyze layout, fonts, alignment, and content to detect elements such as titles, paragraphs, lists, and figures.

Human feedback is then used to improve accuracy. Users can manually annotate documents using SelfLabel, or rely on CrowdCollect, which generates Mechanical Turk tasks with training and validation steps. These annotations are stored as ground truth.
Heuristics are evaluated against ground truth using precision and recall, then combined using weighted voting to improve results across document collections.

Once structure is identified, users write simple extraction rules (SBEL) that operate on structure rather than raw text—for example, extracting list items under a section titled Ingredients.