u/Adventurous-Bus7657

I’m working on extracting structured data from PDFs using an LLM, and I’m running into a schema design issue with LanceDB.

The problem is that LLM outputs are not type-consistent. For example, a field might sometimes be a number (123.45), but other times be "N/A" or some descriptive text.

In my Pydantic schema, I defined a flexible type like this:

SchemaFieldValue = float | str | None

class StudyExtractionMetadata(StrictBaseModel):
    study_title: SchemaFieldValue = None
    study_category: SchemaFieldValue = None
    study_objective: SchemaFieldValue = None
    row_kind: SchemaFieldValue = None

class StructureDataRowSchema(LanceModel):
    doc_id: str
    doc_name: str
    study_extraction_metadata: StudyExtractionMetadata = Field(default_factory=StudyExtractionMetadata)

Then I insert into LanceDB like this:

if structured_row is not None:
    append_rows_to_lancedb(
        database=database,
        table_name=database.structured_data_table,
        rows=[structured_row],
        schema=StructureDataRowSchema,
    )

My questions:

  1. Is my understanding correct that LanceDB won’t handle float | str well in the same column?
  2. What’s the best practice for storing LLM-extracted fields with inconsistent types? Store everything as string?

Would really appreciate any advice or patterns you’ve used!

reddit.com
u/Adventurous-Bus7657 — 16 days ago
▲ 3 r/LangChain+1 crossposts

I’m working on extracting structured data from PDFs using an LLM, and I’m running into a schema design issue with LanceDB.

The problem is that LLM outputs are not type-consistent. For example, a field might sometimes be a number (123.45), but other times be "N/A" or some descriptive text.

In my Pydantic schema, I defined a flexible type like this:

SchemaFieldValue = float | str | None

class StudyExtractionMetadata(StrictBaseModel):
    study_title: SchemaFieldValue = None
    study_category: SchemaFieldValue = None
    study_objective: SchemaFieldValue = None
    row_kind: SchemaFieldValue = None

class StructureDataRowSchema(LanceModel):
    doc_id: str
    doc_name: str
    study_extraction_metadata: StudyExtractionMetadata = Field(default_factory=StudyExtractionMetadata)

Then I insert into LanceDB like this:

if structured_row is not None:
    append_rows_to_lancedb(
        database=database,
        table_name=database.structured_data_table,
        rows=[structured_row],
        schema=StructureDataRowSchema,
    )

My questions:

  1. Is my understanding correct that LanceDB won’t handle float | str well in the same column?
  2. What’s the best practice for storing LLM-extracted fields with inconsistent types? Store everything as string?

Would really appreciate any advice or patterns you’ve used!

reddit.com
u/Adventurous-Bus7657 — 16 days ago