u/BricksterJ

Hey r/databricks!

Native Excel ingestion on Databricks is now Generally Available across AWS, Azure, and GCP.

With this release, you can ingest, parse, and query .xls / .xlsx / .xlsm files directly.

Public docs: https://docs.databricks.com/aws/en/query/formats/excel

📂 What is it?

Native Excel support that lets you:

  • Directly read .xls, .xlsx, and .xlsm files using Spark (spark.read.excel(...)) or SQL (read_files, COPY INTO).
  • Upload Excel files through the "Create or modify table" UI and land them as Delta.
  • Specify exact sheets and cell ranges (e.g., "Sheet1!A2:D10") for complex layouts.
  • Infer schema, headers, and data types automatically, or bring your own.
  • Stream Excel files with Auto Loader using cloudFiles.format = "excel".
  • List sheets in a workbook programmatically before ingesting.

🤷 Why?

Until now, Databricks didn't have a native Excel reader. That meant writing custom Python with pandas / openpyxl to convert Excel → DataFrame → Delta, manually exporting sheets to CSV before you could ingest them, or giving up on workflows because the Databricks file-upload UI rejected .xlsx.

GA makes Excel a first-class file format across Spark, SQL, Auto Loader, and the table-creation UI. It also opens the door to Excel ingestion via our managed file connectors (SharePoint, Google Drive, SFTP, and more coming soon).

🧑‍💻 How do I try it?

1️⃣ Requirements

  • Databricks Runtime 18.1 or above.

2️⃣ Try it in the UI

  • Click New → Add Data → Create or modify table.
  • Upload an .xls, .xlsx, or .xlsmfile.
  • Pick the sheet. Adjust header rows or cell range if needed.
  • Preview the inferred schema.
  • Click Create table. It lands as a Delta table in Unity Catalog.

3️⃣ Try it in Spark (batch)

# Read the first sheet of a workbook
df = spark.read.excel("<path to excel file>")

# Use a header row and a specific sheet + range
df = (
  spark.read
    .option("headerRows", 1)
    .option("dataAddress", "Sheet1!A1:E10")
    .excel("<path to excel directory or file>")
)

df.write.mode("overwrite").saveAsTable("<catalog>.<schema>.my_table")

4️⃣ Try it in SQL with read_files

CREATE TABLE my_sheet_table AS
SELECT * FROM read_files(
  "<path to excel directory or file>",
  format              => "excel",
  headerRows          => 1,
  dataAddress         => "Sheet1!A2:D10",
  schemaEvolutionMode => "none"
);

5️⃣ Try it with COPY INTO

COPY INTO excel_demo_table
FROM "<path to excel directory or file>"
FILEFORMAT = EXCEL;

6️⃣ Try it with Auto Loader (streaming)

df = (
  spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "excel")
    .option("cloudFiles.inferColumnTypes", True)
    .option("headerRows", 1)
    .option("cloudFiles.schemaLocation", "<schema location>")
    .load("<path to excel directory or file>")
)

(df.writeStream
  .format("delta")
  .option("checkpointLocation", "<checkpoint path>")
  .table("<catalog>.<schema>.excel_stream"))

7️⃣ List sheets in a workbook

sheets = (
  spark.read
    .option("operation", "listSheets")
    .excel("<path to workbook>")
)
sheets.show()  # returns sheetIndex, sheetName

🎛️ Supported options

Option Description
dataAddress Cell range in Excel syntax. Examples: "MySheet!C5:H10", "C5:H10", "Sheet1". Defaults to all valid cells on the first sheet.
headerRows Number of header rows inside dataAddress (0 or 1). Default: 0.
operation "readSheet" (default) or "listSheets".
dateFormat Custom date format. Default: yyyy-MM-dd.
timestampNTZFormat Custom timestamp (no TZ) format. Default: yyyy-MM-dd'T'HH:mm:ss[.SSS].

⚠️ Known limitations + behaviors

  • Password-protected files are not supported.
  • One header row max (headerRows = 0 or 1).
  • "Strict OOXML" format is not supported.
  • Schema evolution is not supported with Auto Loader streaming.
  • Merged cells: only the top-left value is retained; other cells in the merge become NULL.
  • Duplicate column headers are not supported (workaround: headerRows=0 and rename post-read).
  • .xlsm macros are not evaluated (computed values come through, but macros don't run).

⏭️ What's next?

  • Writing to Excel files.
  • Multi-sheet → multi-table ingestion in a single pass.
  • .xlsb binary format support.
  • Excel ingestion via managed connectors (SharePoint, Google Drive, SFTP, OneDrive, Box, Dropbox).

💬 Feedback

  • Drop a comment below or reach out to your Databricks account team. We'd love to hear which Excel workflows you want us to prioritize next.
u/BricksterJ — 8 days ago