Extract Structured Data

When building browser automations in code, you can extract structured data from any page using page.extract with a JSON schema, or by passing a data_extraction_schema to page.agent.run_task. By default, Skyvern returns extracted data in whatever format makes sense for the action. Pass a schema to enforce a specific shape using JSON Schema. If you’re using the Cloud UI workflow editor instead, extraction works through the Extract block. See Block Types and Configuration for setup.

Define a schema

Pass a JSON Schema object to page.extract via the schema parameter, or to run_task via data_extraction_schema:

page.extract (browser automation)
run_task

data = await page.extract(
    "Get the title of the top post",
    schema={
        "type": "object",
        "properties": {
            "title": {
                "type": "string",
                "description": "The title of the top post"
            }
        }
    },
)

result = await client.run_task(
    prompt="Get the title of the top post",
    url="https://news.ycombinator.com",
    data_extraction_schema={
        "type": "object",
        "properties": {
            "title": {
                "type": "string",
                "description": "The title of the top post"
            }
        }
    }
)

The description field in each property helps Skyvern understand what data to extract. Be specific.

description fields drive extraction quality. Vague descriptions like “the data” produce vague results. Be specific: “The product price in USD, without currency symbol.”

Schema format

Skyvern uses standard JSON Schema. Common types:

Type	JSON Schema	Example value
String	`{"type": "string"}`	`"Hello world"`
Number	`{"type": "number"}`	`19.99`
Integer	`{"type": "integer"}`	`42`
Boolean	`{"type": "boolean"}`	`true`
Array	`{"type": "array", "items": {...}}`	`[1, 2, 3]`
Object	`{"type": "object", "properties": {...}}`	`{"key": "value"}`

A schema doesn’t guarantee all fields are populated. If the data isn’t on the page, fields return null. Design your code to handle missing values.

Build your schema

Use the interactive builder to generate a schema, then copy it into your code.

Examples

Single value

Extract one piece of information, such as the current price of Bitcoin:

result = await client.run_task(
    prompt="Get the current Bitcoin price in USD",
    url="https://coinmarketcap.com/currencies/bitcoin/",
    data_extraction_schema={
        "type": "object",
        "properties": {
            "price": {
                "type": "number",
                "description": "Current Bitcoin price in USD"
            }
        }
    }
)

Output (when completed):

{
  "price": 104521.37
}

List of items

Extract multiple items with the same structure, such as the top posts from a news site:

result = await client.run_task(
    prompt="Get the top 5 posts",
    url="https://news.ycombinator.com",
    data_extraction_schema={
        "type": "object",
        "properties": {
            "posts": {
                "type": "array",
                "description": "Top 5 posts from the front page",
                "items": {
                    "type": "object",
                    "properties": {
                        "title": {
                            "type": "string",
                            "description": "Post title"
                        },
                        "points": {
                            "type": "integer",
                            "description": "Number of points"
                        },
                        "url": {
                            "type": "string",
                            "description": "Link to the post"
                        }
                    }
                }
            }
        }
    }
)

Output (when completed):

{
  "posts": [
    {
      "title": "Running Claude Code dangerously (safely)",
      "points": 342,
      "url": "https://blog.emilburzo.com/2026/01/running-claude-code-dangerously-safely/"
    },
    {
      "title": "Linux kernel framework for PCIe device emulation",
      "points": 287,
      "url": "https://github.com/cakehonolulu/pciem"
    },
    {
      "title": "I'm addicted to being useful",
      "points": 256,
      "url": "https://www.seangoedecke.com/addicted-to-being-useful/"
    },
    {
      "title": "Level S4 solar radiation event",
      "points": 198,
      "url": "https://www.swpc.noaa.gov/news/g4-severe-geomagnetic-storm"
    },
    {
      "title": "WebAssembly Text Format parser performance",
      "points": 176,
      "url": "https://blog.gplane.win/posts/improve-wat-parser-perf.html"
    }
  ]
}

Arrays without limits extract everything visible on the page. Specify limits in your prompt (e.g., “top 5 posts”) or the array description to control output size.

Nested objects

Extract hierarchical data, such as a product with its pricing and availability:

result = await client.run_task(
    prompt="Get product details including pricing and availability",
    url="https://www.amazon.com/dp/B0EXAMPLE",
    data_extraction_schema={
        "type": "object",
        "properties": {
            "product": {
                "type": "object",
                "description": "Product information",
                "properties": {
                    "name": {
                        "type": "string",
                        "description": "Product name"
                    },
                    "pricing": {
                        "type": "object",
                        "description": "Pricing details",
                        "properties": {
                            "current_price": {
                                "type": "number",
                                "description": "Current price in USD"
                            },
                            "original_price": {
                                "type": "number",
                                "description": "Original price before discount"
                            },
                            "discount_percent": {
                                "type": "integer",
                                "description": "Discount percentage"
                            }
                        }
                    },
                    "availability": {
                        "type": "object",
                        "description": "Stock information",
                        "properties": {
                            "in_stock": {
                                "type": "boolean",
                                "description": "Whether the item is in stock"
                            },
                            "delivery_estimate": {
                                "type": "string",
                                "description": "Estimated delivery date"
                            }
                        }
                    }
                }
            }
        }
    }
)

Output (when completed):

{
  "product": {
    "name": "Wireless Bluetooth Headphones",
    "pricing": {
      "current_price": 79.99,
      "original_price": 129.99,
      "discount_percent": 38
    },
    "availability": {
      "in_stock": true,
      "delivery_estimate": "Tomorrow, Jan 21"
    }
  }
}

Accessing extracted data

How you access extracted data depends on which method you used.

page.extract (browser automation)

page.extract returns the extracted data directly as the return value:

data = await page.extract(
    "Get the top post",
    schema={
        "type": "object",
        "properties": {
            "title": {"type": "string", "description": "Post title"},
            "points": {"type": "integer", "description": "Points"}
        }
    },
)

# data is the extracted result directly
print(f"Title: {data['title']}")
print(f"Points: {data['points']}")

run_task (async task)

The extracted data appears in the output field of the completed run. Poll until the task reaches a terminal state, then access the output.

result = await client.run_task(
    prompt="Get the top post",
    url="https://news.ycombinator.com",
    data_extraction_schema={
        "type": "object",
        "properties": {
            "title": {"type": "string", "description": "Post title"},
            "points": {"type": "integer", "description": "Points"}
        }
    }
)

run_id = result.run_id

while True:
    run = await client.get_run(run_id)

    if run.status in ["completed", "failed", "terminated", "timed_out", "canceled"]:
        break

    await asyncio.sleep(5)

# Access the extracted data
print(f"Output: {run.output}")

If using webhooks, the same output field appears in the webhook payload.

Next steps

Actions Reference

All available page actions and agent methods

Build a Browser Automation

Launch a browser, navigate pages, and extract data

Getting Started

Core Features

Browser Automation

Handling Authentication

Optimization

Going to Production

Debugging

Self-Hosted Deployment

Extract Structured Data

Define a schema

Schema format

Build your schema

Examples

Single value

List of items

Nested objects

Accessing extracted data

page.extract (browser automation)

run_task (async task)

Next steps

Actions Reference

Build a Browser Automation

Getting Started

Core Features

Browser Automation

Handling Authentication

Optimization

Going to Production

Debugging

Self-Hosted Deployment

​Define a schema

​Schema format

​Build your schema

​Examples

​Single value

​List of items

​Nested objects

​Accessing extracted data

​page.extract (browser automation)

​run_task (async task)

​Next steps

Actions Reference

Build a Browser Automation

Define a schema

Schema format

Build your schema

Examples

Single value

List of items

Nested objects

Accessing extracted data

page.extract (browser automation)

run_task (async task)

Next steps