Datasets

A dataset is a collection of inputs and expected outputs and is used to test your application. Both UI-based and SDK-based experiments support Langfuse Datasets.

Langfuse Dataset View

Datasets

Why use datasets?

  • Create test cases for your application with real production traces
  • Collaboratively create and collect dataset items with your team
  • Have a single source of truth for your test data

Get Started

Creating a dataset

Datasets have a name which is unique within a project.

langfuse.create_dataset(
    name="<dataset_name>",
    # optional description
    description="My first dataset",
    # optional metadata
    metadata={
        "author": "Alice",
        "date": "2022-01-01",
        "type": "benchmark"
    }
)

See Python SDK docs for details on how to initialize the Python client.

Upload or create new dataset items

Dataset items can be added to a dataset by providing the input and optionally the expected output. If preferred, dataset items can be imported using the CSV uploader in the Langfuse UI.

langfuse.create_dataset_item(
    dataset_name="<dataset_name>",
    # any python object or value, optional
    input={
        "text": "hello world"
    },
    # any python object or value, optional
    expected_output={
        "text": "hello world"
    },
    # metadata, optional
    metadata={
        "model": "llama3",
    }
)

See Python SDK docs for details on how to initialize the Python client.

Dataset Folders

Datasets can be organized into virtual folders to group datasets serving similar use cases. To create a folder, add slashes (/) to a dataset name. The UI shows every segment ending with a / as a folder automatically.

Create and fetch a dataset in a folder

Use the Langfuse UI or SDK to create and fetch a dataset in a folder by adding a slash (/) to a dataset name.

dataset_name = "evaluation/qa-dataset"
 
# When creating a dataset, use the full dataset name
langfuse.create_dataset(
    name=dataset_name,
)
 
# When fetching a dataset in a folder, use the full dataset name
langfuse.get_dataset(
    name=dataset_name
)
 

This creates and fetches a dataset named qa-dataset in a folder named evaluation. The full dataset name remains evaluation/qa-dataset.

URL Encoding: When using dataset names with slashes as path parameters in the API or JS/TS SDK, use URL encoding. For example, in TypeScript: encodeURIComponent(name).

Versioning

To access Dataset Versions via the Langfuse UI, navigate to: Datasets > Navigate to a specific dataset > Select Items Tab. On this page you can toggle the version view.

Every add, update, delete, or archive of dataset items produces a new dataset version. Versions track changes over time using timestamps.

GET APIs return the latest version at query time by default. You can fetch datasets at specific version timestamps using the version parameter.

Versioning applies to dataset items only, not dataset schemas. Dataset schema changes do not create new versions.

Fetch dataset at a specific version

You can retrieve a dataset as it existed at a specific point in time by providing a version timestamp. This returns only the items that existed at that timestamp.

from langfuse import get_client
from datetime import datetime, timedelta
 
langfuse = get_client()
 
# Capture dataset state as of 2025-12-15 at 06:30:00 UTC
version_timestamp = datetime(2025, 12, 15, 6, 30, 0, tzinfo=timezone.utc)
 
# Fetch dataset at version timestamp
dataset_at_version = langfuse.get_dataset(
    name="my-dataset",
    version=version_timestamp
)
 
# Fetch latest version
dataset_latest = langfuse.get_dataset(name="my-dataset")

Run experiments on versioned datasets

You can run experiments directly on versioned datasets. This is useful for comparing how your model performs against different dataset versions or reproducing experiment results with the exact dataset state from a specific point in time.

from datetime import timedelta
import time
from langfuse import Langfuse
 
langfuse = Langfuse()
 
version_timestamp = datetime(2025, 12, 15, 6, 30, 0, tzinfo=timezone.utc)
 
# Fetch versioned dataset 
versioned_dataset = langfuse.get_dataset("qa-dataset", version=version_timestamp)
 
# Run experiment on the versioned dataset
def my_llm_application(*, item, **kwargs):
    # Your LLM application logic here
    # For this example, we'll just return the expected output
    return item.expected_output
 
result = versioned_dataset.run_experiment(
    name="Baseline Experiment v1",
    description="Running on dataset v1",
    task=my_llm_application
)

This approach ensures reproducibility by allowing you to:

  • Re-run experiments on historical dataset versions even after items are updated or deleted
  • Compare model performance before and after dataset changes
  • Maintain experiment consistency and reproduce exact results from previous runs
  • Test improvements against the same baseline dataset version

Schema Enforcement

Optionally add JSON Schema validation to your datasets to ensure all dataset items conform to a defined structure. This helps maintain data quality, catch errors early, and ensure consistency across your team.

You can define JSON schemas for input and/or expectedOutput fields when creating or updating a dataset. Once set, all dataset items are automatically validated against these schemas. Valid items are accepted, invalid items are rejected with detailed error messages showing the validation issue.

langfuse.create_dataset(
    name="qa-conversations",
    input_schema={
        "type": "object",
        "properties": {
            "messages": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "role": {"type": "string", "enum": ["user", "assistant", "system"]},
                        "content": {"type": "string"}
                    },
                    "required": ["role", "content"]
                }
            }
        },
        "required": ["messages"]
    },
    expected_output_schema={
        "type": "object",
        "properties": {"response": {"type": "string"}},
        "required": ["response"]
    }
)

Create synthetic datasets

Frequently, you want to create synthetic examples to test your application to bootstrap your dataset. LLMs are great at generating these by prompting for common questions/tasks.

To get started have a look at this cookbook for examples on how to generate synthetic datasets:

Create items from production data

A common workflow is to select production traces where the application did not perform as expected. Then you let an expert add the expected output to test new versions of your application on the same data.

langfuse.create_dataset_item(
    dataset_name="<dataset_name>",
    input={ "text": "hello world" },
    expected_output={ "text": "hello world" },
    # link to a trace
    source_trace_id="<trace_id>",
    # optional: link to a specific span, event, or generation
    source_observation_id="<observation_id>"
)

Batch add observations to datasets

You can batch add multiple observations to a dataset directly from the observations table. This is useful for quickly building test datasets from production data.

The field mapping system gives you control over how observation data is transformed into dataset items. You can use the entire field as-is (e.g., map the full observation input to the dataset item input), extract specific values using JSON path expressions or build custom objects from multiple fields.

  1. Navigate to the Observations table
  2. Use filters to find relevant observations
  3. Select observations using the checkboxes
  4. Click ActionsAdd to dataset
  5. Choose to create a new dataset or select an existing one
  6. Configure field mapping to control how observation data maps to dataset item fields
  7. Preview the mapping and confirm

Batch operations run in the background with support for partial success. If some observations fail validation against a dataset schema, valid items are still added and errors are logged for review. You can monitor progress in SettingsBatch Actions.

Edit/archive dataset items

You can edit or archive dataset items. Archiving items will remove them from future experiment runs.

You can upsert items by providing the id of the item you want to update.

langfuse.create_dataset_item(
    id="<item_id>",
    # example: update status to "ARCHIVED"
    status="ARCHIVED"
)

Dataset runs

Once you created a dataset, you can test and evaluate your application based on it.

Learn more about the Experiments data model.

Was this page helpful?