`lilac.data`: Querying#

class lilac.data.Column#

A column in the dataset.

param alias: str | None = None#

param path: PathTuple [Required]#

param signal_udf: Signal | None = None#

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

class lilac.data.ConceptSearch#

A concept search query on a column.

param concept_name: str [Required]#

param concept_namespace: str [Required]#

param embedding: str [Required]#

param path: Path [Required]#

param type: Literal['concept'] = 'concept'#

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

class lilac.data.Dataset(namespace: str, dataset_name: str, project_dir: str | Path | None = None)#

The database implementation to query a dataset.

Adds a label to a row, or a set of rows defined by searches and filters.

Returns the number of added labels.

add_media_field(path: tuple[str, ...] | str, markdown: bool = False) → None#: Add a media field to the dataset.

abstract cluster(input: tuple[str, ...] | str | Callable[[Any], str] | DatasetFormatInputSelector, output_path: tuple[str, ...] | str | None = None, min_cluster_size: int = 5, topic_fn: Callable[[list[list[tuple[str, float]]]], list[str]] | Callable[[list[tuple[str, float]]], str] | None = None, overwrite: bool = False, use_garden: bool = False, task_id: str | None = None, category_fn: Callable[[list[list[tuple[str, float]]]], list[str]] | Callable[[list[tuple[str, float]]], str] | None = None, skip_noisy_assignment: bool = False) → None#

Compute clusters for a field of the dataset.

Parameters:

input – The path to the text field to cluster, a function that returns a string for each row in the dataset, or a DatasetFormatInputSelector that is a format-specific input selector, like ShareGPT.human.
output_path – The name of the output path to write to. Defaults to the input path + “.cluster”.
min_cluster_size – The minimum number of docs in a cluster.
topic_fn – A function that returns a topic summary for each cluster. It takes a list of (doc, membership_score) tuples and returns a single topic. This is used to compute the topic for a given cluster of docs. It defaults to a function that summarizes user’s requests.
overwrite – Whether to overwrite an existing output.
use_garden – Whether to run the clustering remotely on Lilac Garden.
task_id – The TaskManager task_id for this process run. This is used to update the progress of the task.
category_fn – A function that returns a category for a set of related titles. It takes a list of (doc, membership_score) tuples and returns a single category name.
skip_noisy_assignment – If true, noisy points will not be assigned to the nearest cluster. This only has an effect when the clustering is done locally (use_garden=False) and will speedup clustering.

compute_concept(namespace: str, concept_name: str, embedding: str, path: tuple[str, ...] | str, filters: Sequence[Filter | tuple[tuple[str, ...] | str, Literal['equals', 'not_equal', 'greater', 'greater_equal', 'less', 'less_equal'], bool | int | float | str | bytes | datetime] | tuple[tuple[str, ...] | str, Literal['length_longer', 'length_shorter', 'ilike', 'regex_matches', 'not_regex_matches'], bool | int | float | str | bytes | datetime] | tuple[tuple[str, ...] | str, Literal['exists', 'not_exists', None]] | tuple[tuple[str, ...] | str, Literal['in', None], list[str]]] | None = None, limit: int | None = None, include_deleted: bool = False, overwrite: bool = False, task_id: str | None = None) → None#: Compute concept scores for a given field path.

Compute an embedding for a given field path.

Parameters:

embedding – The name of the embedding to compute.
path – The path to compute the embedding on.
filters – Filters to apply to the row; only matching rows will have the embedding computed.
limit – Limit the number of rows to compute the embedding on.
include_deleted – Whether to include deleted rows in the computation.
overwrite – Whether to overwrite an existing embedding computed at this path.
task_id – The TaskManager task_id for this process run. This is used to update the progress of the task.
use_garden – Whether to run the computation remotely on Lilac Garden.

Compute a signal for a column.

Parameters:

signal – The signal to compute over the given columns.
path – The leaf path to compute the signal on.
filters – Filters to apply to the row; only matching rows will have the signal computed.
limit – Limit the number of rows to compute the signal on.
include_deleted – Whether to include deleted rows in the computation.
overwrite – Whether to overwrite an existing signal computed at this path.
task_id – The TaskManager task_id for this process run. This is used to update the progress of the task.
use_garden – Whether to run the computation remotely on Lilac Garden.

config() → DatasetConfig#: Return the dataset config for this dataset.

Count the number of rows in the dataset.

Parameters:

filters – Filters to apply to the row; only matching rows will be counted.
limit – Limit the number of rows to count. The returned answer may be smaller than LIMIT, if filters reduce the row count to below the limit!
include_deleted – Whether to include deleted rows in the count.

Returns:

The number of rows in the dataset matching the selection parameters.

abstract delete() → None#: Deletes the dataset.

abstract delete_column(path: tuple[str, ...] | str) → None#

Delete a computed column from the dataset.

Parameters:: path – The path of the computed column.

abstract delete_embedding(embedding: str, path: tuple[str, ...] | str) → None#

Delete a computed embedding from the dataset.

Parameters:

embedding – The name of the embedding.
path – The path of the computed embedding.

Deletes rows from the dataset.

Returns the number of deleted rows.

abstract delete_signal(signal_path: tuple[str, ...] | str) → None#

Delete a computed signal from the dataset.

Parameters:: signal_path – The path holding the computed data of the signal.

abstract get_embeddings(embedding: str, rowid: str, path: tuple[str | int, ...] | str) → list[Any]#

Returns the span-level embeddings associated with a specific row value.

Parameters:

embedding – The embedding name (e.g. gte-small, or jina-v2-small).
rowid – The row id to get embeddings for.
path –
The path within a row to get embeddings for. - If the row is a struct, e.g. {person: {document: {text: …}}} and we want the embeddings for text, the row path would be person.document.text. - If the row has a list of strings, e.g. {docs: [{text: …}, {text: …}]}, then the

row path would be docs.0.text for the 1st doc and docs.1.text for the 2nd doc.

Returns:

A list of ll.Item dicts holding the span coordinates and the embedding. These come from ll.chunk_embedding, which looks like: ` {'__span__': {'start': 0, 'end': 6}, 'embedding': array([1., 0., 0.])} `

abstract get_label_names() → list[str]#: Returns the list of label names that have been added to the dataset.

abstract load_embedding(load_fn: Callable[[Any], ndarray | list[Any]], index_path: tuple[str, ...] | str, embedding: str, overwrite: bool = False, task_id: str | None = None) → None#

Loads embeddings from an external source.

Parameters:

load_fn – A function that takes an item and returns an embedding. load_fn should return either a numpy array for full-document embeddings, or a list of ll.chunk_embeddings for chunked embeddings.
index_path – The path to the index to load the embeddings into.
embedding – The name of the embedding to load under. This should be a registered embedding.
overwrite – Whether to overwrite an existing embedding.
task_id – The TaskManager task_id for this process run. This is used to update the progress of the task.

abstract manifest() → DatasetManifest#: Return the manifest for the dataset.

Maps a function over all rows in the dataset and writes the result to a new column.

Parameters:

map_fn – A callable that takes a full row item dictionary, and returns an Item for the result. The result Item can be a primitive, like a string.
input_path – The path to the input column to map over. If not specified, the map function will be called with the full row item dictionary. If specified, the map function will be called with the value at the given path, flattened. The output column will be written in the same shape as the input column, paralleling its nestedness.
output_path – The name of the output path to write to. It is often useful to emit results next to the input, so they will get hierarchically shown in the UI.
overwrite – Set to true to overwrite this column if it already exists. If this bit is False, an error will be thrown if the column already exists.
resolve_span – Whether to resolve the spans into text before calling the map function.
batch_size – If provided, the map function will be called with a list of rows. Useful for batching API requests or other expensive operations. If unspecified, the map will receive one item at a time.
filters – Filters limiting the set of rows to map over. At the moment, we do not support incremental computations; the output column will be null for rows that do not match the filter, and there is no way to fill in those nulls without recomputing the entire map with a less restrictive filter and overwrite=True.
limit – How many rows to map over. If not specified, all rows will be mapped over.
sort_by – The path to sort by. If specified, the map will be called with rows sorted by this path. This is useful for map functions that need to maintain state across rows.
sort_order – The sort order. Defaults to ascending.
include_deleted – Whether to include deleted rows in the query.
num_jobs – The number of jobs to shard the work, defaults to 1. When set to -1, the number of jobs will correspond to the number of processors. If num_jobs is greater than the number of processors, it split the work into num_jobs and distribute amongst processors.
execution_type – The local execution type of the map. Either “threads” or “processes”. Threads are better for network bound tasks like making requests to an external server, while processes are better for CPU bound tasks, like running a local LLM.
embedding – If specified, the map function will be called with the `lilac.SpanVector`s of each item, instead of the original text. This is needed for embedding-based computations (e.g. clustering).
schema – The schema for the output of the map function. If not provided, the schema will be auto inferred.

Returns:

An iterable of items that are the result of map. The result item does not have the column name as part of the dictionary, it is exactly what is returned from the map.

abstract media(item_id: str, leaf_path: tuple[str, ...] | str) → MediaResult#

Return the media for a leaf path.

Parameters:

item_id – The item id to get media for.
leaf_path – The leaf path for the media.

Returns:

A MediaResult.

petals() → dict[tuple[str, ...], Field]#: Return the leafs of the dataset.

Generate a pivot table with counts over two fields (outer and inner).

For each value of outer_path in the dataset, the pivot table will contain the count of that value, as well as the internal counts of all the values of inner_path for the given outer_path value.

For example, if outer_path is split=train|test and inner_path is source=wiki|cc the pivot table will look like:

``` [

{
value: ‘train’, count: 100, inner: [{value: ‘wiki’, count: 94}, {value: ‘cc’, count: 6}],

}, {

value: ‘test’, count: 10, inner: [{value: ‘wiki’, count: 3}, {value: ‘cc’, count: 7}],

},

]#

Removes labels from a row, or a set of rows defined by searches and filters.

Returns the number of removed labels.

remove_media_field(path: tuple[str, ...] | str) → None#: Remove a media field from the dataset.

Undeletes rows from the dataset.

Returns the number of restored rows.

Select grouped columns to power a histogram.

Parameters:

leaf_path – The leaf path to group by. The path can be a dot-seperated string path, or a tuple of fields.
filters – The filters to apply to the query.
sort_by – What to sort by, either “count” or “value”.
sort_order – The sort order.
limit – The maximum number of rows to return.
bins – The bins to use when bucketizing a float column.
include_deleted – Whether to include deleted rows in the query.
searches – The searches to apply to the query.

Returns:

A SelectGroupsResult iterator where each row is a group.

Select a set of rows that match the provided filters, analogous to SQL SELECT.

Parameters:

columns – The columns to select. A column is an instance of Column which can either define a path to a feature, or a column with an applied Transform, e.g. a Concept. If none, it selects all columns.
searches – The searches to apply to the query.
filters – The filters to apply to the query.
sort_by – An ordered list of what to sort by. When defined, this is a list of aliases of column names defined by the “alias” field in Column. If no alias is provided for a column, an automatic alias is generated by combining each path element with a “.” For example: e.g. (‘person’, ‘name’) => person.name. For columns that are transform columns, an alias must be provided explicitly. When sorting by a (nested) list of values, the sort takes the minumum value when sort_order is ASC, and the maximum value when sort_order is DESC.
sort_order – The sort order.
limit – The maximum number of rows to return.
offset – The offset to start returning rows from.
resolve_span – Whether to resolve the span of the row.
combine_columns – Whether to combine columns into a single object. The object will be pruned to only include sub-fields that correspond to the requested columns.
include_deleted – Whether to include deleted rows in the query.
exclude_signals – Whether to exclude fields produced by signals.
user – The authenticated user, if auth is enabled and the user is logged in. This is used to apply ACL to the query, especially for concepts.

Returns:

A SelectRowsResult iterator with rows of `Item`s.

abstract select_rows_schema(columns: Sequence[tuple[str, ...] | str | Column] | None = None, sort_by: Sequence[tuple[str, ...] | str] | None = None, sort_order: SortOrder | None = SortOrder.DESC, searches: Sequence[ConceptSearch | SemanticSearch | KeywordSearch | MetadataSearch] | None = None, combine_columns: bool = False) → SelectRowsSchemaResult#: Returns the schema of the result of select_rows above with the same arguments.

settings() → DatasetSettings#: Return the persistent settings for the dataset.

abstract stats(leaf_path: tuple[str, ...] | str, include_deleted: bool = False) → StatsResult#

Compute stats for a leaf path.

Parameters:

leaf_path – The leaf path to compute stats for.
include_deleted – Whether to include deleted rows in the stats.

Returns:

A StatsResult.

Export the dataset to a csv file.

Parameters:

filepath – The path to the file to export to.
columns – The columns to export.
filters – The filters to apply to the query.
include_labels – The labels to include in the export.
exclude_labels – The labels to exclude in the export.
include_deleted – Whether to include deleted rows in the export.
include_signals – Whether to include fields produced by signals.

Export the dataset to a huggingface dataset.

Parameters:

columns – The columns to export.
filters – The filters to apply to the query.
include_labels – The labels to include in the export.
exclude_labels – The labels to exclude in the export.
include_deleted – Whether to include deleted rows in the export.
include_signals – Whether to include fields produced by signals.

Export the dataset to a JSON file.

Parameters:

filepath – The path to the file to export to.
jsonl – Whether to export to JSONL or JSON.
columns – The columns to export.
filters – The filters to apply to the query.
include_labels – The labels to include in the export.
exclude_labels – The labels to exclude in the export.
include_deleted – Whether to include deleted rows in the export.
include_signals – Whether to include fields produced by signals.

Export the dataset to a pandas DataFrame.

Parameters:

columns – The columns to export.
filters – The filters to apply to the query.
include_labels – The labels to include in the export.
exclude_labels – The labels to exclude in the export.
include_deleted – Whether to include deleted rows in the export.
include_signals – Whether to include fields produced by signals.

Export the dataset to a parquet file.

Parameters:

filepath – The path to the file to export to.
columns – The columns to export.
filters – The filters to apply to the query.
include_labels – The labels to include in the export.
exclude_labels – The labels to exclude in the export.
include_deleted – Whether to include deleted rows in the export.
include_signals – Whether to include fields produced by signals.

Transforms the entire dataset (or a column) and writes the result to a new column.

Parameters:

transform_fn – A callable that takes an iterable of all items in the dataset, and returns an
result. (iterable of the same length for the) –
input_path – The path to the input column to map over. If not specified, the map function will be called with the full row item dictionary. If specified, the map function will be called with the value at the given path, flattened. The output column will be written in the same shape as the input column, paralleling its nestedness.
output_path – The name of the output path to write to. It is often useful to emit results next to the input, so they will get hierarchically shown in the UI.
overwrite – Set to true to overwrite this column if it already exists. If this bit is False, an error will be thrown if the column already exists.
resolve_span – Whether to resolve the spans into text before calling the map function.
filters – Filters limiting the set of rows to map over. At the moment, we do not support incremental computations; the output column will be null for rows that do not match the filter, and there is no way to fill in those nulls without recomputing the entire map with a less restrictive filter and overwrite=True.
limit – How many rows to map over. If not specified, all rows will be mapped over.
sort_by – The path to sort by. If specified, the map will be called with rows sorted by this path. This is useful for map functions that need to maintain state across rows.
sort_order – The sort order. Defaults to ascending.
embedding – If specified, the transform function will be called with the `lilac.SpanVector`s of each item, instead of the original text. This is needed for embedding-based computations (e.g. clustering).
schema – The schema for the output of the map function. If not provided, the schema will be auto inferred.

update_settings(settings: DatasetSettings) → None#: Update the persistent settings for the dataset.

class lilac.data.DatasetManifest#

The manifest for a dataset.

param data_schema: Schema [Required]#

param dataset_format: DatasetFormat | None = None#

param dataset_name: str [Required]#

param namespace: str [Required]#

param num_items: int [Required]#

param source: SerializeAsAny[Source] [Required]#

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

class lilac.data.Filter#

A filter on a column.

param op: FilterOp [Required]#: The type of filter. Available filters are [‘equals’, ‘exists’, ‘greater’, ‘greater_equal’, ‘ilike’, ‘in’, ‘length_longer’, ‘length_shorter’, ‘less’, ‘less_equal’, ‘not_equal’, ‘not_exists’, ‘not_regex_matches’, ‘raw_sql’, ‘regex_matches’]

param path: PathTuple [Required]#

param value: FeatureValue | FeatureListValue | None = None#

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

enum lilac.data.GroupsSortBy(value)#

The sort for groups queries.

Either “count” which sorts by the count of feature value, or “value” which sorts by the feature value itself.

Member Type:: str

Valid values are as follows:

COUNT = <GroupsSortBy.COUNT: 'count'>#

VALUE = <GroupsSortBy.VALUE: 'value'>#

class lilac.data.KeywordSearch#

A keyword search query on a column.

param path: Path [Required]#

param query: SearchValue [Required]#

Constraints:

strict = True

param type: Literal['keyword'] = 'keyword'#

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

class lilac.data.MetadataSearch#

A metadata search query on a column.

param op: FilterOp [Required]#

param path: Path [Required]#

param type: Literal['metadata'] = 'metadata'#

param value: FeatureValue | FeatureListValue | None = None#

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

class lilac.data.Schema#

Database schema.

param fields: dict[str, Field] [Required]#

get_field(path: tuple[str, ...]) → Field#: Returns the field at the given path.

has_field(path: tuple[str, ...]) → bool#: Returns if the field is found at the given path.

model_post_init(__context: Any) → None#

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:

self – The BaseModel instance.
__context – The context.

property all_fields: list[tuple[tuple[str, ...], Field]]#: Return all the fields, including nested and repeated fields, as a flat list.

property leafs: dict[tuple[str, ...], Field]#

Return all the leaf fields in the schema. A leaf is defined as a node that contains a value.

NOTE: Leafs may contain children. Leafs can be found as any node that has a dtype defined.

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

class lilac.data.SelectGroupsResult#

The result of a select groups query.

param bins: list[Bin] | None = None#

param counts: list[tuple[FeatureValue | None, int]] [Required]#

param too_many_distinct: bool [Required]#

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

class lilac.data.SelectRowsResult(df: DataFrame, total_num_rows: int)#

The result of a select rows query.

df() → DataFrame#: Convert the result to a pandas DataFrame.

class lilac.data.SemanticSearch#

A semantic search on a column.

param embedding: str [Required]#

param path: Path [Required]#

param query: SearchValue [Required]#

Constraints:

strict = True

param query_type: EmbeddingInputType = 'document'#: The input type of the query, used for the query embedding.

param type: Literal['semantic'] = 'semantic'#

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

enum lilac.data.SortOrder(value)#

The sort order for a database query.

Member Type:: str

Valid values are as follows:

DESC = <SortOrder.DESC: 'DESC'>#

ASC = <SortOrder.ASC: 'ASC'>#

lilac.data: Querying#

]#

`lilac.data`: Querying#