lilac.signals: Enrichment#

Signals enrich a document with additional metadata.

class lilac.signals.ConceptSignal#

Compute scores along a given concept for documents.

param concept_name: str [Required]#
param draft: str = 'main'#
param namespace: str [Required]#
param version: int | None = None#
compute(examples: Iterable[str | bytes]) Iterator[Any | None]#

Get the scores for the provided examples.

fields() Field#

Return the schema for the signal output. If None, the schema will be automatically inferred.

Returns:

A Field object that describes the schema of the signal.

key(is_computed_signal: bool | None = False) str#

Get the key for a signal.

This is used to make sure signals with multiple arguments do not collide.

NOTE: Overriding this method is sensitive. If you override it, make sure that it is globally unique. It will be used as the dictionary key for enriched values.

Parameters:

is_computed_signal – True when the signal is computed over the column and written to disk. False when the signal is used as a preview UDF.

model_post_init(__context: Any) None#

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self – The BaseModel instance.

  • __context – The context.

set_user(user: UserInfo | None) None#

Set the user for this signal.

setup() None#

Setup the signal.

vector_compute(span_vectors: Iterable[list[SpanVector]]) Iterator[Any | None]#

Compute the signal for an iterable of keys that point to documents or images.

Parameters:

vector_spans – Precomputed embeddings over spans of a document.

Returns:

An iterable of items. Sparse signals should return “None” for skipped inputs.

vector_compute_topk(topk: int, vector_index: VectorDBIndex, rowids: Iterable[str] | None = None) list[tuple[tuple[str | int, ...], Any | None]]#

Return signal results only for the top k documents or images.

Signals decide how to rank each document/image in the dataset, usually by a similarity score obtained via the vector store.

Parameters:
  • topk – The number of items to return, ranked by the signal.

  • vector_index – The vector index to lookup pre-computed embeddings.

  • rowids – Optional iterable of row ids to restrict the search to.

Returns:

A list of (key, signal_output) tuples containing the topk items. Sparse signals should return “None” for skipped inputs.

display_name: ClassVar[str] = 'Concept'#
input_type: ClassVar[SignalInputType] = 'text'#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

name: ClassVar[str] = 'concept_score'#
class lilac.signals.LangDetectionSignal#

Detects the language code in text.

Supports 55 languages returning their [ISO 639-1 codes](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes).

param split_by_paragraph: bool = False#

Compute language scores for each paragraph.

compute(data: list[str]) list[Any | None]#

Compute a signal over a field.

Just like in dataset.map, if the field is a repeated field, then the repeated values will be flattened and presented as a continuous stream.

A signal can choose to return None for any input, but it must always return some value, so that alignment errors don’t occur when Lilac attempts to unflatten repeated fields.

This function is polymorphic and its signature depends on the local_batch_size class var.

If batch_size = -1 (default), then we hand you the iterator stream and you choose how to process the stream; the function signature is:

compute(self, data: Iterable[RichData]) -> Iterator[Optional[Item]]

If batch_size > 0, then we will batch items and provide them to you in the requested batch size; the function signature is:

compute(self, data: list[RichData]) -> list[Optional[Item]]

If batch_size is None, then we hand you items one by one, and the function signature is:

compute(self, data: RichData) -> Optional[Item]

Parameters:
  • data – An iterable of rich data to compute the signal over.

  • user – User information, if the user is logged in. This is useful if signals are access controlled, like concepts.

Returns:

An iterable of items. Sparse signals should return “None” for skipped inputs.

fields() Field#

Return the schema for the signal output. If None, the schema will be automatically inferred.

Returns:

A Field object that describes the schema of the signal.

setup() None#

Setup the signal.

display_name: ClassVar[str] = 'Language detection'#
input_type: ClassVar[SignalInputType] = 'text'#
local_batch_size: ClassVar[int | None] = 1024#
local_parallelism: ClassVar[int] = -1#
local_strategy: ClassVar[Literal['processes', 'threads']] = 'processes'#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

name: ClassVar[str] = 'lang_detection'#
class lilac.signals.NearDuplicateSignal#

Find near duplicate documents in a dataset using n-grams.

<br/>

Documents are fingerprinted using n-grams with [minhash LSH](https://en.wikipedia.org/wiki/MinHash). Documents are assigned the same cluster id if their Jaccard similarity is above the provided threshold.

param threshold: float = 0.85#

The similarity threshold for detecting a near duplicate.

compute(data: Iterable[str | bytes]) Iterator[Any | None]#

Compute a signal over a field.

Just like in dataset.map, if the field is a repeated field, then the repeated values will be flattened and presented as a continuous stream.

A signal can choose to return None for any input, but it must always return some value, so that alignment errors don’t occur when Lilac attempts to unflatten repeated fields.

This function is polymorphic and its signature depends on the local_batch_size class var.

If batch_size = -1 (default), then we hand you the iterator stream and you choose how to process the stream; the function signature is:

compute(self, data: Iterable[RichData]) -> Iterator[Optional[Item]]

If batch_size > 0, then we will batch items and provide them to you in the requested batch size; the function signature is:

compute(self, data: list[RichData]) -> list[Optional[Item]]

If batch_size is None, then we hand you items one by one, and the function signature is:

compute(self, data: RichData) -> Optional[Item]

Parameters:
  • data – An iterable of rich data to compute the signal over.

  • user – User information, if the user is logged in. This is useful if signals are access controlled, like concepts.

Returns:

An iterable of items. Sparse signals should return “None” for skipped inputs.

fields() Field#

Return the schema for the signal output. If None, the schema will be automatically inferred.

Returns:

A Field object that describes the schema of the signal.

display_name: ClassVar[str] = 'Near duplicate documents'#
input_type: ClassVar[SignalInputType] = 'text'#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

name: ClassVar[str] = 'near_dup'#
class lilac.signals.PIISignal#

Find personally identifiable information (emails, phone numbers, secret keys, etc).

compute(data: list[str | bytes]) list[Any | None]#

Compute a signal over a field.

Just like in dataset.map, if the field is a repeated field, then the repeated values will be flattened and presented as a continuous stream.

A signal can choose to return None for any input, but it must always return some value, so that alignment errors don’t occur when Lilac attempts to unflatten repeated fields.

This function is polymorphic and its signature depends on the local_batch_size class var.

If batch_size = -1 (default), then we hand you the iterator stream and you choose how to process the stream; the function signature is:

compute(self, data: Iterable[RichData]) -> Iterator[Optional[Item]]

If batch_size > 0, then we will batch items and provide them to you in the requested batch size; the function signature is:

compute(self, data: list[RichData]) -> list[Optional[Item]]

If batch_size is None, then we hand you items one by one, and the function signature is:

compute(self, data: RichData) -> Optional[Item]

Parameters:
  • data – An iterable of rich data to compute the signal over.

  • user – User information, if the user is logged in. This is useful if signals are access controlled, like concepts.

Returns:

An iterable of items. Sparse signals should return “None” for skipped inputs.

compute_garden(docs: Iterator[str]) Iterator[Any]#

Compute a signal over a field, accelerated through Lilac Garden.

This method gets an iterator of the entire data, and should return an iterator of the same length, with the processed results.

fields() Field#

Return the schema for the signal output. If None, the schema will be automatically inferred.

Returns:

A Field object that describes the schema of the signal.

display_name: ClassVar[str] = 'Personal Information (PII)'#
input_type: ClassVar[SignalInputType] = 'text'#
local_batch_size: ClassVar[int | None] = 128#
local_parallelism: ClassVar[int] = -1#
local_strategy: ClassVar[Literal['processes', 'threads']] = 'processes'#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

name: ClassVar[str] = 'pii'#
supports_garden: ClassVar[bool] = True#
class lilac.signals.SemanticSimilaritySignal#

Compute semantic similarity for a query and a document.

This is done by embedding the query with the same embedding as the document and computing a

a similarity score between them.

param query: str [Required]#
param query_type: Literal['question', 'document'] = 'document'#

The input type of the query, used for the query embedding.

compute(data: Iterable[str | bytes]) Iterator[Any | None]#

Compute a signal over a field.

Just like in dataset.map, if the field is a repeated field, then the repeated values will be flattened and presented as a continuous stream.

A signal can choose to return None for any input, but it must always return some value, so that alignment errors don’t occur when Lilac attempts to unflatten repeated fields.

This function is polymorphic and its signature depends on the local_batch_size class var.

If batch_size = -1 (default), then we hand you the iterator stream and you choose how to process the stream; the function signature is:

compute(self, data: Iterable[RichData]) -> Iterator[Optional[Item]]

If batch_size > 0, then we will batch items and provide them to you in the requested batch size; the function signature is:

compute(self, data: list[RichData]) -> list[Optional[Item]]

If batch_size is None, then we hand you items one by one, and the function signature is:

compute(self, data: RichData) -> Optional[Item]

Parameters:
  • data – An iterable of rich data to compute the signal over.

  • user – User information, if the user is logged in. This is useful if signals are access controlled, like concepts.

Returns:

An iterable of items. Sparse signals should return “None” for skipped inputs.

fields() Field#

Return the schema for the signal output. If None, the schema will be automatically inferred.

Returns:

A Field object that describes the schema of the signal.

model_post_init(__context: Any) None#

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self – The BaseModel instance.

  • __context – The context.

vector_compute(span_vectors: Iterable[list[SpanVector]]) Iterator[Any | None]#

Compute the signal for an iterable of keys that point to documents or images.

Parameters:

vector_spans – Precomputed embeddings over spans of a document.

Returns:

An iterable of items. Sparse signals should return “None” for skipped inputs.

vector_compute_topk(topk: int, vector_index: VectorDBIndex, rowids: Iterable[str] | None = None) list[tuple[tuple[str | int, ...], Any | None]]#

Return signal results only for the top k documents or images.

Signals decide how to rank each document/image in the dataset, usually by a similarity score obtained via the vector store.

Parameters:
  • topk – The number of items to return, ranked by the signal.

  • vector_index – The vector index to lookup pre-computed embeddings.

  • rowids – Optional iterable of row ids to restrict the search to.

Returns:

A list of (key, signal_output) tuples containing the topk items. Sparse signals should return “None” for skipped inputs.

display_name: ClassVar[str] = 'Semantic Similarity'#
input_type: ClassVar[SignalInputType] = 'text'#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

name: ClassVar[str] = 'semantic_similarity'#
class lilac.signals.Signal#

Interface for signals to implement. A signal can score documents and a dataset column.

compute(data: Any) Any#

Compute a signal over a field.

Just like in dataset.map, if the field is a repeated field, then the repeated values will be flattened and presented as a continuous stream.

A signal can choose to return None for any input, but it must always return some value, so that alignment errors don’t occur when Lilac attempts to unflatten repeated fields.

This function is polymorphic and its signature depends on the local_batch_size class var.

If batch_size = -1 (default), then we hand you the iterator stream and you choose how to process the stream; the function signature is:

compute(self, data: Iterable[RichData]) -> Iterator[Optional[Item]]

If batch_size > 0, then we will batch items and provide them to you in the requested batch size; the function signature is:

compute(self, data: list[RichData]) -> list[Optional[Item]]

If batch_size is None, then we hand you items one by one, and the function signature is:

compute(self, data: RichData) -> Optional[Item]

Parameters:
  • data – An iterable of rich data to compute the signal over.

  • user – User information, if the user is logged in. This is useful if signals are access controlled, like concepts.

Returns:

An iterable of items. Sparse signals should return “None” for skipped inputs.

compute_garden(data: Iterator[Any]) Iterator[Any]#

Compute a signal over a field, accelerated through Lilac Garden.

This method gets an iterator of the entire data, and should return an iterator of the same length, with the processed results.

fields() Field | None#

Return the schema for the signal output. If None, the schema will be automatically inferred.

Returns:

A Field object that describes the schema of the signal.

key(is_computed_signal: bool | None = False) str#

Get the key for a signal.

This is used to make sure signals with multiple arguments do not collide.

NOTE: Overriding this method is sensitive. If you override it, make sure that it is globally unique. It will be used as the dictionary key for enriched values.

Parameters:

is_computed_signal – True when the signal is computed over the column and written to disk. False when the signal is used as a preview UDF.

serialize_model(serializer: Callable[[...], dict[str, Any]]) dict[str, Any]#

Serialize the model to a dictionary.

setup() None#

Setup the signal.

setup_garden() None#

Setup the signal for remote execution.

teardown() None#

Tears down the signal.

display_name: ClassVar[str | None]#
input_type: ClassVar[SignalInputType]#
local_batch_size: ClassVar[int | None] = -1#
local_parallelism: ClassVar[int] = 1#
local_strategy: ClassVar[Literal['processes', 'threads']] = 'threads'#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

name: ClassVar[str]#
output_type: ClassVar[Literal['embedding', 'cluster'] | None] = None#
supports_garden: ClassVar[bool] = False#
enum lilac.signals.SignalInputType(value)#

Enum holding the signal input type.

Member Type:

str

Valid values are as follows:

TEXT = text#
TEXT_EMBEDDING = text_embedding#
IMAGE = image#
ANY = any#
class lilac.signals.SpacyNER#

Named entity recognition with SpaCy.

For details see: [spacy.io/models](https://spacy.io/models).

param model: str = 'en_core_web_sm'#
compute(data: Iterable[str | bytes]) Iterator[Any | None]#

Compute a signal over a field.

Just like in dataset.map, if the field is a repeated field, then the repeated values will be flattened and presented as a continuous stream.

A signal can choose to return None for any input, but it must always return some value, so that alignment errors don’t occur when Lilac attempts to unflatten repeated fields.

This function is polymorphic and its signature depends on the local_batch_size class var.

If batch_size = -1 (default), then we hand you the iterator stream and you choose how to process the stream; the function signature is:

compute(self, data: Iterable[RichData]) -> Iterator[Optional[Item]]

If batch_size > 0, then we will batch items and provide them to you in the requested batch size; the function signature is:

compute(self, data: list[RichData]) -> list[Optional[Item]]

If batch_size is None, then we hand you items one by one, and the function signature is:

compute(self, data: RichData) -> Optional[Item]

Parameters:
  • data – An iterable of rich data to compute the signal over.

  • user – User information, if the user is logged in. This is useful if signals are access controlled, like concepts.

Returns:

An iterable of items. Sparse signals should return “None” for skipped inputs.

fields() Field#

Return the schema for the signal output. If None, the schema will be automatically inferred.

Returns:

A Field object that describes the schema of the signal.

model_post_init(__context: Any) None#

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self – The BaseModel instance.

  • __context – The context.

setup() None#

Setup the signal.

display_name: ClassVar[str] = 'Named Entity Recognition'#
input_type: ClassVar[SignalInputType] = 'text'#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

name: ClassVar[str] = 'spacy_ner'#
class lilac.signals.TextEmbeddingSignal#

An interface for signals that compute embeddings for text.

param embed_input_type: Literal['question', 'document'] = 'document'#

The input type to the embedding.

fields() Field#

NOTE: Override this method at your own risk if you want to add extra metadata.

Embeddings should not come with extra metadata.

key(is_computed_signal: bool | None = False) str#

Get the key for an embedding. This is exactly the embedding name, regardless of garden.

model_post_init(__context: Any) None#

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self – The BaseModel instance.

  • __context – The context.

input_type: ClassVar[SignalInputType] = 'text'#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

output_type: ClassVar[Literal['embedding', 'cluster'] | None] = 'embedding'#
class lilac.signals.TextSignal#

An interface for signals that compute over text.

key(is_computed_signal: bool | None = False) str#

Get the key for a signal.

This is used to make sure signals with multiple arguments do not collide.

NOTE: Overriding this method is sensitive. If you override it, make sure that it is globally unique. It will be used as the dictionary key for enriched values.

Parameters:

is_computed_signal – True when the signal is computed over the column and written to disk. False when the signal is used as a preview UDF.

input_type: ClassVar[SignalInputType] = 'text'#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

class lilac.signals.TextStatisticsSignal#

Compute text statistics for a document such as readability scores, type-token-ratio, etc..

compute(docs: list[str]) list[Any | None]#

Compute a signal over a field.

Just like in dataset.map, if the field is a repeated field, then the repeated values will be flattened and presented as a continuous stream.

A signal can choose to return None for any input, but it must always return some value, so that alignment errors don’t occur when Lilac attempts to unflatten repeated fields.

This function is polymorphic and its signature depends on the local_batch_size class var.

If batch_size = -1 (default), then we hand you the iterator stream and you choose how to process the stream; the function signature is:

compute(self, data: Iterable[RichData]) -> Iterator[Optional[Item]]

If batch_size > 0, then we will batch items and provide them to you in the requested batch size; the function signature is:

compute(self, data: list[RichData]) -> list[Optional[Item]]

If batch_size is None, then we hand you items one by one, and the function signature is:

compute(self, data: RichData) -> Optional[Item]

Parameters:
  • data – An iterable of rich data to compute the signal over.

  • user – User information, if the user is logged in. This is useful if signals are access controlled, like concepts.

Returns:

An iterable of items. Sparse signals should return “None” for skipped inputs.

fields() Field#

Return the schema for the signal output. If None, the schema will be automatically inferred.

Returns:

A Field object that describes the schema of the signal.

model_post_init(__context: Any) None#

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self – The BaseModel instance.

  • __context – The context.

setup() None#

Setup the signal.

display_name: ClassVar[str] = 'Text Statistics'#
local_batch_size: ClassVar[int] = 128#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

name: ClassVar[str] = 'text_statistics'#
lilac.signals.register_embedding(embedding_cls: Type[TextEmbeddingSignal], exists_ok: bool = False) None#

Register an embedding in the global registry.

lilac.signals.register_signal(signal_cls: Type[Signal], exists_ok: bool = False) None#

Register a signal in the global registry.

Parameters:
  • signal_cls – The signal class to register.

  • exists_ok – Whether to allow overwriting an existing signal.