Embeddings#
Embeddings are used in Lilac for Concepts, for Semantic Similarity, and for embedding-based signals.
The choice of an embedding can be crucial for a well-performing downstream signal.
Lilac has built-in on device embeddings:
gte-small
: Gegeral Text Embeddings (GTE) model (small).gte-base
: Gegeral Text Embeddings (GTE) model (base).sbert
: SentenceTransformers text embeddings.
Lilac has built-in remote embeddings. Using these will send data to an external server:
openai
: OpenAI embeddings. You will need to defineOPENAI_API_KEY
in your environment variables.cohere
: Cohere embeddings. You will need to defineCOHERE_API_KEY
in your environment variables.
Register your own embedding#
You can register your own embedding in Python:
class MyEmbedding(ll.TextEmbeddingSignal):
name: 'my_embedding'
def setup(self):
# Do your one-time setup here.
pass
def compute(self, docs):
def embed_fn(texts: list[str]):
# Compute your embedding matrix for the batch of text here. This return a matrix with
# dimensions [batch_size, embedding_dims].
return your_embedding(texts)
for doc in docs:
# Split the text, and compute embeddings for each split,
yield from ll.compute_split_embeddings(
docs=docs,
batch_size=64,
embed_fn,
# Use the lilac chunk splitter.
split_fn=ll.split_text,
# How many batches to request as a single unit.
num_parallel_requests=1)
ll.register_signal(MyEmbedding)
After you create a custom embedding and register it, you will be able to use it as my_embedding
.