Compute or load embeddings#

Indexing data with embeddings allows us to semantically or conceptually search text data. This guide will show you how to compute an embedding over a dataset. For more details on chosing an embedding, see Embeddings.

From the UI#

Important

Embedding indexes can be expensive to compute, however this is a one-time task and enables you to search semantically, or by concept once computed.

We can compute an embedding index with one of two approaches:

1b. From the schema#

From the schema, we can choose the field’s hamburger menu to click “Compute Embedding”.

From Python#

From Python, we can compute embedding indexes, or load from an external vector store.

Computing embeddings in Lilac#

We can compute embeddings with registered embeddings by using Dataset.compute_embedding. For more details on chosing an embedding, see Embeddings.

Let’s compute gte-small over the text field:

dataset = ll.get_dataset('local', 'imdb')
dataset.compute_embedding('gte-small', path='text')

Custom embeddings#

Lilac also allows you to register your own embedding function, and/or load your own embeddings from an existing vector store.

We have an accompanying notebook for this workflow.

To do this, we first must register a custom embedding function so that we can compute embeddings on future data.

To demonstrate this, we’ll load an embedding model from HuggingFace SentenceTransformers, and wrap it in a custom Lilac embedding class.

embedding_model = SentenceTransformer('thenlper/gte-small')


def _embed(text):
  # Call the gte-small embedding model.
  return np.array(embedding_model.encode(text))


# Make an embedding class.
class MyEmbedding(ll.TextEmbeddingSignal):
  name = 'my_embedding'

  def compute(self, data):
    for text in data:
      embedding = _embed(text)
      # Yield a full chunk embedding. If you want to chunk your text, yield an array here.
      yield [ll.chunk_embedding(0, len(text), embedding)]



# Register the embedding under 'my_embedding' so it can be used by Lilac.
ll.register_embedding(MyEmbedding, exists_ok=True)

Once we’ve registered the embedding, we can test it out.

print('Testing the embedding on a single item...')
print(next(MyEmbedding().compute(['This is some text'])))

Output:

[{'__span__': {'start': 0, 'end': 17}, 'embedding': array([-4.39735241e-02, -9.28446930e-03,  4.57611308e-02, -3.19548771e-02,...

We can now compute this embedding from the UI or python, like above.

Loading pre-computed embeddings#

We can also load pre-computed embeddings from an existing vector store by using the Dataset.load_embeddings method.

Let’s first make a dummy vector store, just a simple dictionary.

items = [
  {'id': '0_', 'text': 'This is some fake data'},
  {'id': '1_', 'text': 'This is some more fake data'},
  {'id': '2_', 'text': 'This is even more fake data'},
  {'id': '3_', 'text': 'I love plants'},
]

vector_store = {}
for item in items:
  vector_store[item['id']] = _embed(item['text'])

Now let’s load the embeddings. We’ll use a custom lambda function that will read the id from the dataset, and look it up in our vector store.

ds = ll.from_dicts('local', 'load_embedding', items)

# Load the embeddings into Lilac.
def _load_embedding(item):
  return vector_store[item['id']]

# Load the embeddings into Lilac.
ds.load_embedding(
  load_fn=_load_embedding, index_path='text', embedding='my_embedding', overwrite=True
)

Embeddings are now loaded, and can be used from the UI or from python.

# Select rows using a semantic search.
rows = ds.select_rows(
  ['text'],
  searches=[
    ll.SemanticSearch(path='text', query='This is some data', embedding='my_embedding'),
  ],
)

for row in rows:
  print(
    row['text'],
    row['text.semantic_similarity(embedding=my_embedding,query=This is some data)'][0]['score'],
  )

Output:

Computing signal "semantic_similarity" on local/load_embedding:('text',) took 0.016s.
This is some fake data 0.9254916310310364
This is some more fake data 0.9084776043891907
This is even more fake data 0.8841889500617981
I love plants 0.7808101177215576