In this quick start we’re going to:
Load OpenOrca, a popular instruction dataset for tuning LLMs.
Delete specific clusters.
Find profanity in the remaining rows (using powerful text embeddings)
Download the enriched dataset as a json file so we can clean it in a Python notebook
Start the web server#
Start a new Lilac project.
pip install lilac[all]
lilac start ~/my_project
This should open a browser tab pointing to http://localhost:5432.
Add a dataset#
Let’s load OpenOrca, a popular instruction dataset used for tuning LLM models.
Add dataset button on the Getting Started page and fill in:
The dataset name in the Lilac project:
huggingfacedataset loader chechbox
Fill in HuggingFace-specific fields:
HuggingFace dataset name:
Sample size: 10000
Lilac’s sweet spot is ~100K-1M rows of data, although up to 10 million rows are possible. This quickstart uses 10,000 rows so that clustering and embedding operations finish locally in ~10 minutes even without a GPU.
Click the “Add” button at the bottom.
See the console output to track progress of the dataset download from HuggingFace.
When we load a dataset, Lilac creates a default UI configuration, inferring which fields are media (e.g. unstructured documents), and which are metadata fields. The two types of fields are presented differently in the UI.
Let’s edit the configuration by clicking the
Dataset settings button in the top-right corner. If
your media field contains markdown, you can enable markdown rendering.
Lilac can detect clusters in your dataset. Clusters are a powerful way to understand the types of content present in your dataset, as well as to target subsets for removal from the dataset.
To cluster, open up the dataset schema tray to reveal the fields in your dataset. Here, you can choose which field will get clustered.
The cluster visualizer shows two hierarchical levels of clusters by default. You can also group over other fields in your dataset by changing the Explore and Group By selections.
Tagging and Deleting rows#
Lilac can curate your dataset by tagging or deleting rows.
Deleting is not permanent - you can toggle visibility of deleted items - but it is a convenient way to iterate on your dataset by removing undesired slices of data. Later on, when you export data from Lilac, deleted rows will be excluded by default.
Lilac can enrich your media fields with additional metadata by:
Running a signal (e.g. PII detection, language detection, text statistics, etc.)
Running a concept (e.g. profanity, sentiment, etc. or a custom concept that you create)
Let’s also run the profanity concept on the
response field to see if the LLM produced any profane
content. To see the results, we need to index the
response field using a text embedding. We only
need to index once. For a fast on-device embedding, we recommend the
It takes ~20 minutes to index the 100,000 responses on a Macbook M1. Now that the field is indexed, we can now do semantic search and concept search on the field (in addition to the usual keyword search).
Let’s search by the profanity concept and see if the LLM produced any profane content. Results in the video are blurred due to sensitive content.
Concepts by default run in preview mode, where we only compute the concept scores for the top K
results. To compute the concept score over the entire dataset, we click the blue
button next to
lilac/profanity/gte-small in the schema.
Computing the concept takes ~20 seconds on a Macbook M1 laptop. Now that the concept is computed, we can open the statistics panel to see the distribution of concept scores.
Now that we’ve clustered, curated, and enriched the dataset, let’s download it by clicking on the
Download data button in the top-right corner. This will download a json file with the same name as
the dataset. Once we have the data, we can continue working with it in a Python notebook, or any
You can also get the dataset as a Pandas dataframe through the Python API.