Quick Start#

Overview#

In this quick start we’re going to:

  • Load OpenOrca, a popular instruction dataset for tuning LLMs.

  • Compute clusters.

  • Delete specific clusters.

  • Find profanity in the remaining rows (using powerful text embeddings)

  • Download the enriched dataset as a json file so we can clean it in a Python notebook

Start the web server#

Start a new Lilac project.

pip install lilac[all]

lilac start ~/my_project

This should open a browser tab pointing to http://localhost:5432.

Add a dataset#

Let’s load OpenOrca, a popular instruction dataset used for tuning LLM models.

Click the Add dataset button on the Getting Started page and fill in:

  1. The dataset name in the Lilac project: open-orca-10k

  2. Choose huggingface dataset loader chechbox

Fill in HuggingFace-specific fields:

  1. HuggingFace dataset name: Open-Orca/OpenOrca

  2. Sample size: 10000

Note

Lilac’s sweet spot is ~100K-1M rows of data, although up to 10 million rows are possible. This quickstart uses 10,000 rows so that clustering and embedding operations finish locally in ~10 minutes even without a GPU.

Finally:

  1. Click the “Add” button at the bottom.

Note

See the console output to track progress of the dataset download from HuggingFace.

Configure#

When we load a dataset, Lilac creates a default UI configuration, inferring which fields are media (e.g. unstructured documents), and which are metadata fields. The two types of fields are presented differently in the UI.

Let’s edit the configuration by clicking the Dataset settings button in the top-right corner. If your media field contains markdown, you can enable markdown rendering.

Cluster#

Lilac can detect clusters in your dataset. Clusters are a powerful way to understand the types of content present in your dataset, as well as to target subsets for removal from the dataset.

To cluster, open up the dataset schema tray to reveal the fields in your dataset. Here, you can choose which field will get clustered.

The cluster visualizer shows two hierarchical levels of clusters by default. You can also group over other fields in your dataset by changing the Explore and Group By selections.

Tagging and Deleting rows#

Lilac can curate your dataset by tagging or deleting rows.

Deleting is not permanent - you can toggle visibility of deleted items - but it is a convenient way to iterate on your dataset by removing undesired slices of data. Later on, when you export data from Lilac, deleted rows will be excluded by default.

Enrich#

Lilac can enrich your media fields with additional metadata by:

  • Running a signal (e.g. PII detection, language detection, text statistics, etc.)

  • Running a concept (e.g. profanity, sentiment, etc. or a custom concept that you create)

Profanity detection#

Let’s also run the profanity concept on the response field to see if the LLM produced any profane content. To see the results, we need to index the response field using a text embedding. We only need to index once. For a fast on-device embedding, we recommend the GTE-Small embedding.

It takes ~20 minutes to index the 100,000 responses on a Macbook M1. Now that the field is indexed, we can now do semantic search and concept search on the field (in addition to the usual keyword search).

Let’s search by the profanity concept and see if the LLM produced any profane content. Results in the video are blurred due to sensitive content.

Concepts by default run in preview mode, where we only compute the concept scores for the top K results. To compute the concept score over the entire dataset, we click the blue Compute signal button next to lilac/profanity/gte-small in the schema.

Computing the concept takes ~20 seconds on a Macbook M1 laptop. Now that the concept is computed, we can open the statistics panel to see the distribution of concept scores.

Download#

Now that we’ve clustered, curated, and enriched the dataset, let’s download it by clicking on the Download data button in the top-right corner. This will download a json file with the same name as the dataset. Once we have the data, we can continue working with it in a Python notebook, or any other language.

You can also get the dataset as a Pandas dataframe through the Python API.