Edit a dataset#
Note
This page goes into the technical details of editing a dataset in Lilac. For a real world example, see the blog post on Curate a coding dataset with Lilac.
Once you bring the data into Lilac, you can start editing it. The main edit operation is creating a
new column via Dataset.map
, which takes API inspiration from
HuggingFace’s Dataset.map(). The function being
mapped runs against every row in a table, enhancing that data with new information. For example, we
can call GPT to extract structure from a piece of text, or rewrite a piece of text in a different
style or language.
The benefits of using Lilac’s map
include the ability to track lineage information for every
computed column, and the ability to resume computation if the processing fails mid-way. We are also
working on the ability to write to an existing column, undo and redo edits, and see a history of all
the edits made.
Dataset.map
#
Dataset.map
is the main vehicle for processing data in Lilac.
It is fundamentally a row-oriented operation. As a text-oriented tool, Lilac assumes that text processing (via LLMs, etc.) is more expensive than memory access. It is not like pandas, which optimizes for vectorized numerical computation over rows, columns, or entire dataframes.
Lilac will save your map progress: if the map fails mid-way (e.g. with an exception, or your computer dies), you can resume computation without losing any intermediate results. This is important when the
map
function is expensive or slow (e.g. calling GPT to edit data, or calling an expensive embedding model).map
operates seamlessly over repeated subfields: your map only ever sees a flattened stream of items and Lilac keeps track of the association between each item and its source row.The output of Lilac’s
map
is always written to a new column in the same dataset. We’re working on versioned columns, which will allow you to overwrite the same column while being able to compare/undo the change.While the computation is running, the Lilac UI will show a progress bar. When it completes, the UI will auto-refresh and we can use the new column (see statistics, filter by values, etc).
Example usage#
Let’s start with a simple example where we add a Q:
prefix to each question
in the dataset. By
default, Lilac reads the entire row and your map function will receive the row as a dictionary.
import lilac as ll
items = [{'question': 'A'}, {'question': 'B'}, {'question': 'C'}]
dataset = ll.from_dicts('local', 'questions', items)
def add_prefix(item):
return 'Q: ' + item['question']
dataset.map(add_prefix, output_path='question_prefixed')
dataset.to_pandas()
question question_prefixed
0 A Q: A
1 B Q: B
2 C Q: C
input_path
#
If we want to map
over a single field, we can provide input_path
. Let’s tell Lilac to only read
the question
field. The map will no longer see the entire row, but just a single value from the
field we care about:
def add_prefix(question):
return 'Q: ' + question
dataset.map(add_prefix, input_path='question', output_path='question_prefixed2')
dataset.to_pandas()
question question_prefixed question_prefixed2
0 A Q: A Q: A
1 B Q: B Q: B
2 C Q: C Q: C
input_path
is very useful for:
keeping the
map
code reusable, by decoupling the processing logic from the input schema.transforming repeated fields, by allowing Lilac to handle flattening and unflattening. Your map code will only ever see the flattened stream of fields.
Let’s make a new dataset with a nested list of questions:
items = [
{'questions': ['A', 'B']},
{'questions': ['C']},
{'questions': ['D', 'E']},
]
dataset = ll.from_dicts('local', 'nested_questions', items)
dataset.to_pandas()
questions
0 [A, B]
1 [C]
2 [D, E]
Let’s do the map again, but this time we’ll use input_path=('questions', '*')
to tell Lilac to map
over each individual item in the questions
list. This is equivalent to mapping over the flattened
list ['A', 'B', 'C', 'D', 'E']
.
def add_prefix(question):
return 'Q: ' + question
dataset.map(add_prefix, input_path=('questions', '*'), output_path='questions_prefixed')
dataset.to_pandas()
questions questions_prefixed
0 [A, B] [Q: A, Q: B]
1 [C] [Q: C]
2 [D, E] [Q: D, Q: E]
We can see that the questions_prefixed
column is a nested list, with the same nestedness as the
questions
column.
output_path
#
To test the map function while developing without writing to a new column, we can omit the
output_path
argument and print the result of map
:
result = dataset.map(add_prefix, input_path='question')
print(list(result))
> ['Q: B', 'Q: C', 'Q: A']
Structured output#
Often our maps will output multiple values for a given item, e.g. when calling GPT to extract
structure from a piece of text. If the output of the map
function is a dict
, Lilac will
automatically unpack the dict
and create nested columns under the output_path
. This is useful
when we want to output multiple values for a single input item. For example, we can use a map
function to compute the length of each question, and whether it ends with a question mark:
items = [
{'question': 'How are you today?'},
{'question': 'What kind of food'},
{'question': 'Are you sure?'},
]
dataset = ll.from_dicts('local', 'questions3', items)
def enrich(question):
return {'length': len(question), 'ends_with_?': question[-1] == '?'}
dataset.map(enrich, input_path='question', output_path='metadata')
dataset.to_pandas()
question metadata.length metadata.ends_with_?
0 How are you today? 18 True
1 What kind of food 17 False
2 Are you sure? 13 True
If we start the Lilac web server via ll.start_server()
and open the browser, we can see the new
column statistics in the UI and filter by their values:
Batching#
When setting the batch_size
kwarg, map will provide your function with a list of batch_size
items at once. You may receive a partial batch at the end of the map, so your function should handle
receiving a batch smaller than batch_size
.
You may also set batch_size=-1
in order to receive the entire dataset as a single batch. This may
be useful if some computation requires seeing all rows at once - for example, a duplicate text
detector. This mode will load your entire dataset (or a column of your dataset if input_column
is
specified) into memory, so please ensure that you have sufficient memory on your machine.
Filtering and limiting#
To run a preview computation on a few rows and sanity-check your function, run map
with limit=5
.
To run a computation on a subset of rows, you can pass a set of Filter
s. For example,
to limit your map to longer strings, you could run
map(fn, filters=[Filter(path='column', op='length_greater', 20)])
. Multiple filters are combined
with AND
- only rows matching all provided filters will be mapped.
items = [
{'question': 'A', 'source': 'foo'},
{'question': 'B', 'source': 'bar'},
{'question': 'C', 'source': 'bar'}
]
dataset = ll.from_dicts('local', 'questions', items, overwrite=True)
result = dataset.map(
lambda x: x['question'].lower(),
filters=[ll.Filter(path=('source',), op='equals', value='bar')],
limit=1)
print(list(result))
> ['b']
Parallelism#
By default Lilac will run the map
on a single thread. To speed up computation, we can provide
execution_type
and num_jobs
. Executing type can be either threads
or processes
. Threads are
better for network bound tasks like making requests to an external server, while processes are
better for CPU bound tasks, like running a local LLM.
The number of jobs defaults to the number of physical cores on the machine. However, if our map function is making requests to an external server, we can increase the number of jobs to reach the desired number of requests per second.
def compute(text_batch: list[str])-> list[str]:
# make a single request to an external server.
dataset.map(compute, batch_size=32, input_path='question', execution_type='threads', num_jobs=10)
Assuming a latency of 100ms per request, we can expect to make 10 requests per second with a single job, and 100 requests per second with 10 jobs.
Annotations#
Often our map will extract relevant information that we want to associate with the input text. For
example, when detecting company names, we want to know the location where each company name was
found. We can do this by using lilac.span
annotation.
import re
items = [
{'text': 'Company name: Apple Inc.\n Apple Inc is a ...'},
{'text': 'Google LLC is a ... Company name: Google LLC'},
{'text': 'There is no company name here'},
]
dataset = ll.from_dicts('local', 'company', items)
def extract_company(text):
pattern = r'Company name: (.*)?\s'
matches = re.finditer(pattern, text)
return [ll.span(m.span(1)[0], m.span(1)[1]) for m in matches]
dataset.map(extract_company, input_path='text', output_path='company')
dataset.to_pandas()
Lilac will then highlight the spans in the UI when we filter by the company
column:
Nested output_path
#
If we want to nest the output of a map under an existing column, we can prefix the output_path
with the name of the existing column: :
items = [
{'question': 'How are you today?'},
{'question': 'What kind of food'},
{'question': 'Are you sure?'},
]
dataset = ll.from_dicts('local', 'questions3', items)
def enrich(question):
return {'length': len(question), 'ends_with_?': question[-1] == '?'}
dataset.map(enrich, input_path='question', output_path='question.metadata')
dataset.to_pandas()
question question.metadata.length question.metadata.ends_with_?
0 How are you today? 18 True
1 What kind of food 17 False
2 Are you sure? 13 True