lilac.sources: Ingestion#

Sources for ingesting data into Lilac.

class lilac.sources.CSVSource#

CSV or TSV data loader

CSV or TSV files can live locally as a filepath, point to an external URL, or live on S3, GCS, or R2.

For TSV files, use t as the delimiter.

For more details on authorizing access to S3, GCS or R2, see: https://duckdb.org/docs/guides/import/s3_import.html

param delim: str | None = ','#

The CSV file delimiter to use. For TSV files, use t.

param filepaths: list[str] [Required]#

A list of paths to CSV files. Paths can be local, point to an HTTP(s) url, or live on GCS, S3 or R2.

param header: bool | None = True#

Whether the CSV file has a header row.

param names: list[str] = []#

Provide header names if the file does not contain a header.

load_to_parquet(output_dir: str, task_id: str | None) SourceManifest#

Process the source by directly writing a parquet file.

You should only override one of yield_items or load_to_parquet.

This fast path exists for sources where we are able to avoid the overhead of creating python dicts for every row by using non-Python parquet writers like DuckDB.

The output parquet files should have a {schema.ROWID} column defined. This ROWID should not be part of source_schema, however.

Finally, self.source_schema().num_items is usually computed in setup(), but can be left as None for sources implementing load_to_parquet, since this count is only used to display a progress bar. load_to_parquet doesn’t have progress bar support so the count is unnecessary. However, you must still keep track of how many items were processed in total, because fields like self.sample_size should reflect the actual size of the dataset if len(dataset) < sample_size.

Parameters:
  • output_dir – The directory to write the parquet files to.

  • task_id – The TaskManager id for this task. This is used to update the progress of the task.

Returns:

A SourceManifest that describes schema and parquet file locations.

model_post_init(__context: Any) None#

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self – The BaseModel instance.

  • __context – The context.

setup() None#

Prepare the source for processing.

This allows the source to do setup outside the constructor, but before its processed. This avoids potentially expensive computation the pydantic model is deserialized.

source_schema() SourceSchema#

Return the source schema.

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

name: ClassVar[str] = 'csv'#
class lilac.sources.DictSource#

Loads data from an iterable of dict objects.

model_post_init(__context: Any) None#

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self – The BaseModel instance.

  • __context – The context.

setup() None#

Prepare the source for processing.

This allows the source to do setup outside the constructor, but before its processed. This avoids potentially expensive computation the pydantic model is deserialized.

source_schema() SourceSchema#

Return the source schema.

yield_items() Iterable[Any]#

Process the source request.

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

name: ClassVar[str] = 'dict'#
class lilac.sources.GithubSource#

GitHub source code loader

Loads source code from GitHub repositories using the LlamaIndex GithubRepositoryReader.

Each file becomes a separate row.

The following extensions are automatically ignored as Lilac does not yet support media:

.png, .jpg, .jpeg, .gif, .mp4, .mov, .avi

param branch: str | None = 'main'#

The branch to load from. Defaults to the main branch.

param github_token: str | None = None#

The GitHub token to use for authentication. If not specified, uses the GITHUB_TOKEN environment variable.

param ignore_directories: list[str] | None = None#

A list of directories to ignore. Can only be used if filter_directories is not specified.

param ignore_file_extensions: list[str] | None = None#

A list of file extensions to ignore. Can only be used if filter_file_extensions is not specified.

param repo: str [Required]#

The GitHub repository to load from. Format: <owner>/<repo>.

model_post_init(__context: Any) None#

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self – The BaseModel instance.

  • __context – The context.

scrub_github_token(github_token: str) str#

Scrubs the github token so it isn’t stored on disk.

setup() None#

Prepare the source for processing.

This allows the source to do setup outside the constructor, but before its processed. This avoids potentially expensive computation the pydantic model is deserialized.

source_schema() SourceSchema#

Return the source schema.

yield_items() Iterable[Any]#

Read from GitHub.

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

name: ClassVar[str] = 'github'#
class lilac.sources.GmailSource#

Connects to your Gmail and loads the text of your emails.

One time setup

Download the OAuth credentials file from the [Google Cloud Console](https://console.cloud.google.com/apis/credentials) and save it to the correct location. See [guide](https://developers.google.com/gmail/api/quickstart/python#authorize_credentials_for_a_desktop_application) for details.

param credentials_file: str [Optional]#

Path to the OAuth credentials file. Defaults to .gmail/credentials.json in your Lilac project directory.

model_post_init(__context: Any) None#

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self – The BaseModel instance.

  • __context – The context.

setup() None#

Prepare the source for processing.

This allows the source to do setup outside the constructor, but before its processed. This avoids potentially expensive computation the pydantic model is deserialized.

source_schema() SourceSchema#

Return the source schema for this source.

Returns:

A SourceSchema with

fields: mapping top-level columns to fields that describes the schema of the source. num_items: the number of items in the source, used for progress.

yield_items() Iterable[Any]#

Process the source by yielding individual rows of the source data.

You should only override one of yield_items or load_to_parquet.

This method is easier to use, and simply requires you to return an iterator of Python dicts. Lilac will take your iterable of items and handle writing it to parquet. You will still have to override source_schema.

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

name: ClassVar[str] = 'gmail'#
class lilac.sources.HuggingFaceSource#

HuggingFace data loader

For a list of datasets see: [huggingface.co/datasets](https://huggingface.co/datasets).

For documentation on dataset loading see:

[huggingface.co/docs/datasets/index](https://huggingface.co/docs/datasets/index)

param config_name: str | None = None#

Some datasets require this.

param dataset: Dataset | DatasetDict | None = None#

The pre-loaded HuggingFace dataset

param dataset_name: str [Required]#

Either in the format user/dataset or dataset.

param load_from_disk: bool | None = False#

Load from local disk instead of the hub.

param revision: str | None = None#
param sample_size: int | None = None#

Number of rows to sample from the dataset, for each split.

param split: str | None = None#

Loads all splits by default.

param token: str | None = None#

Huggingface token for private datasets.

load_to_parquet(output_dir: str, task_id: str | None) SourceManifest#

Process the source by directly writing a parquet file.

You should only override one of yield_items or load_to_parquet.

This fast path exists for sources where we are able to avoid the overhead of creating python dicts for every row by using non-Python parquet writers like DuckDB.

The output parquet files should have a {schema.ROWID} column defined. This ROWID should not be part of source_schema, however.

Finally, self.source_schema().num_items is usually computed in setup(), but can be left as None for sources implementing load_to_parquet, since this count is only used to display a progress bar. load_to_parquet doesn’t have progress bar support so the count is unnecessary. However, you must still keep track of how many items were processed in total, because fields like self.sample_size should reflect the actual size of the dataset if len(dataset) < sample_size.

Parameters:
  • output_dir – The directory to write the parquet files to.

  • task_id – The TaskManager id for this task. This is used to update the progress of the task.

Returns:

A SourceManifest that describes schema and parquet file locations.

model_post_init(__context: Any) None#

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self – The BaseModel instance.

  • __context – The context.

setup() None#

Prepare the source for processing.

This allows the source to do setup outside the constructor, but before its processed. This avoids potentially expensive computation the pydantic model is deserialized.

source_schema() SourceSchema#

Return the source schema for this source.

Returns:

A SourceSchema with

fields: mapping top-level columns to fields that describes the schema of the source. num_items: the number of items in the source, used for progress.

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

name: ClassVar[str] = 'huggingface'#
class lilac.sources.JSONSource#

JSON data loader

Supports both JSON and JSONL.

JSON files can live locally as a filepath, point to an external URL, or live on S3, GCS, or R2.

For more details on authorizing access to S3, GCS or R2, see: https://duckdb.org/docs/guides/import/s3_import.html

param filepaths: list[str] [Required]#

A list of filepaths to JSON files. Paths can be local, point to an HTTP(s) url, or live on GCS, S3 or R2.

param sample_size: int | None = None#

Number of rows to sample from the dataset.

load_to_parquet(output_dir: str, task_id: str | None) SourceManifest#

Process the source by directly writing a parquet file.

You should only override one of yield_items or load_to_parquet.

This fast path exists for sources where we are able to avoid the overhead of creating python dicts for every row by using non-Python parquet writers like DuckDB.

The output parquet files should have a {schema.ROWID} column defined. This ROWID should not be part of source_schema, however.

Finally, self.source_schema().num_items is usually computed in setup(), but can be left as None for sources implementing load_to_parquet, since this count is only used to display a progress bar. load_to_parquet doesn’t have progress bar support so the count is unnecessary. However, you must still keep track of how many items were processed in total, because fields like self.sample_size should reflect the actual size of the dataset if len(dataset) < sample_size.

Parameters:
  • output_dir – The directory to write the parquet files to.

  • task_id – The TaskManager id for this task. This is used to update the progress of the task.

Returns:

A SourceManifest that describes schema and parquet file locations.

model_post_init(__context: Any) None#

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self – The BaseModel instance.

  • __context – The context.

setup() None#

Prepare the source for processing.

This allows the source to do setup outside the constructor, but before its processed. This avoids potentially expensive computation the pydantic model is deserialized.

source_schema() SourceSchema#

Return the source schema.

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

name: ClassVar[str] = 'json'#
class lilac.sources.LangSmithSource#

LangSmith data loader.

param dataset_name: str [Required]#

LangSmith dataset name

model_post_init(__context: Any) None#

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self – The BaseModel instance.

  • __context – The context.

setup() None#

Prepare the source for processing.

This allows the source to do setup outside the constructor, but before its processed. This avoids potentially expensive computation the pydantic model is deserialized.

source_schema() SourceSchema#

Return the source schema.

yield_items() Iterable[Any]#

Process the source.

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

name: ClassVar[str] = 'langsmith'#
router: ClassVar[APIRouter] = <fastapi.routing.APIRouter object>#
class lilac.sources.LlamaIndexDocsSource#

LlamaIndex document source

Loads documents from a LlamaIndex Document Iterable.

Usage: ```python

from llama_index import download_loader

# See: https://llamahub.ai/l/papers-arxiv ArxivReader = download_loader(“ArxivReader”)

loader = ArxivReader() documents = loader.load_data(search_query=’au:Karpathy’)

# Create the dataset ll.create_dataset( config=ll.DatasetConfig(

namespace=’local’, name=’arxiv-karpathy’, source=ll.LlamaIndexDocsSource(

# documents comes from the loader.load_data call in the previous cell. documents=documents )

)

)

```

A detailed example notebook [can be found here](https://github.com/lilacai/lilac/blob/main/notebooks/LlamaIndexLoader.ipynb)

model_post_init(__context: Any) None#

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self – The BaseModel instance.

  • __context – The context.

setup() None#

Setup the source.

source_schema() SourceSchema#

Return the source schema.

yield_items() Iterable[Any]#

Ingest the documents.

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

name: ClassVar[str] = 'llama_index_docs'#
class lilac.sources.PandasSource#

Pandas source.

model_post_init(__context: Any) None#

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self – The BaseModel instance.

  • __context – The context.

setup() None#

Prepare the source for processing.

This allows the source to do setup outside the constructor, but before its processed. This avoids potentially expensive computation the pydantic model is deserialized.

source_schema() SourceSchema#

Return the source schema.

yield_items() Iterable[Any]#

Process the source.

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

name: ClassVar[str] = 'pandas'#
class lilac.sources.ParquetSource#

Parquet data loader

Parquet files can live locally as a filepath, or remotely on GCS, S3, or Hadoop.

For more details on authentication with private objects, see: https://arrow.apache.org/docs/python/filesystems.html

param filepaths: list[str] [Required]#

A list of paths to parquet files which live locally or remotely on GCS, S3, or Hadoop.

param pseudo_shuffle: bool = False#

If true, the reader will read a fraction of rows from each shard, avoiding a pass over the entire dataset.

param pseudo_shuffle_num_shards: int = 10#

Number of shards to sample from when using pseudo shuffle.

param sample_size: int | None = None#

Number of rows to sample from the dataset

param seed: int | None = None#

Random seed for sampling

load_to_parquet(output_dir: str, task_id: str | None = None) SourceManifest#

Process the source by directly writing a parquet file.

You should only override one of yield_items or load_to_parquet.

This fast path exists for sources where we are able to avoid the overhead of creating python dicts for every row by using non-Python parquet writers like DuckDB.

The output parquet files should have a {schema.ROWID} column defined. This ROWID should not be part of source_schema, however.

Finally, self.source_schema().num_items is usually computed in setup(), but can be left as None for sources implementing load_to_parquet, since this count is only used to display a progress bar. load_to_parquet doesn’t have progress bar support so the count is unnecessary. However, you must still keep track of how many items were processed in total, because fields like self.sample_size should reflect the actual size of the dataset if len(dataset) < sample_size.

Parameters:
  • output_dir – The directory to write the parquet files to.

  • task_id – The TaskManager id for this task. This is used to update the progress of the task.

Returns:

A SourceManifest that describes schema and parquet file locations.

model_post_init(__context: Any) None#

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self – The BaseModel instance.

  • __context – The context.

setup() None#

Prepare the source for processing.

This allows the source to do setup outside the constructor, but before its processed. This avoids potentially expensive computation the pydantic model is deserialized.

source_schema() SourceSchema#

Return the source schema.

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

name: ClassVar[str] = 'parquet'#