Sources

What is a Source?

In Airfold, a Source is the foundational storage unit, comparable to a table in a traditional database. It stores structured data in rows and columns based on a predefined schema and serves as the entry point for data ingestion into the platform. Once data is ingested, it’s stored in a Source and made immediately available for querying with SQL. Sources power all downstream transformations and API endpoints in Airfold, making them a critical part of your analytics workflow. They are designed for speed, schema clarity, and reliability, helping ensure your data remains fresh, queryable, and production-ready.

Ingesting Data into a Source

To ingest your data into a source, you have two options:

Airfold Connectors

Seamlessly sync data from your existing systems into Airfold using built-in connectors for platforms like S3, Postgres, or Snowflake. Ideal for scheduled or high-volume ingestion.

Airfold API

Push data directly to Airfold in real time using our API. Instantly populate your Source and begin querying; perfect for syncing event data as it happens, with no batching required.

Whether you ingest data with a connector or through the Airfold API, once your data lands in a source, it is instantly queryable and ready to power your analytics pipelines.

Schema

Each Source has its own schema that defines the table’s columns, data types, and table settings. catalog schema

Defining a clear, well-structured schema for your Source is key to ensuring fast, efficient queries with minimal latency and resource usage. There are several Airfold specific things that you need to consider when finalizing your schema such as:

Table settings
- Each Source must have an engine type (required)
- The Primary Key defines how data is stored on disc (required)
- Table Partitions - Used mainly for data management (optional)
Data types There are many data types in Airfold that you might not find in other platforms that can help to cut down on storage cost and query efficiency.

Having an optimized schema is crucial, especially when your Source is storing data at scale. What’s more, your schema settings cannot be changed once the table is created. We strongly recommend reading the schema section in the docs!

Creating a Source

To create a source, you can either use the Airfold UI (simplest and quickest method) or you can define your Source details in YAML files and then push to your workspace using the Airfold CLI. Below we have shared an example of using either method.

Creating a Source in the UI

Navigate to “Sources” on the left menu bar, and click on ”+”: data format options

From this window, you can either choose to create a Source that uses a connector or to create a Source that will infer a schema from a text snippet, a file upload, or an external URL.

💡 Tip: If you plan on ingesting data using the Airfold API, you need to create a source without a connector. Selecting the Text option in this window will allow you to create an unconnected Source, but still define its schema - ready for data to be ingested into.

Below is an example of creating a Source using a file upload. This will create an unconnected Source that will infer its schema from the file uploaded. You can choose whether or not to actually ingest the data from the file OR to just use the file as a means of inferring a schema.

Upload a File

After clicking File Upload, you will be asked to specify the file. catalog upload

Confirm Schema

On the next page, you can confirm the Name of your Source. The Source name is how you will reference your data in your SQL queries - so make sure to select a SQL-friendly name! You can also choose whether or not to actually ingest the data from the file (or just infer its schema). Check the box if you want to ingest the data from the uploaded file into your Source. catalog schema

In this window, you can also modify the schema and the table settings. You can update names of columns, data types, engine type etc. Refer to the schema page of these docs for best practices. Click Create to finalize the Source.

UI Features

Once your source is successfully created, the UI has several tabs that allow you to see useful information: metrics

The graphs at the top of the page show your Usage metrics so that you can monitor your key Source metrics. The tabs below the graph: Data Gives you a preview of your data. You can explore the data by using:

Filters allow you to filter out certain rows based on a condition
Sort can be used to order your rows in ascending/descending order based on a specified column
Group by is for organizing your rows by a specified column, grouping related rows together based on a shared value in that column

Schema provides the table schema and ClickHouse settings. metrics

Data Graph shows all the dependencies for the Source. As you build downstream analytics from the Source, using Pipes, the dependency chart will populate showing all nodes that reference this source. graph

Logs provides error logs. Any ingestion errors can be seen there. graph

Creating a Source with the CLI

Creating a YAML file

If you are using the CLI, Sources are defined with YAML files - usually in your /Sources directory. For a Source that isn’t using a connector, your YAML should look like this:

web_events.yaml

type: Table
name: web_events
description: Events data of a website
cols:
  event_id: Int64
  user_id: Int64
  event_type: String
  page_url: String
  timestamp: DateTime
  referrer: String
settings: 
  engine: MergeTree()
  partition_by: toYYYYMM(timestamp)
  order_by: `event_id`

Properties

type

string

The type of table that gets created. For Sources that don’t use connectors, this value should be Table

name

string

The name given to the Source

description

string

A brief overview of the source’s content or purpose (optional)

cols

{name: type}

required

Defines the schema of the source as a list of columns where each column name is a key and its data type is the value (see Data Types)

settings

string | {key: value} | array

Engine Types for the source, which can include ORDER BY, PARTITION BY, Table Engine, etc. The settings can be a String, a key-value pair, or an array comprising either or both. Optional.

Push

Push sources to your workspace using the CLI command af push. For example, to push web_events.yaml, run:

af push web_events.yaml

Once you have successfully pushed your new Source, it should be visible in the UI and ready to ingest data! For your next steps, try writing some SQL queries against your new Source in a Pipe.

Get Started

Manage your Workspace

Ingest your data

Transform your data

Production Best Practices

CLI Config Reference

Legal

What is a Source?

Ingesting Data into a Source

Airfold Connectors

Airfold API

Schema