Use the S3 connector to ingest data directly from your AWS S3 buckets into Airfold. Configure a scheduled sync to ensure your Airfold source stays up-to-date with the latest files, giving you control over the freshness and frequency of your data updates.

The following walkthrough demonstrates how to set up an S3 connector within your Airfold workspace.

Create Source

Navigate to Sources and click on Create new Source. Alternatively, click on the + next to Sources in the left-hand tool bar.

In the pop-up window, click on Amazon S3

Bucket

This will take you to the first page of the setup wizard. Here you can either select a previously connected bucket or connect a new one. If this is your first time connecting Airfold to S3, you will need to select “Add New” and then enter the name of the S3 bucket that you need to sync data from. Enter the name of the bucket exactly how it appears on your AWS account.

Policy and Role

In this section of the setup wizard, we will need to create a role and a policy in your AWS account for Airfold to use and access your bucket. To complete this step, you may need administrative privileges on your AWS account.

A role is a user of your AWS account.
A policy is like a set of rules and permissions that can be assigned onto a role.

The following steps will create a role for Airfold within your AWS account and will assign a policy to that new role with read permissions on the S3 bucket that you have specified.

First, let’s create the policy. Follow the instructions on the setup wizard to navigate to the correct section on the IAM page of your AWS account.

Copy the entire JSON from the setup wizard and then paste it into the policy editor. It should look something like this:

Click next and then on the following page, give the policy a name. Remember what you have called your new policy as you will need to assign it to the new role in the next step.
Once you have successfully created the policy, click on the Mark as Created checkbox.
Next let’s create the role. Again, follow the steps of the setup wizard to navigate to the Roles section of the IAM page. Paste the full JSON into the policy editor, it should look something like this:

Assign the previously created policy to this new role and then click “Create role”.
Once you have completed these steps, click on the “Mark as created” checkbox and then move onto the next step.

Role ARN and Path

For this section you will need to retrieve two pieces of information from your AWS account: The Role ARN (Amazon Resource Name) of the role that you just created, and the path to the file(s) that you wish to sync into Airfold.

First, lets find the ARN. Navigate to the roles section of the IAM page in AWS and search for the role that you created in the previous step. It should look something like the following:

Copy the ARN and paste into the setup wizard.

Next lets get the path. This path defines the directory that Airfold will scan for files during each scheduled sync. All files matching this path will be ingested into the same Airfold source, provided they share a consistent schema. Some examples that you could use:

logs/                 # Ingest all files in the 'logs' folder
data/*.parquet        # Ingest only Parquet files in the 'data' folder
snapshots/2025-05-*   # Ingest all files with a matching date prefix
raw-data/**/*.json    # Recursively ingest all JSON files in 'raw-data' and its subfolders

Configure Schedule

In this section of the setup wizard, you will specify the frequency in which Airfold will pull data from S3 and the Strategy for how the data will populate in your Source.

Set the refresh schedule by writing a cron expression and specifying a timezone.

The Strategy selection gives you two options: append or replace. Append will take the files in the bucket that match your path and will add that data to any existing rows already present in the Source. Replace will directly overwrite the current rows in the Source with the latest contents in the bucket.\

💡 Tip:
For most use cases involving the S3 connector, the append strategy is the recommended choice. A common setup is to upload a new file to the bucket each day, with a time-to-live (TTL) policy configured to automatically remove older files—typically after 7 days. In this scenario, your bucket will always contain the most recent 7 files.

By default, Airfold’s append strategy is designed to be idempotent: it will only ingest new files that haven’t previously been processed, or files that have been modified. This ensures that each scheduled sync pulls in only the latest data while safely ignoring files that have already been ingested, keeping your Source accurate and avoiding duplication.

Finalize

On the final step, you’ll name your new Source and define its schema.

💡 Tip:
It’s worth spending some time on this section as both the schema and table settings cannot be changed once the table is created. Choosing optimal data types and table settings (like the primary key, order by, and engine) are crucial steps in Airfold for ensuring that your queries run with low latency. If you wish to change data types or table settings, you will have to drop the source, recreate the table with the new settings, and ingest your data again. This could be a potentially costly backfill, especially if you have a large dataset, so it’s important to ensure correctness before proceeding.

Your new Source will now update on the schedule you defined, and will appear under the Sources tab in your workspace.