Ingest data directly from your S3 buckets
Sources
and click on Create new Source
. Alternatively, click on the +
next to Sources
in the left-hand tool bar.
Amazon S3
bucket
that you need to sync data from. Enter the name of the bucket exactly how it appears on your AWS account.
role
and a policy
in your AWS account for Airfold to use and access your bucket. To complete this step, you may need administrative privileges on your AWS account.
role
is a user of your AWS account.policy
is like a set of rules and permissions that can be assigned onto a role.Mark as Created
checkbox.
Role ARN
(Amazon Resource Name) of the role that you just created, and the path
to the file(s) that you wish to sync into Airfold.
ARN
. Navigate to the roles
section of the IAM page in AWS and search for the role that you created in the previous step. It should look something like the following:
path
. This path defines the directory that Airfold will scan for files during each scheduled sync. All files matching this path will be ingested into the same Airfold source, provided they share a consistent schema. Some examples that you could use:
Strategy
for how the data will populate in your Source.
Strategy
selection gives you two options: append
or replace
. Append will take the files in the bucket that match your path
and will add that data to any existing rows already present in the Source. Replace will directly overwrite the current rows in the Source with the latest contents in the bucket.\
💡 Tip:
For most use cases involving the S3 connector, theappend
strategy is the recommended choice. A common setup is to upload a new file to the bucket each day, with a time-to-live (TTL) policy configured to automatically remove older files—typically after 7 days. In this scenario, your bucket will always contain the most recent 7 files. By default, Airfold’s append strategy is designed to be idempotent: it will only ingest new files that haven’t previously been processed, or files that have been modified. This ensures that each scheduled sync pulls in only the latest data while safely ignoring files that have already been ingested, keeping your Source accurate and avoiding duplication.
💡 Tip:Your new Source will now update on the schedule you defined, and will appear under the Sources tab in your workspace.
It’s worth spending some time on this section as both the schema and table settings cannot be changed once the table is created. Choosing optimal data types and table settings (like theprimary key
,order by
, andengine
) are crucial steps in Airfold for ensuring that your queries run with low latency. If you wish to change data types or table settings, you will have to drop the source, recreate the table with the new settings, and ingest your data again. This could be a potentially costly backfill, especially if you have a large dataset, so it’s important to ensure correctness before proceeding.