Use-case

In this guide, we’ll develop an API to identify the top 10 most searched products on an e-commerce platform.

We’ll process e-commerce events to track user interactions like item views, cart additions, and checkout completions. Our dataset, a 50-million-row CSV file, captures these activities.

Our steps include:

  • Ingesting the dataset.
  • Executing queries to filter, aggregate, and refine the data into a top 10 list.
  • Exposing the top 10 results through an HTTP API.

Create a Workspace

Before we begin, we need to create a workspace to store our data and resources, as well as a token to authenticate our CLI.

  1. Go to Airfold and create a new workspace.
  2. Copy an Admin token from the workspace’s Token page.

The token should look like this: aft_6eab8fcd902e4cbfb63ba174469989cd.Ds1PME5dQsJKosKQWVcZiBSlRFBbmhzIocvHg8KQddV.

Set up the CLI

The CLI requires Python 3.10 or higher.

  1. Install the CLI using pip install airfold-cli.
  2. Run af config and paste your token when prompted.
$ af config
Configuring for API URL: https://api.airfold.co
? Api key: <your token>

To see a list of available commands just run af.

Create a Source

To create a source, a YAML file defining the source is required.
Let’s generate a source by inferring the schema from a CSV file:

$ af source create https://airfold-demo-data.s3.us-east-2.amazonaws.com/quickstart/events.csv .

The CLI will infer the schema from the CSV file and generate a YAML file with the following contents:

./sources/events.yaml
name: events
cols:
  acceptcharset: String
  acceptencoding: String
  acceptlanguage: String
  browsername: String
  browserversion: String
  cachecontrol: String
  city: String
  connection: String
  contentlength: String
  contenttype: String
  country: String
  cpuarchitecture: String
  devicemodel: String
  devicetype: String
  devicevendor: String
  enginename: String
  engineversion: String
  event_ts: DateTime64(3)
  from: String
  headers: String
  host: String
  ip_address: String
  is_bot: Int64
  latitude: Float64
  log_level: String
  log_message: String
  longitude: Float64
  method: String
  origin: String
  osname: String
  osversion: String
  protocol: String
  referer: String
  region: String
  url: String
  useragent: String
  via: String
  xforwaredforip: String
settings:
  engine: MergeTree()
  order_by: (`event_ts`, `ip_address`)
  partition_by: toYYYYMM(event_ts)

Push the source to your workspace:

$ af push ./sources/events.yaml

List the sources in your workspace to confirm that the source was created:

$ af source ls
- events: 0 rows | 0 bytes | 0 errors | created 1 min ago | updated 1 min ago

Ingest Data

Now that we have a source, we can ingest data into it.

Let’s ingest the file we used to generate the source:

$ af source append events https://airfold-demo-data.s3.us-east-2.amazonaws.com/quickstart/events.csv

Create a Pipe

All the SQL queries in Airfold are written inside pipes.
Pipes are a sequence of one or more SQL queries (Nodes) that are executed in order, each of those SQL queries is called a Node, and each Node is simply a SELECT statement.

Use pipes to build features over your data!

Let’s create a pipe to identify the counts by log level:

levels.yaml
nodes:
  - node1:
      sql: |
        SELECT
          log_level,
          count()
        FROM
          events
        GROUP BY
          log_level

Push the pipe to your workspace:

$ af push levels.yaml

List the pipes in your workspace to confirm that the pipe was created:

$ af pipe ls
- levels: draft | created 1 min ago | updated 1 min ago

Publish the Results

We’ve pushed the pipe to our workspace, but we have not yet published an endpoint from it.
To publish the pipe, we need to turn it into a published pipe.
To do so, we need to set the publish field to the name of the endpoint we want to publish:

levels.yaml
nodes:
  - node1:
      sql: |
        SELECT
          log_level,
          count()
        FROM
          events
        GROUP BY
          log_level
publish: levels

Push the pipe to your workspace:

$ af push levels.yaml

List the pipes in your workspace to confirm that the pipe was created:

$ af pipe ls
- top10: published | created 1 min ago | updated 1 min ago

Query the results using the API:

$ curl --request GET \
  --url https://api.airfold.co/v1/pipes/levels.json \
  --header 'Authorization: Bearer <use auth token here>'

{
        "meta":
        [
                {
                        "name": "log_level",
                        "type": "String"
                },
                {
                        "name": "count()",
                        "type": "UInt64"
                }
        ],

        "data":
        [
                {
                        "log_level": "WARN",
                        "count()": "2"
                },
                {
                        "log_level": "INFO",
                        "count()": "8"
                }
        ],

        "rows": 2,

        "statistics":
        {
                "elapsed": 0.001772939,
                "rows_read": 10,
                "bytes_read": 130
        }
}

Or using the CLI:

$ af pipe query levels --format json

[...]

Next Steps

Congratulations! You’ve successfully ingested, queried, and published data in Airfold.
Now that you’ve learned the basics, you can explore workspaces, sources, pipes, and tokens.
Or, check the reference of the API and the CLI to learn more about how to interact with Airfold.