Writing to Iceberg tables

Apache Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.

NiFi supports a PutIceberg processor to add rows to an existing Iceberg table starting from version 1.19.0. As of NiFi version 2.7.2 only PutIceberg is supported, you need to create and compact your tables with other tools such as Trino or Spark (both included in the Stackable Data Platform).

NiFi 2.7 and above

In NiFi 2.7.0 Iceberg support was re-added after the removal in 2.0.0.

The 2.7.x version has the following changes over the 2.0.x and 2.6.x version, you need to adopt your setup accordingly:

HDFS and Kerberos support was dropped
Hive metastore support was dropped
Iceberg REST catalog support was added
It now uses the Iceberg S3 IO instead of the Hadoop S3 client libraries
It uses much less dependencies and therefore reduces the amount of CVEs

There have been efforts from Stackable to re-add at least Hive metastore support, but we ran into NiFi classpath loader issues, which we haven’t been able to solve so far.

NiFi 2.0 - 2.6

In NiFi 2.0.0 Iceberg support has been removed from upstream NiFi.

We forked the nifi-iceberg-bundle and made it available at https://github.com/stackabletech/nifi-iceberg-bundle. Starting with SDP 25.7, we have added the necessary bundle to NiFi by default, you don’t need to explicitly add Iceberg support to the Stackable NiFi.

Please read on its documentation on how to ingest data into Iceberg tables. You don’t need any special configs on the NiFiCluster in case you are using S3 and no Kerberos.

HDFS and Kerberos are also supported, please have a look at the Iceberg integration test for that.

NiFi 1

Starting with 1.19.0, NiFi supports writing to Iceberg tables.

The following example shows an example NiFi setup using the Iceberg integration.

apiVersion: nifi.stackable.tech/v1alpha1
kind: NifiCluster
metadata:
  name: nifi
spec:
  clusterConfig:
    # ...
    extraVolumes:
      # Will be mounted at /stackable/userdata/nifi-hive-s3-config/
      - name: nifi-hive-s3-config
        secret:
          secretName: nifi-hive-s3-config
---
apiVersion: v1
kind: Secret
metadata:
  name: nifi-hive-s3-config
stringData:
  core-site.xml: |
    <configuration>
      <property>
        <name>fs.s3a.aws.credentials.provider</name>
        <value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value>
      </property>

      <property>
        <name>fs.s3a.endpoint</name>
        <value>http://minio:9000</value>
      </property>

      <property>
        <name>fs.s3a.access.key</name>
        <value>xxx</value>
      </property>

      <property>
        <name>fs.s3a.secret.key</name>
        <value>xxx</value>
      </property>

      <property>
        <name>fs.s3a.path.style.access</name>
        <value>true</value>
      </property>

      <property>
        <name>fs.s3a.connection.ssl.enabled</name>
        <value>false</value>
        <description>Enables or disables SSL connections to S3.</description>
      </property>
    </configuration>

Please fill in the correct endpoint, access key and secret key for your S3 store, this is a classic Hadoop config file.

Use e.g. Trino to create a table for Nifi to write into using something like

CREATE SCHEMA IF NOT EXISTS lakehouse.demo WITH (location = 's3a://lakehouse/demo/');

CREATE TABLE IF NOT EXISTS lakehouse.demo.test (
    test varchar
);

In NiFi you need to create a HiveCatalogService first which allows you to access the Hive Metastore storing the Iceberg metadata. Set Hive Metastore URI to something like thrift://hive-iceberg.default.svc.cluster.local:9083, Default Warehouse Location to s3a://lakehouse and Hadoop Configuration Resources to /stackable/userdata/nifi-hive-s3-config/core-site.xml.

Afterwards you can create the PutIceberg processor and configure the HiveCatalogService. Also set Catalog Namespace to your schema name and the Table Name.

For the File Format it is recommened to use PARQUET or ORC rather than AVRO for performance reasons, but you can leave it empty or choose your desired format.

You should end up with the following PutIceberg processor: