Apache NiFi logo
Apache NiFi logo

As a COE, I have many tasks that I need to do on my own. In my new quest to delegate as much as possible to AI, I wanted to take Apache NiFi for a spin.

What is Apache NiFi?

Apache NiFi is a powerful, easy-to-use, and reliable system to process and distribute data. It is designed to automate the flow of data between systems, making it easy to get data from a variety of sources and processes into other systems and data stores. NiFi is highly configurable and can be used to connect to a wide variety of sources and destinations. It can be used to process and distribute data in real-time, and can scale to handle very large data flows. NiFi is used in a wide range of applications, including data ingestion, data processing, and data distribution.

Although not directly related to AI, the tasks that Apache NiFi performs are part of what is now commonly called "data engineering". In order to be able to actually use data, there are a number of practical steps that have to be taken in order to gather the data and make it into a form that is actually usable.

What is the problem I am trying to solve?

A COE needs to be extremely wary of their time, and where they choose to invest their energies. As a COE myself, I often use my own experiences as a proxy for other COEs.

In my case, for some time now, I have had to perform a number of nagging tasks that require loading data from one system, transforming the data to make it fit a certain schema or structure, then using it in another system. In general, this is a problem common to many companies. Larger organizations can tackle this problem by hiring experts. Oftentimes, the COE does not have that luxury. I therefore wanted to experiment with some tool to get an idea of whether or not it could be useful for the typical COE.

For this experiment, I have chosen a particular use case in my actual work. A segment of my customers fill out a "form", which is used to generate a license key for a particular product. The number of customers is large enough to make this a repetitive task that I would prefer to avoid, but too small to make me actually invest time in solving it, either by automating or delegating. So, I have continued to do this manually over the past few years, a few minutes at a time, several times a week.

It is, I believe, exactly this type of task that kills the COE's productivity: it is death by a thousand cuts. So one of my hypotheses is: by automating this type of repetitive task, I will have better focus, productivity, and enjoyment of my work as a COE. However, to make automation practical, the work necessary to automate must take less than an hour or so. Obviously, the less time, the better, but I'm starting with an hour for now.

To accomplish this type of automation, it is important to understand the context of the COE, i.e:

  • I don't have a team of IT or DevOps experts to manage infrastructure for me
  • I need to focus my efforts on more value-adding work, and not so much on infrastructure
  • I don't have a lot of time to learn new concepts that are only very specific to a single application

As a technical COE, I could develop an application, but it would take more than an hour, and the solution would not be generalizable to other COEs.

Why Apache NiFi?

Or perhaps the more appropriate question is, why not something else?

There are a number of potential solutions available, including Apache products: Kafka, Flume, Beam, and Spark. And there are also quite a few commercial services available, such as Keboola, Stitch, and Segment. Some of the large infrastructure players also offer services, like AWS, and Google.

I eventually chose Apache NiFi for the simple reason that I was able to understand its purpose immediately, and it seemed to fit my immediate needs.

Apache NiFi is a powerful, easy-to-use, and reliable system to process and distribute data. It is designed to automate the flow of data between systems, making it easy to get data from a variety of sources and processes into other systems and data stores. NiFi is highly configurable and can be used to connect to a wide variety of sources and destinations. It can be used to process and distribute data in real-time, and can scale to handle very large data flows. NiFi is used in a wide range of applications, including data ingestion, data processing, and data distribution.

Most of all, I found that it was relatively easy to install, and the graphical user interface was quite intuitive. It took me very little time to figure out how to get up and running with Apache NiFi.

My first pipeline

It was almost trivial to set up Apache NiFi on my Mac:

$ brew install nifi

Following installation, the application was immediately available locally at https://localhost:8443/nifi. I was able to get up and running, and try out my first pipeline in less than 20 minutes.

My initial test pipeline grabbed an email from a specific email folder using IMAP, then sent out an email with identical content to a different email address. It is not a useful pipeline, but it did show me how easy it is to set up Apache NiFi, and get a data pipeline working that integrates with my email. Pretty cool!

How could Apache Nifi help a COE?

Now that I have Apache NiFi set up, and am beginning to experiment with it, my impression is that it will be a useful tool for me, so I will continue investing time to build more data pipelines. Eventually, I would need to move this to my Kubernetes cluster, but I think that it shouldn't be too difficult to do this.

However, the big question is: could this be used by a typical COE?

To manage data, a typical busy COE would need something that:

  • Is very simple, with an intuitive graphical interface
  • Does not require any expertise in the data sciences
  • Does not require any setup or additional infrastructure
  • Is preconfigured to handle typical COE use cases

Apache NiFi has a pretty good GUI, but it quite technical, as it is focused on the low-level nitty gritty of working with data. For instance, a typical COE will likely not know how to write a regexp to transform their data.

The installation was relatively simple, but it is still a step too far for a typical COE. Also, there are no preconfigured templates that would address the needs of a typical COE.

So, although I think I can make good use of it, it is not the right tool for a COE in general.

A step in the right direction

Apache NiFi is not a panacea for the COE. It is intended more for data scientists, which pretty much requires a healthy team of people with specific expertise. Although quite light weight and simply, it is still a piece of infrastructure that requires some technical abilities to set up and use. It is not intended for the average COE.

However, for me it is a step in the right direction. I hope to benefit from this new toy, and eventually distill the necessary elements that would benefit COEs everywhere. I will continue using it as part of my AI/ML journey to see where this leads.