NiFI for Apache - the flow using records and registry


In a previous guide, we’ve setup MiNiFi on Web servers to export Apache access log event to a central NiFi server. Then we saw an example of flow build in this NiFi server to handle this flow. This flow was using standard NiFi processors, manipulating each event as a string. Now, we will start a new flow, achieving the same purpose but using a record oriented approach.
We will then discover the ease of use of the record oriented flow files and how it can speed up the deployment of a flow.

Pieces needed from before

NiFi and the Hortonworks Registry


The HortonWorks Registry is a service running on your Hortonworks Data Flow cluster that will allow you to centrally store and distribute schemas of how the data you are manipulating are organized.
The Registry is a web application offering:
  • A web interface to add and modify schema
  • A REST API that can be used by any other service to retrieve schema information
The Registry retains previous version of the schema each time you perform an update on an existing schema.

NiFi for Apache - the flow


In the previous guide, you have installed, configured and enabled the MiNiFi agent on each of your web server. Now, it is time to build a flow on your central NiFi server to do something with the information that will be sent to it.


Building up a flow on the NiFi server

We are now back to the workspace of our NiFi server.
If you have followed this guide line by line, you should only have one input port called “RemoteMiNiFi” on it.

NiFi for Apache - using MiNiFi


In this guide, we will use the lightweight version of NiFi, Minifi, that will run on an Apache web server, looking for new event written in the Apache access logs.
MiNiFi is a lightweight version of NiFi, without the web interface and with only a limited set of processors. It doesn’t take a lot of resources on the host it is running.
It can be used as a “Forward-only” to any central NiFi server you have previously setup.

Configuring MiNiFi

NiFi and SSL for authorization


By default, your NiFi installation is not protected at all. Anyone knowing the hostname of your NiFi hosts can connect to them with a simple web browser.
To protect access to NiFi, by adding user authentication and authorization, you will need to enable SSL. Client-side certificates, generated by the NiFi CA are going to be used not only to setup an encrypted link to the NiFi hosts but also to provide user authentication.
When SSL has been enabled for NiFi, it is no more possible to connect using HTTP.

In the Ambari GUI

NiFi for Syslog

Let’s build with NiFi a flow similar to what we build with Logstash to store syslog messages into an ElasticSearch index.


Receving the messages

We start with the ListenSyslog processor of NiFi that can be configured to listen on any UDP or TCP ports for syslog. When listening on TCP, you must specify the maximum number of concurrent TCP connections. This parameters will be dependant of the number of systems sending syslog message simultaneously to your listener.

NiFi and ElasticSearch

Custom mapping for the index you will update with NiFi flows

Unlike the Logstash output “ElasticSearch”, you cannot associate a customized mapping with the processor. Therefore, if the dynamic mapping of ElasticSearch doesn’t attribute the type you really want to one of your fields, you will have to use a default mapping template (see this chapter in the ElasticSearch section of the site).
If doing that, remember that:

NiFi and JSON

NiFi and JSON

Remark: with the introduction of the records-oriented flow files, managing JSON with NiFi became easier than ever.
The below how-to about JSON manipulation is making an extensive use of message contents and attributes extraction / modification.
You will find later pages about the records usage in NiFi.

NiFi installation and implementation

NiFi introduction

NiFi will allows you to create various data pipelines in a very nice web GUI.
Inside NiFi, one event sent and handled by the system is called a flow file. Each event will be stored as file, containing attributes. Flow files will be received, transformed, routed, split, transferred by processors. Tons of processors are proposed by default, there are processors to:
  • Receive messages from Syslog, HTTP, FTP, HDFS, Kafka, …

Creating a HDF cluster

Setting up a HDF cluster with Ambari

To have a fully functional cluster running HortonWorks Data Flow


Attention, read this first before starting the deployment of an HDF cluster

(Valid end of June 2017)
The last version of Ambari (2.5.1) is well supported on Ubuntu 16 LTS and Ubuntu 14 LTS. This is also the case for the full Hortonworks Data Platform stack (HDP, version 2.6.1). Besides being supported on Oracle Linux, Suse, CentOS, RedHat and Debian.