Flink wordcount scala | Complete tutorial in 2022

Apache flink is a distributed query engineer that can process batch as well as streaming data. In this session, we will learn flink wordcount scala.

We will be using IntelliJ ID to write and export jar and maven to set up the flink dependency. So let’s get started.

Setup flink development environment

Before Starting to write flink code, make sure to install/configure the following tool/software in your system.

Setup IntelliJ id for flink

You can download the IntelliJ IDEA based on your operating system from here.

Setup Maven for flink

Since we will be using Maven to set up the flink dependency for your word count project, please make sure the maven is set up on your machine.

In macOS, you can type the below command to setup maven

brew install maven

Setup flink locally

Follow this link to install flink in your system. Once flink is installed the WebUI can be accessed on localhost:8081.

Flink Project Template for scala

Apache maven provides the Quickstart Archetype using which you can easily generate the project template for your flink job. Type the below command to generate the Quickstart flink project

mvn archetype:generate \
-DarchetypeGroupId=org.apache.flink \
-DarchetypeArtifactId=flink-quickstart-scala \
-DarchetypeVersion=1.14.4

Provide the groupId, artifactId, and package name.

Define value for property 'groupId': org.apache.flink
Define value for property 'artifactId': flink-scala-wc
Define value for property 'version' 1.0-SNAPSHOT: : 
Define value for property 'package' org.apache.flink: : 
Confirm properties configuration:
groupId: org.apache.flink
artifactId: flink-scala-wc
version: 1.0-SNAPSHOT
package: org.apache.flink
 Y: : y

After passing the above details a flink-scala-wc maven project will be created. Now open this project in IntelliJ ID.

Flink wordcount example scala

In this session, we will learn how to write a word-count application in scala.

Open the existing flink-scala-wc application which is generated using the mvn archetype. Delete existing scala application and crate on new scala class

Provide the class name as wordCount and select the object and click on the ok button.

Paste the below code in the wordCount File

package org.apache.flink

import org.apache.flink.api.scala._
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.api.common.functions.FlatMapFunction

object wordCount {
  def main(args: Array[String]) {

    val params: ParameterTool = ParameterTool.fromArgs(args)

    // set up the streaming execution environment
    val env = ExecutionEnvironment.getExecutionEnvironment

    // make parameters available in the web interface
    env.getConfig.setGlobalJobParameters(params)

    val text = env.readTextFile(params.get("input"))

    val counts = text
      .flatMap(value => value.split("\\s+"))
      .map(value => (value,1))
      .groupBy(0)
      .sum(1)

      counts.writeAsCsv(params.get("output"), "\n", " ")

      env.execute("Scala WordCount Example")
  }

}

Change the main class name in pom.xml as per your class name

<mainClass>org.apache.flink.wordCount</mainClass>

Now perform maven clean and maven compile.

Make sure there is no error while compilation.

Flink wordcount jar

In this session, we will learn how to generate the jar file for the wordcount job which is required to run the flink application.

There are many ways in which a user can generate a jar file. one of the easy ways is by using the IntelliJ id itself. Before generating the jar using IntelliJ id make sure that the project is compiled successfully without any errors.

To generate the jar click on the maven package button.

Once the package command runs successfully jar will be generated under the target folder with the name

flink-scala-wc-1.0-SNAPSHOT.jar

Run flink wordcount scala

Now will be using the above jar file to submit the flink job. The above wordcount job takes 2 parameters

input 
output

input= Files where to read the data from

output= path where to write the o/p in CSV format.

Now type the below command to submit the flink job.

./bin/flink run --detached ~/flink-scala-wc-1.0-SNAPSHOT.jar --input ~/flink/flink-1.14.4/README.txt --output ~/flink_out.txt

Conclusion

I hope you have liked this small tutorial on flink wordcount. In this blog, we have learned how to create a flink wordcount application schema and we have also learned how to generate the jar file and submit the jar to the flink job.

Please do let me know if you are facing any issues while following along.

More to Explore

How to send email from airflow

How to integrate airflow with slack

Install airflow using docker