Performing spark scala word count with example:2023 Edition

In this tutorial, we will understand how to create and run a spark scala word count program. We will be using IntelliJ to develop the park application. So let’s get started:

Download IntelliJ for spark development

IntelliJ IDEA is an integrated development environment (IDE) for creating computer software. It is developed by JetBrains and is primarily used for Java programming, but it also supports other programming languages such as Kotlin, Scala, Groovy, and more.

It includes features such as code completion, code analysis, debugging, and version control integration.

Download and configured IntelliJ from here based on the OS version you have.

Spark scala word count example IntelliJ

Once the IntelliJ is downloaded. Open the application and add support for the SCALA framework. We will be using maven to setup spark related dependencies. We will use inbuild archetypes to develop the spark scala word count project.

Now open IntelliJ id and click on new project > select Maven.

select the Create from archetype checkbox and select scala-archetype-simple and click on next.

Give the project name as scala_wc and click next and click the Ok button to create a sample scale project.

Spark scala create a pom.xml file

A pom.xml file will be created with the required dependencies. We need to modify it and add spark-related dependencies.

Now add the spark core-related dependencies under <dependencies> tag

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.12</artifactId>
      <version>3.2.0</version>
    </dependency>

similarly add the spark-sql related maven dependencies.

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.12</artifactId>
      <version>3.2.0</version>
    </dependency>

NOTE: Make sure to remove <scope>provided</scope>,otherwise you will get class not found error

After performing some cleanup the final pom.xml files look like the below.

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>org.example</groupId>
  <artifactId>scala_wc</artifactId>
  <version>1.0-SNAPSHOT</version>
  <inceptionYear>2008</inceptionYear>
  <properties>
    <scala.version>2.12.10</scala.version>
    <target.java.version>1.8</target.java.version>
  </properties>

  <repositories>
    <repository>
      <id>scala-tools.org</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </repository>
  </repositories>

  <pluginRepositories>
    <pluginRepository>
      <id>scala-tools.org</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </pluginRepository>
  </pluginRepositories>

  <dependencies>
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>
    <dependency>
      <groupId>org.specs</groupId>
      <artifactId>specs</artifactId>
      <version>1.2.5</version>
      <scope>test</scope>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.12</artifactId>
      <version>3.2.0</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.12</artifactId>
      <version>3.2.0</version>
    </dependency>

  </dependencies>

  <build>
    <sourceDirectory>src/main/scala</sourceDirectory>
    <plugins>
      <plugin>
        <groupId>net.alchim31.maven</groupId>
        <artifactId>scala-maven-plugin</artifactId>
        <version>3.2.2</version>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
          </execution>
        </executions>
        <configuration>
          <args>
            <arg>-nobootcp</arg>
            <arg>-target:jvm-${target.java.version}</arg>
          </args>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-eclipse-plugin</artifactId>
        <configuration>
          <downloadSources>true</downloadSources>
          <buildcommands>
            <buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>
          </buildcommands>
          <additionalProjectnatures>
            <projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>
          </additionalProjectnatures>
          <classpathContainers>
            <classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>
            <classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>
          </classpathContainers>
        </configuration>
      </plugin>
    </plugins>
  </build>
</project>

Please change the scala and park versions as per your requirement.

Now open the App file under src > main > scala > App and add the following word count scala code:

package org.example

import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
/**
 * Hello world!
 *
 */
object App {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local").appName("wordcount").getOrCreate()

    import spark.implicits._

    // Create a list of strings
    val words = List("this", "is", "a", "list", "of", "words")

    // Convert the list to a DataFrame
    val wordsDF = words.toDF("words")

    // Use the explode transformation to split each word into individual words
    val wordsSplitDF = wordsDF.explode("words", "word") { word: String => word.split(" ") }

    // Use the groupBy transformation to group the words
    val groupedWordsDF = wordsSplitDF.groupBy("word")

    // Use the count aggregation to count the occurrences of each word
    val wordCountDF = groupedWordsDF.agg("word" -> "count")

    // Show the results
    wordCountDF.show()

    spark.stop()

  }
}

The program creates a SparkSession, converts a list of words into a DataFrame, and uses various DataFrame transformations and aggregations to count the occurrences of each word in the list.

Finally, the results are displayed using the “show” method and the SparkSession is stopped.

Submit a spark scala job from IntelliJ IDE

In this session, we will understand how to run a spark job from IntelliJ IDE.

Before running the job a user needs to configure the main class. To do so click on edit configuration in IntelliJ ide.

Submit a spark scala job from IntelliJ IDE

Now select the main class as the name(In this example the main class name is App)

Now right-click on the App.scala file and click on Run

Once the job runs successfully you will see the below output in the IntelliJ terminal

Create a jar file from IntelliJ for the spark Scala project

Now open the terminal or command prompt and navigate to the root directory of your project.

Run the following command to clean the project:

mvn clean

Run the following command to compile the project:

mvn compile

Finally, run the following command to package the project into a JAR file:

mvn package

The JAR file will be created in the “target” directory of your project.

Submit a spark scala job

Run the below command in the environment where the spark is installed to get the result of spark scala word count

spark-submit --class "org.example.App" ./scala_wc/target/scala_wc-1.0-SNAPSHOT.jar

The above command performs the below operations:

spark-submit: This is the command to submit a Spark job.
–class “org.example.App”: This specifies the entry point of the application, which is the main class of the Scala code. In this case, the main class is “org.example.App”.
./scala_wc/target/scala_wc-1.0-SNAPSHOT.jar: This is the path to the JAR file that contains the Scala code that you want to run. In this case, the JAR file is located at “./scala_wc/target/scala_wc-1.0-SNAPSHOT.jar”.

When you run this command, Spark will start a driver program on a cluster node, which will then execute the code in the JAR file.

Conclusion

I hope you have liked this tutorial on the spark scala word count program. Please do let me know if you are facing any issues while following along.

More to Explore?

How to write a flink Kafka consumer

Flink socket wordcount