In this tutorial, we will understand how to create and run a spark scala word count program. We will be using IntelliJ to develop the park application. So let’s get started:
Download IntelliJ for spark development
IntelliJ IDEA is an integrated development environment (IDE) for creating computer software. It is developed by JetBrains and is primarily used for Java programming, but it also supports other programming languages such as Kotlin, Scala, Groovy, and more.
It includes features such as code completion, code analysis, debugging, and version control integration.
Download and configured IntelliJ from here based on the OS version you have.
Spark scala word count example IntelliJ
Once the IntelliJ is downloaded. Open the application and add support for the SCALA framework. We will be using maven to setup spark related dependencies. We will use inbuild archetypes to develop the spark scala word count project.
Now open IntelliJ id and click on new project > select Maven.
select the Create from archetype checkbox and select scala-archetype-simple and click on next.
Give the project name as scala_wc and click next and click the Ok button to create a sample scale project.
Spark scala create a pom.xml file
A pom.xml file will be created with the required dependencies. We need to modify it and add spark-related dependencies.
Now add the spark core-related dependencies under <dependencies> tag
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.12</artifactId> <version>3.2.0</version> </dependency>
similarly add the spark-sql related maven dependencies.
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.12</artifactId> <version>3.2.0</version> </dependency>
NOTE: Make sure to remove <scope>provided</scope>,otherwise you will get class not found error
After performing some cleanup the final pom.xml files look like the below.
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>org.example</groupId> <artifactId>scala_wc</artifactId> <version>1.0-SNAPSHOT</version> <inceptionYear>2008</inceptionYear> <properties> <scala.version>2.12.10</scala.version> <target.java.version>1.8</target.java.version> </properties> <repositories> <repository> <id>scala-tools.org</id> <name>Scala-Tools Maven2 Repository</name> <url>http://scala-tools.org/repo-releases</url> </repository> </repositories> <pluginRepositories> <pluginRepository> <id>scala-tools.org</id> <name>Scala-Tools Maven2 Repository</name> <url>http://scala-tools.org/repo-releases</url> </pluginRepository> </pluginRepositories> <dependencies> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> </dependency> <dependency> <groupId>org.specs</groupId> <artifactId>specs</artifactId> <version>1.2.5</version> <scope>test</scope> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.12</artifactId> <version>3.2.0</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.12</artifactId> <version>3.2.0</version> </dependency> </dependencies> <build> <sourceDirectory>src/main/scala</sourceDirectory> <plugins> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.2.2</version> <executions> <execution> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> </execution> </executions> <configuration> <args> <arg>-nobootcp</arg> <arg>-target:jvm-${target.java.version}</arg> </args> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-eclipse-plugin</artifactId> <configuration> <downloadSources>true</downloadSources> <buildcommands> <buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand> </buildcommands> <additionalProjectnatures> <projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature> </additionalProjectnatures> <classpathContainers> <classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer> <classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer> </classpathContainers> </configuration> </plugin> </plugins> </build> </project>
Please change the scala and park versions as per your requirement.
Now open the App file under src > main > scala > App and add the following word count scala code:
package org.example import org.apache.spark.sql.SparkSession import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf /** * Hello world! * */ object App { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().master("local").appName("wordcount").getOrCreate() import spark.implicits._ // Create a list of strings val words = List("this", "is", "a", "list", "of", "words") // Convert the list to a DataFrame val wordsDF = words.toDF("words") // Use the explode transformation to split each word into individual words val wordsSplitDF = wordsDF.explode("words", "word") { word: String => word.split(" ") } // Use the groupBy transformation to group the words val groupedWordsDF = wordsSplitDF.groupBy("word") // Use the count aggregation to count the occurrences of each word val wordCountDF = groupedWordsDF.agg("word" -> "count") // Show the results wordCountDF.show() spark.stop() } }
The program creates a SparkSession, converts a list of words into a DataFrame, and uses various DataFrame transformations and aggregations to count the occurrences of each word in the list.
Finally, the results are displayed using the “show” method and the SparkSession is stopped.
Submit a spark scala job from IntelliJ IDE
In this session, we will understand how to run a spark job from IntelliJ IDE.
Before running the job a user needs to configure the main class. To do so click on edit configuration in IntelliJ ide.
Now select the main class as the name(In this example the main class name is App)
Now right-click on the App.scala file and click on Run
Once the job runs successfully you will see the below output in the IntelliJ terminal
Create a jar file from IntelliJ for the spark Scala project
Now open the terminal or command prompt and navigate to the root directory of your project.
Run the following command to clean the project:
mvn clean
Run the following command to compile the project:
mvn compile
Finally, run the following command to package the project into a JAR file:
mvn package
The JAR file will be created in the “target” directory of your project.
Submit a spark scala job
Run the below command in the environment where the spark is installed to get the result of spark scala word count
spark-submit --class "org.example.App" ./scala_wc/target/scala_wc-1.0-SNAPSHOT.jar
The above command performs the below operations:
- spark-submit: This is the command to submit a Spark job.
- –class “org.example.App”: This specifies the entry point of the application, which is the main class of the Scala code. In this case, the main class is “org.example.App”.
- ./scala_wc/target/scala_wc-1.0-SNAPSHOT.jar: This is the path to the JAR file that contains the Scala code that you want to run. In this case, the JAR file is located at “./scala_wc/target/scala_wc-1.0-SNAPSHOT.jar”.
When you run this command, Spark will start a driver program on a cluster node, which will then execute the code in the JAR file.
Conclusion
I hope you have liked this tutorial on the spark scala word count program. Please do let me know if you are facing any issues while following along.