The latest Spark Kafka setup and dev

Until today the latest versions for bigdata streaming ecosystem are follows.

Hadoop 3.3.4

Spark 3.3.1 for Scala 2.12

Kafka 3.3.1 for Scala 2.13

Please note,

How to setup and programming with them? Please reference my past blog.

Start spark3.3 kafka3.2 streaming from scratch

The main changes are in build.sbt, whose content becomes,

 name := "spark-consumer"

version := "0.1"

scalaVersion := "2.12.17"

libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "3.3.1" % Test

libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.3.1" % "provided"

Go to mvnrepository.com for getting the latest dependencies.

And this is the submit script,

 #!/bin/bash

spark-submit \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1 \
--class "KafkaConsumer" \
--master local[*] \
target/scala-2.12/spark-consumer_2.12-0.1.jar

This is the main program,

 import org.apache.spark.sql.SparkSession

object KafkaConsumer {
def main(args:Array[String]):Unit = {

val spark = SparkSession.builder.appName("Mykafka").getOrCreate()

val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "mytest")
.load()

import spark.implicits._

df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]

val myCount = df.groupBy("key").count()

val query = myCount.writeStream
.outputMode("complete")
.format("console")
.start()

query.awaitTermination()
}
}

Go to compile and submit the job, you will see stream aggregation as follows.

The last info is the development environment.

 $ lsb_release -cd
Description: Ubuntu 22.04.1 LTS
Codename: jammy

$ java -version
openjdk version "11.0.17" 2022-10-18

$ scala -version
Scala code runner version 2.12.17 -- Copyright 2002-2022, LAMP/EPFL and Lightbend, Inc.

$ ruby -v
ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [x86_64-linux-gnu]