Another 10 million dataset testing for Spark and Hive

2024-05-12

In my last posting I have compared the execution time for Spark and Postgresql for about 10 million data.

Execution time comparison for Spark and Postgresql

This time I generated another 10 million data which are used for testing execution time of Spark and Hive.

You can download the dataset from the URL below.

Download 10 million dataset for books (172M)

Let's create a table in Hive and load the data into it.

 > create table books (
           id int,
           name string,
           year int,
           publisher string,
           types array<string>,
           price float )
           ROW FORMAT DELIMITED
           FIELDS TERMINATED BY '|'
           COLLECTION ITEMS TERMINATED BY ',';
   
 > load data inpath "/tmp/books.txt" overwrite into table books;

Let's issue the query to Hive as follows.

 > select name,count(*) as count from books group by name order by count desc limit 30;
 +------------------------------------------------+--------+
 |                      name                      | count  |
 +------------------------------------------------+--------+
 | dive into mysql, the 4th edition               | 625    |
 | get started with julia, the 2nd edition        | 606    |
 | notes for spark, the 5th edition               | 603    |
 | notes for openpose, the 5th edition            | 603    |
 | mastering julia, the 1st edition               | 601    |
 | professional on item2vec, the 5th edition      | 601    |
 | programming inception, the 4th edition         | 601    |
 | the hard way item2vec, the 3rd edition         | 601    |
 | an introduction for haskell, the 1st edition   | 600    |
 | the hard way drill, the 1st edition            | 599    |
 | a guide to DNS, the 5th edition                | 598    |
 | introduce to bash, the 3rd edition             | 598    |
 | 24 days for forbase, the 2nd edition           | 598    |
 | learning erlang, the 1st edition               | 598    |
 | in action big data, the 2nd edition            | 596    |
 | step into NSD, the 5th edition                 | 596    |
 | core javascript, the 6th edition               | 596    |
 | definitive guide for cpp, the 3rd edition      | 596    |
 | cloud native with resnet, the 4th edition      | 596    |
 | by examples DNS, the 6th edition               | 595    |
 | action for recommand systems, the 3rd edition  | 595    |
 | teach yourself kolin, the 1st edition          | 594    |
 | cook book kolin, the 2nd edition               | 594    |
 | 24 days for alibaba cloud, the 4th edition     | 594    |
 | definitive guide for dovecot, the 2nd edition  | 594    |
 | get started with java, the 5th edition         | 593    |
 | programming DNS, the 3rd edition               | 592    |
 | a guide to rocky linux, the 2nd edition        | 592    |
 | play for postgres, the 4th edition             | 592    |
 | a guide to erlang, the 5th edition             | 592    |
 +------------------------------------------------+--------+

It takes 13s to finish the job.

Now let's load the data into Spark.

 > val schema = "id INT,name STRING,year INT,publisher STRING,types STRING,price FLOAT";
 > val df = spark.read.schema(schema).option("delimiter","|").csv("books.csv");

And issue the similar query to Spark,

 > df.groupBy("name").count.orderBy(desc("count")).show(30,false)
 +---------------------------------------------+-----+                           
 |name                                         |count|
 +---------------------------------------------+-----+
 |dive into mysql, the 4th edition             |625  |
 |get started with julia, the 2nd edition      |606  |
 |notes for spark, the 5th edition             |603  |
 |notes for openpose, the 5th edition          |603  |
 |the hard way item2vec, the 3rd edition       |601  |
 |mastering julia, the 1st edition             |601  |
 |programming inception, the 4th edition       |601  |
 |professional on item2vec, the 5th edition    |601  |
 |an introduction for haskell, the 1st edition |600  |
 |the hard way drill, the 1st edition          |599  |
 |introduce to bash, the 3rd edition           |598  |
 |learning erlang, the 1st edition             |598  |
 |24 days for forbase, the 2nd edition         |598  |
 |a guide to DNS, the 5th edition              |598  |
 |core javascript, the 6th edition             |596  |
 |step into NSD, the 5th edition               |596  |
 |cloud native with resnet, the 4th edition    |596  |
 |in action big data, the 2nd edition          |596  |
 |definitive guide for cpp, the 3rd edition    |596  |
 |by examples DNS, the 6th edition             |595  |
 |action for recommand systems, the 3rd edition|595  |
 |cook book kolin, the 2nd edition             |594  |
 |definitive guide for dovecot, the 2nd edition|594  |
 |24 days for alibaba cloud, the 4th edition   |594  |
 |teach yourself kolin, the 1st edition        |594  |
 |get started with java, the 5th edition       |593  |
 |play for postgres, the 4th edition           |592  |
 |a guide to erlang, the 5th edition           |592  |
 |programming DNS, the 3rd edition             |592  |
 |a guide to rocky linux, the 2nd edition      |592  |
 +---------------------------------------------+-----+

It takes 12.2s to finish the job.

Hive uses the default mr engine, not the optimized tez engine.

As you see, Hive 3 today has big performance improvement comparing to the old versions. It's even as fast as Spark.

The versions,

Apache Hive - 3.1.3
Apache Spark - 3.3.1