Execution time comparison for Spark and Postgresql
Apache Spark is fast enough for big data query, and Postgresql is the fastest RDB service I have seen.
I have a dataset which has about 10 million rows. It's a csv file with the content for taxi trip data. You can see the sample dataset below.
+--------+-------------+-----------+-----+-------+----------+------------+---------------------+------------+
|VendorID|trip_distance|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|
+--------+-------------+-----------+-----+-------+----------+------------+---------------------+------------+
| 2| 1.1| 7.5| 0.5| 0.5| 0.0| 0.0| 0.3| 8.8|
| 2| 4.9| 18.0| 0.5| 0.5| 0.0| 0.0| 0.3| 19.3|
| 2| 10.54| 33.0| 0.5| 0.5| 0.0| 0.0| 0.3| 34.3|
| 2| 4.75| 16.5| 0.0| 0.5| 0.0| 0.0| 0.3| 17.3|
| 2| 1.76| 8.0| 0.0| 0.5| 0.0| 0.0| 0.3| 8.8|
| 2| 5.52| 19.0| 0.5| 0.5| 0.0| 0.0| 0.3| 20.3|
| 2| 7.45| 26.0| 0.5| 0.5| 0.0| 0.0| 0.3| 27.3|
| 1| 1.2| 9.0| 0.5| 0.5| 0.0| 0.0| 0.3| 10.3|
| 1| 6.0| 18.0| 0.5| 0.5| 0.0| 0.0| 0.3| 19.3|
| 2| 3.21| 11.5| 0.5| 0.5| 0.0| 0.0| 0.3| 12.8|
| 2| 0.79| 6.0| 0.5| 0.5| 0.0| 0.0| 0.3| 7.3|
| 2| 2.43| 12.0| 0.5| 0.5| 3.99| 0.0| 0.3| 17.29|
| 2| 0.01| 2.5| 0.5| 0.5| 0.0| 0.0| 0.3| 3.8|
| 1| 3.7| 14.0| 0.5| 0.5| 3.05| 0.0| 0.3| 18.35|
| 1| 2.2| 11.0| 0.5| 0.5| 1.5| 0.0| 0.3| 13.8|
| 2| 0.54| 6.0| 0.5| 0.5| 0.0| 0.0| 0.3| 7.3|
| 2| 1.92| 7.5| 0.5| 0.5| 0.0| 0.0| 0.3| 8.8|
| 1| 1.7| 7.0| 0.5| 0.5| 1.65| 0.0| 0.3| 9.95|
| 2| 1.38| 7.0| 0.5| 0.5| 1.66| 0.0| 0.3| 9.96|
| 1| 4.9| 19.0| 0.5| 0.5| 4.06| 0.0| 0.3| 24.36|
+--------+-------------+-----------+-----+-------+----------+------------+---------------------+------------+
For a simple aggregate calculation in Spark, I can write it as the follows.
scala> df.groupBy("VendorID").agg(avg("total_amount").alias("avg_amount")).orderBy(desc("avg_amount")).show
+--------+------------------+
|VendorID| avg_amount|
+--------+------------------+
| 2|15.794205288031264|
| 1|15.465564699774234|
+--------+------------------+
This job takes 11.6s to get finished.
Now I load the data into Postgresql,
postgres=# \copy taxi FROM '/home/pyh/dataset/taxi.csv' DELIMITER ',' CSV;
COPY 10906858
And run the similar statistics in Postgresql below.
postgres=# select vendorid,avg(total_amount) as avg_amount from taxi group by vendorid order by avg_amount desc;
vendorid | avg_amount
----------+--------------------
2 | 15.794205288345042
1 | 15.465564700052573
(2 rows)
It takes 4.6s to finish the job.
So can we say Postgresql is much faster than Spark? I am afraid it can't. Because the dataset is too small, only 10 million items.
Once I had another big dataset which has 80 million items, when running the similar statistics jobs, Spark took 106s, while Postgresql took 156s.
For real bigdata jobs Spark are clearly faster than the traditional RDB from what I have seen.
Here are the environment info for Postgresql,
postgres=# show shared_buffers;
shared_buffers
----------------
384MB
(1 row)
postgres=# show work_mem;
work_mem
----------
16MB
(1 row)
postgres=# \d taxi;
Table "public.taxi"
Column | Type | Collation | Nullable | Default
-----------------------+-----------------------------+-----------+----------+---------
vendorid | integer | | |
tpep_pickup_datetime | timestamp without time zone | | |
tpep_dropoff_datetime | timestamp without time zone | | |
passenger_count | integer | | |
trip_distance | double precision | | |
pickup_longitude | double precision | | |
pickup_latitude | double precision | | |
ratecodeid | integer | | |
store_and_fwd_flag | text | | |
dropoff_longitude | double precision | | |
dropoff_latitude | double precision | | |
payment_type | integer | | |
fare_amount | double precision | | |
extra | double precision | | |
mta_tax | double precision | | |
tip_amount | double precision | | |
tolls_amount | double precision | | |
improvement_surcharge | double precision | | |
total_amount | double precision | | |
Indexes:
"vendor" btree (vendorid)
postgres=# select version();
version
----------------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 12.12 (Ubuntu 12.12-0ubuntu0.20.04.1)