PostgreSQL Concurrent Scan Performance: 8.3beta1 versus 8.2.5

Introduction

This test is designed to show the great improvements in scan performance achieved in 8.3 (currently in beta). Many other areas of performance have also been improved, but these tests are specifically designed to measure the gains in scan performance.

The features in 8.3 that make this possible do not even take effect unless the table is at least 25% of the size of shared_buffers.

The Script

This script first tests the performance of a single (no concurrent queries) scan. Then it forks num_procs processes, and each open a connection and wait until it's time to start a scan. The waiting is to stagger the scans, so that they do not start all at once. When it's time, each process starts a scan (which will be at the same time other scans are already in progress, unless it's the first one), and reports how long that scan took.

#!/usr/bin/ruby

require 'postgres'

num_procs = ARGV[0].to_i
iterations = ARGV[1].to_i
offset = ARGV[2].to_i

pids = []

db = PGconn.connect('/tmp',nil,nil,nil,'test','test',nil)
t1 = Time.now
db.exec "SELECT COUNT(*) FROM test"
t2 = Time.now
db.close

puts "single process time: #{t2-t1} seconds"

(0...num_procs).each { |i|
        pid = fork
        if(pid)
                pids << pid
        else
                db = PGconn.connect('/tmp',nil,nil,nil,'test','test',nil)
                sleep i*offset
                puts [ Time.now.strftime("[%Y-%m-%d %H:%M:%S]"),
                     Process.pid.to_s, "starting" ].join("\t")
                t1 = Time.now
                (0...iterations).each{ |i|
                        res = db.exec "SELECT COUNT(*) FROM test"
                }
                t2 = Time.now
                puts [ Time.now.strftime("[%Y-%m-%d %H:%M:%S]"),
                     Process.pid.to_s, 
                         "finished: #{t2-t1}\tseconds"
                         ].join("\t")
                db.close
                exit 0
        end
}

pids.each{ |pid| Process::waitpid(pid) }

The Results

All settings in postgresql.conf are default, except autovacuum is disabled. Locale is en_US.UTF-8. No special compile-time options were given. Compiled for x86-64.

The test table: 150M tuples. Each tuple is a simple integer followed by 100 characters (=100 bytes) in a text attribute. To generate the table do:

=> create table test(i int, t text);
=> insert into test 
	select generate_series, 
	'abcdefghijklmnopqrstuvwxyabcdefghijklmnopqrstuvwxyabcdefghijklmnopqrstuvwxyabcdefghijklmnopqrstuvwxy' 
	from generate_series(1,150000000);

The test hardware:

All tests are on linux with the anticipatory I/O scheduler. The default I/O scheduler for Linux is CFQ (Completely Fair Queue), which does not work well for PostgreSQL in my tests.

PostgreSQL 8.2.5

Table size in 8.2.5: approximately 20.8 GB

10 scans, staggered to start 30 seconds apart:

single process time: 305.897248 seconds
[2007-10-06 19:23:45]   16967   starting
[2007-10-06 19:24:15]   16968   starting
[2007-10-06 19:24:45]   16969   starting
[2007-10-06 19:25:15]   16970   starting
[2007-10-06 19:25:45]   16971   starting
[2007-10-06 19:26:15]   16972   starting
[2007-10-06 19:26:45]   16973   starting
[2007-10-06 19:27:15]   16974   starting
[2007-10-06 19:27:45]   16975   starting
[2007-10-06 19:28:15]   16976   starting
[2007-10-06 19:32:14]   16967   finished: 508.988534    seconds
[2007-10-06 19:32:14]   16968   finished: 479.013674    seconds
[2007-10-06 19:37:47]   16976   finished: 571.819937    seconds
[2007-10-06 19:37:47]   16971   finished: 721.831606    seconds
[2007-10-06 19:37:47]   16972   finished: 681.512183    seconds
[2007-10-06 19:37:47]   16974   finished: 631.828676    seconds
[2007-10-06 19:37:47]   16973   finished: 661.834094    seconds
[2007-10-06 19:37:47]   16975   finished: 601.826697    seconds
[2007-10-06 19:37:47]   16969   finished: 781.843272    seconds
[2007-10-06 19:37:47]   16970   finished: 751.83741     seconds
20 scans, staggered to start 15 seconds apart:

$ ./bench.rb 20 1 15
single process time: 306.405574 seconds
[2007-10-06 19:50:50]   17672   starting
[2007-10-06 19:51:05]   17673   starting
[2007-10-06 19:51:20]   17674   starting
[2007-10-06 19:51:35]   17675   starting
[2007-10-06 19:51:50]   17676   starting
[2007-10-06 19:52:05]   17677   starting
[2007-10-06 19:52:20]   17678   starting
[2007-10-06 19:52:35]   17679   starting
[2007-10-06 19:52:50]   17680   starting
[2007-10-06 19:53:05]   17681   starting
[2007-10-06 19:53:20]   17682   starting
[2007-10-06 19:53:35]   17683   starting
[2007-10-06 19:53:50]   17684   starting
[2007-10-06 19:54:05]   17685   starting
[2007-10-06 19:54:20]   17686   starting
[2007-10-06 19:54:35]   17687   starting
[2007-10-06 19:54:50]   17688   starting
[2007-10-06 19:55:05]   17689   starting
[2007-10-06 19:55:20]   17690   starting
[2007-10-06 19:55:35]   17691   starting
[2007-10-06 20:28:31]   17685   finished: 2065.646385   seconds
[2007-10-06 20:28:31]   17684   finished: 2080.654577   seconds
[2007-10-06 20:28:31]   17683   finished: 2095.656818   seconds
[2007-10-06 20:28:31]   17675   finished: 2215.665656   seconds
[2007-10-06 20:28:31]   17676   finished: 2200.662957   seconds
[2007-10-06 20:28:31]   17672   finished: 2260.648018   seconds
[2007-10-06 20:28:31]   17682   finished: 2110.633246   seconds
[2007-10-06 20:28:31]   17677   finished: 2185.664061   seconds
[2007-10-06 20:28:31]   17687   finished: 2035.629152   seconds
[2007-10-06 20:28:31]   17679   finished: 2155.663292   seconds
[2007-10-06 20:28:31]   17674   finished: 2230.665842   seconds
[2007-10-06 20:28:31]   17681   finished: 2125.663343   seconds
[2007-10-06 20:28:31]   17673   finished: 2245.674556   seconds
[2007-10-06 20:28:31]   17680   finished: 2140.665867   seconds
[2007-10-06 20:28:31]   17678   finished: 2170.67249    seconds
[2007-10-06 20:28:31]   17686   finished: 2050.664088   seconds
[2007-10-06 20:28:34]   17690   finished: 1994.382316   seconds
[2007-10-06 20:28:34]   17691   finished: 1979.391717   seconds
[2007-10-06 20:28:34]   17688   finished: 2024.400597   seconds
[2007-10-06 20:28:34]   17689   finished: 2009.437021   seconds

PostgreSQL 8.3beta1

Table size in 8.3beta1: approximately 19.7 GB

10 scans, staggered to start 30 seconds apart:

$ ./bench.rb 10 1 30
single process time: 310.992567 seconds
[2007-10-07 00:13:39]   24327   starting
[2007-10-07 00:14:09]   24328   starting
[2007-10-07 00:14:39]   24329   starting
[2007-10-07 00:15:09]   24330   starting
[2007-10-07 00:15:39]   24331   starting
[2007-10-07 00:16:09]   24332   starting
[2007-10-07 00:16:39]   24333   starting
[2007-10-07 00:17:09]   24334   starting
[2007-10-07 00:17:39]   24335   starting
[2007-10-07 00:18:09]   24336   starting
[2007-10-07 00:18:57]   24327   finished: 318.092037    seconds
[2007-10-07 00:19:27]   24328   finished: 317.486874    seconds
[2007-10-07 00:19:57]   24329   finished: 317.56133     seconds
[2007-10-07 00:20:26]   24330   finished: 316.905676    seconds
[2007-10-07 00:20:56]   24331   finished: 316.63102     seconds
[2007-10-07 00:21:26]   24332   finished: 316.313321    seconds
[2007-10-07 00:21:56]   24333   finished: 316.234245    seconds
[2007-10-07 00:22:24]   24334   finished: 314.901694    seconds
[2007-10-07 00:22:54]   24335   finished: 314.846691    seconds
[2007-10-07 00:23:23]   24336   finished: 314.135473    seconds
20 scans, staggered to start 15 seconds apart:

$ ./bench.rb 20 1 15
single process time: 309.949542 seconds
[2007-10-07 00:40:44]   25097   starting
[2007-10-07 00:40:59]   25098   starting
[2007-10-07 00:41:14]   25099   starting
[2007-10-07 00:41:29]   25100   starting
[2007-10-07 00:41:44]   25101   starting
[2007-10-07 00:41:59]   25102   starting
[2007-10-07 00:42:14]   25103   starting
[2007-10-07 00:42:29]   25104   starting
[2007-10-07 00:42:44]   25105   starting
[2007-10-07 00:42:59]   25106   starting
[2007-10-07 00:43:14]   25107   starting
[2007-10-07 00:43:29]   25108   starting
[2007-10-07 00:43:44]   25109   starting
[2007-10-07 00:43:59]   25110   starting
[2007-10-07 00:44:14]   25111   starting
[2007-10-07 00:44:29]   25112   starting
[2007-10-07 00:44:44]   25113   starting
[2007-10-07 00:44:59]   25114   starting
[2007-10-07 00:45:14]   25115   starting
[2007-10-07 00:45:29]   25116   starting
[2007-10-07 00:46:03]   25097   finished: 318.567256    seconds
[2007-10-07 00:46:19]   25098   finished: 319.490271    seconds
[2007-10-07 00:46:34]   25099   finished: 319.897278    seconds
[2007-10-07 00:46:49]   25100   finished: 320.0251      seconds
[2007-10-07 00:47:04]   25101   finished: 319.868858    seconds
[2007-10-07 00:47:19]   25102   finished: 319.48718     seconds
[2007-10-07 00:47:34]   25103   finished: 319.670489    seconds
[2007-10-07 00:47:49]   25104   finished: 319.932719    seconds
[2007-10-07 00:48:04]   25105   finished: 319.788767    seconds
[2007-10-07 00:48:19]   25106   finished: 319.877593    seconds
[2007-10-07 00:48:34]   25107   finished: 319.523455    seconds
[2007-10-07 00:48:49]   25108   finished: 319.252667    seconds
[2007-10-07 00:49:04]   25109   finished: 319.314306    seconds
[2007-10-07 00:49:18]   25110   finished: 318.703788    seconds
[2007-10-07 00:49:32]   25111   finished: 317.934656    seconds
[2007-10-07 00:49:47]   25112   finished: 317.769151    seconds
[2007-10-07 00:50:02]   25113   finished: 317.707941    seconds
[2007-10-07 00:50:17]   25114   finished: 317.642323    seconds
[2007-10-07 00:50:31]   25115   finished: 316.932149    seconds
[2007-10-07 00:50:46]   25116   finished: 316.351529    seconds

Conclusion

8.2 degrades quickly when multiple large sequential scans are executing at once, and in 8.3 the scans execute almost as fast as if no other scans were in progress. This is a major win for these types of queries. The performance advantage comes from the Synchronized Scan feature of 8.3.

There is also an unseen benefit: even though 8.3 is much faster, these large sequential scans actually use much less space in the shared buffers. This is due to the buffer recycling feature of 8.3, which avoids "cache pollution" that could happen with large sequential scans.

Notice that the single scans go slightly faster in 8.2 versus 8.3 in this test. When I saw this result, I decided to find out the cause, so I tried reading the files directly.

Reading 8.2.5 data files:

$ time cat data82-big/base/16384/16386* > /dev/null

real    5m5.673s
user    0m0.876s
sys     0m15.697s

Reading 8.3beta1 data files:

$ time cat data83-big/base/16387/16395* > /dev/null

real    5m10.196s
user    0m0.908s
sys     0m15.089s
So, it must be the placement on disk that explains the slight difference. Even though the 8.3 data files are slightly smaller, they take longer to read from disk, regardless of PostgreSQL. So, straight COUNT(*) on a large table appears to go at disk speed in both 8.2 and 8.3.

This is actually a win for 8.3 also, because 8.3 has reduced row header size (saving either 4 or 8 bytes per tuple, depending on platform), and also varvarlena, which saves space in length headers for variable-length attributes. These two optimizations equal about 1GB of total space savings in this test (or about 5%), meaning that there is less to read from disk. If both databases happened to be placed identically on disk by the filesystem, I would expect these two 8.3 optimizations to show about a 5% advantage over 8.2.

One thing these benchmarks didn't entirely explain is why, in 8.2, all of the scans finished at almost exactly the same time. I suspect that this is due in part to the characteristics of the linux anticipatory I/O scheduler. What I think happened in this test is that the scans very slowly synchronized over the course of the scan, and then eventually all finished at once. It would take more testing to determine exactly what's going on, but it's surely much slower than 8.3.