For fun (and no particular better reason), I have a long tradition of liking to write programs that generate prime numbers. All prime numbers up to some limit.
Here, I do it in C++
and try to squeeze performance out what I do.
Some background on this is on my home page.
A positive integer is prime if there exist exactly two positive integers that are divisors.
Any number n
is divisible by 1
and by n
itself.
If any other divisors exist on top of these, the number isn't prime.
We need to have two different divisors. This rules out 1
, which
has only one positive divisor, namely, itself. So 1
isn't prime.
That definition straightforwardly leads to a first stupid_prime
program:
// ui is unsigned long or unsigned long long
template<typename ui> void driven(ui max_prime,
std::function<void(ui)> prime_receiver) {
for(ui maybe_prime = 2; maybe_prime <= max_prime; ++maybe_prime) {
bool is_prime = true;
for(ui divisor = 2; is_prime && divisor < maybe_prime; ++divisor)
is_prime = 0 != maybe_prime % divisor;
if(is_prime)
prime_receiver(maybe_prime);
}
}
This, of course, works:
import primes_driver as pd
pd.run("stupid_prime", "50")
But this is a slow algorithm. While it features a small (constant) memory footprint, its runtime is quadratic.
It's not worth spending much time on this algorithm, but for the record, here is a (crude) performance estimate:
stupid_prime_data = pd.report("stupid_prime");
pd.runtime_plot(stupid_prime_data)
A prime number generation algorithm is run with increasing max_prime
values:
1e4
, 3e4
, 1e5
, 3e5
, 1e6
, 3e6
and so on.
Resource usage of each run is recorded and collected.
The runs terminate when their combined (wall-clock) runtime exceeds a fixed amount of time,
presently a mere 30 s. Afterwards, a few "usual suspects" are used to provide a fit to that resource usage data, in particular, total time used (user + system) and memory
consumption. The best fit is chosen and displayed, as a formula.
The primes_max
variable is abbreviated as N
in those formulas.
So runtime performance is not duly analyzed, but only measured. The resulting formulas should be taken with a decent lump of salt.
I'm usually a fan of automated testing. Here, I restrict myself to doing end-result testing only.
The prime numbers are generated in a straightforward "one prime number per line" format
(regular expression (\d+\n)*
). So, when driven with the same max_prime
input parameter, different prime number generators should all generate identical outputs.
This (alone) is checked: A SHA256 checksum of each entire output stream is taken and compared with previous results.
Back to prime number generation. We want it to become faster.
A first observation: When a a * b = c
(all assumed positiv),
then not both of a
and b
can exceed sqrt(c)
.
A second observation: If one divides c
by some sequence of numbers thats
increasing without limits, e.g., a = 1, 2, 3, ...
,
then sqrt(c)
is surpassed at the point when c / a
becomes smaller than a
.
Based on these observations, a faster algorithm is this:
template<typename ui> void gen_primes(ui max_prime,
std::function<void(ui)> prime_receiver) {
for(ui maybe_prime = 2; maybe_prime <= max_prime; ++maybe_prime) {
bool is_prime = true;
for(ui divisor = 2; is_prime; ++divisor) {
// Division and remainder is really the same calculation.
// We hope the compiler sees that, too, and optimizes it:
ui q = maybe_prime / divisor;
ui rem = maybe_prime % divisor;
if (q < divisor) {
// Divisor is already above the sqrt of q,
// so we are done:
break;
}
is_prime = 0 != rem;
}
if(is_prime)
prime_receiver(maybe_prime);
}
}
In passing, I would have liked to use something like
std::div
instead of writing the above comment,
but found it for signed integral types only.
div_by_all_upto_root_data = pd.report("div_by_all_upto_root")
Runtime is down from O(N**2)
to O(N**1.5 / log(N))
.
However, this needs to be taken with rather a grain of salt,
as this is not the result of a thorough analysis of the algorithm,
but only of data fitting on a handful of measurements taken on my laptop.
That said, here is a double-log plot of the time the algorithm actually took, vs. the time predicted by the approximation formula:
pd.runtime_plot(div_by_all_upto_root_data)
How to speed up further?
The next obvious remark: We do not need to test all numbers. Just testing the odd ones is enough. Beyond 2
,
even numbers cannot be prime.
This reasoning can be extended straightforwardly. Instead of multiples of 2
, disregard multiples of the first few primes.
To give an example, let us assume we want to generate a stream of those numbers
not divisible by any of the initial primes 2
, 3
, and 5
.
The pattern of these numbers repeats, with a period
of 2 * 3 * 5 = 30
, in this sense: One can decide whether a number n
is divisible by any of those three
if one knows n % (2 * 3 * 5) = n % 30
.
For actual computational purposes, one can store the numbers <= 30
not divisible by 2
, 3
, and 5
,
and obtain any larger such numbers by adding appropriate multiples of 30
.
This is called the "wheel" trick.
Each addition of 30
is thought to start a new revolution of the wheel.
The gist of my implementation: It stores the low non-divisibles
(e.g., {1, 7, 11, 13, 17, 19, 23, 29}
) in some
std::vector<ui> data_
, and the period (e.g., 30
), in ui period_
.
To consume those numbers, a std::function<bool(ui)> receiver
is provided
that knows what to do with them. Its return value has the semantics "keep going",
so false
is a signal for stopping the wheel.
The actual original implementation had some additional sophistication, but the gist of the wheel is code like this:
for(ui base{0}; true; base += period_) {
for(ui offset: data_) {
ui next = base + offset;
if(upto < next)
return;
if( ! receiver(next))
return;
}
}
(I later rewrote the wheel
to iterators, so this code is gone.)
Such a wheel can be used at two levels: For the generator of maybe_prime
candidates,
and for the list of divisors
used to test maybe_prime
.
This results in the wheel_generator
prime number generation algorithm.
Here is an example using the initial primes list {2, 3, 5, 7, 11, 13}
:
wheel_data = pd.report("wheel_generator")
pd.runtime_plot(wheel_data)
The wheel data size increases p - 1
fold when a new prime p
is added to the wheel logic.
The larger the wheel gets, the more RAM is needed and the less effective CPU caches are. For this (and other) reasons, it is not wise to increase the wheel beyond limits.
To experiment, I introduced a configuration environment variable WHEEL_INITIAL_PRIMES_UPTO
.
On my laptop, an increase in runtime is increasing this from 19
to 23
.
Going one step further, using initial primes up to 29
,
leads to "catastrophic failure": The machine starts to swap
heavily, becomes next to unusable, and prime generation performance drops to dismal levels.
(Lesson learned: Remove swap space. The machine remains usable and a fast, hard
crash of the prime generator results.)
Here is a parameter study (maximum_mem_size
in kB and time
in seconds):
pd.wheel_generation_parameter_study(max_prime = 10000000)
Now we have wheels. How to filter further? Two ideas bring further improvements:
We only need to filter multiples of primes.
We do not need to filter stuff that has already been filtered by the wheel.
To this end, a sieve
class is designed. We want one object of this class for
each prime p
below the root of max_prime
that wasn't already used
in wheel construction.
The sieve should filter out all n * p
, where n
is any number produced by the
wheel.
(A multiple n * p
is produced by the wheel if and only if n * p
is not a multiple
of one of the wheel's initial primes. As p
is a prime and not among those initial primes, this is true exactly if n
itself is not a multiply of the initial primes.
This is true exactly if n
is produced by the wheel.)
An iterator that produces all numbers generated by the wheel,
that is, a wheel_iterator
. This can be coded easily.
Here is a sieve
implementation. For the intended algorithm,
it needs to report what the next number is it wants to have removed.
template <typename ui, class InputIt> class sieve {
private:
const ui p_;
InputIt base_;
ui next_removal_;
public:
explicit sieve(InputIt base, ui p) : p_(p), base_(base) {
next_removal_ = *base_ * p_;
}
ui next_removal() const {return next_removal_;}
sieve<ui, InputIt>& operator++() {
++base_;
next_removal_ = *base_ * p_;
return *this;
}
}
To mix what these various sieves have to say, a priority queue can be used conveniently:
typedef sieve<ui, wheel_iterator<ui>> sv;
// Sort by which sieve filters next:
struct lowest_first {
bool operator()(const sv* lhs, const sv* rhs) {
return lhs->next_removal() > rhs->next_removal();
}
};
std::priority_queue<sv*, std::vector<sv*>, lowest_first> queue;
This queue gets filled with the sieves. For the utmost in performance, plain raw pointers
are used. (There is a std::vector<std::unique_ptr<sv>>
container in the same scope
which will take care of sieve cleanup.)
The basic algorithm step involves incrementing the iterator at the top of the queue,
so its next_removal()
increases, and then reshuffling the queue.
As I could not find a "replace top with new value" method,
I implemented this functionality with a pop
, followed by a push
.
(There goes "utmost in performance". However, a planned future algorithm will
make this non-consequential.)
auto step_queue = [&queue]() -> ui {
auto active_sieve = queue.top();
queue.pop();
++(*active_sieve);
queue.push(active_sieve);
return queue.top()->next_removal();
};
With that, the inner loop of this prime generation algorithm becomes:
ui next_to_filter = queue.top()->next_removal();
for(ui maybe_prime: wheel.range(2, max_prime)) {
if(maybe_prime < next_to_filter) {
prime_receiver(maybe_prime);
} else {
while(next_to_filter < maybe_prime)
next_to_filter = step_queue();
if(next_to_filter == maybe_prime)
do
next_to_filter = step_queue();
while(next_to_filter == maybe_prime);
else
prime_receiver(maybe_prime);
}
}
Inspired by our previous measurements, initial primes up to 19 are used in the wheel.
Here is the result:
import os
os.environ["WHEEL_INITIAL_PRIMES_UPTO"] = "19"
sieve_data = pd.report("sieve_generator")
pd.runtime_plot(sieve_data)
So, we apparently have improved performance from O(N**1.5/log(N))
to O(N*log(N))
.
Our performance measurements (which squeezes as many runs as possible into 30 seconds)
have for the first time allowed an algorithm to generate all primes up to 3e8
.
(To be continued.)