Troubleshooters.Com
and Code Corner
Present

Fun With Prime Numbers

All material herein provided "As-Is". User assumes all risk and responsibility for any outcome.

Debug like a Ninja

CONTENTS

# Introduction

What was the name of that 1983 Cyndi Lauper hit? "Geeks Just Want To Have Fun"?

This document has absolutely no redeeming social value. The prime number algorithms you find in this document won't break any records. If you want a REALLY fast prime algorithm, they're scattered all over the net.

This document contains a bunch of programs to figure out prime numbers. The first rendition requires 1:23 (one minute 23 seconds) to calculate all primes below ten million. From there, improvements are made, dead ends are explored and backed out of, leading to the final version, which finds all primes below 40 billion in less than 4 hours, and can find all primes below 100 million in less than 13 seconds. From there, if you really enjoy prime numbers, you can venture out on the net to find truly optimized algorithms.

But meantime, have fun!

# The Obvious Method for Finding Primes

##### By Steve Litt
We'll start with an obvious method, and attack it with brute force. We test each number starting with 2 and continuing up through a stop point. The test consists of dividing the number by any possible divisor, up to a factor that is the square root of the number. If no factors are found, it's prime. If a factor is found, it's not prime (it's composite). Either way, we move up to the next number. Here's the code:

 `#include #include #include #include typedef unsigned long long bignum;void printPrime(bignum bn) { static char buf[1000]; sprintf(buf, "%ull", bn); buf[strlen(buf) - 2] = '\0'; printf("%s\n", buf); }void findPrimes(bignum topCandidate) { bignum candidate = 2; while(candidate <= topCandidate) { bignum trialDivisor = 2; int prime = 1; while(trialDivisor * trialDivisor <= candidate) { if(candidate % trialDivisor == 0) { prime = 0; break; } trialDivisor++; } if(prime) printPrime(candidate); candidate++; } }int main(int argc, char *argv[]) { bignum topCandidate = 1000; if(argc > 1) topCandidate = atoll(argv[1]); findPrimes(topCandidate); return 0; }`

We define a type called bignum in order to have huge numbers that are the same type throughout the prime number finders in this magazine. We define it as unsigned long long so as to get some really huge numbers, if we need them.

The main() routine of the preceding program simply acquires the top number to be tested, then calls findPrimes() with that number. The findPrimes() routine repeatedly increments the candidate, then loops through all possible divisors up to the candidate's square root. If none of the divisors evenly divide, printPrime()is called to print the number.

Here's the output for candidates up to 30:

 `[slitt@mydesk primes]\$ ./a.out 302357111317192329[slitt@mydesk primes]`

Here's how long it takes for my Athlon 2400XP with 1.5GB RAM to check the first 10,000,000 candidates:
 `[slitt@mydesk primes]\$ time ./a.out 10000000 > /dev/null83.82user 0.00system 1:23.84elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+108minor)pagefaults 0swaps[slitt@mydesk primes]\$`

A minute and 23 seconds to find all primes up to 10,000,000. Not too shabby!

But wait -- there's more. If you read this document all the way to the end, you'll see a 10 fold betterment of this performance.

Our first improvement will be using only primes as divisors...

# Using Only Primes As Divisors

##### By Steve Litt
The preceding article featured a prime finding program that was obvious, brute force, simple and fast. Can we improve the performance even more? One obvious improvement would be to test each candidate only with numbers that have been already proven prime. To do so we'll need to keep a list of already found primes. We could keep the list three ways:
1. In an array
2. On disk
Keeping it in an array is simplest, but one must declare the array before finding primes. How big do you declare it. Making its size topCandidate elements long would obviously do the trick. Even topElement/2 + 1 would work, because the only even prime is 2. But even that is much too big, because as you get out into the tens of millions, primes occur maybe once per decade. But how much less?

Another problem with an array is, where are you going to put it? If you make it a non-static local variable you're a braver man than I, because it's going to come straight off the stack.

Choice #2 is keeping the prime list on disk. No memory problems, no need to pre-estimate the number of primes. What a deal. Except that the time taken to recover these numbers to use them as divisors would slow the program horribly.

So I'll use choice #3, a linked list. Watch this...

 `#include #include #include #include typedef unsigned long long bignum;struct primerec { bignum bn; struct primerec * next; };void printPrime(bignum bn) { static char buf[1000]; sprintf(buf, "%ull", bn); buf[strlen(buf) - 2] = '\0'; printf("%s\n", buf); }void freeList(struct primerec *head) { struct primerec *thisPrime = head; while(1) { struct primerec *nextPrime = thisPrime->next; free(thisPrime); if(nextPrime == NULL) break; else thisPrime = nextPrime; } }void findPrimes(bignum topCandidate) { struct primerec *firstPrime = malloc(sizeof(struct primerec)); assert(firstPrime != NULL); struct primerec *latestPrime = firstPrime; firstPrime->bn = 2; firstPrime->next=NULL; bignum candidate = 2; while(candidate <= topCandidate) { struct primerec *thisPrime = firstPrime; int prime = 1; while(thisPrime->bn * thisPrime->bn <= candidate) { if(candidate % thisPrime->bn == 0) { prime = 0; break; } thisPrime = thisPrime->next; } if(prime) { printPrime(candidate); latestPrime->next = malloc(sizeof(struct primerec)); assert(latestPrime->next != NULL); latestPrime = latestPrime->next; latestPrime->bn = candidate; latestPrime->next = NULL; } candidate++; } freeList(firstPrime); }int main(int argc, char *argv[]) { bignum topCandidate = 1000; if(argc > 1) topCandidate = atoll(argv[1]); findPrimes(topCandidate); return 0; }`

In the preceding we create a struct primerec such that each instance holds the prime number in its bn element, and the element next points to the next such struct, creating a linked list. We substitute thisPrime->bn for all instances of trialDivisor, thereby dividing only by found primes. When we find a prime, we malloc() a new thisPrime->next, and then set the bn element of that next struct to the found prime, and that element becomes the new thisPrime. Last but not least, at the very end we call freeList() on the head element to free all allocated structs.

That's a lot of extra work the computer must do just to divide only by primes. Will it save time? Check this out...

 `[slitt@mydesk primes]\$ time ./jj 10000000 > /dev/null17.20user 0.05system 0:18.19elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+5814minor)pagefaults 0swaps[slitt@mydesk primes]\$`

Elapsed time has fallen from 1:23 to 0:18.19 -- more than a fourfold reduction in runtime. Pretty darned good! But is the output correct? A regression test between this new version and the all-divisors version reveals the two output identical data.

Now we have a fast prime finder. We can't make it any faster, can we?

# Odd Candidates Only

##### By Steve Litt
Well here's a deep thought -- why are we testing even candidates? We could make 2 a special case and just print it, not include it in the list of primes, start our candidates with 3, and increment by 2 every time. Watch this:
 `#include #include #include #include typedef unsigned long long bignum;struct primerec { bignum bn; struct primerec * next; };void printPrime(bignum bn) { static char buf[1000]; sprintf(buf, "%ull", bn); buf[strlen(buf) - 2] = '\0'; printf("%s\n", buf); }void freeList(struct primerec *head) { struct primerec *thisPrime = head; while(1) { struct primerec *nextPrime = thisPrime->next; free(thisPrime); if(nextPrime == NULL) break; else thisPrime = nextPrime; } }void findPrimes(bignum topCandidate) { printf("2\n"); struct primerec *firstPrime = malloc(sizeof(struct primerec)); assert(firstPrime != NULL); struct primerec *latestPrime = firstPrime; firstPrime->bn = 3; firstPrime->next=NULL; bignum candidate = 3; while(candidate <= topCandidate) { struct primerec *thisPrime = firstPrime; int prime = 1; while(thisPrime->bn * thisPrime->bn <= candidate) { if(candidate % thisPrime->bn == 0) { prime = 0; break; } thisPrime = thisPrime->next; } if(prime) { printPrime(candidate); latestPrime->next = malloc(sizeof(struct primerec)); assert(latestPrime->next != NULL); latestPrime = latestPrime->next; latestPrime->bn = candidate; latestPrime->next = NULL; } candidate += 2; } freeList(firstPrime); }int main(int argc, char *argv[]) { bignum topCandidate = 1000; if(argc > 1) topCandidate = atoll(argv[1]); findPrimes(topCandidate); return 0; }`

The preceding code has only minor changes from the first primes-only code. We hard code it to print 2 as a prime, and start testing at 3, both for divisors and candidates. We then increment candidates by 2 so they're always odd. How much time do we save? About 1.5 seconds...

 `[slitt@mydesk primes]\$ time ./jj 10000000 > /dev/null16.53user 0.04system 0:16.58elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+5814minor)pagefaults 0swaps[slitt@mydesk primes]\$`

And once again, regression tests show the same output.

If we can eliminate candidates and divisors divisible by 2, why not also 3? Same idea. Hardcode printing of 2 and 3, start candidates at 5, start divisors at 5, and alternate between incrementing by 2 and incrementing by 4. Here it is...
 `#include #include #include #include typedef unsigned long long bignum;struct primerec { bignum bn; struct primerec * next; };void printPrime(bignum bn) { static char buf[1000]; sprintf(buf, "%ull", bn); buf[strlen(buf) - 2] = '\0'; printf("%s\n", buf); }void freeList(struct primerec *head) { struct primerec *thisPrime = head; while(1) { struct primerec *nextPrime = thisPrime->next; free(thisPrime); if(nextPrime == NULL) break; else thisPrime = nextPrime; } }void findPrimes(bignum topCandidate) { printf("2\n"); printf("3\n"); struct primerec *firstPrime = malloc(sizeof(struct primerec)); assert(firstPrime != NULL); struct primerec *latestPrime = firstPrime; firstPrime->bn = 5; firstPrime->next=NULL; bignum candidate = 5; int inc=2; while(candidate <= topCandidate) { struct primerec *thisPrime = firstPrime; int prime = 1; while(thisPrime->bn * thisPrime->bn <= candidate) { if(candidate % thisPrime->bn == 0) { prime = 0; break; } thisPrime = thisPrime->next; } if(prime) { printPrime(candidate); latestPrime->next = malloc(sizeof(struct primerec)); assert(latestPrime->next != NULL); latestPrime = latestPrime->next; latestPrime->bn = candidate; latestPrime->next = NULL; } candidate += inc; if(inc==2) inc = 4; else inc = 2; } freeList(firstPrime); }int main(int argc, char *argv[]) { bignum topCandidate = 1000; if(argc > 1) topCandidate = atoll(argv[1]); findPrimes(topCandidate); return 0; }`

The preceding is pretty obviously a way to eliminate testing of 2, 3 and all their multiples. Perhaps the only tricky part is this:
`candidate += inc;if(inc==2) inc = 4; else inc = 2;`
Once again, regression tests on the first million candidates reveal no problem. How much time do we save? About 3/10 seconds:

 `[slitt@mydesk primes]\$ time ./jj 10000000 > /dev/null16.19user 0.05system 0:16.43elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+5821minor)pagefaults 0swaps[slitt@mydesk primes]\$`

Clearly we've gone as far as we can in optimizing this obvious algorithm. For further significant time reductions, we must find a better algorithm.

# The Array Method (Sieve of Eratosthenes)

##### By Steve Litt
The trouble with the obvious method is it's a heavy duty double nested loop. For each of a million candidates, it must test thousands of possible divisors.

So I invented a better method. Of course, a guy named Eratosthenes invented it 2244 years ago, but what the heck.

What if we turned the process on its ear, and for each possible divisor just marked out its every multiple. We could then go through and conclude the unmarked numbers are prime.

In other words, to find all primes under 1,000,000, we create an array of char 1,000,001 long, filled with ones. For each found prime, we then run a loop to change the element whose subscript is each multiple of that found prime, to 0. Finally, we scan the entire array, outputting the numeric value of each subscript whose element still contains a 1.

On its surface this might not be any faster. We still have a double nested loop. This time the outer loop picks the next trial divisor, and each inner loop marks the next multiple of that divisor. The good news is that we've at least gotten rid of the expensive modulus (%) test. The only way to see whether it's really better is to try it and see what happens.

 `#include #include #include #include typedef unsigned long long bignum;void printPrime(bignum bn) { static char buf[1000]; sprintf(buf, "%ull", bn); buf[strlen(buf) - 2] = '\0'; printf("%s\n", buf); }void findPrimes(bignum topCandidate) { char * array = malloc(sizeof(unsigned char) * (topCandidate + 1)); assert(array != NULL); /* SET ALL BUT 0 AND 1 TO PRIME STATUS */ int ss; for(ss = 0; ss <= topCandidate+1; ss++) *(array + ss) = 1; array[0] = 0; array[1] = 0; /* MARK ALL THE NON-PRIMES */ bignum thisFactor = 2; bignum lastSquare = 0; bignum thisSquare = 0; while(thisFactor * thisFactor <= topCandidate) { /* MARK THE MULTIPLES OF THIS FACTOR */ bignum mark = thisFactor + thisFactor; while(mark <= topCandidate) { *(array + mark) = 0; mark += thisFactor; } /* PRINT THE PROVEN PRIMES SO FAR */ thisSquare = thisFactor * thisFactor; for(;lastSquare < thisSquare; lastSquare++) { if(*(array + lastSquare)) printPrime(lastSquare); } /* SET thisFactor TO NEXT PRIME */ thisFactor++; while(*(array+thisFactor) == 0) thisFactor++; assert(thisFactor <= topCandidate); } /* PRINT THE REMAINING PRIMES */ for(;lastSquare <= topCandidate; lastSquare++) { if(*(array + lastSquare)) printPrime(lastSquare); } free(array); }int main(int argc, char *argv[]) { bignum topCandidate = 1000; if(argc > 1) topCandidate = atoll(argv[1]); findPrimes(topCandidate); return 0; }`

The preceding is a fairly straightforward implementation of the algorithm discussed at the beginning of this article. You might wonder why I print some of the primes within the loop, instead of waiting until the entire array is marked and then printing it.

The answer is user interface. When you get up to ranges like 1 to 100,000,000, you don't know whether the program is running or whether it's hung unless you see some output. The printing within the loop enables you to see printing.

Even with the print in the middle, at high ranges you'll wait quite a while to see output, because the first factor takes quite awhile to mark all its factors up to, let's say, a billion. But even with the 15 seconds it takes to print the first couple primes, you can be glad you didn't have to wait several minutes the way you would have had the entire primefinding process needed to complete before printing began.

Another comment about this algorithm is that when processing really high ranges, the lower primes are output very slowly, and in small groups. As you move more toward the top of the assigned range, primes are output much faster. This is in distinct contrast to the modulus based algorithm, where the lower primes print the fastest.

You're probably wanting to know if we saved any time using the array algorithm. Here's a printout of the time taken to pipe all primes from 2 to 10,000,000 to /dev/null:

 `[slitt@mydesk primes]\$ time ./jj 10000000 > /dev/null2.25user 0.04system 0:02.30elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+5625minor)pagefaults 0swaps[slitt@mydesk primes]\$`

HELLO! We've just printed the primes up to 10,000,000 in 2.30 seconds. If you remember, our best time with any of the obvious modulus based algorithms was 16.4 seconds, so this algorithm represents a speed improvement factor of 7.13 from our former best. Remembering our initial brute force attempt was 1:23.8 (83.8 seconds), this algorithm represents an improvement factor over that first attempt of 36.43.

Don't start celebrating yet, because this algorithm has a major problem -- it's not scalable. Given enough time, the modulus based algorithm could run forever, producing ever increasing primes. The array algorithm runs out of RAM as the array size approaches the available semiconductor memory. Once the array size becomes so big that swap memory must be used, this lightning fast program becomes snail slow.

There are two ways we can deal with this:
1. Improve the information density of the array, scaling by the factor of density increase, but leaving the program unscalable after that. This will be demonstrated in this document.
2. Make the program truly scalable by exchanging arrays. If done right, the limiting factor would be disk size, because the permanent record of primes would be kept on disk. This will be demonstrated in this document.
3. Make the program ultimately scalable by having it run on multiple machines with multiple RAM and disks. If done right, machines could be added and removed while the program is running. If done right, the limiting factor would be the number of computers, which could be added in real time. This method will be discussed in principle, but not demonstrated, in this document.

# The Bit Array Improvement

##### By Steve Litt
When it comes to memory usage, the quickest and most obvious improvement comes with the realization that each candidate number either is or is not prime, and that's all you need to know. It's binary. That means each candidate number could be represented by a single bit, instead of the 8 bit unsigned char used in the preceding article. Theoretically, this would allow an 8 fold increase in the range to be looked at.

To accomplish this we need to write put and get routines to map info into single bits of an array whose byte count is now topCandidate/8. The bit arithmetic inside those put and get routines will doubtlessly slow the algorithm. The question is, how much. The answer is simple enough -- it depends on how well you write them.

We'll start by creating a struct to contain a bit array and everything needed to handle it.
 `#include #include #include #include typedef unsigned long long bignum;typedef struct { unsigned int *p; /* pointer to array */ int bitsPerByte; /* 8 on normal systems */ int bytesPerInt; /* sizeof(unsigned int) */ int bitsPerInt; /* for bit arithmetic */ bignum bitsInArray; /* how many bits in array */ bignum intsInArray; /* how many uints to give necessary bits */ } BITARRAY;void freeBitArray(BITARRAY *ba) { free(ba->p); free(ba); }BITARRAY * createBitArray(bignum bits) { BITARRAY *ba = malloc(sizeof(BITARRAY)); assert(ba != NULL); ba->bitsPerByte = 8; ba->bytesPerInt = sizeof(unsigned int); ba->bitsPerInt = ba->bitsPerByte * ba->bytesPerInt; ba->bytesPerInt = sizeof(unsigned int); ba->bitsInArray = bits; ba->intsInArray = bits/ba->bitsPerInt + 1; ba->p = malloc(ba->intsInArray * ba->bytesPerInt); assert(ba->p != NULL); return ba; }void setBit(BITARRAY *ba, bignum bitSS) { unsigned int *pInt = ba->p + (bitSS/ba->bitsPerInt); unsigned int remainder = (bitSS % ba->bitsPerInt); *pInt |= (1 << remainder); }void clearBit(BITARRAY *ba, bignum bitSS) { unsigned int *pInt = ba->p + (bitSS/ba->bitsPerInt); unsigned int remainder = (bitSS % ba->bitsPerInt); unsigned int mask = 1 << remainder; mask = ~mask; *pInt &= mask; }int getBit(BITARRAY *ba, bignum bitSS) { unsigned int *pInt = ba->p + (bitSS/ba->bitsPerInt); unsigned int remainder = (bitSS % ba->bitsPerInt); unsigned int ret = *pInt; ret &= (1 << remainder); return(ret != 0); }void clearAll(BITARRAY *ba) { bignum intSS; for(intSS=0; intSS <= ba->intsInArray; intSS++) { *(ba->p + intSS) = 0; } }void setAll(BITARRAY *ba) { bignum intSS; for(intSS=0; intSS <= ba->intsInArray; intSS++) { *(ba->p + intSS) = ~0; } }void printPrime(bignum bn) { static char buf[1000]; sprintf(buf, "%ull", bn); buf[strlen(buf) - 2] = '\0'; printf("%s\n", buf); }void findPrimes(bignum topCandidate) { BITARRAY *ba = createBitArray(topCandidate); assert(ba != NULL); /* SET ALL BUT 0 AND 1 TO PRIME STATUS */ setAll(ba); clearBit(ba, 0); clearBit(ba, 1); /* MARK ALL THE NON-PRIMES */ bignum thisFactor = 2; bignum lastSquare = 0; bignum thisSquare = 0; while(thisFactor * thisFactor <= topCandidate) { /* MARK THE MULTIPLES OF THIS FACTOR */ bignum mark = thisFactor + thisFactor; while(mark <= topCandidate) { clearBit(ba, mark); mark += thisFactor; } /* PRINT THE PROVEN PRIMES SO FAR */ thisSquare = thisFactor * thisFactor; for(;lastSquare < thisSquare; lastSquare++) { if(getBit(ba, lastSquare)) printPrime(lastSquare); } /* SET thisFactor TO NEXT PRIME */ thisFactor++; while(getBit(ba, thisFactor) == 0) thisFactor++; assert(thisFactor <= topCandidate); } /* PRINT THE REMAINING PRIMES */ for(;lastSquare <= topCandidate; lastSquare++) { if(getBit(ba, lastSquare)) printPrime(lastSquare); } freeBitArray(ba); }void test() { int ss; BITARRAY *ba = createBitArray(77); clearAll(ba); /*setAll(ba);*/ setBit(ba, 0); setBit(ba, 64); setBit(ba, 10); clearBit(ba, 10); clearBit(ba, 0); setBit(ba, 64); setBit(ba, 10); setBit(ba, 0); for(ss=0; ss < 78; ss++) { if(getBit(ba, ss) != 0) { printf("%d", ss); printf("=%d ON!!!", (getBit(ba, ss))); printf("\n"); } } printf("First int is %ull.\n", *(ba->p)); }int main(int argc, char *argv[]) { bignum topCandidate = 1000; if(argc > 1) topCandidate = atoll(argv[1]);/* test(); */ findPrimes(topCandidate); return 0; }`

 `[slitt@mydesk primes]\$ time ./jj 10000000 > /dev/null4.24user 0.03system 0:04.49elapsed 95%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+3624minor)pagefaults 0swaps[slitt@mydesk primes]\$`

Ouch! That 4.49 seconds represents an almost doubling of the runtime due to making this bitwise. So we decreased the memory footprint by a factor of approximately 8, but we halved the performance. How can we get back some of that performance?

Perhaps much of the performance hit is due to function calls and function returns. If that were true, we could create macros msetBit(), mgetBit() and mclearBit() to substitute for their functional equivalents, as follows:

 `#define msetBit(ba, bitSS) \ { \ *(ba->p + (bitSS/ba->bitsPerInt)) |= \ (1 << (bitSS % ba->bitsPerInt)); \ }#define mclearBit(ba, bitSS) \ { \ *(ba->p + (bitSS/ba->bitsPerInt)) &= \ ~(1 << (bitSS % ba->bitsPerInt)); \ }#define mgetBit(ba, bitSS) \ ( \ ( \ 0 != \ ( \ *(ba->p + (bitSS/ba->bitsPerInt)) & \ (1 << (bitSS % ba->bitsPerInt)) \ ) \ ) \ )`

The preceding macros would be inline code without function calls and returns. Of course, such macros are difficult to write and debug, and have no typechecking, but let's see what happens after substituting these macros for the equivalent functions:

 `[slitt@mydesk primes]\$ time ./jj 10000000 > /dev/null3.92user 0.03system 0:04.15elapsed 95%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+3678minor)pagefaults 0swaps[slitt@mydesk primes]\$`

So the macros knock the time down from 4.49 to 4.15 seconds -- less than a 10% reduction. In my opinion, that's not enough benefit to give up the robustness and typechecking of functions.

Can this time be trimmed? Perhaps. I have a feeling the divide and modulus operations in all three functions are expensive, and I know just the way to eliminate them if we let in just a smidgen of hardware dependence. The key is that modern machines have natural integer sizes that are a power of 2. That means we can divide by shifting, and we can modulus by anding. In order to limit calculations during prime testing, we add two elements to the BITARRAY structure:
`int bpiShift;			/* 8 bit words = 3, 16=4, etc */int bpiMask;			/* bitsPerInch - 1, for modulus */`
In the case of a 32 bit integer, you can divide by bits per int by shifting right 5 places, and you can modulus by bits per int by anding with 31. Here is the code:
 `#include #include #include #include typedef unsigned long long bignum;typedef struct { unsigned int *p; /* pointer to array */ int bitsPerByte; /* 8 on normal systems */ int bytesPerInt; /* sizeof(unsigned int) */ int bitsPerInt; /* for bit arithmetic */ int bpiShift; /* 8 bit words = 3, 16=4, etc */ int bpiMask; /* bitsPerInch - 1, for modulus */ bignum bitsInArray; /* how many bits in array */ bignum intsInArray; /* how many uints to give necessary bits */ } BITARRAY;void freeBitArray(BITARRAY *ba) { free(ba->p); free(ba); }BITARRAY * createBitArray(bignum bits) { BITARRAY *ba = malloc(sizeof(BITARRAY)); assert(ba != NULL); ba->bitsPerByte = 8; ba->bytesPerInt = sizeof(unsigned int); ba->bitsPerInt = ba->bitsPerByte * ba->bytesPerInt; switch (ba->bitsPerInt) { case 8: ba->bpiShift = 3; break; case 16: ba->bpiShift = 4; break; case 32: ba->bpiShift = 5; break; case 64: ba->bpiShift = 6; break; case 128: ba->bpiShift = 7; break; case 256: ba->bpiShift = 8; break; case 512: ba->bpiShift = 9; break; default: { perror("ABORTING: Non-standard bits per int\n"); exit(1); break; } }; ba->bpiMask = ba->bitsPerInt - 1; ba->bytesPerInt = sizeof(unsigned int); ba->bitsInArray = bits; ba->intsInArray = bits/ba->bitsPerInt + 1; ba->p = malloc(ba->intsInArray * ba->bytesPerInt); assert(ba->p != NULL); return ba; }void setBit(BITARRAY *ba, bignum bitSS) { unsigned int *pInt = ba->p + (bitSS >> ba->bpiShift); unsigned int remainder = (bitSS & ba->bpiMask); *pInt |= (1 << remainder); }void clearBit(BITARRAY *ba, bignum bitSS) { unsigned int *pInt = ba->p + (bitSS >> ba->bpiShift); unsigned int remainder = (bitSS & ba->bpiMask); unsigned int mask = 1 << remainder; mask = ~mask; *pInt &= mask; }int getBit(BITARRAY *ba, bignum bitSS) { unsigned int *pInt = ba->p + (bitSS >> ba->bpiShift); unsigned int remainder = (bitSS & ba->bpiMask); unsigned int ret = *pInt; ret &= (1 << remainder); return(ret != 0); }void clearAll(BITARRAY *ba) { bignum intSS; for(intSS=0; intSS <= ba->intsInArray; intSS++) { *(ba->p + intSS) = 0; } }void setAll(BITARRAY *ba) { bignum intSS; for(intSS=0; intSS <= ba->intsInArray; intSS++) { *(ba->p + intSS) = ~0; } }void printPrime(bignum bn) { static char buf[1000]; sprintf(buf, "%ull", bn); buf[strlen(buf) - 2] = '\0'; printf("%s\n", buf); }void findPrimes(bignum topCandidate) { BITARRAY *ba = createBitArray(topCandidate); assert(ba != NULL); /* SET ALL BUT 0 AND 1 TO PRIME STATUS */ setAll(ba); clearBit(ba, 0); clearBit(ba, 1); /* MARK ALL THE NON-PRIMES */ bignum thisFactor = 2; bignum lastSquare = 0; bignum thisSquare = 0; while(thisFactor * thisFactor <= topCandidate) { /* MARK THE MULTIPLES OF THIS FACTOR */ bignum mark = thisFactor + thisFactor; while(mark <= topCandidate) { clearBit(ba, mark); mark += thisFactor; } /* PRINT THE PROVEN PRIMES SO FAR */ thisSquare = thisFactor * thisFactor; for(;lastSquare < thisSquare; lastSquare++) { if(getBit(ba, lastSquare)) printPrime(lastSquare); } /* SET thisFactor TO NEXT PRIME */ thisFactor++; while(getBit(ba, thisFactor) == 0) thisFactor++; assert(thisFactor <= topCandidate); } /* PRINT THE REMAINING PRIMES */ for(;lastSquare <= topCandidate; lastSquare++) { if(getBit(ba, lastSquare)) printPrime(lastSquare); } freeBitArray(ba); }int main(int argc, char *argv[]) { bignum topCandidate = 1000; if(argc > 1) topCandidate = atoll(argv[1]);/* test(); */ findPrimes(topCandidate); return 0; }`

But is there a significant speed improvement? Let's see...

 `[slitt@mydesk primes]\$ time ./jj 10000000 > /dev/null2.14user 0.03system 0:02.18elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+3614minor)pagefaults 0swaps[slitt@mydesk primes]\$`

Oh Yeah, that's what I'm talking about! We're right back to the neighborhood of the non-array algorithm. In fact, we're still a little slower, even though the preceding printout shows us being faster. The reason is that this was run on a new day, after a reboot, with less programs running. The no-modulus, no-division bit array version runs around 2.15, while the byte version runs at 2.07. If you wanted to, you could further improve the time using macros, but I'm not going to go there.

So once we got rid of the divsion and modulus, the bit array method yields an eightfold increase in the number of primes that can be processed by a specific machine. It comes at less than a 10% speed cost, which to me is perfectly acceptable. If you need to squeeze out the very most speed, you could write macros using the no-division, no-modulus algorithms.

This is great news. We can test 100,000,000 numbers for primeness in less than a minute. But, somewhere around testing 10,000,000,000 (ten billion) numbers, we run out of memory on my 1.5GB machine. We're still limited by the semiconductor RAM on your machine. To achieve scalability, we need to page the bit arrays in and out of RAM, while storing the derived prime sets on disk, and paging those in and out one at a time. Obviously this comes at a speed cost, but given enough time it raises the upper limit to the disk space of the computer. Read on...

# Design Musings: Scalability Through Paging

##### By Steve Litt
The preceding article demonstrated a fast algorithm that, on a 1.5GB machine, can efficiently test somewhere around ten billion numbers. Beyond that, the program goes into swap memory and basically takes forever. This happens because the bit array grows past the size of the semiconductor memory.

We can get around this limit by using a smaller array to test a specific range, and then when done there, page in another array. Not only that, but we need to page in and out trial divisors. In a paging situation, we cannot count on the trial divisors being in the same range as the numbers being tested. In fact, except for the initial range, the range being tested probably WILL be different from the range of trial divisors. This means that not only will we encounter paging overhead, but our ranges must be smaller to accommodate two of them. This is the price we pay for scalability.

## Data forms

Obviously, the candidate range, the numbers to be tested, must be in a bitarray. This bitarray stays in memory until all possible factors have marked their territory, at which time the primes revealed are recorded somewhere, and the bit array is discarded.

In the previous articles, the prime trial factors were taken from the same bit array as the numbers being tested. This will often not be the case in a paging algorithm. For instance, let's say we're testing prime candidates in pages of one million, and we're now testing numbers between one million and two million. In that case we would need to try factors from 2 through 1414 (the square root of two million). Those numbers are definitely not in the one to two million range.

In fact, for all ranges below one trillion, we'd use only primes in the 0 to one million range. In fact, except for the very first page, the factors will always be on a different page than the candidates. This makes a very compelling case for having two different programs -- one for the first page, and one for every other page. Trying to make one program fit both situations would introduce if statements that would slow all processing.

As mentioned, the candidate numbers must be in a bitarray. The factors could be stored either in a different bitarray, or in an array of prime numbers. The former has space advantages, the latter speed advantages because one needn't navigate non-primes to find the next prime. To achieve the speed, how much space advantage are you giving up?

My observation in the 100,000,000 neighborhood is that about 1/20 of the numbers are prime, and that as the range gets higher, the frequency of primes decreases slightly. It takes a 64 bit number to store numbers up to 18 million trillion (1.84467440737096e+19). That's probably sufficient. So with integers, 64 bits can store one number, whereas with a bit array 64 bits could store 64 numbers, 1/20 of which are prime. So storing factors in integers requires 3 times the storage (or 1/3 the page sizes). I'm guessing (and this is pure guess, gentle reader), that based on speed and space, an array of primes is the better way to go.

Of course, there are even more exotic ways to store primes. For instance, perhaps a single 64 bit integer could store the first prime in the range, and then an array of 16 bit integers could store the increase for each prime over the last. This would bring the space usage down below the level of a bitarray, but it would involve an add operation for each new prime, and perhaps more importantly, who can be sure that there aren't two consecutive primes somewhere that are more than 65536 apart. So I'll still go with the array of 64 bit numbers.

As I mentioned, once a candidate bit array is complete, its primes are written out and then it is discarded. I think the primes should be written to disk. That way, if the computer crashes, it can pick up where it left off. Also, if we enhance the algorithm so multiple machines can process primes, the prime lists will most likely exist on several computers and must be transfered via NFS or remote processing or some other mechanism. Files are easy to handle with NFS.

Files are a huge performance drag. They should be as small as possible, and should be read and written with large buffers to maximize throughput. I'd recommend that the prime arrays be written bit for bit to their respective files. Such files are binary in nature and cannot be handled with an editor, but they're much faster to read and write than text files. A separate program could easily be created to convert the binary representations into text. Certainly this would need to be done in order to port the results to machines whose integer representations are different (endian, etc). As a matter of fact, the endian problem would absolutely crop up if the prime finding task was split among programs on several different computers.

Even using text representations of the numbers, if you use a hundred million as the candidate range, the size of the file created will be in the 50MB range. Primefinding the first hundred million numbers takes about 24 seconds on my machine, and copying the resulting text file takes about 1. So at high, but easily achieveable ranges like a hundred million, outputting text representations of the primes is doable. My timings show that even at ranges like ten million, which is easily done on even the most anemic machines, the file writing is a small part of the overall processing time.

But what about when you need to read one of these behamoths? In the case of a 100 million candidate range, that's over 6 million primes you need to convert from text to integers. Luckily, at 100 million candidate ranges, you needn't do that until you begin testing candidates above ten thousand trillion. By the time you hit that point, 4.67 seconds to decode 6 million text primes is the least of your concerns. How did I get the 4.67 seconds? With this quick, dirty, and incredibly inefficient program:

 `#include #include #include #include int main(int argc, char *argv[]) { char buf[100], numberstring[100], *pbuf; unsigned long long prime; FILE *fin, *fout; fin = fopen("./junk.jnk", "r"); assert(fin != NULL); fout = fopen("./junkout.jnk", "w"); assert(fout != NULL); while((pbuf = fgets(buf, 90, fin)) != NULL) { prime = atoll(buf); fprintf(fout, "%ull\n", prime); } fclose(fin); fclose(fout); }`

The preceding program reads, converts to int, converts back to text and writes. It took about 24 seconds to write out the primes in the first 100,000,000 numbers, and it took less than 5 seconds to read, convert, convert and write them back. When I commented out the fprintf() to more accurately reflect the task of simply reading and converting, the time dropped to 2.51 seconds -- about 1/10 of the time required to mark the 100,000,000 candidates that created the ascii list of primes. If this 2.5 seconds ever becomes an issue, a faster reading or conversion algorithm can probably be written.

There's one more piece of data -- the run log. It should list each range, who it was assigned to (for clustering), the filename and host of the prime output, and whether it's complete.

However, the list of trial factor prime numbers just might be better served as an array of integers.

# Implementation: Scalability Through Paging

##### By Steve Litt
So I wrote the code. It's ugly, with little exception handling. It segfaults at the drop of a hat, especially if you don't input the right command line arguments. But it works. Here's the code.

 `#include #include #include #include typedef unsigned long long bignum;typedef struct { unsigned int *p; /* pointer to array */ int bitsPerByte; /* 8 on normal systems */ int bytesPerInt; /* sizeof(unsigned int) */ int bitsPerInt; /* for bit arithmetic */ int bpiShift; /* 8 bit words = 3, 16=4, etc */ int bpiMask; /* bitsPerInch - 1, for modulus */ bignum bitsInArray; /* how many bits in array */ bignum intsInArray; /* how many uints to give necessary bits */ bignum pageno; /* page 0 is 0-amillion or whatever */ bignum offset; /* ba start at 0, a million, 2 mil? */ } BITARRAY;void freeBitArray(BITARRAY *ba) { free(ba->p); free(ba); }BITARRAY * createBitArray(bignum bits, bignum pageno) { BITARRAY *ba = malloc(sizeof(BITARRAY)); assert(ba != NULL); ba->bitsPerByte = 8; ba->bytesPerInt = sizeof(unsigned int); ba->bitsPerInt = ba->bitsPerByte * ba->bytesPerInt; switch (ba->bitsPerInt) { case 8: ba->bpiShift = 3; break; case 16: ba->bpiShift = 4; break; case 32: ba->bpiShift = 5; break; case 64: ba->bpiShift = 6; break; case 128: ba->bpiShift = 7; break; case 256: ba->bpiShift = 8; break; case 512: ba->bpiShift = 9; break; default: { perror("ABORTING: Non-standard bits per int\n"); exit(1); break; } }; ba->bpiMask = ba->bitsPerInt - 1; ba->bytesPerInt = sizeof(unsigned int); ba->bitsInArray = bits; ba->intsInArray = bits/ba->bitsPerInt + 1; ba->p = malloc(ba->intsInArray * ba->bytesPerInt); assert(ba->p != NULL); ba->pageno = pageno; ba->offset = ba->pageno * ba->bitsInArray; return ba; }void setBit(BITARRAY *ba, bignum bitSS) { unsigned int *pInt = ba->p + (bitSS >> ba->bpiShift); unsigned int remainder = (bitSS & ba->bpiMask); *pInt |= (1 << remainder); }void clearBit(BITARRAY *ba, bignum bitSS) { unsigned int *pInt = ba->p + (bitSS >> ba->bpiShift); unsigned int remainder = (bitSS & ba->bpiMask); unsigned int mask = 1 << remainder; mask = ~mask; *pInt &= mask; }int getBit(BITARRAY *ba, bignum bitSS) { unsigned int *pInt = ba->p + (bitSS >> ba->bpiShift); unsigned int remainder = (bitSS & ba->bpiMask); unsigned int ret = *pInt; ret &= (1 << remainder); return(ret != 0); }void clearAll(BITARRAY *ba) { bignum intSS; for(intSS=0; intSS <= ba->intsInArray; intSS++) { *(ba->p + intSS) = 0; } }void setAll(BITARRAY *ba) { bignum intSS; for(intSS=0; intSS <= ba->intsInArray; intSS++) { *(ba->p + intSS) = ~0; } }void printPrime(bignum bn) { printf("%llu\n", bn); }void fprintPrime(FILE *fp, bignum bn) { fprintf(fp, "%llu\n", bn); }bignum * makePrimeArray(const char *fname) { FILE * fin; char buf[1000], *pbuf; bignum * factors; fin = fopen(fname, "r"); assert(fin != NULL); /* FIND NUMBER OF LINES */ bignum lineCount = 0; while(fgets(buf, 90, fin) != NULL) lineCount++; /* ALLOCATE ARRAY */ factors = malloc(sizeof(bignum) * (lineCount + 3)); assert(factors != NULL); /* READ THE PRIMES INTO factors */ rewind(fin); lineCount = 0; while((pbuf = fgets(buf, 90, fin)) != NULL) { *(factors + lineCount) = atoll(pbuf); lineCount++; } fclose(fin); *(factors + lineCount) = 0; return factors; }void findPrimesInOnePage(bignum topCandidate, bignum pageno, bignum *factors, BITARRAY *ba) { bignum ss; char buf[100]; FILE *fout; /* SET ba ELEMENTS */ ba->pageno = pageno; ba->offset = ba->pageno * ba->bitsInArray; /* SET ALL BUT 0 AND 1 TO PRIME STATUS */ setAll(ba); /* MARK ALL THE NON-PRIMES */ bignum factorSS = 0; bignum thisFactor = *(factors + factorSS); while((thisFactor != 0) && (thisFactor * thisFactor <= topCandidate + ba->offset)) { /* MARK THE MULTIPLES OF THIS FACTOR */ bignum mark = thisFactor + thisFactor; if(mark < ba->offset) { mark = (ba->offset/thisFactor) * thisFactor; if(mark < ba->offset) mark += thisFactor; } mark -= ba->offset; while(mark <= topCandidate) { clearBit(ba, mark); mark += thisFactor; } /* SET thisFactor TO NEXT PRIME */ factorSS++; thisFactor = *(factors + factorSS); assert(thisFactor <= topCandidate + ba->offset); } /* PRINT ALL THE PRIMES */ sprintf(buf, "pri%08llu.pri", ba->pageno); fout = fopen(buf, "w"); assert(fout != NULL); for(ss = 0;ss <= topCandidate; ss++) { if(getBit(ba, ss)) fprintPrime(fout, ss + ba->offset); } fclose(fout); }void findPrimes(bignum numbersPerPage, bignum startPageno, bignum pagesToProcess) { bignum pageCount; /* PUT FACTORS IN AN ARRAY */ bignum * factors = makePrimeArray("0.pri"); /* CREATE BITARRAY */ BITARRAY *ba = createBitArray(numbersPerPage, startPageno); assert(ba != NULL); /* PROCESS EVERY PAGE */ for(pageCount = 0; pageCount < pagesToProcess; pageCount++) { findPrimesInOnePage ( numbersPerPage, pageCount+startPageno, factors, ba ); } /* FREE THE BITARRAY */ freeBitArray(ba); /* FREE THE FACTOR ARRAY */ }void test() { bignum * factors = makePrimeArray("0.pri"); bignum lineCount = 0; bignum factor; FILE * fout = fopen("junky.jnk", "w"); assert(fout != NULL); for(lineCount = 0; (factor = *(factors + lineCount)) != 0; lineCount++) fprintPrime(fout, factor); fprintPrime(fout, 0); fclose(fout); }int main(int argc, char *argv[]) { bignum numbersPerPage = atoll(argv[1]); bignum startPageno = atoll(argv[2]); bignum pagesToProcess = atoll(argv[3]); printf("numbersPerPage = %llu\n", numbersPerPage ); printf("startPageno = %llu\n", startPageno ); printf("pagesToProcess = %llu\n", pagesToProcess);/* test(); */ findPrimes(numbersPerPage, startPageno, pagesToProcess) ; return 0; }`

In the preceding, note that I've added two elements to the BITARRAY struct:
• bignum pageno;
• bignum offset;
pageno refers to the page being done. Assuming a page size (ba->bitsInArray) of 1000, page 0 covers 0-999, page 1 covers 1000-1999, page 2 covers 2000-2999, and so on. The preceding program is not set up to handle page 0. Page 0 is produced by the non-paging bit array program that uses shift and and instead of divide and modulus. The paging program in this article uses the output of the page 0 producing program as its input -- in other words, as its prime factors.

Basically, this paging program reads in the prime factors that the first program deposited on a disk, and then for each candidate page, marks multiples on a bit array and when done writes the remaining primes to a file determined by the page number. This program's command line arguments are as follows:

 arg0 the program name arg1 page size In other words, a page size of 1000 would test everything from 4000 through 4999, or 7000 through 7999. If the page size were 10,000,000, it would test everything from 30,000,000 through 39,999,999, or everything between 80,000,000 and 89,999,999. arg2 page number Page 0 starts with 0. Page 1 starts with the page size. Page 2 starts with 2 x pagesize. Page 3 starts with 3 x pagesize. arg3 number of pages to process It's nice to have a stopping point for the program, even though if the program aborts all files produced except the last one should be accurate. Arg3 tells how many pages to process, including thefirst page processed (arg2). So if arg1 were 1000, arg2 were 3, and arg 3 were 4, it would process pages 3 through 6, or numbers from 3000 through 6999.

1. Is it accurate?
2. How fast or slow is it?
3. How scalable is it?
4. How robust is it?
5. What can we learn from it?

## Is it accurate?

I believe this program to be dead on accurate. First I ran the non-paging program to produce all primes in the range 2 to 100,000,000 (a hundred million). I put the resulting file away as firstresults.txt for comparison.

Next I ran the non-paging program to produce all primes 2 to 1,000,000 (one million), saving the results to 0.pri, which is the prime factor input file used by the paging program.

Next I ran the paging program using a page size of 1,000,000 (one million), a start page of 1, and number of pages to process being 99. That produced files pri00000001.pri through pri00000099.pri. Then I ran the following command:
`cat 0.pri pri*.pri > junk.jnkdiff firstresults.txt junk.jnk`
The result was gratifying -- the files were identical. So I had regression tested the paging program against the non-paging, and came up clean, at least for all primes under 100,000,000. How accurate is it in the billions and trillions? I just can't say, but at this point am assuming it's as accurate as below 100,000,000. I should mention that when I ran it out to 40 billion, the results I saw at the end of the last file produced looked reasonable.

## How fast or slow is it?

This program does a lot of work. It must repeatedly clear out the BITARRAY, then mark it with multiples of every prime below and including its high-end's square root. Each succeeding page must try higher prime factors. Decreased performance, even a halving or thirding of a performance, is an acceptable price for the added scalability.

To test, I first ran the fastest non-paging on numbers 0 through 99,999,999, and then ran it on the first 1,000,000 numbers followed by the paging program on numbers 1,000,000 through 99,999,999.

Here are the results:
 `slitt@mydesk primes]\$ time ./array_bit_shift.exe 100000000 > /dev/null21.92user 0.04system 0:22.95elapsed 95%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+3169minor)pagefaults 0swaps[slitt@mydesk primes]\$`

Almost 23 seconds. Then I ran the non paging on the first million to create an input file for the paging program, and ran the paging program on 99 pages of size one million:

 `[slitt@mydesk primes]\$ time ./array_bit_shift.exe 1000000 > 0.pri0.11user 0.00system 0:00.13elapsed 85%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+144minor)pagefaults 0swaps[slitt@mydesk primes]\$ time ./array_page_ba.exe 1000000 1 99numbersPerPage = 1000000startPageno = 1pagesToProcess = 9911.51user 0.22system 0:11.79elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+408minor)pagefaults 0swaps[slitt@mydesk primes]\$`

The elapsed time was 11.79 + 0.13 seconds -- 11.92 seconds, or a little more than half the elapsed time as the non-paging version. The paging version was faster!

Now multiply everything by 10 -- up to a billion:

 `[slitt@mydesk primes]\$ time ./array_bit_shift.exe 1000000000 > /dev/null246.64user 0.36system 4:09.60elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+30635minor)pagefaults 0swaps[slitt@mydesk primes]\$`

 `[slitt@mydesk primes]\$ time ./array_bit_shift.exe 10000000 > 0.pri1.84user 0.03system 0:01.89elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+420minor)pagefaults 0swaps[slitt@mydesk primes]\$ time ./array_page_ba.exe 10000000 1 99numbersPerPage = 10000000startPageno = 1pagesToProcess = 99222.72user 2.30system 3:45.89elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+1829minor)pagefaults 0swaps[slitt@mydesk primes]\$`

Total elapsed time here is 3:45.89 + 1.89, or 3:47.78. Still less runtime than the nonpaging version, but now we're getting closer. Next, let's try doing a billion, but a million at a time:

 `[slitt@mydesk primes]\$ time ./array_bit_shift.exe 1000000 > 0.pri0.11user 0.00system 0:00.14elapsed 80%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+144minor)pagefaults 0swaps[slitt@mydesk primes]\$ time ./array_page_ba.exe 1000000 1 999numbersPerPage = 1000000startPageno = 1pagesToProcess = 999118.66user 2.43system 2:01.88elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+1308minor)pagefaults 0swaps[slitt@mydesk primes]\$`

2:01.88 + 0.14 is 2:02.02 -- significantly less time using pages of a million rather than ten million. What happens if we go to a billion using pages of a hundred thousand?

 `[slitt@mydesk primes]\$ time ./array_bit_shift.exe 100000 > 0.pri0.01user 0.00system 0:00.03elapsed 35%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+117minor)pagefaults 0swaps[slitt@mydesk primes]\$ time ./array_page_ba.exe 100000 1 9999numbersPerPage = 100000startPageno = 1pagesToProcess = 9999115.22user 7.51system 2:04.90elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+10145minor)pagefaults 0swaps[slitt@mydesk primes]\$`

2:04.98 + 0.03 is 2:05.01 or 125.01 seconds -- the time increased slightly. I'm not surprised, given that the process produced 10,000 files. By the time there are a couple thousand files created, each file creation requires a sequential search through the directory. To test this hypothesis, let's see how fast it produces 100,000,000 instead of a million, once again using pages of 100,000. This way we'll create only 1000 files. Let's see whether it's proportional...

 [slitt@mydesk primes]\$ time ./array_bit_shift.exe 100000 > 0.pri 0.01user 0.00system 0:00.01elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+117minor)pagefaults 0swaps [slitt@mydesk primes]\$ time ./array_page_ba.exe 100000 1 999 numbersPerPage = 100000 startPageno = 1 pagesToProcess = 999 11.15user 0.29system 0:11.96elapsed 95%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+1145minor)pagefaults 0swaps [slitt@mydesk primes]\$

11.96 + 0.01 is 11.97 seconds. There isn't that much difference, proportionally speaking. If you feel the need for speed, page sizes of a million might be ideal.

Regardless, for a given range over ten million, the paging version will be faster than the non paging version.

One more measurement. I used pages of a hundred thousand to go all the way up to 40 billion. The 40 billion took 3 hours and 12 minutes...

 `[slitt@mydesk primes]\$ time ./jj 100000000 1 400numbersPerPage = 100000000startPageno = 1pagesToProcess = 40011404.32user 90.31system 3:12:43elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (80major+18036minor)pagefaults 0swaps[slitt@mydesk primes]\$ `

It produced 401 files, containing a total of 1,716,050,469 primes.

 `5761455 0.pri5317482 pri00000001.pri5173388 pri00000002.pri5084001 pri00000003.pri5019541 pri00000004.pri4968836 pri00000005.pri4928228 pri00000006.pri4893248 pri00000007.pri4863036 pri00000008.pri4838319 pri00000009.pri4814936 pri00000010.pri4792235 pri00000011.pri4773628 pri00000012.pri4757140 pri00000013.pri4741055 pri00000014.pri4725305 pri00000015.pri4711186 pri00000016.pri4699403 pri00000017.pri4685506 pri00000018.pri4674359 pri00000019.pri4664239 pri00000020.pri4653596 pri00000021.pri4644818 pri00000022.pri4633507 pri00000023.pri4624924 pri00000024.pri4618796 pri00000025.pri4608025 pri00000026.pri4600066 pri00000027.pri4593753 pri00000028.pri4585526 pri00000029.pri4579104 pri00000030.pri4572164 pri00000031.pri4565024 pri00000032.pri4559367 pri00000033.pri4554137 pri00000034.pri4547803 pri00000035.pri4542381 pri00000036.pri4536677 pri00000037.pri4530431 pri00000038.pri4525187 pri00000039.pri4519257 pri00000040.pri4515034 pri00000041.pri4511145 pri00000042.pri4506206 pri00000043.pri4500869 pri00000044.pri4496996 pri00000045.pri4492848 pri00000046.pri4487576 pri00000047.pri4483802 pri00000048.pri4478678 pri00000049.pri4475770 pri00000050.pri4472349 pri00000051.pri4468618 pri00000052.pri4463105 pri00000053.pri4460455 pri00000054.pri4457344 pri00000055.pri4454430 pri00000056.pri4449258 pri00000057.pri4445999 pri00000058.pri4443817 pri00000059.pri4439588 pri00000060.pri4437913 pri00000061.pri4433552 pri00000062.pri4428777 pri00000063.pri4426270 pri00000064.pri4426143 pri00000065.pri4421122 pri00000066.pri4418433 pri00000067.pri4414555 pri00000068.pri4412631 pri00000069.pri4410854 pri00000070.pri4405694 pri00000071.pri4403901 pri00000072.pri4402103 pri00000073.pri4399708 pri00000074.pri4395213 pri00000075.pri4393291 pri00000076.pri4391401 pri00000077.pri4389464 pri00000078.pri4387673 pri00000079.pri4385112 pri00000080.pri4381093 pri00000081.pri4380870 pri00000082.pri4377102 pri00000083.pri4374596 pri00000084.pri4373483 pri00000085.pri4369765 pri00000086.pri4367782 pri00000087.pri4366133 pri00000088.pri4363605 pri00000089.pri4362433 pri00000090.pri4357534 pri00000091.pri4360247 pri00000092.pri4355186 pri00000093.pri4353197 pri00000094.pri4351880 pri00000095.pri4351204 pri00000096.pri4347860 pri00000097.pri4346041 pri00000098.pri4343734 pri00000099.pri4341930 pri00000100.pri` `4339185 pri00000101.pri4338463 pri00000102.pri4336111 pri00000103.pri4333768 pri00000104.pri4332507 pri00000105.pri4331427 pri00000106.pri4330119 pri00000107.pri4326766 pri00000108.pri4325830 pri00000109.pri4323958 pri00000110.pri4323676 pri00000111.pri4318636 pri00000112.pri4318809 pri00000113.pri4317429 pri00000114.pri4318267 pri00000115.pri4315133 pri00000116.pri4311817 pri00000117.pri4310950 pri00000118.pri4308559 pri00000119.pri4308455 pri00000120.pri4306131 pri00000121.pri4305733 pri00000122.pri4302823 pri00000123.pri4303539 pri00000124.pri4300171 pri00000125.pri4298221 pri00000126.pri4297329 pri00000127.pri4297229 pri00000128.pri4294718 pri00000129.pri4293145 pri00000130.pri4291744 pri00000131.pri4290289 pri00000132.pri4289891 pri00000133.pri4287411 pri00000134.pri4285984 pri00000135.pri4285442 pri00000136.pri4284509 pri00000137.pri4280681 pri00000138.pri4281040 pri00000139.pri4277646 pri00000140.pri4279267 pri00000141.pri4277982 pri00000142.pri4275691 pri00000143.pri4274793 pri00000144.pri4273589 pri00000145.pri4271835 pri00000146.pri4272165 pri00000147.pri4268547 pri00000148.pri4268665 pri00000149.pri4267319 pri00000150.pri4266988 pri00000151.pri4264745 pri00000152.pri4263366 pri00000153.pri4261398 pri00000154.pri4262593 pri00000155.pri4261189 pri00000156.pri4258119 pri00000157.pri4257324 pri00000158.pri4256264 pri00000159.pri4255391 pri00000160.pri4254836 pri00000161.pri4253092 pri00000162.pri4250851 pri00000163.pri4251517 pri00000164.pri4249648 pri00000165.pri4249773 pri00000166.pri4248142 pri00000167.pri4247178 pri00000168.pri4245686 pri00000169.pri4244069 pri00000170.pri4244025 pri00000171.pri4241525 pri00000172.pri4241219 pri00000173.pri4239755 pri00000174.pri4240122 pri00000175.pri4237819 pri00000176.pri4236204 pri00000177.pri4237030 pri00000178.pri4235695 pri00000179.pri4234213 pri00000180.pri4231919 pri00000181.pri4232426 pri00000182.pri4231293 pri00000183.pri4230950 pri00000184.pri4228567 pri00000185.pri4228084 pri00000186.pri4227581 pri00000187.pri4225333 pri00000188.pri4226263 pri00000189.pri4224795 pri00000190.pri4225428 pri00000191.pri4223716 pri00000192.pri4221857 pri00000193.pri4221095 pri00000194.pri4218215 pri00000195.pri4220437 pri00000196.pri4217122 pri00000197.pri4217116 pri00000198.pri4216908 pri00000199.pri4215497 pri00000200.pri` `4214708 pri00000201.pri4213986 pri00000202.pri4213512 pri00000203.pri4211430 pri00000204.pri4211931 pri00000205.pri4210601 pri00000206.pri4208941 pri00000207.pri4209684 pri00000208.pri4207483 pri00000209.pri4208020 pri00000210.pri4205498 pri00000211.pri4205756 pri00000212.pri4203903 pri00000213.pri4204304 pri00000214.pri4203377 pri00000215.pri4202983 pri00000216.pri4199722 pri00000217.pri4200632 pri00000218.pri4199667 pri00000219.pri4198449 pri00000220.pri4199098 pri00000221.pri4195782 pri00000222.pri4195784 pri00000223.pri4195905 pri00000224.pri4195114 pri00000225.pri4194621 pri00000226.pri4192950 pri00000227.pri4192050 pri00000228.pri4191440 pri00000229.pri4190714 pri00000230.pri4190659 pri00000231.pri4188120 pri00000232.pri4188138 pri00000233.pri4189294 pri00000234.pri4187985 pri00000235.pri4186840 pri00000236.pri4184480 pri00000237.pri4185708 pri00000238.pri4184885 pri00000239.pri4184270 pri00000240.pri4181792 pri00000241.pri4182047 pri00000242.pri4181671 pri00000243.pri4178780 pri00000244.pri4180022 pri00000245.pri4179179 pri00000246.pri4177410 pri00000247.pri4178026 pri00000248.pri4177841 pri00000249.pri4175965 pri00000250.pri4176290 pri00000251.pri4174127 pri00000252.pri4173390 pri00000253.pri4173128 pri00000254.pri4174142 pri00000255.pri4171145 pri00000256.pri4171037 pri00000257.pri4170652 pri00000258.pri4170539 pri00000259.pri4169371 pri00000260.pri4168906 pri00000261.pri4169293 pri00000262.pri4168258 pri00000263.pri4166651 pri00000264.pri4166703 pri00000265.pri4165821 pri00000266.pri4164924 pri00000267.pri4163299 pri00000268.pri4164109 pri00000269.pri4164289 pri00000270.pri4161940 pri00000271.pri4162996 pri00000272.pri4159832 pri00000273.pri4161026 pri00000274.pri4160915 pri00000275.pri4159151 pri00000276.pri4158606 pri00000277.pri4155885 pri00000278.pri4158142 pri00000279.pri4157136 pri00000280.pri4155799 pri00000281.pri4154165 pri00000282.pri4154127 pri00000283.pri4155006 pri00000284.pri4152987 pri00000285.pri4153513 pri00000286.pri4154208 pri00000287.pri4151300 pri00000288.pri4151984 pri00000289.pri4150362 pri00000290.pri4149159 pri00000291.pri4149271 pri00000292.pri4148678 pri00000293.pri4148540 pri00000294.pri4147312 pri00000295.pri4146389 pri00000296.pri4147803 pri00000297.pri4146017 pri00000298.pri4144233 pri00000299.pri4144536 pri00000300.pri` `4146202 pri00000301.pri4143914 pri00000302.pri4142892 pri00000303.pri4141341 pri00000304.pri4143208 pri00000305.pri4142937 pri00000306.pri4140907 pri00000307.pri4139946 pri00000308.pri4138815 pri00000309.pri4139455 pri00000310.pri4138142 pri00000311.pri4138535 pri00000312.pri4138179 pri00000313.pri4137370 pri00000314.pri4135509 pri00000315.pri4135972 pri00000316.pri4135106 pri00000317.pri4135742 pri00000318.pri4134781 pri00000319.pri4134107 pri00000320.pri4132054 pri00000321.pri4134213 pri00000322.pri4131700 pri00000323.pri4132371 pri00000324.pri4130787 pri00000325.pri4129439 pri00000326.pri4131603 pri00000327.pri4129942 pri00000328.pri4129858 pri00000329.pri4128564 pri00000330.pri4127817 pri00000331.pri4126652 pri00000332.pri4127107 pri00000333.pri4126116 pri00000334.pri4125261 pri00000335.pri4125146 pri00000336.pri4122884 pri00000337.pri4124964 pri00000338.pri4124659 pri00000339.pri4122935 pri00000340.pri4122626 pri00000341.pri4123905 pri00000342.pri4122048 pri00000343.pri4122071 pri00000344.pri4121439 pri00000345.pri4120251 pri00000346.pri4119755 pri00000347.pri4119643 pri00000348.pri4120544 pri00000349.pri4116359 pri00000350.pri4118686 pri00000351.pri4118044 pri00000352.pri4118540 pri00000353.pri4116846 pri00000354.pri4116103 pri00000355.pri4115436 pri00000356.pri4116436 pri00000357.pri4114191 pri00000358.pri4116295 pri00000359.pri4114039 pri00000360.pri4114165 pri00000361.pri4112426 pri00000362.pri4111107 pri00000363.pri4111505 pri00000364.pri4112123 pri00000365.pri4112484 pri00000366.pri4109609 pri00000367.pri4109747 pri00000368.pri4109091 pri00000369.pri4109732 pri00000370.pri4108209 pri00000371.pri4108663 pri00000372.pri4108614 pri00000373.pri4106165 pri00000374.pri4107312 pri00000375.pri4106974 pri00000376.pri4105058 pri00000377.pri4105891 pri00000378.pri4106261 pri00000379.pri4106065 pri00000380.pri4104875 pri00000381.pri4102045 pri00000382.pri4102815 pri00000383.pri4103341 pri00000384.pri4102680 pri00000385.pri4101392 pri00000386.pri4101102 pri00000387.pri4099492 pri00000388.pri4102470 pri00000389.pri4100155 pri00000390.pri4098147 pri00000391.pri4099683 pri00000392.pri4099462 pri00000393.pri4097440 pri00000394.pri4098039 pri00000395.pri4097558 pri00000396.pri4098800 pri00000397.pri4097354 pri00000398.pri4096531 pri00000399.pri4095036 pri00000400.pri`

 ` 1716050469 total`

1,716,050,469 primes in three hours and twelve minutes. But wait -- there's more. I tried using pages of one million instead of one hundred million, and look what happened:

 `[slitt@mydesk primes]\$ time ./array_bit_shift.exe 1000000 > 0.pri0.11user 0.00system 0:00.13elapsed 82%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+144minor)pagefaults 0swaps[slitt@mydesk primes]\$ time ./array_page_ba.exe 1000000 1 40000numbersPerPage = 1000000startPageno = 1pagesToProcess = 400004950.74user 203.87system 1:28:41elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+40309minor)pagefaults 0swaps[slitt@mydesk primes]\$`
1:28:41, or 88:41, or 5321 seconds -- less than half the time of the hundred million sized pages. This program tested an unbelieveable 7,517,383.9 numbers per second for primacy. It actually found 322,505.25 primes per second on average. Considering this occurred on an \$800 commodity box running standard Mandrake 10 Linux with X running along with several other programs, I'm impressed.

## How scalable is it?

The Array method is fast, but it also limits the size of the candidate range to somehere below the number of bytes of RAM. The bit array method helps, but there's still an upper limit somewhere below RAM/8.

The paging of bit arrays dispenses with RAM as a limit. Theoretically, my machine, with 1.5 GB of RAM, could find primes to infinity. But other limits rear their ugly heads.

On my box, the hard disk is the limit. Consider how much disk space was used by my little 40 billion number scan:

 `[slitt@mydesk primes]\$ df /volatileFilesystem Size Used Avail Use% Mounted on/dev/ide/host0/bus0/target0/lun0/part10 58G 38G 18G 69% /volatile[slitt@mydesk primes]\$`

We used 38 GB of space to store the primes under 40 billion. As ranges increase, prime frequency decreases, but the text length of the primes increase. The effect is that over time, the size of range files decreases slightly, stabilizing at just less than half the candidate range. In other words, we used one million as the candidate range, and the last written files jitter around 490,000 bytes. In other words, the biggest range you can investigate is (partitionSize) * 1,000,000,000, where partitionSize is measured in GB. Today's commodity 240GB disks yield a 240,000,000,000 check range -- not even a trillion.

One could also create a huge RAID array to go past a trillion. However, as the number of files increases, the process of finding files takes longer and longer. Programs to handle these files have increasing trouble. For instance, the ls command stops working somewhere before the first 10,000 files:

 `[slitt@mydesk primes]\$ ls *.pribash: /bin/ls: Argument list too long[slitt@mydesk primes]\$`

There are several ways around this problem. First, the program could be modified to place output files in a tree instead of a single directory. For instance:
`/primes/0/0/0/1/4/2/4/5/pri00014245.pri`
That would speed the program considerably, and also make programs that handle those primes easier and faster.

Another possibility is to rewrite the program so that every so many pages, it stops long enough to cat the files into a larger file and then delete the smaller ones. This would cure the "too many files" problem without requiring increased memory. This could be combined with the tree idea so that even up into the trillions you'd never have too many files in a directory. Unfortunately, long before the trillions, raw disk space would become the limiting factor.

The way to get around the disk space problem is use a second machine to move the primes off the prime machine and into a single file on the second machine, or better yet to repeatedly write the primes to DVD and erase them. That could theoretically go on forever. Except for a few other nagging problems...

The longer a program runs, the more likely an extended power failure or a hardware failure will take it down. With commodity hardware and without great backup generators, I sure wouldn't count on a program staying up for more than a month. How many primes can we test in a month?

Let's assume (and I think it's unlikely, but assume it for now) that the 7 million candidates per second could be maintained linearly. A day has 60*60*24 seconds, or 86,400 seconds per day. Thirty days contain 2,592,000 seconds. Multiplying that by 7 million candidates per second, we come up with all primes between 2 and 18,144,000,000,000 (a little over 18 trillion).

18 trillion is big, but nothing earth shattering. It's still CompSci 101. Luckily, this algorithm writes separate files, so in the event of a crash one could dispose of the last file and be fairly confident that the second to last file and all before it were OK, and simply restart at the proper page. Thus, with human intervention, one could go on indefinitely (except for one limitation that will be discussed later). But even if you went on a year, you'd only reach about 200 trillion. The bottom line is that a single commodity box is just too slow to help you reach the REALLY big numbers any time soon. To go REALLY big, you need either a supercomputer, or a network of commodity computers and an algorithm that can be spread among them.

And now for the hard limit...

The paging algorithm has a mathmatical upper limit of the square of the topmost prime factor in 0.pri. That's because a number bigger than the square of the highest test factor could have factors above the topmost prime factor. For instance, suppose 0.pri contains all primes between 2 and a million.

 0.pri goes from 0 to  a million `[slitt@mydesk primes]\$ time ./array_bit_shift.exe 1000000 > 0.pri0.11user 0.00system 0:00.14elapsed 78%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+144minor)pagefaults 0swaps[slitt@mydesk primes]\$ cp 0.pri 0.prm[slitt@mydesk primes]\$ time ./array_page_ba.exe 1000000 1 999numbersPerPage = 1000000startPageno = 1pagesToProcess = 999118.52user 2.43system 2:01.71elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+1308minor)pagefaults 0swaps[slitt@mydesk primes]\$`

The preceding executed in 121.71 seconds. That's 8,216,251.7 candidates per second -- a respectable speed.

 0.pri goes from 0 to  a billion `[slitt@mydesk primes]\$ time ./array_page_ba.exe 1000000 1 999numbersPerPage = 1000000startPageno = 1pagesToProcess = 999145.16user 5.94system 2:33.80elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+100466minor)pagefaults 0swaps[slitt@mydesk primes]\$`

When the prime factor array goes up to the billion range -- in other words, somewhere around 5 million prime factors in the array of 64 bit integers, things slow down for the same amount of work. Here it took 153.8 seconds, which is 6,501,950.6 candidates per second. Slower, but not a showstopper.

Here's a question. Do you really need prime factors up to a billion? Prime factors up to a billion lets you test all numbers up to a million trillion. We've already established that a single commodity machine would be lucky to process 18 trillion numbers per month -- lets say 20 trillion for simplicity. The square root of 20 trillion is 4,472,136 -- lets say 5 million for simplicity. Let's try finding prime numbers to a billion using a list of primes from 2 to 5 million:

 0.pri goes from 0 to 5 million `[slitt@mydesk primes]\$ time ./array_page_ba.exe 1000000 1 999numbersPerPage = 1000000startPageno = 1pagesToProcess = 999118.60user 2.53system 2:01.64elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+1835minor)pagefaults 0swaps[slitt@mydesk primes]\$`

So running this process with enough prime factors for a month's worth of processing is essentially the same speed as running with only a million prime factors.

What about the 200 trillion you could run in a year -- how many prime factors would you need? The square root of 200 trillion is 14,142,135.62 -- say 15 million for simplicity. How would the calculations go with 15 million prime factors?  See the following:

 0.pri goes from 0 to 15 million `[slitt@mydesk primes]\$ time ./array_page_ba.exe 1000000 1 999numbersPerPage = 1000000startPageno = 1pagesToProcess = 999119.63user 2.46system 2:03.14elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+3050minor)pagefaults 0swaps[slitt@mydesk primes]\$`

It went up 2 seconds -- insignificant. You can do a years worth of prime finding on a commodity computer with factors below 15 million, and doing so will be fast.

Why was the process so significantly slower using prime factors up to a billion? I suspect that they consumed too much memory. Each number is 8 bytes. There are 50,847,534 primes from 2 to a billion. At 8 bytes per prime, that's 406,780,272 bytes -- 407MB. That's enough memory reduction to make a small difference. If you'd gone up to 3.5 billion, the prime factor array alone would have consumed all the machine's semiconductor memory.

### Scalability Summary

The bit array algorithm with paging is scalable well into the billions, and with some nursing into the low trillions. The first two logjams you'll hit are numbers of files and raw disk space. The numbers of files can easily be handled by simple changes to the program. The raw disk space must be handled with human intervention to get the files onto tape or dvd or whatever. If you use a large RAID array, you could store enough files that the human intervention could be done at reasonable intervals.

Even if you miss a few interventions and the program crashes from lack of disk space, the good news is it can be restarted where it left off.

With the disk space limitation out of the way, we find time to be the next limitation. We all have limited lifetimes, and waiting a year for the computer to find primes to 220 trillion isn't rewarding. There's got to be a better way.

If we were to get past the 220 trillion mark, we'd find ourselves rapidly approaching a roadblock in which the necessary prime factors would consume all RAM. Getting up to numbers far beyond a million trillion would necessitate paging the prime factors in much the same way the program in this article pages candidate arrays. That would work, but at the really high numbers we'd find ourselves swapping factors 10 or 20 or 1000 times for each page of candidates tested, sending performance into the toilet. Making the candidate arrays bigger would help, but the current algorithm is nonlinear in such a way that doubling the size of the candidate array significantly more than doubles the time required. Perhaps the algorithm needs improvement. Or perhaps a series of machines could work on a page, like an assembly line, with each machine containing a successive prime factor list.

## How robust is it?

It's not. You can segfault this program with a single bad command line parameter. I've seen evidence of segfaults without user error. To make this into a robust app you'd need to put some work into it.

That being said, given the nature of this app and the fact that the person running it will probably be a geek, it's robust enough to get the job done, always assuming the operator puts out a little effort.

## What can we learn from it?

We learned quite a bit. We learned that the frequency of prime numbers decreases as the numbers go up, but they decrease at a decreasing rate, leveling off somewhere around an average of 1 out of 25. This makes a lot of sense, because as candidate numbers go up, marking more numbers as composite, and therefore lessening the frequency of primes.

Contrastingly, as the prime factors go up, so does the distance between them, and therefore so does the distance between their squares, meaning that the rate of decrease decreases.

We've learned that for some reason, the bit array method decreases in speed efficiency with larger bit arrays, and that it does so long before the bit arrays challenge the boundaries of installed RAM. We also learned that for a given bit array size, the paging method is much more linear than increasing bit arrays, so that when we go up to the hundreds of millions, the low billions and beyond, paging with bit arrays holding a million candidates is the faster way to go.

We've learned that it's possible to page prime factors, so that mathmatically speaking we can find primes forever. However, we've also learned that a single commodity machine poops out in the very low trillions, even with considerable operator intervention, and that to go farther we need either a super computer with copious RAM and disk space, or a networked cluster of computers.

# Design Musings: The Cluster Solution

##### By Steve Litt
First of all, there's no economic reason to attempt this. If you create a prime finding cluster, you do it for the same reason people climb Mount Everest -- because it's there.

As we saw in the last article, somewhere in the low to mid billions the commodity machine begins hitting all sorts of walls -- RAM, diskspace and speed. You can partially fix the RAM wall by paging, and the diskspace wall with human intervention, and the speed wall by running for long times (a month or a year), but without a doubt, a single commodity box's scalability breaks down in all sorts of ways as you approach a trillion.

Ten commodity machines have ten times the disk space, ten times the RAM, and ten times the CPU power of a single machine. If they can be efficiently grouped to solve the prime problem, they should approximate a tenfold decrease in time.

## Disk Space

Imagine all the machines networked and NFS connected. If each has a 200GB disk devoted only to the operating system and the output files, for all practical purposes each machine can store 200 billion proven prime numbers. That means if the machines divide the work evenly (big if), the ten machines can store 2 trillion proven prime numbers before human intervention becomes necessary.

To provide prime factors, one of the machines must store a list of all prime factor files, including their host computer, and their filename, such that any process on any computer can access them, to use as prime factors, via NFS.

## CPU

Prime number detection can be a textbook example of cluster computing. Each computer works on its own range of candidate numbers, independent of the other computers. At least that's how things will be until a computer works on a range higher than the square of the highest number covered by range 0, at which time prime factors must be either paged or assembly-lined. We already saw in the preceding article that a factor range of 0-15 million is practical, yielding prime searching to 200 trillion.

Past that upper limit, we must implement either factor list swapping or an assembly line where successive computers test the same range using successive factor lists.

Another CPU related problem is that the different computers will have different speeds. Whatever governs handing out new ranges must know when a given computer is ready to take a new range.

## RAM

This is the tough wall, because we can't cluster RAM from many machines into one contiguous piece of RAM. If we could, the algorithm could be much easier, and we wouldn't need to do as much paging. But with the IBM AT bus (you know, the commodity Intel computers you buy today), you can't do that, at least not by stringing the boxes together with Ethernet. So we page both candidates, and at very high numbers, factors.

One way to simulate conglomerating the RAM is to use an assembly line, where each computer contains a range of factors, in which case we can find a way to pass bitarrays between such computers. Depending on our candidate ranges, the bit arrays could be as small as 1/8 meg -- max a few meg. These would presumably pass fast, although all machines would need to compatibly read the same bit arrays (same "endian", etc).

The "assembly line" method is a great way to eliminate the need to page factors, but it means that every assembly line must have a machine for each factor array, which when you get into high numbers means you must add machines as numbers get higher. It's not a particularly great way of scaling.

But maybe an excellent alternative compared to paging factors. No matter how small the candidate bit array, you must page through all the factor arrays up to the square root of the highest number in the bit array.  At very high numbers with small bit arrays, the majority of the time would be spent swapping factor arrays.

## Minimizing Obstacles to Scalability

With any array type algorithm, you cannot achieve absolute scalability. At some point, some resource will either get in the way, or suddenly render performance completely dead. With this in mind, our job isn't to achieve absolute scalability, but simply to see how high we can raise the limits of practical prime discovery. I see two important leverage points:
1. Make bit array performance more linear with respect to size so as to make large bitarrays practical
2. Enable more factors to be retained in memory to reduce the frequency of prime factor swaps
With respect to #1, I see no particular reason why doubling the bit array should more than double the time to find all primes in that bit array, considering that the paging method seems pretty linear with respect to number of candidates. My suspicion is that possibly the need to re-iterate the bit array in order to retrieve the prime factors. Friendly reader, I leave it to you to research how to make bitarray performance linear with respect to size.

# 2 might be fairly easy. If one can assume that no two primes are separated by more than 65535, then one could represent the prime array by a single 64 bit starting point, followed by an array of 16 bit increments. This would require one addition for every factor, but would basically allow a quadrupling of the size of prime factor arrays. I believe that because prime frequency seems to approach 1 in 25 rather than 1 in 65535, consecutive primes varying by more than 65535 would be exceedingly rare. In the rare event that consecutive primes are greater than 65535 apart, the increment for the second one would be zero. When a zero increment is discovered during factor marking, a small special table of 64 bit integers would be consulted to retrieve the value.

If prime arrays held 500 million prime factors, and if the bit array held the same amount, the first prime factor swap would not occur until a candidate reaches 25,000 trillion. The next prime factor swap would occur at candidate 100,000 trillion. Even at very high values, where each bitarray must be marked by many factor arrays, at least the bitarray has 500 million candidates to benefit from those factor swaps.

If huge bitarrays are impossible, there's always the possibility of swapping bitarrays while keeping the factor array constant. In other words, one could keep a single factor array, marking a succession of 1000 bitarrays of a million candidates apiece, and storing the marked arrays. Then swap the factor array, and mark the 1000 bitarrays again, so on and so forth until everything has been marked. This would be a way to  limit the detrimental effects of factor swapping. Because a one million number bitarray takes only 125,000 bytes, storing and retrieving it to and from disk isn't horribly inefficient.

## Stepwise Prime Discovery

Ultimately, as numbers become huge, prime discovery becomes ever more expensive. The first billion can be done on a modern commodity computer with a gig of RAM, using a simple non-paging algorithm with a bit array. This is very fast.

The next 500 billion can be done without factor swapping, with a simple bit array paging algorithm and a disk capable of holding 250GB or so, can be done in less than 24 hours. This is extremely fast. Faster yet would be the use of several machines. If one could manually assign necessary intervals to, let's say 20 machines, this task could be done in less than 3 hours.

At some point in the high sub trillions to the low trillions or even as high as 25,000 trillion, factor swapping becomes necessary. I suspect that as factor swapping becomes necessary, performance plummets. This might be a good point to spread the work to several computers, finding a way to minimize the effects of factor swapping.

There's no reason to burdon the first 200 billion with a (presumably slower) factor paging algorithm. Indeed, the easiest way to do the first 200 billion is to start with the standard bit array paging algorithm using a factor array of 15 million primes. If you have multiple computers available, the best way to use them is simply to hand-assign separate candidate ranges to each. Once you get to the high sub trillions or low trillions, you'll need a real factor switching algorithm, complete with job assignments.

So here are the steps I suggest:

 Number range Suggested solution 0 to a few million 1 computer, bitarray paging neither candidates nor factors A few million to high sub-trillions or low trillions Several computers, bitarray paging candidates but not factors -- candidate ranges manually assigned. High sub-trillions or low trillions through 264 or 2128 Several computers running programs that are for the most peer to peer, and hot-swappable on the computer level. Algorithm pages both candidates and factors. Perhaps include facilities for several computers to act as an assembly line, each with its own factors range,  moving ranges of candidates from computer to computer to minimize swapping of factor arrays, because factor arrays are probably huge.

## Clustering

To spread the work and data holding, I envision the computers operating mostly on a peer to peer basis. But I'd have one computer keeping track of certain facts:
• The residence of various prime factor array files
• Various computers and their host names and NFS names and their speeds and disk capacities
• Which ranges have been assigned to which computers
The residence of various prime factor array files (I envision each computer dumping its output to its own hard disk) is vital so that any computer can query for the location of a needed factor array file.

The list of various computers keeps track of host names, ip addresses, NFS data and the like. It also contains info on the computer's speed and disk space. The disk space is necessary so you don't overload the computer's disk space. The speed is necessary in order to determine the amount of time a computer should need to perform its prime discovery task.

The assignment tracking system is vital to make sure everyone gets work promptly. I envision it like this. After completing its task, each computer submits a report to the main computer telling the range it completed, and the hostname and filename of the output file, and requesting another assignment. The main computer then assigns another range, and the host and filename of the starting factor file. When a computer needs the next factor file, it asks the main computer for the file, and the main computer reports the host and filename of the factor file. The satellite computer then communicates directly to that host to retrieve the file. I envision the main computer as running the prime finding program, but also replying to requests from satellites. One other task for the central computer is to periodically check on assigned tasks, and if a task has not completed in the time it should have, then the task is marked incomplete and given to someone else. If the timed-out computer later comes back with a completion report, it is discarded. Perhaps that discarding is reported to that computer.

# Conclusion

Geeks just gotta have fun. You've just gone through several prime number discovery algorithms, starting with a poke-slow brute force method, and then graduating to various array methods (Sieve of Eratosthenes). None of the algorithms you've coded in this document will break any records, but I'll bet they made you think, and I'll bet you got some great ideas while going through them. If so, celebrate your success.

[Back to Troubleshooters.Com ]   [Back to Code Corner]

Copyright (C) 2004 by Steve Litt.