Troubleshooters.Com and Code Corner Present

PHP Power Pointers:

PHP Data Security

Contents

Introduction

This document assumes you have computer programming knowledge and that you know your way around PHP, Linux and UNIX. Rather than spending huge amounts of time discussing cookies or PHP4's sessions, this document discusses the ramifications of HTTP's statelessness, alternatives to get around it, database variable lookup by passed session key, and session key generation tips and techniques.

This document assumes PHP is installed and functioning on your system, and that you're reasonably familiar with PHP.

Security Warning!

Throughout this document it is assumed that PostgreSQL data web access is through user apache or whatever user your httpd daemon runs as. Although that's the easiest way to do it, it's by no means the most secure. Anyone co-hosted on the same box as your website can access your data by writing their own PHP scripts, because their access to user apache is the same as yours. Also, ident methods of authentication.

Data enabled web apps have many gotchas, especially if there are multiple website owners on a single host computer sharing one Apache and one DBMS. For a futher discussion of this, and some solutions, see

Challenges of Statelessness

Most web app challenges revolve around the fact that HTTP is a stateless protocol. Statelessness means that when you transition from one web page to another, all variable values are forgotten. This is true whether the page is static or rendered by a script. Any variables -- any values, are forgotten.

As a programmer, a good way of envisioning this challenge is to imagine a programming language without global variables. Any variable set in one web page is out of scope in another.

There are three basic ways to retain variables and values between pages:

Cookies
Passing the values
Database lookup

Cookies

I lied about no global variables between web pages. Cookies can hold global variables accessible from multiple web pages. Cookies are little files kept on the user's computer that store the desired values. While this might seem ideal, cookies have several potential problems:

Older browsers don't recognize cookies
Many people set their browsers to reject cookies for security reasons

Reason #2 is critical. Cookies can be made to track your browsing activities and other very personal information. Cookie-compliant browsers all provide methods of rejecting cookies entirely, or on a per-domain basis.

The bottom line is that if you need your website to work with all users, cookies might help with some but you need a backup plan for those rejecting or unable to process cookies.

Value passing

All computer programmers know global variables can spell disaster through side-effect problems. So instead of using global variables, we pass necessary information as subroutine arguments. It's inconvenient but much safer.

You can pass values between web pages. As in subroutines, it's inconvenient (actually much more inconvenient than passing between subroutines). But it can be done, and it doesn't rely on cookies.

The two methods of passing data between web pages are the GET method and the POST method. GET is much easier, POST is much more secure. Professional websites are usually better off with the POST method.

The GET Method

The GET method simply passes variable name/value pairs on the URL after a question mark. This method can be sent by all web pages, whether or not the transfer mechanism is a link or a form's submit button.

Unfortunately, the variables are plain text and are subject to tampering by the user. On visiting an ecommerce site using the GET method, I was able to modify the URL such that I could purchase a $10.00 book for $1.00. Naturally I didn't complete the transaction, but an unscrupulous person could have easily ordered books at 1/10 the price and then resold them. Only a separate validation script or a human audit could have uncovered the problem.

The GET method is best used for only the most harmless data, or data which cannot easily be intelligently forged.

It's possible to generate a session id variable from microsecond-seeded random numbers, sometimes combined with other data. If constructed right, such session id's are for all practical purposes impossible to reverse engineer. Sometimes such a session id can be passed via the URL without undue risk.

One more problem with passing data in the URL is that crackers and script kiddies can tack malicious commands on the end of an otherwise good URL. Be sure every page receiving GET requests immediately truncates excess characters from the URL, and after splitting the URL into variables, immediately check each for the right length and other validity, and immediately look up database session info with the session id. If any of these fails, immediately terminate the script.

The POST Method

The POST method of value passing is available ONLY from HTML forms. It cannot be applied by links, even if those links are contained within a form. Therefore, to retain state without resorting to the GET method, every exit point on every page must be a button on a form. While inconvenient, this is the most secure method of passing data between pages.

Database lookup

Oops, I lied again. You can have the equivalent of global variables without cookies. You can simply keep the information in a database. But there's one problem -- how do you retrieve the information once you've switched pages?

The new page must receive at least one piece of data -- the lookup key for the remaining data, by means other than the database. That means either cookies or value passing.

The way this is typically done is that the first page visited generates a session id. The session id is a unique string defining a specific visit by a specific visitor. The string typically involves the concatination or combination of a random number seeded by a timestamp and an autoincrement. For additional resistance to cracking and reduced likelihood of duplicates other information might be involved, such as the user's IP address, the Apache process id. Whatever is chosen, it's vital that it not be guessable, because if it can be guessed, a bad guy can guess at a session id, and then masquerade as a different user. PHP's md5() function can change an otherwise guessable session id into a completely unguessable 32 character string which can be used as the session id.

As soon as the session id is created, a row for that session id is created in the database. This row has all the necessary information for the various web pages on the site. For instance, it might contain the invoice info for a shopping cart, and a relation to the user's shopping cart.

The combination of storing most variables in a database and passing a unique key through POST cookies, or even GET (the session id has been made to be unguessable) is an excellent solution. It's convenient in that only a single variable need be passed around, with all other variables regenerated in each page. If a website enhancement necessitates a new piece of information, that new piece can be added to the database, rather than simultaneously adding a variable to 20 web scripts. Another advantage of the key/database hybrid is confusion reduction.

Without the database repository, every variable used by every script would need to be passed by every other script, because even if neither the passing nor receiving script needs the variable, once it's dropped there's no way that the one script needing it can regenerate it.

The session id's MUST be unguessable. When you generate unique session ids, depending on how you create them they might be guessable. As a simple case, session ids could simply be incrimented. If so, a badguy could use your site, and then decrement the session id to hack into somebody else's session. If you're passing the session id in the URL (GET method) even a 12 year old could perform this crack. However, using PHP's md5() function, you can md5sum your unique key into a new 32 character string that bears no resemblance to the old string and is impossible to convert back to the old string. Simply use the 32 character string as the session id, and the likelihood of a cracker being able to change it to an existing good session ID (good meaning a session ID that's been updated in let's say the past hour) approaches 0.

If your app doesn't use a database, it's possible to use a flat file or some other mechanism to store all the non-session-id variables. However, finding a way for the http server to write and read those variables without allowing crackers to read them is difficult, and out of the scope of this article.

In summary, a hybrid with a single session id passed everywhere, and the rest looked up by session id by each script, is probably the best methodology. The session id can be passed by cookies where possible, and by POST or GET where it must. If the session ID is sufficiently unforgeable (perhaps by using some sort random numbers or checksum), it could even be passed in the URL as a GET request.

Generating Session IDs

The stateless nature of HTTP is always a hassle, but many developers find partial solace by keeping most variables in the database, passing only the session id, and looking up the rest of the variables based on the session id. Use of session id's present two different risks which must be handled:

Risk of a user or cracker forging a legitimate session id
Risk of accidentally handing out duplicate session id's

These are two very different risks, and both must be handled. The more important, valuable or private the information being handled, the more precautions must be taken against these two risks.

Absolute duplication prevention can be handled by a well designed autoincriment mechanism. Unfortunately, autoincriments are extremely easy to forge. Risk of forgery can be limited to one in trillions with a well designed random number generator. Unfortunately, random numbers uniqueness is only as good as the random seed, and the best that can be hoped for is about one in a million on a system with a true microsecond clock. For clocks with less precision, the likelihood can be less. So it's a good idea to seed the random number generator with numbers other than the microsecond clock.

Therefore, to minimize both the risk of forgery and the risk of accidental duplication, you can use a combination (concatination) of a random number and an autoincriment, or if you completely trust the autoincrementer you can use the md5() function on the increment in order to turn it into a 32 digit hex number, thereby reducing the chance of successful forgery to one in trillions even on the most heavily trafficked sites. However, unless you completely trust your autoincrement, you're better off using it in combination with a random number, so even if the autoincrementer occasionally returns a dup, the random number will likely provide uniqueness.

Minimizing forging risks

Passed variables can be changed, and a passed session id is no exception. Changing the session id is trivial if the session id is passed in the URL (GET method), but a skillful cracker can also change the session id even if the POST method is used.

You cannot prevent the user from changing the session id, but you can take steps to minimize the chance of such forgeries mimicking a legitimate session id. What you need is a random number, and the random number must be large enough that its range overwhelms the number of legitimate numbers. For example, if your session id's are 12 character base 26 random numbers (each "digit" is an upper case letter), that offers 9.54 x 10¹⁶ possible session id's. If your site receives a million visitors a day, and if you regularly purge sessions over 24 hours old, then at a given time you'll have a million legitimate session id's. The chance of someone forging a legitimate one is (1 x 10⁶)/9.54 x 10¹⁶), or 1 in 94.5 billion. The risk is miniscule compared to more pressing risks facing your business, and if you want to further decrease the risk of a successful forgery, you can add more digits. Adding 5 more digits would decrease the risk by another 11 million, or approximately 1 in a million trillion.

Minimizing accidental duplicate risks

There's no such thing as a truly random algorithm. PHP's mt_rand(lowerlimit,upperlimit) routine is excellent at producing seemingly random numbers evenly distributed across the range defined by its arguments, but in fact its starting place will always be determined by its seed. If you don't seed it, you have no idea how random it will really be. And if you seed it with the microsecond clock (explained later), the odds of duplication are less than a million to one. For a well trafficked site that's not good enough for the session id.

Including the seconds since epoch in the session ID seed drops the duplication risk much further -- for practical purposes limiting the possibility to cases where users arrive within the same microsecond, or within a number of microseconds corresponding to the time granularity of the system. Depending on your traffic and the degree of loss created by an inadvertent duplicate, the appending of seconds since epoch might be sufficient. But for more safety, it's all too easy to add in a third number -- an unseeded random number. Unseeded random numbers aren't necessarily random, but when combined with a time element the two make it very difficult to reverse engineer the produced random number.

Other values that could be either concatenated into the session id or used for the seed include:

The incoming IP address
The difference between a second microtime() call and the first one, multiplied by a suitable multiplier

The second microtime() might seem like an opportunity to exploit system variation for more variability, but my tests on my box indicate it cuts the risk of duplication by maybe a factor of 30. That's better than nothing, but not much. The incoming IP address might help, but given the fact that large ISP's funnel many users through the same IP address, its added safety is unpredictable.

Probably the easiest way to autoincrement in a secure way is with a single row of a table containing a key, and a data column containing an integer to be incremented. When someone needs a new number, they lock the table, grab the value, increment it, write it back, and unlock it. If two people try to perform this process at the same time, Postgres acts as a traffic cop.

Locking can be a mess for the application programmer to implement, so later we'll discuss how to have Postgres do the whole thing with a stored procedure (a function in Postgresese).

Random number generation

PHP gives you two outstanding tools to use in generating random number session ids: microtime() and mt_random(). The microtime() command returns a string consisting of a float, a space, and an integer. The float is between 0 and 1, representing the number of microseconds on the system clock. The integer is the number of seconds ellapsed since 1/1/1970, or whatever epoch the computer uses. To test it, create the following microtime.php file:

<?php
echo microtime();
?>

Pull up microtime.php in a browser, and you'll see something like this:

0.52037700 1039818902

Click your browser's refresh button several times and note that the first number changes almost randomly, while the second number corresponds to ellapsed seconds and changes only in the rightmost couple digits.

To actually acquire the numeric value of the microseconds and seconds parts of the string, do the following:

$timestring = microtime();
$microseconds = (double) $timestring;
$seconds = (integer) substr($timestring, strrpos($timestring, " "), 100);

First you capture the time in a snapshot, and from that point forward you work only on the snapshot. PHP's type conversion considers the space as the end of the floating point number, so the typecast retrieves the correct amount. The second part of the string is more difficult. You need to find the spaces position, and retrieve only the portion of the string after that space.

Prove this to yourself by changing microtime.php to the following:

<?php
$timestring = microtime();
$microseconds = (double) $timestring;
$seconds = (integer) substr($timestring, strrpos($timestring, " "), 100);
echo "<pre>";
echo $timestring . "\n";
echo $microseconds . "   " . $seconds . "\n";
echo "</pre>";
?>

The preceding should print the string, and then a concatination of the numeric values, and they should obviously be the same. Note that depending on how precise your system's microsecond reading is, you'll need to vary the number of spaces between $microseconds and $seconds in order to have them fall below their string equivalents. And of course, if $microseconds ends in one or more zeros, it will be shorter and the $seconds will move left. But this should be enough to demonstrate the workings of the microtime() command.

As discussed earlier in this section, using only microseconds as a random number seed leaves you open to inadvertent duplicates. Much better is to include the seconds since Epoch, and even better is to also throw in an unseeded random number. Between those three, it's almost impossible for a cracker to set up a machine to duplicate your seed and thus guess your random numbers.

The easiest way to generate a random number is with base26, using A-Z, using the random number generator seeded with microseconds. Here's the code to generate a 12 character base 26 number:

<?php
function randomString($randStringLength)
	{
	$timestring = microtime();
	$secondsSinceEpoch=(integer) substr($timestring, strrpos($timestring, " "), 100);
	$microseconds=(double) $timestring;
	$seed = mt_rand(0,1000000000) + 10000000 * $microseconds + $secondsSinceEpoch;
	mt_srand($seed);
	$randstring = "";
	for($i=0; $i < $randStringLength; $i++)
		{
		$randstring .= chr(ord('A') + mt_rand(0, 25));
		}
	return($randstring);
	}
echo "<pre><big><big>\n";
echo randomString(12);
echo "</big></big></pre>\n";
?>

The preceding code generates a 12 character base 26 number. There are 9.54 x 10¹⁶ such numbers, meaning that if your site gets a billion visits per year, the chance of duplicate numbers being handed out in a year is over one in a million. Assuming your purge old session records daily, that becomes less than one in 365 million. If these aren't good enough odds for you, tack on an additional 5 characters to reduce the likelihood of of duplicates another 11 million times. At that point you're more at risk of being killed by an alligator than you are of dealing out duplicate. This is true especially because the seed is based on seconds since epoch, microseconds, and an unseeded random number.

What is the performance effect of the randomString() function? Let's run 30000 iterations on my unloaded dual Celeron 450 with 512MB of RAM. Here's the loop:

<?php
echo "<pre><big><big>\n";
$iterations = 30000;
echo "Starting $iterations iterations of random number generation...\n";
$randstring="";
$startTimeString = microtime();
for($i=0; $i < $iterations - 1; $i++)
	{
	$randstring = randomString(12);
	}
$endTimeString = microtime();
$startTime = (integer) substr($startTimeString, strrpos($startTimeString, " "), 100);
$endTime = (integer) substr($endTimeString, strrpos($endTimeString, " "), 100);
$elapsed = $endTime - $startTime;
$elapsedPerCall = $elapsed/$iterations;
echo "Final random number is $randstring\n";
echo "Elapsed time is $elapsed seconds\n";
echo "That's $elapsedPerCall seconds per call.\n";
echo "</big></big></pre>\n";
?>

The preceding code produced the following output on my browser:

Starting 30000 iterations of random number generation...
Final random number is XKLYEQWAWBVC
Elapsed time is 11 seconds
That's 0.00036666666666667 seconds per call.

366 microseconds isn't bad unless you're getting thousands of visits per hour, and if you are, you're probably running more than a dual Celeron 450.

I believe that this base 26 representation of a random number, when concatinated with an autoincrement, is the best compromise between performance and security.

Autoincrementing

Autoincrementing on a busy site is anything but trivial. It's perfectly possible for two users to appear at the same nanosecond. Will the autoincrement work properly, will it grant the two users duplicate autoincrements, or will it malfunction in some other way? If you're already working with a database, perhaps the simplest way is to use the database. Let's use PostgreSQL as an example.

Using psql, create a table called increments with columns type and number. :

create table increments (type char(8), number integer);

Pre-load it with this row:

insert into increments (type, number) values ('sid', 100001);

For the purposes of this exercise, be sure to grant all priveleges for this table to the user under which httpd runs (user apache on my box).

Now create a test-jig program called autoincrement.php to test it:

<?php
echo "<pre>";
$starttime = microtime();
echo $starttime . "\n";
$connection = pg_Connect ("dbname=mydb port=5432 user=apache");
if($connection == 0)
	{
	die("Connection failed\n");
	}
else
	{
	echo "<p>Connection succeeded</p>\n";
	}
	

$result = pg_Exec($connection,
	"select number from increments where type='sid';");
$row = pg_fetch_row($result, 0);
$number = $row[0] + 1;

$result = pg_Exec($connection,
	"update increments set number=$number where type='sid';");

echo "\nNew increment is: " . $number . "\n";
pg_close($connection);

$endtime = microtime();
echo $endtime . "\n";
echo "</pre>";

?>

Hit it with a browser, and you'll see that each refresh increments the number.

Autoincrement traffic copping

The preceding code is cute, but what happens if autoincrement requests occur within nanoseconds of each other? Watch the following simulated disaster:

<?php
echo "<pre>";
$starttime = microtime();
echo $starttime . "\n";
$connection = pg_Connect ("dbname=mydb port=5432 user=apache");
if($connection == 0)
	{
	die("Connection failed\n");
	}
else
	{
	echo "<p>Connection succeeded</p>\n";
	}
	

$result = pg_Exec($connection,
	"select number from increments where type='sid';");
$row = pg_fetch_row($result, 0);
$number1 = $row[0] + 1;

$result = pg_Exec($connection,
	"select number from increments where type='sid';");
$row = pg_fetch_row($result, 0);
$number2 = $row[0] + 1;


$result = pg_Exec($connection,
	"update increments set number=$number1 where type='sid';");

$result = pg_Exec($connection,
	"update increments set number=$number2 where type='sid';");

echo "\nNew  first increment is: " . $number1 . "\n";
echo "\nNew second increment is: " . $number2 . "\n";
pg_close($connection);

$endtime = microtime();
echo $endtime . "\n";
echo "</pre>";

?>

The preceding code shows what happens if a second autoincrement request comes in between the select and the update for the first one. Both requests get the same number -- a disaster when driving a website with session id's. Concatinating the autoincrement with a random number helps, because due to the closeness of the two requests' time of arrival it's better not to depend on the random number, because 2/3 of its seed factors are time based, and the other one is unknown.

You could fix this with locks, timeout code and anti-deadlock code. Ughhh! It has the advantage of database portability (more or less), but it can get nasty.

My preference is to work directly at the database level, using a stored procedure to accomplish both the increment and the return of the number as a single transaction. Thus the database queue's all the requests. Everybody gets incremented, and nobody gets a duplicate or any other bogus problem.

Create the following text file, called incr.sql:

drop function incr(text);

create function incr(text) returns int8 as '
	declare
		mytype char(8);
		rtrn record;
	begin
		mytype := $1;
		SELECT number into rtrn FROM increments WHERE type=mytype;
		rtrn.number := rtrn.number + 1;
		update increments set number=rtrn.number where type=mytype; 
		return rtrn.number;
	end;
' language 'plpgsql';

Now, within the psql environment, run the following command:

\i incr.sql

Depending on where you started psql from, you might need to put the complete path on the filename in the preceding command. If all goes well psql should issue a message saying "DROP" followed by another saying "CREATE". What has happened is that it dropped function incr(text) and then created it again. If psql gripes about "permission denied", the user from which you ran psqlpsql doesn't have permission to create and drop functions (stored procedures). Those rights must be granted from run by the postgres user. Also, only the owner of a function can drop it, so if the incr(text) function was previously created by a different user, that user must drop it.

Once the function is in place, you can test it from within psql because the function is implemented in PostgreSQL, not in PHP code. Within psql issue the following command:

select incr('sid');

If you run the preceding command twice, you'll see the number increment. Within the psql environment it should look something like this:

mydb=> select incr('sid');
  incr  
--------
 100029
(1 row)

mydb=> select incr('sid');
  incr  
--------
 100030
(1 row)

mydb=>

If the command doesn't work, perhaps you need to grant the user proper permissions. Try this from user postgres:

grant select,update on increments to apache;

And if that doesn't work, temporarily try brute force:

grant all on increments to apache;

Later you can take away priveleges with the revoke command.

Once you can autoincrement within the psql environment, you can do it in the PHP environment. Create the following inctest.php:

<?php
echo "<pre><big><big>\n";
$number=0;
$iterations = 1000;
echo "Starting $iterations iterations of random number generation...\n";
$startTimeString = microtime();
for($i=0; $i < $iterations - 1; $i++)
	{
	$number = getNextIncrement('sid');
	}
$endTimeString = microtime();
$startTime = (integer) substr($startTimeString, strrpos($startTimeString, " "), 100);
$endTime = (integer) substr($endTimeString, strrpos($endTimeString, " "), 100);
$elapsed = $endTime - $startTime;
$elapsedPerCall = $elapsed/$iterations;
echo "Final increment number is $number\n";
echo "Elapsed time is $elapsed seconds\n";
echo "That's $elapsedPerCall seconds per call.\n";
echo "</big></big></pre>\n";
?>

The result is ugly. It takes 6 seconds for 1000 iterations, and every time you run it it goes up from there. Can you guess what went wrong?

You might think it's because I never closed the connection, or that repeatedly opening connections is very expensive. Although these might be true, the problem is more subtle. Internally PostgreSQL processes the update as a delete followed by an insert, and those "deleted" records lie around bloating the database. As user postgres you can actually see the bloat by first running a du . command, then accessing the preceding php program from a browser, then once more runing a du . command. Certain directories will show an increase. Trace it down to a specific file, and you can see that file grow every time you refresh your browser.

Performing the following psql command as user Postgres will cure the problem:

vacuum full;

But of course the database will bloat again as more increments are done.

If you're using PostgreSQL, you can triple the best case incrementation speed by substituting a sequence for the stored procedure. Do the following, as user postgres, within the psql environment:

create sequence sidseq;
grant update on sidseq to apache;

In a typical database app the connection is likely to be open, so why not move the open outside of the getNextIncrement() function. Here's the complete code, function and all:

[ Troubleshooters.com | Code Corner | Email Steve Litt ]
Copyright (C)2002 by Steve Litt -- Legal