File input and output is an integral part of every programming
language.
Perl has complete file input and output capabilities, but it has
especially
handy syntax for line-at-a-time sequential input. Since Perl's
strengths are
in text manipulation/parsing, this is especially important, and will be
well
covered on this web page. Also covered will be sequential file output.
This
website will not discuss fixed record reads or random i/o.
Because writing files in Perl is actually simpler, we'll start
with output,
then move to input.
Opening for Read requires no angle brackets in the filename.
If you wish,
you can put in a left angle bracket <, which means "input file".
It's
good practice to close any files you open. Files can be read line by
line,
or the entire contents of the file can be dumped into a list, with each
list
element being a line. Here is an example of a program that reads a
file,
capitalizes each line, and prints it to the screen:
Sometimes it's easier to read a whole file into a list,
especially with
complex break logic, read-ahead totals, or sorting. Here's a program
that
reads a file and prints it in sorted order:
Perl is exceptionally good at file conversion. Here's an
example where
each line in the file has 3 fields (in this order): A 5 digit zip code,
a
20 char name (first last) and a mm/dd/yy birth date. You want to change
it
to a 16 char last name, a 10 char first name, a mm/dd/yyyy birth date,
and
a 5 digit zip. For simplicity, assume names have no spaces (no Mary
Anns,
no Van Gelders). Here's a 21 line program to do the conversion:
File
Slurping
You might occasionally want to grab an entire file without paying
attention
to line termination. You can do that by undefing the
$/
built in
variable, and then assigning the <file> to a scalar. This
is called
"slurping" the file.
The following code slurps the STDIN file, then splits it into lines,
then
reassembles the lines into a single string, and prints the string:
x
#!/usr/bin/perl -w use strict;
my $holdTerminator = $/; undef $/; my $buf = <STDIN>; $/ = $holdTerminator; my @lines = split /$holdTerminator/, $buf; $buf = "init"; $buf = join $holdTerminator, @lines; print $buf; print "\n";
|
The preceding code works like this:
- First we store the terminator character, which by default
on Linux
systems is linefeed -- "\n".
- Now we undef the line terminator character
- Now we slurp the entirety of STDIN
- Now we restore the line terminator character
- Now we split the string we read using the termator as a
border
- Now we join the array back into a string
- We print the string
- Last but not least, we print an extra newline to fix a
picket fence
condition
Slurping isn't as handy as it might seem. If you're a C programmer
accustomed
to using the
read() and
write()
functions with a large
buffer to accomplish incredibly fast I/O, you might think
file-at-a-time
I/O would be much faster than line oriented I/O. Not in Perl! For
whatever
reason, line oriented is faster.
One reason is the need for huge amounts of memory, which on UNIX
systems
translates into huge disk usage as swap file space is used. But this
doesn't
account for the whole thing, as you'll see in the test following
program:
#!/usr/bin/perl -w use strict;
my $bigfileName = "/scratch/bigfile.txt"; my $sipfileName = "/scratch/sip.out"; my $arrayfileName = "/scratch/array.out"; my $slurpfileName = "/scratch/slurp.out";
sub slurp() { my $inf; my $ouf; my $holdTerminator = $/; undef $/; open $inf, "<" . $bigfileName; my $buf = <$inf>; close $inf; $/ = $holdTerminator; my @lines = split /$holdTerminator/, $buf; $buf = "init"; $buf = join $holdTerminator, @lines; open $ouf, ">" . $slurpfileName; print $ouf $buf; print $ouf "\n"; close $ouf; }
sub sip() { my $inf; my $ouf; open $inf, "<" . $bigfileName; open $ouf, ">" . $sipfileName; while(<$inf>) { my $line = $_; chomp $line; print $ouf $line, "\n"; } close $ouf; close $inf; }
sub buildarray() { my $inf; my $ouf; my @array; open $inf, "<" . $bigfileName; while(<$inf>) { my $line = $_; chomp $line; push @array, ($line); } close $inf; open $ouf, ">" . $arrayfileName; foreach my $line (@array) { print $ouf $line, "\n"; } close $ouf; }
sub main() { my $time1 = time();
print "Starting sip\n"; sip(); print "End sip\n";
my $time2 = time();
print "Starting array\n"; buildarray(); print "End array\n";
my $time3 = time();
print "Starting slurp\n"; slurp(); print "End slurp\n";
my $time4 = time();
print "Sip time is ", $time2-$time1, " seconds\n"; print "Array time is ", $time3-$time2, " seconds\n"; print "Slurp time is ", $time4-$time3, " seconds\n"; }
main();
|
The preceding program creates the following output:
x
[slitt@mydesk littperl]$ ./slurp.pl Starting sip End sip Starting array End array Starting slurp End slurp Sip time is 14 seconds Array time is 74 seconds Slurp time is 279 seconds [slitt@mydesk littperl]
|
As you can see in the preceding program and output, the line in, line
out
method copied 50 a MB file in 14 seconds. A line at a time input that
pushed
on an array and then outputted it a line at a time took 74 seconds.
Note
that this stores the full file in memory. The slurp method, which reads
the
file into a string and then copies it to an array, takes 279 seconds.
Looking
more closely, the slurp version actually has two copies of the file in
memory
-- one in the array and one in the scalar. Indeed, if you add the
following
line to the array method, right after the building of the array is
complete,
array runtime more closely approximates that of the slurp method:
my @arraycopy = @array;
Adding the preceding statement means storing 2 copies of the file in
memory,
just like the slurp method. Here are the run results with the extra
copy:
[slitt@mydesk littperl]$ ./slurp.pl Starting sip End sip Starting array End array Starting slurp End slurp Sip time is 14 seconds Array time is 304 seconds Slurp time is 258 seconds [slitt@mydesk littperl]$
|
The Moral of the Story
The moral of the story is clear. Large buffer I/O is not efficient the
way
it is in C. If the file is large enough to save time by whole file
reads,
then it's so large as to exhaust electronic RAM memory, thus incurring
swap
penalties.
The most efficient algorithm reads a line, writes a line, and stores
nothing.
That's not always practical, and it's certainly not the easiest way to
design
code.
A further advantage of read a line, write a line occurs when dealing
with
pipes. This is in the
Piping
section, later in
this document.
If you really want to get faster I/O in Perl, you might experiment with
the
sysopen(),
sysread(),
sysseek(),
and
syswrite()
functions. But beware, they interact quirkily with normal Perl I/O
functions.
Passing
Files as Arguments
Given the Perl syntax, it's inobvious how to pass files as arguments.
There
are three methods, as Globs, as filehandles, and as variables.
Globs
The Glob method of passing files is very Perlistic, and as such appears
incredibly
inobvious to general purpose programmers not using Perl on a regular
basis.
The Glob method is useful when retrofitting file passing in programs
using
Perl's <FILENAME> syntax. If you're starting fresh,
consider filehandles.
Here's the Glob method:
sub printFile($) { my $fileHandle = $_[0]; while (<$fileHandle>) { my $line = $_; chomp($line); print "$line\n"; } }
open(MYINPUTFILE, "<filename.in"); printFile(\*MYINPUTFILE); close(MYINPUTFILE);
|
Output files work similarly.
If you need to assign the glob to an actual variable, you can do that
also.
The code in the subroutine remains the same, and the following is the
code
doing the passing:
open(MYINPUTFILE, "<filename.in"); my $fileGlob = \*MYINPUTFILE; printFile($fileGlob); close(MYINPUTFILE);
|
Use of an actual variable makes the code much more obvious to the
programmer
with only casual Perl experience.
Once again, Globs are the old method, and they're compatible with older
Perl
file methods, but for new construction you'll probably prefer to use
the FileHandle
module.
FileHandles
This is the modern, preferred way. With the FileHandle module you can
assign
a file handle to a variable that can be passed, just like in C. Unlike
Globs,
its use is obvious to any experienced programmer.
use FileHandle;
sub printFile($) { my $fileHandle = $_[0]; while (<$fileHandle>) { my $line = $_; chomp($line); print "$line\n"; } }
my $fh = new FileHandle; $fh->open("<filename.in") or die "Could not open file\n"; printFile($fh); $fh->close(); # automatically closes file
|
The FileHandle class also has methods like gets(), print(), printf().
This
gives the programmer much better control, and helps in OOP programs.
Variables
We usually see files expressed as uppercase bare text, as in
<INF>,
but it can also be a variable, such as
<$inf>.
As such, the
variable can be passed between subroutines. Usually the FileHandle
method
is preferred, but if you're an oldschool perl guy who wants to use the
oldschool
syntax but be able to pass open files without resorting to cumbersome
globs,
variables are just what's needed. Watch this:
#!/usr/bin/perl -w use strict;
sub printFile($) { my $fileHandle = $_[0]; while (<$fileHandle>) { my $line = $_; chomp($line); print "$line\n"; } }
my $fh; open($fh,"<filename.in") or die "Could not open file\n"; printFile($fh); close($fh);
|
Piping
One really quick, modular and high quality method of program
design/coding
is to build the program out of small executables connected with pipes.
For
instance, the following CGI shellscript, let's call it
showrpt.cgi,
illustrates such a piping situation:
#!/bin/bash
./get_mainframe_data.pl | ./zap_extraneous_text.pl | ./parse_data.pl | ./make_into_web_page.pl
In the preceding,
zap_extraneous_text.pl,
parse_data.pl
, and
make_into_web_page.pl are perl scripts
receiving their data
through STDIN and outputting data through STDOUT. They're what is called
filters
in the UNIX world. The
get_mainframe_data.pl,
program generates its
own data and passes it out through STDOUT. The pipeline route is
defined by
showrpt.cgi,
which calls all four in a pipe.
Now ask yourself this: What if a perl program had to decide the pipe
route.
This is a very real question. Perhaps a parsing program starts with a
complex
parse to determine which parser units to use, and then assembles the
pipe,
and then pipes data into it? You do that with a
Perl Pipe:
In the
Perl Pipe code, the
open $pipe, $pipestring
sets it up so
anything printed to
$pipe is sent to the STDIN of
the pipe laid out
in
$pipestring. From there, the environment lines
are sent to that
pipe, and then this program's STDIN is sent to that pipe.
Piping Efficiency Issues
Small executables piped together are a great way to rapidly develop an
application.
They're a great way to quickly rearrange an application. Applications
built
with piped executables are so modular that bugs are few, shallow, and
easy
to test for. The main problem with piped executables, especially those
made
with Perl, is that piping data is slow. Perl programs handle STDIN and
STDOUT
about half the speed of
awk, and about 1/5 the
speed of equivalently
written C programs.
Beyond that, assuming you're running a Linux, UNIX or BSD box, order
counts.
Ideally, you read a little, process a little, write a little:
while(<STDIN>) { my $line = $_; chomp($line); $line = process_one_line($line); print $pipe $line, "\n"; }
|
The preceding code implements a true bucket brigade, where each process
on
the pipeline has something to do, and they can all work concurrently.
This
is especially important on multiprocessor machines.
Often, however, you cannot output until all the input has been read and
processed.
This means that the next stage must wait until completion of the
previous
stage, and only then begin. Compound that by several stages, and
processing
time balloons. Unfortunately, it's often very difficult to write an
executable
so that it outputs before completion of input.
Other
File Algorithms
truncate()
This is a way of emptying a file without deleting it. This is wonderful
for
web apps, where the Apache user can be given write rights to the file,
but
not write rights to the whole directory. As a priveleged user, create
the
file with touch, and then change its permissions to be writeable by the
Apache
user. From there, it never gets deleted, so it's always modifiable by
the
Apache user.
unlink()
This deletes a file.
rename()
This renames a file, like the UNIX mv command.
mkdir, rmdir, chdir, chmod, chown, chroot
These perform identical functions to their UNIX counterparts.
-X
In this case the "X" is actually one of the following letters:
-r |
File is readable by effective
uid/gid. |
-w |
File is writable by effective
uid/gid. |
-x |
File is executable by effective
uid/gid. |
-o |
File is owned by effective uid. |
|
|
-R |
File is readable by real uid/gid. |
-W |
File is writable by real uid/gid. |
-X |
File is executable by real uid/gid. |
-O |
File is owned by real uid. |
sysopen(), sysread(), sysseek(), and syswrite()
These are low level calls corresponding to C's open(), read() and
write()
functions. Due to lack of buffering, they interact strangely with calls
to
Perl's buffered file I/O. But if you really want to speed up Perl's
I/O,
this might (or might not) be a way to do it. This is beyond the scope
of
this document.