Anatomy of an Output Class - Writing PDFs
(Page 3 of 4 )
Adding Links
I've mentioned already that one of the early challenges I faced was dynamically adding hyperlinks to the PDF. It's a tricky thing to do because, unlike HTML, PDF associates links not with the words they're attached to, but with grid coordinates. So in order to add a hyperlink, you have to know the starting x and y coordinates for the hot spot, and you have to know the width and height of the hot spot. If you want to have your hyperlinks appear as a different color than the standard text, you also have to do that manually. So the routine goes as follows:
- Determine that a given piece of text should be a hyperlink.
Get its x and y coordinates and its height and width. If need be, change the text color. Write the text. If need be, change the text color back. Add the hyperlink based on the x and y coordinates and the width and height. Move on to the next piece of text.
If you're building your columns and rows manually with set widths and coordinates, this is no big deal, but remember that we're putting all of this together dynamically, so we don't know from the start where these coordinates are going to fall. At all times while I'm writing a page of the PDF, I keep track of the x and y coordinates of my cursor and how wide and tall each bit of text is. This makes it easy to add hyperlinks if they've been specified. As I move across the page, I increment my y value by the height of the given text plus any vertical spacing and my x value by the width of the given text plus any horizontal spacing and gutter. But how do we determine these widths on the fly?
Calculating Column WidthsCalculating column widths and row heights was actually the hardest part of this project. On the surface, it seems fairly easy: Just take the page width, subtract from it the total of the widest columns, and divide the difference to calculate the gutter to place between columns. This would be great if you had unlimited page width, but I'm usually working with legal sized reports, some of which contain 20 or 30 columns, including one for comments of (potentially) several hundred characters.
It doesn't take long in these circumstances for the total of the widest columns to exceed the page width, giving you negative gutter sizes, which makes for really ugly PDFs. There are two ways to handle this. The first is to determine the difference between the actual and the available widths and, if the actual exceeds the available, to truncate any columns that exceed the width of the actual width divided by the number of columns (plus a little gutter). The second is to wrap long columns over multiple rows.
This is where the textarray element of the Datum object comes in handy, though this whole procedure has its own problems: For example, if your output includes horizontal and vertical rules separating cells, your calculations for those are thrown off by multiple-row-spanning data; pagination also becomes a little more difficult to manage; you still have to reconcile the actual width with the available width and adjust the whole grid accordingly. And of course there are still limits to how well this can work.
There is always a point at which, no matter how many adjustments and calculations you do, the actual page width and the available page width just can't be reconciled in a way that makes for pleasing output. I never said the class was perfect, but it does make for pretty rapid development of PDFs where circumstances and display are pretty mundane.
Before I move on, I want to address one more issue with column width calculation. It's not always appropriate to adjust column widths uniformly. Imagine you've got 20 columns to display. Nineteen of them will contain one or two characters of text. The remaining column is a comments column that could contain several hundred characters. Imagine further that your available width is 1000 pixels and that the total of the widths of the widest columns is 1200 (so in our example, 19 of the columns are maybe 10 - 15 pixels apiece and the last column makes up the rest). Let's go through the basic logic we'd use to reconcile the widths.
- Actual width is 200 wider than available width, so we've got to subtract 200 from the overall width.
Divide 200 by 20 to deduct evenly from all columns. This'll screw up the shorter columns and won't take enough off the wide column. So instead: Subtract only from the widest column. But that could force it to break into multiple rows, giving us 19 columns of equal size (and probably too wide for the text they actually contain) and one shortish column that spans, potentially, 5 or 10 rows -- not the most appealing output. So instead: Find the minimum possible width of each column based on the longest word unbroken by a space. Find the current width of each column. While the current column width exceeds the minimum possible width, subtract from each column a width proportional to its width proportional to the full width until you reach the minimum possible width for the given column. Break columns into rows or truncate as needed.
The result is that columns containing little text will be short and columns containing more text will be wider. We're doing the best we can here to subtract proportionally from each column until it reaches its breaking point. Of course we're still somewhat limited. For example, imagine your available page width was 1000, and you had 20 columns, each of whose shortest word was 52 pixels wide (like "supercalifragilisticexpialadocious" or "antidisestablishmentarianism"). The sum of the minimum possible widths in this case is 1040, and there's simply nowhere else to subtract from. The output will go haywire. These limitations apply to PDFs created manually too, of course.
Adding Recurring Elements and GraphsOur reports tend to have a number of recurring elements, including headers, page numbers, logos, and grid lines. Because PDF generation is page-centric -- that is, because you open a page, do everything you wish to do on that page, close it, and open the next -- it makes sense to add each recurring element to an array of recurring elements of its type and then to print these on each page by developing a function for that purpose that's called during the generation of each page.
Grid lines are a little different in that they're not added to an array. They're drawn based on the coordinates of the current text element. Nevertheless, we write the code for drawing lines only once and let the looping and the math do the rest.
The reports my company formerly wrote made use of a nifty class that draws graphs on the fly and returns PNG images that can be embedded into PDFs. The images are a little fuzzy, however, and they come with some overhead. So I decided to write my own graph handler that generates cleaner graphs complete with drop shadows and that, even better, interacts nicely with my Datum class.
As with adding recurring images or titles to the pages, you simply call the add_graph() function, passing in the array of Datum objects containing the graph info, and the Output class builds scaled graphs for you on the fly. Though I've got hooks in place that one day may allow for the insertion of graphs at any point within the reports, the behavior now is to add graphs at the end of the PDF. I usually build one array of Datum objects for grid data and another for graph data.
Putting it All Together
I've referred so far to a series of set and get methods used to initialize the Datum class. The Output class has its own such methods, and of course there are a number of other methods that do things like add recurring elements to their arrays and perform repetitive functions such as the printing of these recurring elements. But the real meat of the PDF_Output class is the execute method, whose basic flow is outlined below.
- Open a PDF with dimensions x and y.
Begin the first page and set the font specifications. Print headers if we've got that option set. For each Datum object, determine the PDF string width of its text and set that property of the Datum object to the value returned. Also count the number of rows and columns based on the "row" and "col" attributes of the Datum objects. Find the minimum width for each column based on the longest single word. Also set the row height for each row to 1. Determine the page width difference (available - actual) and distribute among columns proportional to their widths. For each Datum object, [1] if we're truncating, truncate text and print text and vertical lines or [2] if we're wrapping, do some code to wrap the given text as needed, print vertical lines, and set cursor position to accommodate wrapping, pagination, etc. If our adjusted row count is greater than the number of rows we've determined will fit on the page, print recurring elements, end the current page, start a new page, and reset the cursor to the origin. Print headers if we've got that option set. At the end of the loop, end the page and move on to the graph code if any graphs have been set.A lot more goes on in the code than what might be apparent from this brief outline, but you see the basic idea. One pitfall of the method I've chosen to do my output is that it involves looping through the dataset twice from within the class (and that on top of looping through the results once to build the Datum objects) -- certainly not the most efficient routine, and less so the larger the data set. So convenience of up-front coding using this class is counterbalanced by less than optimal performance.
For small-to-mid-sized reports, I find that this is a big time saver that lets me keep my sanity. For larger reports, the wait gets a little tiresome, especially since our larger reports are typically doing multiple complex queries to put the data set together. In one case, I converted an existing complex report to the Output class. In order to compensate for the slower load time, I also converted it from using an ODBC connection to using the Sybase driver and found that the final performance more or less matched the original (pre-Output class) performance.
To give you an idea of what kind of up-front code savings this class can lead to, I'm providing below some code that might be used to generate a simple PDF. Compare this tidy code to a couple of thousands of lines of repetitive code that's now managed more concisely from within the class. This also gives you a chance to see the black box in action, though as I mentioned at the start, the code's not yet ready to have the inside of the box exposed.
<?php
//Database connection stuff goes here...
$data=array(); // Will hold Datum objects
$rows1=array();
$rowcount=0;
$colcount=0;
//For each row in the result set...
for($i=0; $i<@odbc_num_rows($result); $i++){
$colcount=0;
odbc_fetch_into($result,$rows1);
//For each column in the current row...
foreach($rows1 as $r){
//Perform any data validation, e.g. converting dates to a readable format.
//Also, in this case, set links for every fifth column in every third row.
if($rowcount % 3 ==0 && $colcount % 5 ==0){
$link="http://www.somewhere.com";
}
else{
$link="";
}
//Create a new Datum object.
array_push($data, new Datum($r,$link,"",$rowcount,$colcount));
$colcount++;
}
$rowcount++;
}
$o=new PDF_Output();
$o->set_link_color(1,0,0);
$o->set_page_height(612);
$o->set_page_width(1008);
$o->set_data($data); //This is where we send the $data array to the object.
$o->set_border(0.1);
$o->set_font_size($fontsize);
$o->set_x_margin(36);
$o->show_page_numbers("Page "); //"Page" here is a prefix for the actual page number and is optional
$o->set_font("Times-Roman");
$date=date('M d, Y');
$o->set_x_spacing(6); //Both horizontal and vertical spacing can be set
$o->set_col_wrap(1);
$o->add_text("Generated on " . $date, $o->get_page_width() - 160, 20, $o->get_font(), 9); //Add text blurb to repeating objects array.
$o->add_image(0,$o->get_page_height()-50,"png","/var/www/navreports/images/poweredby.png", "http://www.somewhere.com",.1); //Add image to repeating objects array.
$pdf=$o->execute(); //Put the returned PDF into a buffer.
$len = strlen($pdf);
header("Content-Type:application/pdf");
header("Content-Length: $len");
header("Content-Disposition: inline; filename=" . $o->get_filename());
print $pdf;
?>
Next: Conclusion >>
More PHP Articles
More By Daryl Houston