the crazed path of file conversions, pt I

This summer’s project (other than finishing the kitchen, dealing with the septic/plumbing, and completely re-landscaping the entire front yard) was to convert a large archive from a PHP flat-file world into WordPress. Yep, an archive of almost two thousand static HTML-quasi-PHP files.

The first step was creating a development site, where I tested the basics and decided after a bit of comparison that Drupal could go back into its shoebox and freaking stay there, because no way would I ever condemn a non-tech person to using that backend. In fact, the biggest reason for selecting WordPress (outside a few technical reasons) was because its backend is the friendliest out there.

It’s also friendly to developers, I should note, which was the reason that doing the front-end design — while tedious at times, and no small amount of energy-investment — was not actually difficult, per se. No, the difficulty doesn’t come into the picture until after all that is in place and suddenly one looks at the number of files to be converted and starts to despair of ever finishing the damn project.

Out in the big world, there are several ways to mass-import into WordPress. One is to use one of the CSV-plugins, which are basically little applications that make it possible for you to import comma-delimited files, which you create either in a text-editor or in a spreadsheet. The second most-common format is to reverse-engineer the process, by exporting a post out of WP and then using it as a basis for manually creating an XML file to import the to-be-added posts back into WP.

The drawback of both systems is not just that WP’s importing skills are limited in terms of categories and tags, but that both the plugins and WP itself don’t always recognize that an imported file has the same category or tag as the existing system, even when the match is identical. For whatever reason, WP just doesn’t always ’see’ them as the same, and will gleefully create a brand-new category. I discovered this the hard way after doing all the work to get a whallop of extracted information to line up neatly in a test XML file — only to realize that it had created an entirely new set of ratings, colors, codes, and character tags. It took me nearly three hours to carefully undo the damage done in a five-minute upload, and that’s after several hours of setting up the reverse-engineered XML file in the first place.

Also known as: back to the drawing board.

(This project’s drawing board is on its sixteenth-go-round, I think.)

What makes the conversion significantly more complex is the element of custom taxonomies being used on this site: not just categories and tags, but pairings, notes, warnings, series, books, and groups. It’s certainly true that the site could have been designed without all of these additional taxonomies, but it was a choice I made early on because of the power it affords the design, and because it was the only way to address certain elements that aren’t native to WP (such as forcing WP to allow/use series of posts in a connected format akin to chapters).

But when it comes time to mass-import, WP doesn’t recognize these non-native taxonomies, which means there’s only two ways to get that information into a post: either each post must be edited one at a time, or you go in through the back door and keep your fingers crossed you don’t muck the database.

The first three hundred or so converted posts were added as I tested and designed. Some were multiparters added while I was coaxing WP into ignoring sequential chapters while at the same time exporting the first chapter’s meta down the line to all following chapters — which, once it finally worked, meant that adding chapters required only identifying the author, entering the title and content, selecting an excerpt, and picking a book-tag… and then done. But in the getting-there, a lot of chapters were uploaded as I went through to find stories with some kind of exception, to bullet-proof the CSS and the coding.

More got added in the process of creating back-end metaboxes in the Add New page. These allow users to easily and simply select or enter the meta for each post: when it was archived, the story’s continuity, whether there’s a sequel or prequel, what awards the story has won, identify notes (neutral content) and warnings (strong content), indicate whether the story was a collaborative work (group) or is a collection of related short stories (series), or a sequential work with posts-as-chapters (book), and so on, and so on. The converted short stories are a mish-mash of any exception-laden oneshot I could find: ones that have two prequels, or are a collaborative work, or have more than one author, or use the category-custom-field ties, and all sorts of other back-end crazy goodness that makes the front end look so purty.

If I had some basic familiarity with a post and its meta, uploading was quick: on average, about two to three minutes. That covers time to copy from the original site, paste into WP, select the related details, and save. What slowed me down was when I started to hit stories that were completely unfamiliar to me, which meant cut, paste, and then flip back to the original archive site to collect the meta — a process far from fool-proof. Going back and forth like that, copying the wrong excerpt, or forgetting to also note the story’s rating, or realizing I’d left off the archive date (or put in the wrong one), or just plain tiredness making my memory burn out — okay, the story is humor, gotcha, and then I get back to the draft-post and ask myself, ‘what genre is this? I can’t recall…’

At that point, it dawned on me that what had previously been a quick conversion process had slowed dramatically to fifteen to twenty minutes per post. I’m just no good with constant, endless, routine processes. My brain gets bored, I start to make mistakes, and it doesn’t help when you do the math and realize that with the remaining number of files to convert, it’s going to take the better part of the next ten weeks — assuming I really wanted to sit down, five days a week, eight hours a day, and do nothing but copy, paste, and click. No, thanks.

That made some kind of mass-conversion really, really important, and made figuring out a way around WP’s limitations a Very Important Thing — even if that did mean putting in a week’s worth of work to come up with something myself. Better to spend forty hours now and save four hundred, kthxbai.

As if it’s not enough that WP has issues with importing non-native taxonomies and custom fields, the original archive site presents its own set of difficulties: mainly, there aren’t any single files that could be brought over. That is to say, the information for each file — each separate short story or chapter — is contained in at least two places, sometimes three, and none but perhaps two or three are on the actual story-file itself.

For instance, on the old site, the visual presentation for an index looks like this:

Picture 21

And the code looks like this:

<?php
$title = "greywing";
include_once("inc/header1.php");
include_once("inc/title1.php");
?>
<p><img src="images/c-green.jpg" alt="code green"> <a href="01fiction/01_greywi_aao.php?title=Apples
and Oranges&author=Greywing&list=01_greywi">Apples
and Oranges</a>
<?php $today = date("Ymd"); $then = date("Ymd", filemtime("01fiction/01_greywi_aao.php"));
$file = $then +14; if ($today < $file ) echo " <img src=\"images/new-l.gif\" alt=\"new story!\">"; ?>
<br>
She didn't know it, but she was smiling.<br>
<b>GA</b> &#149; Hawkeye-centric &#149; added
<?php echo date("m/d/y", filemtime("01fiction/01_greywi_aao.php")); ?>
</p>
<?php
include_once("inc/footer.php");
?>

All the information is contained in each segment: the title, author, story-link, categories, genre, and archive date. What you see on the story’s actual page:

Picture 22Picture 22

In fact looks like this:

<?php
ini_set('include_path', '.:../inc/');
include_once("header1.php");
// slot for any author's notes
//$notes = "";
include_once("title2.php");
?>
<p> Ed's mission reports were always both good and bad. Good, because the boy
cut his reading teeth on a diet of scientific papers, so his writing style absorbed
all the principles of concise, clear, detailed writing to a greater extent than
most adults Roy knew. ...</p>

<?php
include_once("ficend.php");
?>

And that meant I couldn’t just take a single post, arrange its meta to make WP happy, and then upload. I had to find a way to extract the information from an author’s index page, match it up with the file that contains the story, pull out the information that is located with the story (chapter subtitle, chapter number, author start notes, attribution quotes, any footnotes, and author endnotes), and then put it all together into something that WP will understand.

Well, after I had a drink. Several of them, actually.

the first order of business: preparation

No two ways around it, I did end up spending a lot of time in jEdit (a Java-based programming app that works pretty well in the Mac environment), even if most of that time was spent on global find-and-replace. Where things were consistent, I could remove freely, then I’d run it through the first rudimentary extraction page and discover six other things that needed more straightening out.

Fortunately, the majority of the setup was already in the original site’s design. Each story-entry is marked by a paragraph tag, which meant that could be used as a ‘new story starts here’ marker. Breaking the rest of the entries into consistent parts was a bit harder, because so many of the posts vary. Never by a huge amount, but just enough: in one story, it’s “humor; angst; Alphonse-centric” — notice the semi-colons — and in another it’s commas, and in a third there’s an extra space, and others use bullets. These are near-seamless to an end-user, honestly, but their slight differences have a huge impact when you’re trying to extract that information.

It’s not even something to complain about, so much as to note, because when an archivist is creating/maintaining flat files like these, there’s no reason to stress about comma versus semi-colon, let alone British spelling (”humour”) versus American (”humor”). Common issues in fandom like character names aren’t that big a deal, because readers are versed in knowing that “Lisa” is the same as “Liza” is the same as “Riza”, and the use of a particular name often indicates when the story was created (before or after the official translation).

That’s where jEdit was truly helpful, in its ability to search/replace all open files at the same time — although I did discover the hard way that jEdit seems to have an upper limit of about five hundred or so open files at a time. While it’s true that PHP does have functions that would have allowed me to do effectively the same thing, it was just easier and faster to do it with jEdit than to try and write the code to address every possible permutation prior to running an index.

Eventually, over the course of the past few weeks (in between all the other projects!), I finally got most of the story-pages into the following format. Remove all <br> and replace with ##; remove all <a href=”www.domain… and replace with ##; remove all <b> and replace with… You get the idea. If it held still and wasn’t a paragraph-tag, it was replaced with ##. (The value of doing this will be explained in a bit, but for now, let’s just say it’s breaking up every story-post into a bunch of chunks of information.) Keep at it, until you end up with something like this:

<p>4##01_tobuis_wire##tobu ishi##tobuis##Wired Wired##01_tobuis_wire##She looked the wire
over for a minute, noting where the insulation had been stripped away for retuning, then tugged it
gently, careful not to pull too hard.##17##humor##ed_winry##01_tobuis_wire

Even at that, you can see quirks remain: the author’s name is repeated, as is the title and the file. Some of the categories I converted into their numerical values (4, 17) while others retain their semantic title (humor, ed_winry). The repetition is because all information in the second page comes through the html link; it made coding a single page cleaner and easier, but it also made for some ugly URLs — and, at this stage, for a lot of duplication.

The individual story pages were easier on one level — having the same start and end, basically — but a helluva lot more time to get the internal coding straightened out. Some stories had been coded in a Word-based environment, complete with Word-induced styles; some stories had one kind of <hr> marker while another story might use ‘xXx’, centered, instead. There are PHP functions that could strip this down, but those functions don’t strip the styles, nor do they remove non-breaking spaces. The only way to deal with it was to mass find-and-replace and hope no move rendered half the files whacked. The one kink that remains is actually related to the notes in each story file, and those will probably have to be edited manually.

But in the end, I had forty-five index files and twelve hundred story files, ready — mostly — to go. (This doesn’t include two more sets of authors and stories, which are hosted separately on the original archive — guest authors and stories entered into archive competitions. I decided to do the biggest chunk first, since the largest hurdle isn’t the act of conversion but creating the process to get it done. So, yes, there are somewhere between two hundred and four hundred files remaining, after these. I’m trying not to think about it.)

the second order of business: extraction

If preparing the files sounds like a hassle (and it was), it was still a relatively quick and somewhat painless hassle. The real headache was in coming up with a reliable and consistent way to pull the information out of the text files and get them into any kind of holding spot — the database, in this case — where the information could then be sorted properly and readied for importing.

Eventually I managed to break it down into four major steps: a Start page for selecting the author index to extract, an index page for reviewing and identifying the auto-processed fields and values, a two-part proc-page (one for review to confirm entry, the second to actually enter it into the database), and an upload page (which actually contains three processes on its own).

Ignoring the rough drafts, this is what the Start pages looks like currently.

Picture 18

In the code, I ended up using an amalgamation of WP-native functions, a few make-do functions that coax my hosting service’s PHP4 into doing the work of PHP5, and some direct connections to the database (which is normally frowned upon in WP, because it’s not translatable, but that wasn’t an issue in this case).

First step is the form that pulls up available author-indices, and a native WP function to list all registered users. Normally I don’t like to use GET, and prefer the cleaner POST, but using method=”get” makes it a lot easier to trouble-shoot, as well as to manually change a destination by tweaking the URL instead of going back into the code and messing with things.

<b>START</b></p><p>
<form action="main" method="get">
<input type="hidden" name="start" value="A">

// manual array of all author-index text files
<?php $authos = array('01_murder.txt', '01_mirabe.txt', '01_vikki.txt', '01_velvet.txt', '01_veatar.txt', '01_tobuis.txt', '01_spinny.txt');?>
<select name="autho">
<?php foreach ($authos as $auth) { ?>
<option value="<?php echo $auth; ?>"><?php echo $auth; ?></option>
<?php } ?>

</select> &nbsp;  &nbsp;  &nbsp;

// regular WP drop-down turned into select option
<?php wp_dropdown_users('show_option_none=');?>
// set $offset variable & carry over
<input type="hidden" name="offset" value="1">
<input type="submit" name="submit" value="continue" /> &nbsp; &nbsp;
<input type="reset" value="clear">
</form>
<br><br>
<hr>

The drawback of the manually-created array is that authors don’t drop off the list once they’ve been translated, but whatever. Going for quick and dirty (or at least trying).

The next section is for dealing with the information after it’s all logged into the database. I considered having the entire action a series of unbroken steps, but decided against it. Just in terms of process, it’s easier to have a mindset of extraction and looking for those details, and do it all before moving onto importing. Again, using WP native functions to turn the author lists into a select-option.

<b>CONVERT</b></p><p>
<form action="upload" method="get">
<select name="user"><option value=""></option>
<?php
// declare globals
global $wpdb, $table_prefix, $tempspot;
$tempspot = $wpdb->prefix.'temptable';

$doneuser = $wpdb->get_results("SELECT DISTINCT user FROM $tempspot WHERE filenokey='title' AND conv='1'");
foreach ($doneuser as $user) {
$user = $user->user;
$author = get_userdata($user);
echo '<option value="'.$user.'">'.$author->display_name.'</option>';}
?>

</select>
<input type="submit" name="submit" value="convert" />
</form>
<hr>

Yes, my semantics get a little crazy at times. It’s variables; it doesn’t really matter what it’s called, most of the time, so these are often meaningless names, if repetitious ones. Besides, it gets hard to think up good names when it’s late and your brain is sunburned.

The most important pattern is based on the $offset variable: that number starts at 1, and with each succeeding paragraph-tag in an author’s index, the $offset value goes up by one. By adding this to the author’s name, it creates a semi-unique value that designates a distinct story, which I’ve called $fileno (for ‘file number’, not for ‘not-file’, mmkay?).Thus, the key and value related to $fileno become $filenokey, and $filenovalue.

I didn’t bother setting up cross-field uniqueness, so it’s entirely possible to have two different fields with $filenokey equal to ‘title’ and both attached to the same $fileno value. I figured I could catch those in the prep-for-conversion stage, and in the meantime, just use SELECT DISTINCT to prevent duplicates when drawing out of the temporary db table.

The last section just lists all authors already found, as a visual reminder when selecting from the drop-down list in the top section. That’s mostly just because it’s helpful for knowing which text files I could ignore when I picked the next author to prep.

<?php
$doneuser = $wpdb->get_results("SELECT DISTINCT user FROM $tempspot WHERE filenokey='title' AND conv='1'");
foreach ($doneuser as $user) {
$user = $user->user;
$author = get_userdata($user);
echo $author->display_name.'<br>';}
?>

Next, the Index page and the extraction process.

Post a Comment

Your email is never shared. Required fields are marked *

*
*