the crazed path of file conversions: pt II

From the top, on the index page, I select a file and an author. Because the files don’t have consistent author-naming in their headers, and I didn’t want to bother with making sure all 45 did, it was just easier to programmatically insert the author name via the second drop-down list. What that really gets me is the user_ID for each author, which then becomes another identifier in the database.

start the extraction engines

With those two values, and all the prep-work, I get a page that looks something like this:

Picture 23

At the very top of the page is the display_name, which is WP’s version of “the pretty name you see on the end-user side of things”. That’s followed by the actual file name getting extracted, the author’s user_ID, and the offset number. Then it moves into the meat of the page, starting with running through all stories already converted and comparing those titles to all the fields (that is, all chunks of information separated by double hashmarks).

It’s not a perfect system; I used Velvet as an example here because you can see one of the converted titles is EdT3, and that down in the fields there’s one with the value of 3, which has nothing to do with a title but is identified as a ‘match’, unfortunately. I had originally hidden the rest of the page when a match was made, until I realized it had skipped past every story-listing that had a category equal to 3 because of this miscomparison. Regardless, the program gives the option of skipping an entry if it appears to be a match with an already-converted story title.

Then it lists the converted titles for me, for my own recollection and comparison, to make sure the story-is-converted message is not in error; the category numbers on the following line are for my own reference, and most useful in those cases where for whatever reason, any of the categories got dropped (such as when there’s a line like the following):

<p>##3##
<p>##5##Story Title##file.php##

It’s not that often, but enough that I’ve realized I just have to check for it, each time; the pattern used in the original setup may not be the best for a one-to-one conversion (such as a definitive order of things), but once I got used to it, I could predict somewhat accurately if a category-designation was missing, and insert it based on reasonable guesswork.

There are two parts to the fields: the field name, and the value. They’re all input boxes, because at least once every five or six entries there’ll be something off, that needs editing. Putting it all in input boxes just made it easier to do this, without having to go back to the original text file.

The final blank lines are because I soon realized that sometimes I need to add information, or break an existing field-value combination into two or three distinct fields (ie, when a value is something like “humor, parody” and must be edited to be “humor”, with “parody” added as a new term in one of the empty boxes.

On the surface, I guess it looks pretty straightforward: a bunch of chunks separated by ##, and the little application makes reasonable guesses as to what each value might be: a title, a term (category, tag, note, pair, etc: these are all ‘terms’ in WP-lingo, really), an excerpt, a file, and so on. On the backend, though, getting it to do all this means a whole tonne of moving parts.

For those of you curious (hi, Mom!), here’s a basic introduction to what a loop is, and what you can do with it, and how the $offset value ties into that.

for each of A, do B, and keep doing it

If there’s any one thing I am good at, it’s logic, and it’s a damn good thing programming is nothing more than glorified logic — even if sometimes it does remind me of taking Logic 101 and hitting the final question on the final exam and not being entirely certain how to prove the point… so when I stalled out, I’d just refer back to a previous line and try a different route. (In the end, it took me eighty-two steps to prove; when I asked the professor later whether I’d gotten the answer right, he said he’d lost track around line fifty and gave me the ten points anyway, on the grounds that even if I hadn’t gotten it, I’d damn well tried above and beyond the call of duty. This probably says a lot about me.)

With that in mind, working out the logic for extracting ends up going something like this:

  • Open the author-index text file.
  • Break the index each time there’s a <p> tag.
  • Treat each chunk of text between <p> tags as a separate body of information.
  • Number each chunk so it can be distinguished from the rest on the page: chunk #1, chunk #1, etc.
  • Now, break that chunk into another set of mini-chunks, with each one starting where there’s a ##.
  • Take each of those mini-chunks (aka values), and for each value, do the following:
    1. Does the value match the author’s name? If so, skip it, because it doesn’t need to be displayed. (We already have that info.)
    2. Does the value match a title of a story by that author, that’s already been converted? If so, skip it as well.
    3. Does the value start with “ep ” or with “ch “? That means it’s a spoiler warning, so identify this value as being SP, for spoiler.
      1. If the value starts with “ep “, then it’s from the broadcast, so add a key/value set that identifies the story as being from the television show.
      2. If the value starts with “ch “, then it means the story’s continuity is from the original story, so add a key/tag for this, too.
    4. If the value is simply “movie”, then tag it with “SP” for movie spoilers, and add a key/tag at the end that identifies the continuity as ‘movie’.**
    5. Does the value start with “DF”? That means it’s a Divergent Future warning, so tag this value with a key of DF.
    6. Does the value end with “/04″? That means its date was hard-coded prior to the site’s redesign in ‘05, so tag this value as the archived date (aka ‘whenarc’).
    7. Does the value start with “01_”? That means it’s a file… and that requires some mini-steps to figure out how to deal with that:
      1. Does the value start with “01_” and end with a numerical value? Then it’s the first of a series of files, which means this story is a multi-parter. Take all the files that have the “01_ … #” pattern, and sort them. Now, for each of those, tag it with a key that has the same number as the story: so we get file1 for “01_story_01,” and file2 for “01_story_02″, and so on.
      2. If there’s at least one instance of this “01_… #” pattern, count up the number total. If the value is less than 11, create a new key and value that identify the story as a short multipart. If the total is more than 10 but less than 25, add a key and value that identify the story as a medium-length multipart. If the total is more than 24, the key and value should say the story is a long multipart.*
      3. Does the value start with “01_” but not end with a numerical value? Then it’s a single file linked to this mini-chunk, which means it’s a short story. Tag this value as simply “file”, and add an additional key-value of ’short story’.
    8. How big is the value’s size? Count the number of characters. If it’s over 100, it’s probably the excerpt. Identify this as ‘excerpt’, and make the input field into a textarea large enough to show the entire excerpt.
    9. Does the value match one of the items in the list of ratings categories? Then it’s a term (category). Mark it as term, and append a counting-number to that. Then up the count by one so it’s ready for the next term that comes along.
    10. Does the value contain a comma or a semi-colon? Then it’s probably a list of meta-information. Turn the value into an array and break it apart into its items, as separated by either comma or semi-colon, and treat each one as a separate value — and tag each as being a ‘term’. Also append that counting-number to these, as well. (But list the original value anyway, as a distinct field/value combination, just in case.)
    11. If the value is simply “post-series” or “pre-series”, then tag it as “timeline”.
    12. Has the value already been logged in the database as a title? If so, tag this value as ‘prev’, to indicate it’s a prequel.
    13. List all values with tags, adjusting for those values with a character count over 100 (as text areas).
    14. List all values that didn’t get tagged, putting in a select-box instead, with the most-common tags remaining (term, title, series, etc).
    15. List all values that are automatic adds (story length, continuity), along with identified tags.
    16. Add a few blank boxes at the end, both drop-down and text-box options, for last-minute additions.***

*I warned you about the number of custom fields, didn’t I?

** Also, sometimes I really hate fandom, especially multi-continuity fandoms. (This is why I never got into Batman, and we won’t even mention Star Trek here.)

*** Because for all this, it still happens.

Yes, that’s a whole bunch of nested foreaches. Fortunately, I was pretty much raised on flowcharts — when you’re a second-grader and your post-grad father is telling you to flowchart what you do between waking up and going to school, it pretty much becomes ingrained — and I didn’t do a formal flowchart for this, since the pseudo-code worked well enough. Besides, I think in flow-charts, really. I can’t not. (Hi, Dad!)

numbering these puppies

Now, a foreach command is a great thing: it loops through each instance, performing functions as it’s told, and when it gets to the bottom it goes back to the top and starts over. Yes, just like the Dylan song. The problem here is that some of these authors have sixty, seventy, over a hundred instances of those <p> tags in their index files.

Note that I didn’t streamline nor compress any of this code. I didn’t see the need, honestly, given this is a one-time thing (I sure as hell don’t plan on doing this again anytime soon!). The price paid is that between the number of moving parts and the number of database-connections, it’s not visibly slow but it’s not instantaneous, either. Doing it once, a tiny lag. Doing it a hundred times on one page… wow, we could be here for awhile.

So that meant figuring out a way to prevent the page from loading everything at once,. There are three ways to halt a foreach function: you break it — which ends it completely — or you tell it to ‘continue’ (skip) — and it loops invisibly until it ends — or you set it to sleep, which only pauses it. I’m sure there’s got to be a way to set it to sleep and have it ‘wake’ at a key-press, but I suspect all that really does is just about the same thing I came up with, anyway. Roughly.

This is where the $offset and $p_count variables come in. We open the page bringing in a value for $offset of 1, and on the page we set $p_count as 1. Then we start the loop, and compare the two.

The first time around, $offset is not less than $p_count, so we don’t skip the entry (aka, ‘continue’ to the next <p> marker and do a new loop). The two are equal, so the loop runs and you get a page that looks like the last screenshot. At the end, though, it checks and if $offset is equal to $p_count (which is it, in this case), it breaks the loop. The page ends right there, closes up with whatever code remains, and the page is done and awaiting edit/entry/response.

Now, when I click on “submit” and it carries my keys, values, and edits into the next page, $offset goes with me. On the next page, $offset is incremented by 1, and now it’s equal to 2. When I’ve approved all the tags and entries for this chunk, the application returns me to the index page, along with my newly-increased $offset value.

That means, as the page loads and the loop starts all over again, it compares $offset (equal to 2) to $p_count (equal to 1, because it’s the first iteration of the loop). But $offset is greater, so that means the iteration with a $p_count of 1 should be skipped. It adds 1 to $p_count and starts over. If I came back to this page with an $offset of 10, the page would go through the loop nine times: and each time it would compare its on-page count — that’s $p_count — to the base variable — that’s $offset — and it keeps doing so until the two are equal.

This lets me do two things. One, I can back up. There are a number of author-index pages where the coding pattern was altered to indicate a series of stories — not chapters of a book, but a bunch of stories in what’s often called an ‘arc’. These aren’t marked with <p> counters, I realized, but <li>. It’s easy to realize when I’ve stumbled over one of those: I get a single page with something like twenty or more values, and what looks like six titles and it just keeps running down the page — because the chunk doesn’t end until a <p> shows up.

That’s where the field called FILENO comes in. It’s both a leftover from debugging, and a way to create multiple $fileno from a single $p_count. It’s made up of the author’s name plus the $p_count, but it can be adjusted manually.

Picture 23b

When I get to a page with multiple entries inside a single <p> chunk, it slows things down but at least it no longer halts me. I do have to go back to the original site to determine which title, excerpt, and information goes together (using the sort function to list all numerical values together wrecks the chances of being able to say the first X values are likely from story #1, and the next are from story #2 — because those have been rearranged). The positive is that most often, authors tend to use the same terms (rating, characters, warnings, etc) for all stories in an arc, with some minor variation. Once I get it all straightened out, I manually change $fileno by appending a letter on the end: a for the first, b for the second, and so on. Because $fileno is set all over again at the top of this page, doing this doesn’t have any downstream ill-effects.

When I get to the final send-to-database page, I can send and continue, or send and repeat. The first option ups $offset by one, and will display the next <p> chunk. But in this case, I don’t want $offset going up, because I need to see ‘velvet-mace5′ a second time. So the ’send and repeat’ option subtracts from $offset, the page loads and runs until $p_count is equal to the corrected $offset, I add a ‘b’ at the end of $fileno, and I repeat this until all sub-stories have been recorded in the database.

Next up, the actual code behind all this.

Post a Comment

Your email is never shared. Required fields are marked *

*
*