First Data

When I typed the title “First Data” above, initially it came out as “First Date”. I almost kept that in, because in some ways diving into our data set for the first time is exactly like getting to know someone - there is some excitement, some trepidation, some sense of not wanting to move too fast, or to screw anything up.

Yesterday, thanks to some help from the great Rodrigo Baeza, we were able to look at our master list for the first time. If you’re just joining us, in our attempt to sample two per cent of all “American comic books” published between 1934 and 2014, we have worked out a preliminary definitional system that something is an “American comic book” if it was published in the United States and if it appears in at least two of these three sources: the Grand Comics Database; the Overstreet Price Guide; and mycomicshop.com. For the underground period, we are also using Jay Kennedy’s price guide, since Overstreet intentionally omits underground comix. 

So far, so good.

Our first step has been to use the MySQL data dump from the GCD and to sort it in Excel. This is primitive, to be sure, but all we are interested in at this point in time is making a series of eighty-one annual lists that can be cross-checked against the other reference sources and then randomized. 

At first glance, our data dump includes 293,381 items. Some of these were cut immediately: things published before 1934, and things published in 2015 (if we do add 2015, it will be after the year has concluded). Simple. 

Sorting by year and moving comics to individual pages was also a cut and dried process for the most part. Interestingly, just moving all those issues electronically provides a good historical overview of the rises and falls in publishing volume over time. We’ll have more to share on this point when we get precise data.

The first big stumbling block came when I looked at all of the comics that have no dates attached to them in the GCD. That number was much higher than I anticipated: 69,630 - or more than twenty-three per cent of all the titles indexed lack this basic piece of information. 

What to do? Well, the obvious answer is that we need to generate that by hand ourselves. Simply beginning at the first entry I began typing titles into the mycomicshop.com database to see what is there. Fortunately, quite a lot is there, including good dating information. An added bonus is that this has already demonstrated that mycomicshop.com is not simply taking data from GCD and duplicating it, which was a back of the mind worry. They are independent of each other, so can be used to corroborate each other. Items with dates in mycomicshop.com are being added to our master list, and items that are absent are being flagged as absent so that they can be checked against Overstreet for inclusion or exclusion. The bad news? It’s a time-consuming and dull process. I checked about 300 records in an hour. At that pace, it will take a research assistant six and a half weeks of full time work just to put dates on entries that are missing them, and even then we will need to check the validity of the 230,000 comics that already had dates. It will be a long summer.

One other issue has already raised its head: GCD indexes some material that seems very instinctively to be not comics, like Toyfare Magazine and Amazing Heroes, and mycomicshop.com also lists them for sale. We may need to modify our search parameters based on the stated GCD rule that: “The individual issues of this series are each less than 50% comics. Only comics sequences are indexed and cover scans are accepted only if the issue has 10% indexed comics content.” This would mean restricting our searches to items that are “cover scan eligible” from GCD. Thoughts?