Snapshot 20120923 - Storage and Processing

< previously, I talk about source documents >

As part of the continuing snapshot series, I wanted to talk some more about the filing and processing. Again, this is really boring detail but I feel it should be here to increase the transparency of my data process because the original source documents are a very important foundation. I suppose you don't have to read it either - it helps me focus and I want a diary of this project. One of the reasons I tackle big projects like this is to challenge myself and push my skills envelope and this has been no exception. With the software side I have taught myself polymorphism (a programming concept) as well as how to tune and manage really large databases.

Currently the source documents (every source) are downloaded, every day - described in the last snapshot. Source documents which are considered duplicates, are not stored. The disk space is filling up pretty quickly, there's about 22 Gigabytes of files in the cloud with nearly half of that is the extracted text files (which I also keep in case I have to re-process them). I may need to start compressing the SLV files simply because there's so many of the things (about 5 GB of PDF files). But locally, the additional storage space required is no real strain on available disk space - the main database files including indexes take up less than 50 GB, and double that again for the transaction log size.

The PDF text extraction takes the longest - the SLV document particularly which is nearly 5000 pages long. Once a SLV document has been extracted, the bulk upload into the database can take up to 10 minutes each document. Generally that runs in the background so it doesn't matter how long it takes.

Once the database is loaded up, depending on the analysis steps involved it can take up to 1½ minutes to process an entire SLV document, about 30 seconds to process GLD. This doesn’t sound like much, but with around 300 individual source documents for each, that still takes nearly 8 hours to crunch the entire SLV set (which I normally leave running overnight) so I’m trying to find ways of speeding things up like using table indexes and query optimization. Happily just this week I took delivery of a new database server - a 2nd hand refurbished HP XW6600 (Workstation). I fully realize this is super-geeky but here's a photo of the internals:

Sports a dual Xeon processor (marked 1 and 2) and 32GB of RAM - I fitted it out with my existing SSD drives and this thing is very fast. Not supersonic, as there are still some bottlenecks but basically it has cut my database processing time in half. The Xeon processors are tiny but run so hot that the big metals casings are additional extractions fans, one for each.

Eventually all the processing and storage for the project will move to this machine exclusively. The machine is paid for by combining the costs from about three different projects (yeah I'm a busy boy - I am learning AutoCad too).

The database structure itself is (now) reasonably simple. Originally I had a table for each document ... broke the golden rule of always starting with a high level of normalization, but I'm there now :) This is as close as I will ever get to showing the database structure itself, but on the left is what the current set of tables looks like. Currently the PALLADIUM and PLATINUM tables are not populated yet, but at time of writing the silver and gold warehouse tables have 108 million and 31 million rows respectively.

The Bars structures can get erased and re-run at any time using the data in the warehouse tables. This is a good approach since I'm still finding the best processing structures. There is also a second database (not pictured) which holds the analysis work for the specific studies like the work on Sprott, Perth Mint, etc. Because the secondary database is a lot smaller, it's my intent to publish that database online, but only if we have some really interesting studies on the data where I think people can contribute, and only if there is an expression of interest (none so far). Basically it costs more to host the database online and I would not be able to leave it there indefinitely - the cloud-based archive is different, as far as I am able I would like to keep these as a permanent archive to the public - the only one of its kind. I figure it may be an important reference if FreeGold ever comes about. Generally speaking storage costs will get cheaper over time as new technologies (like holographic storage) come into play. Anyway, that's it. Once I am done with my research (by which I mean completing Phase 3 of my plans) I will have the difficult choice of what to do with the database itself - do I keep processing the files? Do I keep downloading? Maybe I can publish the processed data on a data market and find out what the commercial value is.

Not sure of these answers but at least I have the questions. I believe these conclusions make themselves evident over time. Anyway, thanks for reading and hope it wasn't too boring for you.

< next, I talk about motivations for why I am doing this project >

No comments: