Finding the right data is too often an exercise in frustration…

I haven’t written in a while, first because I was slamming busy in my role as a VR dev at Meta, and most recently because I’ve been battling stage IV lymphoma. (Bummer.)

BUT along the way I’ve learned a lot about medical scanning and various medical image data. I’ll post some experiments in Houdini soon, and perhaps look at bringing in some of that medical data into VR.

As I explored this interest, I ran into a phenomena I’ve encountered many times before: the sloppy and poorly organized nature of academic and medical data sets. This ad-hoc approach of every organization and university throwing their data into often poorly managed databases and sharing them via cheaply made websites created by interns or undergraduates is, frankly, pretty pathetic. Often I find myself spending more time identifying a source of usable data and then navigating the myriad of organizational schemes, cryptic project-or-paper based naming, unclear file structures etc than I actually spend doing something interesting with that data.

Worse, I’ve been appalled at how academics and medical professionals share data amongst themselves. I recently ran across a world-class physician who was reduced to sharing CDs of radiology image data with colleagues over postal mail (!) to discuss time critical patient issues(!!)

This is, frankly, unacceptable. Academics and physicians, take some time and clean up your data infrastructure. You’ll be glad of it, I promise.

A few fields have done this however. Those in the fields of geosciences, physics and astronomy have some amazing resources. They have learned a few principals:

  • Don’t horde data for no good reason. Sure data has value, but sharing data can prove equally useful by accelerating progress and collaboration. Open source models are popular for a reason – they pay long term dividends.
  • Agree on standards. If you are generating datasets, take time to document then and provide tools to convert that data to widely used formats (“standardization”). If possible, make that data fit a useful range (“normalization”).
  • Document your data sets in plain English. Don’t assume the data is only of interest to specialists. Avoid acronyms, trade jargon, or project names or numbers.
  • Provide more than one download option. Don’t force every file of thousands to be downloaded individually, make .zips available. Conversely, don’t make a 100 gig zip file the only option, either.

With all that said, there are excellent resources for all kinds of data. NASA, JPL, The Visible Human Project, mapping and GIS data in general comes to mind. But if I’m looking for something a little more esoteric, I often first check the Harvard Dataverse.