Thursday, April 8, 2010

Data Sets & Data Portals: What Are They and Why Do They Matter?

One of the components for the Dataverse Project is the compilation of external links. These links can be grouped into two main categories: data sets and data portals.

We make this distinction because data set sites are sites that houses data itself, while data portals point toward other sites that house the data (in a way then our Dataverse project is a portal to other portals).

Below are some frequently asked questions (and their corresponding answers) that I get when I talk to people about this project:

What’s the point?
The main goal of the Dataverse Project is a creation of a comprehensive, long term, world historical database. By gathering external data sources:
  1. We are identifying potential collaborators that we hope would contribute relevant data to this objective.
  2. We are informally gauging the availability of long term historical data.
  3. We are providing an auxiliary resource of data for people who visit our website.
  4. We are analyzing new innovations from data collectors.
Why do we compile data in these two categories?
There were some discussions going back and forth as to what we should name these sites that we were collecting. At the end of the day we used our own logic, instead of naming conventions to name the data we have collected. We did this because there were no naming conventions to speak of. For example, what we call data portal is also known as data warehouse, gateway, clearinghouse, etc. Though someone could technically describe the difference between these naming conventions, the difference between them is small and so calling something as encompassing as a portal would meet our need.

We divided them into two categories of data, but in all honesty, there were sites that do not conform to each individual description. For example, there are sites that provide data sets and data portals (similar to our Dataverse Project), but they were few and far between and we found no justification to create a new category for these sites (for now).

How do we display them?
Currently, it’s set up as a bunch of links on two web pages.

How would we like to display them?
What we hope to do going forward includes separation of portal/sets by categories/discipline, region and timeline. Adding metadata/tags to individual data sites would be the subsequent step. This tagging capability gives users the ability to extract data sites based on naming conventions that go beyond categories, region and timeline. Tagging could be done by users (tags) and/or added by staff (metadata).

There are many sites that already implement our proposed categorization techniques. We will be featuring them on our blog in the future.

What are some of the issues we face?
Besides naming conventions and lack of data display options, another issue that we ran into is attribution. We did not want to give an impression that these sites belong to us, so we explicitly make users agree that the information they are obtaining are from a third-party site, that we are merely a conduit of these data. There are also not one universal set of rules for data usage for these sites, thus we made it clear that usage is preconditioned upon the terms that each website has set.

There are also issues of viruses and expired links. We make the best effort in avoiding sites that do pose these problems. We run virus scans on a sample of files in the sites and we routinely check for broken links, to give users the best browsing experience. That said, we still included a disclaimer that we cannot guarantee virus-free links.

We are also having issues naming the external sites. At the time of writing, they are alphabetically arranged based on our description of the site. We are addressing this issue by providing a search link, but a naming convention has to be established to improve the functionability of our website.

What else can we do to improve the external links section?
As of now, I can’t think of any beyond what has already been said. I sincerely hope that you can provide some feed back on our website. We’re new and we’re willing to listen.

