Finding Data + Data Reuse
Format
This is a short writing/thinking exercise. Best done with a partner or small group, but can also be done alone.
Target Audience
Project leads looking great for data with open licenses.
Materials
- Pen/pencil & paper or text editor
- Data Reuse Plan Template
Introduction
Data reuse saves time and accelerates the pace of scientific discovery, so investigating how to put together existing data sets is totally a worthwhile effort.
This exercise will walk you through some open data sources and some templates for assessing and documenting the data you collect/use.
Steps to Complete
-
Browse Data Repos
Break into groups of 2-5 people. Take a look a the following open data resources and consider the type of data you might like to work with; keep in mind data formats and licenses before you pick your data
Data Sets/APIs
- NASA Data: choice datasets for space scientists
- Compendium of Awesome Public Datasets: great datasets themed by discipline and domain
- Index of Public APIs: awesome opensource listing
- Forecast.io API: well-structured API for global weather information
- Personify JS: a JS library for using IBM Watson and Twitter/Social APIs
- Web Audio API: tutorial for leveraging web audio, if that's your jam
- Quandl: the "wikipedia" of time-series data, they provide datasets and format parsers for conversion to what you need
- Data.gov: almost every government/city has an "open data portal" intiative where you can download data of interest to you, and search through for differnt formats
- Federal Data Listing Etherpad: loads of federal data resources for the taking
- Enigma.io: loads more open data, larger datasets, and tools for correlating multple datasets
- Exversion: similar data catalog, they also have a fabulous newsletter (subscribe for cool datasets in your inbox)
- NYC Open Data: loads of cities have "socrata" portals where you can download data
- IRE Database Library: IRE also provides a lot of open data to investigative journos, along with data dictionaries telling you how to read it
Tools to Convert Data
- Mr. Data Converter: an online tool for converting data from excel to other formats (HTML/JSON/XML)
- Open Refine:not unlike excel but way more powerful for large datasets, you can also convert formats in refine (ie. from JSON to CSV or vice-versa)
- DSV: a parser and formatter for delimiter-separated values
- Tabula: tool for extracting data from PDFs
Identify a note taker or yourself who can flag datasets you like and start looking into using them for your project.
-
Interview your data
Using the Data Reuse Plan Template as a guide, members of the group ask questions of about these data sets while the note taker records responses. The note taker can (and is encouraged) to ask questions too. As you ask questions, think about how you would (or if you could) respond to a similar question about your data set. The purpose of this is to investigate the origin and validity of the data and know loads about it before you start munging.
Glossary
Open Data
Data that is made easily and freely available for anyone to access, use, and share without restrictions, the possible exception being a requirement of attribution.
Metadata
Information that describes, explains, locates, or in some way makes it easier to find, access, and use a resource (in this case, data). For example, metadata for a photograph may include the name of the photographer, when and where it was taken, as well as the type of camera and settings used to take the photograph.
Licensing
A license gives explicit permissions for the use of something. This is particularly important if you want to make your data open as some jurisdictions assign copyrights to data sets which limit their use. There are several types of licenses that are in common use for data. You can read more about them here: http://www.dcc.ac.uk/resources/how-guides/license-research-data.
Naming Conventions
These are a set of predefined rules for the naming and structure of folders, files, field names, etc. (E.g. All files begin with a date, location and project name.) Naming conventions help provide context to a data set, as well as make sure a standard of data collection and management is being followed by all members of a team.
Permanent Identifiers
A permanent identifier (or PID) is a set of numbers and/or characters, frequently in the form of a URL, that points to the location of a resource. PIDs are set up in such a way that even though the storage location of the resource may change over time (e.g. moving data from one university server to another), the PID will always point to the correct location. DOI is a commonly known type of PID.
Follow-up Resources & Materials
You may find it useful to review this handout early on in the planning stages of your project to help design the workflows of your project.
The following resources are useful for more information documenting your data and research best practices to make documenting your data easier.
- Metadata Guide from Australian National Data Service (ANDS)
- Best Practices for Data Management from DataONE