The (Re)usable Data Project

Inspired by the efforts of scientists around the world and the game-changing efforts of projects like the Creative Commons, the Wikipedia Foundation, and the Free Software movement, we hope to engage the larger community in an open and fruitful discussion on issues concerning the use and reuse of scientific data, including the balance of openness and how to make ends meet in an increasingly competitive environment.

If you would like to join our efforts to highlight the use and reuse of data in the sciences, please feel free to contact us on our tracker, create a pull request against our repository, or join our forum.

Clearly statedComprehensive andnon-negotiatedAccessibleAvoid restrictions onkinds of (re)useAvoid restrictions onwho may (re)use010203040506070
High-level summary of curated data resourcesHave a violationUnknownNo violations

Who we are

We are not lawyers and this is not legal advice: all institutions and groups have their own perspectives and counsel. We are a group of scientists, engineers, librarians, and specialists that are concerned about the use and reuse of increasingly interconnected, derived, and reprocessed data. We want to make sure that data-driven scientific endeavors can work with one another in meaningful ways without undue legal concerns.

The (Re)usable Data Project is meant provide a resource that looks at some of the issues around the reuse of scientific data and open a conversation about how to deal with them.

We also want to actively work with the community in considering our criteria and in making sure that our information about scientific data resources is up-to-date and correct. If you have any questions, concerns, or see any problems, please open a ticket on our GitHub tracker.

What this is Β»

What this is

The initial driving concern of this project is the use and reuse of biological and biomedical data. However, this is a general problem in the scientific community and needs to be addressed directly.
For each resource, using our criteria, we attempt to objectively assign zero to five stars for how well we believe a resource's data may build upon, edited, modified, and redistributed.
Grossly speaking:

  • 5 stars β˜… β˜… β˜… β˜… β˜…
    The license unambiguously allows the unfettered (re)use and redistribution of the data.
  • 4 stars β˜… β˜… β˜… β˜…
    The license unambiguously allows (re)use and redistribution of the data under some terms.
  • 3 stars β˜… β˜… β˜…
    The license is clearly stated, unambiguous, and of a standard type, and has clear access, but has terms that may greatly impact the (re)use and redistribution of the data.
  • 2.5 or less stars β˜… β˜… Β½ - βˆ…
    There are likely issues in definitively finding the license, ambiguities in the license that hamper further analysis, issues with clean data access, or terms that require legal advice.

If you see any problems with our determinations or would like to make corrections or clarifications, please open a ticket for us on our issue tracker.

Our criteria Β»

Our criteria

This is a short overview of the criteria that we use when evaluating a resource's data license for use and reuse. We have attempted to balance many needs (credit, mutability, commercialization, redistribution, etc.) and focused on trying to objectively see how licenses can interact across resources.

To learn more about how we look at resource data licenses, please see our criteria and license type pages.

  • Clearly stated (A)
    A clearly stated, unambiguous, and hopefully standard, license for data use is critical for any (re)use of data: if there is no license to be found, then rights are unclear and one needs to assume the default: all rights reserved. more Β»
  • Comprehensive and non-negotiated (B)
    Data that is mixed under different licenses, only partially available, or must be in some way negotiated creates barriers to the (re)use of data. more Β»
  • Accessible (C)
    Data must be accessible in a reasonable and manner to be useful to the broader community. more Β»
  • Avoid restrictions on kinds of (re)use (D)
    Data should be able to be copied, built upon, edited, and modified as freely as possible. more Β»
  • Avoid restrictions on who may (re)use (E)
    Data should should be available to as many people as possible for their (re)use. more Β»

Our sources data Β»

Our sources data

You may also explore our data with simple visualizations here.

NameTagsGradeDescriptionLicense InfoLicense Issues
NameTagsGradeDescriptionLicense InfoLicense Issues
Alliance of Genome Resources (AGR) πŸ”—biology, MOD, functional annotation, disease-gene association, orthology, phenotype and disease modelsβ˜… β˜… β˜… β˜… β˜…The primary mission of the Alliance of Genome Resources (the Alliance) is to develop and maintain sustainable genome information resources that facilitate the use of diverse model organisms in understanding the genetic and genomic basis of human biology, health and disease.permissive πŸ”—
ArrayExpress πŸ”—biology, microarray experiments, functional genomics, high-throughput, microarray, sequencingβ˜… β˜… β˜… β˜… Β½ArrayExpress Archive of Functional Genomics Data stores data from high-throughput functional genomics experiments, and provides these data for reuse to the research community.permissive πŸ”—
Criteria A.2.2
Minimal custom permissive terms.
Bgee πŸ”—biomedical, x-species, expression data, curated data, biology, evo-devo, curated experiment annotations, RNA-Seq experiments, scRNA-Seq experiments, microarray experiments, in situ hybridization experiments, EST librariesβ˜… β˜… β˜… β˜… β˜…Bgee is a database to retrieve and compare gene expression patterns in multiple animal species, produced from multiple data types (RNA-Seq, scRNA-Seq, Affymetrix, in situ hybridization, and EST data).permissive 
BioCyc Database Collection (BioCyc, public) πŸ”—biology, genomic resource, sequence, gene structure, pathways, reactions, functional annotationβ˜… β˜… β˜…BioCyc is a collection of 20,028 Pathway/Genome Databases (PGDBs) for model eukaryotes and for thousands of microbes, plus software tools for exploring them. BioCyc is an encyclopedic reference that contains curated data from 130,000 publications.restrictive πŸ”—
Criteria A.2.2
Non-standard/custom license.
Criteria B.1
One term of the license is that you must "Notify SRI that you are making BIOCYC DATABASES available in this manner"; this, combined with somewhat bulky access (see comments), I believe rises to a barrier to reuse as a manual step invloving people has been added.
Criteria D.1.2
The license specifically notes that it is being licensed to you and you do not have rights betond that; I believe that this puts second-hand reuse in question: you may remix, but a downstream party from you would have to register and comply directly with BioCyc.
BioGRID πŸ”—biology, cross-species, protein-protein interactionβ˜… β˜… β˜… β˜… β˜…BioGRID is an interaction repository with data compiled through comprehensive curation efforts. Our current index is version 3.4.155 and searches 63,959 publications for 1,507,991 protein and genetic interactions, 27,785 chemical associations and 38,559 post translational modifications from major model organism species. All data are freely provided via our search index and available for download in standardized formats.permissive πŸ”—
BRENDA Tissue Ontology πŸ”—biology, ontology, enzyme sourcesβ˜… β˜… β˜… β˜…A structured controlled vocabulary for the source of an enzyme. It comprises terms of tissues, cell lines, cell types and cell cultures from uni- and multicellular organisms.permissive πŸ”—
Criteria A.2.2
While they are obviously attempting to be permissive by listing \"CC-BY\", this does not map onto any of the number CC-BY versions (e.g. 3.0 or 4.0), thereby, breaking the reference and meaning that we would have to contact for them for terms).
Criteria C.2
Once one knows the location of the BTO (ontology file), access is fine. However, we were unable to locate the ontology file after some searching through the main BRENDA website; most text would imply there is no free ontology file to be had.
Cancer Biomarkers database πŸ”—oncology, interaction, cancer, drug, biomarker, oncologyβ˜… β˜… β˜… β˜… β˜…The Cancer Biomarkers database is curated and maintained by several clinical and scientific experts in the field of precision oncology.permissive πŸ”—
Catalogue of Life πŸ”—biology, custom, biodiversity, distribution, biogeography, taxonomy, ontologyβ˜… β˜…The Catalogue of Life is the most comprehensive and authoritative global index of species currently available. It consists of a single integrated species checklist and taxonomic hierarchy. The Catalogue holds essential information on the names, relationships and distributions of over 1.6 million species.restrictive πŸ”—
Criteria A.2.2
The resource uses custom terms.
Criteria B.1
Use seems to hinge on some contact with Sp2000. For example: \"If you wish to use the Catalogue of Life content on a public portal or webpage you are required to notify the Species 2000 Secretariat, and to assist with a check that the correct credits are given.\" Check check assistance especially seems to violate B.1.
Criteria C.1
The data \"download\" is quite complicated and should not actually be considered a download (see commentary). The API as given would likely require a custom spider to obtain the data in bulk.
Criteria D.1.2
Distribution seems to be prohibited without negotiation; example on the main ToS: \"Commercial use of this compilation or any of the species datasets contained within...or dissemination on the Internet, requires written permission from Species 2000 and ITIS.\"
Criteria E.1.1
Non-commercial restrictions exist on the data from the ToS.
CATH Protein Structure Database πŸ”—biology, protein families, protein family, superfamily, classification protein structureβ˜… β˜… β˜… β˜… β˜…CATH is a classification of protein structures downloaded from the Protein Data Bank.permissive πŸ”—
ChEMBL πŸ”—biology, biochemical, bioactive drug-like small moleculesβ˜… β˜… β˜…ChEMBL is a database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.) and abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data).copyleft πŸ”—
Criteria D.1.2
CC SA prevents some types of reuse, such and modification and redistribution with data from different license types.
Criteria E.1.2
CC SA prevents all parties from reusing the data as D.1.2.
Showing 1 to 10 of 67 entries

Contact us

All copyrightable materials on this site are Β© 2019 the (Re)usable Data Project under the CC-BY 4.0 license.
The (Re)usable Data Project is funded by the National Center for Advancing Translational Sciences (NCATS) OT3 TR002019 as part of the Biomedical Data Translator project and U24TR002306 as part of the CTSA Program National Center for Data to Health (CD2H).
The (Re)usable Data Project would like to acknowledge the assistance of many more people than can be listed here. Please visit the about page for the full list.