occCite: Tools for querying and managing large biodiversity occurrence datasets

The amount of observational and specimen-based biodiversity data available to researchers is increasing exponentially, yet the ability to manage and cite large, complex biodiversity datasets lags behind. This management and citation gap impedes reproducibility for data users and the ability for data publishers to track use and accumulate use citations, ultimately harming the longer-term sustainability of the still-emerging enterprise of research data-sharing. Here we present an R package, occCite (v. 0.4.7), to aid researchers in querying large species occurrence data aggregators (specifically, the Global Biodiversity Information Facility, GBIF, and the Botanical Information and Ecology Network, BIEN), and store metadata such as primary data providers, database accession dates, DOIs, and the taxonomic source used for search terms. occCite also includes tools to summarize and visualize query results and generate citation lists of all data providers and software packages used during the query process. We provide examples of a basic occurrence search and citation workflow as well as an advanced workflow using features for custom optimized searches, visualization, and summary procedures. occCite improves upon existing R packages by uniting data from powerful API-based query packages (rgbif and BIEN) into a unified object-based framework, while maintaining metadata vital to best-practice recommendations for documenting biodiversity analysis workflows. occCite aims to efficiently close the gap in the citation cycle between primary data providers and final research products, allowing researchers to meet dataset documentation standards without sacrificing time and resources to the demands of providing increasing levels of detail on their datasets.