--- title: "Uploading plot data with vegbankr" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Uploading plot data with vegbankr} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 72 --- ```{r, message = FALSE, echo = FALSE} library(dplyr) library(DT) docs <- read.csv("../inst/loader-table-fields.csv") docs$required <- factor(docs$required, levels = c("required", "best practice", "commonly used", "sometimes used")) docs <- docs %>% arrange(required) ``` ## Introduction to `vegbankr` This package is an R client for VegBank, the vegetation plot database of the Ecological Society of America's [Panel on Vegetation Classification](https://esa.org/vegpanel/), hosted by the [National Center for Ecological Analysis and Synthesis](https://www.nceas.ucsb.edu) (NCEAS). VegBank contains vegetation plot data, community types recognized by the U.S. National Vegetation Classification and others, and all ITIS/USDA plant taxa along with other taxa recorded in plot records. As a VegBank API client, the `vegbankr` package currently supports querying and downloading vegetation plot records and other supporting information from the VegBank database, and supports validating and uploading new data to the VegBank database as well. ## Contributing data to VegBank To upload data to VegBank, you must **first request contributor permission** from the ESA Vegetation Classification Panel. You can request to be a contributor by emailing [help@vegbank.org](mailto:help@vegbank.org) and the panel will evaluate your request with the goal of maintaining high-quality vegetation data in the system. Once your contributor role is granted, you will be able to log in and upload new plot data with the [vegbankr R package](https://nceas.github.io/vegbankr). To use `vegbankr` to upload data, there are 3 key steps: 1. Model and transform your data to the VegBank Loader Table format 2. Validate your data 3. Upload your data using `vb_upload_plot_observations(...)` This vignette will walk through these 3 steps, with an emphasis on modeling and validating data. ## VegBank Loader Tables Loader tables are the data format that is used to upload data into VegBank. In order to publish your data to VegBank, the first step is to model whatever format your data is in to the loader table format, and then transform the data into that format. Modeling your data means identifying how each piece of information in your original dataset (like species names, plot locations, survey dates) corresponds to specific fields in the VegBank loader table format. There are a total of 12 loader tables that can be used for plot observation data, though not all are required for data ingest. In this section, the loader tables and their fields will be described, in the order in which it is recommended to prepare them. Each loader table you create is an R `data.frame` that is eventually passed to a VegBank upload function. In the documentation below, interactive tables display the allowed `field` (column names), whether the field is required, best practice, commonly used, or sometimes used, and a description of the field. There are a number of fields that act as codes and that are used as primary and secondary keys to link loader tables together. Codes that begin with `user_` are supplied by the data contributor. Codes that begin with `vb_` are created by the database upon data upload. ### Projects This table stores information about a project established to collect vegetation plot data. The `user_pj_code` is the project code primary key, and is found as a foreign key in several other tables. An example project code might be `MOJA` with project name "Mojave Desert Vegetation Surveys." ```{r, echo = FALSE} docs %>% filter(loader_table == "projects") %>% select(-loader_table) %>% datatable() ``` ### Parties The Parties loader table is used to upload new parties (people) associated with plots, projects, taxa, and classifications. The primary key is `user_py_code` which is used as a foreign key in the Contributors loader table. Once uploaded, VegBank will create a `vb_py_code` for each party to be used in the Contributors table. ```{r, echo = FALSE} docs %>% filter(loader_table == "parties") %>% select(-loader_table) %>% datatable() ``` ### Contributors The contributors loader table is fairly code-heavy, but is closely linked to both the Projects table and the Parties table, and is used to link parties (people) with their contributions to plots, projects, taxa, and classifications. `user_py_code` is a foreign key to the `parties` table, so any values present in this field must be present there as well. Optionally, instead of `user_py_code`, `vb_py_code` can be used if the party is already in VegBank. Only one of `user_py_code` or `vb_py_code` must be present, and having both in the same row is disallowed. `vb_ar_code` is the VegBank role code - a code in the format `ar.{nn}`. A table of allowed values and their meanings is listed below the loader table variables. Finally, the `contributor_type` indicates whether this contributor is linked to an Observation, Project, or Classification. The `record_identifier` value will depend on which of these three types the contributor is associated with. If the contributor should be associated with a project with `user_pj_code` `MOJA`, the `record_identifier` for that contributor should be `MOJA` and the contributor type should be Project. This transitively will also associate the contributor with all plots associated with that project. If the contributor should only be associated with a particular observation with `user_ob_code` `MOJA_0214`, the record identifier is that observation identifier, and the `contributor_type` should be Observation. ```{r, echo = FALSE} docs %>% filter(loader_table == "contributors") %>% select(-loader_table) %>% datatable() ``` ```{r, echo = FALSE} roles <- dplyr::tribble( ~ar_code, ~role_name, "ar.16", "Author", "ar.17", "Contact", "ar.18", "PI", "ar.19", "Data Manager", "ar.34", "Classifier", "ar.36", "Plot author", "ar.38", "Co-PI", "ar.39", "Computer (automated)", "ar.40", "Consultant", "ar.43", "Field assistant", "ar.44", "Guide", "ar.45", "Land owner", "ar.46", "Not specified", "ar.47", "Not specified/Unknown", "ar.48", "Passive observer", "ar.50", "Plot contributor", "ar.51", "Publication author", "ar.53", "Research advisor", "ar.54", "System manager", "ar.55", "Taxonomist", "ar.56", "Data aggregator" ) datatable(roles) ``` ### Plot Observations The plot observations loader table contains all data that is consistent across a plot. This includes information on the plot name, location, physical features, non-vegetation cover, etc. This table has many optional fields that may or may not be applicable to your project. Similar to the pattern described in contributors, one of `vb_pl_code` or `user_pl_code` is required, and only one of those two fields may be used for each row. `vb_pl_code` would only be used if the intention is to add a new observation record to the same plot. These codes are used as foreign keys in various other tables. `author_plot_code` is often the same as these codes (e.g., `MOJA_0214`) but there they could be different for valid reasons. `author_plot_code` is prominently displayed in the user VegBank interface as the plot identifier. `user_ob_code` is the primary key for an observation on a plot, and may be the same as the plot code if there is only one observation of each plot. If there are multiple observations on the same plot, however, `user_ob_code` must be unique for each observation. `user/vb_pj_code` can link a plot and its observation back to a project. Values in these fields must be present either in the `projects` loader table (`user_pj_code`) or VegBank if the project is already uploaded (`vb_pj_code`). ```{r, echo = FALSE} docs %>% filter(loader_table == "plot_observations") %>% select(-loader_table) %>% datatable() ``` ### Community Classifications The community classifications loader table contains the community classification of an observation. The primary key is `user_cl_code`. The foreign key `user_ob_code` corresponds to the key present in the plot observations loader table, and is required. All values in this field must also be present in the plot observations table. `vb_cc_code` is the VegBank community concept identifier, a required field, the value of which must already be present in VegBank. To retrieve a list of possible `vb_cc_code` values, use the `vegbankr` function `vb_get_community_concepts`. ```{r, echo = FALSE} docs %>% filter(loader_table == "community_classifications") %>% select(-loader_table) %>% datatable() ``` ### Strata Cover This loader table contains data from a plot observation of the plant names, cover, and strata in a given plot. The `user_ob_code` is a required foreign key that links the plant to a plot observation. All values in this field must be present in the plot observations loader table. `user_to_code` is a key that is unique for each combination of `user_ob_code`, plant name, and strata in this table. `user_tm_code` is a key that is unique for each combination of `user_ob_code` and plant name - so `user_tm_code` may be repeated in this table if a plant exists in multiple strata in a plot observation. `user_sr_code` is a foreign key that corresponds to the `strata` loader table, described below. ```{r, echo = FALSE} docs %>% filter(loader_table == "strata_cover_data") %>% select(-loader_table) %>% datatable() ``` ### Strata Methods Each strata value must also have a VegBank strata method associated with it. This is represented by `vb_sy_code`. To see available codes, see the code snippet below the table. The strata method is linked to an observation via the required `user_ob_code` field. `user_sr_code` is a required identifier that provides a key to each unique strata observation in the strata cover table. ```{r, echo = FALSE} docs %>% filter(loader_table == "strata") %>% select(-loader_table) %>% datatable() ``` ```{r, eval = FALSE} vb_strata <- vb_get_stratum_methods(with_nested = TRUE) %>% unnest(stratum_types) %>% mutate(stratum_index = tolower(stratum_index)) %>% rename(Stratum = stratum_index) ``` ### Taxon Interpretations Taxon interpretations associate the plants in the Strata Cover table with an existing VegBank plant concept code. To get a list of existing plant concepts, use the `vb_get_plant_concepts` function. Note that a person with a role is also required for this table, so one of `user_py_code` (present in the `parties` loader table) or `vb_py_code` (an existing VegBank party) must exist, along with their role, in `vb_ar_code`. ```{r, echo = FALSE} docs %>% filter(loader_table == "taxon_interpretations") %>% select(-loader_table) %>% datatable() ``` ### Disturbances The disturbances loader table contains information about disturbances observed at a plot, such as fire, grazing, logging, or other events that have impacted the vegetation. The primary key is `user_do_code`. The foreign key `user_ob_code` links each disturbance record to a specific plot observation and is required. `type` describes the kind of disturbance and is a required field. This field is a closed list in VegBank, with options listed below. ```{r, echo = FALSE} docs %>% filter(loader_table == "disturbances") %>% select(-loader_table) %>% datatable() ``` Below are the allowed disturbance types: ```{r, echo = FALSE} disturbance_types <- data.frame( type = c( "Animal, general", "Grazing, domestic stock", "Grazing, native ungulates", "Herbivory, invertebrate", "Herbivory, vertebrates", "Human, general", "Cultivation", "Fire suppression", "Herbicide or chemical", "Mowing", "Roads and vehicular traffic", "Timber harvest, general", "Timber harvest, clearcut", "Timber harvest, selective", "Trampling and trails", "Natural, general", "Avalanche and snow", "Cryoturbation", "Erosion", "Floods", "Hydrologic alteration", "Ice", "Mass movements (landslides)", "Plant disease", "Salt spray", "Tides", "Wind, chronic", "Wind event", "Fire, general", "Fire, canopy", "Fire, ground", "Other disturbances", "unknown" ) ) datatable(disturbance_types) ``` ### Soils The Soils loader table is used to describe soils collected from a plot. This includes information on soil horizons, texture, color, depth, and chemical properties. The primary key is `user_so_code`, which can be a simple row number. `user_ob_code` links to the observations table. The foreign key `user_ob_code` links each soil record to a specific plot observation when used. `horizon` is a required field that identifies the soil horizon being described. ```{r, echo = FALSE} docs %>% filter(loader_table == "soils") %>% select(-loader_table) %>% datatable() ``` ### Stem Data The Stem Data loader table is used to describe individual plant stems measured at a plot. This table supports detailed tree/shrub demographic data collection, including stem diameter, height, location, and health status. The primary key is `user_sc_code`, which is the stem count identifier. The required foreign key `user_tm_code` links each stem record to a specific taxon observation in the strata cover table, associating stems with their species identification. ```{r, echo = FALSE} docs %>% filter(loader_table == "stem_data") %>% select(-loader_table) %>% datatable() ``` ## Data Validation Validating that your data conform to the VegBank schema is an important step to a successful data upload. Of course, validation occurs before ingest into the database, and some validation is also done by the API, but users often benefit from getting easy to read validation results prior to even attempting a data upload. The `vb_validate` family of functions will check for the presence of required fields, unique fields, and cross-check required foreign keys across tables. To validate your plot observations data before submitting, you pass all of your loader tables as `data.frames` to the appropriate arguments in `vb_validate_plot_observations`. ```{r, eval = FALSE} vb_validate_plot_observations(plot_observations = plots, projects = projects, parties = party, contributors = contrib, disturbances = dist, community_classifications = comm, strata_cover_data = strata_cover, taxon_interpretations = tax, strata = strata) ``` If there are issues in the data, the validator will return output that looks like this: ``` ✖ disturbances.user_ob_code values not found in plot_observations.user_ob_code: DO001, DO002, DO003 ℹ soils table not provided - skipping validation ℹ stem_data table not provided - skipping validation ℹ references table not provided - skipping validation ``` In this example, `user_ob_code` in the `disturbances` table contains values that are not found in `user_ob_code` in `plot_observations`. As described in the disturbances section of the loader tables requirements above, all values in this foreign key must be present in `user_ob_code` in the plot observations loader table. To correct this mistake, you would need to return to the code that performed the data modeling to determine what the cause of the issue is. It could be that the entire wrong column in the original data was mapped to one of the two `user_ob_codes`, or it could be that there are capitalization or white space issues that make the same code not be recognized as equivalent across tables. Note that validation is case and white space sensitive across all checks. ## Data Upload Once data are validated, you are ready to try to upload data. First you need to point your R session to the correct VegBank instance and set a token. To do a test upload, point to the test instance. ```{r, eval = FALSE} vb_set_base_url("https://api-dev.vegbank.org") ``` Next, get a token by logging into [http://api-dev.vegbank.org/login](http://api-dev.vegbank.org/login). After logging in, you should see a JSON document that contains an access token and refresh token. The easiest way to get this information into R is to select the "Raw Data" option in your browser (if available) to get the plain text JSON. Copy this to your clipboard, and paste it into R to save to the variable `token`. You'll need to encase this string in **single quotes**. Double quotes will give a syntax error. ```{r, eval = FALSE} token <- '{"message":"Authorization successful","token":{"access_token":".......","refresh_token":"......."}}' ``` Then set your token using: ```{r, eval = FALSE} vb_set_token(tokens = token) ``` From here, uploading is easy using the `vb_upload_plot_observations()` function. This function takes as arguments data frames for each of the loader tables described above. Note that not all loader tables are required. To run the function, you'll take the same `data.frames` used in the validation section and pass them as arguments to the function like below. Note the `dry_run` argument. This is a way to do one final round of validation before inserting the data into VegBank. Setting a dry run allows the API to go through all of the steps except the very last insert call. If the dry run is successful, output will display saying that rows were inserted (but they weren't!). If it is not successful, it will error and more work is needed to ensure the loader tables conform to the required schema. ```{r eval = FALSE} vb_upload_plot_observations(plot_observations = plots_semi, projects = projects, parties = party, contributors = contrib_semi, disturbances = dist_semi, community_classifications = comm_semi, strata_cover_data = strat_semi, taxon_interpretations = tax_semi, strata = strat_defs_semi, soils = soils, dry_run = TRUE) ``` Once you get a successful dry run, set `dry_run` to `FALSE` to complete your upload!