---
title: "Uploading plot data with vegbankr"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Uploading plot data with vegbankr}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

```{r, message = FALSE, echo = FALSE} 
library(dplyr)
library(DT)

docs <- read.csv("../inst/loader-table-fields.csv")

docs$required <- factor(docs$required, levels = c("required", "best practice", "commonly used", "sometimes used"))

docs <- docs %>% 
  arrange(required)
```

## Introduction to `vegbankr`

This package is an R client for VegBank, the vegetation plot database of
the Ecological Society of America's [Panel on Vegetation
Classification](https://esa.org/vegpanel/), hosted by the [National
Center for Ecological Analysis and
Synthesis](https://www.nceas.ucsb.edu) (NCEAS). VegBank contains
vegetation plot data, community types recognized by the U.S. National
Vegetation Classification and others, and all ITIS/USDA plant taxa along
with other taxa recorded in plot records. As a VegBank API client, the
`vegbankr` package currently supports querying and downloading
vegetation plot records and other supporting information from the
VegBank database, and supports validating and uploading new
data to the VegBank database as well.

## Contributing data to VegBank

To upload data to VegBank, you must **first request contributor permission** 
from the ESA  Vegetation Classification Panel. You can request to 
be a contributor by emailing [help@vegbank.org](mailto:help@vegbank.org) and
the panel will evaluate your request with the goal of maintaining high-quality 
vegetation data in the system. Once your contributor role is 
granted, you will be able to log in and upload new plot data with the 
[vegbankr R package](https://nceas.github.io/vegbankr). 

To use `vegbankr` to upload data, there are 3 key steps:

1.  Model and transform your data to the VegBank Loader Table format
2.  Validate your data
3.  Upload your data using `vb_upload_plot_observations(...)`

This vignette will walk through these 3 steps, with an emphasis on
modeling and validating data.

## VegBank Loader Tables

Loader tables are the data format that is used to upload data into
VegBank. In order to publish your data to VegBank, the first step is to
model whatever format your data is in to the loader table format, and
then transform the data into that format. Modeling your data means
identifying how each piece of information in your original dataset (like
species names, plot locations, survey dates) corresponds to specific
fields in the VegBank loader table format. There are a total of 12
loader tables that can be used for plot observation data, though not all
are required for data ingest. In this section, the loader tables and
their fields will be described, in the order in which it is recommended
to prepare them.

Each loader table you create is an R `data.frame` that is eventually
passed to a VegBank upload function. In the documentation below,
interactive tables display the allowed `field` (column names), whether
the field is required, best practice, commonly used, or sometimes used,
and a description of the field.

There are a number of fields that act as codes and that are used as primary and secondary keys to link loader tables together. Codes that begin with `user_` are supplied by the data contributor. Codes that begin with `vb_` are created by the database upon data upload. 

### Projects

This table stores information about a project established to collect
vegetation plot data. The `user_pj_code` is the project code primary
key, and is found as a foreign key in several other tables. An example
project code might be `MOJA` with project name "Mojave Desert Vegetation
Surveys."

```{r, echo = FALSE}
docs %>% 
  filter(loader_table == "projects") %>% 
  select(-loader_table) %>% 
  datatable()
```

### Parties

The Parties loader table is used to upload new parties (people) associated with plots, projects, taxa, and classifications. The primary key is `user_py_code` which is used as a foreign key in the Contributors loader table. Once uploaded, VegBank will create a `vb_py_code` for each party to be used in the Contributors table.

```{r, echo = FALSE}
docs %>% 
  filter(loader_table == "parties") %>% 
  select(-loader_table) %>% 
  datatable()
```

### Contributors

The contributors loader table is fairly code-heavy, but is closely linked to
both the Projects table and the Parties table, and is used to link parties (people) with their contributions to plots, projects, taxa, and classifications.

`user_py_code` is a foreign key to the `parties` table, so any values present in this field must be present there as well. Optionally, instead of `user_py_code`, `vb_py_code` can be used if the party is already in VegBank. Only one of `user_py_code` or `vb_py_code` must be present, and having both in the same row is disallowed.

`vb_ar_code` is the VegBank role code - a code in the format `ar.{nn}`. A table of allowed values and their meanings is listed below the loader table variables.

Finally, the `contributor_type` indicates whether this contributor is linked to an Observation, Project, or Classification. The `record_identifier` value will depend on which of these three types the contributor is associated with. If the contributor should be associated with a project with `user_pj_code` `MOJA`, the `record_identifier` for that contributor should be `MOJA` and the contributor type should be Project. This transitively will also associate the contributor with all plots associated with that project. If the contributor should only be associated with a particular observation with `user_ob_code` `MOJA_0214`, the record identifier is that observation identifier, and the `contributor_type` should be Observation.


```{r, echo = FALSE}
docs %>% 
  filter(loader_table == "contributors") %>% 
  select(-loader_table) %>% 
  datatable()
```

```{r, echo = FALSE}

roles <- dplyr::tribble(
  ~ar_code, ~role_name,
  "ar.16",  "Author",
  "ar.17",  "Contact",
  "ar.18",  "PI",
  "ar.19",  "Data Manager",
  "ar.34",  "Classifier",
  "ar.36",  "Plot author",
  "ar.38",  "Co-PI",
  "ar.39",  "Computer (automated)",
  "ar.40",  "Consultant",
  "ar.43",  "Field assistant",
  "ar.44",  "Guide",
  "ar.45",  "Land owner",
  "ar.46",  "Not specified",
  "ar.47",  "Not specified/Unknown",
  "ar.48",  "Passive observer",
  "ar.50",  "Plot contributor",
  "ar.51",  "Publication author",
  "ar.53",  "Research advisor",
  "ar.54",  "System manager",
  "ar.55",  "Taxonomist",
  "ar.56",  "Data aggregator"
) 

datatable(roles)

```

### Plot Observations

The plot observations loader table contains all data that is consistent across a plot. This includes information on the plot name, location, physical features, non-vegetation cover, etc. This table has many optional fields that may or may not be applicable to your project.

Similar to the pattern described in contributors, one of `vb_pl_code` or `user_pl_code` is required, and only one of those two fields may be used for each row. `vb_pl_code` would only be used if the intention is to add a new observation record to the same plot. These codes are used as foreign keys in various other tables. `author_plot_code` is often the same as these codes (e.g., `MOJA_0214`) but there they could be different for valid reasons. `author_plot_code` is prominently displayed in the user VegBank interface as the plot identifier. 

`user_ob_code` is the primary key for an observation on a plot, and may be the same as the plot code if there is only one observation of each plot. If there are multiple observations on the same plot, however, `user_ob_code` must be unique for each observation. 

`user/vb_pj_code` can link a plot and its observation back to a project. Values in these fields must be present either in the `projects` loader table (`user_pj_code`) or VegBank if the project is already uploaded (`vb_pj_code`). 

```{r, echo = FALSE}
docs %>% 
  filter(loader_table == "plot_observations") %>% 
  select(-loader_table) %>% 
  datatable()
```

### Community Classifications

The community classifications loader table contains the community classification of an observation. 

The primary key is `user_cl_code`. The foreign key `user_ob_code` corresponds to the key present in the plot observations loader table, and is required. All values in this field must also be present in the plot observations table.

`vb_cc_code` is the VegBank community concept identifier, a required field, the value of which must already be present in VegBank. To retrieve a list of possible `vb_cc_code` values, use the `vegbankr` function `vb_get_community_concepts`.

```{r, echo = FALSE}
docs %>% 
  filter(loader_table == "community_classifications") %>% 
  select(-loader_table) %>% 
  datatable()
```

### Strata Cover

This loader table contains data from a plot observation of the plant names, cover, and strata in a given plot. The `user_ob_code` is a required foreign key that links the plant to a plot observation. All values in this field must be present in the plot observations loader table. `user_to_code` is a key that is unique for each combination of `user_ob_code`, plant name, and strata in this table. `user_tm_code` is a key that is unique for each combination of `user_ob_code` and plant name - so `user_tm_code` may be repeated in this table if a plant exists in multiple strata in a plot observation. `user_sr_code` is a foreign key that corresponds to the `strata` loader table, described below.

```{r, echo = FALSE}
docs %>% 
  filter(loader_table == "strata_cover_data") %>% 
  select(-loader_table) %>% 
  datatable()
```

### Strata Methods

Each strata value must also have a VegBank strata method associated with it. This is represented by `vb_sy_code`. To see available codes, see the code snippet below the table. The strata method is linked to an observation via the required `user_ob_code` field. `user_sr_code` is a required identifier that provides a key to each unique strata observation in the strata cover table.

```{r, echo = FALSE}
docs %>% 
  filter(loader_table == "strata") %>% 
  select(-loader_table) %>% 
  datatable()
```

```{r, eval = FALSE}
vb_strata <- vb_get_stratum_methods(with_nested = TRUE) %>% 
  unnest(stratum_types) %>% 
  mutate(stratum_index = tolower(stratum_index)) %>% 
  rename(Stratum = stratum_index)
```


### Taxon Interpretations

Taxon interpretations associate the plants in the Strata Cover table with an existing VegBank plant concept code. To get a list of existing plant concepts, use the `vb_get_plant_concepts` function. Note that a person with a role is also required for this table, so one of `user_py_code` (present in the `parties` loader table) or `vb_py_code` (an existing VegBank party) must exist, along with their role, in `vb_ar_code`.

```{r, echo = FALSE}
docs %>% 
  filter(loader_table == "taxon_interpretations") %>% 
  select(-loader_table) %>% 
  datatable()
```

### Disturbances

The disturbances loader table contains information about disturbances observed
at a plot, such as fire, grazing, logging, or other events that have impacted
the vegetation. The primary key is `user_do_code`. The foreign key `user_ob_code` links each
disturbance record to a specific plot observation and is required. `type` describes the kind of disturbance and is a required field. This field is a closed list in VegBank, with options listed below.

```{r, echo = FALSE}
docs %>% 
  filter(loader_table == "disturbances") %>% 
  select(-loader_table) %>% 
  datatable()
```

Below are the allowed disturbance types:

```{r, echo = FALSE}
disturbance_types <- data.frame(
  type = c(
    "Animal, general",
    "Grazing, domestic stock",
    "Grazing, native ungulates",
    "Herbivory, invertebrate",
    "Herbivory, vertebrates",
    "Human, general",
    "Cultivation",
    "Fire suppression",
    "Herbicide or chemical",
    "Mowing",
    "Roads and vehicular traffic",
    "Timber harvest, general",
    "Timber harvest, clearcut",
    "Timber harvest, selective",
    "Trampling and trails",
    "Natural, general",
    "Avalanche and snow",
    "Cryoturbation",
    "Erosion",
    "Floods",
    "Hydrologic alteration",
    "Ice",
    "Mass movements (landslides)",
    "Plant disease",
    "Salt spray",
    "Tides",
    "Wind, chronic",
    "Wind event",
    "Fire, general",
    "Fire, canopy",
    "Fire, ground",
    "Other disturbances",
    "unknown"
  )
)

datatable(disturbance_types)
```

### Soils

The Soils loader table is used to describe soils collected from a plot. This includes information on soil horizons, texture, color, depth, and chemical properties.

The primary key is `user_so_code`, which can be a simple row number. `user_ob_code` links to the observations table. The foreign key `user_ob_code` links each soil record to a specific plot observation when used. 
`horizon` is a required field that identifies the soil horizon being described.

```{r, echo = FALSE}
docs %>% 
  filter(loader_table == "soils") %>% 
  select(-loader_table) %>% 
  datatable()
```

### Stem Data

The Stem Data loader table is used to describe individual plant stems measured at a plot. 
This table supports detailed tree/shrub demographic data collection, including
stem diameter, height, location, and health status.

The primary key is `user_sc_code`, which is the stem count identifier. The
required foreign key `user_tm_code` links each stem record to a specific taxon
observation in the strata cover table, associating stems with their species
identification.

```{r, echo = FALSE}
docs %>% 
  filter(loader_table == "stem_data") %>% 
  select(-loader_table) %>% 
  datatable()
```

## Data Validation

Validating that your data conform to the VegBank schema is an important step to a successful data upload. Of course, validation occurs before ingest into the database, and some validation is also done by the API, but users often benefit from getting easy to read validation results prior to even attempting a data upload.

The `vb_validate` family of functions will check for the presence of required fields, unique fields, and cross-check required foreign keys across tables.

To validate your plot observations data before submitting, you pass all of your loader tables as `data.frames` to the appropriate arguments in `vb_validate_plot_observations`.

```{r, eval = FALSE}
vb_validate_plot_observations(plot_observations = plots,
                              projects = projects,
                              parties = party,
                              contributors = contrib,
                              disturbances = dist,
                              community_classifications = comm,
                              strata_cover_data = strata_cover,
                              taxon_interpretations = tax,
                              strata = strata)
```

If there are issues in the data, the validator will return output that looks like this:

```
✖ disturbances.user_ob_code values not found in plot_observations.user_ob_code: DO001, DO002, DO003
ℹ soils table not provided - skipping validation
ℹ stem_data table not provided - skipping validation
ℹ references table not provided - skipping validation
```

In this example, `user_ob_code` in the `disturbances` table contains values that are not found in `user_ob_code` in `plot_observations`. As described in the disturbances section of the loader tables requirements above, all values in this foreign key must be present in `user_ob_code` in the plot observations loader table.

To correct this mistake, you would need to return to the code that performed the data modeling to determine what the cause of the issue is. It could be that the entire wrong column in the original data was mapped to one of the two `user_ob_codes`, or it could be that there are capitalization or white space issues that make the same code not be recognized as equivalent across tables. Note that validation is case and white space sensitive across all checks.


## Data Upload

Once data are validated, you are ready to try to upload data. First you need to point your R session to the correct VegBank instance and set a token.

To do a test upload, point to the test instance.

```{r, eval = FALSE}
vb_set_base_url("https://api-dev.vegbank.org")
```

Next, get a token by logging into [http://api-dev.vegbank.org/login](http://api-dev.vegbank.org/login). After logging in, you should see a JSON document that contains an access token and refresh token. The easiest way to get this information into R is to select the "Raw Data" option in your browser (if available) to get the plain text JSON. Copy this to your clipboard, and paste it into R to save to the variable `token`. You'll need to encase this string in **single quotes**. Double quotes will give a syntax error.

```{r, eval = FALSE}
token <- '{"message":"Authorization successful","token":{"access_token":".......","refresh_token":"......."}}'
```

Then set your token using:

```{r, eval = FALSE}
vb_set_token(tokens = token)
```

From here, uploading is easy using the `vb_upload_plot_observations()` function. This function takes as arguments data frames for each of the loader tables described above. Note that not all loader tables are required.

To run the function, you'll take the same `data.frames` used in the validation section and pass them as arguments to the function like below. Note the `dry_run` argument. This is a way to do one final round of validation before inserting the data into VegBank. Setting a dry run allows the API to go through all of the steps except the very last insert call. If the dry run is successful, output will display saying that rows were inserted (but they weren't!). If it is not successful, it will error and more work is needed to ensure the loader tables conform to the required schema.

```{r eval = FALSE}
vb_upload_plot_observations(plot_observations = plots_semi,
                            projects = projects,
                            parties = party,
                            contributors = contrib_semi,
                            disturbances = dist_semi,
                            community_classifications = comm_semi,
                            strata_cover_data = strat_semi,
                            taxon_interpretations = tax_semi,
                            strata = strat_defs_semi,
                            soils = soils,
                            dry_run = TRUE)
```

Once you get a successful dry run, set `dry_run` to `FALSE` to complete your upload!