Working with PubChemR to Access Chemical Data

Selcuk Korkmaz

2023-11-23

1. Introduction to the PubChemR Package

Overview

PubChemR is an R package designed to facilitate seamless interaction with the PubChem database, a comprehensive resource for chemical information. This package provides a user-friendly interface to query and retrieve data from PubChem, which includes detailed information on chemical compounds, substances, and biological assays.

Purpose

The primary purpose of PubChemR is to enable researchers, chemists, and data scientists to access and utilize the wealth of chemical information available in the PubChem database directly from their R environment. This integration allows for efficient data retrieval, manipulation, and analysis within a familiar R framework, enhancing the workflow for chemical data analysis.

Main Features

Data Retrieval: PubChemR offers functions to query and retrieve detailed information about chemical compounds, substances, and assays. This includes structural information, properties, biological activities, and more.

Data Processing: The package includes utilities to process and transform the retrieved data into user-friendly formats, such as data frames or lists, suitable for further analysis in R.

Integration with R Environment: The package is designed to integrate smoothly with the R ecosystem, allowing users to leverage other R packages and tools for data analysis, visualization, and reporting.

Prerequisites and Dependencies

R Environment: PubChemR is an R package and requires an R environment to run. It is recommended to use the latest version of R for optimal performance.

Required R Packages: PubChemR may depend on other R packages for certain functionalities, such as httr for HTTP requests, jsonlite for JSON processing, and dplyr or tidyverse for data manipulation.

API Keys: While PubChemR primarily interacts with public APIs that do not require authentication, users should check if any specific functions or extended usage require an API key from PubChem or related services.

2. Installation

PubChemR can be installed from CRAN or directly from its GitHub repository (if available). Users can install the package using standard R package installation commands.

# Install from CRAN
install.packages("PubChemR")

# Or, install the development version from GitHub
install.packages("devtools")
devtools::install_github("selcukorkmaz/PubChemR")

3. Setup

Before diving into the functionalities of the PubChemR package, it’s essential to set up your R environment by loading PubChemR along with any other necessary libraries. This step ensures that all the functions and features of the package are readily available for use.

Loading the PubChemR Package

First, you need to load the PubChemR package into your R session. If you haven’t already installed the package, refer to the installation instructions provided in the Introduction section.

# Load the PubChemR package
library(PubChemR)

Verifying the Setup

After loading the package and any additional libraries, it’s good practice to verify that everything is set up correctly. You can do this by calling a simple function from the package to check if it executes without errors.

# Example function call to verify setup
example_result <- pubchem_summary("aspirin", "name")
#> Successfully retrieved compound data.
#> Successfully retrieved CIDs.
#> Successfully retrieved substance data.
#> Successfully retrieved SIDs
example_result$CIDs
#> # A tibble: 1 × 2
#>   Compound CID  
#>   <chr>    <chr>
#> 1 aspirin  2244

This setup section ensures that users have everything they need to start working with the PubChemR package effectively. The next sections of the vignette will delve into specific functionalities and use cases of the package.

4. Overview of Functions

The PubChemR package offers a suite of functions designed to interact with the PubChem database, allowing users to retrieve and manipulate chemical data efficiently. Below is an overview of the main functions provided by the package, along with brief descriptions of their purposes:

1. pubchem_summary()

Purpose

The pubchem_summary function is designed to fetch and summarize various types of data from the PubChem database. It can retrieve information about compounds, substances, assays, and their associated properties, synonyms, and structural data files (SDF).

Parameters

Functionality

1. Data Retrieval: Based on the namespace and type, the function fetches data from PubChem. It can retrieve data about compounds, substances, and assays. 2. Error Handling: The function uses tryCatch to handle any errors during data retrieval, providing informative messages about the success or failure of each operation. 3. Synonyms and Properties: If requested, the function can also fetch synonyms and specified properties of the identifier. 4. SDF File Download: If include_sdf is TRUE, the function downloads the SDF file for the compound and saves it either in the specified path or the current working directory.

Usage

The function is used to aggregate various types of information from PubChem for a given identifier. It simplifies the process of fetching detailed data from PubChem by wrapping multiple queries into a single function call.

Example

pubchemSummary <- pubchem_summary(
  identifier = "aspirin",
  namespace = 'name',
  type = c("compound", "substance", "assay"),
  properties = "IsomericSMILES",
  include_synonyms = TRUE,
  include_sdf = FALSE,
  sdf_path = NULL
)
#> Successfully retrieved compound data.
#> Successfully retrieved CIDs.
#> Successfully retrieved substance data.
#> Successfully retrieved SIDs
#> Successfully retrieved synonyms data.
#> Successfully retrieved properties data.
pubchemSummary$CIDs
#> # A tibble: 1 × 2
#>   Compound CID  
#>   <chr>    <chr>
#> 1 aspirin  2244

This example retrieves data for aspirin, including compound details, and synonyms, and stores the results in the variable pubchemSummary. This example retrieves data for aspirin, including compound details, and synonyms, and stores the results in the variable r. To save SDF file, one may set include_sdf = TRUE and define the path to save downloaded file via sdf_path and the file name via sdf_file_name. If both arguments, i.e., sdf_path and sdf_file_name are set NULL, the downloaded file will be saved into a temporary folder with a file name retrieved from identifier argument. See below for an example

# Save downloaded SDF file into a temporary folder.
pubchemSummary <- pubchem_summary(
  identifier = "aspirin",
  namespace = 'name',
  type = c("compound", "substance", "assay"),
  properties = "IsomericSMILES",
  include_synonyms = TRUE,
  include_sdf = TRUE, 
  sdf_path = NULL, 
  sdf_file_name = "Aspirin"
)
pubchemSummary$CIDs

b2bca3ff8aa37fe371187dcb523282d79d2e17ba

Notes

2. get_aids()

Purpose

The get_aids function in the PubChemR package is designed to retrieve Assay IDs (AIDs) from the PubChem database for a given set of identifiers. This function is particularly useful for researchers and scientists who need to access assay information related to specific compounds, substances, or other entities in PubChem.

Parameters

Functionality

1. Data Retrieval: The function sends a request to PubChem to fetch AIDs based on the provided identifiers and parameters. 2. Response Parsing: It parses the JSON response to extract AIDs. 3. Data Formatting: Depending on the as_data_frame flag, the function either formats the results into a tibble (data frame) or returns them as a list. 4. Error Handling: The function includes error handling to manage and report any issues during the data retrieval process.

Usage

This function is used to obtain assay information from PubChem, which is essential for various biochemical and pharmacological research purposes. It simplifies the process of querying PubChem for assay data.

Example

getAIDs <- get_aids(
  identifier = "aspirin",
  namespace = "name"
)
head(getAIDs)
#> # A tibble: 6 × 3
#>   Compound   CID   AID
#>   <chr>    <dbl> <dbl>
#> 1 aspirin   2244     1
#> 2 aspirin   2244     3
#> 3 aspirin   2244     9
#> 4 aspirin   2244    15
#> 5 aspirin   2244    19
#> 6 aspirin   2244    21

In this example, the function retrieves AIDs associated with “aspirin” from PubChem and returns them in a data frame format.

Notes

3. get_cids()

Purpose The get_cids function is designed to interact with the PubChem database to retrieve Compound IDs (CIDs) based on a given set of identifiers. This function is particularly useful for users who need to convert various types of chemical identifiers into CIDs, which are unique numerical identifiers assigned to chemical compounds by PubChem.

Parameters * identifier: This is a vector containing identifiers. These identifiers can be a mix of positive integers (like cid, sid, aid) or strings (such as source, inchikey, formula). The function is flexible enough to handle different types of identifiers, including names, SMILES strings, and more.

Functionality The function returns a tibble (a modern version of a data frame in R), where each row corresponds to an identifier and its associated CID(s). The tibble contains columns ‘Compound’ and ‘CID’, making it easy to understand and manipulate the data.

Usage

getCIDs <- get_cids(
  identifier = "aspirin",
  namespace = "name"
)
getCIDs
#> # A tibble: 1 × 2
#>   Compound CID  
#>   <chr>    <chr>
#> 1 aspirin  2244

Additional Notes * The function uses tryCatch to handle any errors during the API request and parsing of the response. * It employs various tidyverse functions like unnest_wider, unnest_longer, and mutate to reshape the data into a user-friendly format. * The function assumes the existence of a get_json function in the package, which is responsible for making the actual API request and returning the JSON response. * This function is a key component of the PubChemR package, enabling users to seamlessly convert various chemical identifiers into PubChem CIDs, which are essential for further chemical data exploration and analysis.