An Introduction to polarisR

Introduction

Welcome to the polarisR…..

This document provides a comprehensive guide to use the polarisR. We will walk through each tab of the application, explaining the features and functionalities available to help you explore and understand your high-dimensional data.

What does polarisR stand for?

projective output layouts and reduced interactive surfaces in R

Application Overview

The polarisR interface is organized into five main tabs, each designed for specific aspects of your data analysis workflow:

Dataset Preview - Load and explore your data, select columns, and manage datasets
Non-linear dimension reduction (NLDR) - Apply NLDR methods (t-SNE/UMAP) with parameter configuration and visualization
Dynamic Tour - Explore high-dimensional structure through animated projections
Diagnosing - Assess embedding quality using quantitative methods
2-D Layout Comparison - Compare different NLDR configurations and results

Each tab builds upon the previous ones, creating a comprehensive workflow from data loading to advanced comparative analysis. Let’s explore each tab in detail.

Dataset Preview Tab

The Dataset Preview tab is the starting point of your analysis. Here, you can load your data, select relevant columns, and get a quick overview of your dataset.

Dataset Preview Tab

Features:

Upload Dataset: You can upload your own dataset in CSV format using the “Upload Dataset” button. The application will automatically validate the file and handle potential errors.
Example Datasets: polarisR comes with two pre-loaded datasets: four_clusters and pdfsense. You can select any of these to explore the application’s features without needing your own data.

# Access the datasets directly
data(four_clusters, package = "polarisR")
data(pdfsense, package = "polarisR") 

# View dataset information
?four_clusters
?pdfsense

Dataset Descriptions:

four_clusters: A synthetic dataset with four distinct clusters, perfect for testing clustering visualization
pdfsense: A high-energy physics dataset representing parton distribution function fits

Additional Features:

Column Selection: After loading a dataset, you can choose which columns to include in the NLDR analysis. By default, all columns are selected. You can manually select or deselect columns and apply the changes.
Data Preview: A table displays the first few rows of your dataset, allowing you to inspect the data and ensure it has been loaded correctly.
Dataset Information: This section provides a summary of your dataset, including the number of rows, columns, and the types of columns (numeric or categorical).
NLDR Datasets: As you run different NLDR analyses, the results will be stored and listed here. You can easily switch between different results to compare them.

Non-linear dimension reduction (NLDR) Tab

The Non-linear dimension reduction (NLDR) tab is where the main NLDR analysis happens. You can choose between t-SNE and UMAP, configure their parameters, and visualize the results.

Dataset Visualization Tab

Features:

Choose Method: Select either t-SNE or UMAP as your NLDR method.
t-SNE Parameters:
- Perplexity: Adjust the perplexity value, which influences the number of nearest neighbors for each point.
- Max Iterations: Set the maximum number of iterations for the t-SNE algorithm.
- Auto-adjust perplexity: Let the application automatically choose a suitable perplexity value based on your data. The formula used is: perplexity = max(5, min(30, floor(n_samples / 3) - 1)), which ensures perplexity is between 5 and 30, and scales with dataset size to maintain effective neighborhood structure.
UMAP Parameters:
- Number of Neighbors: Control the size of the local neighborhood UMAP will use.
- Min. Distance: Set the minimum distance between embedded points.
Color Options: Choose a column from your dataset to color the points in the visualization. This is useful for identifying clusters or patterns.
Reproducibility Options: Set a random seed to ensure that your NLDR results are reproducible.
Run Visualization: Click this button to start the NLDR computation. The progress will be displayed, and the resulting visualization will be shown on the right.
Visualization Information: This panel displays the parameters used for the current visualization, making it easy to track your experiments.

Dynamic Tour Tab

The Dynamic Tour tab offers an interactive way to explore the high-dimensional space of your data. It provides a dynamic projection of the data, which can be viewed as a scatter plot, sage plot, or slice plot.

Dynamic Tour Tab

Features:

Select Tour Display: Choose from three types of dynamic tours:
- Scatter: A standard scatter plot of the projected data.
- Sage: A scatter plot display that adjusts for the projected volume, defined in Laa et al. (2021).
- Slice: A scatter plot display that highlights points close to the projection plane, defined in Laa et al. (2020).
Tour Options:
- Show Axes: Toggle the visibility of the axes in the tour plot.
- Show Wireframe: Toggle the visibility of wireframe edges in the tour plot for enhanced structural visualization.
- Point Opacity (Alpha): Adjust the transparency of the points.
- Gamma (for Sage): Control the effective dimensionality parameter for the sage plot.
- Slice Relative Volume (for Slice): Adjust the thickness of the slice.
Enable Linked Brushing: When enabled, you can select points in the NLDR plot, and the corresponding points will be highlighted in the dynamic tour plot, and vice-versa. This is a powerful feature for exploring the relationship between the low-dimensional embedding and the original high-dimensional data.

Diagnosing Tab

The Diagnosing tab provides tools to assess the selected NLDR layout. It uses the [quollr package] (https://github.com/JayaniLakshika/quollr) to perform a quantitative analysis of the NLDR layout and helps you to find the optimal binwidth for the model fitting.

Features:

Binwidth Optimization: This feature automatically tests a range of bin widths for the quollr analysis and finds the optimal configuration based on the Root Mean Square Error (RMSE).
Run Quollr Analysis: After optimizing the binwidth, you can run the full quollr analysis to get a detailed assessment of your embedding.
Analysis Results: The results of the analysis are presented in several tabs:
- RMSE vs Binwidth: An interactive plot showing the RMSE for different binwidth values.
- Model Summary: A table with the detailed results of the binwidth optimization.
- Model Error: A visualization shows the NLDR layout, with points colored according to the high-dimensional model error.
Configuration & Summary: This section displays the optimal binwidth configuration and a summary of the quollr model fit.

2-D Layout Comparison Tab

The 2-D Layout Comparison tab allows you to compare the results of different NLDR layouts. You can compare different methods (t-SNE vs. UMAP) or the same method with different hyper-parameters.

Features:

Choose Comparison Type:
- NLDR Settings Comparison: Compare the RMSE of different NLDR configurations. This is useful for finding the best method and hyper-parameters for your data.
- Side-by-Side Visualization: Display two NLDR plots next to each other for a direct visual comparison.
- Enable Linked Brushing: In the side-by-side visualization mode, you can enable linked brushing to select points in one plot and see the corresponding points highlighted in the other.
Dataset Selection: Select the NLDR results you want to compare from the list of stored results.
Run Comparison Plot: Generate the comparison plot based on your selection.
Best Configuration Summary: When comparing NLDR settings, this section will show you the best configuration found based on the RMSE.