# Research Computing Guidance ## Budget Guidance for HPC whether starting a new research project or continuing one at Mines this document will help you find the resources to get your High-Performance Computing workload done. Some useful free compute cycles through programs such as [RMACC via Alpine](https://curc.readthedocs.io/en/latest/access/rmacc.html) or [NSF ACCESS Program](https://access-ci.org/), will help you get some exploratory compute cycles to awarded a grant, and might be used through out the project. Mines HPC systems on-premise are operated under a Operational Expenditure (OpEX) cost model, where researchers pay for the CPU/hours cycles used. This document aims to help write future research grants that use HPC resources at Mines. Schedule a consultation appointment at [CIARC.mines.edu](https://helpcenter.mines.edu/TDClient/1946/Portal/Requests/TicketRequests/NewForm?ID=4GCQlvW5OYk_&RequestorType=Service) ### Life Cycle of Research Computing Projects ![Life Cycle of an HPC Project](/_static/life_cycle_hpc_project.png) Stage 1: Identify area for exploratory research to expand current scientific knowledge base - Choice of Software - Vet software for security and other restrictions - Open Source vs Commercial - Commercial: Licensing requirements - FlexLM server - Site-wide, single user, multi-user, etc. - Open Source - Optimized build for compute architecture - Self-authored software - Identify software languages, compilers, libraries required, etc. Stage 2: Development of scripts, scaling & benchmarking - Identify workload and workflows - Common Parallel Workflows: - Multi-node (e.g. MPI) - Multi-threaded (e.g. OpenMP) - Single or Multi GPU (NVIDIA using CUDA, AMD using OpenCL, etc) - Embarrassingly Parallel – Parameter Sweeps, Markov Chain Monte Carlo Simulations - Other considerations - High memory (relative to CPU requirements) - High data I/O (e.g. machine learning) Stage 2 (alternative): Find published benchmark (little to no experience with software) or manual benchmarking (established codebase or simulation input deck) - Run test problem across multiple hardware configurations: - Local workstation or laptop - Freely Accessible HPC - Performance Analysis Consultation Checklist - A working test problem of model/code - What compute resources were used to verify code? (No. of cores, memory, etc) - Model can use multiple cpus and/or nodes - Anticipated number of runs of model - Parameter matrix - Outcomes: Performance Analysis report - Weak scaling (vary problem size proportionally to no. of processors) - Strong scaling (fixed problem size -> increase no. of processors) - Scaling analysis will estimate to user the amount of core-hours per job they need Stage 3: Pre-Funded HPC options - No funding: - CU Boulder's Alpine via RMACC - Open Science Grid (if user has an Embarassingly Parallel workflow) - NSF ACCESS Program - Proposal required, but grant is not required Stage 3 (grant supported HPC): Perform baseline parallel and efficiency analysis on Pre-Funded HPC choice - Use as an approximate scale -\> estimate budget - Secure funding - Secure allocation on paid/NSF cluster, or fund AWS HPC Stage 4: Full-scale Computing & Visualization - Setting up computing environment - Request software install - Project Allocation - User access account - Shared storage and project storage - Performance batch jobs - Data management - Visualization setups Stage 5: Post-processing & continued training of new researchers - Data insights, data sharing, project publications - Training for new researchers Stage 6: Project close-out and data transfer to long term storage - Metadata tagging - Archive ready datasets ### Parameters to Determine 1. Nodes x Cores: total number of Cores used for simulation 2. Total jobs: Jobs needed to complete research project 3. Core Run Time (Cores * Hours): Wirk done by each CPU (CPUh) 4. Storage of Input and Output: Data sets inputs (multi-use or single-use) 5. Cost / (Core-Hour): Wendian = $0.02 (CPUh), Apex (AWS cloud) $0.04 and up and GPUs even more ie. $96/hour 6. Storage during & after simulation: S3-Bucket, Orebits, fast NVMe 7. Job Parametric Study: Number of simulations to assess science domain 8. GPU-Hour (GPUh): Increased costs for specialize resources. Wendian GPU V100 = 6 core-hour or $0.12.