Intro to Data Management on HPC

Checking data usage

If you want to see the size of files in a given directory, the ls command with the -h, -l, and -a flags will list all files in the directory in a human readable filesize:

ls -lah

Using du

du reference: https://www.geeksforgeeks.org/du-command-linux/

To get a summary of data usage in your current directory, use

du -sh

The -s flag gives you a summary, while the -h flag makes it human readable.

To get the 10 largest directories in a given directory, we can pipe du into the sort command. For the example below, we use it for our scratch directory:

du -a $SCRATCH | sort -n -r | head -n 10

NOTE: This command can get slow if you have a lot of files.

Archive Data

Using tar

Tar reference: https://www.geeksforgeeks.org/tar-command-linux-examples/

To archive data using tar, it takes the following format:

tar -czf <filename>.tar.gz <list of directories>

For example, if we wanted to bundle our data from yesterday's lab, with the job data:

cd $SCRATCH
tar -czf Workshop_Fall2023_day2.tar.gz Workshop_Fall2023 jobs

Let's untar the data in a new folder:

cd $SCRATCH
mkdir -p new_data_folder
cd new_data_folder
cp ../Workshop_Fall2023_day2.tar.gz .
tar -xf Workshop_Fall2023_day2.tar.gz

Exercise: Try untarring the data to a specific folder say ~/scratch/test_oct12 using the -C flag. Look at the documentation for tar to figure this out.

Transfer Data

Using scp

Transfering individual file

Now transfer the tarball we created from Wendian to your home system (open a new terminal that is NOT logged into Wendian):

scp username@wendian.mines.edu:~/scratch/Workshop_Fall2023_day2.tar.gz .

Transfering directory

You can also just transfer the directory directory using the recursive -r flag. Again make sure you have a terminal open that is NOT logged into Wendian:

scp -r username@wendian.mines.edu:~/scratch/new_data_folder .

Using rsync

Rsync is similar to scp, but will let transfers restart if they're cancelled. Here is a template for a typical rsync transfer:

rsync --rsh=ssh -rvP username@remote_host:/path/to/source /path/to/destination

The flag –rsh=ssh ensures rsync uses ssh. -rvP will recursively pull files from the directory (-r), with verbose output to the screen (-v) and allow for partial transfers (-P) in case an interruption or a restart. For example, to transfer the directory new_data_folder from Wendian to your local directory:

rsync --rsh=ssh -rvP username@wendian.mines.edu:~/scratch/new_data_folder .

If you want to purposely cancel it, press CTRL+C on your keyboard and cancel it. You can see on your local machine, by typing ls, that part of the file will still be there. If you did this with scp, you would not see a partial file.

Now restart the transfer with the command above, you'll see it will pick up where it left off from the last cancellation.

Using Graphical Applications

Filezilla

Go to https://filezilla-project.org/ and install Filezilla on your machine. Then open the application and fill in the information on the top:

Try to transfer the same tarball down using the FTP client.

Globus

Go to http://app.globus.org and login using your Colorado School of Mines Credentials. Try to pull down the tarball using this interface too.