Introduction

From a thesis in Ho Chi Minh University of Technology ...

Hello world !

Our HPC system problem

Specify problems in HCMUT's HPC system, its user story, demands ...

Why this project

Should compare to other monitoring tools such as Zabbix, Prometheus ...

Which problems do this project resolve ? ...

System Architecture

Architecture

Sensor manager

Virtual Sensor

Virtual sensor descriptions.

Helpful libraries

Data collection

Payload's content interfaces

interface Process {
    name: string;
    pid: number
    parentPid: number
    uid: number
    gid: number
    executePath: string
    command: string
    virtualMemoryUsage: number  // In KB
    physicalMemoryUsage: number // In KB
    cpuTime: number             // In ms
    cpuUsage: number            // In %
    networkInBandwidth: number  // What interface ???
    networkOutBandwidth: number
    ioWrite: number             // In KB
    ioRead: number              // In KB
}

interface NetworkInterface {
    name: string
    inBandwidth: number
    outBandwidth: number
}

interface Memory {
    used: number
    available: number
    swapUsed?: number
    swapFree?: number
}

interface Cpu {
    user: number
    nice: number
    system: number
    iowait: number
    steal: number
    idle: number
}

interface IOUsage {
    deviceName: string
    readPerSecond: number
    writePerSecond: number
}

interface DiskUsage {
    filesystemName: string
    used: number // In KB
    available: number // In KB
}

Full payload interface

interface KafkaMessage {
    nodeId: number
    timestamp: number
    payload: Process[] | NetworkInterface[] | Memory | Cpu | IOUsage | DiskUsage
    type: 'PROCESS' | 'NETWORK_INTERFACE' | 'MEMORY' | 'CPU' | 'IO_USAGE' | 'DISK_USAGE'
}

Data in namespace

Note that a process runs in container such as Docker, LXC ... or runs in a VM has its own namespace.

Sample data from /proc/$PID/net/dev file

Interface nameReceiveTransmit
bytespacketserrsdropfifoframecompressedmulticastbytespacketserrsdropfifocollscarriercompressed
lo246922419558000000246922419558000000

Helpful tools

Some useful command lines:

  • sysstat
  • df
  • free

Build systemd service

https://www.tecmint.com/create-systemd-service-linux/

Appendix

The following sections show how to deploy & operate the entire system.

Github workflows settings

We recommend to run github runners inside Linux container (LXC).

For details about self-hosted github runners, visit https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners.

For LXC installation and usage, you can visit our guides here.

Linux container (LXC)

Install LXC

Full documentation about installation goes here. For security reasons, we create an unprivileged container as a user by these following steps:

Init configurations

mkdir -p ~/.config/lxc
cp /etc/lxc/default.conf ~/.config/lxc/default.conf
MS_UID="$(grep "$(id -un)" /etc/subuid  | cut -d : -f 2)"
ME_UID="$(grep "$(id -un)" /etc/subuid  | cut -d : -f 3)"
MS_GID="$(grep "$(id -un)" /etc/subgid  | cut -d : -f 2)"
ME_GID="$(grep "$(id -un)" /etc/subgid  | cut -d : -f 3)"
echo "lxc.idmap = u 0 $MS_UID $ME_UID" >> ~/.config/lxc/default.conf
echo "lxc.idmap = g 0 $MS_GID $ME_GID" >> ~/.config/lxc/default.conf

Download container

Run this command to start download:

systemd-run --unit=hpc-unit --user --scope -p "Delegate=yes" -- lxc-create -t download -n hpc-container

Then, the console will print list of distibution, choose distribution centos, release 7 and host computer's architecture. After downloading successful, your terminal should print result like this:

Downloading the image index

---
DIST        RELEASE ARCH    VARIANT BUILD
---
almalinux   8       amd64   default 20230123_23:10
almalinux   8       arm64   default 20230123_23:14
almalinux   8       ppc64el default 20230123_23:08
..... Other distribution
---

Distribution: 
centos
Release: 
7
Architecture: 
amd64

Downloading the image index
Downloading the rootfs
Downloading the metadata
The image cache is now ready
Unpacking the rootfs

---
You just created a Centos 7 x86_64 (20230123_22:38) container.

Start container

Run lxc container with allocating an empty delegated cgroup:

systemd-run --unit=hpc-unit --user --scope -p "Delegate=yes" -- lxc-start hpc-container

To confirm its status:

lxc-info -n my-container
lxc-ls -f

And get a shell inside it with:

lxc-attach -n hpc-container

Stopping it can be done with:

lxc-stop -n my-container

And finally removing it with:

lxc-destroy -n my-container

User guide

User guide here ... (Web UI, how to interact ...)