Introduction
From a thesis in Ho Chi Minh University of Technology ...
Hello world !
Our HPC system problem
Specify problems in HCMUT's HPC system, its user story, demands ...
Why this project
Should compare to other monitoring tools such as Zabbix
, Prometheus
...
Which problems do this project resolve ? ...
System Architecture
Sensor manager
Virtual Sensor
Virtual sensor descriptions.
Helpful libraries
Data collection
Payload's content interfaces
interface Process {
name: string;
pid: number
parentPid: number
uid: number
gid: number
executePath: string
command: string
virtualMemoryUsage: number // In KB
physicalMemoryUsage: number // In KB
cpuTime: number // In ms
cpuUsage: number // In %
networkInBandwidth: number // What interface ???
networkOutBandwidth: number
ioWrite: number // In KB
ioRead: number // In KB
}
interface NetworkInterface {
name: string
inBandwidth: number
outBandwidth: number
}
interface Memory {
used: number
available: number
swapUsed?: number
swapFree?: number
}
interface Cpu {
user: number
nice: number
system: number
iowait: number
steal: number
idle: number
}
interface IOUsage {
deviceName: string
readPerSecond: number
writePerSecond: number
}
interface DiskUsage {
filesystemName: string
used: number // In KB
available: number // In KB
}
Full payload interface
interface KafkaMessage {
nodeId: number
timestamp: number
payload: Process[] | NetworkInterface[] | Memory | Cpu | IOUsage | DiskUsage
type: 'PROCESS' | 'NETWORK_INTERFACE' | 'MEMORY' | 'CPU' | 'IO_USAGE' | 'DISK_USAGE'
}
Data in namespace
Note that a process runs in container such as Docker
, LXC
... or runs in a VM has its own namespace.
Sample data from /proc/$PID/net/dev
file
Interface name | Receive | Transmit | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
bytes | packets | errs | drop | fifo | frame | compressed | multicast | bytes | packets | errs | drop | fifo | colls | carrier | compressed | |
lo | 2469224 | 19558 | 0 | 0 | 0 | 0 | 0 | 0 | 2469224 | 19558 | 0 | 0 | 0 | 0 | 0 | 0 |
Helpful tools
Some useful command lines:
sysstat
df
free
Build systemd
service
https://www.tecmint.com/create-systemd-service-linux/
Appendix
The following sections show how to deploy & operate the entire system.
Github workflows settings
We recommend to run github runners inside Linux container (LXC
).
For details about self-hosted github runners, visit https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners.
For LXC installation and usage, you can visit our guides here.
Linux container (LXC)
Install LXC
Full documentation about installation goes here. For security reasons, we create an unprivileged container as a user by these following steps:
Init configurations
mkdir -p ~/.config/lxc
cp /etc/lxc/default.conf ~/.config/lxc/default.conf
MS_UID="$(grep "$(id -un)" /etc/subuid | cut -d : -f 2)"
ME_UID="$(grep "$(id -un)" /etc/subuid | cut -d : -f 3)"
MS_GID="$(grep "$(id -un)" /etc/subgid | cut -d : -f 2)"
ME_GID="$(grep "$(id -un)" /etc/subgid | cut -d : -f 3)"
echo "lxc.idmap = u 0 $MS_UID $ME_UID" >> ~/.config/lxc/default.conf
echo "lxc.idmap = g 0 $MS_GID $ME_GID" >> ~/.config/lxc/default.conf
Download container
Run this command to start download:
systemd-run --unit=hpc-unit --user --scope -p "Delegate=yes" -- lxc-create -t download -n hpc-container
Then, the console will print list of distibution, choose distribution centos
, release 7
and host computer's architecture. After downloading successful, your terminal should print result like this:
Downloading the image index
---
DIST RELEASE ARCH VARIANT BUILD
---
almalinux 8 amd64 default 20230123_23:10
almalinux 8 arm64 default 20230123_23:14
almalinux 8 ppc64el default 20230123_23:08
..... Other distribution
---
Distribution:
centos
Release:
7
Architecture:
amd64
Downloading the image index
Downloading the rootfs
Downloading the metadata
The image cache is now ready
Unpacking the rootfs
---
You just created a Centos 7 x86_64 (20230123_22:38) container.
Start container
Run lxc container with allocating an empty delegated cgroup:
systemd-run --unit=hpc-unit --user --scope -p "Delegate=yes" -- lxc-start hpc-container
To confirm its status:
lxc-info -n my-container
lxc-ls -f
And get a shell inside it with:
lxc-attach -n hpc-container
Stopping it can be done with:
lxc-stop -n my-container
And finally removing it with:
lxc-destroy -n my-container
User guide
User guide here ... (Web UI, how to interact ...)