├── .gitignore ├── README.md ├── cicd.md ├── cron.md ├── docker.md ├── download-curl.sh ├── git.md ├── history ├── README.md └── ace_cream.jemdoc ├── images ├── gnome_vnc2.png ├── mac_juniper.png ├── mac_vm_screenshot3.png ├── matlab_terminal.png ├── seafile-android-1.jpg ├── seafile-android-2.jpg ├── server_matlab.png ├── slurm_job_2_task.png ├── texworks.png ├── tutorial02.png ├── tutorial04.png ├── tutorial05.png ├── vscode_languagetool.png ├── windows_rdp_screenshot2.png ├── windows_remote_desktop_screenshot.png └── wolfram_player.png ├── languagetool.md ├── maintenance ├── .gitkeep ├── README.md ├── StorageCheck.py ├── bright-view.md ├── cpu_limit.sh ├── cpu_monitor ├── database.md ├── docker.md ├── down.md ├── jupyterhub.md ├── languagetool.md ├── limit.md ├── minikube.md ├── module.md ├── mysreport ├── mysreport.md ├── network.md ├── nfs.md ├── overleaf.md ├── quota.md ├── server_test.py ├── shorewall.md ├── slurm.md ├── sys-admin.md ├── user_management.md ├── verify_password.py ├── virtual_machine.md ├── vpn.md ├── vpn_on_bcm.md └── wiki.md ├── markdown.md ├── megaglest.md ├── module.md ├── programming_language.md ├── service.md ├── slurm.md ├── software.md ├── virtual_machine.md ├── vnc.md ├── vpn.md ├── website-config ├── index.php └── seafile-logo.png └── wiki ├── Admin.md ├── Configure_the_environment.md ├── Dataset.md ├── Doc-mirror.md ├── Food_Area_of_lab2c.md ├── Group_meeting.md ├── Guide_GFW.md ├── Large_Deviation_Theory.md ├── ServerAdmin.md ├── Software-Recommendation.md ├── Software-mirror.md ├── Spring_and_Autumn_of_lab2c.md ├── What_to_consider_when_revising_paper.md ├── images ├── 1600px-Pic_006.png ├── Pic_001.png ├── Pic_002.png ├── Pic_003.png ├── Pic_004.png ├── Pic_005.png └── Pic_007.png ├── zhaofeng_shu33.md ├── 出行途中.md ├── 国际交流渠道.md ├── 材料准备.md ├── 社会实践渠道.md ├── 行前准备.md └── 返程途中.md /.gitignore: -------------------------------------------------------------------------------- 1 | build 2 | .DS_Store 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # How to Use Our Cluster 2 | 3 | We have a high performance cluster for your research computation. 4 | 5 | This repo servers as a quick start tutorial for cluster usage. 6 | 7 | You can find further details in other chapters. 8 | 9 | ## 5 minutes Quick Start 10 | - Access server via SSH with following command. You may need [VPN](vpn.md) outside school. For SSH client, 11 | [ModaXterm](https://mobaxterm.mobatek.net/) is recommended for Windows users. For other operating systems, you can use the system default terminal. 12 | ```bash 13 | > ssh [username]@10.8.6.22 14 | ``` 15 | Then use the following command to load necessary modules to run GPU code: 16 | ```bash 17 | > module load slurm anaconda3/py3 cuda10.1 cudnn/7.6.5 18 | ``` 19 | - Submit a GPU job using 20 | ```bash 21 | > srun --gres=gpu:1 python /home/feng/mcr/k_mnist_xregression_simple.py 22 | ``` 23 | If you encounter any errors, please run `module purge` first and retry. If errors persist, please report your issue to the server admin. 24 | 25 | ## 1. Access the cluster 26 | 27 | ### 1.1 Account 28 | 29 | For new students, you need to apply for a cluster account from [server administrators](http://10.8.6.22/wiki/index.php/ServerAdmin). 30 | The cluster account can be used in 3 purposes: shell access to manage node(10.8.6.22) and storage node(10.8.6.21); universal identity verification(seafile, jupyter, etc); slurm account, which is used to submit jobs to computing node. 31 | 32 | You will get an initial password from the server admin, which is complex and can be changed once you log in. Reset password using the `passwd` tool and follow the prompt. 33 | 34 | When setting new password, **Never** use simple codeword like 123456, cluster admin may have a regular check on password complexity and reset your weak password. 35 | 36 | Your home directory is located at `/home/[username]`. 37 | You can put/find commonly used datasets in `/home/dataset`. 38 | If you want to put data in this directory, please declare it on [Dataset](http://10.8.6.22/wiki/index.php/Dataset). 39 | Otherwise, your dataset in `/home/dataset` may be removed without notice. 40 | 41 | By default, your home directory has `700` permissions, which means others do not have read access to the files in your home. 42 | 43 | ### 1.2 VPN 44 | 45 | Our server can only be accessed within the school private network. If you are outside of our school, please check the [institute vpn](vpn.md) first. It is prohibited to use personal reverse proxies due to security concern. 46 | 47 | ### 1.3 Shell Access 48 | For Windows client, [ModaXterm](https://mobaxterm.mobatek.net/) is recommended though it is a commercial freeware. It integrates SFTP and SSH so that you can view and edit your file 49 | easily. Other client options are possible. For example, [Xshell](https://www.netsarang.com/en/xshell/), Powershell, [VSCode](https://code.visualstudio.com/), etc. 50 | 51 | For Mac client, you can use the terminal to connect to the server. See [ssh.md](./service.md#ssh) for more detailed. 52 | 53 | For both platform we specially recommend [VSCode](https://code.visualstudio.com/), which combines code editor, shell, git control and debug 54 | tools all-in-one. With the help of Remote-SSH extension of VSCode, you can manage your project directly on cluster as it is in your local. You can install 55 | many other extensions for your preference, such as Language support (Python/Matlab/C++/...), SFTP and UI Theme. 56 | 57 | ### 1.4 Cluster Home Page 58 | Our lab homepage is located at [http://10.8.6.22](http://10.8.6.22). From there you can find the link to our lab's wiki, gitlab, overleaf and jupyter etc. You can also connect to our manage node or storage node using remote desktop there. For wiki and gitlab web service you need to register first. For jupyter web service, you can login with your ssh username and password. 59 | 60 | ## 2. Setup a working environment 61 | 62 | ### 2.1 Load modules 63 | 64 | We have many popular software pre-installed on our server, which can be easily imported to your own working environment through "module" command. 65 | 66 | For example, the default python version is 2.7 on CentOS. To use python 3, you need to load the `anaconda3/py3` module. 67 | 68 | ```shell 69 | module load anaconda3/py3 70 | ``` 71 | 72 | This adds `anaconda 3`, a popular python platform, to your current session. Now you can use Python 3.6 by typing `python`. 73 | 74 | We have a lot of pre-installed modules like CUDA/cuDNN (for GPU programs), Tensorflow/Pytorch (for deep learning). You can use following command to see the complete software list. 75 | ``` 76 | module avail 77 | ``` 78 | 79 | Important: Module command only works in current session. To make modules automatic loaded every time you log in, or make it default environment for SLURM, you need to modify your session profile. For example, modify `~/.bashrc` file like this ( a script that runs at the start of every new terminal session. If you use `zsh`, put the command in `~/.zshrc`). 80 | ``` 81 | module load slurm anaconda3/py3 cuda10.1 cudnn/7.6.5 82 | ``` 83 | Here is a sample `.bashrc` file that include all the above packages: 84 | 85 | ```bash 86 | # .bashrc 87 | 88 | # Source global definitions 89 | if [ -f /etc/bashrc ]; then 90 | . /etc/bashrc 91 | fi 92 | 93 | # Uncomment the following line if you don't like systemctl's auto-paging feature: 94 | # export SYSTEMD_PAGER= 95 | 96 | # User specific aliases and functions 97 | module load slurm anaconda3/py3 cuda10.1 cudnn/7.6.5 98 | ``` 99 | 100 | The last line is where modules are loaded. This environment works for most deep learning requirements. 101 | 102 | ### 2.2 Python Environment 103 | #### Python Version 104 | We have pre-installed multiple versions of python interpreter, together with many popular packages. Load corresponding module to use them: 105 | 106 | | Module Name | Python Version | Package Version | 107 | |----------------|----------------|----------------------------------| 108 | | anaconda2/py2 | 2.7.16 | - | 109 | | anaconda3/py3 | 3.6.8 | tensorflow 1.8.0; pytorch 1.0.1 | 110 | | anaconda3/py38 | 3.8.5 | tensorflow 2.6.0; pytorch 1.9.0 | 111 | 112 | Important Notice: any problems about python 2 will not be supported by our lab's server admin. 113 | 114 | #### Recommended Deep Learning Environment 115 | Due to different GPU type and driver version, an specific environment may not work well on all computing node. For simplicity, 116 | the table below lists the recommended environment for different computing nodes, i.e., load modules in this table along with `anaconda3/py3`, 117 | and then choose appropriate nodes using `-w` in srun accordingly. 118 | 119 | | | node[01-03] | node[04-05] | 120 | |----------------|----------------------------------------|------------------------------------------------| 121 | | Pytorch | cuda10.1 cudnn/7.6.5 pytorch/1.7.1 | cuda11.1 cudnn8.0-cuda11.1 pytorch/1.10.0_cu113| 122 | | Tensorflow 1.x | cuda10.0 cudnn/7.6.5 tensorflow/1.15.0 | NOT SUPPORTED | 123 | | Tensorflow 2.x | cuda10.1 cudnn/7.6.5 tensorflow/2.3.0 | cuda11.1, cudnn8.0-cuda11.1, tensorflow/2.4.1 | 124 | 125 | Note: For `anaconda3/py38`, please load cuda11.1 and cudnn8.0-cuda11.1, and for now it only supports node[04-05]. 126 | 127 | #### PIP 128 | 129 | If you need any additional Python packages which is not installed by anaconda or system python by default, you can use `pip` to install it within your home directory (with `--user` option) 130 | 131 | For example, to install a package called graphviz, which is not bundled by anaconda. You can type: 132 | 133 | ```shell 134 | python -m pip install --user graphviz 135 | ``` 136 | #### Custom Environment 137 | If you need another version of Python package which is incompatible with existing installation. You need to configure your own Python environment using `conda`. 138 | See [Configure Environment](http://10.8.6.22/wiki/index.php/Configure_the_environment) for detail. 139 | 140 | ## 3. Submit a job using Slurm 141 | 142 | ### 3.1 Cluster Structure 143 | 144 | We have 7 nodes (servers) in our cluster. Once you login, you will be in the manage node called bcm. Please do not directly run your job on bcm, instead use SLURM to submit jobs to other nodes. SLURM will automatically allocates GPU/CPU resources to you based on your requests. 145 | 146 | SLURM is the **only way to use the GPU resources** on our lab server. 147 | 148 | | nodename | direct shell access | ip address | OS | 149 | | -------- | ------------------- | ---------- | ------------ | 150 | | bcm | Yes | 10.8.6.22 | CentOS 7.6 | 151 | | node01 | No | | CentOS 7.6 | 152 | | node02 | No | | CentOS 7.6 | 153 | | node03 | No | | CentOS 7.6 | 154 | | node04 | No | | CentOS 7.6 | 155 | | node05 | No | | CentOS 7.6 | 156 | | nfs | Yes | 10.8.6.21 | Ubuntu 16.04 | 157 | 158 | * `bcm`: management node, where you typically login; 159 | * `node01` : computing node with 8 TITAN V GPUs, 56 CPUs, 256GB RAM; 160 | * `node02`: computing node with 8 TITAN V GPUs, 256GB RAM; 161 | * `node03` : computing node with 4 Tesla K80 GPU cores and 2 TITAN V GPUs, 128GB RAM, fastest CPU among all nodes; 162 | * `node04`: computing node with 8 RTX 3090 GPUs, 256GB RAM; 163 | * `node05`: computing node with 8 RTX 3090 GPUs, 256GB RAM; 164 | * `nfs`: storage node that hosts the 120 TB file system `/home`; 165 | 166 | ### 3.2 Using srun 167 | 168 | You can use `srun` command to submit a single job to SLURM. 169 | 170 | ``` 171 | srun --gres=gpu:1 [option] [command] 172 | ``` 173 | 174 | * `--gres=gpu:1` requests one GPU for running the code. 175 | * `[command]` can be any terminal command such as `python test.py` 176 | * `[option]` can be any of following: 177 | 178 | | option | description | default value | 179 | | ---------- | --------------------------------------------------- | --------------------- | 180 | | -w node01 | use node01 for computation | automictic allocation | 181 | | --qos=high | use high quality of service | --qos=normal | 182 | | -t 200 | the maximum job running time is limited to 200 mins | -t 4320 (3 days) | 183 | | -c 4 | use 4 CPU cores for computation | -c 1 | 184 | | --gres=gpu:1 | use 1 GPU for computation | None | 185 | | --unbuffered | display output in real-time | None | 186 | 187 | The maximal time each GPU job is allowed to run is 3 days divided by the number of GPUs your job is using. 188 | 189 | Note that setting `--gres=gpu` to more than one will NOT automatically make your code faster! You also need to make sure your code supports multiple GPUs. See the following links on how to achieve this. 190 | 191 | * Keras: https://keras.io/getting-started/faq/#how-can-i-run-a-keras-model-on-multiple-gpus 192 | * Tensorflow: https://www.tensorflow.org/guide/using_gpu#using_multiple_gpus 193 | * Pytorch: https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html 194 | 195 | ### 3.3 Using sbatch 196 | 197 | While `srun` run a single job and block your shell, `sbatch` command submits a list of your jobs together and run in background. And return the standard output into a `.out` file for later check. 198 | 199 | To submit a job, you need to wrap terminal commands within a `sbatch` script. Suppose you want to run a list of GPU program like `test.py`. First create a new file `submit_jobs.sh` with the following content: 200 | 201 | ```bash 202 | #!/bin/bash 203 | #SBATCH -J yang # job name, optional 204 | #SBATCH -N 1 # number of computing node 205 | #SBATCH -c 4 # number of cpus, for multi-thread programs 206 | #SBATCH --gres=gpu:1 # number of gpus allocated on each node 207 | 208 | python test1.py 209 | python test2.py --option on 210 | python test2.py --option off 211 | ``` 212 | 213 | * Lines starting with `#SBATCH` are slurm options, which act identically to the options in `srun` command. 214 | * The last few lines `python test1.py` is the command to be run. If multiple commands are listed, they will be always be executed sequentially, NOT in parallel. 215 | 216 | Then submit the job with `sbatch` 217 | 218 | ```bash 219 | sbatch submit_jobs.sh 220 | ``` 221 | 222 | The output log of this job will be saved to `slurm-[jobID].out` in the current directory. A useful way to display the log in real time is via the `tail` command. e.g 223 | 224 | ``` 225 | tail -f slurm-177.out 226 | ``` 227 | 228 | To exit, use Ctrl-C. 229 | 230 | ### 3.4 View and Cancel jobs 231 | 232 | You can view the job queue using `squeue`. (This applies to all jobs submitted with `srun` or `sbatch`) 233 | 234 | ```bash 235 | [yang@bcm ~]$ squeue 236 | JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 237 | 177 defq yang yang R 0:30 1 node01 238 | ``` 239 | 240 | * The column named ST displays the status of your job. 'R' stands for running and 'PD' stands for pending. 'NODELIST(REASON)' shows where this job runs or the reason of pending. 241 | 242 | 245 | 246 | To cancel a job, use `scancel` command: 247 | 248 | ``` 249 | scancel [jobID] 250 | ``` 251 | 252 | 267 | 268 | 269 | ## 4. Quality of Service (QoS) 270 | 271 | ### 4.1 How many GPUs can I use? 272 | 273 | * The home directory of each user is restricted to 10TB in maximal. 274 | * Task directly running on manage node is allowed to use up to 10 GB memory and 14 CPUs. See [cron.md](./cron.md) for detail. 275 | * Task submitted by Slurm can choose different Quality of Service (QoS): 276 | 277 | | QoS | Users | \#GPUs | Priority | Example | 278 | |:----------------:|:----------:|:------:|:--------:|:----------------------------------------------------:| 279 | | normal (Default) | Everyone | 3 | High | `srun [--qos=normal] --gres=gpu:1 python main.py` | 280 | | high | Applicants | ~7 | Normal | `srun --qos=high --gres=gpu:1 python main.py` | 281 | 282 | The high QoS have 7 extra GPUs for students submitting papers (and therefore 10 avaliable in total). You can apply it by consulting with Yang Li. 283 | 284 | Note that, the number of extra high QoS may change depends on overall workload of our sever, i.e., get larger at low-workload and smaller at high-workload. This kind of change will take effect without further notice and you can check the latest quota in this page. 285 | 286 | ### 4.2 Why my jobs are waiting? 287 | 288 | There are two reasons: 289 | - You have run out of your quota. In this case your waiting jobs will not be scheduled even though there's free rescources. Please wait for your previous job or apply more quota. 290 | - Your job is queued. In such case, our sever is very busy. The Priority decide the order of the queue of jobs waiting to be scheduled. As you see, normal QoS have hgher Priority than high QoS, and jobs of same Priority will follow a FIFO schedule. 291 | 292 | 293 | ## 5. Further documentation 294 | You can download the official user guide of how to user cluster at [User Manual](http://10.8.6.22/wiki/index.php/File:user-manual.pdf). 295 | 296 | You can submit issues on [our github](https://github.com/mace-cream/clusterhowto/issues) or [intranet gitlab](http://10.8.6.22:88/yang/clusterhowto/issues). 297 | For instant communication please join the [slack](https://join.slack.com/t/lab2c/shared_invite/enQtODQyMTY4OTcyNTMwLWRkOTlkYmM2MWI3NGYzOWMwYTRkYzEzMTBjNjcxMWMxNTMxZjg2N2U1YzE5ZjI4YTE3ZTQ2ZWU2YzEyODNmMmU) channel of our lab. 298 | Though we have a wechat group for server related issues, it is not recommended to use it compared with the above ways. 299 | 300 | 301 | -------------------------------------------------------------------------------- /cicd.md: -------------------------------------------------------------------------------- 1 | This guide introduces how to use CI feature of gitlab to help your experiment. 2 | 3 | Currently, this proposal is experimental. 4 | 5 | Since slurm does not provides a way to organize the log, it is very easy to end up with a lot of `slurm-1234.out`. Deleting these files will lose some useful informations. 6 | 7 | Actually you can use gitlab-runner to upload these log files, a sample project is at [self-hosted-runner](https://gitlab.com/zhaofeng-shu33/triangle_counting_self-hosted_runner/). 8 | Feel free to check the job logs at [log](https://gitlab.com/zhaofeng-shu33/triangle_counting_self-hosted_runner/pipelines). 9 | 10 | 1. Create a project in gitlab.com (or our self hosted gitlab). 11 | 12 | 1. You need to configure your runner at the project setting page. Your runner is private and project wide. That is, it will not be used by other people and other project. You do not need 13 | to install again `gitlab-runner` as it is already available on our manage node. You should register for your own purpose. Just type 14 | `gitlab-runner register` to go. After registeration, using `gitlab-runner run` to start the runner. 15 | 16 | 1. Using `srun` to submit your job in `.gitlab.yml` configuration file. 17 | 18 | 1. Check the CICD page of the project. 19 | 20 | ## How to create multiple gitlab-runner instance on the same machine? 21 | The default runner installed on our manage node is a service, which is managed by system admin. Your own gitlab-runner 22 | is run in user mode and will use your account to execute the task. 23 | 24 | ## Maintainance Note 25 | For maintainers, you can check the system gitlab-runner service by 26 | ```shell 27 | systemctl status gitlab-runner 28 | ``` 29 | ### How to run another gitlab-runner instance globally? (For shared runner purpose) 30 | 31 | To do this, we resort to the docker solution. That is, we start the gitlab-runner as a docker service. The configuration file is located at [docker-compose.yml](http://10.8.6.22:88/zhaofeng-shu33/lab2cnew/blob/master/docker-compose.yml). 32 | After using `docker-compose start` to start the service. We can register for each project we need. However, since this runner is a Docker container, the shell executor is within the 33 | container environment and is not usefully. We need to use the Docker executor as well. See [executor](https://docs.gitlab.com/runner/executors/README.html) how to configure Docker executor. 34 | 35 | For more information, please check [cicd doc](http://10.8.6.22:88/help/ci/yaml/README.md). 36 | -------------------------------------------------------------------------------- /cron.md: -------------------------------------------------------------------------------- 1 | Currently, two root cron jobs are responsible to monitor the manage node. The two programs are run every one minites, configured by 2 | ``` 3 | * * * * * /usr/local/bin/cpu_limit.sh >> /var/log/cpu_limit_cron.log 2>&1 4 | * * * * * /usr/local/bin/cpu_monitor >> /var/log/cpu_monitor_cron.log 2>&1 5 | ``` 6 | 7 | You can find the source code of the two programs in `maintenance` folder. 8 | 9 | You can also check the log files. -------------------------------------------------------------------------------- /docker.md: -------------------------------------------------------------------------------- 1 | # Docker 2 | To use docker, you need to be within the docker group of node01. You can contact the server admin to add you to the docker group if needed. 3 | 4 | Currently, you can only use docker in node01 with GPU support. There are some security issues when using docker. Therefore it is not recommended to use docker container in service mode. 5 | 6 | ## Example 7 | ```shell 8 | srun -w node01 docker run --rm alpine:3.11 /bin/sh -c "echo 123;" 9 | ``` 10 | ## Run docker with current user 11 | ```shell 12 | srun -w node01 docker run -it --rm --user="$(id -u):$(id -g)" your_image 13 | ``` 14 | 15 | ## Build docker images using mainland mirror 16 | It is painful to build docker images at the step of installing overseas software stacks. There should be a mirror switch to deal with this problem, so that 17 | the following effect is achieved: 18 | * normally no mirror is used 19 | * mirror is used when certain environment variable is set 20 | 21 | To do this, use the following script switch in `build.sh` 22 | ```shell 23 | if [[ ! -z "$A_MIRROR" ]] # not empty 24 | do something 25 | fi 26 | ``` 27 | 28 | For `Dockerfile`, explicitly pass the environment variable to the build script: 29 | 30 | ``` 31 | ARG A_MIRROR 32 | ENV A_MIRROR=$A_MIRROR 33 | ``` 34 | 35 | Finnaly, run the `docker` command with custom args: 36 | ```shell 37 | srun -w node01 docker build -t your_target --build-arg A_MIRROR=1 . 38 | ``` 39 | 40 | ## Remove images with `none` tag 41 | ```shell 42 | srun -w node01 docker rmi $(srun -w node01 docker images --filter "dangling=true" -q) 43 | ``` 44 | ## Remove stopped containers 45 | ```shell 46 | docker container prune 47 | ``` -------------------------------------------------------------------------------- /download-curl.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Usage: sbatch --array=0-N download-curl.sh file_url 3 | # replace N with the number of thread you want 4 | # and file_url with the url you want 5 | # The file server should support partial downloading for multithreading download to work 6 | # Current limitation: the file prefix is hard coded, e.g. zip format. 7 | set -e -x 8 | url=$1 9 | suffix="${url##*.}" 10 | total_bytes=$(curl --head $url | grep Content-Length | awk '{split($0,a," ");print a[2]}' | awk '{print substr($0, 1, length($0)-1)}') 11 | start_id=$SLURM_ARRAY_TASK_ID 12 | end_id=$((SLURM_ARRAY_TASK_ID+1)) 13 | total_id=$((SLURM_ARRAY_TASK_MAX+1)) 14 | echo $total_bytes 15 | echo $total_id 16 | start=$((start_id * total_bytes / total_id)) 17 | end=$((end_id * total_bytes / total_id -1)) 18 | filename=$(printf %02d $start_id).partial.$suffix 19 | curl -H "Range: bytes=$start-$end" $url -o $filename 20 | # after the downloading complete, please use `cat *.partial.$suffix > total.$suffix` to join the partial files together 21 | -------------------------------------------------------------------------------- /git.md: -------------------------------------------------------------------------------- 1 | # Git 2 | 3 | This document instructs you how to install a newer version of git on your local computer. 4 | 5 | ## Make a temporary working directory (optional) 6 | 7 | ```sh 8 | $ tmp_dir=$(mktemp -d -t git-XXXXXXXX) 9 | $ cd "$tmp_dir" 10 | ``` 11 | 12 | ## Download git 13 | 14 | Download the source code tarball of your preferred version (e.g. `2.26.2`) 15 | from [https://mirrors.edge.kernel.org/pub/software/scm/git/](https://mirrors.edge.kernel.org/pub/software/scm/git/) 16 | 17 | ```sh 18 | $ wget https://mirrors.edge.kernel.org/pub/software/scm/git/git-2.26.2.tar.gz 19 | ``` 20 | 21 | ## Build and install 22 | 23 | I prefer to install local applications to `~/.local`. 24 | If you do not want to install git to this path, 25 | you can change the `prefix` in the following commands. 26 | 27 | ```sh 28 | $ tar -xzvf git-2.26.2.tar.gz 29 | $ cd git-2.26.2 30 | $ make configure 31 | $ ./configure --prefix="$HOME/.local" 32 | $ make -j8 33 | $ make install 34 | ``` 35 | 36 | ## Test installation 37 | 38 | Make sure you have added `~/.local/bin/` to your `PATH` environment variable. 39 | 40 | ```sh 41 | $ echo 'export PATH=$HOME/.local/bin:$PATH' >> ~/.bashrc 42 | $ . ~/.bashrc 43 | ``` 44 | 45 | Try the following command to test if you install git successfully. 46 | 47 | ```sh 48 | $ git --version 49 | git version 2.26.2 50 | ``` 51 | 52 | ## Clean (optional) 53 | 54 | ```sh 55 | $ echo "$tmp_dir" ## Make sure the tmp_dir defined before haven't been modified 56 | $ cd && rm -rf "$tmp_dir" 57 | ``` 58 | -------------------------------------------------------------------------------- /history/README.md: -------------------------------------------------------------------------------- 1 | ## Startup 2 | Before 2018 May, our lab did not have any server. By the effects of Yang Li, postdoc of our lab at that time, 3 | we had one server machine during May. The initial three administrators also include Mingyang Li and Yue Zhang. 4 | The server is called "ace-cream" since Professor Shao-lun Huang, senior phd student xiangxiangxu and other 5 | promonient students shared research interests on this technique. 6 | Mingyang Li wrote the user documentation [ace_cream](./ace_cream.jemdoc). This documentation can also be 7 | accessible at the personal website of Yang: [web ace_cream](http://yangli-feasibility.com/wiki/doku.php?id=ace-cream). 8 | 9 | Student Feng Zhao also wrapped a python package called `ace_cream`, published on [pypi](https://pypi.org/project/ace-cream/). 10 | 11 | ## Upgrade 12 | Our old single server only had 4 GPUs and it was not enough when several students are in need. Competition for the occpucation 13 | of GPUs happened now and then inevitable. 14 | 15 | After around a year of running, our lab had bought a server cluster. The cluster solution is provided by Amax China, who were 16 | promoting their docker-based GPU cluster solution. Postdoc Yang was cautious enough and was skeptical about their new system. 17 | After detailed discussion with the engineers, Yang finally decided to use the solution of Bright Computing. The installation 18 | was conducted by engineers of Amax China. 19 | 20 | The server cluster became fully available from 2019 May. The new cluster uses `slurm` to schedule the computing tasks of users. 21 | 22 | -------------------------------------------------------------------------------- /history/ace_cream.jemdoc: -------------------------------------------------------------------------------- 1 | # jemdoc: menu{MENU}{index.html}, nofooter 2 | ==ace-cream使(食)用指南 3 | 4 | ~~~ 5 | 为了大家能更好的享用我们的ace-cream,请大家在使用前阅读此说明文档,里面包含了一些使用说明和注意事项,请大家爱护我们的服务器哦! 6 | ~~~ 7 | 8 | == 1.如何连接服务器? 9 | 一般我们使用ssh命令连接服务器。如果你是Linux或MacOS用户,建议直接在terminal中使用ssh命令连接;如果你是Windows用户,我个人建议使用[https://mobaxterm.mobatek.net/ MobaXterm]软件~ 10 | 11 | 无论怎样,请牢记我们的IP地址58.250.23.213,以及端口号2215。(现在这个是在哪里都能连的外网ip,另外内网ip是10.8.16.177) 12 | 13 | 所以在terminal里,你应该输入:\n 14 | ~~~ 15 | {}{} 16 | ssh 你的用户名@58.250.23.213 -p 2215 17 | ~~~ 18 | 19 | 如果你想使用GUI,也就是在服务器命令行中调用matlab、spyder3等程序时自动弹出图形界面,请使用如下命令连接服务器(MacOS用户请先安装[https://www.xquartz.org/ XQuartz]!): 20 | ~~~ 21 | {}{} 22 | ssh -Y 你的用户名@58.250.23.213 -p 2215 23 | ~~~ 24 | 25 | 请先联系管理员(名扬、张跃、阳姐)申请你的账号哦! 26 | 27 | == 2.服务器上有什么? 28 | 登陆之后你会发现在\/home文件夹下有一个以你的账户名命名的文件夹,这个就是属于你的主目录了,请把你的文件、代码等放到这个文件夹下。不过注意:由于\/home文件夹空间有限(平均每人10G),请把占用空间大的文件或数据集等放置在\/data文件夹下面!(关于\/data文件夹的使用方法见第4部分) 29 | 30 | 我们现在已经安装好了tensorflow, pytorch, keras深度学习框架(均支持GPU,请使用python3),以及scikit-learn, pandas,opencv等一些你可能会用到的包,另外还安装了matlab、spyder3。直接在命令行中输入如下命令:matlab、spyder3即可调用相应程序。如果你个人碰到一些python包需要安装,你可以使用如下命令安装到你个人的目录中: 31 | 32 | ~~~ 33 | {}{} 34 | pip install --user 你要安装的包的名字 35 | ~~~ 36 | 37 | 如上述方式无法安装,或你觉得这个包可能会被普遍使用,请联系管理员安装到系统目录中去~ 38 | 39 | == 3.使用GPU的方法 40 | 41 | 我们的服务器上现在有4个GPU供大家使用(实际上是两块Nvidia Tesla K80,但被分成了四块)。经过我们的初步估计,你一般的神经网络在一个GPU上训练即可达到你台式机10倍的训练速度,是不是很棒? 42 | 43 | 为了避免大家使用GPU撞车导致训练出错的情况,我们找到了一个非常好用的脚本,它可以实现帮你自动选择可用的GPU,并在你使用的时候把它锁起来不让别人使用,还能在你不使用的时候自动解锁~是不是很方便? 44 | 45 | 那么请你在使用GPU的时候一定记得在命令行开始时加上gpu_run这个命令!这样你就可以放心的训练你的程序啦!举例如下: 46 | ~~~ 47 | {}{} 48 | gpu_run python3 train.py 49 | ~~~ 50 | 你还可以查看GPU使用情况,直接在命令行输入: 51 | ~~~ 52 | {}{} 53 | gpu_lock_info 54 | ~~~ 55 | 如果你碰到程序停止了但GPU没有解锁的情况,请在命令行输入: 56 | ~~~ 57 | {}{} 58 | gpu_lock_release_dead 59 | ~~~ 60 | 61 | 注意哦!如果你没有使用gpu_run运行程序,你的程序会被默认不使用GPU运行的。比如你在spyder3中运行python3的话,你的程序是会用CPU跑的哦!所以建议大家可以使用spyder3调试程序,程序确认没有问题了,再在命令行中输入gpu_run来运行哈! 62 | 63 | == 4.其他注意事项 64 | - 在\/data文件夹下,大家可以将一些大型的、公用的数据集放在\/datasets文件夹中,一些公用的开发工具放在\/pkgs文件夹中,一些你觉得好用的、推荐的、小型的代码或工具放在\/tools文件夹中。建议大家在放东西的时候整理好到一个文件夹中,然后再放到上述的文件夹,以免造成混乱。让我们互帮互助,使我们的ace-cream变得越来越好用吧~。此外,如果你个人有比较大的且其他人不太可能会使用的程序、文件需要放到\/data下面,请在\/users下以你的用户名建一个文件夹,然后把东西放到里面吧~(以上提到的文件夹都是在\/data下面的哈!) 65 | - 对linux命令行操作不熟悉的同学,推荐[http://cn.linux.vbird.org/linux_basic/linux_basic.php “鸟哥的Linux私房菜”]网站自学一下哦~ 66 | - 建议大家在本地电脑上开发测试各自的程序,仅在服务器上运行需要大量内存或GPU的大数据运算。 67 | - 如果有长时间运算(以天为单位)请提前告知管理员,避免维护时重启。 68 | - 为了降低数据意外丢失的可能,请阶段性地将计算/训练数据写入硬盘。 -------------------------------------------------------------------------------- /images/gnome_vnc2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mace-cream/clusterhowto/48d8087b4e163f6b1d5731e2ef5977f10be38782/images/gnome_vnc2.png -------------------------------------------------------------------------------- /images/mac_juniper.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mace-cream/clusterhowto/48d8087b4e163f6b1d5731e2ef5977f10be38782/images/mac_juniper.png -------------------------------------------------------------------------------- /images/mac_vm_screenshot3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mace-cream/clusterhowto/48d8087b4e163f6b1d5731e2ef5977f10be38782/images/mac_vm_screenshot3.png -------------------------------------------------------------------------------- /images/matlab_terminal.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mace-cream/clusterhowto/48d8087b4e163f6b1d5731e2ef5977f10be38782/images/matlab_terminal.png -------------------------------------------------------------------------------- /images/seafile-android-1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mace-cream/clusterhowto/48d8087b4e163f6b1d5731e2ef5977f10be38782/images/seafile-android-1.jpg -------------------------------------------------------------------------------- /images/seafile-android-2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mace-cream/clusterhowto/48d8087b4e163f6b1d5731e2ef5977f10be38782/images/seafile-android-2.jpg -------------------------------------------------------------------------------- /images/server_matlab.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mace-cream/clusterhowto/48d8087b4e163f6b1d5731e2ef5977f10be38782/images/server_matlab.png -------------------------------------------------------------------------------- /images/slurm_job_2_task.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mace-cream/clusterhowto/48d8087b4e163f6b1d5731e2ef5977f10be38782/images/slurm_job_2_task.png -------------------------------------------------------------------------------- /images/texworks.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mace-cream/clusterhowto/48d8087b4e163f6b1d5731e2ef5977f10be38782/images/texworks.png -------------------------------------------------------------------------------- /images/tutorial02.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mace-cream/clusterhowto/48d8087b4e163f6b1d5731e2ef5977f10be38782/images/tutorial02.png -------------------------------------------------------------------------------- /images/tutorial04.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mace-cream/clusterhowto/48d8087b4e163f6b1d5731e2ef5977f10be38782/images/tutorial04.png -------------------------------------------------------------------------------- /images/tutorial05.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mace-cream/clusterhowto/48d8087b4e163f6b1d5731e2ef5977f10be38782/images/tutorial05.png -------------------------------------------------------------------------------- /images/vscode_languagetool.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mace-cream/clusterhowto/48d8087b4e163f6b1d5731e2ef5977f10be38782/images/vscode_languagetool.png -------------------------------------------------------------------------------- /images/windows_rdp_screenshot2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mace-cream/clusterhowto/48d8087b4e163f6b1d5731e2ef5977f10be38782/images/windows_rdp_screenshot2.png -------------------------------------------------------------------------------- /images/windows_remote_desktop_screenshot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mace-cream/clusterhowto/48d8087b4e163f6b1d5731e2ef5977f10be38782/images/windows_remote_desktop_screenshot.png -------------------------------------------------------------------------------- /images/wolfram_player.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mace-cream/clusterhowto/48d8087b4e163f6b1d5731e2ef5977f10be38782/images/wolfram_player.png -------------------------------------------------------------------------------- /languagetool.md: -------------------------------------------------------------------------------- 1 | Languagetool is a grammar checker, deployed on our server with N-gram enhancement. 2 | It is useful for example when you write your paper. 3 | You can configure your source editor to connect to the server. 4 | 5 | ## VScode 6 | Install `LanguageTool Linter` and use external url: 7 |  8 | ## Texstudio 9 | Also change the server url to `http://10.8.6.21:8088`. 10 | -------------------------------------------------------------------------------- /maintenance/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mace-cream/clusterhowto/48d8087b4e163f6b1d5731e2ef5977f10be38782/maintenance/.gitkeep -------------------------------------------------------------------------------- /maintenance/README.md: -------------------------------------------------------------------------------- 1 | # Admin Wiki 2 | 3 | ## Log onto root: sudo su - 4 | 5 | ``` 6 | mongo -u root -p root_pass --authenticationDatabase admin 7 | ``` 8 | 9 | Test user: username: `ntfs`. This user is controled by LDAP and is a non-root user. The purpose of this user is used to test some functionality of our server. 10 | 11 | ## Add new user 12 | 13 | To add a new user, use the [web portal](https://10.8.6.22:8081/bright-view/#!/auth/login). 14 | 15 | ## slurm job accounting add new user 16 | 17 | To grant slurm access (for using GPU nodes) to the new user "alice", use 18 | ```shell 19 | sacctmgr create user name=alice DefaultAccount=users 20 | ``` 21 | ## grant sudo privilege to user 22 | 23 | Sudo users are in the `wheel` group (for rhel only, the group name is called `sudo` for debian based system). Grant sudo access by adding an existing user to `wheel` 24 | ```shell 25 | sudo usermod -aG wheel alice 26 | ``` 27 | ## change the slurm quota for all users 28 | The following line limits the number of CPUs per user to 32, number of GPUs per user to 6 29 | ```shell 30 | sacctmgr modify qos normal set MaxTRESPerUser=cpu=32,gres/gpu=6 31 | ``` 32 | 33 | ## add users to high usage GPU group 34 | See user qos association 35 | ``` 36 | sacctmgr show assoc format=user,qos 37 | ``` 38 | 39 | To divide high usage users into a separate group for higher computing resource allocation, follow the steps below: 40 | 41 | Create a new Quality of Service (QOS) group called high: 42 | ```shell 43 | sacctmgr add qos high 44 | ``` 45 | 46 | Set the GPU limit of the high QOS group to 10: 47 | ```shell 48 | sacctmgr modify qos high set MaxTRESPerUser=cpu=32,gres/gpu=10 49 | ``` 50 | 51 | Put a user named feima into the high QOS group (the default group that all the users are in is normal) 52 | ```shell 53 | sacctmgr -i modify user where name=feima set QOS=normal,high 54 | ``` 55 | 56 | Change the default QOS group of feima into high: 57 | ```shell 58 | sacctmgr -i modify user where name=feima set DefaultQOS=high 59 | ``` 60 | 61 | Now user feima can use 10 GPUs for multiple jobs at the same time. 62 | -------------------------------------------------------------------------------- /maintenance/StorageCheck.py: -------------------------------------------------------------------------------- 1 | import os 2 | import time 3 | import threading 4 | 5 | path = '/home/zhiyuan/' 6 | limit = 1*1024*1024 7 | log_file = '/home/zhiyuan/du.log' 8 | 9 | _f = open(log_file,'a') 10 | _f.write("[LOG]"+' '+time.strftime("%Y-%m-%d %H:%M:%S",time.localtime(time.time()))+' Program start.\n') 11 | threads = [] 12 | 13 | def check(target): 14 | size = os.popen('du -s ' + path + target).readlines()[0].split('\t')[0] 15 | if float(size) > limit: 16 | _f.write("[LOG]"+' '+time.strftime("%Y-%m-%d %H:%M:%S",time.localtime(time.time()))+' Exceed Limit: '+target+' '+str(size)+'\n') 17 | 18 | for target in os.listdir(path): 19 | threads.append(threading.Thread(target = check, args=(target,))) 20 | threads[-1].start() 21 | 22 | for x in threads: 23 | x.join() 24 | 25 | _f.write("[LOG]"+' '+time.strftime("%Y-%m-%d %H:%M:%S",time.localtime(time.time()))+' Program stop.\n') 26 | _f.close() -------------------------------------------------------------------------------- /maintenance/bright-view.md: -------------------------------------------------------------------------------- 1 | The maintainance web GUI interface is at [bright-view](https://10.8.6.22:8081/bright-view), you should use root user account to login. -------------------------------------------------------------------------------- /maintenance/cpu_limit.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # get the top 5 processes 3 | result=1490.0 4 | mapfile -t proc_list < <(ps --no-headers --sort=-pcpu -Ao pid,pcpu,user,cmd | head -n 5) 5 | for i in "${proc_list[@]}"; do 6 | process_cpu_list=($i) 7 | process_id=${process_cpu_list[0]} 8 | cpu_usage_percentage=${process_cpu_list[1]} 9 | compare_result=$(awk 'BEGIN{ print '$cpu_usage_percentage'<'$result' }') 10 | if [ ! "$compare_result" -eq 1 ]; then 11 | kill $process_id 12 | echo $(date) 13 | echo ${process_cpu_list[2]} 14 | echo ${process_cpu_list[3]} 15 | fi 16 | done 17 | -------------------------------------------------------------------------------- /maintenance/cpu_monitor: -------------------------------------------------------------------------------- 1 | #! /usr/bin/python 2 | # memory < 10g 3 | import psutil 4 | import os 5 | import sys 6 | import signal 7 | from datetime import datetime 8 | for pid in psutil.pids(): 9 | try: 10 | p = psutil.Process(pid) 11 | if int(p.memory_info()[0]) > 1073741824 * 10: 12 | print datetime.now(),kill,pid,p.username(),p.cmdline() 13 | os.kill(pid,signal.SIGKILL) 14 | except OSError: 15 | pass 16 | except: 17 | pass -------------------------------------------------------------------------------- /maintenance/database.md: -------------------------------------------------------------------------------- 1 | # Mongo 2 | ```shell 3 | mongo -u username -p password --authenticationDatabase db 4 | ``` 5 | `db` should be `admin` for root user and `user-data` for normal user. -------------------------------------------------------------------------------- /maintenance/docker.md: -------------------------------------------------------------------------------- 1 | The docker installed on manage node is used by server maintainers. 2 | 3 | How to run gui program in docker (manage node) 4 | ``` 5 | module load docker 6 | xhost +local:root 7 | docker run -it --rm -e 'DISPLAY=:3.0' -v /tmp/.X11-unit:/tmp/.X11-unix debian-xterm xterm 8 | ``` 9 | ## How to manage sharelatex service 10 | 11 | This service is deployed on storage node and maintained by `docker-compose`, configuration file at `/home/feng/sharelatex-config`. 12 | After modifying the configuration file, you need to stop the service, remove the container and create 13 | the service. Simply restarting the service does not work. 14 | 15 | In specific, the following commands should be executed (on storage node): 16 | ``` 17 | ./docker-compose -f docker-compose-storagenode.yml stop 18 | ./docker-compose -f docker-compose-storagenode.yml rm # do not use docker down to preserve the network 19 | ./docker-compose -f docker-compose-storagenode.yml create 20 | ./docker-compose -f docker-compose-storagenode.yml start # do not use docker up to start in backend 21 | ``` 22 | 23 | If you encounter the problem of cannot create bridged network, restarting docker daemon by 24 | `systemctl restart docker`. 25 | 26 | ### Reference 27 | * [科学地部署自己的 Overleaf 服务](https://harrychen.xyz/2020/02/15/self-host-overleaf-scientifically/) -------------------------------------------------------------------------------- /maintenance/down.md: -------------------------------------------------------------------------------- 1 | # 服务器宕机记录 2 | 3 | ## 2020/3/13 4 | 下午六时左右,管理员接到信息办来电,称有关政府部门拦截到了从服务器发出的挖矿病毒通信数据,致函信息办要求调查。之后管理员请求网关临时放行了攻击者IP以便调查,确实发现luis账户下存在占用资源极高的计划任务,之后立即终止相关进程,删除相关文件,更改luis账户的密码,病毒活动随之消失。此后对攻击路径的猜测为:初步确定攻击方式为SSH扫描。系统日志显示,通过某同学自用的反向代理服务,3月2日起存在大量的针对root和其他用户的恶意登录拒绝记录;3月10日攻击者疑似攻击弱口令普通用户luis成功并创建了病毒;3月10日起,存在大量该病毒因占用资源过多而被终止的记录。此后关闭了涉事内网穿透,重新确认了所有用户的密码安全等级,以防止类似事件再次发生。 5 | 6 | ## 2020/2/20 7 | 北京时间 14:00 左右管理服务器(10.8.6.22)宕机,这个时间是通过穿透代理查到的。该问题首先由李阳在 Users 微信群报出来(14:13),多名同学反映问题确实存在,经测试 存储节点 10.8.6.21 也连不上,10.8.6.21 IPMI 也连不上,但 10.8.4.130 (2F 实验室服务器可以连上,赵丰尝试),初步断定是我方(2C)服务器集群全部挂掉,15.06 李阳决定亲自去机房重启服务器集群,15:50分赵丰在 Users 微信群发布服务器宕机通知。18点左右李阳 8 | 和学院IT某老师赶到机房,18:45 重启管理节点。各项测试本地正常,但仍无法远程 SSH 到服务器。19:20 线上咨询 Amax 工程师,初步断定是网线或者交换机的问题。20:30 通过将管理节点外网网线插到另一个实验室交换机上暂时解决远程连接的问题,从而得出 Amax 提供的外网交换机故障的结论。目前仍存在服务器不稳定的问题,可能是存 9 | 储节点外网网卡仍连接故障交换机的缘故。次日(2月21日)17:14 赵丰从管理节点系统日志发现通过IB网络连接 172.16.1.100(存储节点)经常连不上,可能是导致管理节点不正常的原因。通过使用备用存储节点暂时解决了这一问题,但原存储节点的数据暂时拿不到。2月22日下午3点左右,在 Amax 工程师的远程指导下,李阳前往机房,将原管理节点插到内网交换机的 IB 线换了一个插口,并把 10 | 管理节点接到别的实验室交换机上的以太网线重新插回到自己实验室的前置交换机上。测试各项指标正常。目前初步断定原管理节点插到内网交换机的 IB 线接口松动,以及前置交换机可能存在问题。 11 | 12 | ## 2020/1/5 13 | 由于电力检修,服务器集群于12:30左右正常关机。经核实,当天机房实际并未发生电力中断。15:30,在尝试重新启动服务器集群时,存储节点电源键失灵。随后管理节点和计算节点正常开机,赵丰同学临时挂载了其它存储源备用。次日,Amax工程师上门服务,对存储节点主板进行了未知操作后,电源键重新可用,存储节点顺利上线。 14 | 15 | ## 2020/4/13 16 | this morning from 10:30am-12pm engineers will perform some testing and maintenance tasks on our server. some network interruptions are expected. 现在我们换了一根网线,工程师说可能是以前的网线过长,如果有质量问题容易受电磁波干扰。。 17 | 18 | ## 2021/2/3 19 | 20.30 Weitao Tang report down of manage node 20 | 21 | 20.34 web access ok 22 | 23 | 21:00 call zhiyuan wu for remote assistant 24 | 25 | 0:30 restart the server 26 | 27 | Next day, the engineer tries to find the reason 28 | 29 | 11:20, computing nodes return to normal 30 | 31 | Afterwards: cluster suffer numerous down during February to March since this crash, with unknown reason. And everything seems to be OK after reboot. The AMAX engineer give his theory that it may be a storage unstable because of overheating. We decide to keep cabin door of our server open to alleviate the problem. It works, amazing. 32 | 33 | ## 2021/7/2 34 | 7/2 21:00 Power failure of iPark-C2 building, all connection lost. 35 | 36 | 7/5 14:00 Temporal Power supplement and cluster reboot. We only open bcm/storage/Node01 because IT refused further booting as concern of 'limited power', though we heard Electric Engineer said its safe. Also because of movement of institute network server, 10.8.4.170 no more works. We still can't access cluster. 37 | 38 | 7/9 17:00 和IT扯了几天皮,他们一直不给放新的IP,却推脱让我们找厂家,请出老师出马去问时却又称“不知道进展”。We have to move previous IPMI wire on Node05 (which use 10.8.6.0/24 network and not being influenced) to our cluster switch and use this as our new global interface. And cluster back to service after 7 days down. 39 | 40 | 7/26 Full recovery. Total shutdown time 24 days. -------------------------------------------------------------------------------- /maintenance/jupyterhub.md: -------------------------------------------------------------------------------- 1 | The configuration of jupyterhub on our lab is located at `/etc/jupyterhub/jupyterhub_config.py` -------------------------------------------------------------------------------- /maintenance/languagetool.md: -------------------------------------------------------------------------------- 1 | How to start the server 2 | ``` 3 | cd /home/feng/LanguageTool # on storage node 4 | nohup java -cp languagetool-server.jar org.languagetool.server.HTTPServer --port 8088 --allow-origin "*" --public --languageModel ./ > run.log 2>&1 & 5 | ``` 6 | 7 | Use `kill id` to stop the service. -------------------------------------------------------------------------------- /maintenance/limit.md: -------------------------------------------------------------------------------- 1 | This document applies to manage node only. This configuration is used for abuse use of manage node to run large experiment. 2 | 3 | 4 | # CPU Limit 5 | see `/usr/local/bin/cpu_limit.sh`. 6 | I set this to a root cronjob, see by `sudo crontab -e`. 7 | This cronjob run `cpu_limit.sh` every minute and check whether there are processes consumes more than 14 CPUs. 8 | If The program consumes more than 14 CPUs, a hook process is started to limit it to at most 15 CPUs. 9 | Also the program is run on backend. -------------------------------------------------------------------------------- /maintenance/minikube.md: -------------------------------------------------------------------------------- 1 | The executable is located at: 2 | ``` 3 | /home/feng/minikube 4 | ``` 5 | Before using it, make sure the service is started, which is a virutal machine using VirtualBox. 6 | ``` 7 | ./minukube status 8 | ``` 9 | 10 | If the service is not running, please start it first: 11 | ``` 12 | minikube start --driver=virtualbox 13 | ``` 14 | 15 | To simplify the management 16 | ``` 17 | alias kubectl="./minikube kubectl --" 18 | ``` 19 | 20 | To connect the docker daemon in the vm, run 21 | ``` 22 | eval $(./minikube -p minikube docker-env) 23 | ``` 24 | Then you can use `docker` to do some low-level management task. -------------------------------------------------------------------------------- /maintenance/module.md: -------------------------------------------------------------------------------- 1 | # module 2 | 3 | ## How to add python package as new module 4 | First, use `module load` command to load necessary environment(e.g. anaconda3/py3, cuda, etc.). Then use `pip` command to install target package at /cm/shared/apps/ rather than default python directory. You need sudo to write in this dir. 5 | 6 | ```shell 7 | pip install tensorflow-gpu==1.15 -t /cm/shared/apps/some/directory 8 | ``` 9 | 10 | It will be convinient to use tsinghua tuna by adding `-i https://pypi.tuna.tsinghua.edu.cn/simple`. 11 | 12 | Then create a modulefile at /cm/shared/modulefile/ to let module system know it. In this file, two terms have to be announced. Add your necessary environment as dependency like: 13 | 14 | ``` 15 | if { [is-loaded cuda10.0/]==0 } { 16 | module load cuda10.0 17 | } 18 | ``` 19 | 20 | and add the path of new package to PYTHONPATH to let python find it: 21 | 22 | ``` 23 | prepend-path PYTHONPATH /cm/shared/apps/some/directory 24 | ``` -------------------------------------------------------------------------------- /maintenance/mysreport: -------------------------------------------------------------------------------- 1 | #! /usr/bin/python 2 | import os 3 | import sys 4 | 5 | if len(sys.argv) == 1: 6 | command = 'sh sreport_new.sh' 7 | elif len(sys.argv) == 2: 8 | command = 'sh sreport_new.sh ' + str(sys.argv[1]) 9 | else: 10 | print 'argv are too much' 11 | command = 'sh sreport_new.sh' 12 | process = os.popen(command) 13 | output = process.read() 14 | process.close() 15 | row = str(output).split('\n') 16 | print '{0:<20}'.format('User'), '\t\t', '{0:<20}'.format('Used/Min') 17 | result_dict = {} 18 | for column in filter(None, row[8:]): 19 | column = filter(None, column.split(' ')) 20 | if column[2] == 'gres/gpu': 21 | continue 22 | result_dict[str(column[2])] = int(column[-1]) 23 | # print '{0:<20}'.format(column[2]), '\t\t', '{0:<20}'.format(column[-1]) 24 | sorted_dict = sorted(result_dict.items(), key=lambda d: d[1],reverse=True) 25 | 26 | for item in sorted_dict: 27 | print '{0:<20}'.format(item[0]), '\t\t', '{0:<20}'.format(item[1]) 28 | -------------------------------------------------------------------------------- /maintenance/mysreport.md: -------------------------------------------------------------------------------- 1 | ### How to show the GPU usage report 2 | *** 3 | Use the following command to show the usage report from a specific time. (`module load slurm` first.) 4 | 5 | ``` 6 | mysreport 2019-10-01 7 | ``` 8 | 9 | The time format is YYYY-mm-dd. 10 | 11 | The report time unit is minute. 12 | 13 | If you don't append a specific time, it will count from 7 days ago. 14 | 15 | ``` 16 | mysreport 17 | ``` 18 | 19 | -------------------------------------------------------------------------------- /maintenance/network.md: -------------------------------------------------------------------------------- 1 | # Network Setting 2 | 3 | ## 1. External network 4 | We use fixed IP for all external interface. Here are some useful information (update: 2021/7/10): 5 | 6 | Available IP pool: 10.8.6.10 - 10.8.6.30 7 | 8 | Mask: 255.255.255.0 9 | 10 | Gateway: 10.8.6.254 11 | 12 | DNS: 10.8.2.200 13 | 14 | ## 2. How to Set External Network Parameters 15 | 16 | Two options to change network parameters for bcm: 17 | 18 | (1) Temporary change: modify scripts in `/etc/sysconfig/network-scripts`. And use `systemctl restart network` to make it valid. This file will be reset by bright-view after reboot. 19 | 20 | (2) Permanent change: Check web interface of [bright-view] (https://10.8.6.22:8081/bright-view/#!/auth/login), modify (a). Network Tab (2nd Tab) -> Networks -> externalnet -> edit -> Settings, and (b). Devices Tab (5th Tab) -> Head node -> bcm -> Edit -> Settings -> Interfaces. Settings in (a) determines overall network parameter for cluster, while (b) determines setting for specific devices. 21 | 22 | Storage Node is not governed by bright-view. It's located in `/etc/NetworkManager/system-connections/` 23 | You can modify it with some Network Setting GUI or command line tool. Don't forget to restart the network interface before the new static 24 | ip address takes effect. 25 | 26 | ## 3. IP Pool Status 27 | Since we are using fixed IP setting, make sure there's no collision when you pick your favorite number. 28 | 29 | | IP | Usage | 30 | | --------- | ------------------------ | 31 | | 10.8.6.21 | Storage Node External IP | 32 | | 10.8.6.22 | Manage Node External IP | 33 | | 10.8.6.20 | Storage Node IPMI | 34 | | 10.8.6.23 | Manage Node IPMI | 35 | | 10.8.6.14 | Node04 IPMI | 36 | | 10.8.6.15 | Node05 IPMI | 37 | 38 | ## 4. Internal Network 39 | We also have three Internal Network for cluster: 40 | 41 | |Network|Description| 42 | | --------- | ------------------------ | 43 | |10.1.1.0/24|bright view management network, including intranet for node01-05| 44 | |10.1.2.0/24|IPMI network for Node01-03| 45 | |172.16.1.0/24|High speed NFS ib network| -------------------------------------------------------------------------------- /maintenance/nfs.md: -------------------------------------------------------------------------------- 1 | For clients, sometimes it is necessary to mount nfs as network disk. 2 | The command is 3 | ``` 4 | mkdir -p /media/data 5 | sudo mount -t nfs 10.8.6.21:/data /media/data 6 | ``` 7 | 8 | Persistent mounting (modify `/etc/fstab`, add one line) 9 | ``` 10 | 10.8.6.21:/data /media/data nfs rw 0 0 11 | ``` -------------------------------------------------------------------------------- /maintenance/overleaf.md: -------------------------------------------------------------------------------- 1 | The maintaince instruction is at [sharelatex-config](http://10.8.6.22:88/zhaofeng-shu33/sharelatex-config). -------------------------------------------------------------------------------- /maintenance/quota.md: -------------------------------------------------------------------------------- 1 | The user quota is setup on storage node to restrict the home directory of each user. 2 | 3 | For detailed instruction, please see [quota large than 4TB](https://serverfault.com/questions/348015/setup-user-group-quotas-4tib-on-ubuntu). 4 | 5 | Before executing these commands, make sure you are at storage node. 6 | 7 | ```shell 8 | quotaon /data # start the quota monitoring 9 | quotaoff /data # stop the quota monitoring 10 | sudo repquota -s /data # check the quota 11 | sudo setquota -u feng 1024G 1024G 0 0 /data # set the quota for user feng to use 1024G in maximum 12 | ``` 13 | 14 | -------------------------------------------------------------------------------- /maintenance/server_test.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | import unittest 3 | import requests 4 | class TestHttpService(unittest.TestCase): 5 | def test_refbase_homepage(self): 6 | r = requests.get('http://10.8.6.22:8084/') 7 | self.assertEqual(r.status_code, 200) 8 | def test_wiki_homepage(self): 9 | r = requests.get('http://10.8.6.22/wiki') 10 | self.assertEqual(r.status_code, 200) 11 | def test_gitlab_homepage(self): 12 | r = requests.get('http://10.8.6.22:88') 13 | self.assertEqual(r.status_code, 200) 14 | def test_nonexist(self): 15 | r = requests.get('http://10.8.6.22/non_exist') 16 | self.assertEqual(r.status_code, 404) 17 | 18 | if __name__ == '__main__': 19 | unittest.main() -------------------------------------------------------------------------------- /maintenance/shorewall.md: -------------------------------------------------------------------------------- 1 | # Firewall 2 | configuration file located at `/etc/shorewall/rules`. After editting, use `sudo systemctl restart shorewall` to reload the firewall software. -------------------------------------------------------------------------------- /maintenance/slurm.md: -------------------------------------------------------------------------------- 1 | ## How to check slurm usage report 2 | 3 | Below command check the usage report form a specific time. 4 | ```shell 5 | sreport cluster all Start=2019-09-10 6 | ``` 7 | 8 | How to check GPU status of three computing nodes ( you should be root to execute this command) 9 | ```shell 10 | pdsh -w node01,node02,node03 /cm/local/apps/cuda/libs/current/bin/nvidia-smi | dshbak 11 | ``` 12 | 13 | The daemon service on computing node is called `slurmd` which read configuration file `/etc/slurm/gres.conf`. This file will be reset on rebooting (provision of computing node). 14 | The default file is located at `/cm/shared/apps/slurm/var/etc/gres.conf` which sayes the machine has 8 GPUs. This is true for node01 but node02 and node03 have only 6 gpus. Therefore, 15 | the configuration file should be modified to suit such purpose. Otherwise the slurm daemon on node02 (or node03) cannot be restarted. Currently, the replace file is located at 16 | `/data2/gres.conf`. All you should do after rebooting of node02 or node03 is: 17 | 18 | ```shell 19 | cp /data2/gres.conf /etc/slurm/gres.conf 20 | ``` 21 | 22 | ## How to undrain a node 23 | Using `root` account. 24 | ```shell 25 | scontrol update nodename=node01 state=resume 26 | ``` 27 | See [how-to-undrain-slurm-nodes-in-drain-state](https://stackoverflow.com/questions/29535118/how-to-undrain-slurm-nodes-in-drain-state) 28 | 29 | ## How to check cpu usage of a node 30 | Output the top 20 cpu consuming program on node01: 31 | ```shell 32 | srun -w node01 ps -Ao user,pcpu,args --sort=-pcpu | head -n 20 33 | ``` -------------------------------------------------------------------------------- /maintenance/sys-admin.md: -------------------------------------------------------------------------------- 1 | You should execute the following command in `root` user. 2 | 3 | ## Rebooting 4 | (计划之内的计算节点重启,由于安装新的软件等缘故) 5 | 6 | ```shell 7 | cmsh # enter cmsh shell 8 | device 9 | reboot node03 10 | ``` 11 | 12 | ## Swap Space 13 | Node03 has no swap space due to etcd installation on `node03`. That is node03 has etcd role in kubernetes cluster. 14 | The swap space can be checked with `free -m` command when logged in. 15 | -------------------------------------------------------------------------------- /maintenance/user_management.md: -------------------------------------------------------------------------------- 1 | 集群中用户管理不同于普通单台linux 服务器,如果用 `useradd` 等命令只能在单机上建用户,使用LDAP 可以解决一个用户访问多台机器的问题。 2 | 目前服务器部分同学(2018级及之前)的账号存放在管理节点的 LDAP 数据库中,包括 NFS 的账号也在。 3 | 4 | 不在 LDAP 数据库中的同学账号无法ssh登录存储节点和管理节点 jupyter。 5 | 6 | ## Local user management 7 | ``` 8 | userdel username # preserve home 9 | ``` 10 | 11 | # Search 12 | All users: 13 | ```shell 14 | ldapsearch -LLL -x -b dc=cm,dc=cluster -s sub "(objectclass=posixAccount)" dn 15 | ``` 16 | One user: 17 | ``` 18 | ldapsearch -x 'uid=weijiafeng' 19 | ``` 20 | 21 | # Reset Password 22 | ``` 23 | ldappasswd -s 123456 -W -D 'cn=root,dc=cm,dc=cluster' -x 'uid=weijiafeng,dc=cm,dc=cluster' 24 | ``` 25 | `-s` 参数后面跟新的密码 26 | 27 | # LDAP setting 28 | ``` 29 | uri ldaps://10.8.6.22:636/ 30 | base dc=cm,dc=cluster 31 | ``` 32 | # User management GUI 33 | [lam](http://10.8.6.22/lam) 34 | -------------------------------------------------------------------------------- /maintenance/verify_password.py: -------------------------------------------------------------------------------- 1 | # this script is used to check whether username and password are 2 | # the same for given ip and port 3 | 4 | import socket 5 | from ssh2.session import Session 6 | from ssh2.exceptions import AuthenticationError 7 | 8 | ip = '10.8.6.22' 9 | port = 22 10 | # userlist = ['lewenyuan','lianjing','riccardo','tanyang','tianzhou','ziyanzheng'] 11 | # userlist = ['wangbin','lirong','jielian','liangcheng','tianyu','hutianyu','jiarongli','anzhicheng','bunchalit','jackkuo','jinggewang', 12 | # 'luis','jihongwang','lewenyuan','tanyang','ziyanzheng','yurunpeng'] 13 | userlist = [] 14 | for line in open('recore.txt','r'): 15 | userlist.append(line.replace('\n','')) 16 | def ssh_login(user): 17 | sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 18 | sock.connect((ip,port)) 19 | sock.setblocking(1) 20 | session = Session() 21 | session.handshake(sock) 22 | try: 23 | session.userauth_password(user,user) 24 | except AuthenticationError: 25 | return 26 | print(user) 27 | 28 | def ssh_login2(user): 29 | sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 30 | sock.connect((ip,port)) 31 | sock.setblocking(1) 32 | session = Session() 33 | session.handshake(sock) 34 | try: 35 | session.userauth_password(user,'password') 36 | except AuthenticationError: 37 | return 38 | print(user) 39 | 40 | def ssh_login3(user): 41 | sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 42 | sock.connect((ip,port)) 43 | sock.setblocking(1) 44 | session = Session() 45 | session.handshake(sock) 46 | try: 47 | session.userauth_password(user,'123456') 48 | except AuthenticationError: 49 | return 50 | print(user) 51 | 52 | if __name__ == '__main__': 53 | print('username------') 54 | for user in userlist: 55 | ssh_login(user) 56 | print('password-------') 57 | for user in userlist: 58 | ssh_login2(user) 59 | print('123456------') 60 | for user in userlist: 61 | ssh_login3(user) 62 | 63 | 64 | -------------------------------------------------------------------------------- /maintenance/virtual_machine.md: -------------------------------------------------------------------------------- 1 | ## Edit qemu qcow2 file 2 | This is needed to extract the linux kernel and boot temporary file system 3 | 4 | `/home/feng/qemu/debian-jessie/hda.qcow2` is the virtual disk of the qemu vm. You can actually mount this disk without 5 | starting the vm. The following steps are necessary to do this: 6 | ```shell 7 | cd /home/feng/qemu/debian-jessie/ 8 | sudo modprobe nbd 9 | sudo qemu-nbd --connect=/dev/nbd0 ./hda.qcow2 10 | fdisk /dev/nbd0 -l 11 | sudo partx -a /dev/nbd0 12 | sudo mount /dev/nbd0p1 /home/feng/mntpoint/ 13 | # the following commands revert to normal 14 | sudo umount /home/feng/mntpoint/ 15 | sudo qemu-nbd --disconnect /dev/nbd0 16 | sudo rmmod nbd 17 | ``` 18 | See [this gist](https://gist.github.com/shamil/62935d9b456a6f9877b5) for further detail. 19 | 20 | ## Compile QEMU source code 21 | ``` 22 | ./configure --help 23 | ./configure --enable-gtk # the missing deps can be installed by apt ... 24 | make install # needs sudo, use make -j28 to accelerate 25 | ``` 26 | 27 | ## GUI Arm64 28 | The following command is used to boot the debian buster system from the virtual disk 29 | ``` 30 | qemu-system-aarch64 -smp 2 -netdev user,id=mynet -device virtio-net-device,netdev=mynet \ 31 | -m 2G -M virt -cpu cortex-a72 -drive if=none,file=hdd01.qcow2,format=qcow2,id=hd0 \ 32 | -device virtio-blk-device,drive=hd0 -device VGA \ 33 | -kernel vmlinuz-4.19.0-9-arm64 -append 'root=/dev/vda2' -initrd initrd.img-4.19.0-9-arm64 \ 34 | -device virtio-scsi-device -device usb-ehci -device usb-kbd -device usb-mouse -usb 35 | ``` 36 | To install the system for the first time, first you need to download the iso file, for example 37 | [debian-10.4.0-arm64-netinst.iso](https://mirrors.tuna.tsinghua.edu.cn/debian-cd/10.4.0/arm64/iso-cd/debian-10.4.0-arm64-netinst.iso). 38 | 39 | The second step is to create a virtual disk file. The command to do this is `qemu-img create`. We recommend `qcow2` format. 40 | ``` 41 | qemu-img create -q qcow2 hdd0.qcow2 40G 42 | ``` 43 | Then using the following command to launch the installer: 44 | ``` 45 | qemu-system-aarch64 -M virt -m 2048 -cpu cortex-a72 -smp 2 \ 46 | -drive if=none,file=debian-10.4.0-arm64-netinst.iso,id=cdrom,media=cdrom \ 47 | -device virtio-scsi-device -device scsi-cd,drive=cdrom \ 48 | -drive if=none,file=hdd0.qcow2,id=hd0 -device virtio-blk-device,drive=hd0 49 | ``` 50 | 51 | For Ubuntu 18.04 you do not need to extract the kernel and initmf manually and the following command is enough to startup the system: 52 | ``` 53 | qemu-system-aarch64 -m 2048 -cpu cortex-a72 -smp 2 -M virt \ 54 | -device virtio-scsi-device -drive if=none,file=hdd0.qcow2,id=hd0 -device virtio-blk-device,drive=hd0 \ 55 | -display gtk -device bochs-display --bios QEMU_EFI.fd 56 | ``` 57 | within the directory `/home/feng/qemu/ubuntu-18.04` 58 | -------------------------------------------------------------------------------- /maintenance/vpn.md: -------------------------------------------------------------------------------- 1 | ## 自建内网穿透 2 | 如果你使用其他 Linux 发行版的操作系统,很可能无法使用官方VPN外网接入实验室服务器,此时可以用自建内网穿透的方式, 3 | 比如 [ssh 端口转发](https://www.cnblogs.com/zhaofeng-shu33/p/10685685.html) 或者 搭建[frp](https://github.com/fatedier/frp) 4 | 服务。由于智园网络不是很稳定以及这两种方式本身的局限性,可能会断掉。断掉后可以结合官方VPN重启。 5 | 6 | ## windows proxy server 与 proxychains 配合 7 | ### 背景 8 | * 装学校 Linux 版 Matlab 会遇到 matlab 启动时连学校服务器 matlab.cic.tsinghua.edu.cn,但在智园的网络环境下这个 ip 是校本部内网无法访问,因此 正版matlab 无法启动。 9 | * 假如你的工作站用 Linux 操作系统,已知 YangLi 用 Ubuntu, zhaofeng-shu33 在深研院用 Fedora,则无法使用 VPN 客户端连接特定的服务。 10 | ## Solution 11 | 为解决 Linux 操作系统不能用校本部和深研院 VPN 的问题,或更广泛地说,如何在 Linux OS 中使用 Windows VPN ,经过 zhaofeng-shu33 一番探索,提出如下的解决方案:通过 windows 虚拟机与特定的网络配置实现。 12 | 13 | ### Path 14 | 1. linux os 可以通过 wine 运行 windows exe,但对于涉及复杂网络配置的 vpn client, wine 的思路应该是行不通的,即使可以运行,linux os 也无法利用 client 弄出来的 network adapter。 15 | 1. windows 虚拟机的思路很容易想到,但如何将 windows 作为网关是一个难题。 win 10 虚拟机不是 win server,自带的 network 模块受限制,比如有名的 RRAS(routing and remote access service)虽然可以开启服务但无法使用管理工具,因此笔者在一番尝试下发现行不通。 16 | 1. 于是采用 windows proxy server 的方法, 同样因为 IIS 作 proxy 功能太弱,笔者费了一番功夫找到了一个开源的 cross-platform 的 proxy server 名字叫 [3proxy](https://github.com/z3APA3A/3proxy),这个项目文档做的不好,经过一番探索才跑起来 server。 17 | 1. linux os 如何使用搭建好的 proxy server 也是个难题,CentOS 7 实际并没有 system proxy 这一概念,有的 network client 是不会用你设置好的 http proxy 环境变量的。为此,只能在底层做一些改动强制软件如 matlab 使用 proxy。笔者找到了一个开源的 UNIX 平台的 [proxychains4](https://github.com/rofl0r/proxychains-ng) 18 | 19 | ### 最终网络结构 20 | eth0 是一个内网,中心型拓扑结构 21 | * 管理节点 10.1.1.254 (如下4台机器的网关和DNS服务器) 22 | * 计算节点1 10.1.1.1 23 | * 计算节点2 10.1.1.2 24 | * 计算节点3 10.1.1.3 25 | * Win10 10.1.1.4 (运行 3proxy service) 26 | 27 | 在其他机器上(非Win10),通过 3proxy 可访问到 matlab.csc.tsinghua.edu.cn 28 | 29 | ```shell 30 | curl --socks 10.1.1.4:1080 matlab.cic.tsinghua.edu.cn:27000 31 | curl --proxy 10.1.1.4:3128 matlab.cic.tsinghua.edu.cn:27000 32 | ``` 33 | curl 客户端通过命令行参数可以指定 proxy server,但 matlab 不行。 34 | 35 | ### 具体步骤 36 | 这里以服务器的管理节点为例,对于工作站,也是类似的。 37 | 38 | 1. 首先要在管理节点安装 VMWare Win10 虚拟机,配置网络环境为 Bridge,使用管理节点的 etho adapter,这里可能需要解锁 vmware 的一个高级功能,参考 [wiki](http://10.8.6.22/wiki/index.php/Admin): 39 | 配置静态ip, network mask, gateway, dns server ip , 然后确保 win10 可以上外网. 打开 win10 icmp firewall 的限制。在管理节点可以 ping 通 10.1.1.4。 40 | 41 | 1. 然后在 Win10 上安装 3proxy pre-built binary. 完成相关的配置,其中的要点有:取消 proxy 密码限制,打开对 3proxy service 的 firewall restriction。在管理节点使用 curl 命令测试 proxy server 是否正常。 42 | 43 | 1. 在管理节点编译安装 proxychains,修改 `/etc/proxychains.conf`,其中的要点有取消对 dns 请求的代理,对 localhost 的代理。然后用 `proxychains matlab` 启动 matlab图形化界面。matlab 命令行初始化需要一点时间,需要等待半分钟左右。 44 | 45 | 最终启动效果图如下: 46 |  47 | 48 | 目前 matlab 装到了集群共享目录 `/cm/shared/modulefiles/matlab` 下面 49 | 50 | 通过在其他计算节点下安装 proxychains 并集成好,配置好 module load/unload matlab,可以实现对最终用户无感地使用 matlab,并且能以 shell mode 启动,通过 slurm 以 batch mode 完成大型计算任务等。 51 | 52 | ### Known Issues 53 | * Win10 虚拟机目前是以 feng 用户启动的,有无可能以 daemon 启动。3proxy 已经注册为 Win 10 service, 不需要用户登录 Win10 即可在后台运行。 54 | * Win10 安装了校本部 VPN 客户端,用的是 zhaofeng-shu33 的账号,需要一直保持连接。 55 | 56 | ## OpenVPN 57 | 在实验室服务器管理节点部署 OpenVPN Server, 客户端安装 OpenVPN Client 后可直接访问到集群内网(计算节点和Win虚拟机),比如 ping 通 `10.1.1.2`(node02 的 ip)。 58 | 安装 Client 后,要手动导入根证书,root certificate 在服务器 `/usr/share/doc/openvpn-2.4.6/sample/sample-keys/ca.crt` 的位置上。 59 | 将该文件下载到本地,client config file 里 ca 换成本地 ca.crt 的绝对路径,注释掉 `cert` 和 `key` 的配置。 60 | 添加 server ip 地址: `remote 10.8.6.22 1194` 及 使用用户名密码校验:`auth-user-pass`(与 ssh 登陆用户名与密码相同)。 -------------------------------------------------------------------------------- /maintenance/vpn_on_bcm.md: -------------------------------------------------------------------------------- 1 | 出于科学上网、连接校本部vpn 等考虑, bcm 管理节点上部署了若干 vpn。 2 | 3 | ## 科学上网 4 | 5 | ## 校本部 vpn 6 | 目前采用 windows 虚拟机中转的方法,通过 `proxychain4` 可以在计算集群上使用 `matlab`。 7 | 8 | ### 替代方案 9 | 使用 openconnect 则可免去虚拟机常开、使用 `proxychain4`, `3proxy` 中间件的繁琐,但目前的问题是计算节点直接连接 `10.1.1.254` (学校网关)而管理 10 | 节点不是路由器,因此无法在计算节点上使用 `matlab`。 -------------------------------------------------------------------------------- /maintenance/wiki.md: -------------------------------------------------------------------------------- 1 | Our wiki is served by standard LAMP framework. The information page is at `http://10.8.6.22/wiki/index.php/Special:Version`. 2 | 3 | The codebase is at `/var/www/html/wiki`. If you need to add new extension to our wiki, register the extension at `/var/www/html/wiki/LocalSettings.php`. 4 | 5 | Our wiki is served by PHP 7.2 with extension `mbstring` and `xml`. The database we used is mariadb-server 5.5.60 6 | 7 | 8 | -------------------------------------------------------------------------------- /markdown.md: -------------------------------------------------------------------------------- 1 | To make markdown tables. You can use some visual editor, for example [tablesgenerator](https://www.tablesgenerator.com/markdown_tables). 2 | 3 | -------------------------------------------------------------------------------- /megaglest.md: -------------------------------------------------------------------------------- 1 | Megaglest is an online strategy game. We have deployed it on our server. Normally it is not started. 2 | To start it, at storage node terminal, go to `/home/feng/megaglest-3.13.0/` and run `./start_megaglest_gameserver --headless-server-mode=exit`. 3 | 4 | Then from client game ( windows binary can be downloaded by `scp username@10.8.6.22:/home/feng/Downloads/MegaGlest-Installer-3.13.0_windows_64bit.exe ./`) 5 | choose Network LAN mode and enter `10.8.6.21`. 6 | 7 | Notice: use `wget github_game_release_url` on the server cluster to download the binary for other platforms. -------------------------------------------------------------------------------- /module.md: -------------------------------------------------------------------------------- 1 | # module 2 | 3 | ## How to load different Tensorflow Version? 4 | We have two versions of Tensorflow available on our sever: a very old version 1.8.0 and the last version of TF1 1.15.0. 5 | 6 | TF version 1.8.0 is combined with anaconda3/py3. Beside this, you also need following modules: 7 | ```bash 8 | module load anaconda3 cuda90 cudnn openmpi/cuda 9 | ``` 10 | 11 | TF version 1.15.0 is a seperate module based on anaconda3/py3, you have to load it explicitly: 12 | ```bash 13 | module load tensorflow/1.15.0 # gpu 14 | # or 15 | module load tensorflow/1.15.0-cpu 16 | ``` 17 | 18 | TF2.x is not supported globally on server now. 19 | ## Small tips 20 | 21 | unload all loaded modules 22 | ```shell 23 | module purge 24 | ``` -------------------------------------------------------------------------------- /programming_language.md: -------------------------------------------------------------------------------- 1 | # Python 2 | ## 2.7 3 | This version of python retired in 2020.1.1. Do not develop new codes in this version. 4 | If you need to maintain old codes, consider transfer to Python 3.x. 5 | If you need to run Python2.7 codes of others. Here are some tips: 6 | 7 | * `--user` and `--editable` cannot be used simultaneously. 8 | 9 | ## 3+ 10 | ### Sympy 11 | Symoblic Computing 12 | 13 | ## Use CPUs 14 | For tensorflow, if you need to use cpus on compute node, try the following code 15 | ``` 16 | import os 17 | os.environ['CUDA_VISIBLE_DEVICES'] = '' 18 | ``` 19 | 20 | # PHP 21 | Sometimes you need to develop a web application or website to make your work accessible by others from browser client. You need to choose a 22 | backend programming language. If you use php, you can use our server as the developing machine since php developing on windows is not the best choice. 23 | 24 | Two versions of php are installed on the server. One is the default version 5.4 when you invoke `php` command. The other is version 7.2, 25 | If you need 7.2 toolchain, run `module load php7` on manage node. Then `php --version` will show you it is 7.2. Also `phpunit` is available. 26 | After your usage, you can run `module unload php7` to quit. 27 | 28 | To develop, you can invoke `php -S localhost:8213` in the shell with current working directory as your document root. Then you can access the port to 29 | debug your application. 30 | 31 | # C++ 32 | ## Compiler 33 | 34 | I recommend to use higher version of `g++`. You can type `module load gcc/8.2.0`. 35 | 36 | ## CMake 37 | 38 | The version of `cmake` is **2.8**. To use higher version, you can use `module load cmake/3.12.3`. 39 | 40 | Our cluster cannot identify loaded module like higher version of `g++`. To use 41 | it in cmake, you need to specify the abosolute path like the following. 42 | ```bash 43 | cmake -DCMAKE_CXX_COMPILER=/cm/local/apps/gcc/8.2.0/bin/g++ .. 44 | ``` 45 | ## ABI Version 46 | 47 | Some preinstalled libraries (for example `boost-1.53.0`) are built by gcc version <=4.8. 48 | Because of [ABI Compatability](https://gcc.gnu.org/onlinedocs/gcc-5.2.0/libstdc++/manual/manual/using_dual_abi.html) 49 | you need to add some extra preprocessor flags to `g++` to be able to link your `cpp`. For dual abi problem, add `add_definitions("-D_GLIBCXX_USE_CXX11_ABI=0")` 50 | in the `CMakeLists.txt`. 51 | 52 | ## Dependencies 53 | 54 | A typical C++ project dependends on some other libraries, which is recommended to be 55 | installed system-wide. The installation needs sudo privileges. If you are 56 | common users, you can contact the server admin to install them. For some cases, you can download the prebuilt `rpm` or `deb` package and install them to 57 | a custom directory. Then in you project you should specify this custom directory to your compiler. 58 | 59 | Once installed, many packages can be checked via `pkg-config`. 60 | For example, to see which compile flags are needed to use glib library, you can use: 61 | ```shell 62 | pkg-config --cflags glib-2.0 63 | ``` 64 | To see which link flags are needed to use glib library, use: 65 | ```shell 66 | pkg-config --libs glib-2.0 67 | ``` 68 | ## Python Binding 69 | 70 | To develop Python binding for your C++ project, Cython is needed. For complex binding, you need to customize your `setup.py` to build the extension. 71 | 72 | ## Debug 73 | Command line debugger `gdb` can be used on the manager node. On Storage node, you can launch vscode and use the GUI debugging tool, which uses `gdb` internally but provide many convenient functionalities. 74 | 75 | # TeX 76 | On our manage node, full scheme of texlive 2019 is installed. 77 | 78 | To use latex on our server. You need to add `/usr/local/texlive/2019/bin/x86_64-linux` to your path. 79 | 80 | Then 81 | ```shell 82 | mkdir -p build && xelatex -output-directory=build your_file.tex 83 | ``` 84 | 85 | After successful complication, you can view your pdf with: 86 | ```shell 87 | xdg-open build/your_file.pdf 88 | ``` -------------------------------------------------------------------------------- /service.md: -------------------------------------------------------------------------------- 1 | # Seafile 2 | Seafile is a cloud storage service. Our university has deployed a seafile service: 3 | [https://cloud.tsinghua.edu.cn/](https://cloud.tsinghua.edu.cn/). You can use it through your universal pass. 4 | 5 | Our server has also deployed an instance, whose account is managed by LDAP. 6 | You can use 7 | your ssh login name and password to use [seafile cloud storage](http://10.8.6.22:8030/). 8 | Notice that the file you uploaded through seafile is stored in our storage node. 9 | 10 | 11 | # Overleaf 12 | The university has an instance [https://overleaf.tsinghua.edu.cn](https://overleaf.tsinghua.edu.cn). 13 | You can login using your university universal pass. 14 | 15 | Our server has also deployed an overleaf community server. 16 | To start use it, visit [overleaf](http://10.8.6.21:8031). 17 | 18 | 19 | ## Service starting up 20 | In the case where seafile and seahub services are down, we can reboot it manually by 21 | ```bash 22 | cd /data1/seafile-server/seafile-server-7.0.5 23 | bash seafile.sh start # root priviledge 24 | bash seahub.sh start # root priviledge 25 | ``` 26 | 27 | ## Advantages 28 | 29 | * 微信电脑与手机互传文件有诸多限制,替代之。 30 | * fast speed and large space 31 | * mobile client support 32 | 33 | ## Use cases 34 | 35 | 1. Download the `seafile` mobile client from the Huawei Market. 36 | 37 | 2. Add our server and login use your ssh account. 38 | 39 |  40 | 41 | 3. Upload your files from mobile to the server. 42 | 43 |  44 | 45 | 4. (optional) Download your files from desktop browser client. 46 | 47 | # SSH 48 | ## About ssh in general 49 | SSH client has a configuration file. On macos it is located at `/etc/ssh/ssh_config`. It forwards `LC_*` and `LANG` environment variables to the server you ssh to by default. If these environment variables are not set properly, they could cause problems. See [locale setting](https://askubuntu.com/questions/412495/setlocale-lc-ctype-cannot-change-locale-utf-8) and [ssh forwarding env](https://superuser.com/questions/513819/utf-8-locale-portability-and-ssh) for detailed explanation. 50 | 51 | If you encounter locale setting warning, please add 52 | `export LC_ALL=en_US.UTF-8` to your `~/.bash_profile`(if you use other shells, the profile name is different). 53 | 54 | ## How to upload files to server 55 | ### GUI 56 | `sftp` is recommended 57 | ### scp 58 | Using `scp` you can upload one file or file glob to or from the server. 59 | ### rsync 60 | If you need to upload multiple files within a directory or upload large files, you can use `rsync`. 61 | 62 | On Windows platform, you can use Mobaxterm local terminal to finish this job. First `cd /drives/c/[your large files on disk C]` to your file. Then 63 | ```shell 64 | rsync -av --progress your_file user@10.8.6.21:/home/user/[your path] 65 | ``` 66 | 67 | # NFS 68 | If you use unix system for your workstation, you can mount our nfs server by 69 | 70 | ```shell 71 | sudo apt-get install nfs-common # Ubuntu 72 | sudo mkdir -p /mnt/server 73 | sudo mount 10.8.6.21:/data /mnt/server 74 | ``` 75 | 76 | Limitations: you may not have proper access to your home directory. A possible solution is to change your local user id and group id to match the 77 | remote one. -------------------------------------------------------------------------------- /slurm.md: -------------------------------------------------------------------------------- 1 | `slurm` is our workload manager. Here are some tips: 2 | 3 | 1. If you need to specify the node to run your program use: 4 | 5 | ```shell 6 | #SBATCH --nodelist=node03 7 | ``` 8 | 9 | in your script. 10 | 11 | For `srun`, you can use `srun -w node01 echo "hello world". 12 | 13 | 2. If you need more cpu resources to run your program, add 14 | 15 | ```shell 16 | #SBATCH -c 4 17 | ``` 18 | Or you can use `srun -c 4 your_multi_threaded_or_multi_process_program`. 19 | 20 | That is, you require 4 cpus to run your program. 21 | By default, you only have one physical cpu if not specified. This cpu has 2 logical cores. You can request maximum 32 cpus. 22 | 23 | 3. You can submit several serial programs within one job and run them in parallel. 24 | 25 | ```shell 26 | #!/bin/bash 27 | srun your_first_program -options & 28 | METHOD=cpu_forward srun your_second_program -options & 29 | wait 30 | ``` 31 | 32 | You should add `&` to each `srun`. Also, do not forget `wait` at the very end. 33 | 34 | After job finishes, you can check the detail by `sacct -j your_job_id`. It may shows how the job is decomposed into 2 tasks which are running in parallel. 35 | 36 |  37 | 38 | 4. Check cpu and memory usage of other nodes 39 | 40 | ```shell 41 | srun --nodelist=node02 ps -u feima -o pid,user,%mem,%cpu,cmd 42 | ``` 43 | 5. submit array jobs 44 | See [Introduction to array jobs](https://slurm.schedmd.com/job_array.html). 45 | For 2d array jobs. see usage example of [2d array jobs](https://wiki.anunna.wur.nl/index.php/Array_jobs) 46 | You can use this techniques to download file in multi-threaded way. See [download-curl.sh](./download-curl.sh) for example. 47 | 48 | 6. neural network library using cpus 49 | If your gpu resource limit is hit or you want to train your neural network using CPU. You can explicitly do this by specificying 50 | some environment variable. 51 | 52 | 7. Interactive Shell 53 | ```shell 54 | srun --gres=gpu:1 -t 500 --pty bash 55 | ``` 56 | You can use this method to debug your GPU program. Please quit it after your debugging session. This slurm job will also occupy one GPU. 57 | You can use `tmux` in this temporary session to open multiple windows. This is useful when you need to inspect the resource usage of current node with `top` or `/cm/local/apps/cuda/libs/current/bin/nvidia-smi`. 58 | Notice that all your tmux sessions are killed if you quit the shell. This behaviour is different with normal tmux usage. 59 | 60 | 8. openmpi 61 | ```shell 62 | module load openmpi/gcc/64/1.10.7 63 | # compile your programs 64 | salloc -N 2 mpirun your_mpi_program 65 | ``` 66 | If you need to use hybrid programming (e.g. OpenMP + MPI), you need to enable specifically: 67 | ```shell 68 | salloc -N 2 -c 8 69 | mpirun -n 2 --cpus-per-rank 4 --oversubscribe --bind-to none nvtc/nvtc-variant -f /home/dataset/triangle_counting_dataset/s28.e15.kron.edgelist.bin 70 | ``` 71 | You can have one line 72 | ```shell 73 | salloc -N 2 -c 6 mpirun -n 2 --cpus-per-rank 3 --bind-to none --oversubscribe ./hybridhello 74 | ``` 75 | If `mpirun` is invoked using `sbatch`, the sample file for `run.sh` is 76 | ```bash 77 | #!/bin/bash 78 | #SBATCH -N 2 79 | #SBATCH --gres=gpu:1 80 | #SBATCH -t 500 81 | #SBATCH -c 8 82 | mpirun -n 2 --bind-to none --cpus-per-rank 4 --oversubscribe /home/feng/triangle_counting_gpu_v2/build/nvtc/nvtc-variant -f /home/dataset/soc-LiveJournal1.bin 83 | ``` 84 | 9. Unbuffer 85 | If you are using `srun` to run your commands you can use the `--unbuffered` option which disables the output buffering. 86 | 87 | 10. mpich 88 | Run `mpich` program is similar but different with `openmpi`. First you need load `mpich/ge/gcc/64/3.3`, then compile your programs with `mpicc` provided by `mpich`. Finally run 89 | your program with `srun -N 2 --mpi=pmi2 your_program`. 90 | 91 | 11. Request nodes unavailable 92 | In some cases when some programs are running on `node01` and GPU is used out. Then you want to use CPU of `node01` to do some computation. You find your cpu job is pending. 93 | You can add the time limit to make your job run. That is `srun -t 500 -w node01 python --version`. 94 | 95 | 12. Submit graphics jobs 96 | You can use `slurm` to request graphics jobs, for example `xterm`. First you need to use `ssh-keygen -m pem` on our server to generate default 97 | `~/.ssh/id_rsa.pub` and `~/,ssh/id_rsa`. You cannot specify the paraphrase. Then you should add `id_rsa.pub` to `~/.ssh/authorized_keys`. You can automatically 98 | do so by `ssh-copy-id username@10.8.6.21`. 99 | Finally make sure you `ssh -Y` to our server (within network displays). Then use `srun --pty --x11 xterm` and enjoy it. 100 | 101 | 13. Debug mpi programs within slurm 102 | For openmpi, first using `salloc -N 2 --x11` to allocate nodes. Then use `mpirun -n 2 xterm -e gdb --args nvtc/nvtc-variant -f test_io_nvgraph.bin` to launch user programs within `xterm`. -------------------------------------------------------------------------------- /software.md: -------------------------------------------------------------------------------- 1 | ## Matlab 2 | If you want to use, type `module load matlab`. 3 | 4 | ### GUI 5 | You can use GUI matlab to explore small dataset or plot figures for your paper. 6 | 7 | Warning: Do not use GUI mode to run large-scale experiment. GUI is run on manage node and will use all CPUs if parallelism is enabled. This will cause the instability of the system. 8 | 9 | If you connect the server via X11 or vnc desktop, you can start matlab by `proxychains4 matlab`. 10 |  11 | 12 | ### Non-GUI 13 | Under normal SSH login, you can invoke matlab by `proxychains4 matlab -nodesktop`. Notice this software is experimental supported by lab2c web admin. 14 | Matlab may not start as you want. 15 |  16 | 17 | ### Script Mode 18 | ```shell 19 | proxychains4 matlab -nodesktop -nosplash -nodisplay -r "run('path/to/your/script.m');exit;" 20 | ``` 21 | 22 | ### Large Scale Computation 23 | You can use `srun` to submit matlab job to compute node. 24 | 25 | ## Mathematica 26 | You can use the storage node to run mathematica player. This software can view the Mathematica notebook. Type `WolframPlayer` 27 | to open the software. 28 | 29 |  30 | 31 | You can also run Mathematica Computing Engine within Raspberry Pi. We have deployed [3 raspberries](http://10.8.6.22:88/zhaofeng-shu33/slurm-test-env) in our labs. 32 | See [wolfram](https://www.wolfram.com/raspberry-pi/) for detail. 33 | 34 | After `ssh pi@10.8.15.39` to the raspberry OS, type `wolfram` to start symbolic computing: 35 | ``` 36 | pi@raspberrypi:~ $ wolfram 37 | Mathematica 12.0.1 Kernel for Linux ARM (32-bit) 38 | Copyright 1988-2019 Wolfram Research, Inc. 39 | 40 | In[1]:= 1+2 41 | 42 | Out[1]= 3 43 | 44 | In[2]:= Quit[] 45 | ``` 46 | ## Texworks 47 | You can use the storage node to run texworks. 48 | 49 |  50 | 51 | ## Inkscape 52 | You can use `inkscape` to make vector graphs, which are suitable for paper embedded figures. 53 | -------------------------------------------------------------------------------- /virtual_machine.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | 3 | 本文档介绍虚拟机的使用,有关更轻量的 docker 虚拟化参见 [docker.md](./docker.md)。 4 | 5 | 由于历史原因, 6 | 目前实验室部署有3类虚拟机系统, 一类是 vmware player, 由于安装的是免费版,限制较多,但图形化界面相对比较成熟; 7 | 另一类是 virtualbox,为开源系统,采用 `valgrant` 进行管理。 8 | The third is `qemu` on storage node. 9 | 10 | ## VMWare 11 | 用 VMWare 虚拟机部署的虚拟机本着实用的原则,配置较高,一般处于关闭状态,仅在有需要时开机并且对外开放端口。 12 | * 一台 Win 10,4核8GB内存,128GB SSD本机硬盘,已安装 VS2019 C#, Cygwin 开发环境 13 | (CPU慢于非虚拟化桌面,而由于存储类型不一样,Win10 VMWare, Disk IO 快于linux 系统中 Home读写) 14 | 15 | 这台机子同时有着连接校本部vpn使得正版matlab 可以在管理节点上使用的作用 16 | 17 | Windows 支持ssh 连接和rdp 连接,下图是用 rdp连接 windows 的截图: 18 | 19 | 连接的效果图 20 |  21 | 22 | Windows 10 虚拟机原来的网络配置是用 NAT 方法将远程桌面 3306 端口暴露给外网,后来由于要支持 matlab 的认证,改用 Bridge 方法加入 10.1.1.1/24 内网,虽然3306端口还开着,但原来配置的远程桌面不能使用了。目前可以用 VMWare (先登录 First Type 远程桌面)的方式连接。 23 | 24 | ### 连接Windows 远程桌面 25 | 在 windows 电脑 上首先 ssh 登陆管理节点,在管理节点终端上输入 26 | ```shell 27 | ssh -L 10.8.6.22:3389:10.1.1.4:3389 username@10.8.6.22 28 | ``` 29 | 然后打开远程桌面输入地址 10.8.6.22, 公共账号用户名:lab2c,密码:guest,是 Win10 普通用户。 30 | 后连接即可。RDP 会话过程 SSH 会话不能中断。 31 | 32 | 连接成功后可用 Juniper客户端 输入自己的用户名和密码开启校本部VPN,以使用服务器集群上的 Matlab, Windows 虚拟机未安装 Matlab。 33 | 34 | Known issue: Only One user is allowed each time. See [rdpwrapper](https://github.com/stascorp/rdpwrap) for a possible solution. 35 | 36 | * MacOS High Serria (10.13)(4核8GB内存,64GB SSD本机硬盘,暂无公共账号),已安装 command line tools, `Homebrew` 37 | 38 | 连接效果图 39 |  40 | 41 | ## VirtualBox 42 | 43 | VirtualBox 使用 `valgrant` 进行配置,目前下载了 Ubuntu 14.04 和 MacOS 12 的虚拟机。 44 | 45 | Available virtual machine can be found from [vagrantup](https://app.vagrantup.com). 46 | 47 | 默认是使用 ssh 登录。 48 | 49 | ## Debian Arm 50 | 51 | `ssh` to storage node(10.8.6.21) and then cd to `/home/feng/qemu/debian-jessie` working directory. 52 | 53 | Use `bash startup.sh` to start the virtual machine. You need to wait about half a minite for the VM to be started. 54 | Then the following prompt is available. 55 | ``` 56 | debianqemu login: lab2c 57 | Password: 58 | Linux debianqemu 3.16.0-6-armmp-lpae #1 SMP Debian 3.16.56-1+deb8u1 (2018-05-08) armv7l 59 | 60 | The programs included with the Debian GNU/Linux system are free software; 61 | the exact distribution terms for each program are described in the 62 | individual files in /usr/share/doc/*/copyright. 63 | 64 | Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent 65 | permitted by applicable law. 66 | lab2c@debianqemu:~$ pwd 67 | /home/lab2c 68 | ``` 69 | You can use the guest account `lab2c` with password `lab2c` to have a try. -------------------------------------------------------------------------------- /vnc.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | 3 | 本文档介绍远程桌面的使用。 4 | 5 | 目前实验室分别在管理节点(10.8.6.22)和存储节点(10.8.6.21)部署了 thinlinc 远程桌面服务器。 6 | 7 | The manage node supports 10 concurrent connections while the storage node supports 5 concurrent connections. If the maximum connection is hit, 8 | the server admins will coordinate on this issue and may cut off some sessions. 9 | 10 | 欲使用,首先在[官网](https://www.cendio.com/thinlinc/download) 下载相应操作系统的客户端(**专用**)。 11 | 12 | 连接时只需输入IP地址,无需输入端口号,用户名与密码与ssh 登录的用户名和密码相同。下图是 windows 客户端使用截图: 13 | 14 |  15 | 16 | thinlinc 还支持用浏览器作为客户端登录(HTTPS),使用 300端口。https://10.8.6.22:300。 17 | 18 | 首次登录后选择桌面环境,管理节点建议选择 xfce 桌面环境, 存储节点选 gnome 桌面环境。 19 | 20 | 桌面环境与服务器的操作系统一致,可以打开图形化的终端进行操作。 21 | 22 | 23 | ## 注意 24 | 由于实验室很多同学在自己的`.bashrc` 文件中 load anaconda3 模块与系统 python 冲突,使用 vnc 会出现闪退的问题。建议在自己的HOME 目录下更改 25 | `.bash_profile`文件中,将以下行: 26 | ```shell 27 | if [ -f ~/.bashrc ]; then 28 | ``` 29 | 改成: 30 | ```shell 31 | if [ -f ~/.bashrc ] && [ -z "$TLPREFIX" ]; then 32 | ``` 33 | 34 | thinlinc 是商业软件,实验室买的授权是管理节点10个用户同时在线。 35 | 存储节点是安装的免费版, 支持最多5个用户同时在线。 36 | 37 | 下图是在存储节点打开 gnome 远程桌面的截图。 38 |  39 | 40 | 41 | -------------------------------------------------------------------------------- /vpn.md: -------------------------------------------------------------------------------- 1 | # 官方VPN 2 | 3 | ## 清华深研院的VPN -- easy connect 4 | 目前使用清华深研院的VPN可以在外网连接实验室服务器。请从 [http://vpn.sz.tsinghua.edu.cn](http://vpn.sz.tsinghua.edu.cn) 下载相应的客户端。 5 | 下载安装打开VPN,服务器地址填 [https://vpn.sz.tsinghua.edu.cn](https://vpn.sz.tsinghua.edu.cn), 6 | VPN 初始用户名是学号,初始密码是身份证后8位。 7 | 8 | 由于该VPN 5分钟内无连接即自动断开,给使用SSH 造成不便,建议通过设置客户端`ssh_config`中添加: 9 | ``` 10 | ServerAliveInterval 30 11 | ServerAliveCountMax 60 12 | ``` 13 | 本地 ssh 每隔30s向 server 端 sshd 发送 keep-alive 包,如果发送 60 次,server 无回应断开连接。 14 | 15 | 虽然官网上可以下载Linux 的客户端,但客户端比较旧了,据说只支持 Ubuntu 16.04, Ubuntu 17.04. 16 | 17 | ## VPN of library of University Town of Shenzhen 18 | If you need access to some electronical resources, you can also use VPN of the library. 19 | See [lib vpn how to](https://lib.utsz.edu.cn/page/id-544.html?locale=zh_CN) for detail. 20 | The username is the student id (prefix with 11). 21 | The password is the one of universal identification authentication for UTSZ service. 22 | 23 | ## 校本部的VPN 24 | 智园和大学城的网络做过特殊处理,可以直接访问 info 等网站,但有些站点还不行(比如软件中心下载软件、激活 Win10、Matlab 等),此时要用到 校本部 VPN 25 | 客户端 pulse secure。 26 | Windows 和移动端下载地址 在 [https://sslvpn.tsinghua.edu.cn](https://sslvpn.tsinghua.edu.cn) 页面可以找到。 27 | Mac 下载地址学校官方没有提供,可以自行搜索,这里给出开发这个客户端软件的公司提供的下载地址(国外网站,客户端下载速度较慢): 28 | * [MacOS](http://trial.pulsesecure.net/clients/ps-pulse-mac-9.0r4.0-b1731-installer.dmg) 29 | 30 | 官网上下载的Linux 的客户端,无法成功使用;[openconnect](https://www.infradead.org/openconnect/index.html) 可以用(不可用了): 31 | ``` 32 | sudo /usr/local/sbin/openconnect --protocol=nc https://sslvpn.tsinghua.edu.cn 33 | ``` 34 | 35 | 使用时,地址是 https://sslvpn.tsinghua.edu.cn, 用户名和密码与 info 相同。 36 | 以下是 Mac 版本的使用截图 37 |  38 | 39 | 40 | 41 | ## GFW 42 | See [How to break through GFW](http://10.8.6.22/wiki/index.php/Guild_gfw) 43 | 44 | -------------------------------------------------------------------------------- /website-config/index.php: -------------------------------------------------------------------------------- 1 | 64 | 65 |
66 |
88 |
89 |