Release: 1.5 -This Document was generated 2015-05-27
Copyright © 2015 init.at informationstechnologie GmbH
NESTOR® is a registered trademark of init.at informationstechnologie GmbH.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
MySQL® is a registered trademark of MySQL AB in the United States, the European Union and other countries.
Red Hat®, Red Hat Enterprise Linux®, Fedora® and RHCE® are trademarks of Red Hat, Inc., registered in the United States and other countries.
Ubuntu® and Canonical® are registered trademarks of Canonical Ltd.
Debian® is a registered trademark of Software in the Public Interest, Inc.
SUSE® is a registered trademark of Novell, Inc.
openSUSE® is a registered trademark of the openSUSE community.
All trademarks are the property of their respective owners.
Please write to <nestor@init.at>
to contact our developer and support team.
Abstract
This document is the official documentation for NESTOR®. It explains general concepts of the software, gives an overview of it's components and walks you through various installation and administration tasks, the appendix documents and explains the stable part of the API.
Table of Contents
List of Figures
List of Tables
List of Examples
Table of Contents
Before you start to dive deep into the documentation, we think it is fair to let you know if you get the right information out of the document or not. This documentation is intended for system administrators who want to get into monitoring and cluster management. It is also intended for user who only have to operate with the software but don't do any configurations.
The handbook provides also some background information about LINUX® commands in general, i.e. in context with package installation.
It does not deal with general monitoring themes or basic principles of monitoring or cluster management. If you wish to learn something about that, please save your time and look for a more suitable document in the world wide web or in your local specialised bookstore.
Below list shows you which prerequisite of users and administrators should be fulfiled to get into monitoring or cluster management with software by init.at ltd..
Experience with LINUX® in general
Experience with the LINUX® command line, e.g bash, zsh or other
Experience with standard html browser
Experience with network settings
For easier document handling on some place you will notice small icons with the following meanings:
Table 1.1. Symboltable
![]() | Link inside the documentation |
![]() | Mailto Link |
![]() | Internet http link |
![]() | Link to the glossary |
![]() | Mark some very important statement |
![]() | Moving mouse to menu with given name |
![]() | Work on this content is in progress |
![]() | Left mouse click |
![]() | Right mouse click |
![]() | Doublemouse click |
Table of Contents
Purpose of NESTOR® is the HPC Cluster Management. It allows administrators to setup and manage a huge number of nodes in clusters or even more than one cluster at once.
Especially in cooperation with Monitoring it offers both, management and monitoring of your cluster, two essential parts every admin of HPC sooner or later have to think about.
Even if the main task of NESTOR® is to ease configuration and administration of icinga, there are some special and unique features icinga and other software solutions do not provide.
Following list contains the exclusive components that makes our software unique and unreachable on software market:
![]() |
| |
Peering - Possibility to connect monitored devices to peers. Displays whole network topology of connected devices. |
| |
Central database - One central storage for all data like configs, settings, logging data user data and much more. |
| |
![]() |
| |
![]() |
| |
![]() |
| |
![]() |
| |
![]() |
|
In this section you will find out about what technical requirements NESTOR® has.
Like every other software, NESTOR® has also certain system requirements. Because NESTOR® is open source Software, everybody who has enough programming knowledge could be able to port it to other open source software systems.
The good news: You don't have to port anything if you already use one of the following LINUX distributions:
Debian
Ubuntu
CentOS
Opensuse
SLES
For exact versions please take a look into Installation Chapter.
Monitoring configurations are stored in databases for faster access and therefore faster reaction time and further more flexible administration of data.
NESTOR® is using Django as its database interface so every database which is compatible with Django can be used. The recommended Database
is PostgreSQL (mostly due to license issues)
MySQL database is not supported any more.
Table of Contents
Software packages are available for following operating systems:
Debian Squeeze (6.x)
Debian Wheezy (7.x)
Ubuntu 12.04
CentOS 6.5
openSuSE (12.1, 12.3, 13.1)
SLES 11 (SP1, SP2, SP3)
It might be very well possible to get the software up and running on different platforms - but this type of operation is not tested and supported at all.
There is no public access to our repository directories, therefore first, you have to contact us to get a valid LOGINNAME and PASSWORD.
After receiving you access data you are able to use below mentioned repositories.
To install the software on your operating system there are two ways to go.
You can either automatically install the software by downloading and running our install_icsw.py install script or manually by adding the below listed repositories with your individual access data and use your package manager as usual.
The automatically installation via script is well recommended because it is very comfortable and handles most of installation scenarios.
In order to use the install script, first you have to contact us to get your individual access data.
Download the script named install_icsw.py
from our download portal.
As user root run the script with your repository access data as follows:
install_icsw.py
[
-u
USERNAME
] [
-p
PASSWORD
] [
-n
CLUSTERNAME
]
The script does following:
Determine which operating system is running
Add the necessary repositories with the required access data in your operating systems repository directory
Refresh your package cache
Install the software
Automatically integration of valid license files to the system
After receiving your individual access data do following steps to install the software manually:
Add suitable repositories for your operating system
Refresh your package cache with your os package manager
Install the software (for details see section installation)
Integrate your received or downloaded license files
There are two main repositories you have to deal with to install the software.
Repositories are available for stable (2.5) and for master (devel) versions of above mentioned operating systems. The operating system running on your hardware and the version of the software you want, determine which repository configuration you must use for your package manager.
Repository for latest releases of init cluster soft ware.
The current developer Version, containing the newest functions and modules. Very fast update and change cycle due to active development. Sometimes a bug could slipped in but usually it works fine. From time to time it will be merged into stable.
Repository for stable releases of init cluster soft ware.
The current stable version for productive environment. Most features and functions are included and there are no knowing bugs.
Based on the above mentioned operating system, repository and desired software version, resulting repositories can be added.
There are two different ways to add new repositories for monitoring software by init.at ltd. to the operating system. It can be added all at once in one central file or in a repository directory. For each operating system there are special repository directories.
Table 3.1. For debian based systems
Debian wheezy |
/etc/apt/sources.list.d/
|
Debian squeeze |
/etc/apt/sources.list.d/
|
Ubuntu 12.04 |
/etc/apt/sources.list.d/
|
Following you can see some examples of
source.list
content.
This are the lines you must add to your
/etc/apt/sources.list
The relevant parts for deb based package manager looks like this for devel version and wheezy :
deb http://LOGINNAME:PASSWORD@www.initat.org/cluster/DEBs/debian_wheezy/icsw-devel wheezy main
For ubuntu 12.04 you can also add repositories either in one file, into
/etc/apt/sources.list
or in repository directory
/etc/apt/sources.list.d
Debian and ubuntu use an other package manager than CentOS, OpenSUSE or SLES. For that reason, on rpm based operating systems
sources.list
does not exist. Rather there are a few files for repository management not only one. All relevant repository files stays in the directory
/etc/zypp/repos.d/
.
[cluster_devel_remote] name=cluster_devel_remote enabled=1 autorefresh=0 baseurl=http://LOGINNAME:PASSWORD@www.initat.org/cluster/RPMs/suse_13.1/icsw-devel type=rpm-md
[cluster_devel_remote] name=cluster_devel_remote enabled=1 autorefresh=0 baseurl=http://LOGINNAME:PASSWORD@www.initat.org/cluster/RPMs/suse_13.1/icsw-2.5 type=rpm-md
Alternative it's possible to download repositories direct from internet instead of editing files manually. There are two URLs you can get repositories from.
Don't forget to request for access data to the repository directories otherwise you can not access it.
Repository directory is
/etc/yum.repos.d/
. Place your desired *.repo files inside this directory, do a
yum check-update
and you are
ready to install the software.
[initat_cluster] autorefresh=1 enabled=1 type=rpm-md name=initat_cluster baseurl=http://LOGINNAME:PASSWORD@www.initat.org/cluster/RPMs/rhel_6.2/icsw-devel
Before continuing the server installation it is worth to say something about the database because it is one of most important parts of the software. It contains all settings, configurations, users and much more in one single database. This is the reason why it is easy to migrate or backup data and this is why it lowers the monitoring effort in comparison to a simple icinga installation.
After a basic installation of the server, normally only a SQLite database exists. To start with monitoring it suffices completely but it must be mentioned that due to database limitations of SQLite we recommend you to switch to a better database solution e.g. SQL.
So before installing the software, two scenarios are possible.
SQL Database already exists
SQL Database does not exists
The server can handle both states, in case of an existing database we want the server doing some migrations for us, in the other case we want to do an initial database setup.
icsw setup runs in an interactive mode and is responsible for a couple of basic setting:
Creates suitable database schemata
Creates an administrator account and an initial password for first login into the web front-end
Creates the database config file /etc/sysconfig/cluster/db.cf
The script asks you for
Install postgresql-server
Install python-modules-psycopg2
Either run following command to migrate an former existing database:
icsw setup --ignore-existing
or run following command to create a completely new database:
icsw setup
After running icsw setup, the script awaits some input from the admin.
possible choices for DB engine: psql, sqlite DB engine (psql) : DB host (localhost) : DB user (cdbuser) : DB name (cdbase) : DB passwd (bTMZPUYMiR) : DB port (5432) :
Take the suggested defaults with the "ENTER" key or insert own data. By now you have to have a faultless PostgreSQL installation setup to allow the software to connect to the database.
In case something goes wrong the script displays possible steps have to be done.
Most of conflicts at this time are wrong permissions to the database or generally a wrong database installation and setup. If below steps can not solve the problem please take a look into your database manual or ask the database administrator to find out how to setup database with correct permissions.
Login into your database (commonly done with su postgres followed by a simple psql)and type following commands to create the right user and a new and empty database:
CREATE USER cdbuser LOGIN NOCREATEDB UNENCRYPTED PASSWORD 'my_password_123'; CREATE DATABASE cdbase OWNER cdbuser;
This will create a new database user with the name cdbuser, the desired password my_password_123 and a new, empty database called cdbase
After successful creation of database user and database, we have to edit /var/lib/pgsql/data/pg_hba.conf
(OpenSUSE) to setup correct
permissions to the database. Comment out other lines so you have only the bolow three remaining.
local cdbase cdbuser md5 host cdbase cdbuser 127.0.0.1/32 md5 host cdbase cdbuser ::1/128 md5
To be on the safe side it is recommended trying to login manually to the database. If you are able to connect manually likely the script is able too.
After everything goes well we get a successful connection message
dsn is 'dbname=cdbase user=cdbuser host=localhost password=bTMZPUYMiR port=5432' connection successful
Once connected successful to the database the script runs the migration for you. Finally you have an installation with a PostgreSQL database.
The database access data for the server is stored in
/etc/sysconfig/cluster/db.cf
created by icsw setup, a sample file is provided under
/etc/sysconfig/cluster/db.cf.sample
. If you want to connect via local socket leave DB_HOST empty. Either fill in the user and database information manually or run
icsw setup for an assisted config file creation.
Every daemon and process from the server is using this file to gain access to the database. The File has to be readable for the following system entities:
The user of the uwsgi-processes (wwwrun on SUSE systems)
The system group idg
A typical set of rights would look like
-rw-r----- 1 wwwrun idg 156 May 7 2013 /etc/sysconfig/cluster/db.cf
Although the software do periodic backups, it could be necessary to do a database backup by hand. For PostgreSQL there is a special dump command:
pg_dump -Fc -U cdbuser cdbase > DATABASE_BACKUP_NAME
This single line is enough to copy your whole database to a file.
A few more action is needed to restore database backup. First of all, before we are able to restore our database backup cleanly, we have to drop (delete) all database contents. After database contents are dropped we are able to import data into the existing and empty database.
Delete database contents:
su postgres -c "psql -c \"DROP DATABASE cdbase; \""
Restore database:
pg_restore -c -C -F c DATABASE_BACKUP_NAME | psql -U postgres
The Webfrontend for your server can be accessed via
http://SERVERNAME/cluster or by http://IP_ADDRESS:80/cluster/
The Webfrontend for your server can be accessed via
http://SERVERNAME/cluster or by http://IP_ADDRESS:80/cluster/
In case you run setup_noctua.sh
script manually, the portnumber will be rewritten to 18080.
So you can access the web front-end by this url:
http://SERVERNAME:18080/cluster or http://IP_ADDRESS:18080/cluster
Also you can use your server localhost alias for accessing the front-end:
http://localhost:18080/cluster or http://localhost:18080/cluster
After added desired repositories for your operating system it is time to install the software packages itself and configure it. There are three main packages you have to install to get a basic server running:
icsw-server
icsw-client
icsw-dependencies
These packages contains all necessary services, binaries, libraries and dependencies for a clean and proper Installation.
If you also want to get access to the server by web GUI, you additionally need to install the nginx-init package and run the nginx-init http server.
For SUSE operating systems a installation command should look like following one:
zypper ref; zypper install icsw-server icsw-client icsw-dependencies
While accessing the repositories you will be prompted for valid username and password. Type in your received access data on the terminal and continue installation.
To guarantee the maximum possible flexibility, we decided to involve the system administrator into the installation procedure. After installation you will get a note on stdout how to create a new database configuration.
icsw setup command
Alternative to usual installation of binary packages via repositories and the operating system package manager like zypper , apt-get or yum, you can use a virtual machine with an ready to go installation. We distribute two popular VM image file formats running with libvirt/qemu and vmware . For information how to set up your VM environment, please take a look at the corresponding documentation of your VM vendor.
Following steps have to be done to run a KVM libvirt/qemu virtual machine with preinstalled NESTOR®;
Download the KVM/libvirt image and move it into the right image directory e.g.
/usr/local/share/images/
.
Copy an existing *.xml or create a new one
Edit your new *.xml file
Define your new virtual machine
Finally if your machine is setup correct, only you have to do is to start the virtual machine and have fun with monitoring.
From time to time, new software packages were built and can be downloaded. Especially for the master development branch there are frequent updates which can be applied to get new functions or features or simply fixing some bugs. Update period for master is about every second day.
The stable banch gets less frequent updates than the master version. Because it is the stable branch, most updates for stable affected security issues und bugfixes. Really big updates are done only if the master is stable enough for productive environment. The update period time is about 4-6 month.
The update procedure is very comfortable, it based on the system integrated package manager, for example zypper in OpenSUSE or apt-get in debian.
Comands for updating/upgrading all installed software by package manager are:
Refresh repositories and do whole system upgrade in OpenSUSE
Refresh repositories and do whole system upgrade in debian
Of course, you are also able to only update single packages, for example the package handbook-init . The command looks similar to the command used to update all packages:
Refresh repositories and do single package upgrade, in this case upgrade of package handbook-init in OpenSUSE
Refresh repositories and do single package upgrade, in this case upgrade of package handbook-init in debian
For other distributions please look into your distributors package management description.
After successful installation of NESTOR®; first of all you have to check if all necessary services are running. Type following command into the terminal:
icsw service status
Get more information about possible flags with: icsw --help
One of the most common flag is -v. This shows additionally the version number of each package like shown below.
An other common flag is -a. With this flag, the script shows additional information:
Thread info
pids
runlevels
Memory
Take a look into our command reference to learn more about icsw command.
The main command, which can be used to manage the cluster components, is the icsw command. The following table present the frequently used examples of the icsw tool:
Table 3.4. icsw - command overview
icsw-command | Functionality |
---|---|
|
Show status or control icsw services |
|
Show the state overview or enable/disable services These are the service states the which are managed by the meta-server |
|
The logwatch command is intended to show logging messages to stdout. The -f option is used to append data while the file grows. The --system-filter option limits the logging output to a specific service. If used without any arguments it displays logging messages for all running services. |
|
With the icsw license command administrators are able to lock, unlock or show locked licenses and devices from the license system. You can also show your cluster ID or register a cluster. For more information about the lock or unlock command take a look the section called “ Lock command to fall below the parameter limitation ” |
|
Create database and perform initial setup. There are many options and arguments for this command, please take a look into the
|
Short overview about icsw commands
Table 3.5. icsw - service
icsw-command | Functionality |
---|---|
|
Displays the status of the server and all of its services With the "service-name" it displays only status of given "service-name" |
|
Starts the service with "service-name". Without service-name option the command initiates the start of all cluster components. Warning!: if the service is disabled in the meta-server, then in some minutes the meta-server will stop the service again. |
|
This command stops the service with the "service-name" or all services of cluster instance. |
|
This command restarts the service with the given "service-name" or, if no service-name is used, it restarts all services of cluster instance. |
|
Using the debug option is like to start a service in foreground. In contrast with services started as daemon (background), the service probably displays some stdout messages. |
|
Provides the state/status info about all services or specific service, if there is the "service-name" |
|
Enables the service in meta-server settings(database). That means, the meta-server is responsible for running of the service. If the service is not active, in few minutes the meta-server will start the service again. |
|
Disables the services in meta-server settings (database). Henceforward the meta-server controls, if the service is not running. |
The icsw service command
All core features, also called basis features, of the software are licensed under an open source license and for free but some enterprise features of NESTOR® are not. There are licenses for this enterprise features and you have to buy licenses in order to get the features working.
This section guides you through the process of getting licenses, understanding the concept of license limitation and manageing licenses.
If you never have applied licenses to your server then you are running an unlicensed software version. To remind you of this fact you will see notification messages on some pages, for example on the login page and also on the dashboard after succesfull login.
Also you can check your license state by navigating to SessionLicense.
There you will see three different drop down windows:
Your licenses for this cluster
License packages
Upload license file
Shipped Licenses are keyfiles containing information about licensed features, license period and license parameter. Each license is associated to one specific Cluster ID. A keyfile can contain one or more license packages assigned with one or more cluster IDs.
The license package dropdown window shows you an content overview of uploaded keyfiles.
Keyfiles containing more than one single license package will be displayed in separate tabs inside of the License package dropdown window.
You can't expand any dropdown menu by left clicking on the arrow beside the dropdown except the Upload license file one. Use the
Choose File button to select your valid license keyfile. After selecting your keyfile its name will be displayed on the right side of
the button:
Push the upload button to integrate your valid license keyfile into the server and activate aquired enterprise features.
After uploading your valid license keyfile to the server, your license overview imediately will be updated and shows your purchased licenses.
Generally, if you want to see your license overview navigate to SessionLicense
to display the license status.
License name
Description of the specific license
Parameter value is the limitation of licenses in context of
Device
Service
User
External license
Used licenses and amount will be displayed as info window
Valid in future, license will be valid from point of time in future
Valid, license is active and valid until displayed date
Grace, license is in grace time. It is still active until the grace timeperiod of 2 weeks is over.
Expired, license is out of grace time period.
Name of used license package
There are four different periods or states of licenses. Dependent of the period or the state a licensed feature is working or not.
The second factor which decides if a license is valid or not is the parameter limitation. Dependent of purchased parameter amount and of used parameter the license can be valid, in grace time or expired.
Licenses can be violated by exceeding the license time period or by exceeding one of the license parameters limitation. In case of exceeding the license parameter limitation also a grace time period starts which is totaly independent of the license time period grace time.
For a small violation of parameter limitation there is a lock command to get back within parameter limit.
In Figure 3.10, “ License states ” marked with (1) you can see a license violation by exceeding the parameter limitation. Because of a violated license the grace time period starts immediatelly. In your license overview you will notice a warning message for this violated license.
Now, you have two options.
Purchase an extended license for that feature to increase the parameter limitation
Lock the license parameter to get back below parameter limitation
There is a special command for the second option:
icsw license lock
{
-d DEVICE
} {
-l LICENSE
}
After locking licenses [illustrated on (2) of Figure 3.10, “ License states ”], the used license amount falls below the parameter limit and is valid again.
If you want to know if any licenses are locked, use following command:
icsw license show_locks
It displays all locked licenses and corresponding devices.
Table of Contents
This sections explains the core concepts behind NESTOR®. It gives a top level overview of it's components and capabilities.
Following some separate components are listed and to which part of the software it belongs to.
As relative complex software NESTOR® use some well known frameworks and technology we absolutely have to mention.
Used Frameworks
Web application framework written in python.
Opensource framework developed by google™.
CSS Framework developed by twitter™
Used software solutions
Industy standard software for monitoring devices.
RRDtool is the OpenSource industry standard, high performance data logging and graphing system for time series data. RRDtool can be easily integrated in shell scripts, perl, python, ruby, lua or tcl applications.
Powerfull small and fast webserver.
Son of grid engine.
Parts which both software packages needs to works properly.
Components
This daemon is responsible for restarting services that might have crashed or were otherwise killed. This functionality should be taken over by systemd.
/var/lib/meta_server
contains the relevant information about which services should be running.
Creates the structure needed for receiving logs via rsyslog or syslog-ng.
Monitoring in context of hard- and software will be practiced to get information about specified systems.
Components
Responsible for configuration of icinga. Interacts with database.
Responsible for writing config files and general coordination of the cluster. Listens on port TCP/8004 . cluster-server.py is a general purpose server that handles various tasks like writing /etc/hosts, generating a valid DHCP configuration, configuration of the BIND nameserver, feeding LDAP and / or YP servers, ...
Client part of the host-monitoring collserver.py.
Frontend program to talk to collrelay.
NESTOR® consist of many different parts and services. Each of these services perform a specific set of tasks in the cluster. Most of these services are network enabled and listen on a specific port for commands. The following list tries to give an overview about the most important parts.
The general idea of NESTOR® is simple. Create an image, a kernel and a set of configuration files to be used by your nodes and distribute them to the nodes.
The distribution is done via PXE. NESTOR® enables you to describe the node specific configurations in Python.
Components
Generates the files for the Clusternodes (based on the config stored in the Database) to make the nodes distinguishable.
Creates the tftpboot /ethernet structure, monitors the installation progress of the nodes. Listens on ports TCP/8000
TCP/8001 . Mother provides access to IPMI as well.
Provides repositories available for installation by the package-client. Listens on port TCP/8007.
Install required software by using the locally available package management commands. zypper, yum or apt-get.
A small program written in C that transmits node status messages to the cluster-server. The hoststatus is written in C to be easily includable in the initial ramdisk.It listens on port TCP/2002. hoststatus is in the package child
Do log rotation and deletes logs older than a specified time range.
Provides integration of NESTOR® with SGE. The commands sns and sjs rely on it.
Daemon for automatic configuration of devices.
The database where all the configuration data of NESTOR® installation is stored is generally referred as the "clusterdatabase".
Throughout this document we might refer to:
Server side scripts . Are generally services or scripts that are running on the clusterserver. Most of them need database connectivity to function properly.
Node side scripts . On the other hand, node side scripts are daemons or scripts that are generally run on a node of the cluster. Node side scripts don't require access to the cluster Database.
Because NESTOR® consists of many different parts working together, it is not obligatory to run every service at once. Services like package-install or discovery-server are not essential to operate monitoring or cluster management.
For that reason, default installation of NESTOR® is rudimentary. Special services and functions are not activated by default. Activation of certain services force the user to push some buttons or move some lever.
Two spots where you can activate services are:
cluster server information
Device Config
Table of Contents
Most configuration in NESTOR® administrators have to do, is accessible over a standard html compatible browser like Mozilla Firefox™ or Google Chrome™. Once NESTOR® is installed and all required services are running, all you have to do is to connect to the server via browser.
Type in
http://SERVER-IP-ADDRESS:80/cluster/
or
http://SERVERNAME/cluster/
in your browser addressbar to connect to the server. If you connect the first time to the server you will be redirected to the account info page.
NESTOR® webfrontend offers you a very clear view. There are three areas you will work with:
Menu area (1)
Sidebar device tree (2)
Main area (3)
In the menu area you'll find submenus, buttons, date, time and user section.
Submenus
Base
Users
Monitoring
Session
NESTOR® offers some additional menus:
RMS - Resource management System
Cluster
Buttons
cluster server information
show cluster handbook as pdf
show index
number of background jobs
In the tree area you can find your device group tree and associated devices. Located on top, there is a searchfield and 2 buttons.
Searchfield
use selection Button (green with arrow)
clear selection Button (red with circle)
Group
FQDN (Full Qualified Domain Name)
Category
Alternativ there is an one button selection method.
All the configurations and input takes place in the main area. According to the selected or preselected devices and settings, corresponding page appears.
Figure 5.8. Possible main area
One possible view of main area after select some devices in "device network"
The cluster server information button shows three overview tabs, one tab with information about definied cluster roles, one tab with information about server results and one with information about the server itself.
Inside this upper tab, there is a table showing the Name, reachableIP and the defined cost of each of them. This tab is only for displaying information.
Each of the defined roles provides special functionality to the server.
Also a tab only for displaying general information.
valid
name
Result
max Memory
total Memory
This is the only one tab inside of server information, which allows you to control something. You are able to control services as you do by command line.
Following information will be displayed:
Server information
Name of service
Type of service {node, server,system}
Kind of Check
Status if service is installed or not
Versionnumber of installed service
Number of processes started
Displays memory usage as number and as statusbar
Button to apply action to the services
Table of Contents
The workflow inside of the web front-end follows a special pattern. This workflow repeats for specific actions and therefore it is worth to mention and learn it. We divide this section into four subsections listed below to show the difference of each one. Of course it is possible to get similar results by different ways but each way have advantages and disadvantages or is more or less efficient.
There are also software regions like Nodeboot which are only accessible by one single way.
All preselections are done in the sidebar device tree Section 5.2.2, “
Sidebar device tree (2)
”. Select groups or devices and click in top menu on
desired submenu to access .
Submenu method is recommended for working with multiple devices.
Following areas can be accessed by the submenu method:
Device Tree
Device Variable
Device Network
Device Configuration
Nodeboot
Package Install
Device Settings
Monitoring Overview
Livestatus
The method by home button is more general. Again preselection of devices or groups takes place in sidebar device tree but afterwards instead of choosing a submenu,
this time we click on the home button.
As result we get some useful tabs for each device.
Home button method is recommended for working with multiple devices.
Following areas can be accessed by the home button method:
By direct click on the device name in sidebar device tree we get similar overview tabs like by home button method
but only for one single device.
Direct method is recommended for working with single devices.
For graphing there is the requirement first to select devices, then select the wanted graphs and finally draw this graphs for selected devices. The graph
preselection will remain unaffected if you change the device preselection. After changing device preselection you must push the
Apply button. Same is true for drawing graphs.
We implemented three little helping functions to ease the handling with large and complex tables inside of the web front-end. This helping functions are located on several places/pages e.g. like sidebar device tree, device tree or device configurations mostly on top of the specific place/page.
Most common auxiliary function is the filter input field on top of device trees or configuration tables. Most simple usage of this filter field is to insert some text string or number to filter for. If you do so for example in sidebar device tree then only matching devices will be selected.
Table 6.1. Examples for filter function in sibebar device tree
Regular expression match character | Matching description | Example | Result |
---|---|---|---|
^ | Starting possition of line |
^node | Select all devices whose device name begin with the string node. |
[0-9] | Range of numbers |
node[0-5] | Select all devices which called node immediatelly followed by a number between 0 and 5 |
$ | End of line |
[0-9]$ | Select all devices whose name end with a numeric between 0 and 9. |
\d | Digit |
\d$ | Select all devices whose name end with a digit. |
Regular expressions for input fields.
Further information on regular expressions filter can be found in the world wide web by looking for javascript regex.
Figure 6.4. Input field filter for device configurations
![]() |
Displays all device configuration entries beginning with base, even they are not selected.
An other auxiliary function in handling with tables are the show and hide buttons on top of tables. With this buttons you can easily show or hide specific table columns.
To ease displaying of longer lists and to avoid to much page scrolling there is also a simple pagination function built into the software. With pagination we are able to limit entry output on pages to a specific number. Only the choosen numer of etries will be displayed, the other entries, if there are some, will be divided on separate pages which can be accessed via the page button.
Last but not least we'd like to mention the column sort function. It also could be very useful to display only desired data.
Not all columns provide this sort function, but most of them do. The function will be toggled by clicking onto the column name. If the function is
activated there is a small triangle left beside the column name pointing with its tip either in top direction for ascending or in bottom
direction for descending sorting. If no triangle is visible sorting function is deactivated.
Sorting method is:
First numerical
Second alphabetical
Figure 6.7. Sort column
![]() |
Activated ascending sorting marked with a small black triangle pointing in top direction.
Sometimes it could be very neccessary to undo former applied changes, for example if you have a typo in an script, variable or whereever or if someone else has applied some changes and you want to see the state before and after this changes.
Newest data and changes will be attached on top.
Our developer created a reversion function not only to display what changes were done but also to go back in change history to
desired state and drop changes which were done afterwards. History reversion can be found in top menu at
ClusterHistory.
The reversion function is based upon the central database (default is PostgreSQL) so in principle every change written into the database can be reversed. Normally there are lot of different data stored in the database to ensure every component works fine but it makes no sense to provide the reversion feature to all of this collected data. For normal user and administrators it is completely enough to revert changes which were done via the web front-end.
For example three new user were added in User Management. Like shown below, system history lists all relevant database entries for each of them.
Now lets suppose there was a typo in one of these names e.g Lucy was changed to Luci. If we take a look into the system history under User we get exactly this change displayed in an diff-like style.
If you are not only satisfied to display changes but also really want to go back to an earlier version there is the revert to this version button.
For example, someone changed directory paths in an script located at BaseConfigurations
and you would like to display that changes, simply navigate to the History tab of Modification window
to get a list containing all applied changes up to now.
Next step to do is to mark your desired version out of the list. Now, you can either apply reversion by clicking on the Modify
button or just switch to the script editor to check how the script looks like after reverting.
Table 6.2. Colorcode for reversion
green |
inserted character |
red |
deleted character |
black |
unchanged character |
We can see all changes at a glance in above shown figure.
Table of Contents
After installation of NESTOR® the user admin and the group admingrp already exists. This is the user you have to change password for after first login into your fresh installed system.
User admin has all possible rights and permissions to add, to modify and to delete devices/groups etc. User admin is also able to do reconfiguration of database and of course able to add or delete new user.
If you want to set restrictions for some user or groups, for example for external staff, you have to create this new restricted user/group with following buttons:
To add a new group in user management, klick the "create group" button, fill out the form and confirm your input by klicking the "Create" button.
The form is self-explanatory, but some input should be mentioned anyway:
Internal group ID
Set basic permissions to get access to selected devicegroup
Another extended form can be shown by clicking the new created group in the user/group tree:
A more complex permission system appears.
Similar structure and procedure is true for creating new user.
Also here we must mention some contents:
Internal user ID
Is the superior group
Operating system group
Owns all rights and permissions like the admin own
The permission system is divided into several parts which covers certain functions. Some permissions depend on other permissions, or in other words, chainpermissions. The more permissions user get the more powerfull they can act. The user "admin" or "superuser" is the most powerfull user. Admin have all possible rights and permissions.
Below is a list with permissions and what their functions are.
background_job
Shows additional menu button:
Session Background Job Info
config
Shows additional menu button:
Base
Configurations
device
Shows graphs tab for selected devices. Depends on possibility to choose devices (acess all devices)
Shows disk tab for selected devices. Depends on possibility to choose devices (acess all devices)
Change basic settings (General) for selected devices. Depends on possibility to choose devices (acess all devices)
Shows new top-menu named Cluster
Show Config tab for selected devices. Depends on possibility to choose devices (acess all devices)
Show Category tab for selected devices. Depends on possibility to choose devices (acess all devices)
Shows new top-menu:
Base
Device connections
Show Location tab for selected devices. Depends on possibility to choose devices (acess all devices)
Shows 3 new tabs for selected devices:
Livestatus
Monconfig
MonHint
Shows new top menu content:
Base
device network Depends on possibility to choose devices
(acess all devices)
Show vars tab for selected devices and new top menu:
Base
Device variables.
Depends on possibility to choose devices (acess all devices)
The main permission to show devices. Most of above permissions depends on it. Shows existing devices in device tree on the left.
group
...
image
...
kernel
...
mon_check_command
Shows new top menu content under:
Monitoring
Basic Setup / Build Info
network
...
...
package
Shows new top menu under:
Cluster
Package install. Additional software packages can be choosen and installed by
this menu button.
partition_fs
...
user
Shows new top menu content unter:
Session
Admin
...
Shows new top menu content unter:
Base
Category tree
Shows 2 new top menu content unter:
Base
Crerate new device / Device tree /
Shows new top menu content unter
Base
Domain name tree
...
The permission level defines what can be done by users. In combination with the permission itself, administrators are more flexible in assigning rights and permissions to user or to groups.
Below are 4 main permission levels which can be assigned.
Permits the user to read data. User can't change, create or delete data.
Permits the user to change existing data. Includes read-only level.
Permits user to change and create new data. Deletion is not possible.
All Permissions are granted.
Table of Contents
Installation of packages via the webfrontend is another helpful feature provided by NESTOR®. It offers you to install software packages on one or many systems over the webfrontend, without the need to login on each local machine and install packages manually by terminal command.
Your NESTOR® operates as central package installation entity, stores its repositories in the database and can also distribute its repositories to connected nodes.
It's a huge ease for user with less experience to do software installation with a few clicks instead of typing long and cryptic terminal commands.
In this section you can learn how to setup this feature, how to configure and how to use it.
Two important services for this function are: package-client, package-server
Before you are able to install packages by the webfrontend, you have to configure your machines appropriate. Not only the server-side configuration but also the client-side configuration is essential to make installation and distribution of packages working.
On top menu, go to Session
Settings. Enable the Button for
package installation (package) and reload the page.
Click your server device from device-tree on the left side, go into Config tab, and activate the package_server config.
Start the package-server by navigating to cluster server information and open the lower dropdown menu called One Server checked with click on the arrow. Push "Action" button for package-server and choose start if it is not already running.
So far, your server is ready for package installation. Also the clients/nodes have to be prepared for package installation.
First step to setup package-installation is to enter your server (package-server) IP-address (or hostname) in /etc/packageserver
on
the client machine.
Make sure package-client service is installed and running on the nodes/clients. To check the status of package-client use icsw service status command. Status of package-client should be "running".
Alternative use rcpackage-client status
to display service status.
The main configuration file for package-server is /etc/sysconfig/package-server
. The content should be self-explanatory and looks like this:
Table 8.1. package-server config options
options | default value | description |
---|---|---|
PID_NAME=
|
package-client/package-client
|
Name of PID files |
KILL_RUNNING=
|
True
|
|
USER=
|
idpacks
|
Username |
GROUP=
|
idg
|
Groupname |
GROUPS=
|
['idg']
|
|
LOG_DESTINATION=
|
uds:/var/lib/logging-server/py_log_zmq
|
Destination of log files |
LOG_NAME=
|
package-server
|
Name of log file |
SERVER_PUB_PORT=
|
8007
|
Server port for communication with client |
NODE_PORT=
|
2003
|
Client port for communication with server |
DELETE_MISSING_REPOS=
|
False
|
Capability to deleting missing repos |
The main configuration file for package-client is /etc/sysconfig/package-client
. It content should be self-explanatory and looks like this:
Table 8.2. package-client config options
options | default value | description |
---|---|---|
PID_NAME=
|
package-client/package-client
|
Name of PID files |
KILL_RUNNING=
|
True
|
|
COM_PORT=
|
2003
|
Client port for communication with server |
SERVER_COM_PORT=
|
8007
|
Server port for communication with client |
LOG_DESTINATION=
|
uds:/var/lib/logging-server/py_log_zmq
|
Destination of log files |
LOG_NAME=
|
package-client
|
Name of log file |
NICE_LEVEL=
|
15
|
Nice level the log daemon running at |
MODIFY_REPOS=
|
False
|
Capability to modify repositories |
PACKAGE_SERVER_FILE=
|
/etc/packageserver
|
|
PACKAGE_SERVER_ID_FILE=
|
/etc/packageserver_id
|
|
There are two common ways to install additional packages.
Package installation with operating system package manager
Package installation with package upload in directory
Usually the first method is recommended for standard installation of available packages. All software and packages your running system provides, can be installed via "Package install". It starts your system package-manager in background (apt-get, yum, zypper) and install selected packages on selected nodes.
In top menu, go to Cluster
Package install.
In Package_repositories tab [Figure 8.1, “ Package repositories tab ”], push the reload button to update you repositories.
Go to Package search tab [Figure 8.2, “ Package search tab ”] and search for the packages you want to install on the system.
If there are some results, list all matching packages with the show results button. In below appeared list choose your desired package version by pushing one of the the right buttons (take exact/take latest).
Go to Install tab [Figure 8.3, “ Package install tab ”], select devices package should be installed for and push "attach" button.
On top, a new button named action appears. Push the button, choose "Target state" install and submit your settings. The package will be installed automatic on your selected nodes.
If your system do not provide some packages you really want to install, there is an other way to go. In this special case you can either download fitting binary packages from external sources and place it in the right directory or you can compile and build your own package from sourcecode.
Upload your package into your upload directory on your server: /opt/cluster/system/packages/
Create the directory in case it does not exist.
Execute the update script update_repo.sh
in /opt/cluster/system/packages/
to refresh your repositories.
Script content for Red Hat based systems:
#!/bin/bash cd /opt/cluster/system/packages createrepo . yum clean all yum makecache
Script content for Suse based systems:
#!/bin/bash cd /opt/cluster/system/packages createrepo . zypper ar /opt/cluster/system/packages local_repository zypper ref
Maybe you have to "Sync to clients"/"Clear caches" to get the new repositories on all nodes.
Now, if you search after uploaded package you should get some results. To install uploaded packages follow the same procedure as install packages from system package manager mentioned in Section 8.2.1, “ Install packages using package manager ”.
Download source files and extract it.
Compile your software as usual and install it (.configure ; make; make install).
Once your package is installed, use make_package.py to create a new *.rpm package.
Run the update_repo.sh to refresh your repositories.
Maybe you have to "Sync to clients"/"Clear caches" to get the new repositories on all nodes.
On top, a new button "action" appears. Push the button, choose "Target state" install and submit your settings. The package will be installed automatic on your selected nodes.
To delete packages do following steps:
In top menu navigate to Cluster
Package install, and choose the
Install tab.
Select packages and nodes to delete it from.
Push the Action button and choose erase from Target state dropdown menu. To finish deletion click on the Submit button.
Table of Contents
An essential aspect in NESTOR® is the job management system. Main reason for using clusters is a higher computing power to calculate jobs. The calculation of data will be splitted into pieces and every node or slot can calculate each piece separately, this results in a higher speed of calculation. The organisation of slots, cluster and jobdistribution is done by the SGE - son of grid engine. SGE provides special commands and tools to control jobs distributed to the nodes.
The RMS is the coupling between the SGE and our web front-end. With enabled RMS you are able to manage jobs without any using of SGE commands.
Like mentioned before, the RMS is a powerful addon for managing jobs on clusters. It consists of packages and services working together to provide management functions for transmitted jobs.
Important parts of RMS are:
SGE part
SGE - Son of Grid Engine
Commandline tools like:
qdel
qstat
qacct
Look command-reference or manual page of sge_intro
to show complete list of commands.
man sge_intro
init.at part
rms-server - Server between SGE and Webfrontend
Webfrontend
Commandline tools like:
sjs
sns
Both commands sjs and sns are links to /opt/cluster/bin/sgestat.py
.
Environment variables for setting up RMS can be found under /etc/
/etc/sge_cell
Name of SGE
/etc/sge_server
Hostname or IP address of sge server.
/etc/sge_root
Directory sge installs to.
To get RMS working it is not enough only to install the package, you must also edit some config files and build the SGE part manually. Below step by step how to install RMS will help you installing RMS and run the required services.
Even it should be obvious, before you are able to install RMS make sure you already installed NESTOR® and its dependencies.
RMS package comes with installation of icsw-server
Set environment variables in /etc/sge_cell
, /etc/sge_server
and /etc/sge_root
Setting of environment variables must be done before compiling SGE!
Download the latest version (Latest version for 2014.09.25 is 8.1.7) of SGE package from https://arc.liv.ac.uk/trac/SGE
wget http://arc.liv.ac.uk/downloads/SGE/releases/8.1.7/sge-8.1.7.tar.gz
Extract sge-8.1.7.tar.gz
archive to /src/
, change into extracted directory
and run our buildscript placed under /opt/cluster/sge/build_sge6x.sh
.
tar xzf sge-8.1.7.tar.gz
cd /src/source/
/opt/cluster/sge/build_sge6x.sh
If your system can not compile and output some error messages, make sure you already installed necessary build-tools and development packages. Dependent of your operating system package names and count could differ.
Now directories under /opt/sge62
exists and service sge_qmaster is running.
Test if sge_qmaster is running:
ps aux | grep sge_qmaster
Set $PATH variables by running script located under /etc/profile.d/batchsys.sh
. /etc/profile.d/batchsys.sh
Run followed scripts:
/opt/cluster/sge/create_sge_links.py and /opt/cluster/sge/modify_sge_config.sh
RMS overview provides 4 tabs. Not only for displaying information but also to control jobs. There are a couple of green buttons on the bottom of overview page to hide or unhide columns.
The first tab of RMS overview displays current running jobs in the grid engine. You can get some background information like jobids, owner, runtime or nodelist of each job. On the right side there is an action button to delete or force delete running jobs.
The second tab of RMS overview displays the current waiting jobs. This are jobs waiting in the SGE queue for execution. Among other infos, it shows the "WaitTime", "Depends" and the "LeftTime".
The third tab of RMS overview displays done jobs and specific columns like "ExitStatus", "Failed" or "RunTime".
For direct usage of the SGE, there are a couple of commands:
Commands the SGE provides are:
Queue Configuration, allows the system administrator to add, delete, and modify the current Grid Engine configuration, including queue management, host management, complex management and user management.
qlogin initiates a telnet or similar login session with automatic selection of a suitable host.
qmake is a replacement for the standard Unix make facility. It extends make with an ability to distribute independent make steps across a cluster of suitable machines.
qmod allows the owner(s) of a queue to suspend and enable queues, e.g. all queues associated with his machine (all currently active processes in this queue are also signaled) or to suspend and enable jobs executing in the queues.
qmon provides a Motif command interface to all Grid Engine functions. The status of all, or a private selection of, the configured queues is displayed on-line by changing colors at corresponding queue icons.
qquota provides a status listing of all currently used resource quotas (see sge_resource_quota(5)).
qrsh can be used for various purposes such as providing remote execution of interactive applications via Grid Engine comparable to the standard Unix facility rsh, to allow for the submission of batch jobs which, upon execu- tion, support terminal I/O (standard/error output and standard input) and terminal control, to provide a batch job submission client which remains active until the job has finished or to allow for the Grid Engine-controlled remote execution of the tasks of parallel jobs.
qselect prints a list of queue names corresponding to specified selection criteria. The output of qselect is usu- ally fed into other Grid Engine commands to apply actions on a selected set of queues.
qsh opens an interactive shell (in an xterm(1)) on a low loaded host. Any kind of interactive job can be run in this shell.
qtcsh is a fully compatible replacement for the widely known and used Unix C-Shell (csh) derivative tcsh. It pro- vides a command-shell with the extension of transparently distributing execution of designated applications to suitable and lightly loaded hosts via Grid Engine.
Common way to submit jobs to the cluster is to use grid engines "q" commands. Assumed that your cluster configuration is correct, running jobs on cluster is as easy as running jobs on local machines.
Following steps have to be done to transfer jobs to queue: COMMING SOON
Table of Contents
Virtual desktop is a technology to transfer display output from remote graphic cards to your local machine graphic card.
User are often forced to work on remote machines because of computation power, license issues or simply geographical distance. In this cases, user usually have to start their remote desktop manually via a command line or similar tools.
With our virtual desktop technology there is no need to manually start anything. The back-end of the server takes care of sessions, ports, passwords etc., makes the relevant settings and saves it in the global database for you. Not only the settings and configurations will be done automatically by the back-end but also in cooperation with the web front-end it provides the display output.
That way you are able to access and work on remote machines via the web front-end on your favorite browser.
To activate the virtual desktop technology, first of all you have to define a Virtual Desktop session in
User Management.
In the main menu on top of the page navigate to Users
Overview, do left mouse click
on the admin user.
Figure 10.2. Virtual desktop session
Before using virtual desktops you have to define a session for it.
Virtual desktop settings
Please insert text here...
Protocol which will be used for virtual desktop session
Portnumber of connecting client - if set to "0", port will be random
Portnumber of vnc server
Window manager system for systems with more than one window manager
Preset of virtual desktop size. It's the windowsize the virtual desktop will be displayed into.
Checkbox to make sure the server is always running
After at least one virtual desktop session is defined, the back-end takes control of the further process. It looks continuously every 5 minutes for a running vnc-server. After discovering a running vnc-server, there will be new entries and buttons in virtual desktop tab.
Now you have the choice to view your remote desktop in the main home page or in a new browser tab.
Connection to remote desktop is as simple as login to your local system, even more simple like this. Just push one of the buttons and enjoy your virtual desktop inline or in new opened tab.
Figure 10.3. Virtual KDE Session
KDE session inside web front end with started 3D software and xterminal
To change your window manager or change the virtual desktop screen size, simply navigate to Users
Overview and choose the user of virtual desktop session.
Scroll down to section "Virtual Desktops", change setting and push the modify button to change settings.
Table of Contents
NESTOR® assists system administrators to set up and manage their cluster with comfortable support and help of the web front-end. Web front-end, GUI? It really do. Imagine you have set up each of the single cluster machine by hand. Who have enough time to do that? First group have the time to do it, but does it flows and goes on or is it a difficult birth?
Typically a cluster consists of many single nodes, each connected over network interfaces building together a cluster. For easier understanding, the following section refers to one server and only three single nodes. Other nodes can be set up exactly the same way, or, if node installation varied, with small adaption to it.
Before we are able to run a cluster installation, we have to make sure that the basic system services and components are working regulary. Following necessary requirements are given to set up a cluster.
Services and daemons provided by the operating system
Services and daemons provided by NESTOR®
We need two separate networks, one for internal communication with the nodes, and an other for external communication. Find further information in section Separate network
Automatically distribution of IP address will be done by DHCP requests. Therefore we must install and run a DHPC daemon on our NESTOR® server.
Because of required network access on node and server side, and in order to be able booting single nodes over network, we have to run a TFTP daemon.
Function of the network file system server (NFS) is to supply file access over ethernet. While tftp, due to the fact that it's very simple, is responsible for node boot process, responsibility of nfs is the really high performance file transfer over ethernet.
The xinetd daemon is responsible for starting services after got network requests on defined ports.
To install all above mentioned and required services on SUSE systems, use the following command line: zypper ref; zypper in dhcp-server tftp nfs-kernel-server xinetd
Use the common operating system package manager for other distributions.
It is not enough only to install the needed services, they also have to be configured properly to make a NESTOR® cluster installation possible.
To start with, insert one of your NESTOR® network devices into the dhcpd configuration file located in /etc/sysconfig/dhcpd
DHCPD_INTERFACE="eth1"
Start or restart the daemon e.g with rcdhcpd restart
Later you have do do extended configuration with help of
cluster-server.py -c write_dhcpd_config -D authoritative:True
Set the value for disable in /etc/xinetd.d/tftp
to no.
Also set the value for server_args in /etc/xinetd.d/tftp
from /srv/tftpboot
to /tftpboot. Restart the xinetd daemon with rcxinetd restart
We have also to create a symbolic link from / to /opt/cluster/system/tftpboot
and name it
tftpboot.
Make sure all mentioned services start properly, even after your cluster server reboots. To start services in open SUSE use following command: rcnfsserver start
.
To check if a service is already running in open SUSE use following command: rcnfsserver status
.If you can not start a service, for example
the nfsserver it may be helpful to start rpcbind first.
Also the following necessary NESTOR® packages have to be installed first in order to get a running cluster installation setup.
One of the most important main parts of NESTOR®. Mother is responsible for:
Scanning it's kernel directories and adds the new kernel into the database.
As the name supposes, in context of NESTOR®, this service ease the cluster network setup.
Manages installation of software packages over the web front-end. You can find more details about it in section package-install
We need three indigrients to "cook" our node installation soup.
System partition template
LINUX ™ kernel and initrd with essential kernel modules
Basic image which should be used for the nodes
Create a new partition table under ClusterPartition overview.
Important aspect here is the Nodeboot check box explained further down.
Columns in partition table overview
Partition name
More precise description of the partition
Insert a name for the partition, and, if you want, insert also a partition description. Two problems will be immediately displayed on the web front-end. In fact no defined discs and no present mountpoint. To solve this problems do following:
create new disk
create new partition, set partition numer and set valid mount point
Set the size to "0" to use the whole disk.
Please pay attention to the naming of discs which orientates on UNIX conventions and therefore used the common devicename standard like:
/dev/sd
The first part /dev/ represents the device tree. The second part sd represents a mass storage device like HDD od SSD. To create a disk simply insert a letter between a (first registered device) and z (last registered device).
Do not insert any partition number to the device name. This will be done in the create partition dialogue.
After creating a disk, create partition button will be visible.
Make sure to set the nodeboot flag of the partition.
Next, we need a LINUX™ kernel. It will be handled like a template kernel for the nodes. There are two common ways to create a kernel. You can either compile a completely new kernel or use an existing one.
The copy_local_kernel.sh script located in /opt/cluster/bin
is designed to copy the kernel and it's belonging modules
of the cluster server into the right directories.
The script also copies the system.map, the kernel config files, the kernel modules, kernel firmware and also generates a dummy initrd file.
copy_local_kernel.sh
[
KERNEL_NAME
] [
KERNEL_DIR
]
In case you got following error message:
system target directory /opt/cluster/system/tftpboot/kernels does not exist
try the command with the --init flag: copy_local_kernel.sh --init
Copy the compiled kernel to the newly created folder. Rename it to bzImage. Copy the System.map to the folder. Copy the config file of the kernel to the newly created folder . Rename it to .config. If you are running the same kernel you can do zcat /proc/config.gz > .config Copy /lib/modules of the desired kernel to the newly created folder + "/lib". Copy /lib/firmware to the new folder "/lib" + firmware + "kernel-name". Create a bz2 tar of the ./lib/modules tar -cj lib > modules.tar.bz2 Copy the firmware to lib/firmware/(kernel-version)
In order to get every single part for a proper booting system together we have to create a suitable initrd. For booting procedure essential
modules and driver (e.g. filesystem driver, network driver etc.) have to be included into the initrd to make the boot proces work.
Run the script populate_ramdisk.py to create the initial ramdisk.
populate_ramdisk.py
[
-m
LIST_OF_MODULES
] [
KERNEL_DIRECTORY
]
Use
populate_ramdisk.py -L -L
and
populate_ramdisk.py -L
to show all the generated ramdisks and the included modules.
The list of built in modules will most likely be the modules used for your node networkcard or filesystemdriver. Watch for file missing errors and install the missing binaries to your system. killall for example would be found in the package psmisc. Restart mother. The mother scans its kernel directory and adds the new kernel into the database. The populate_ramdisk.py script copies the stages scripts /opt/cluster/lcs and some binaries into an initrd + kernel usable for pxe booting. Now you can associated your first node with the new kernel via operation > nodeboot.
So our command to populate initrd should look like this:
populate_ramdisk.py -m virtio_net,virtio_pci --set-master-server -i /tftpboot/kernels/3.11.10-25-default/
The third component you absolutely need to run a cluster and install nodes over network is an image. It contains all needed software packages for your node installation. We need a basic image which will be served to the local nodes and represents a pattern for all node installations. Also here we have to choose if we want to install all image components manually or if we want to be assisted by a script named make_image.py.
First of all we need an image directory structure. A good choice to start with is /opt/cluster/system/
Within, create following subdirectories:
/opt/cluster/system/images
/opt/cluster/system/images/suse_131/etc/zypp/repos.d
Copy your local repository files into the just created image repository directory.
cp -a /etc/zypp/repos.d/* suse_131/etc/zypp/repos.d/
Now we have to install the os packages into our created directory. First make a refresh(ref) and then install your desired packages.
zypper -R /opt/cluster/system/images/suse_131/ ref
zypper -R /opt/cluster/system/images/suse_131/ in icsw-client icsw-server icsw-dependencies bash aaa_base glibc net-tools sshd openssh psmisc util-linux pam libxml2-tools loadmodules vim
Apart from required services, kernel image and partitions, it is totally necessary to integrate our nodes into our network and peering. But first we have to build up a network configuration consists of two serarate networks. Once handware network settings were done we can switch to the web front-end.
In order to booting nodes over PXE we have to define two separate networks. The first network is intended for the boot procedure of the node. The second network will be used for the node installation over NFS and later in productive state.
We need this two different networks both as setting in NESTOR® and as real hardware network settings.
Below figure shows you a typically NESTOR® network setting for one single cluster server and two different networks. Of course, there are three networks if we also count the local area network but for our purpose we concentrate only on the class B networks beginning with 172.
Also important is, that your cluster server really has the same hardware network settings. In our example, we use two network devices(eth1 and eth1:prod) providing our two different networks for booting and productive and additional the local area network (eth0). So we get following ifconfig -a output:
eth0 Link encap:Ethernet HWaddr 32:34:03:32:32:DD inet addr:192.168.1.239 Bcast:192.168.1.255 Mask:255.255.255.0 inet6 addr: fe80::3034:3ff:fe32:32dd/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:71510 errors:0 dropped:10 overruns:0 frame:0 TX packets:10156 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:7598935 (7.2 Mb) TX bytes:5595522 (5.3 Mb) eth1 Link encap:Ethernet HWaddr 52:54:00:DE:96:95 inet addr:172.17.1.1 Bcast:172.17.255.255 Mask:255.255.0.0 inet6 addr: fe80::5054:ff:fede:9695/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:179865 errors:0 dropped:0 overruns:0 frame:0 TX packets:193583 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:17279193 (16.4 Mb) TX bytes:306664808 (292.4 Mb) eth1:prod Link encap:Ethernet HWaddr 52:54:00:DE:96:95 inet addr:172.16.1.1 Bcast:172.16.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
The device network settings are also very important to boot nodes over network. That are the settings which must be done over web front end. Each node needs following device network settings:
One network device
Two IP addresses, each for boot and prod
Peering connection to the cluster server
For a better understanding there are already two nodes (node, node2) with a complete correct setup. Now we have to setup node3 the same way.
Lets start with adding a new node. In web front end, navigate to BaseDevice tree. Create a new
device, create optional a device group before, and name it. Essential settings here are:
Bootserver
This should be the machine our nodes boots from
Root passwd
This will be your node root password, if leaved empty the default password will be init4u
Root passwd can be set not only for first installation of nodes but also for every boot procedure.
After created a new node device, navigate to BaseDevice network to create a new netdevice,
assign two IP addresses to it and set the peering.
Figure 11.4. Comparison of complete (node, node2) and incomplete (node3) settings
![]() |
Figure shows node3 without any netdevice, IP address and peer
Click on the Create new button and create a new netdevice.
In hardware tab type in the MAC address of your device. This is the only way your node can be identified by your cluster server because it has no IP at booting time.
Also type in your network driver, for example virtio_net if you want to boot a virtual machine with this driver.
Mark the inter device routing checkbox in the basic settings tab. After creation of your new netdevice
Create two IP adress for your device, each for boot network and for prod network by click on the create new button beside of your network
device and choose IP from the drop down menu.
Type in your desired IP adress and choose the right network.
Last but not least you have to create a peer or network topology connection to make sure the node is connected with the cluster server.
Like above, click the Create new button, but this time choose
network topology connection to make a link between node3 and cluster_server.
An other important point in context of cluster setup is the right device configuration. Here, we enable the NESTOR® cluster software, especially the cluster server device, to act as an cluster server. Analogue is true for the nodes.
You have to enable at least following device configurations for the server:
server
mother-server
kernel-server
image-server
Also you have to enable node configuration for each node
After all packages are installed, configured and are running, it is time for the very funny part of NESTOR®. To control nodes before and after installation there is
a tool called Clusternodeboot. Nodeboot enables you to control your nodes in every way.
You can also display log lines for each node or with the macbootlog button for all nodes.
It is possible to control one single node or to control all nodes at once. Nodeboot allows you to pick one of the following:
target state
kernel
image
partition
bootdevice
Nodeboot provides in addition to above choices a soft control button, an hard control button, an action button and a log button.
Figure 11.8. Overview of all existing nodes
![]() |
Nodeboot with two selected devices, ready for global actions
Use the "STC" global action buttons or the "sel" buttons to select nodes for modifications or for actions.
Possible soft controls for nodes are:
reboot
halt
poweroff
Possible hard controls for nodes are:
Content of the modify button depends on weather a header button is selected or not. For all selected header buttons the modify popup window looks like this:
Table 11.1. basic settings
Target state |
boot (link) |
Boot the node |
installation (link,ins) |
Install node | |
boot_clean (link,ratain) |
Boot the node | |
installation_clean (link,ins,ratain) |
Install node | |
special state |
mentest(mem) |
Run memory test at boot time |
boot_local(loc) |
| |
boot_iso (iso) |
| |
New kernel |
kernel_1, kernel_2, ... |
List of availabe kernels |
Stage 1 flafour |
CPIO |
|
CramFS |
| |
ext2 via loopback |
| |
Kernel append |
- |
|
New image |
Image_1, Image_2, ... |
List of available images |
Partition table |
Partition_table_1, Partition_table_2,... |
List of available partitions |
Table 11.2. bootdevice settings
greedy |
- |
Checkbox to activate |
Dhcp write |
- |
Checkbox to activate |
Macaddr |
- |
MAC address of the node |
Driver |
- |
Network driver of the node |
Logging is also possible with nodeboot. To display your nodes logging messages simply click on the show button on the
right side.
An other useful logging view is macbootlog.
Table of Contents
One of the main advantages in contrast with proprietary software is the ability to extend or adapt functionality to user-defined targets.
There are some documented APIs which allows you to customise and optimise the workflow and integration into your companys facility.
Table of Contents
To obtain information about the general status of your server use icsw service status.
To show the last errors from the logfile you can use lse .
lse
[
-l
Error number
]
For more information type
lse --help
.
Example 13.2. Using lse to display the last error
clusterserver:~ #
lse -l 1
Found 40 error records Error 40 occured yesterday, 17:12:47, pid 11507, uid/gid is (30/8 [wwwrun/www]), source init.at.cluster.srv_routing, 72 lines: 0 (err) : IOS_type : error 1 (err) : args : None 2 (err) : created : 1409152367.94 3 (err) : exc_info : None 4 (err) : exc_text : None 5 (err) : filename : routing.py 6 (err) : funcName : _build_resolv_dict 7 (err) : gid : 8 8 (err) : levelname : err 9 (err) : levelno : 40 10 (err) : lineno : 179 11 (err) : message : device 'METADEV_server_group' (srv_type grapher) has an illegal device_type MD
Retrieving node information in an automated fashion is often useful in hunting down errors and bugs. To retrieve information about the nodes use collclient.py .
collclient.py
[
--host
Nodename
] [command]
For more information execute
collclient.py --help
The server provides its own logging service. Like usual in *NIX environments there are special directories logfiles will be written into. Access to these log files is given by the command lse. Of course it is also possible to read the logfiles directly by your favorite editor.
In case of something goes wrong the logging-server writes its logs under /var/log/cluster/logging-server/[HOSTNAME]/
.
Denotation of log files and subdirectories is related to the service which writes the log. For example if the meta-server can not start
some service it will write its log into the directory
If you want to see background information for package-installation on some nodes the file you have to check is package-client
. Analogue this is true
for server side, this time the filename is package-server
Files called *.bz2
are compressed logging backup files.
Critical errorlogs will also be delivered by mail. So you do not have to check your logs permanent, you will be notified by mail about critical errors.
Setting for recipient of errorlog mails is stored in /etc/sysconfig/logging-server
.
Another configuration file for mail notification is /etc/sysconfig/meta-server
.
Replace the given mailaddress in the line containing TO_ADDR= with your desired mail address.
By uncommenting and editing the line beginning with #FROM_ADDR= you are able to set the sender "From" name of recived emails.
# from name and addr FROM_NAME=pythonerror #FROM_ADDR=localhost.localdomain # to addr TO_ADDR=mymail@gmail.com # mailserver MAILSERVER=localhost
After editing the logging-server configuration file, the logging-server daemon must be restarted:
icsw service restart logging-server
The new configuration take effect after restart logging-server daemon.
A very handy command to read out logfiles is icsw logwatch. Logwatch makes it possible to display logs for different services and daemons at once. Even if you don't know in which file the logs are written to you are able to watch it. Thats the reason why logwatch is an allround tool for logging.
As usual for log files they have a typical output format.
Table 13.1. Logwatch columns
Column number | Column name | Example |
---|---|---|
1 | Date and time | Thu Apr 09 17:58:41 2015 |
2 | Device | 2_5_branch |
3 | System (logging daemon) | /collectd-init |
4 | Node | /--- |
5 | Loglevel | warn |
6 | Processname,processid | MainThread.19137 |
7 | Logmessage | sending 733 bytes to vector_socket |
Without any parameter logwatch.py displays the last 400 lines of logging messages for all services writing logfiles. With icsw logwatch -n 20 you can limit output lines to the last 20.
A very useful parameter for icsw logwatch is --system-filter. This flag restricts the output to one single daemon (service) e.g
icsw logwatch
[
--system-filter
rrd
]
displays only log messages related to an rrd (daemon) service.
With the -f flag it is possible to view logs in realtime. Use
icsw logwatch
[
-f
] [
--machine
MACHINE
] [
-n
N
] [
--system-filter
rrd
]
to output appended data as the file grows. Try icsw logwatch --help to list all possible options.
In case of malfunction it is very likely that the portnumber will be written into a logfile or apears in the web front-end. To find out which service or process causes the error we have to know which service communicates on which port. Following table shows you a little summary of common services and their communicating ports.
Table 13.2. Portnumber and services
Service | Port |
---|---|
md-config-server | 8010 |
rrd-grapher | 8003, 8003 |
logging-server | 8011 |
meta-server | 8012 |
discovery-server | 8006 |
cluster-server | 8004 |
Table of Contents
This is a Collection of repeated Questions.
14.1.1. Bad looking font in RRD Graph | ||
| ||
14.1.1.1. | Why are my fonts looks so ugly?
ugly looking fonts due to wrong font setup | |
If you get something like in the picture above, you have to install fetchmsttfonts (OpenSUSE) and ... (debian) package. | ||
14.1.2. Server Error (500) | ||
| ||
14.1.2.1. | Why i get Server Error (500)? | |
This is a server internal error, likely the server can't find some files. Take a look into /var/log/nginx/error.log for detailed error message. Also the lse command could be helpful. | ||
14.1.3. Unable to connect to the web front end | ||
14.1.3.1. | Why i can not connect to the web front end? | |
For some reason the webserver nginx doesn't run. Start it manually, for example with "icsw service nginx start" | ||
14.1.4. An error occurred | ||
| ||
14.1.4.1. | Why i get a message "An error occured"? | |
Please wait a moment till database connection is active and reload the page. If you still get this message after waiting a time you have to start uwsgi-init, for example with "service uwsgi-init start" With top you can display a job list. If there is something like yuglify in the top row than wait some time until it disapears. After that and after reloading the page the error message should dissapears. | ||
14.1.5. Configurations seems to be ignored | ||
14.1.5.1. | I changed my configurations but it seems to be ignored. | |
For some changes in your configuration you have to rebuild config (cached, RC) first. If your config is stored in cache, you have even to rebuild config (refresh) | ||
14.1.6. An Error occured | ||
| ||
14.1.6.1. | Why does my discovery not working? | |
Most likely the discovery-server service is not running. Make sure the discovery-server is installed and running.
Run the icsw service status command and look for "discovery-server".
If it is not running, start it either by commandline icsw service start discovery-server or via the webfrontend in top menu under server
information | ||
An other possible reason for that malfunction could be disabled discovery server config for your monitoring server. To enable it select your monitoring server device, navigate to the config tab and select the discovery server config.
After that you have to wait some time or refresh the memcached by | ||
14.1.7. Slow network topology graph | ||
14.1.7.1. | Why is my network topology graph so slowly? | |
Sometimes complex network topology slows down display output in firefox. This issue affects firefox up to version 31.0. Reason is likely bad javascript interpretation on firefox side. If you get bad graphic display performance, try to use another browser e.g. chromium or google chrome™. | ||
14.1.8. Lost password | ||
14.1.8.1. | Ilost my password, how can i get a new one? | |
A short guide how to reset a login password by direct access to the database via clustershell follows:
From now you are able to login with your new password. | ||
14.1.9. "Please wait..." after add location gfx | ||
14.1.9.1. | What if "Please wait..." message occures for loger time? | |
If you must wait long time while pending upload and the infolabel "Please wait..." is shown after upload image with add location gfx button, reload the page to resolv this issue. | ||
14.1.10. Weird mouse events on virtual desktop | ||
14.1.10.1. | The mouse pointer position is wrong, what can i do to resolv this? | |
Some vnc-server tends to break correct mouse pointer handling in virtual desktop. To get back correct mouse pointer, log out of your session and back in again. | ||
14.1.11. Asynchron graphs | ||
| ||
14.1.11.1. | Why are my rrd graphs ascnchron? | |
If you get wrong graphs, for example 1 hour in past or 1 hour in future like the pink graph line below, make sure to set the correct timezone and times on the affected machines. | ||
14.1.12. I have no permissions to icinga | ||
14.1.12.1. | How can i get the right permissions to access the icinga view? | |
To get the right permissions to icinga, you have to have at least one contact in Monitoring
Rebuild your icinga config to apply your new contact entry. Monitoring Figure 14.4. No permission to icinga ![]() Without at least one contact you are not able to use the icinga view | ||
14.1.13. Can not reach any network devices | ||
| ||
14.1.13.1. | Why i can not reach any network device? | |
14.1.14. Unable to delete group from device tree | ||
| ||
14.1.14.1. | How can i delete groups? | |
There is no delete button visible for preselected device groups in device tree. Reason for this behavior is that there are disabled devices in this group. First delete this disabled devices and finally you are also able to delete the group. | ||
14.1.15. No "IP address" dropdown in device network | ||
14.1.15.1. | Why there is no "IP address" dropdown in device network?? | |
There is no "IP address" dropdown button visible in device network until at least one network is defined in Base | ||
14.1.16. Could not connect to server: Connection refused | ||
14.1.16.1. | I get server error on port 5432, how can i resolv the problem? | |
After running icsw service status script, a python error with Port 5432 occours. Generally, if you get error messages with the port number 5432, the reason is likely that your postgres server (listens by default on port 5432) is down. [...] django.db.utils.OperationalError: could not connect to server: Connection refused Is the server running on host "localhost" (::1) and accepting TCP/IP connections on port 5432? To check if your postgres database server is running type in one of the following commands, depend on your os: rcpostgres status service postgresql status systemctl status postgresql.service Make sure to start the postgres server at boot time by enabling your operating system start scripts. Replace status with start to start the database server. Also make sure the service starts after rebooting the system. | ||
14.1.17. Internal Server Error | ||
14.1.17.1. | I get an internal server error in web front end, how can i resolv the problem? | |
After server installation and database setup, it could happen that you get an Internal Server Error. Try restarting your uwsgi-init service with this command:rcuwsgi-init restart. The restart of uwsgi service results in an running yuglify process which generates all static files. After all static files were generated, you should get access to your web front-end. | ||
14.1.18. License warning apears on every page | ||
14.1.18.1. | I get an license warning message on every single page, how can i resolve this? | |
A license warning message apears on the top right side, each time a new page is loaded. This warning means that one of your licenses is either out of date and for this reason in grace time, or used devices/services/user for this license exceeds its license limitation. In both cases the license is violated and from that moment the grace period begins to run. Grace period for licenses is 2 weeks long. In grace period, functionality of the software is as usual but you will see this license violation warning on each new loaded page
You have to get a new license or expand your existing one to avoid the license violation message. Please contact us by mail An other method to get again into a valid license limitation is to lock licenses for some devices. | ||
14.1.19. Reverse domain tree node order | ||
14.1.19.1. | How can i reverse the domain tree node order? | |
Suppose you have a couple devices with following domain tree node order: Domain tree node structure for above device domains looks like this: Now for example you want to revert the order of the device domains.
First step you have to do is to change the domain name tree itself. Navigate to Base Do the same with the other entries until your tree looks like this:
All you have to do now is to change your domain tree node for your devices. Select your desired devices in the device tree sidebar on left side and navigate to
Base The result should looks like this: |
Dynamic Host Configuration Protocol, The Dynamic Host Configuration Protocol (DHCP) is a standardized network protocol used on Internet Protocol (IP) networks for dynamically distributing network configuration parameters, such as IP addresses for interfaces and services. With DHCP, computers request IP addresses and networking parameters automatically from a DHCP server, reducing the need for a network administrator or a user to configure these settings manually.
Django is a free and open source web application framework, written in Python
Domain Tree Node, is the tree structure of fully quallified domain names.
Industy standard software for monitoring devices.
Intelligent Platform Managemnt Interface
Network File System, is a distributed file system protocol allowing a user on a client computer to access files over a network much like local storage is accessed.
Small and fast http server similar to apache
Preboot Execution Environment
RRDtool is the OpenSource industry standard, high performance data logging and graphing system for time series data.
The "Son of Grid Engine", community project of Sun Grid Engine. [https://arc.liv.ac.uk/trac/SGE]
Trivial File Transfer Protocol is a simple, lock-step, file transfer protocol which allows a client to get from or put a file onto a remote host. One of its primary uses is in the early stages of nodes booting from a Local Area Network. TFTP has been used for this application because it is very simple to implement.
An open-source super-server daemon which runs on many Unix-like systems and manages Internet-based connectivity. It offers a more secure extension to or version of inetd, the Internet daemon, thus most modern Linux distributions have switched to it.