/
Many people refer to ‘cloud’ and ‘virtualization’ in the same breath, and from there assume that the cloud is all about managing the virtual machines that run on your hypervisor. CurrentlyOpenStack supports Virtual Machine management through a number of hypervisors, the most widespread being KVM and Xen.
As it turns out, in certain circumstances, using virtualization is not optimal—for example, if there are substantial requirements for performance (e.g., I/O and CPU) that are not compatible with the overhead of virtualization. However, it’s still very convenient to utilize OpenStack features such as instance management, image management, authentication services and so forth for IaaS use cases that require provisioning on bare metal. In addressing these cases we implemented a driver for OpenStack compute, Nova, to support bare-metal provisioning.
When we undertook our first bare-metal provisioning implementation, there was code implemented byUSC/ISI to support bare-metal provisioning on Tilera hardware. We weren’t going to be targeting Tilera hardware, but the other bits of the bare-metal implementation were pretty useful. NTT Docomo also had code to support a more generic scheme using PXE boot and an IPMI-based power manager, but unfortunately it took some time to open source it, so we had to start development of the generic backend before the NTT Docomo code was open sourced.
A blueprint on bare-metal provisioning can be found on the OpenStack Wiki here:General Bare Metal Provisioning Framework.
Our driver implements the standard driver interface for the OpenStack hypervisor driver, with the difference that it doesn’t actually talk to any hypervisor. Instead it manages a pool of physical nodes. Each physical node could be used to provision only one “Virtual” (sorry for the pun) Machine (VM) instance. When a new provisioning request arrives, the driver chooses a physical host from a pool to place this VM on and it stays there until destroyed. The operator can add, remove, and modify the physical nodes in the pool.
bare-metal provisioning architecture
The main components related to the bare-metal provisioning support are:
nova-compute
with the bare-metal driver: The bare-metal driver itself consists of several components: dnsmasq
is a Netboot environment for instance provisioning.nova-baremetal-agent
: This is the agent that is supposed to be run onbootstrap-linux
(see the next bullet) and executes various provisioning tasks spawned by the bare-metal driver.bootstrap-linux
: A tiny Linux image to be booted over the network and perform basic initialization. It is based on theTiny Core Linux and contains a basic set of packages such as Python to runnova-baremetal-agent
(which is implemented in Python) and curl to be able to download an image from Glance. Additionally, it contains an init script that downloads nova-baremetal-agent
using curl and executes it.nova-baremetal-service
: A service that is responsible for orchestration of the provisioning tasks (tasks are applied by nova-baremetal-agent
directly to the bare-metal server it is running on). Let’s see what each component actually does in the course of provisioning a new VM (i.e., when you callnova boot
). I won’t focus on the details of this request until it reachesnova-compute
and the spawn request has reached our bare-metal driver.
The following diagram illustrates this workflow:
bare-metal provisioning flow
dnsmasq
.nova-baremetal-service
(which provides a REST interface for that).nova-baremetal-agent
polls the nova-baremetal-service
REST service for tasks.nova-baremetal-service
sees a task for this node and sends a response with the task, which includes a URL for the image from Glance and the authentication token to be able to fetch it.nova-baremetal-agent
fetches an image from the URL specified in the task and ‘dd’s it to the hard drive and then informsnova-baremetal-service
that it’s done with the task.nova-baremetal-service
is notified about task completion, it informs the driver that it’s time to reboot the node.A typical configuration for the compute will look like this:
1 2 3 4 5 6 | . . . -- connection_type = baremetal # baremetal support -- baremetal_driver = generic # target a generic hardware, i.e. IPMI management and PXE boot -- networkmgr_driver = nova . virt . baremetal . networkmgr . juniper . JuniperNetworkManager # use Juniper network manager -- powermanager_driver = nova . virt . baremetal . powermgr . freeipmi . FreeIPMIPowerManager # use freeimpi-based power management . . . |
But before the system becomes useful, we have to register switches and nodes. Information about them is stored in the database. We have created an extension for OpenStack REST API to manage these objects and two CLI clients for it:nova-baremetal-switchmanager
and nova-baremetal-nodemanager
. Let’s use them to show how to add new switches and nodes.
Switches could be added using a command like this:
1 | nova - baremetal - switchmanager add < ip > < user > < passwd > < driver > < description > |
You have to specify the IP address of the switch, credentials for the manager user, which switch driver to use, and an optional description.
nova-baremetal-switchmanager
also supports other essential commands like list and delete. Once we have at least one switch, we can start adding nodes:
1 | nova - baremetal - nodemanager add < ip > < mac_addr > < cpus > < ram > < hdd > < ipmihost > < ipmiuser > < ipmipass > < switchid > < switchport > |
As you can see, it has a few more options: IP address of the node, MAC address of its first network interface (used to identify the node), number of CPUs, amount of RAM in Mb, HDD capacity in Gb, IPMI information, switch ID of the switch it’s connected to, and a name of the port on the switch.
After successful execution of this command, the specified node will be added to the pool. Withnova-baremetal-nodemanager
you can also list and remove nodes in the pool with list and delete commands respectively.
Bare metal has proved to be a useful and stable feature for our customers. It has other specific features, such as networking management and image preparation, that we will cover in upcoming posts.
/
In a previous post, we introduced the bare-metal uses cases for OpenStack Cloud, using its capabilities. Here, we’re going to talk about how you can apply some of these approaches to a scenario mixing virtualization with isolation of key components.
Isolation requirements are pretty common for OpenStack deployments. And in fact, one can just say: “Without proper resource isolation you can wave goodbye to the public cloud”. OpenStack tries to fulfill this need in a number of ways. This involves (among many other things):
However, if we go under the hood of OpenStack, we will see a bunch of well known open source components, such as KVM, iptables, bridges, iSCSI shares. How does OpenStack treat these components in terms of security? I could say that it does hardly anything here. It is up to the sysadmin to go to each compute node and harden the underlying components on his own.
At Mirantis, one OpenStack deployment we dealt with had especially heavy security requirements. There was a need for all the systems to comply with several governmental standards involved in processing sensitive data. Still we had to provide multitenancy. To observe the standards we decided that for “sensitive” tenants, isolated compute nodes with a hardened config should be provided.
The component responsible for distribution of the instances across OpenStack cluster is nova-scheduler. Its most sophisticated scheduler type, called FilterScheduler allows to enforce many policies on instance placement based on “filters”. For a given user request to spawn an instance, filters determine a set of compute nodes capable of running it. There are a number of filters already provided with the default nova-scheduler installation (they are listed here). However none of them fully satisfied our requirements, so we decided to implement our own, and called it “PlacementFilter”.
The main goal of the PlacementFilter is to “reserve” a whole compute node only for one tenant’s instances, thus making them isolated from other tenants’ instances on the hardware level. Upon tenant creation it can be specified if it is isolated from others or not (default). For isolated tenants only designated compute-nodes should be used for VM instances provisioning. We define and assign these nodes to specific tenants manually, by creating a number of host aggregates. In short – host aggregates is a way to group compute-nodes with similar capabilities/purpose. The goal of the PlacementFilter is to pick a proper aggregate (set of compute nodes) for a given tenant. Usual (non-isolated) tenants will be using “shared” compute-nodes for VMs provisioning. In this deployment we were using OpenStack to also provision baremetal nodes. Bare-metal nodes are isolated by their nature so there’s no need to designate them to pool of isolated nodes for isolated tenants. (In fact, this post builds a bit on one of my previous posts about bare-metal provisioning)
During the initial cloud configuration, all servers dedicated for compute should be split into 3 pools:
Such grouping is required to introduce two types of tenants: “isolated tenant” and “common tenant”. For “isolated tenants” aggregates are used to create dedicated sets of compute nodes. The aggregates are later taken into account in the scheduling phase by the PlacementFilter.
The PlacementFilter has two missions:
Placement filter passes only bare-metal hosts if a ‘bare_metal’ value was given for ‘compute_type’ parameter in scheduler_hints.
NOTE: We can instruct the scheduler to take into account our provisioning requirements by giving it so-called “hints” (“–hint” option to “nova” command); e.g., to specify compute node’s CPU architecture: –hint arch=i386. In the above case, the hint for bare-metal will be: nova boot …. –hint compute_type=bare_metalIf a non bare-metal instance is requested – filter searches aggregate for the project this instance belongs to, and passes only hosts from its aggregate. If aggregate for project is not found, then a host from the default aggregate is chosen.
The following diagram illustrates how the PlacementFilter works for both bare-metal and virtual instances:
(1) A member of project#1 requests an instance on his own isolated set of compute nodes. The instance lands within his dedicated host aggregate.
(2) A member of project#1 requests a bare-metal instance. This time no aggregate is needed as bare-metal nodes are by nature isolated on the hardware level, so the bare-metal node is taken from the general pool.
(3) Instances of tenants not assigned any host aggregate, land in the default “public” aggregate, where compute nodes can be shared among the tenant instances.
This is the procedure we follow to implement instance placement control:
1 2 | nova aggregate-create default nova nova aggregate-add-host 1 compute-1 |
1 2 3 | --scheduler_driver=nova.scheduler.filter_scheduler.FilterScheduler --scheduler_available_filters=placement_filter.PlacementFilter --scheduler_default_filters=PlacementFilter |
1 | keystone tenant-create --name <project_name> |
1 2 | nova aggregate-create <aggregate_name> nova nova aggregate-set-metadata <aggregate_id> project_id=<tenant_id> |
1 | nova aggregate-add-host <aggregate_id> <host_name> |
1 | nova boot --image <image_id> --flavor <flavor_id> <instance_name> |
1 | nova boot --image <image_id> --flavor <flavor_id> --hint compute_type=bare_metal <instance_name> |
With the advent of FilterScheduler, implementing custom scheduling policies has become quite simple. Filter organization in OpenStack makes it formally as simple as overriding a single function called “host_passes”. However, the design of the filter itself can become quite complex and is left to the fantasiesof sysadmins/devs (ha!). As for host aggregates, until recently there was no filter which would take them into account (that’s why we implemented PlacementFilter). However, recently (in August 2012) a new filter appeared, calledAggregateInstanceExtraSpecsFilter which seems to do similar job.
/
Recently we began a series of blog posts on OpenStack provisioning bare-metal instances (see: Beyond virtual machines and hypervisors and Placement control and multi-tenancy isolation). While installing VMs is relatively easy, as they support common image formats like OVF and qcow2, bare-metal servers are not always that simple.
This article describes how to prepare images for bare-metal nodes. In the case of Linux-based images, this process should work just fine because the generic kernel by default includes general hardware types and features. However, Windows-based images require special attention; the problem is that, once installed, Windows OS only works with the same hardware/architecture it was created for. Thus, images should be prepared on the same hardware nodes you are planning to use them on.
The simplest way to prepare images for bare metal consists of three general steps:
First you need an access to a bare-metal node. In our case, we had a Dell 6105 server with an IPMI-interface.
The IPMI-interface allows you to interact with a remote host through a Java applet (JViewer in Dell’s case). Aside from the screen redirection and keyboard forwarding, it has a very important facility—redirection of the ISO image or CD-ROM to the server side. In other words, the hardware management console allows us to use a local CD-ROM on the remote end.
To make a node boot from CD-ROM you might need to change the boot priority in the BIOS.
Once that is done, the server should be rebooted through the IPMI-interface and you can begin the installation process. The installation process will typically take longer, since the installation image physically resides on the operator’s workstation and depends on the capacity of the data channel.
After a successful installation of the system, the configuration should be changed. In the case of Windows Server 2008 there are some tricks that I will cover below.
Next, we need to boot from LiveCD. The boot process is similar to the process described above for the beginning of the installation. We used Ubuntu. Preconfigure the network in /etc/network/interfaces to send image data to the remote server through netcat
, then get the available hard drives with the command
1 | # fdisk -l |
and make sure that they are not mounted
1 | # df |
Now you can proceed directly to image creation. On the client side execute:
1 | # nc -l 3333 > name_of_your_snapshot.img |
netcat
will be waiting for an incoming connection on a specified port. At the same time, on the remote host execute:
1 | # dd if=/dev/sda conv=sync,noerror bs=8M count=5000 | nc 192.168.11.1 3333 |
where:
Grab 5,000 blocks, 8 megabytes each, from the beginning of the sda disk (40 GBs in total, the size of the Windows installation) and send it with netcat
to the remote server.
When everything is completed successfully you get:
1 2 3 | 524288 + 0 records in 524288 + 0 records out 268435456 bytes transferred in 231.665702 secs ( 1158719 bytes / sec ) |
The image has been created and transferred to the local host.
There were three ’gotchas’ we encountered; we’ve listed them in the hope you’ll be able to use this to avoid these problems.
First, when using JViewer, we had the following problems:
These problems were solved by creating a virtual machine on the server physically allocated nearby with bare metal and working with IPMI through that. It could also have been related to the installed version of Java, which we cured by installing JRE version 6.
The second issues was during the installation of Windows family systems. We encountered a possible problem associated with the location of the MBR boot sector.
Windows allows you to work with hard disk partitions of approximately 2 terabytes. If the hard disk or RAID-array assigned for installation is larger than 2 terabytes, then the disk will be divided into logical sectors smaller than the maximum allowed.
In this case the system partition and MBR will be allocated on a different logical disk than the installed system, and nowhere else. This makes the image of the system made from /dev/sda/ unusable since the system partition and MBR are on /dev/sdb/.
Thus, you should make sure to install Windows on a partition less than 2 terabytes.
The third glitch we faced during the testing of images was related to Windows Server 2008’s firewall. The image launched through OpenStack wasn’t accessible through the network, but we solved it by switching off the firewall at the command prompt:
1 | # netsh advfirewall set allprofiles state off |
本文发布于:2024-02-03 00:59:38,感谢您对本站的认可!
本文链接:https://www.4u4v.net/it/170689320747623.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
留言与评论(共有 0 条评论) |