Running a Devnode Cluster Across Multiple Boxes

In my last post, we used devnodectl to fire up a simple 3-node cluster, with all nodes running on the same system.  This is interesting insofar as it demonstrates the functionality of the cluster, but we’re certainly quite resource limited trying to emulate three nodes on one box.  In this post, I’ll show how to get devnode running on different physical servers, so you can begin to see the potential of horizontal scaling available with Clustrix.

Along the way I’ll also cover:

  • Manual control of devnode (vs. devnodectl)
  • How, where, and what to look for in the logs
  • Requirements
  • Three Red Hat/CentOS 6 or equivalent clients
  • All clients must be on the same subnet
  • The clients should NOT be running mysqld (so port 3306 will be free to devnode)
  • Firing up the devnode instances

First install the DevKit RPMs, as described in my last blog post, on each of your clients, and ensure you have a writeable working directory available on each (I’ll use /data/clustrix on each).  If you followed along on the last exercise, please stop those nodes (devnodectl stop), and I also recommend cleaning out the prior state with rm -rf /data/clustrix/*.  The flags we’ll specify below will overwrite old stuff as needed, but you’d still have the old nodes’ 2 and 3 state on your first client, which might become confusing.

Now we’re going to start up devnode directly, and also change the flags around quite a bit.  I recommend opening three terminal windows, one for each node; connect (ssh, presumably) to each of your clients with these nodes, then run the following, one in each window:

client1$ /opt/clustrix/bin/devnode -clusterpath /data/clustrix -setpnid 1 -eth eth0 -clean -nclean 4 -vdev-size 2048 -logfile - 
client2$ /opt/clustrix/bin/devnode -clusterpath /data/clustrix -setpnid 2 -eth eth0 -clean -nclean 4 -vdev-size 2048 -logfile – -noautostart
client3$ /opt/clustrix/bin/devnode -clusterpath /data/clustrix -setpnid 3 -eth eth0 -clean -nclean 4 -vdev-size 2048 -logfile - -noautostart

The -clusterpath, -nclean, and -vdev-size flags we talked about last time (recap: where does the simulated node store it’s data, and how many/big disks should we have — note that I’m allocating 8GB per node here).  Let’s pick apart the other flags here:

-setpnid sets a Physical Node ID — normally this would come from a node’s MAC address

-eth tells devnode to use the ethernet interface for inter-node communication; when we created our three node cluster on the same physical machine with devnodectl, it specified -unix to use UNIX sockets instead.  On real nodes, we’d be using InfiniBand for this purpose.

-clean tells the cluster to wipe all prior state

-noautostart for the second and third nodes avoids a devnode restart step, as will be explained further below

-logfile – means to log to stdout instead of to a log file

This last option is how I prefer to run, because staring at logfiles is how I live.  For our purposes today I think it will be most instructive for you as well.

It should be noted that normally these logs go to/data/clustrix/p1/devnode.log (substitute p2, p3, etc. for other nodes).

If a bunch of FATAL errors scroll by, the most likely culprit is a port conflict:

2012-01-25 11:22:38 ERROR cp/cp_sock.c:74 cp_bind(): stream_listen(IPv4(0.0.0.0:2048)): Address already in use
2012-01-25 11:22:38 FATAL core/segv.c:93 main_segv_handler(): Program received a fatal signal on core 0  fiber 0
2012-01-25 11:22:38 FATAL core/segv.c:95 main_segv_handler(): C stack trace:
0x000000000052d976bind_done()<no lines read>
0x00000000008c7c91scheduler_run_one_item()<no lines read>
0x00000000008c8893scheduler_main_loop()<no lines read>

The above indicates that port 2048 (the control port, which will cover later) is already in use.  You’d see this if you tried to run the above commands on the same box, instead of 3 different boxes.  If you’re running mysqld on one of your boxes, it will fail thusly:

n1 2012-01-25 11:37:47 ERROR mysql/server/mysql_proto.c:1171 listen_on_port(): stream_listen(IPv4(0.0.0.0:3306)): Address already in use
n1 2012-01-25 11:37:47 FATAL dbcore/dbstate.c:104 dbconf_done(): Error handling dbconf chain: Address already in use: dbconf/INIT_MYSQL_PROTO failed (unable to create TCP socket)

You can work around these by specifying the -anyport flag, in which case you’ll need to look back through the logs to find which port it’s chosen:

2012-01-25 11:41:17 INFO dbcore/driver.ct:90 driver_publish_address(): pnid p3 control port 33274 sw f71a74d89c512ac
n1 2012-01-25 11:41:17 INFO dbcore/driver.ct:90 driver_publish_address(): pnid p3 mysql port 36290 sw f71a74d89c512ac

Adding nodes 2 and 3 to your cluster

Normally fresh Clustrix nodes start up in a “cluster-of-one”; you can connect to any node and then pull in the others to form a larger cluster.  This mechanism involves a process restart (on real nodes, an initd-like process called nanny takes care of this); to avoid this, we used the -noautostart flag when starting nodes 2 and 3, so they don’t start up as “cluster-of-one”, and can’t be accessed via mysql until they are added to a cluster.

So, connect to your first client (client1 above, the one where you did not have the -noautostart flag):

[nparrish@hefty mainline1]$ mysql -h beta001 -u root
mysql> use system;

Reading table information for completion of table and column names

You can turn off this feature to get a quicker startup with -A

Database changed

mysql> select * from available_node_details;
+------+------------+----------------------------------------------------+
| pnid | name       | value                                              |
+------+------------+----------------------------------------------------+
| p3   | cluster    | nparrish                                           | 
| p3   | sw_version | 1112854534402937516                                | 
| p3   | version    | 5.0.45-clustrix-v3.2-7371-0f71a74d89c512ac-release | 
| p3   | iface_ip   | 10.2.12.188                                        | 
| p3   | iface_mac  | 00:25:90:34:69:04                                  | 
| p3   | hostname   | loeb.colo.sproutsys.com                            | 
| p3   | started    | 2012-01-25 20:44:33.037318                         | 
| p2   | cluster    | nparrish                                           | 
| p2   | sw_version | 1112854534402937516                                | 
| p2   | version    | 5.0.45-clustrix-v3.2-7371-0f71a74d89c512ac-release | 
| p2   | iface_ip   | 10.2.12.194                                        | 
| p2   | iface_mac  | 00:25:90:34:70:0a                                  | 
| p2   | hostname   | sainz.colo.sproutsys.com                           | 
| p2   | started    | 2012-01-24 23:52:17.085502                         | 
+------+------------+----------------------------------------------------+
14 rows in set (0.00 sec)

So we’re looking at system.available_node_details, which shows the nodes that can be seen on the network (here eth0, on real nodes via InfiniBand), who are not already part of another cluster.  It’s sort of a key/value pair table, extended to include the pnid (Physical Node ID — recall we set this with -setpnid).  We’re really most interested in the hostname — I can see my two other clients, great.

Now we add these nodes into the cluster using the ALTER CLUSTER query:

mysql> alter cluster add p2;
Query OK, 0 rows affected (0.00 sec)


mysql> alter cluster add p3;
Query OK, 0 rows affected (0.02 sec)

mysql> select * from nodeinfoG
*************************** 1. row ***************************
        nodeid: 1
       started: 2012-01-25 20:50:50
       ntptime: 2012-01-25 20:54:13
   node uptime: 2012-01-20 23:15:52
      hostname: beta001.colo.sproutsys.com
    iface_name: eth0
      iface_ip: 10.2.13.11
iface_mac_addr: 00:30:48:c3:e7:5c
          pnid: p1
         cores: 12288
*************************** 2. row ***************************
        nodeid: 3
       started: 2012-01-25 20:50:57
       ntptime: 2012-01-25 20:54:13
   node uptime: 2011-04-12 01:05:08
      hostname: loeb.colo.sproutsys.com
    iface_name: eth0
      iface_ip: 10.2.12.188
iface_mac_addr: 00:25:90:34:69:04
          pnid: p3
         cores: 805313044
*************************** 3. row ***************************
        nodeid: 2
       started: 2012-01-25 20:50:53
       ntptime: 2012-01-25 20:54:13
   node uptime: 2011-10-25 17:31:38
      hostname: sainz.colo.sproutsys.com
    iface_name: eth0
      iface_ip: 10.2.12.194
iface_mac_addr: 00:25:90:34:70:0a
          pnid: p2
         cores: 3145744
3 rows in set (0.01 sec)

And you’re ready to rock and roll.  (Yes, the cores value is a little funny — on Clustrix nodes this would tell you how many CPU cores available on each node.)

Accessing Your Cluster

With each node using the standard MySQL port, it’s a little less fussy to connect to the cluster, as you no longer need to find and specify a different port for each node.  As before, connect to any node and you see the same database instance.

Normally a Clustrix cluster is configured with a Virtual IP (VIP), which is a distinct IP address which load balances connections across all the nodes.  This is partially implemented within the base OS of our appliance nodes, so unfortunately we cannot provide this functionality with the DevKit.  You can, however, implement simple load-balancing with a tool like HAProxy (we use this internally as a simple solution to cut over between clusters located on different subnets, where DSR is not possible).  I’ll see about writing this up as another blog post.

Please bear in mind that while we’ve now got our devnode processes distributed across multiple servers, they are now communicating over ethernet instead of via UNIX sockets (memory).  Ethernet is a far cry from the InfiniBand which connects real Clustrix nodes.  Beyond IB being a higher throughput, lower latency interconnect, our software is tuned for these characteristics, so we’re not going to see world-beating performance with our simulated cluster running over ethernet.  For that, I’ll refer you to our prior blog post on Percona’s TPCC test!

Recap

So we’ve shown how to get devnode running on different nodes, communicating with each other via ethernet.  While a little more “real” than having them all running on a single box, the performance characteristics are going to be orders of magnitude off from the capabilities of proper Clustrix nodes.  This does provide a simulacra of the platform to develop against, and we’d welcome the opportunity to move you from there to deploying on real hardware.  As always, your feedback or questions are welcome, in the comments or support forum.