Friday, February 12, 2016

Setting up Chef with Vagrant box

I certainly ran into some frustrating scenarios with this setup, either because my Engrish isn't good, or documentation isn't as comprehensible as it should be (for Chef).
Here's what I wanted to achieve:
- Vagrantfile to spin up a Virtualbox instance
- Chef Server on Virtualbox instance
- Chef Client on Windows workstation

This was my guide, pretty good one at that: https://www.digitalocean.com/community/tutorials/how-to-set-up-a-chef-12-configuration-management-system-on-ubuntu-14-04-servers
1. I create a simple Vagrantfile with 'vagrant init'. Easy.
Also I note the default Vagrantfile has some stuff about setting up Chef Solo and linking to an existing Chef Server. However it has nothing along the lines of creating my own Chef Server in the Vbox instance. Shame...
2. 'vagrant up' and follow steps on digitalocean website. I skipped adding a hostname initially since vagrant instances already add the hostname to 127.0.0.1, but later added it in during troubleshooting. Shouldn't make any difference unless SSH-ing from a machine not on localhost.
3. Install Chef Server using rpm, then ran chef-server-ctl reconfigure, takes 10-15 minutes but all seemed functional.

It comes around to Chinese New Year and some spring cleaning was in order, so my PC gets shut off. I now start up my VBox instance again, but I wonder how to start the chef-server, if that's even necessary.
So I do some googling and find it's started with 'chef-server-ctl reconfigure', and I run this. However after a few hours, NOTHING HAPPENS. Nothing in /etc/init.d either. Turns out it's using an embedded nginx server...surely it's not this hard?
After a night's sleep, I figure out why. It's because vagrant started my virtualbox instance with 633MB of memory. Bugger that! I also missed this part: The Chef documentation tells us that your Chef server should have at least 4 cores and 4 GB of RAM; That's quite a bit...anyway I bumped it up to 2GB and reconfigure takes a couple of minutes, yay!

4. Create a simple chef-repo, put it in git. Setup Chef DK on my Windows PC. So far so good.
5. Create a couple of .pem files on server as per guide. I just copy them over to workstation via /vagrant folder, no biggie.
6. Create knife.rb in my .chef folder, and run knife client list. But instead of "certificate verify failed" like the guide says, I get "unknown protocol". Continuing to the next step, "knife ssl fetch", that yields the same result. Ah shit, I've now skipped some of the previous steps like setting up SSH keys, where have I gone wrong!
7. I setup all my SSH keys, setup hosts in /etc/hosts like a studious boy, and make sure normal SSH works like 'vagrant ssh' would...oh right default vagrant SSH port is '2222', maybe that could be it...
But if you look at the knife.rb file, it looks like this:

current_dir = File.dirname(__FILE__)
log_level                :info
log_location             STDOUT
node_name                "admin"
client_key               "#{current_dir}/admin.pem"
validation_client_name   "digitalocean-validator"
validation_key           "#{current_dir}/digitalocean-validator.pem"
chef_server_url          "https://server_domain_or_IP/organizations/digitalocean"
syntax_check_cache_path  "#{ENV['HOME']}/.chef/syntaxcache"
cookbook_path            ["#{current_dir}/../cookbooks"]

Hmmm, but it's using "https" so it must use port 443, and it's not using SSH...ok so nginx is exposing a chef webservice on port 443.
Maybe my certs were generated with the wrong host in /etc/hosts file...I regenerate the .pem files and put them in my .chef folder...no that ain't working either!!
Oh. I haven't port forwarded port 443 from Virtualbox. So I set my host port to 4443 and guest to 443. Voila that let me use 'knife ssl fetch'! 'knife client list' is showing me what I want too.
8. I run this command:
knife bootstrap vagrant-centos65.vagrantup.com:2222 -N testing -x vagrant -P vagrant --sudo --use-sudo-password
It starts up Chef Client version 12.7.0 and does this:

vagrant-centos65.vagrantup.com      [2016-02-11T17:37:09+00:00] ERROR: Error connecting to https://vagrant-centos65.vagrantup.com:4443/organizations/myapp/nodes/testing, retry 1/5
vagrant-centos65.vagrantup.com      [2016-02-11T17:38:17+00:00] ERROR: Error connecting to https://vagrant-centos65.vagrantup.com:4443/organizations/myapp/nodes/testing, retry 2/5
vagrant-centos65.vagrantup.com      [2016-02-11T17:39:25+00:00] ERROR: Error connecting to https://vagrant-centos65.vagrantup.com:4443/organizations/myapp/nodes/testing, retry 3/5
vagrant-centos65.vagrantup.com      [2016-02-11T17:40:33+00:00] ERROR: Error connecting to https://vagrant-centos65.vagrantup.com:4443/organizations/myapp/nodes/testing, retry 4/5
vagrant-centos65.vagrantup.com      [2016-02-11T17:41:41+00:00] ERROR: Error connecting to https://vagrant-centos65.vagrantup.com:4443/organizations/myapp/nodes/testing, retry 5/5
vagrant-centos65.vagrantup.com
vagrant-centos65.vagrantup.com      ================================================================================
vagrant-centos65.vagrantup.com      Chef encountered an error attempting to load the node data for "testing"
vagrant-centos65.vagrantup.com      ================================================================================
vagrant-centos65.vagrantup.com
vagrant-centos65.vagrantup.com      Networking Error:
vagrant-centos65.vagrantup.com      -----------------
vagrant-centos65.vagrantup.com      Error connecting to https://vagrant-centos65.vagrantup.com:4443/organizations/myapp/nodes/testing - Connection timed out - connect(2) for "vagrant-centos65.vagrantup.com" port 4443
vagrant-centos65.vagrantup.com
vagrant-centos65.vagrantup.com      Your chef_server_url may be misconfigured, or the network could be down.

Oh noooo what's happening...what's worse is if I "Ctrl+C" out of the timeout I get this:
vagrant-centos65.vagrantup.com [2016-02-11T17:29:17+00:00] WARN: Chef client 18958 is running, will wait for it to finish and then run.
Firstly, what is this PID? "ps -ef" on Cygwin, and Task Manager doesn't show any PID anywhere near that number.
I find this blog and start searching for a "chef-client-running.pid" file on my system. The code references on the blog look a little outdated so I search the code for references to that file, and I find this:

/cygdrive/c/opscode
$ grep -ir "chef-client-running.pid"
...
chef/embedded/lib/ruby/gems/2.0.0/gems/chef-12.6.0-universal-mingw32/spec/unit/run_lock_spec.rb:  default_pid_location = windows? ? 'C:\chef\cache\chef-client-running.pid' : '/var/chef/cache/chef-client-running.pid'

Sure enough, there's a PID file in that location. I delete the file and re-run knife.
Knife didn't seem to care about that file, it still presented itself with the "will wait for it to finish" message. After a day of hunting around on my system, I take a pause and realize 'knife' is actually doing an SSH to my Virtualbox instance, and running chef-client there! I run 'vagrant ssh' and a 'ps -ef | grep chef' on the server and yes, there's that dang PID!
Alright so now I can break out of a hanging knife. But why's it hanging? Turns out when I use port "443" to access this URL "https://vagrant-centos65.vagrantup.com/organizations/myapp" it all works. I'm guessing the certificates that were generated force us to use port 443 on the host machine ... either that or knife bootstrap really wants to use port 443. Anyway that resolved the problem and now I get:

~/myapp/gitrepo/chef-repo
$ knife bootstrap -V vagrant@vagrant-centos65.vagrantup.com:2222 -N testing -x vagrant -P vagrant --sudo --use-sudo-password
INFO: Using configuration from D:/cygwin64/home/Alkaiser/myapp/gitrepo/chef-repo/.chef/knife.rb
Doing old-style registration with the validation key at D:/cygwin64/home/Alkaiser/myapp/gitrepo/chef-repo/.chef/myapp-validator.pem...
Delete your validation key in order to use your user credentials instead

Connecting to vagrant-centos65.vagrantup.com:2222
vagrant-centos65.vagrantup.com      -----> Existing Chef installation detected
vagrant-centos65.vagrantup.com      Starting the first Chef Client run...
vagrant-centos65.vagrantup.com      Starting Chef Client, version 12.7.0
vagrant-centos65.vagrantup.com
vagrant-centos65.vagrantup.com      ================================================================================
vagrant-centos65.vagrantup.com      Chef encountered an error attempting to load the node data for "testing"
vagrant-centos65.vagrantup.com      ================================================================================
vagrant-centos65.vagrantup.com
vagrant-centos65.vagrantup.com      Authentication Error:
vagrant-centos65.vagrantup.com      ---------------------
vagrant-centos65.vagrantup.com      Failed to authenticate to the chef server (http 401).
vagrant-centos65.vagrantup.com
vagrant-centos65.vagrantup.com      Server Response:
vagrant-centos65.vagrantup.com      ----------------
vagrant-centos65.vagrantup.com      Failed to authenticate as 'testing'. Ensure that your node_name and client key are correct.
vagrant-centos65.vagrantup.com
vagrant-centos65.vagrantup.com      Relevant Config Settings:
vagrant-centos65.vagrantup.com      -------------------------
vagrant-centos65.vagrantup.com      chef_server_url   "https://vagrant-centos65.vagrantup.com/organizations/myapp"
vagrant-centos65.vagrantup.com      node_name         "testing"
vagrant-centos65.vagrantup.com      client_key        "/etc/chef/client.pem"
vagrant-centos65.vagrantup.com
vagrant-centos65.vagrantup.com      If these settings are correct, your client_key may be invalid, or
vagrant-centos65.vagrantup.com      you may have a chef user with the same client name as this node.
vagrant-centos65.vagrantup.com
vagrant-centos65.vagrantup.com
vagrant-centos65.vagrantup.com      Running handlers:
vagrant-centos65.vagrantup.com      [2016-02-13T05:22:44+00:00] ERROR: Running exception handlers
vagrant-centos65.vagrantup.com      Running handlers complete
vagrant-centos65.vagrantup.com      [2016-02-13T05:22:44+00:00] ERROR: Exception handlers complete
vagrant-centos65.vagrantup.com      Chef Client failed. 0 resources updated in 07 seconds
vagrant-centos65.vagrantup.com      [2016-02-13T05:22:44+00:00] FATAL: Stacktrace dumped to /var/chef/cache/chef-stacktrace.out
vagrant-centos65.vagrantup.com      [2016-02-13T05:22:44+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
vagrant-centos65.vagrantup.com      [2016-02-13T05:22:44+00:00] ERROR: 401 "Unauthorized"
vagrant-centos65.vagrantup.com      [2016-02-13T05:22:44+00:00] FATAL: Chef::Exceptions::ChildConvergeError: Chef run process exited unsuccessfully (exit code 1)

Unauthorized? Oh I registered the node and removed the node from my client so I could reproduce this for my blog...ehem. This kind soul here helped me figure it out. Delete /etc/chef/client.pem from your server and re-run knife bootstrap. Now I get "Chef Client finished, 0/0 resources updated in 09 seconds", alright!

Now I want to run some playbooks. More frustration...
8. Run "knife cookbook upload -a". This actually required me to manually clone the different dependant projects (e.g. Tomcat relies on Java which relies on yum-epel and openssl etc.) before this would even work. Surely there's an automated way to do this...?
UPDATE: Just so I don't confuse anyone, yes you can do this with "knife cookbook site install COOKBOOK_NAME [COOKBOOK_VERSION] (options)"
9. Run "knife node edit testing" to update the run list, so I can actually install stuff. Instead I get this
ERROR: RuntimeError: Please set EDITOR environment variable
So I find out I have to set knife[:editor] in my knife.rb file. I set it to the long filepath to Notepad++.exe, however I just keep getting the same error, or this:
syntax error, unexpected tSTRING_BEG, expecting end-of-input
Argh...so in the end I finally found this:
https://tickets.opscode.com/browse/CHEF-4503
Looks like you MUST set the value to a Windows shortname using 8.3 notation. It ended up looking like this (the options are mandatory for this to work):
knife[:editor] = "D:\\PROGRA~1\\NOTEPA~1\\NOTEPA~1.EXE -nosession -multiInst"

Where on earth is this in the documentation for that here https://docs.chef.io/config_rb_knife.html??? For the love of ...
10. Run "chef-client" as root on the target server. Hm can't I manage the node remotely? Yes you can, I ended up with this:
knife ssh "name:testing" "sudo chef-client" -x vagrant -p 2222
In the end you get this:
Running handlers:
Running handlers complete
Chef Client finished, 14/15 resources updated in 04 minutes 51 seconds

Yay!
Extra point: I had to add the same version of the Guest Additions ISO that I had on my base box to my Vagrantfile. For 4.3.14 for example, I downloaded it from here: http://download.virtualbox.org/virtualbox/4.3.14/VBoxGuestAdditions_4.3.14.iso, and put in the Vagrantfile this config: config.vbguest.iso_path=