Wednesday, February 27, 2013

Setting Up SolrCloud on Windows for Beginners

Intro


I have recently returned from Solr training by LucidWorks. The training was excellent. I had been using Solr for a couple of months experimenting with various queries trying to improve recall on my particular data. The clarification on the various caches was very valuable. But enough of that, this post is about setting up SolrCloud on windows.

This tutorial uses:
  • Solr 4.1.0
  • Java 1.7.0_13

Java

Install Java.

After installing Java I added this to the sytem Path:
C:\Program Files\Java\jdk1.7.0_13;C:\Program Files\Java\jre7\bin

I also added this environment variable:
JAVA_HOME

with the value:
C:\Program Files\Java\jdk1.7.0_13

Of course those values need to reflect where you installed Java.

Check your install of Java by going to some directory other than where you installed java and run the following:

java -version

java version "1.7.0_13"
Java(TM) SE Runtime Environment (build 1.7.0_13-b20)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)



Solr

Download Solr 4.1.

http://lucene.apache.org/solr/downloads.html

I downloaded the zip file, solr-4.1.0.zip.

Install Solr

Unzip Solr.

Modify Contents of the Example Directory and Prepare for Shards and Replicas

Go into the directory that contains Solr. In my install the zip file created a directory structure like this:
C:\solr-4.1.0\solr-4.1.0

I will refer to the above path as SOLRHOME.

Here is the listing of the directory:

02/27/2013  03:31 PM   .
02/27/2013  03:31 PM   ..
02/27/2013  03:28 PM           286,759 CHANGES.txt
02/27/2013  03:29 PM           contrib
02/27/2013  03:30 PM           dist
02/27/2013  03:31 PM           docs
02/27/2013  03:31 PM           example
02/27/2013  03:28 PM           12,872 LICENSE.txt
02/27/2013  03:31 PM           licenses
02/27/2013  03:28 PM           24,495 NOTICE.txt
02/27/2013  03:28 PM           5,464 README.txt
02/27/2013  03:28 PM           805 SYSTEM_REQUIREMENTS.txt
5 File(s)        330,395 bytes
7 Dir(s)  50,699,972,608 bytes free



Go to SOLRHOME\example\solr

Here I want to simulate setting up a custom collection. So, rename the directory "collection1" to "junk".

Still in the same directory edit the "solr.xml" file. We are interested in the portion at the bottom that specifies the cores.

Here is the original contents of the solr.xml file core entry:



Change each instance of collection1 in the xml file to junk. The results are shown here:





Make sure you have the host port set to the jetty port in the solr.xml file.

In the SOLRHOME directory I duplicated the "example" directory. I want to setup two shards each with a replica, so I need four directories total. So I make three copies of the example directory and name them example1, example2, example3, and example4. (By the way, I did this in Windows Explorer, you could do it the same, or in a Command window, doesn't matter.)


Start Solr

Go to SOLRHOME\example1.

From a command prompt, execute the following:

java -Dbootstrap_confdir=./solr/junk/conf -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar

The -Dcollection.configName=myconf works the same as when I fully specify the location of the config file like this:

 java -Dbootstrap_confdir=./solr/junk/conf -Dcollection.configName=solrconfig -DzkRun -DnumShards=2 -jar start.jar

Notice that this command specifies the number of shards. Here is something to remember, re-sharding means re-indexing. If you setup two shards and load data into them and then decide you want three shards, at the time of writing this blog you have to re-idex (re-import) all of your data.

Also, the command specifies to launch an instance of ZooKeeper with the -DzkRun. ZooKeeper you say. What is ZooKeeper? It is an application for managing clusters. ZooKeeper comes with the Solr install (as well as Jetty) and is launched for you. This is for a convenience. In a production system you would not want ZooKeeper running on the same box as Solr which makes a single point of failure. Also ZooKeeper should be ran in an ensemble of at least three instances. You can look up ZooKeeper if you want more details. The command runs ZooKeeper at the Solr Port + 1000. The default Solr port is 8983, therefore ZooKeeper is at 9983.

Open a browser (I use Firefox, I have experienced problems with IE) and go to this url:

http://localhost:8983/solr/#/

You should see this:


Click on the "Cloud" item on the left.

The page will show you the graph of the cloud. Remember, our collection is named "junk" and we setup two shards and two replicas by making four "example" directories.




Starting the Second Shard

Starting the second shard is very simple.


  • Launch another command window and go to SOLRHOME\example2.
  •  Runs this command:
    • java -Djetty.port=8984 -DzkHost=localhost:9983 -jar start.jar
You need to run this next instance of Solr on a different port than the first. The first defaulted to port 8983, so run this new instance on port 8984 by telling Jetty which port to run on. (Jetty is like Tomcat, it is a Web Application Server).

The paramter -DzkHost is specifying where ZooKeeper is running. 

After running the command you should see that shard2 is now running from the Solr Cloud page.




Starting the Replicas

Starting the replicas is like starting the second shard above, just go into each remaining directory (example3 and example4) and run the command used before specifying a different port for each instance.


Replica 1:

  •  Runs this command:
    • java -Djetty.port=8985 -DzkHost=localhost:9983 -jar start.jar



Replica 2:
  • Launch another command window and go to SOLRHOME\example4.
  •  Runs this command:
    • java -Djetty.port=8986 -DzkHost=localhost:9983 -jar start.jar

Miscellaneous

In the Solr Dashboard if you select "Tree" under "Cloud" it shows you the information that was is used by ZooKeeper to configure the shards and replicas.


Mistakes I Made Trying to Figure this Out

Before I tried to setup my first SolrCloud configuration I had been running Solr for about two months. During that time I was experimenting with various schemas and field types, and Lucene queries. My problem set is one of "recall" based on "scoring".

Sometime during this experimentation I had altered many of the files, and I must have messed up the solr.xml file. I would do the same steps I have above and I would never get any other shards to appear. Finally I just reinstalled Solr and everything started working.

I suspect one culprit that caused things not to work was I had experimented with DistributedSearch where you manually setup shards. In the solr.xml file you can specify the core information with shard details, and that may have been "floating" around somewhere.

Another mistake I made is I forgot to specify the Zookeeper param when launching what I thought would be a new shard or replica. So, make sure you don't forget to tell where Zookeeper is with the -DzkHost param.

If things don't seem to be working you can go to "example1" and delete the zoo_data directory and try launching things again.

Good Luck!