Wednesday, March 13, 2013

Relationships in Solr

Approach to Relationships in Solr


Suppose we have a site that sells two products, shirts and shoes.

Shirt "100" is available in large red, large blue, and large black
Shirt "200" is available in XL red, large blue, and XL blue.
Shoes "300" is available in brown size 10, black size 10, and brown size 12.
Shoes "400" is available in black size 10, brown size 10, black size 12, and blue size 12.

 PRODUCT TABLE  
 ----------------------  
 | ID | TYPE | SKU |  
 ----------------------  
 | 100 | SHIRT | 101 |  
 ----------------------  
 | 100 | SHIRT | 102 |  
 ----------------------  
 | 100 | SHIRT | 103 |  
 ----------------------  
 | 200 | SHIRT | 201 |  
 ----------------------  
 | 200 | SHIRT | 202 |  
 ----------------------  
 | 200 | SHIRT | 203 |  
 ----------------------  
 | 300 | SHOES | 301 |  
 ----------------------  
 | 300 | SHOES | 302 |  
 ----------------------  
 | 300 | SHOES | 303 |  
 ----------------------  
 | 400 | SHOES | 401 |  
 ----------------------  
 | 400 | SHOES | 402 |  
 ----------------------  
 | 400 | SHOES | 403 |  
 ----------------------  
 | 400 | SHOES | 404 |  
 ----------------------  
 SKU TABLE  
 ------------------------  
 | ID | Color | Size |  
 ------------------------  
 | 101 |  RED |  L |  
 ------------------------  
 | 102 |  BLUE |  L |  
 ------------------------  
 | 103 | BLACK |  L |  
 ------------------------  
 | 201 |  RED |  XL |  
 ------------------------  
 | 202 |  BLUE |  L |  
 ------------------------  
 | 203 |  BLUE |  XL |  
 ------------------------  
 | 301 | BROWN |  10 |  
 ------------------------  
 | 302 | BLACK |  10 |  
 ------------------------  
 | 303 | BROWN |  12 |  
 ------------------------  
 | 401 | BLACK |  10 |  
 ------------------------  
 | 402 | BROWN |  10 |  
 ------------------------  
 | 403 | BLACK |  12 |  
 ------------------------  
 | 404 |  BLUE |  12 |  
 ------------------------  

As I was approaching this problem I recalled the "union" structure from my "C" coding days. With a union you could implement a primitive type of polymorphism.

From stackoverflow:

 typedef enum { INTEGER, STRING, REAL, POINTER } Type;  
 typedef struct  
 {  
  Type type;  
  union {  
  int integer;  
  char *string;  
  float real;  
  void *pointer;  
  } x;  
 } Value;  

The "union" caused me to think about how to layout my Solr schema in a similar denormalized fashion.

The fields needed are:
  • id - unique across all entries
  • Type
  • SKUS - a multiValued field to hold SKU IDs.
  • Color
  • Size
A Solr document's fields do not have to contain a value unless the schema specifies the fields is required.

   <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />   
   <field name="Type" type="string" indexed="true" stored="true" required="true" multiValued="false" omitNorms="true" omitTermFreqAndPositions="true" />  
   <field name="SKUS" type="string" indexed="true" stored="true" required="false" multiValued="true" omitNorms="true" omitTermFreqAndPositions="true" />  
   <field name="Color" type="string" indexed="true" stored="true" required="false" multiValued="true" omitNorms="true" omitTermFreqAndPositions="true" />  
   <field name="Size" type="string" indexed="true" stored="true" required="false" multiValued="true" omitNorms="true" omitTermFreqAndPositions="true" />  

When storing a product the only fields used are id, Type, and SKUS.
When storing a product SKU the only fields used are id, Type, Color, and Size.

Create a Solr core and add the fields above to the schema and then "post" the following data to the core.

 <add overwrite="true">  
 <doc>  
      <field name="id">100</field>  
      <field name="Type">SHIRT</field>  
      <field name="SKUS">101</field>  
      <field name="SKUS">102</field>  
      <field name="SKUS">103</field>  
 </doc>  
 <doc>  
      <field name="id">101</field>  
      <field name="Type">SKU</field>  
      <field name="Color">RED</field>  
      <field name="Size">L</field>  
 </doc>  
 <doc>  
      <field name="id">102</field>  
      <field name="Type">SKU</field>  
      <field name="Color">BLUE</field>  
      <field name="Size">L</field>  
 </doc>  
 <doc>  
      <field name="id">103</field>  
      <field name="Type">SKU</field>  
      <field name="Color">BLACK</field>  
      <field name="Size">L</field>  
 </doc>  
 <doc>  
      <field name="id">200</field>  
      <field name="Type">SHIRT</field>  
      <field name="SKUS">201</field>  
      <field name="SKUS">202</field>  
      <field name="SKUS">203</field>  
 </doc>  
 <doc>  
      <field name="id">201</field>  
      <field name="Type">SKU</field>  
      <field name="Color">RED</field>  
      <field name="Size">XL</field>  
 </doc>  
 <doc>  
      <field name="id">202</field>  
      <field name="Type">SKU</field>  
      <field name="Color">BLUE</field>  
      <field name="Size">L</field>  
 </doc>  
 <doc>  
      <field name="id">203</field>  
      <field name="Type">SKU</field>  
      <field name="Color">BLUE</field>  
      <field name="Size">XL</field>  
 </doc>  
 <doc>  
      <field name="id">300</field>  
      <field name="Type">SHOES</field>  
      <field name="SKUS">301</field>  
      <field name="SKUS">302</field>  
      <field name="SKUS">303</field>  
 </doc>  
 <doc>  
      <field name="id">301</field>  
      <field name="Type">SKU</field>  
      <field name="Color">BROWN</field>  
      <field name="Size">10</field>  
 </doc>  
 <doc>  
      <field name="id">302</field>  
      <field name="Type">SKU</field>  
      <field name="Color">BLACK</field>  
      <field name="Size">10</field>  
 </doc>  
 <doc>  
      <field name="id">303</field>  
      <field name="Type">SKU</field>  
      <field name="Color">BROWN</field>  
      <field name="Size">12</field>  
 </doc>  
 <doc>  
      <field name="id">400</field>  
      <field name="Type">SHOES</field>  
      <field name="SKUS">401</field>  
      <field name="SKUS">402</field>  
      <field name="SKUS">403</field>  
      <field name="SKUS">404</field>  
 </doc>  
 <doc>  
      <field name="id">401</field>  
      <field name="Type">SKU</field>  
      <field name="Color">BLACK</field>  
      <field name="Size">10</field>  
 </doc>  
 <doc>  
      <field name="id">402</field>  
      <field name="Type">SKU</field>  
      <field name="Color">BROWN</field>  
      <field name="Size">10</field>  
 </doc>  
 <doc>  
      <field name="id">403</field>  
      <field name="Type">SKU</field>  
      <field name="Color">BLACK</field>  
      <field name="Size">12</field>  
 </doc>  
 <doc>  
      <field name="id">404</field>  
      <field name="Type">SKU</field>  
      <field name="Color">BLUE</field>  
      <field name="Size">12</field>  
 </doc>  
 </add>  


Querying


Query: Find any type of product that is red color and size large.

http://localhost:8983/solr/collection1/select?q={!join from=id to=SKUS}(Color:RED AND Size:L)

 <?xml version="1.0" encoding="utf-8"?>  
 <response>  
   <lst name="responseHeader">  
     <int name="status">0</int>  
     <int name="QTime">9</int>  
     <lst name="params">  
       <str name="q">{!join from=id to=SKUS}(Color:RED AND Size:L)</str>  
     </lst>  
   </lst>  
   <result name="response" numFound="1" start="0">  
     <doc>  
       <str name="id">100</str>  
       <str name="Type">SHIRT</str>  
       <arr name="SKUS">  
         <str>101</str>  
         <str>102</str>  
       </arr>  
       <long name="_version_">1429439617419968512</long>  
     </doc>  
   </result>  
 </response>  

There is only one shirt that is in large and red.

Query: Find any type of product that is red color or black color.

http://localhost:8983/solr/collection1/select?q={!join from=id to=SKUS}(Color:RED OR Color:BLACK)


 <response>  
   <lst name="responseHeader">  
     <int name="status">0</int>  
     <int name="QTime">1</int>  
     <lst name="params">  
       <str name="q">{!join from=id to=SKUS}(Color:RED OR Color:BLACK)</str>  
     </lst>  
   </lst>  
   <result name="response" numFound="4" start="0">  
     <doc>  
       <str name="id">100</str>  
       <str name="Type">SHIRT</str>  
       <arr name="SKUS">  
         <str>101</str>  
         <str>102</str>  
         <str>103</str>  
       </arr>  
       <long name="_version_">1429440962942205952</long>  
     </doc>  
     <doc>  
       <str name="id">200</str>  
       <str name="Type">SHIRT</str>  
       <arr name="SKUS">  
         <str>201</str>  
         <str>202</str>  
         <str>203</str>  
       </arr>  
       <long name="_version_">1429440962945351681</long>  
     </doc>  
     <doc>  
       <str name="id">300</str>  
       <str name="Type">SHOES</str>  
       <arr name="SKUS">  
         <str>301</str>  
         <str>302</str>  
         <str>303</str>  
       </arr>  
       <long name="_version_">1429440962948497408</long>  
     </doc>  
     <doc>  
       <str name="id">400</str>  
       <str name="Type">SHOES</str>  
       <arr name="SKUS">  
         <str>401</str>  
         <str>402</str>  
         <str>403</str>  
         <str>404</str>  
       </arr>  
       <long name="_version_">1429440962951643136</long>  
     </doc>  
   </result>  
 </response>  

Notice that this returned shirts and shoes that are red or black.

Query: Find any product that is black color.

http://localhost:8983/solr/collection1/select?q={!join from=id to=SKUS}(Color:BLACK)


 <response>  
   <lst name="responseHeader">  
     <int name="status">0</int>  
     <int name="QTime">0</int>  
     <lst name="params">  
       <str name="q">{!join from=id to=SKUS}(Color:BLACK)</str>  
     </lst>  
   </lst>  
   <result name="response" numFound="3" start="0">  
     <doc>  
       <str name="id">100</str>  
       <str name="Type">SHIRT</str>  
       <arr name="SKUS">  
         <str>101</str>  
         <str>102</str>  
         <str>103</str>  
       </arr>  
       <long name="_version_">1429440962942205952</long>  
     </doc>  
     <doc>  
       <str name="id">300</str>  
       <str name="Type">SHOES</str>  
       <arr name="SKUS">  
         <str>301</str>  
         <str>302</str>  
         <str>303</str>  
       </arr>  
       <long name="_version_">1429440962948497408</long>  
     </doc>  
     <doc>  
       <str name="id">400</str>  
       <str name="Type">SHOES</str>  
       <arr name="SKUS">  
         <str>401</str>  
         <str>402</str>  
         <str>403</str>  
         <str>404</str>  
       </arr>  
       <long name="_version_">1429440962951643136</long>  
     </doc>  
   </result>  
 </response>  

This query returned both shirts and shoes. What if you only wanted shoes?

Query: Find only shoes that are in blue.

http://localhost:8983/solr/collection1/select?q={!join from=id to=SKUS}(Color:BLUE)&fq=Type:SHOES


 <response>  
   <lst name="responseHeader">  
     <int name="status">0</int>  
     <int name="QTime">4</int>  
     <lst name="params">  
       <str name="q">{!join from=id to=SKUS}(Color:BLUE)</str>  
       <str name="fq">Type:SHOES</str>  
     </lst>  
   </lst>  
   <result name="response" numFound="1" start="0">  
     <doc>  
       <str name="id">400</str>  
       <str name="Type">SHOES</str>  
       <arr name="SKUS">  
         <str>401</str>  
         <str>402</str>  
         <str>403</str>  
         <str>404</str>  
       </arr>  
       <long name="_version_">1429440962951643136</long>  
     </doc>  
   </result>  
 </response>  


Query: Find only shirts that are black color and size large.

http://localhost:8983/solr/collection1/select?q={!join from=id to=SKUS}(Color:BLACK AND Size:L)&fq=Type:SHIRT


 <response>  
   <lst name="responseHeader">  
     <int name="status">0</int>  
     <int name="QTime">1</int>  
     <lst name="params">  
       <str name="q">{!join from=id to=SKUS}(Color:BLACK AND Size:L)</str>  
       <str name="fq">Type:SHIRT</str>  
     </lst>  
   </lst>  
   <result name="response" numFound="1" start="0">  
     <doc>  
       <str name="id">100</str>  
       <str name="Type">SHIRT</str>  
       <arr name="SKUS">  
         <str>101</str>  
         <str>102</str>  
         <str>103</str>  
       </arr>  
       <long name="_version_">1429440962942205952</long>  
     </doc>  
   </result>  
 </response>  

Notice that this query did not return shirts that were large and some other color. This is important.

Conclusion


Using a completely denormalized schema that uses a field to specify the type and multiValued fields to contain the id's of the related documents can represent relationships.

By using Solr's pseudo-join and filter queries you can "select" documents without getting "false positive multi-value field match problem".

This solution works out of the box with Solr 4.1. You gotta like that. :-)



Wednesday, February 27, 2013

Setting Up SolrCloud on Windows for Beginners

Intro


I have recently returned from Solr training by LucidWorks. The training was excellent. I had been using Solr for a couple of months experimenting with various queries trying to improve recall on my particular data. The clarification on the various caches was very valuable. But enough of that, this post is about setting up SolrCloud on windows.

This tutorial uses:
  • Solr 4.1.0
  • Java 1.7.0_13

Java

Install Java.

After installing Java I added this to the sytem Path:
C:\Program Files\Java\jdk1.7.0_13;C:\Program Files\Java\jre7\bin

I also added this environment variable:
JAVA_HOME

with the value:
C:\Program Files\Java\jdk1.7.0_13

Of course those values need to reflect where you installed Java.

Check your install of Java by going to some directory other than where you installed java and run the following:

java -version

java version "1.7.0_13"
Java(TM) SE Runtime Environment (build 1.7.0_13-b20)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)



Solr

Download Solr 4.1.

http://lucene.apache.org/solr/downloads.html

I downloaded the zip file, solr-4.1.0.zip.

Install Solr

Unzip Solr.

Modify Contents of the Example Directory and Prepare for Shards and Replicas

Go into the directory that contains Solr. In my install the zip file created a directory structure like this:
C:\solr-4.1.0\solr-4.1.0

I will refer to the above path as SOLRHOME.

Here is the listing of the directory:

02/27/2013  03:31 PM   .
02/27/2013  03:31 PM   ..
02/27/2013  03:28 PM           286,759 CHANGES.txt
02/27/2013  03:29 PM           contrib
02/27/2013  03:30 PM           dist
02/27/2013  03:31 PM           docs
02/27/2013  03:31 PM           example
02/27/2013  03:28 PM           12,872 LICENSE.txt
02/27/2013  03:31 PM           licenses
02/27/2013  03:28 PM           24,495 NOTICE.txt
02/27/2013  03:28 PM           5,464 README.txt
02/27/2013  03:28 PM           805 SYSTEM_REQUIREMENTS.txt
5 File(s)        330,395 bytes
7 Dir(s)  50,699,972,608 bytes free



Go to SOLRHOME\example\solr

Here I want to simulate setting up a custom collection. So, rename the directory "collection1" to "junk".

Still in the same directory edit the "solr.xml" file. We are interested in the portion at the bottom that specifies the cores.

Here is the original contents of the solr.xml file core entry:



Change each instance of collection1 in the xml file to junk. The results are shown here:





Make sure you have the host port set to the jetty port in the solr.xml file.

In the SOLRHOME directory I duplicated the "example" directory. I want to setup two shards each with a replica, so I need four directories total. So I make three copies of the example directory and name them example1, example2, example3, and example4. (By the way, I did this in Windows Explorer, you could do it the same, or in a Command window, doesn't matter.)


Start Solr

Go to SOLRHOME\example1.

From a command prompt, execute the following:

java -Dbootstrap_confdir=./solr/junk/conf -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar

The -Dcollection.configName=myconf works the same as when I fully specify the location of the config file like this:

 java -Dbootstrap_confdir=./solr/junk/conf -Dcollection.configName=solrconfig -DzkRun -DnumShards=2 -jar start.jar

Notice that this command specifies the number of shards. Here is something to remember, re-sharding means re-indexing. If you setup two shards and load data into them and then decide you want three shards, at the time of writing this blog you have to re-idex (re-import) all of your data.

Also, the command specifies to launch an instance of ZooKeeper with the -DzkRun. ZooKeeper you say. What is ZooKeeper? It is an application for managing clusters. ZooKeeper comes with the Solr install (as well as Jetty) and is launched for you. This is for a convenience. In a production system you would not want ZooKeeper running on the same box as Solr which makes a single point of failure. Also ZooKeeper should be ran in an ensemble of at least three instances. You can look up ZooKeeper if you want more details. The command runs ZooKeeper at the Solr Port + 1000. The default Solr port is 8983, therefore ZooKeeper is at 9983.

Open a browser (I use Firefox, I have experienced problems with IE) and go to this url:

http://localhost:8983/solr/#/

You should see this:


Click on the "Cloud" item on the left.

The page will show you the graph of the cloud. Remember, our collection is named "junk" and we setup two shards and two replicas by making four "example" directories.




Starting the Second Shard

Starting the second shard is very simple.


  • Launch another command window and go to SOLRHOME\example2.
  •  Runs this command:
    • java -Djetty.port=8984 -DzkHost=localhost:9983 -jar start.jar
You need to run this next instance of Solr on a different port than the first. The first defaulted to port 8983, so run this new instance on port 8984 by telling Jetty which port to run on. (Jetty is like Tomcat, it is a Web Application Server).

The paramter -DzkHost is specifying where ZooKeeper is running. 

After running the command you should see that shard2 is now running from the Solr Cloud page.




Starting the Replicas

Starting the replicas is like starting the second shard above, just go into each remaining directory (example3 and example4) and run the command used before specifying a different port for each instance.


Replica 1:

  •  Runs this command:
    • java -Djetty.port=8985 -DzkHost=localhost:9983 -jar start.jar



Replica 2:
  • Launch another command window and go to SOLRHOME\example4.
  •  Runs this command:
    • java -Djetty.port=8986 -DzkHost=localhost:9983 -jar start.jar

Miscellaneous

In the Solr Dashboard if you select "Tree" under "Cloud" it shows you the information that was is used by ZooKeeper to configure the shards and replicas.


Mistakes I Made Trying to Figure this Out

Before I tried to setup my first SolrCloud configuration I had been running Solr for about two months. During that time I was experimenting with various schemas and field types, and Lucene queries. My problem set is one of "recall" based on "scoring".

Sometime during this experimentation I had altered many of the files, and I must have messed up the solr.xml file. I would do the same steps I have above and I would never get any other shards to appear. Finally I just reinstalled Solr and everything started working.

I suspect one culprit that caused things not to work was I had experimented with DistributedSearch where you manually setup shards. In the solr.xml file you can specify the core information with shard details, and that may have been "floating" around somewhere.

Another mistake I made is I forgot to specify the Zookeeper param when launching what I thought would be a new shard or replica. So, make sure you don't forget to tell where Zookeeper is with the -DzkHost param.

If things don't seem to be working you can go to "example1" and delete the zoo_data directory and try launching things again.

Good Luck!