Wednesday, March 13, 2013

Relationships in Solr

Approach to Relationships in Solr


Suppose we have a site that sells two products, shirts and shoes.

Shirt "100" is available in large red, large blue, and large black
Shirt "200" is available in XL red, large blue, and XL blue.
Shoes "300" is available in brown size 10, black size 10, and brown size 12.
Shoes "400" is available in black size 10, brown size 10, black size 12, and blue size 12.

 PRODUCT TABLE  
 ----------------------  
 | ID | TYPE | SKU |  
 ----------------------  
 | 100 | SHIRT | 101 |  
 ----------------------  
 | 100 | SHIRT | 102 |  
 ----------------------  
 | 100 | SHIRT | 103 |  
 ----------------------  
 | 200 | SHIRT | 201 |  
 ----------------------  
 | 200 | SHIRT | 202 |  
 ----------------------  
 | 200 | SHIRT | 203 |  
 ----------------------  
 | 300 | SHOES | 301 |  
 ----------------------  
 | 300 | SHOES | 302 |  
 ----------------------  
 | 300 | SHOES | 303 |  
 ----------------------  
 | 400 | SHOES | 401 |  
 ----------------------  
 | 400 | SHOES | 402 |  
 ----------------------  
 | 400 | SHOES | 403 |  
 ----------------------  
 | 400 | SHOES | 404 |  
 ----------------------  
 SKU TABLE  
 ------------------------  
 | ID | Color | Size |  
 ------------------------  
 | 101 |  RED |  L |  
 ------------------------  
 | 102 |  BLUE |  L |  
 ------------------------  
 | 103 | BLACK |  L |  
 ------------------------  
 | 201 |  RED |  XL |  
 ------------------------  
 | 202 |  BLUE |  L |  
 ------------------------  
 | 203 |  BLUE |  XL |  
 ------------------------  
 | 301 | BROWN |  10 |  
 ------------------------  
 | 302 | BLACK |  10 |  
 ------------------------  
 | 303 | BROWN |  12 |  
 ------------------------  
 | 401 | BLACK |  10 |  
 ------------------------  
 | 402 | BROWN |  10 |  
 ------------------------  
 | 403 | BLACK |  12 |  
 ------------------------  
 | 404 |  BLUE |  12 |  
 ------------------------  

As I was approaching this problem I recalled the "union" structure from my "C" coding days. With a union you could implement a primitive type of polymorphism.

From stackoverflow:

 typedef enum { INTEGER, STRING, REAL, POINTER } Type;  
 typedef struct  
 {  
  Type type;  
  union {  
  int integer;  
  char *string;  
  float real;  
  void *pointer;  
  } x;  
 } Value;  

The "union" caused me to think about how to layout my Solr schema in a similar denormalized fashion.

The fields needed are:
  • id - unique across all entries
  • Type
  • SKUS - a multiValued field to hold SKU IDs.
  • Color
  • Size
A Solr document's fields do not have to contain a value unless the schema specifies the fields is required.

   <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />   
   <field name="Type" type="string" indexed="true" stored="true" required="true" multiValued="false" omitNorms="true" omitTermFreqAndPositions="true" />  
   <field name="SKUS" type="string" indexed="true" stored="true" required="false" multiValued="true" omitNorms="true" omitTermFreqAndPositions="true" />  
   <field name="Color" type="string" indexed="true" stored="true" required="false" multiValued="true" omitNorms="true" omitTermFreqAndPositions="true" />  
   <field name="Size" type="string" indexed="true" stored="true" required="false" multiValued="true" omitNorms="true" omitTermFreqAndPositions="true" />  

When storing a product the only fields used are id, Type, and SKUS.
When storing a product SKU the only fields used are id, Type, Color, and Size.

Create a Solr core and add the fields above to the schema and then "post" the following data to the core.

 <add overwrite="true">  
 <doc>  
      <field name="id">100</field>  
      <field name="Type">SHIRT</field>  
      <field name="SKUS">101</field>  
      <field name="SKUS">102</field>  
      <field name="SKUS">103</field>  
 </doc>  
 <doc>  
      <field name="id">101</field>  
      <field name="Type">SKU</field>  
      <field name="Color">RED</field>  
      <field name="Size">L</field>  
 </doc>  
 <doc>  
      <field name="id">102</field>  
      <field name="Type">SKU</field>  
      <field name="Color">BLUE</field>  
      <field name="Size">L</field>  
 </doc>  
 <doc>  
      <field name="id">103</field>  
      <field name="Type">SKU</field>  
      <field name="Color">BLACK</field>  
      <field name="Size">L</field>  
 </doc>  
 <doc>  
      <field name="id">200</field>  
      <field name="Type">SHIRT</field>  
      <field name="SKUS">201</field>  
      <field name="SKUS">202</field>  
      <field name="SKUS">203</field>  
 </doc>  
 <doc>  
      <field name="id">201</field>  
      <field name="Type">SKU</field>  
      <field name="Color">RED</field>  
      <field name="Size">XL</field>  
 </doc>  
 <doc>  
      <field name="id">202</field>  
      <field name="Type">SKU</field>  
      <field name="Color">BLUE</field>  
      <field name="Size">L</field>  
 </doc>  
 <doc>  
      <field name="id">203</field>  
      <field name="Type">SKU</field>  
      <field name="Color">BLUE</field>  
      <field name="Size">XL</field>  
 </doc>  
 <doc>  
      <field name="id">300</field>  
      <field name="Type">SHOES</field>  
      <field name="SKUS">301</field>  
      <field name="SKUS">302</field>  
      <field name="SKUS">303</field>  
 </doc>  
 <doc>  
      <field name="id">301</field>  
      <field name="Type">SKU</field>  
      <field name="Color">BROWN</field>  
      <field name="Size">10</field>  
 </doc>  
 <doc>  
      <field name="id">302</field>  
      <field name="Type">SKU</field>  
      <field name="Color">BLACK</field>  
      <field name="Size">10</field>  
 </doc>  
 <doc>  
      <field name="id">303</field>  
      <field name="Type">SKU</field>  
      <field name="Color">BROWN</field>  
      <field name="Size">12</field>  
 </doc>  
 <doc>  
      <field name="id">400</field>  
      <field name="Type">SHOES</field>  
      <field name="SKUS">401</field>  
      <field name="SKUS">402</field>  
      <field name="SKUS">403</field>  
      <field name="SKUS">404</field>  
 </doc>  
 <doc>  
      <field name="id">401</field>  
      <field name="Type">SKU</field>  
      <field name="Color">BLACK</field>  
      <field name="Size">10</field>  
 </doc>  
 <doc>  
      <field name="id">402</field>  
      <field name="Type">SKU</field>  
      <field name="Color">BROWN</field>  
      <field name="Size">10</field>  
 </doc>  
 <doc>  
      <field name="id">403</field>  
      <field name="Type">SKU</field>  
      <field name="Color">BLACK</field>  
      <field name="Size">12</field>  
 </doc>  
 <doc>  
      <field name="id">404</field>  
      <field name="Type">SKU</field>  
      <field name="Color">BLUE</field>  
      <field name="Size">12</field>  
 </doc>  
 </add>  


Querying


Query: Find any type of product that is red color and size large.

http://localhost:8983/solr/collection1/select?q={!join from=id to=SKUS}(Color:RED AND Size:L)

 <?xml version="1.0" encoding="utf-8"?>  
 <response>  
   <lst name="responseHeader">  
     <int name="status">0</int>  
     <int name="QTime">9</int>  
     <lst name="params">  
       <str name="q">{!join from=id to=SKUS}(Color:RED AND Size:L)</str>  
     </lst>  
   </lst>  
   <result name="response" numFound="1" start="0">  
     <doc>  
       <str name="id">100</str>  
       <str name="Type">SHIRT</str>  
       <arr name="SKUS">  
         <str>101</str>  
         <str>102</str>  
       </arr>  
       <long name="_version_">1429439617419968512</long>  
     </doc>  
   </result>  
 </response>  

There is only one shirt that is in large and red.

Query: Find any type of product that is red color or black color.

http://localhost:8983/solr/collection1/select?q={!join from=id to=SKUS}(Color:RED OR Color:BLACK)


 <response>  
   <lst name="responseHeader">  
     <int name="status">0</int>  
     <int name="QTime">1</int>  
     <lst name="params">  
       <str name="q">{!join from=id to=SKUS}(Color:RED OR Color:BLACK)</str>  
     </lst>  
   </lst>  
   <result name="response" numFound="4" start="0">  
     <doc>  
       <str name="id">100</str>  
       <str name="Type">SHIRT</str>  
       <arr name="SKUS">  
         <str>101</str>  
         <str>102</str>  
         <str>103</str>  
       </arr>  
       <long name="_version_">1429440962942205952</long>  
     </doc>  
     <doc>  
       <str name="id">200</str>  
       <str name="Type">SHIRT</str>  
       <arr name="SKUS">  
         <str>201</str>  
         <str>202</str>  
         <str>203</str>  
       </arr>  
       <long name="_version_">1429440962945351681</long>  
     </doc>  
     <doc>  
       <str name="id">300</str>  
       <str name="Type">SHOES</str>  
       <arr name="SKUS">  
         <str>301</str>  
         <str>302</str>  
         <str>303</str>  
       </arr>  
       <long name="_version_">1429440962948497408</long>  
     </doc>  
     <doc>  
       <str name="id">400</str>  
       <str name="Type">SHOES</str>  
       <arr name="SKUS">  
         <str>401</str>  
         <str>402</str>  
         <str>403</str>  
         <str>404</str>  
       </arr>  
       <long name="_version_">1429440962951643136</long>  
     </doc>  
   </result>  
 </response>  

Notice that this returned shirts and shoes that are red or black.

Query: Find any product that is black color.

http://localhost:8983/solr/collection1/select?q={!join from=id to=SKUS}(Color:BLACK)


 <response>  
   <lst name="responseHeader">  
     <int name="status">0</int>  
     <int name="QTime">0</int>  
     <lst name="params">  
       <str name="q">{!join from=id to=SKUS}(Color:BLACK)</str>  
     </lst>  
   </lst>  
   <result name="response" numFound="3" start="0">  
     <doc>  
       <str name="id">100</str>  
       <str name="Type">SHIRT</str>  
       <arr name="SKUS">  
         <str>101</str>  
         <str>102</str>  
         <str>103</str>  
       </arr>  
       <long name="_version_">1429440962942205952</long>  
     </doc>  
     <doc>  
       <str name="id">300</str>  
       <str name="Type">SHOES</str>  
       <arr name="SKUS">  
         <str>301</str>  
         <str>302</str>  
         <str>303</str>  
       </arr>  
       <long name="_version_">1429440962948497408</long>  
     </doc>  
     <doc>  
       <str name="id">400</str>  
       <str name="Type">SHOES</str>  
       <arr name="SKUS">  
         <str>401</str>  
         <str>402</str>  
         <str>403</str>  
         <str>404</str>  
       </arr>  
       <long name="_version_">1429440962951643136</long>  
     </doc>  
   </result>  
 </response>  

This query returned both shirts and shoes. What if you only wanted shoes?

Query: Find only shoes that are in blue.

http://localhost:8983/solr/collection1/select?q={!join from=id to=SKUS}(Color:BLUE)&fq=Type:SHOES


 <response>  
   <lst name="responseHeader">  
     <int name="status">0</int>  
     <int name="QTime">4</int>  
     <lst name="params">  
       <str name="q">{!join from=id to=SKUS}(Color:BLUE)</str>  
       <str name="fq">Type:SHOES</str>  
     </lst>  
   </lst>  
   <result name="response" numFound="1" start="0">  
     <doc>  
       <str name="id">400</str>  
       <str name="Type">SHOES</str>  
       <arr name="SKUS">  
         <str>401</str>  
         <str>402</str>  
         <str>403</str>  
         <str>404</str>  
       </arr>  
       <long name="_version_">1429440962951643136</long>  
     </doc>  
   </result>  
 </response>  


Query: Find only shirts that are black color and size large.

http://localhost:8983/solr/collection1/select?q={!join from=id to=SKUS}(Color:BLACK AND Size:L)&fq=Type:SHIRT


 <response>  
   <lst name="responseHeader">  
     <int name="status">0</int>  
     <int name="QTime">1</int>  
     <lst name="params">  
       <str name="q">{!join from=id to=SKUS}(Color:BLACK AND Size:L)</str>  
       <str name="fq">Type:SHIRT</str>  
     </lst>  
   </lst>  
   <result name="response" numFound="1" start="0">  
     <doc>  
       <str name="id">100</str>  
       <str name="Type">SHIRT</str>  
       <arr name="SKUS">  
         <str>101</str>  
         <str>102</str>  
         <str>103</str>  
       </arr>  
       <long name="_version_">1429440962942205952</long>  
     </doc>  
   </result>  
 </response>  

Notice that this query did not return shirts that were large and some other color. This is important.

Conclusion


Using a completely denormalized schema that uses a field to specify the type and multiValued fields to contain the id's of the related documents can represent relationships.

By using Solr's pseudo-join and filter queries you can "select" documents without getting "false positive multi-value field match problem".

This solution works out of the box with Solr 4.1. You gotta like that. :-)



1 comment:

Saqib Ali said...

Hello,

What does omitTermFreqAndPositions="true" do?

Thanks.