Wednesday, July 26, 2017

Lucy Queries and Plastic - A type of Lucene Query for ElasticSearch

Introduction

A Lucy Query is a query that is very similar to a Lucene query and a Solr query, however, not as robust at this time, nor do I feel it needs to be for my uses.

Plastic takes a Lucy Query and converts it at run-time into a boolean query for ElasticSearch.

Generating Lucene or Solr queries programatically is relatively straightforward, however generating ElasticSearch queries programatically is difficult due to ElasticSearch's use of a fluent (dot type) syntax where you generate objects and nest other objects inside of those, etc. It is much easier to generate a text string that specifies a query.

A text string query specification also allows for the creation of a query template and facilitates Query by Example (QBE) queries.

Lucy Query

Lucy queries are similar to Solr queries and to Lucene queries. Lucy queries do not support the syntactical sugar for AND and OR found in Solr. Lucy queries only use the operators for required (+), not required, and not (!).

For example:
+name:"john" phone:"555-2121"

The name is required and the phone is not. Run the above Lucy query through Plastic and the following ElasticSearch query is generated:

{
  "bool" : {
    "must" : [
      {
        "match" : {
          "name" : {
            "query" : "john",
            "boost" : 1.0
          }
        }
      }
    ],
    "should" : [
      {
        "match" : {
          "phone" : {
            "query" : "555-2121",
            "boost" : 1.0
          }
        }
      }
    ],
    "boost" : 1.0
  }
}

Lucy queries support grouping of query terms. If the above query is modified to use groups like this:
+(name:"john") (phone:"555-2121")

Then the results after running the Lucy query through Plastic is very different but results in a query that is equivalent to the original query. Notice how the grouping creates a new boolean query in ElasticSearch for each group, it nests the query terms exactly like the Lucy query specifies.

{
  "bool" : {
    "must" : [
      {
        "bool" : {
          "should" : [
            {
              "match" : {
                "name" : {
                  "query" : "john",
                  "boost" : 1.0
                }
              }
            }
          ],
          "boost" : 1.0
        }
      }
    ],
    "should" : [
      {
        "bool" : {
          "should" : [
            {
              "match" : {
                "phone" : {
                  "query" : "555-2121",
                  "boost" : 1.0
                }
              }
            }
          ],
          "boost" : 1.0
        }
      }
    ],
    "boost" : 1.0
  }
}

Lucy supports nested queries of ElasticSearch (something that is not found in a Solr or Lucene Query). Nesting is specified by using the dot "." in the field name.

name.given:"john" name.surname:"smith"

In the above Lucy query the field values to query are nested. The ElasticSearch mappings (schema) to create the field is thus:


"names": {
          "type": "nested",
          "properties": {
            "given": {
              "type": "text",
              "norms": false,
              "similarity": "boolean",
              "analyzer": "whitespace"
            },
            "surname": {
              "type": "text",
              "norms": false,
              "similarity": "boolean",
              "analyzer": "whitespace"
            }
          }
        }

After running the above Lucy query through Plastic this is the ElasticSearch query:

{
  "bool" : {
    "should" : [
      {
        "nested" : {
          "query" : {
            "match" : {
              "name.given" : {
                "query" : "john",
                "boost" : 1.0
              }
            }
          },
          "path" : "name",
          "score_mode" : "avg",
          "boost" : 1.0
        }
      },
      {
        "nested" : {
          "query" : {
            "match" : {
              "name.surname" : {
                "query" : "smith",
                "boost" : 1.0
              }
            }
          },
          "path" : "name",
          "score_mode" : "avg",
          "boost" : 1.0
        }
      }
    ],
    "boost" : 1.0
  }
}



A simple Lucy query is composed of:
  • field name followed by a colon
  • a field value, which may or may not be in quotes

For instance the following is a valid Lucy query:
  • names.given:"john mark"

A Lucy query can have modifiers for boost, fuzzy / slop, and constant score.
  • names.given:"john mark"~3, this sets the slop for a phrase query to be 3.
  • names.given:"john mark"^3, this sets the boost to be 3.
  • names.given:"john mark"^=3, this sets the constant score to be 3.

A Lucy query can have modifiers for the operations of must, must not, and should.
  • names.given:"john mark", should match
  • +names.given:"john mark", must match
  • !names.given:"john mark", must not match

A Lucy query can have query terms grouped by parenthesis and those parenthesis can have modifiers.
  • (names.given:"john mark" names.surname:"smith"), should match on the results of what is inside the parenthesis.
  • +(names.given:"john mark" names.surname:"smith"), must match on the results of what is inside the parenthesis.
  • !(names.given:"john mark" names.surname:"smith"), match not on the results of what is inside the parenthesis.

Grouped terms in a Lucy query can have modifiers as well.
  • (names.given:"john mark" names.surname:"smith")^3

Plastic

Plastic currently has two features.
  1. Convert Lucy Queries into ElasticSearch Queries
  2. Expand Lucy Query Templates into Lucy Queries

Converting a Lucy Query into an ElasticSearch query is done in a Java module I call Plastic. It is mentioned above, but to keep documentation simple here is an example.

For example take the following Lucy Query:
+name:"john" phone:"555-2121"

The name is required and the phone is not. Run the above Lucy query through Plastic and the following ElasticSearch query is generated:

{
  "bool" : {
    "must" : [
      {
        "match" : {
          "name" : {
            "query" : "john",
            "boost" : 1.0
          }
        }
      }
    ],
    "should" : [
      {
        "match" : {
          "phone" : {
            "query" : "555-2121",
            "boost" : 1.0
          }
        }
      }
    ],
    "boost" : 1.0
  }
}

One advantage of having a query in text like a Lucy Query is that Query Templates can be made and used with a term expander to facilitate "Query By Example" (QBE).

A Lucy Query Template is:
FieldName:{EXPANSION_OPERATOR}

The EXPANSION OPERATORS ARE:
  • FIRST
  • AND
  • OR
  • AND#N (where N is an integer)
  • OR#N (where N is an integer)

Single Value Expansion Operators

FIRST is replaced by the first value found

Multi-Value Expansion Operators

AND is replaced by a required term for each value.
names.given{AND} for "John Willam"
+names.given:"John" +names.given:"William"

OR is replaced by a term for each value.
names.given{OR} for "John Willam"
names.given:"John" names.given:"William"

AND#N OR#N: These are used to limit multivalued terms.
AND#2 means and the first two values.
AND#3 means and the first three values.

It takes a Lucy Query Template and a HashMap (Dictionary) of term values for Plastic to expand a Lucy Query Template into a Lucy Query which in turn can be transformed into an ElasticSearch query.

For example, given the following Lucy Query Template:
givenName{AND} surname:{OR}

And the following HashMap>:
{surname=[Smith, Schmidt], givenName=[john, mark]}

Plastic will expand the Lucy Query Template with the data found in the HashMap to be the following Lucy Query:
( +givenName:"john" +givenName:"mark" ) ( surname:"Smith" surname:"Schmidt" ) 

Then Plastic will convert the above Lucy Query into an ElasticSearch query:

{
  "bool" : {
    "should" : [
      {
        "bool" : {
          "must" : [
            {
              "match" : {
                "givenName" : {
                  "query" : "john",
                  "boost" : 1.0
                }
              }
            },
            {
              "match" : {
                "givenName" : {
                  "query" : "mark",
                  "boost" : 1.0
                }
              }
            }
          ],
          "boost" : 1.0
        }
      },
      {
        "bool" : {
          "should" : [
            {
              "match" : {
                "surname" : {
                  "query" : "Smith",
                  "boost" : 1.0
                }
              }
            },
            {
              "match" : {
                "surname" : {
                  "query" : "Schmidt",
                  "boost" : 1.0
                }
              }
            }
          ],
          "boost" : 1.0
        }
      }
    ],
    "boost" : 1.0
  }
}

Conclusion

Lucy Queries allow for straightforward query generation at run-time. These Lucene-like queries are then converted to ElasticSearch queries by the Plastic module.

Once you have a text based query language that can be converted into ElasticSearch queries then you can create Query Templates and perform Query By Example (QBE) queries in a straight forward manner.

This makes ElasticSearch more usable for those that have systems where the user can specify arbitrary queries via some mechanism such as a UI.

Lucy Query and Plastic are terms invented for the use in the space of querying ElasticSearch relative to Lucene and Solr as well. June 2017.