Friday, July 15, 2016

Scouting and Reconnaissance in Software Development

Check out my other blog:
Maverick Software Development

Scouting and Reconnaissance in Software Development

by Geoffrey Slinker
v1.0 October 2004
v1.1 January 2005
v1.2, v1.3, v1.4 July 2005
v1.5 March 24, 2006

Maverick Development

Abstract

Scouting and reconnaissance are two well known methods of discovery. By these means information and experience are gained when faced with the unknown. Experience is critical to writing good software. Experience allows you to correctly identify problems and address them. Scouting and recon for software development is a great way to gain experience and avoid the pitfalls of the unknown.

Introduction

In the well known book ‘The Mythical Man-Month’ Frederick P Brooks states:
Where a new system concept or new technology is used, one has to build a system to throw away, for even the best planning is not so omniscient as to get it right the first time. Hence plan to throw one away; you will, anyhow.
As the years passed and systems grew in size and complexity it became apparent that building a "throw away" as not the most efficient approach. In the 20th anniversary edition of his same book, Brooks states that developing a throwaway version is not as efficient as iterative approaches to software development.
In Extreme Programming Explained Second Edition, Kent Beck states:
"Defect Cost Increase is the second principle applied to XP to increase the cost-effectiveness of testing. DCI is one of the few empirically verified truths about software development: the sooner you find a defect, the cheaper it is to fix it."
Scouting and recon techniques are used to discover defects through experiments and to completely avoid the presence of the defect in the "real" software. These techniques work within phased or phasic development methodologies as well as within iterative methodologies and give knowledge and experience through their use.

Gaining Experience

There are many software development activities concerned with gaining experience. Some of these activities include creating proofs of concept, prototyping, and experimenting. I will refer to all of these activities as experiments.
How much effort should be placed in an experiment? Enough to gain the experience needed to get you to the next step.

Software Scouting

“Scouting” will be the metaphor. During the exploration of the American frontier, scouts were sent out ahead of the company to determine the safest path through unknown and hostile territory. Through software “scouting missions” one can save time and money, and reduce the risks to the company.

Brooks’ first statement concerning building a "throw away" is akin to exploring the entire route first and then moving the company. His revised statement concerning iterative development is akin to scouting out a few hours (or days) ahead and returning to guide the company. This pattern of short scouting trips would continually repeat, making the technique both iterative and incremental. Through the scouting metaphor you can gain a certain feel for why building a "throw away" version is more costly than iterative development.

Scouting Tools

There are many ways to explore the unknown. These activities have many similarities. One of the key differentiators is the stage of software development in which the activity occurs. Following various "tools" for scouting are defined and the stage in which they are typically used is specified.
"Proof of Concept" occurs after a solution has been conceptualized. Investigation is needed to gain confidence and verify the viability of the solution.

A "Prototype" is made after a design has been made. Investigation is needed to validate that the result of the design solves the problem. In software prototyping development activities are scaled back. In engineering prototypes may be scaled functioning models. In software there is no physical dimension so development activities are scaled back which include minimal effort for robustness and usually only implementing the “happy path” of the functionality. Also techniques to reduce coupling are skipped and cohesion is ignored as much as possible (Even though these activities are skipped the experience of prototyping bring to light how the software components should be coupled and an overall domain definition emerges that allows for better cohesion).
Ed Mauldin explains prototyping as thus:
“Prototyping is probably the oldest method of design. It is typically defined as the use of a physical model of a design, as differentiated from an analytical or graphic model. It is used to test physically the essential aspects of a design before closing the design process (e.g., completion and release of drawings, beginning reliability testing, etc.). Prototypes may vary from static "mockups" of tape, cardboard, and styrofoam, which optimize physical interfaces with operators or other systems, to actual functioning machines or electronic devices. They may be full or sub-scale, depending on the particular element being evaluated. In all cases, prototypes are characterized by low investment in tooling and ease of change.”

An "Experiment" occurs after software modules have been developed. Investigation into their behavior under varied conditions is needed. An experiment is conducted to observe the behavior.
A "Mock Object" is created during software implementation. Components have been developed and investigation into their behavior needs to be done. To isolate these components from the effects of other components the other components are replaced with "mocks" that have simple and specific behavior.
A "Driver" is created during software implementation. Components have been developed and investigation into their interfaces and usability need to occur. A driver is developed to interface with and drive the component. The interfaces or entry points of the components are confirmed correct and the pre-conditions of the components are exercised. The driver can validate the post-conditions of the component as well.
"Stub" is created during software implementation. Functionality has been developed and investigation of the code paths needs to occur. Called interfaces are developed with the simplest means in order to return specific results and exercise the code paths of the caller. These simple interface implementations are stubs.
"Simulation" is typically created after the system is implemented. A deliverable needs to be tested in various environments and conditions. A simulation of an environment is developed and it is used for testing. Common examples are simulated users, simulated load, simulated outages, and such.

When to Scout

Remember, scouting activities address the issue of gaining experience in unknown territory. These activities are not necessary when experience is present. Simply said, “If you know how to do the job, then do it!”

When one is in unknown territory scout ahead for information, then come back and apply the knowledge gained. Have enough discipline not to get distracted by the sights along the way. Stay focused, travel light, and get back to camp as quickly as possible.

Can you afford not to scout ahead? The answer to this question only comes at the end of the journey. Did you make it to your destination or not?

Scouting for Phasic Methodologies

One reason that experiments work is because they address issues and concerns in context and as they occur. It is a "learn as you" go approach. Below are some scenarios in which scouting can be used in a traditional phased or phasic methodologies.

Phase 1: Analysis and Requirements.

•    Paper prototypes of the user interface.
•    Proof-of-concept of a requirement (i.e. the database must support 500 simultaneous connections).

Phase 2: Design.

•    Refined paper prototypes of the user interface. Paper models of the architecture and model (i.e. UML).

Phase 3: Implementation.

•    Develop an experiment for the “happy path” to discover boundaries and interfaces.
•    Create prototypes ahead of implementing frameworks so that the framework's approach can be reviewed.

Phase 4: Testing.

•    Create experiments to test scenarios.
•    Create testing harnesses that allow for proxy users (a proxy user can be a user simulated by a computer program).
•    Simulate extreme conditions such as system load.
(Testing is scouting ahead of the user to make sure the user’s experience will be a good one.)

Scouting for Iterative Methodologies

User Stories

  • Create a proof of concept to verify the User has conveyed their desires.

Project Planning

  • If the user story involves a User Interface, create paper prototypes of the interface to stimulate user input and direction.

Release Planning

  • Create a prototype to identify dependencies to facilitate iteration planning.

Iteration Planning

  • Create design prototypes using a modeling language such as UML.

Iteration

  • Create stubs, drivers, and mock objects to increase confidence in the behavior of isolated units.
  • Create an experiment to observe object behavior.
  • Create a simulation to test things like performance under a heavy load.
This list is supposed to be thought provoking, not complete. The idea behind scouting is to perform some scouting activity when faced with the unknown. When doing experiments in conjunction with an iterative development methodology the experiments are "lighter" than they would be in a phasic development methodology if the customer/user is taking an active role. With the customer present one can prototype a user interface with a white board and some drawings. If the customer is not present then a prototype for a user interface is usually mocked up with some kind of computer aided drawing package or even a "quick and dirty" user interface is developed with a GUI building tool or scripting language.

Benefits of Scouting

  1. Scouting brings light to a situation.  Through scouting activities estimations become more accurate. The accuracy comes from the application of the experience, not from an improved ability to predict the future.
  2. Scouting reduces coupling and improves cohesion.  When writing software in the light of experience, the coupling between objects is reduced and the experience unifies the system's terms and metaphors which increase cohesion.
  3. Scouting builds trust and confidence by eliminating incorrect notions and avoiding drastic changes in design and implementation.

Risks of Software Scouting

  1. Is management mature enough to allow the proper use of an experiment and not try to “ship” the prototype and undermine the effort?
  2. Is development mature enough to refrain from features creeping into the product because the experiment revealed something interesting?

Project Management Ensures Adequate Software Recon

Project Management should scout and see if their development environment can support activities that rapidly gain experience. Probing questions include:
  • Are the software developers aware of all of the activities that can lead to experience?
  • Are the stakeholders aware of the benefits of prototypes and experiments?
  • Is everyone aware of the risks of not doing recon and the risks of doing recon? Remember, one of the risks of a prototype is that sometimes people try to ship it!
An interesting exercise would be to listen for concerns expressed by developers and ask them what activity would address their concern. Some concerns expressed by developers that can be addressed through recon are:
  • “If I just had time to write this right”
  • “I don’t think we know how difficult this is going to be”
  • “I really don’t have any idea how long this is going to take”
When a concern is expressed ask the developer what they would do to address it. Listen for solutions that bring experience and shed light.

Conclusion

Experience is key to writing good software. The sooner you discover a problem and correctly fix the problem the cheaper it is. Scouting ahead in software by using prototypes and experiments is a great way to discover the right path without risking the entire company to the unknown.

Design By Use

Check out my other blog:
Maverick Software Development


"Design By Use" Development

by Geoffrey Slinker
version 1.6
March 25, 2006
April 22, 2005
July 1, 2005
July 25, 2005
August 23, 2005

Maverick Development

Abstract

"Design by Use" development (DBU) improves team resource utilization, software design, software quality, and software maintenance through a set of proven industry methods that have been shown to work together synergistically.

Introduction

Are you concerned with keeping your development staffed adequately tasked? Would you like to improve design quality by reducing coupling, improving cohesion, and communicating the domain model? Is the quality of your software important? Do you maximize the R.O.I. of your software by using the software for as many years as possible? If you answered no to any of these questions are you from another planet?

As part of my career I have specialized in the rendering of concise solutions to problems. Whether the problem was to be solved with code or with a methodology I have always strived to take the problem that was presented, boil it down to the essence, and provide a solution. I have studied software engineering processes now for over 20 years. I have distilled the essence of what I feel are the most useful methods to use as a foundation to build a process that is efficient and improving.

I have recently thrown all of the traditional methodologies along with agile methodologies that I know into my soup pan and turned up the heat! Then I took the results and have been experimenting with them. It is like a soup that has been cooked in a big pot. You will taste all of the different ingredients if you try or you can ignore the ingredients and just enjoy the combined flavor.

This paper presents a methodology for development that can work as a subcomponent of any encompassing methodology and deliver results in the areas mentioned in the abstract.

Executive Summary

Design by Use (DBU) follows the basic steps:
1) Create High Level Design
2) Identify systems and subsystems
3) Identify messages or calls between systems and subsystems
4) Use theses identified messages or calls to specify to each team what they should code and how the message will be made (the message/method signature).

For example: There are two subsystems identified, S1, and S2. There are two teams, T1 and T2. T1 is to write S1 and T2 is to write S2.
S1 calls into S2, let's suppose the message is GetStuffFromS2.
Team 1 writes an Usage Example:
void Main() 
{ 
 MyData data = GetStuffFromS2(1);
 assert(data.value == 3);
}

Team 1 gives this Usage Example to Team 2. T2 uses this to direct what they will develop and the order of development will natually flow from this point. So T2 implements GetStuffFromS2 in their subsystem S2 and they notify T1 when it is available (or if they are using unit tests they will know GetSTuffFromS2 is available when the build light goes green for that test).
S1 is immediately integrated with S2 and even better, it is integrated in a great way, the way the user wants to use the system.
DBU is beyond Test Driven Development (TDD) and Design by Contract (DbC). DBU is concerned with large software systems, multiple teams, coordination, and integration. TDD is a code design activity. DbC is a contract driven process based around invariants, pre-conditions, and post-conditions.

The Approach

The "Why":
The problem is keeping all software development teams working and not waiting.
The "When":
When a large software system is being developed with many systems and subsystems and each of these is developed by different teams.
The "How":
The high level design of the system is done with any method that the company agrees upon. A custom diagramming language such as a simplified UML works fine. Subsystems are identified and teams are assigned to each subsystem.The data flows, invocations, calls, dependencies, or whatever you want to call them are identified at the subsystem boundaries. For example, "My subscription subsystem will need to ask the pricing subsystem for a price given a product Id."

At this point the development pump must be primed. All of the teams have their requirements. It doesn't matter if you use Use Cases or User Stories or another way to specify requirements. In an agile methodology this would be one of the last activities of Release Planning. The teams meet together as one and the functionality that will be delivered during this release is decided upon. Each system and subsystem that are participating in this release are identified. If there are systems that are not part of this release the teams responsible for them will not be needed and can work on other systems. Each call into an external system or subsystem that had been identified are listed. The "caller" starts out by writing a usage example. The usage examples are created for the calls identified from the high level design (calls that cross system boundaries) . The usage examples that call into subsystems other than your own are delivered to the proper team. In all software development there are the upstream/downstream situations. (I do not go into the perils of being downstream in this paper.) All of the usage examples will be used to drive the design and development of what's inside a subsystem. This is the low-level design (code) and includes the details not covered in the high-level design (possibly UML).

When all of the usage examples for the subsystem boundaries are identified that can be the teams can coordinate and prioritize the remaining development tasks. This gives a clear picture of who is doing what and how they should do it. There is no waiting because the usage example has with it sample data to drive the call. Therefore no one is waiting for someone upstream to finally call their code.

The usage examples test post conditions after the call into the subsystem. The implementation in the subsystem checks preconditions, invariants, and post conditions. If you taste the flavor of design by contract in this soup you are correct.

From what has been stated so far in the DBU approach the design has presented the overall domain, has identified sub-domains, has exposed the boundaries and entry points, and has allowed for efficient use of resources and scheduling.

Quality is improved through the approach as thus far stated. By having usage examples that drive development integration has already been addressed. Instead of "integrate often" this approach is "integrate immediately". As soon as a component is finished and satisfies the usage example it can be used by consumers. Through this approach the design is very cohesive because sufficient consideration was given to the domain model and the boundary points. The idea that cohesive designs and correct models just emerge from some primordial ooze is a misunderstanding. Instead it comes from the application of knowledge, consideration, experimentation, and application. This approach uses these four factors (knowledge, consideration, experimentation, and application) continuously.

With the usage examples defined and expectations set there is no need for teams to reinvent the wheel. To often teams will not use others code because the quality is suspect or the delivery date is unknown or the solution is a near fit but not a good fit. Eliminating these concerns is just as much a social problem as well as a procedural one. The approach specified here addresses the procedure.

So far improvements in resource utilization, design, and quality have been described. Finally this approach improves the R.O.I. through facilitating software maintenance. By running the usage examples a developer can isolate a piece of code and step through it to understand a legacy system. Often documentation is lost or out of synchronization with the software and a developer just wants to know what the system currently does. When modifying an existing system it is essential to know that changes have not affected the system in undesired ways. By running the usage examples in the role of a regression test the changes can be verified that the effects of the change are isolated to the desired areas. Since each system call into another system is specified the designer's and programmer's intent is specified. This specification can be used to replace entire systems and subsystems. Suppose we want to replace the pricing subsystem our subscription subsystem uses. The usage examples shows exactly where to make the incision.

Summary


1)    Improves team resource utilization
a.    Through specifying interfaces through usage example one team can clearly specify to another team the functionality that is desired. This is immediate integration. Through this there is less rework during integration that traditionally would come at the end of the development phase.
2)    Improves quality
a.    Eliminates issues with late integration
b.    Builds confidence in subsystems and reduces "silo-ing" and duplicated code.
3)    Improves design
a.    Rapidly defines interfaces and exposed entry points.
b.    Reduces coupling.
c.    Increases cohesion.
i.    through communication
ii.    through the dissemination of domain concepts
iii.    through the unification of domain models
4)    Improves software maintenance
a.    By running the usage examples as a regression test one can step through code that is not documented or that is not behaving according to documentation.
b.    Usage examples are ran after every modification (small sets of changes) to verify that the changes have not caused problems because of unknown side effects and couplings.

Conclusion

"Design by Use" development (DBU) improves team resource utilization, software design, software quality, and software maintenance through a set of proven industry methods that have been shown to work together synergistically.

Complex software with many systems and subsystems which are developed by several teams of developers is difficult to schedule the order each part will be developed. Integration is often done late. Immediate integration is the key activity.
To flesh out more of the entire process please read "Reporting for Accountability".

Friday, April 15, 2016

Java 8 java.time LENIENT date parsing

I was porting some Java code that uses Joda Time to the new Java 8 Time package. I needed a way to parse a date of the format "YYYY-MM-DD" and be lenient, specifically on leap year date mistakes.

I also need this to be a hash of the day of the year, so that March 1, 2000 (a leap year) and March 1, 2001 (not a leap year) hash to the same value.

They key piece of code is this:
DateTimeFormatter formatter = DateTimeFormatter.ISO_LOCAL_DATE.withResolverStyle(ResolverStyle.LENIENT);
 
Notice this: ResolverStyle.LENIENT
 
 public class DateUtils  
 {  
   //This is a hash of the day of the year.  
   public static int hashDayOfYear(String stringDate)  
   {  
    DateTimeFormatter formatter = DateTimeFormatter.ISO_LOCAL_DATE.withResolverStyle(ResolverStyle.LENIENT);  
    LocalDate date = LocalDate.parse(stringDate, formatter);  
    int dayOfYear = date.getDayOfYear();  
    //Leap year stuff  
    boolean isLeap = Year.isLeap(date.getYear());  
    //February 28th is the 59th day of the year.  
    if( isLeap )  
    {  
      if(dayOfYear == 60)  
      {  
       //This allows for exact matching of dates but interfears with range matching  
       dayOfYear = 366;  
      }  
      else if (dayOfYear > 59)  
      {  
       dayOfYear--;  
      }  
    }  
    return dayOfYear;  
   }  
 }  
 public class DateUtilsTest  
 {  
   @Test  
   public void testOne()  
   {  
    //Feb 28 of a leap year  
    String date = "2000-02-28"  
    int dayOfYear = DateUtils.hashDayOfYear(date);  
    assert dayOfYear == 59;  
    //Feb 29 of leap year maps to 366  
    date = "2000-02-29"  
    dayOfYear = DateUtils.hashDayOfYear(date);  
    assert dayOfYear == 366;  
    //Feb 29, but not leap year, needs to be lenient and map to March 1  
    date = "2001-02-29"  
    dayOfYear = DateUtils.hashDayOfYear(date);  
    assert dayOfYear == 60;  
    //Feb 30 of a leap year, needs to be lenient and map to March 1  
    date = "2000-2-30";  
    dayOfYear = DateUtils.hashDayOfYear(date);  
    assert dayOfYear == 60;  
   }  
 }  

Thursday, February 11, 2016

Personalized Bash Shell, VIM, and other settings...

I want to capture my very simple, yet I think clean, settings I like to use.

Here is my .bashrc

---------------------------------------------------------
# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
        . /etc/bashrc
fi

# User specific aliases and functions
LS_COLORS='di=1;33' ; export LS_COLORS

PROMPT_DIRTRIM=3
export PROMPT_DIRTRIM

PS1="[\u \w]\$"

---------------------------------------------------------

This sets some colors and makes the prompt show the directories in a way that I find useful because I am always deep into the directory structure.

[user1 ~/.../opt/solr-5.4.1/example]$

Here is my .vimrc

---------------------------------------------------------
colorscheme desert
---------------------------------------------------------







BASH Script to remove trailing back slash

I was working with a script that was called in an RPM build (in the SPEC file). I passed a directory as a the first variable to the script. The directory ended with a back slash. I needed to remove that backslash since all of my variables started with a backslash.

RPM_BUILD_DIR=""
if [ -n "$1" ]
then
   RPM_BUILD_DIR=${1%/} #REMOVE TRAILING /
fi

So, the bash script checks to see if we have a parameter in the first position and if so remove the trailing backslash.

If the directory string was "/usr/local/bin/" the results are:
"/usr/local/bin"

I searched for examples to do this and they all seemed overly complex. Maybe this one is too simple for some situations, but it works for my needs.

Thursday, February 04, 2016

BASH Script to Manipulate IP Addresses and Port Numbers

This post is just a place to store some bash script for reference. I don't work in bash often enough to remember how things work so I need to make little snippets and such for reference. I do the same thing for regex and sql.

Here are some of the things I do to manipulate ip addresses in bash scripts. Since I work a lot with Solr, specifying zookeeper machines and lists of servers for SolrCloud I am always manipulating IP addresses.

#!/bin/bash
#set -x
echo -----------------BEGIN------------------------
servers="10.1.1.1:8080,10.1.1.2:8080,10.1.1.3:8080"
declare -a serversArray=("10.1.1.1:8080" "10.1.1.2:8080" "10.1.1.3:8080")

echo servers:"      "${servers}
echo serversArray:" "${serversArray[@]}

echo ----------------------------------------------

#use for loop to convert the serversArray to a comma seperated string
#Since arrays can be sparse, that is their indices do not have to be sequential
#get the indices as an array and iterate over the indices.
result1=
arrayIndices=(${!serversArray[*]})
for index in ${arrayIndices[*]}
{
   result1=${result1}${serversArray[$index]}
   if [ $index != ${arrayIndices[*]: -1} ]   #This gets the last element of an array
   then
      result1=$result1, #only append comma if this isn't the last entry
   fi
}
echo Array to comma seperated string using for loop:" "${result1}

#convert array to comma delimited string using character substitution
cds=${serversArray[@]};
cds=${cds// /,}
echo Array to comma seperated string using character substitution:"   "${cds}

#convert the comma delimited string into an array
atemp=($(IFS=','; x=($servers); echo ${x[@]}))
echo Comma delimited string to array:"   "${atemp[@]}
for a in "${atemp[@]}"
{
   echo []"${a}"
}
echo ----------------------------------------------

#remove the port from the servers, only works for port numbers of 4 digits
echo Remove Port Numbers from IP Addresses
servers2=${servers//:????/}
echo ${servers2}

#now use sed to remove port numbers from servers
echo ${servers} | sed -e "s/:[0-9]*//g"

echo ----------------------------------------------
#create an ip address with incrementing port numbers
echo Creat ip addresses with incrementing port numbers
echo 10.1.1.0:{10001,10002,10003,10004}
echo 10.2.2.2:{10001..10004}
echo 10.1.1.0:{10001,10002,10003,10004},
#capture echo into var and strip the trailing comma
ips=$(echo 10.1.1.0:{10001,10002,10003,10004},)
ips=${ips/%,/}
echo ${ips}







Here is the output:
-----------------BEGIN------------------------
servers:      10.1.1.1:8080,10.1.1.2:8080,10.1.1.3:8080
serversArray: 10.1.1.1:8080 10.1.1.2:8080 10.1.1.3:8080
----------------------------------------------
Array to comma seperated string using for loop: 10.1.1.1:8080,10.1.1.2:8080,10.1.1.3:8080
Array to comma seperated string using character substitution:   10.1.1.1:8080,10.1.1.2:8080,10.1.1.3:8080
Comma delimited string to array:   10.1.1.1:8080 10.1.1.2:8080 10.1.1.3:8080
[]10.1.1.1:8080
[]10.1.1.2:8080
[]10.1.1.3:8080
----------------------------------------------
Remove Port Numbers from IP Addresses
10.1.1.1,10.1.1.2,10.1.1.3
10.1.1.1,10.1.1.2,10.1.1.3
----------------------------------------------
Creat ip addresses with incrementing port numbers
10.1.1.0:10001 10.1.1.0:10002 10.1.1.0:10003 10.1.1.0:10004
10.2.2.2:10001 10.2.2.2:10002 10.2.2.2:10003 10.2.2.2:10004
10.1.1.0:10001, 10.1.1.0:10002, 10.1.1.0:10003, 10.1.1.0:10004,
10.1.1.0:10001, 10.1.1.0:10002, 10.1.1.0:10003, 10.1.1.0:10004

Monday, January 25, 2016

Introduction to Solr 5 SolrCloud


Introduction to Solr 5 SolrCloud

Collection - a complete logical index.
Collections are made up of one or more shards and a replication factor.
Shards have one or more replicas as defined by the replication factor.
Each replica is a core.



Special Note

The term “replica” in SolrCloud has caused me confusion. This paper attempts to clarify this issue.

Here are my words trying to clarify the above terms:

Collection - a complete logical index.

Collections are made up of one or more shards and a replication factor.
There is always at least one instance of a Shard (this is a replication factor of one). There can be more than one instance of a Shard for redundancy ( a replication factor greater than one).
An instance of a shard is called a core.

Collection

A SolrCloud Collection is a complete logical index.

A Collection can be divided into shards. This allows the data to be distributed.



The above picture represents a Collection with one shard and a replication factor of one. This results into a collection with one shard and that one shard IS the one replica. Here in lies the confusion when trying to describe SolrCloud. When I read the word replica I immediately imagine an original and a copy, an original and a replica.

Since there is always at least one replica it can be confusing terminology. When I first started trying to build a mental picture of SolrCloud I erroneously started with the idea that there was a master with replicas and therefore a replication factor of one would be a master and a replica. But that is not the case. What I thought of as a master is in reality “replica one”. Therefore, if you want your original index with one backup / failover copy what you need to say is “I want a replication factor of two.”

Therefore I feel the best way to describe it is like this:
The above picture represents a Collection with one shard and a replication factor of one. This results into a collection with one shard and that one shard is the only copy / instance of the data. Each instance of a shard is called a Core..

Shard

A Shard is a division of a Collection (complete logical index). Therefore a Shard is a portion or slice of a Collection (complete logical index).

Above represents a Collection sliced or divided into eight Shards.
Why would you want more than one shard? One reason would be if the total size of the collection is too large to fit on one computer.

In a Collection with one Shard all of the data will be in that single shard. For example, if you are doing a dictionary then with one shard the words from A to Z all go into the single shard. If you have two shards then the data for shard one could be A to M and the data for shard two could be N to Z.

Replica

Shards can be duplicated by using a replication factor. The replication factor specifies how many instances of each shard should exist. Each instance of a shard is called a Core. The confusion lies in that a Core is also called a Replica.

From the Solr documentation:

Collections can be divided into shards. Each shard can exist in multiple copies; these copies of the same shard are called replicas. One of the replicas within a shard is the leader, designated by a leader-election process. Each replica is a physical index, so one replica corresponds to one core.

The replication factor multiplied by the number of shards results in the total number instances of shards or better said the replication factor multiplied by the number of shards results in the total number of cores.

Shard instances show up in the Solr dashboard as “Cores”. In SolrCloud a Replica and a Core are the same thing.




Above the picture shows the “gettingstarted” collection with two shards and a replication factor of one which results in two shards each with one core / replica. Since there are two shards, each with one core / replica, there are a total of two cores / replicas. That is why you see two “cores” in the Solr Dashboard.

It is interesting to see the state.json for the “gettingstarted” collection.

"gettingstarted": {
"maxShardsPerNode": "2",
"router": {
"name": "compositeId"
},
"replicationFactor": "1",
"autoAddReplicas": "false",
"shards": {
"shard1": {
"range": "80000000-ffffffff",
"state": "active",
"replicas": {
"core_node2": {
"state": "active",
"core": "gettingstarted_shard1_replica1",
"node_name": "10.211.1.126:8983_solr",
"base_url": "http://10.211.1.126:8983/solr",
"leader": "true"
}
}
},
"shard2": {
"range": "0-7fffffff",
"state": "active",
"replicas": {
"core_node1": {
"state": "active",
"core": "gettingstarted_shard2_replica1",
"node_name": "10.211.1.126:8983_solr",
"base_url": "http://10.211.1.126:8983/solr",
"leader": "true"
}
}
}
}
}



Below is a collection that has eight shards with a replication factor of three. What is the total number of cores / replicas? There are 24 cores / replicas.


Just remember if you prefer to use the term Replica instead of the term Core that “replica 1” is just the first instantiation of a shard and “replica 2” is the second instantiation of the shard.

Starting Solr

Please be familiar with “Getting Started with SolrCloud”.

What I am about to describe are not steps to take Solr into production. I am not setting up linux users and permissions, this is just quick and dirty and I run as the root user doing it.

Download and untar/ unpackage Solr. Follow the steps found in the link above “Getting Started with SolrCloud” or just run this:

$ bin/solr -e cloud -noprompt

Point a webrowser to:



Zookeeper


If everything is running correctly then we are going to check and see what is in zookeeper. If it isn’t running, delete everything and start over. If you used the -noprompt command to start solr, follow the steps on the webpage and include the -V option with the command.

The first way to examine part of what is in zookeeper is through the Solr Dashboard.

Click on the left panel as shown here:




In Solr’s install directory, go to:
$ cd server/scripts/cloud-scripts

Run:
$ ./zkcli.sh -zkhost localhost:9983 -cmd list | less

You will see how the Solr Dashboard is showing what is in zookeeper.

Now download zookeeper and install it.

Got to the zookeeper bin directory and run:
$ ./zkCli.sh -server localHost:9983

Just because Solr is running the embedded zookeeper doesn’t mean you can’t connect to it.
Note that zkCli.sh is completely different than the shell found in Solr with the name zkcli.sh.

At the zk prompt to do following:
[zk: localHost:9983(CONNECTED) 1] ls /

You will see the following:
[configs, security.json, zookeeper, clusterstate.json, aliases.json, live_nodes, overseer, overseer_elect, collections]

You can examine the values, for example:
ls /live_nodes
[10.211.1.126:8983_solr, 10.211.1.126:7574_solr]

Why did I talk about zookeeper now? Because it is essential to understand where things are being stored and who retrieves the data.

Add a Node to the Cluster


Now, install Solr on another machine. Obviously it needs to be able to see the existing Solr machines on the network.

I decided to install Solr on a Windows machine since the currently running Solr is on a CentOS machine.

I started the Command Prompt by specifying “Run as Administrator”. At this time I am not looking for permission issues, I am avoiding them. I want to see Solr working.

From the DOS Command Prompt while in the Solr bin directory run:
solr.cmd start -c -z 10.211.1.126:9983
Obviously you use the I.P. address of your machine, not mine.

The -c means start in SolrCloud mode.
The -z specifies the zookeeper. Notice it is the embedded zookeeper already running in Solr.

Back to zookeeper, from the zookeeper command prompt (the real zookeeper, the capital C zookeeper, the zkCli.sh):
[zk: localHost:9983(CONNECTED) 3] ls /live_nodes
[10.61.130.207:8983_solr, 10.211.1.126:8983_solr, 10.211.1.126:7574_solr]
[zk: localHost:9983(CONNECTED) 4]

Notice that now there is a new entry in the live nodes.

You can also see this in the Solr Dashboard of the original Solr instance.




Now go to the Solr Dashboard of the original Solr instance, not the one you just started. Notice in the top right the link “Try New UI”.

You will see this:


Click on the “gettingst…” link in the middle pane.



Now click on the Shards on the right of the middle panel.



Now click “add replica” on the right for shard1.



Select the ip address of the “new” instance, the one I started on the Windows machine is the last entry, and click “Create Replica”.




Now go to the Solr Dashboard of the new instance and go to the “Cloud” (click Cloud in the left panel).


Notice on the new instance that it is running and is hosting a Core / Replica of Shard 1.

What did these actions create on the new instance?

D:\solr-5.2.1\server\solr\gettingstarted_shard1_replica3

Inside of that new directory there is a core.properties file that contains this:

#Written by CorePropertiesLocator
#Wed Jan 27 17:40:26 UTC 2016
name=gettingstarted_shard1_replica3
shard=shard1
collection=gettingstarted
coreNodeName=core_node5

On the original instance of Solr add another replica / core for shard2 on the new Solr instance.

When finished the SolrCloud for the gettingstarted collection should look like this:


Adding Data

The Solr documentation instructs how to add data to the gettingstarted collection.

Before we add data we need to update the schema. The schema is stored in zookeeper and is called managed-schema. You can see it at this path in zkCli.sh (the real zookeeper shell):

[zk: localHost:9983(CONNECTED) 4] get /configs/gettingstarted/managed-schema

In the Solr install directory go to example/films and read README.txt. You will see that you need to update the Solr schema. You can run the following command or go into Solr and add fields through the Solr Dashboard.

curl http://localhost:8983/solr/films/schema -X POST -H 'Content-type:application/json' --data-binary '{
   "add-field" : {
       "name":"name",
       "type":"text_general",
       "stored":true
   },
   "add-field" : {
       "name":"initial_release_date",
       "type":"tdate",
       "stored":true
   }
}'

I updated the schema through the UI of the Solr Dashboard.





After updating the schema I reloaded the cores through the Solr Dashboard.





On the original (first) instance of Solr (in my case the instance running on CentOS) run the command:

bin/post -c gettingstarted example/films/films.json

Next go to the Solr Dashboard of the original instance, select the gettingstartedcollection and execute the default query.





It looks like the post put 1,100 records into the database.

NOTE:
If you don’t update the schema before running the post command you will call all kinds of exceptions and errors. I think this is because the post tries to auto-detect field types and update the schema at runtime and it erroneously picks the wrong field type.

Checking Replication

Now I am wondering did the replication to the second instance of Solr (for me the second instance was started on Windows) actually has the data. To check this I am going to remove the cores/replicas from my first instance of Solr (running on CentOS) by removing the cores / replicas through the Solr Dashboard.




Just click the red X next to each core / replica running on the first box.

Now go to the Cloud panel and see if the original box has been removed.



Everything looks as expected.

Now select the collection “gettingstarted” and execute the default query.




There are still 1,100 records. It looks like everything is working correctly.

Just to double check, go into the index directory and see if there are any files on the original instance.

$cd example/cloud/node1/solr/gettingstarted_shard1_replica2/data/index

There are no index files, just a write.lock file left there.