I manage an internal build system that creates a simple text file with key value pairs with build statistics. These statistics are then processed using a fairly gnarly shell script. When I first saw this years ago I thought it looked like the perfect candidate to use ElasticSearch and finally had time to look into this.
ES is a text database and search engine which is useful, but it also has a neat frontend called Kibana which can be used to query and visualize the data. Since I manage the system, there was no need to setup Logstash to preprocess the data since I could just convert it to json myself.
The documentation for ElasticSearch covers installation using Docker,
but there is one gotcha. The webpage that lists all the available
docker images at https://www.docker.elastic.co/ only lists the images
that contain the starter X-Pack with a 30 day trial license. I ended
up using the docker.elastic.co/elasticsearch/elasticsearch-oss
image
which is only Open Source content. Same for the Kibana image.
I wanted to run ES and Kibana on the same server, but if you do that using two separate docker run commands, the auto configuration of Kibana doesn’t work. I also wanted to make a local volume to hold the data and so I created a simple docker-compose file:
---
version: '2'
services:
kibana:
image: docker.elastic.co/kibana/kibana-oss:6.2.4
environment:
SERVER_NAME: $HOSTNAME
ELASTICSEARCH_URL: http://elasticsearch:9200
ports:
- 5601:5601
networks:
- esnet
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch-oss:6.2.4
container_name: elasticsearch
environment:
discovery.type: single-node
bootstrap.memory_lock: "true"
ES_JAVA_OPTS: "-Xms512m -Xmx512m"
ulimits:
memlock:
soft: -1
hard: -1
ports:
- 9200:9200
volumes:
- esdata1:/usr/share/elasticsearch/data
networks:
- esnet
volumes:
esdata1:
driver: local
networks:
esnet:
Now I have ES and Kibana running on ports 5601 and 9200.
I have a large collection of files in the form:
key1: value1
key2: value2
<etc>
Converting this to JSON should be simple, but there were a few surprises:
tr
to delete all backslashes from the JSON.The final code looks like this:
convert_to_json()
{
local ARR=();
local FILE=$1
while read -r LINE
do
ARR+=( "${LINE%%:*}" "${LINE##*: }" )
done < "$FILE"
local LEN=${#ARR[@]}
echo "{"
for (( i=0; i<LEN; i+=2 ))
do
printf ' "%s": "%s",' "${ARR[i]}" "${ARR[i+1]}"
done
printf "\b \n}\n"
}
for FILE in "$@"; do
local JSON=
JSON=$(convert_to_json "$FILE" | tr -d '\\' )
echo $JSON
done
The data was being imported into ES properly and I tried to search and visualize the data, but I found it really hard to visualize the data. Every field was imported as a text and keyword type which meant that all the date and number fields could not be visualized as expected.
The solution was to create a mapping which assigns types to each field in the document. If the numbers had not been sent as strings, ES would have converted them automatically, but I had dates in epoch seconds which is indistinguishable from a large number. Date parsing is its own challenge and ES supports many different date formats. In my specific case, epoch_seconds was the only date format required.
I took the default mapping and added the type information to each field.I tried to apply this mapping to the existing document, but ES does not allow the mapping of a field to be changed because that would change the interpretation of the data. The solution is to create a new index and reindex the old index to the new one with types. This worked and I was now able to visualize the data much more easily.
I now had one index and it was growing quickly. I remembered from previous research that LogStash uses indexes with a date suffix. This allows data to be cleaned up regularly and also allows a new mapping to applied to new indexes. Creating and deleting indexes is handled by the Curator tool.
I created two scripts: one for deleting indexes that are 45 days old and another for creating tomorrows index with the specified mapping. Running these from cron daily will automate the creation and cleanup. Last piece is to have the JSON sent to the index that matches the day.
Every new piece of software has its learning curve. So far the curve has been quite reasonable for ElasticSearch. I look forward to working with it more.