Query types
match vs terms :
match performs full-text analysis and doesn’t search exact match
VS
terms performs exact term search
In query :
To perform full-text search on foo_field
, we refer to foo_field
To perform keyword search and aggregations on foo_field
, we refer to foo_field.keyword
Field types
– string type was removed since the version 5 (confusing)
– strings are mapped both as text and keyword by default since the version 5
– set a field with « keyword » in the mapping to perform efficient terms query.
Search vs Aggregation/sort
Most fields are indexed by default, which makes them searchable.
However : sorting, aggregations, and accessing field values in scripts requires a different access pattern from search.
Most fields can use index-time, on-disk doc_values for this data access pattern, but text fields do not support doc_values.
Indeed : Fielddata is disabled on text fields by default..
While we can enable Fielddata on a text field via the api, it is not efficient in terms of used heap. So we should favor keyword fields to sort and aggregation.
Indices Query
List all indexes :
http://elk:9200/_cat/indices?v"
Helpful parameters :
?h=
: headers/columns to render.
Ex: ?h=index,creation.date
: output only index name and its creation data.
?help
: list available headers we may specify.
?format=output-format
: format of the output.
Default : text. Other allowed values : json – smile – yaml – cbor.
Aliases, Nodes and Shards Query
List all aliases :
http://elk:9200/_aliases?pretty
Return shards information for all indexes of all nodes :
http://elk:9200/_cat/shards?v
By default it returns information :
index shard prirep state docs store ip node
You can request specific attributes by passing a Comma-separated list of attributes to display thanks to the h
parameter.
Valid columns are:
index, i, idx
(Default) Name of the index.
shard, s, sh
(Default) Name of the shard.
prirep, p, pr, primaryOrReplica
(Default) Shard type. Returned values are primary or replica.
state, st
(Default) State of the shard. Returned values are:
INITIALIZING: The shard is recovering from a peer shard or gateway.
RELOCATING: The shard is relocating.
STARTED: The shard has started.
UNASSIGNED: The shard is not assigned to any node.
docs, d, dc
(Default) Number of documents in shard, such as 25.
store, sto
(Default) Disk space used by the shard, such as 5kb (that is memory used).
ip
(Default) IP address of the node, such as 127.0.1.1.
id
(Default) ID of the node, such as k0zy.
node, n
(Default) Node name, such as I8hydUG.
Return global allocation (memory used and free) for each node :
http://elk:9200/_cat/allocation?v
Here is an example to understand how to read the result :
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node 24 7.9gb 32.9gb 6.1gb 39.1gb 84 192.170.244.150 192.170.244.150 es02 24 11gb 32.9gb 6.1gb 39.1gb 84 192.170.244.151 192.170.244.151 es03 24 3.4gb 32.9gb 6.1gb 39.1gb 84 192.170.244.149 192.170.244.149 es01 |
For each node, it gives :
shards
: nb of shards
disk.indices
: the real size used by the node (the indices content)
disk.used
: disk used on the disk, not only the part used by elastic
disk.avail
: disk available on the disk
disk.total
: disk size
disk.percent
: disk percent used, still not only the part used by elastic
Helpul parameter :
– specify columns to display : h=columnOne,ColumnTwo
– format of the output : json, yaml, … (default is a terminal-like rendering) : format=fooFormat
get documentation of attributes (key and meaning) we can request :
http://elk:9200/_cat/shards?help
get information about a cluster’s nodes :
http://elk:9200/_cat/nodes?v
By default it returns information :
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
As well as /_cat/shards
, you can request specific attributes by passing a Comma-separated list of attributes to display thanks to the h
parameter.
Node roles abbreviation meaning :
c (cold node), d (data node), f (frozen node), h (hot node), i (ingest node), l (machine learning node), m (master-eligible node), r (remote cluster client node), s (content node), t (transform node), v (voting-only node), w (warm node), and - (coordinating node only)
get documentation of attributes (key and meaning) we can request :
http://elk:9200/_cat/nodes?help
Document add/count Query
– count of documents in an index :
http://elk:9200/FOO_INDEX/_count
– count of documents in an index matching to a query :
http://elk:9200/FOO_INDEX/_count?q=QUERY
– add a document to the FOO_INDEX :
POST -H "content-type: application/json" -d '{"dog":"snoop", "cat":"snap"}' "http://elk:9200/FOO_INDEX/_doc/"
Document delete Query with _delete_by_query
scope
– In any indexes:
http://elk:9200/_all/_delete_by_query
– In a specific index:
http://elk:9200/FOO_INDEX/_delete_by_query
– In a set of index:
http://elk:9200/FOO_INDEX,BAR_INDEX/_delete_by_query
– In indexes matching a wildcard:
http://elk:9200/FOO_INDEX*/_delete_by_query
syntax and examples
– delete documents matching to a query.
POST _delete_by_query
The query may be either a json object « query » as POST data or a query string parameter « q » similarly to which one of the search query.
Example with a text field equal to a value :
{ "query": { "match": { "fooField": "foo value" } } } |
Example with a range date (@timestamp
field older or equals to 10 days ago) :
{ "query": { "range": { "@timestamp": { "lte": "now-10d" } } } } |
Example of response to _delete_by_query
request :
{ "took" : 147, "timed_out": false, "deleted": 119, "batches": 1, "version_conflicts": 0, "noops": 0, "retries": { "bulk": 0, "search": 0 }, "throttled_millis": 0, "requests_per_second": -1.0, "throttled_until_millis": 0, "total": 119, "failures" : [ ] } |
Issues with deleting
It is slow.
Cause : If many indexes and documents match, the query may be long to be processed.
Solution: wait and abort only if it is extremely slow because aborting and deleting again may rise some version conflict.
it returns a error response with as cause a version conflict.
Cause : the document changes between the time when the snapshot was taken and when the delete request is processed.
May be caused by : Deleting abort, deleting failure or still document update during the deletion
_delete_by_query
performs a check during its process.
Elastic gets a snapshot of the index when it starts and deletes what it finds using internal versioning and does a check on that at the end.
Solution : Set a parameter to tell Elastic to process the deletion despite conflicts.
In JSON query : "conflicts": "proceed"
In query string : conflicts=proceed
Search Query
scope of queries
– In any indexes:
http://elk:9200/_search
– In a specific index:
http://elk:9200/FOO_INDEX/_search
– In a set of index:
http://elk:9200/FOO_INDEX,BAR_INDEX/_search
– In indexes matching a wildcard:
http://elk:9200/FOO_INDEX*/_search
Search Query : FORM requests example
– Without url encoding :
http://elk:9200/FOO_INDEX/_search?pretty&q=fields.foobar:123 AND type:footype
– With url encoding :
http://elk:9200/FOO_INDEX/_search?pretty&q=fields.foobar:"123"%20AND%20type:footype
Warning :
– Don’t try to find an expression with whitespace(s) in a text field.
Indeed when searching in a text field, by default a OR is applied between words.
So q=message:build%20successful
will not search "build successful"
but "build" or "successful"
– You can specify a wildcard search with * (1 or more) or ? (0 or 1) characters.
Note that uppercase inside a search with wildcard may give weird results. To dig…
Useful params :
sort
: (Optional, string) A comma-separated list of <field>:<direction> pairs.
Example : &sort=event_timestamp:asc
size=number
: number of document to retrieve.
By default, ‘size’ is 10. That has also a max value (10.000) and if the param exceeds it , the api use the default (10)
– You can select fields to retrieve with _source parameter.
For example : "q=_source=message"
to get message fields only.
Search Query : JSON request examples
The general syntax is : FOO_INDEX/_search
– Inside a specific index, search a foo_field
term and sort results by @timestamp
field ascending:
Here the json input:
{ "size": 10000, "query" : { "term" : { "fields.foo_field.keyword" : "41585" } }, "sort": [ { "@timestamp": { "order": "asc" } } ] } |
curl -H "Content-Type: application/json" --data "@my-query.json" "http://elk:9200/FOO_INDEX/_search"
– search documents matching to three terms conditions:
Here the json input:
"query": { "bool": { "must": [ { "term": { "fields.foo.keyword" : "123" } }, { "term": { "type" : "footype" } }, { "term": { "fields.bar" : "10" } } ] } } |
And we could invoke it as previously.
– search documents matching both terms and matches conditions (for the message matching, wants to find the whole expression):
Here the json input:
{ "size": 10, "query": { "bool": { "must": [ { "term": { "fields.foo": "1" } }, { "match": { "message": { "query": "[ERROR] File", "operator": "and" } } } ] } } } |
And we could invoke it as previously.
– search documents matching with classic condition AND negative condition:
Here the json input:
{ "size": 1000, "query": { "bool": { "must": [ { "term": { "environment": "foo" } }, { "term": { "fields.type": "aws-app" } } ], "must_not": [ { "term" : { "fields.document_type": "internal-cloud-app" } } ] } } } |
– variant : search specific fields in documents instead of whole documents:
Here the json input:
{ "fields": ["message", "@timestamp"], "_source": false } |
Paginated search Query with scroll
The Scroll api is recommended for efficient deep scrolling but scroll contexts are costly and it is not recommended to use it for real time user requests.
To use it, we set a ‘scroll’ param indicating the duration of the opened session to query next documents.
When we send that param, the response returns documents as result along in the HEADER a ‘scroll_id’ field.
We use it to get the next values of the query
For example :
curl -H "http://elk:9200/FOO_INDEX/_search?pretty&scroll=5m&size=10000"
to retrieve the 10.000 first documents and by keeping the scrolling session for 5 minutes.
Then we get the next documents by executing a second time the request but by setting the scroll_id
param with the information retrieved in the first response : curl -H "content-type: application/json" -d '{"scroll":"5m", "scroll_id":"FOO_ID"}' "http://elk:9200/_search/scroll?pretty"
to get next results.
Note :
Scroll requests have optimizations that make them faster when the sort
order is _doc
.
Paginated search Query with search_after
The search_after parameter circumvents the scroll problem (costly in memory and processing) by providing a live cursor. The idea is to use the results from the previous page to help the retrieval of the next page.
Here is the first query :
GET twitter/_search { "size": 10, "query": { "match" : { "title" : "elasticsearch" } }, "sort": [ {"date": "asc"}, {"tie_breaker_id": "asc"} ] } |
The result from the above request includes an array of sort values for each document. These sort values can be used in conjunction with the search_after parameter to start returning results « after » any document in the result list. Generally, we use the sort values of the last document and pass it to search_after to retrieve the next page of results such as :
GET twitter/_search { "size": 10, "query": { "match" : { "title" : "elasticsearch" } }, "search_after": [1463538857, "654323"], "sort": [ {"date": "asc"}, {"tie_breaker_id": "asc"} ] } |
Note :
A field with one unique value per document should be used as the tiebreaker of the sort specification. Otherwise the sort order for documents that have the same sort values would be undefined and could lead to missing or duplicate results.
The _id field has a unique value per document but it is not recommended to use it as a tiebreaker directly.
Beware that search_after looks for the first document which fully or partially matches tiebreaker’s provided value. Therefore if a document has a tiebreaker value of « 654323 » and you search_after for « 654 » it would still match that document and return results found after it.
Mapping Query
– describe the mapping for the specified index : GET /INDEX/_mapping
Cache Query
– clear cache for an index :
POST "http://elk:9200/FOO_INDEX/_cache/clear"
– clear cache for all indexes :
POST "http://elk:9200/_cache/clear"
Beware : the ELK cache caches « only » a part of things computed for the query. Indeed, Lucene (behind ELK) relies strongly on the filesystem to cache segments related to the query.
So to perform a full cache clearing (for example to do some performance tests): clearing the FS cache is needed.
We can clear the whole cache with : sysctl vm.drop_caches=3
. Note that it may have consequences on performances of your system during a relative short time since the system cannot rely on any longer on what it had previously cached.
While not essential for that use case, we can execute that command after a sync
execution to prevent some in progress caching (in memory) from not being clearing.
Cluster Settings Query
– get no default settings :
GET /_cluster/settings
– get all settings :
GET /_cluster/settings?include_defaults
– set settings :
PUT /_cluster/settings
+ JSON message
Example to change the watermark rules :
{ "transient": { "cluster.routing.allocation.disk.watermark.low": "20gb", "cluster.routing.allocation.disk.watermark.high": "20gb", "cluster.routing.allocation.disk.watermark.flood_stage": "5gb", "cluster.info.update.interval": "1m" } } |
Delete document or set of documents
Delete all documents in an index :
curl -X DELETE "http://elk/myIndex"
Delete all documents where index(es) matching to the index pattern:
curl -X DELETE "http://elk/myIndex*"
Possible issues :
A timeout occur during the delete operation such as :
"process_cluster_event_timeout_exception","reason":"failed to process cluster event..."
To fix that we could increase the master timeout in the query, such as
?master_timeout=120s
Update Query
– UPDATE document of an INDEX by adding/updating a field :
POST /_update/<_id>
That service allows script document updates. The script can update, delete, or skip modifying the document.
For example to add a new property in the the document with the 12345 id, located in the fooIndex :
curl "http://elk:9200/fooIndex/_update/12345" \ -H "Content-Type: application/json" \ -d "{\"script\": \"ctx._source.myNewField= ctx._source.fields.existingFieldOne + '-' + ctx._source.existingFieldTwo + '-' + ctx._source.fields.existingFieldTwo \"}" |
Stats Query
– returns detailed statistics for all nodes : http://elk:9200/_nodes/stats?pretty
– returns detailed statistics for a specific node : http://elk:9200/_nodes/FOO_NODE/stats?pretty
Useful fields returned by stats :
* about searches :
open_contexts
(integer) Number of open search contexts.
query_total
(integer) Total number of query operations.
query_time
(time value) Time spent performing query operations.
query_time_in_millis
(integer) Time in milliseconds spent performing query operations.
query_current
(integer) Number of query operations currently running.
* about merges:
current
(integer) Number of merge operations currently running.
current_docs
(integer) Number of document merges currently running.
current_size
(byte value) Memory used performing current document merges.
current_size_in_bytes
(integer) Memory, in bytes, used performing current document merges.
total
(integer) Total number of merge operations.
total_time_in_millis
(integer) Total time in milliseconds spent performing merge operations.
total_docs
(integer) Total number of merged documents.
total_size
(byte value) Total size of document merges.
total_size_in_bytes
(integer) Total size of document merges in bytes.
total_stopped_time_in_millis
(integer) Total time in milliseconds spent stopping merge operations.
Task management API
Returns information about the tasks currently executing in the cluster.
Task get requests
Get all tasks :
GET /_tasks
Helpful params:
detailed (boolean)
By default, no details about tasks are returned. With that param, it returns them.
Get details for a task :
GET /_tasks/<task_id>
task_id
param to set with node:task_number
information returned by GET /_tasks
Task update requests
Cancel a task :
POST _tasks/node:task_number/_cancel