Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application specific tab - Cassandra #52

Closed
tzach opened this issue Sep 4, 2014 · 20 comments
Closed

Application specific tab - Cassandra #52

tzach opened this issue Sep 4, 2014 · 20 comments

Comments

@tzach
Copy link
Member

tzach commented Sep 4, 2014

Cassandra (C_) tab should be available only if C_ is running.
It will present C* related information in charts and text box.

Cluster - Text info (mostly static)

  • org.apache.cassandra.service.StorageService.Attributes.LiveNodes
    A set of the nodes which are visible and live, from the perspective of this node
  • org.apache.cassandra.service.StorageService.Attributes.LoadMap
    A map of which nodes have what level of load (present as a table)

Operation charts

reads, write, gossip

  • org.apache.cassandra.request.ReadStage.ActiveCount / CompletedTasks
  • org.apache.cassandra.request.MutationStage.ActiveCount / CompletedTasks
  • org.apache.cassandra.interna/type=GossipStage

Latency (charts)

  • org.apache.cassandra.service.StorageProxy.Attributes.RecentRangeLatencyMicros
    The latency of range operations since the last time this attribute was read.
  • org.apache.cassandra.service.StorageProxy.Attributes.RecentReadLatencyMicros
    The latency of range operations since the last time this attribute was read.
  • org.apache.cassandra.service.StorageProxy.Attributes.RecentWriteLatencyMicros
    The latency of write operations since the last time this attribute was read

Compaction Manager (charts)

  • org.apache.cassandra.db.CompactionManager.Attributes.BytesCompacted
  • org.apache.cassandra.db.CompactionManager.Attributes.BytesTotalInProgress

DB (charts)

  • org.apache.cassandra.db.CommitLog.Attributes.ActiveCount
    The number of tasks which are currently executing.
  • org.apache.cassandra.db.CommitLog.Attributes.CompletedTasks
    The number of completed tasks.

source:

@slivne
Copy link

slivne commented Sep 4, 2014

I have searched some blogs on the subject and there are some informative
ones

http://www.tomas.cat/blog/en/monitoring-cassandra-relevant-data-should-be-watched-and-how-send-it-graphite

Extraction from this blog - some of it covered in the above - but others
may be interesting as well

  1. ReadStage, MutationStage, GossipStage tasks

    With this metrics we can measure the activity in each server counting
    the number of operations. The three types are read, write and
    "gossip" (inter-node
    communication
    http://www.datastax.com/docs/1.1/cluster_architecture/gossip). We will
    gather the total CompletedTasks where we will see how many operations per
    minute are being executed, the ActiveTasks where we will see how many
    concurrent tasks are in each node, and the PendingTasks where we will see
    the "pending" queue length. With this data we can see a lot of things: for
    instance, if the number of PendingTasks grows consistently our node may be
    receiving more queries than it can handle, or maybe we ran out of disk
    space and, failing to write in the commitlog
    http://wiki.apache.org/cassandra/ArchitectureCommitLog, they are
    piling up (anyway, if this metric grows, something wrong is happening). If
    we see the load in our server grows, but also CompletedTasks increases at
    the same time, this may be "normal".
    We can find these values at:
    http://
    $host:8081/mbean?objectname=org.apache.cassandra.request%3Atype%3DReadStage
    http://
    $host:8081/mbean?objectname=org.apache.cassandra.request%3Atype%3DMutationStage
    http://
    $host:8081/mbean?objectname=org.apache.cassandra.internal%3Atype%3DGossipStage

    So I think that Completed and Pending are the more interesting ones and
    that we can split it up according to operation type - which may also be
    interesting - they can be stack area charted over time to provide a total
    of operations - please note that according to the description above the
    granularity is minute (need to check in datastax documentation)

    The ActiveTasks is intesrting as number - so if we have already graphs
    like Eldan suggested with numbers on the side we can use that.

  2. Compaction tasks

    Normally they are related to activity in cluster. If there are lots of
    writes, usually there will be compactions. We will gather how many
    compactions are pending (PendingTasks) and completed (CompletedTasks), so
    we know how many there are, and if they're piling up. For instance, if we
    find a loaded server with a long compaction queue, we should think about
    putting down compaction priority (nodetool setcompactionthroughput 1),
    or if we see our queue grows consistently, we should think about disabling
    thrift (nodetool disablethrift) to stop receiving new queries, and
    giving max priority to compactions, to get rid of them the sooner the
    better (nodetool setcompactionthroughput 999). These metrics will also
    help us to know when a repair, or scrub/rebuild, or upgradesstables,
    etc. ended (although there is now a progress indicator for repairs,
    since v1.1.9 and 1.2.2). Anyway, if these values are usually not zero, we
    will have worries. The link:
    http://
    $host:8081/mbean?objectname=org.apache.cassandra.db%3Atype%3DCompactionManager

    Listed above we can also - graphs this over time for pending/completed
    (check granularity)

  3. Latency

    Here we will get the latency in operations. We want this value to be the
    lowest possible, and if it grows without reason we should find out why. We
    have 3 latency types, one for each operation: Range
    (RecentRangeLatencyMicros), Read (RecentReadLatencyMicros) and Write
    (RecentWriteLatencyMicros).
    http://
    $host:8081/mbean?objectname=org.apache.cassandra.db%3Atype%3DStorageProxy

    A latency graph for each operation type

  4. Heap and NoHeap memory usage

    Here we will find how much memory is available for Java, and how much of
    it is busy. We will get HeapMemoryUsage and NoHeapMemoryUsage.
    http://$host:8081/mbean?objectname=java.lang%3Atype%3DMemory -s

    We may have that already from jvm info - but we may want to replicate
    this into the cassandra page

  5. Número de GarbageCollections

    Here we will gather GarbageCollections
    http://en.wikipedia.org/wiki/Garbage_collection_%28computer_science%29
    in the system. This is related to the former metric (JavaHeap), because
    each GarbageCollection will free some memory. This will help us when the
    java process is GarbageCollecting too often and ends up wasting more time
    doing so than in its main task (read and write data!). We should check the
    GC frequency (ConcurrentMarkSweep). If it's too often, we may need to add
    some more memory to the java process. Anyway, we want this value to be the
    lowest possible.
    http://
    $host:8081/mbean?objectname=java.lang%3Atype%3DGarbageCollector%2Cname%3DConcurrentMarkSweep

    Outside JMX there are also interesting things

We may have that already from jvm info - but we may want to replicate this into
the cassandra page

  1. Number of connections

    We want to know how many concurrent connections is Cassandra serving.
    This way, if cassandra load increases, we can correlate it to a users
    increase. If the number of users in our application doesn't grow but
    cassandra connections do, something is wrong (the queries are slower, for
    instance). If the number of cassandra connections increases, and so do the
    number of users in our application, then this is "normal" and we should
    improve Cassandra (assigning more resources, or tuning the configuration)
    to fix it. This is a very interesting metric. It could be better, though.
    It would be great if we could see what transactions are active in cassandra
    (as does mysql show processlist
    http://dev.mysql.com/doc/refman/5.1/en/show-processlist.html) so we
    could see if there any badly constructed query or any that can be improved.
    But given cassandra's architecture, this doesn't seem feasible, so we will
    settle with the number of connections. I asked in cassandra-users
    mailing list http://mail-archives.apache.org/mod_mbox/cassandra-user/
    if there is any way to get this number
    http://mail-archives.apache.org/mod_mbox/cassandra-user/201212.mbox/browser
    and they answered there is not such thing, but the find it interesting
    because it was frequently asked, so a developer ticket was created
    https://issues.apache.org/jira/browse/CASSANDRA-5084. Some day it will
    be implemented, I hope, and we will get his value from JMX. Meanwhile the
    only way is netstat:
    connections=netstat -tn|grep ESTABLISHED|awk '{print $4}'|grep 9160|wc -l

    Lets skip this no jmx

    • Data for each ColumnFamily*

    To further squeeze Cassandra it's also interesting to analyze each
    ColumnFamily Data. This way we can see size, activity, cache sucess rate,
    secondary indexes, etc. But these are lots of queries to mx4j (about 21 for
    each ColumnFamily, about 2000 HTTP queries in my case!), and this
    information doesn't change so often, so I won't gather it at the moment,
    and when I do it, I'll get in 5-minutes interval, or 15 minutes, avoiding
    the server overload, so I'll put that in a separate script.

    No info on how todo that

There is also an image :
http://www.tomas.cat/blog/sites/default/files/xgraphite-dash.png.pagespeed.ic.AMqjgIo17k.png

AppDynamics provides a plugin for cassandra:
https://www.appdynamics.com/database/cassandra/\

The interesting part here is per transaction breakdown - yet I suspect this
is via their agent and not via Cassandra mbeans

MapEngine also has one:
http://www.manageengine.com/products/applications_manager/cassandra-monitoring.html

Aside from JVM information that should be replicated into this page as well

  • we may consider replicating "interesting"
    tracepoints to this page - to simplify the cassandra user having a singel
    page holding all the info - so io stats for disk and io stats for network
    are informative, cpu on node is also informative - what do you think. The
    downside of the approach to replicate information is that it will be
    replicated across multiple pages - so their is a downside as well - yet it
    may be easier to correlate issues of cpu with compaction for example or gc
    with latency etc.

On Thu, Sep 4, 2014 at 10:57 AM, Tzach Livyatan [email protected]
wrote:

Cassandra (C_) tab should be available only if C_ is running.
It will present C* related information in charts and text box.
Cluster Text info (mostly static)

  • org.apache.cassandra.service.StorageService.Attributes.LiveNodes A
    set of the nodes which are visible and live, from the perspective of this
    node
  • org.apache.cassandra.service.StorageService.Attributes.LoadMap A map
    of which nodes have what level of load (present as a table)

Compaction Manager text info (mostly static)

  • org.apache.cassandra.db.CompactionManager.Attributes.MaximumCompactionThreshold
    The maximum number of SSTables in the compaction queue before compaction
    kicks off.
  • org.apache.cassandra.db.CompactionManager.Attributes.MinimumCompactionThreshold
    The minimum number of SSTables in the compaction queue before compaction
    kicks off.
  • org.apache.cassandra.db.CompactionManager.Attributes.PendingTasks
    The number of tasks waiting in the queue to be executed.
  • org.apache.cassandra.service.StorageService.Attributes.Token A
    string describing the start of the range of keys this node is responsible
    for on the ring.

Charts

  • org.apache.cassandra.db.CompactionManager.Attributes.BytesCompacted

    org.apache.cassandra.db.CompactionManager.Attributes.BytesTotalInProgress

DB charts

  • org.apache.cassandra.db.CommitLog.Attributes.ActiveCount The number
    of tasks which are currently executing.
  • org.apache.cassandra.db.CommitLog.Attributes.CompletedTasks The
    number of completed tasks.

source:
http://wiki.apache.org/cassandra/JmxInterface


Reply to this email directly or view it on GitHub
#52.

@tzach
Copy link
Member Author

tzach commented Sep 4, 2014

@slivne, we both looked at the same blog post :)
Much of your additions (compaction, read, write, gossip..) are already included.
Can you please clean the long text to identify what metric you suggest to add?

For JVM info, we have #36
Please review and comment there.

@dorlaor
Copy link

dorlaor commented Sep 4, 2014

These beans and the rest will be highly valuable to add as an application
tab. Cheers!

On Thu, Sep 4, 2014 at 11:48 AM, Tzach Livyatan [email protected]
wrote:

@slivne https://github.com/slivne, we both looked at the same blog post
:)
Much of your additions (compaction, read, write, gossip..) are already
included.
Can you please clean the long text to identify what metric you suggest to
add?

For JVM info, we have #36
#36
Please review and comment there.


Reply to this email directly or view it on GitHub
#52 (comment)
.

@dzautner
Copy link
Contributor

dzautner commented Sep 4, 2014

I have built and ran the Cassandra image but I don't seem to have the following MBeans in the Joloking API for some reason:
org.apache.cassandra.interna/type=ReadStage
org.apache.cassandra.interna/type=MutationStage

Also, the GossipStage returns the following JSON:

{
 "CompletedTasks":0,
 "PendingTasks":0,
 "TotalBlockedTasks":0,
 "ActiveCount":0,
 "MaximumThreads":1,
 "CoreThreads":1,
 "CurrentlyBlockedTasks":0
}

which information is relevant to the chart?

@slivne
Copy link

slivne commented Sep 4, 2014

Ok, here is my take to define what information we should extract and display

Cassandra (C_) tab should be available only if C_ is running.
It will present C* related information in charts and text box.
Cluster - Text info (mostly static) - I don't think this is usefull at a
node level - we can remove

  • org.apache.cassandra.service.StorageService.Attributes.LiveNodes A set
    of the nodes which are visible and live, from the perspective of this node
  • org.apache.cassandra.service.StorageService.Attributes.LoadMap A map
    of which nodes have what level of load (present as a table)

Operation Completed chart - a single area stacked chart (pilling the values
of all 3 - the sum is the total of all operations)

reads, write, gossip

On the side of this chart we can add the active numbers - single number
not delta

Operation Pending chart - single chart - 3 lines

reads, write, gossip

Total Latency Chart Over Time - single chart - 3 lines

(I am not sure this relates to operations above if at all)

  • delta of
    org.apache.cassandra.service.StorageProxy.Attributes.TotalRangeLatencyMicros
  • The latency of all range operations since executor start.
  • delta of
    org.apache.cassandra.service.StorageProxy.Attributes.TotalReadLatencyMicros
  • The latency of all read operations since executor start.
  • delta of
    org.apache.cassandra.service.StorageProxy.Attributes.TotalWriteLatencyMicros
  • The latency of all write operations since executor start.

Avg Latency Chart Over Time - (delta)/(delta) - single chart - 3 lines

(I am not sure this relates to operations above if at all)

  • delta of
    org.apache.cassandra.service.StorageProxy.Attributes.TotalRangeLatencyMicros
    (- The latency of all range operations since executor start.) divided by
    delta of
    org.apache.cassandra.service.StorageProxy.Attributes.RangeOperations (The
    number of range operations since executor start.)
  • delta of
    org.apache.cassandra.service.StorageProxy.Attributes.TotalReadLatencyMicros
    (- The latency of all read operations since executor start.) divided by
    delta of
    org.apache.cassandra.service.StorageProxy.Attributes.ReadOperations (The
    number of read operations since executor start.)
  • delta of
    org.apache.cassandra.service.StorageProxy.Attributes.TotalWriteLatencyMicros
    (- The latency of all write operations since executor start.) divided by
    delta of
    org.apache.cassandra.service.StorageProxy.Attributes.WriteOperations (The
    number of write operations since executor start.)

Compaction Manager (two charts)

  • org.apache.cassandra.db.CompactionManager.Attributes.BytesCompacted

    org.apache.cassandra.db.CompactionManager.Attributes.BytesTotalInProgress

DB (two charts)

  • org.apache.cassandra.db.CommitLog.Attributes.ActiveCount The number of
    tasks which are currently executing.
  • org.apache.cassandra.db.CommitLog.Attributes.CompletedTasks The number
    of completed tasks.

JVM (one/two charts)

Two charts or one - copy paste from the JVM Tab

Heap / GC

OS (two charts)

Two charts fone for Disk IO and one for Networking IO

Disk IO (based on trace point counters) / Networking IO (based on trace
point counters)

source:

On Thu, Sep 4, 2014 at 11:53 AM, dorlaor [email protected] wrote:

These beans and the rest will be highly valuable to add as an application
tab. Cheers!

On Thu, Sep 4, 2014 at 11:48 AM, Tzach Livyatan [email protected]

wrote:

@slivne https://github.com/slivne, we both looked at the same blog
post
:)
Much of your additions (compaction, read, write, gossip..) are already
included.
Can you please clean the long text to identify what metric you suggest
to
add?

For JVM info, we have #36
#36
Please review and comment there.


Reply to this email directly or view it on GitHub
<
https://github.com/cloudius-systems/osv-gui/issues/52#issuecomment-54433701>

.


Reply to this email directly or view it on GitHub
#52 (comment)
.

@slivne
Copy link

slivne commented Sep 4, 2014

I sent my take on the info - I did not find them as well in the jmx link
tzach provided - I did find others that provide the informaiton but we may
need to aggregate their value

On Thu, Sep 4, 2014 at 12:50 PM, Lord Daniel Zautner <
[email protected]> wrote:

I have built and ran the Cassandra image but I don't seem to have the
following MBeans in the Joloking API for some reason:
org.apache.cassandra.interna/type=ReadStage
org.apache.cassandra.interna/type=MutationStage

Also, the GossipStage returns the following JSON:

{
"CompletedTasks":0,
"PendingTasks":0,
"TotalBlockedTasks":0,
"ActiveCount":0,
"MaximumThreads":1,
"CoreThreads":1,
"CurrentlyBlockedTasks":0}

which information is relevant to the chart?


Reply to this email directly or view it on GitHub
#52 (comment)
.

@dzautner
Copy link
Contributor

dzautner commented Sep 4, 2014

What would be the best way to put some load on Cassandra to see changes in the latency data?

@tzach
Copy link
Member Author

tzach commented Sep 4, 2014

I have built and ran the Cassandra image but I don't seem to have the following MBeans in the > Joloking API for some reason:
org.apache.cassandra.interna/type=ReadStage
org.apache.cassandra.interna/type=MutationStage

Typo in my original post (now fix)
4 mbeans are

  • org.apache.cassandra.request.ReadStage.ActiveCount / CompletedTasks
  • org.apache.cassandra.request.MutationStage.ActiveCount / CompletedTasks

I used jconcole to connect to Cassandra (port 7199) and verify the above.

@dzautner
Copy link
Contributor

dzautner commented Sep 4, 2014

Should I show the ActiveCount/CompletedTasks with the GossipStage as well?

@dzautner
Copy link
Contributor

dzautner commented Sep 4, 2014

some progress:
cassandratab

@dzautner
Copy link
Contributor

dzautner commented Sep 4, 2014

I can not find the following MBeans either:
org.apache.cassandra.db.CompactionManager.Attributes.BytesCompacted
org.apache.cassandra.db.CompactionManager.Attributes.BytesTotalInProgress

EDIT:
Also this one:
org.apache.cassandra.db.CommitLog.Attributes.ActiveCount

@tzach
Copy link
Member Author

tzach commented Sep 4, 2014

I can not find the following MBeans either:
org.apache.cassandra.db.CompactionManager.Attributes.BytesCompacted
org.apache.cassandra.db.CompactionManager.Attributes.BytesTotalInProgress

There seems to be differences between the C* project, and the Datastax version.
Here are the mbean, found with JConcole

  • org.apache.cassandra.metrics.Compaction.BytesCompacted attribute = Count
  • org.apache.cassandra.metrics.Compaction.TotalCompactionsCompleted attribute = Count

@dzautner
Copy link
Contributor

dzautner commented Sep 4, 2014

Thanks, I was able to find them now

@tzach
Copy link
Member Author

tzach commented Sep 4, 2014

Should I show the ActiveCount/CompletedTasks with the GossipStage as well?

Yes. For the complected counters, you should present derivative in the chart, and absolute value in text format. No point in charting a monotonic increasing function.

@dzautner
Copy link
Contributor

dzautner commented Sep 4, 2014

Is TotalCompactionsCompleted also a counter?

@dzautner
Copy link
Contributor

dzautner commented Sep 4, 2014

Note, derivative shows the difference between two data points and our sampling rate is not constant so it might confuse users to think that the graph is displaying a "per time interval" (e.g. writes/s) data.

@tzach
Copy link
Member Author

tzach commented Sep 4, 2014

Note, derivative shows the difference between two data points and our sampling rate is not constant so it might confuse users to think that the graph is displaying a "per time interval" (e.g. writes/s) data.

Good point. This is why we added timestamp to our API.
I see two alternatives:

  • Show the value, not the derivative, in text format
  • Show the derivative, but calculate it over the last 10 iterations

@dzautner dzautner mentioned this issue Sep 4, 2014
@nyh
Copy link

nyh commented Sep 4, 2014

On Thu, Sep 4, 2014 at 5:00 PM, Tzach Livyatan [email protected]
wrote:

Note, derivative shows the difference between two data points and our
sampling rate is not constant so it might confuse users to think that the
graph is displaying a "per time interval" (e.g. writes/s) data.

You should divide the difference of the two values by the interval's
length.

If you do that, you will get the proper "derivative". If you remember from
calculus course, the derivative is actually the limit of the above ratio
when the time interval approaches zero. Moreover, according to Cauchy's
theorem (see http://en.wikipedia.org/wiki/Mean_value_theorem), when the
interval has a finite (not approaching zero) size the ratio is still equal
the derivative at some unknown point inside the interval.

Good point. This is why we added timestamp to our API.
I see two alternatives:

  • Show the value, not the derivative, in text format
  • Show the derivative, but calculate it over the last 10 iterations

Why not a third option, of showing \delta f / \delta t?

Nadav Har'El
[email protected]

@dzautner
Copy link
Contributor

dzautner commented Sep 4, 2014

You should divide the difference of the two values by the interval's
length.

Wоuldn't that give us the "value per time interval" (writes/s)?

@dzautner
Copy link
Contributor

Should this issue be closed at this point?

@tzach tzach closed this as completed Sep 15, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants