├── power_trace_documentation.pdf
├── PowerData2019.md
├── ClusterData2019.md
├── TraceVersion1.md
├── README.md
├── ETAExplorationTraces.md
├── ClusterData2011_2.md
├── clusterdata_trace_format_v3.proto
├── clusterdata_analysis_colab.ipynb
├── power_trace_analysis_colab.ipynb
└── bibliography.bib
/power_trace_documentation.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/google/cluster-data/HEAD/power_trace_documentation.pdf
--------------------------------------------------------------------------------
/PowerData2019.md:
--------------------------------------------------------------------------------
1 | # PowerData 2019 traces
2 |
3 | The `powerdata-2019` trace dataset provides power utilization information for 57
4 | power domains in Google data centers. Two of these power domains are from cells
5 | in data centers with the new medium voltage power plane design. The remainder
6 | belong to the eight cells featured in the
7 | [2019 Cluster Data trace](ClusterData2019.md).
8 |
9 | Please see [the documentation](power_trace_documentation.pdf) for details on
10 | what's in this dataset and how to access it. For additional background, please
11 | refer to the paper [Data Center Power Oversubscription with a Medium Voltage
12 | Power Plane and Priority-Aware
13 | Capping](https://research.google/pubs/data-center-power-oversubscription-with-a-medium-voltage-power-plane-and-priority-aware-capping/).
14 |
15 | Also included is a [colab](power_trace_analysis_colab.ipynb) recreating the
16 | figures as an example of how to query the data.
17 |
18 | ## Notes
19 |
20 | If you use this data we'd appreciate if you cite the paper and let us know about
21 | your work! The best way to do so is through the mailing list.
22 |
23 | * If you haven't already joined our
24 | [mailing list](https://groups.google.com/forum/#!forum/googleclusterdata-discuss),
25 | please do so now. *Important: to avoid spammers, you MUST fill out the
26 | "reason" field, or your application will be rejected.*
27 |
28 | 
29 | The data and trace documentation are made available under the
30 | [CC-BY](https://creativecommons.org/licenses/by/4.0/) license. By downloading it
31 | or using them, you agree to the terms of this license.
32 |
33 | **Questions?**
34 |
35 | You can send email to googleclusterdata-discuss@googlegroups.com. The more
36 | detailed the request the greater the chance that somebody can help you: screen
37 | shots, concrete examples, error messages, and a list of what you already tried
38 | are all useful.
39 |
--------------------------------------------------------------------------------
/ClusterData2019.md:
--------------------------------------------------------------------------------
1 | # ClusterData 2019 traces
2 |
3 | _John Wilkes._
4 |
5 | The `clusterdata-2019` trace dataset provides information about eight different Borg cells for the month of May 2019. It includes the following new information:
6 |
7 | * CPU usage information histograms for each 5 minute period, not just a point sample;
8 | * information about alloc sets (shared resource reservations used by jobs);
9 | * job-parent information for master/worker relationships such as MapReduce jobs.
10 |
11 | The 2019 traces focus on resource requests and usage, and contain no information about end users, their data, or access patterns to storage systems and other services.
12 |
13 | Because of it's size (about 2.4TiB compressed), we are only making the trace data available via [Google BigQuery](https://cloud.google.com/bigquery) so that sophisticated analyses can be performed without requiring local resources.
14 |
15 | **The `clusterdata-2019` traces are described in this document:
16 | [Google cluster-usage traces v3](https://drive.google.com/file/d/10r6cnJ5cJ89fPWCgj7j4LtLBqYN9RiI9/view).** You can find the download and access instructions there, as well as many more details about what is in the traces, and how to interpret them. For additional background information, please refer to the 2015 Borg paper, [Large-scale cluster management at Google with Borg](https://ai.google/research/pubs/pub43438).
17 |
18 | * If you haven't already joined our
19 | [mailing list](https://groups.google.com/forum/#!forum/googleclusterdata-discuss),
20 | please do so now.
21 | *Important: to avoid spammers, you MUST fill out the "reason" field, or your application will be rejected.*
22 |
23 | 
24 | The data and trace documentation are made available under the
25 | [CC-BY](https://creativecommons.org/licenses/by/4.0/) license.
26 | By downloading it or using them, you agree to the terms of this license.
27 |
28 | **Questions?**
29 |
30 | You can send email to googleclusterdata-discuss@googlegroups.com. The more detailed the request the greater the chance that somebody can help you: screen shots, concrete examples, error messages, and a list of what you already tried are all useful.
31 |
32 | **Acknowledgements**
33 |
34 | This trace is the result of a collaboration involving Muhammad Tirmazi, Nan Deng, Md Ehtesam Haque, Zhijing Gene Qin, Steve Hand and Adam Barker.
35 |
--------------------------------------------------------------------------------
/TraceVersion1.md:
--------------------------------------------------------------------------------
1 |
2 | |*Note: for new work, we **strongly** recommend using the [version 2](ClusterData2011_2.md) or [version 3](ClusterData2019.md) traces, which are more recent, more comprehensive, and provide much more data.* |
3 | |:--------- |
4 |
5 |
6 | _This trace was first announced in [this January 2010 blog post](https://ai.googleblog.com/2010/01/google-cluster-data.html)._
7 |
8 | The first dataset provides traces from a Borg cell that were taken over a 7 hour
9 | period. The workload consists of a set of tasks, where each task runs on a
10 | single machine. Tasks consume memory and one or more cores (in fractional
11 | units). Each task belongs to a single job; a job may have multiple tasks (e.g.,
12 | mappers and reducers).
13 |
14 | The trace data is available
15 | [here](http://commondatastorage.googleapis.com/clusterdata-misc/google-cluster-data-1.csv.gz).
16 | ([SHA1 checksum](http://en.wikipedia.org/wiki/SHA-1#Data_Integrity):
17 | 98c87f059aa1cc37f1e9523ac691ee0fd5629188.)
18 |
19 | The data have been anonymized in several ways: there are no task or job names,
20 | just numeric identifiers; timestamps are relative to the start of data
21 | collection; the consumption of CPU and memory is obscured using a linear
22 | transformation. However, even with these transformations of the data,
23 | researchers will be able to do workload characterizations (up to a linear
24 | transformation of the true workload) and workload generation.
25 |
26 | The data are structured as blank-separated columns. Each row reports on the
27 | execution of a single task during a five minute period.
28 |
29 | * `Time` (int) - time in seconds since the start of data collection
30 | * `JobID` (int) - Unique identifier of the job to which this task belongs (**may be called ParentID**)
31 | * `TaskID` (int) - Unique identifier of the executing task
32 | * `Job Type` (0, 1, 2, 3) - class of job (a categorization of work)
33 | * `Normalized Task Cores` (float) - normalized value of the average number of cores used by the task
34 | * `Normalized Task Memory` (float) - normalized value of the average memory consumed by the task
35 |
36 | 
37 | The data and trace documentation are made available under the
38 | [CC-BY](https://creativecommons.org/licenses/by/4.0/) license.
39 | By downloading it or using them, you agree to the terms of this license.
40 |
41 | Questions? Send [email](mailto:googleclusterdata-discuss@googlegroups.com)
42 | or peruse the
43 | [discussion group](http://groups.google.com/group/googleclusterdata-discuss).
44 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Overview
2 |
3 | This repository describes various traces from parts of the Google cluster
4 | management software and systems.
5 |
6 | * Please join our (low volume)
7 | [discussion group](http://groups.google.com/group/googleclusterdata-discuss),
8 | so we can send you announcements, and you can let us know about any issues,
9 | insights, or papers you publish using these traces. **Important: to avoid
10 | spammers, you MUST fill out the "reason" field, or your application will be
11 | rejected**. Once you are a member, you can send email to
12 | [googleclusterdata-discuss@googlegroups.com](mailto:googleclusterdata-discuss@googlegroups.com)
13 | to:
14 |
15 | * Announce tools and techniques that can help others analyze or decode the
16 | trace data.
17 | * Share insights and surprises.
18 | * Ask questions (the group has a few hundred members) and get help. If you
19 | ask for help, please include concrete examples of issues you run into;
20 | screen shots; error codes; and a list of what you have already tried.
21 | Don't just say "I can't download the data"!
22 |
23 | * We provide a **[trace bibliography](bibliography.bib)** of papers that have
24 | used and/or analyzed the traces, and encourage anybody who publishes one to
25 | add it to the bibliography using a github pull request [preferred], or by
26 | emailing the bibtex entry to
27 | [googleclusterdata-discuss@googlegroups.com](mailto:googleclusterdata-discuss@googlegroups.com).
28 | In either case, please mimic the existing format **exactly**.
29 |
30 | # Borg cluster workload traces
31 |
32 | These are traces of workloads running on Google compute cells that are managed
33 | by the cluster management software internally known as Borg.
34 |
35 | * **[version 3](ClusterData2019.md)** (aka `ClusterData2019`) provides data
36 | from eight Borg cells over the month of May 2019.
37 | * [version 2](ClusterData2011_2.md) (aka `ClusterData2011`) provides data from
38 | a single 12.5k-machine Borg cell from May 2011.
39 | * [version 1](TraceVersion1.md) is an older, short trace that describes a 7
40 | hour period from one cell from 2009. *Deprecated. We strongly recommend
41 | using the version 2 or version 3 traces instead.*
42 |
43 | ## ETA traces
44 |
45 | In addition, this site hosts a set of
46 | [execution traces from ETA](ETAExplorationTraces.md) (Exploratory Testing
47 | Architecture) - a testing framework that explores interactions between
48 | distributed, concurrently-executing components, with an eye towards improving
49 | testing them.
50 |
51 | ## Power traces
52 |
53 | This site also hosts [power traces](PowerData2019.md) for 57 power domains
54 | during the month of May 2019. This trace is synergistic with the
55 | `ClusterData2019` dataset.
56 |
57 | ## License
58 |
59 | 
60 | The data and trace documentation are made available under the
61 | [CC-BY](https://creativecommons.org/licenses/by/4.0/) license. By downloading it
62 | or using them, you agree to the terms of this license.
63 |
--------------------------------------------------------------------------------
/ETAExplorationTraces.md:
--------------------------------------------------------------------------------
1 | # Introduction
2 |
3 | [ETA](http://www.pdl.cmu.edu/PDL-FTP/associated/CMU-PDL-11-113.pdf) (Exploratory
4 | Testing Architecture) is a testing framework that explores the execution of a
5 | distributed application, looking for bugs that are provoked by particular
6 | sequences of events caused by non-determinism such as timing and asynchrony.
7 | ETA was developed for [Omega](http://research.google.com/pubs/pub41684.html), a
8 | cluster management system developed at Google.
9 |
10 | As part of its functionality, ETA provides estimates for when its exploratory
11 | testing will finish. Achieving accurate runtime estimations is a significant
12 | research challenge, and so in order to stimulate interest, and foster research
13 | in improving these estimates, we have made available traces of a number of ETA’s
14 | real-world exploratory test runs.
15 |
16 | You can find the traces
17 | [here](http://commondatastorage.googleapis.com/clusterdata-misc/ETA-traces.tar.gz).
18 | ([SHA1 checksum](http://en.wikipedia.org/wiki/SHA-1#Data_Integrity):
19 | 6664e43caa1bf1f4c0af959fe93d266ead24d234.)
20 |
21 | # Format
22 |
23 | These traces describe the execution tree structure explored by ETA. In short,
24 | the execution tree represents at abstract level the different sequences in which
25 | concurrent events can happen during an execution of a test. Further, ETA uses
26 | [state space reduction](http://dl.acm.org/citation.cfm?id=1040315) to avoid the
27 | need to explore equivalent sequences. In other words, certain parts of the
28 | execution tree may never be explored. For details on the execution tree
29 | structure and application of the state space reduction in ETA read our
30 | [technical report](http://www.pdl.cmu.edu/PDL-FTP/associated/CMU-PDL-11-113.pdf).
31 |
32 | An exploration trace contains a sequence of events that detail the
33 | exploration. Each event can be one of the following:
34 |
35 | * `AddNode x y` -- a node `x` with parent `y` has been added (the parent of the root is -1)
36 | * `Explore x` -- the node `x` has been marked for further exploration
37 | * `Transition x` -- the exploration transitioned from the current node to node `x`
38 | * `Start` -- new test execution (starting from node 0) has been initiated
39 | * `End t` -- current test execution finished after `t` time units.
40 |
41 | A well-formed trace contains a number of executions, each starting with `Start`
42 | and ending with `End t`. Each execution explores a branch of the execution tree,
43 | transitioning from the root all the way to a leaf (`Transition x`), and
44 | optionally adding newly encountered nodes to the tree (`AddNode x y`), and
45 | identifying which unvisited nodes should be explored in future (`Explore x`).
46 |
47 | # Traces
48 | These are all provided in a single compressed tar file (see above for the link).
49 |
50 | * `resource_X.trace`: The `resource_X` test is representative of a class of
51 | Omega tests that evaluate interactions of `X` different users that acquire
52 | and release resources from a pool of `X` resources.
53 | * `store_X_Y_Z.trace`: The `store_X_Y_Z` test is representative of a class of
54 | Omega tests that evaluate interactions of X users of a distributed key-value
55 | store with `Y` front-end nodes and `Z` back-end nodes.
56 | * `scheduling_X.trace`: The `scheduling_X` test is representative of a class of
57 | Omega tests that evaluate interactions of `X` users issuing concurrent
58 | scheduling requests.
59 | * `tlp.trace`: The `tlp` test is representative of a class of Omega tests that
60 | do scheduling work.
61 |
62 | # Notes
63 |
64 | The data may be freely used for any purpose, although acknowledgement of Google
65 | as the source of the data would be appreciated, and we’d love to be sent copies
66 | of any papers you publish that use it.
67 |
68 | Questions? Send us email!
69 |
70 | Jiri Simsa jsimsa@google.com, john wilkes johnwilkes@google.com
71 |
72 | --------------------------------------
73 |
74 | _Version of: 2012-09-26; revised 2015-07-29_
75 |
--------------------------------------------------------------------------------
/ClusterData2011_2.md:
--------------------------------------------------------------------------------
1 | # ClusterData 2011 traces
2 |
3 | _John Wilkes and Charles Reiss._
4 |
5 | The `clusterdata-2011-2` trace represents 29 day's worth of Borg cell information
6 | from May 2011, on a cluster of about 12.5k machines. (The `-2` refers to the fact that we added some additional data after the initial release, to create trace version 2.1.)
7 |
8 | * If you haven't already joined our
9 | [mailing list](https://groups.google.com/forum/#!forum/googleclusterdata-discuss),
10 | please do so now. **Important**: please fill out the "reason" field, or your application will be rejected.
11 |
12 | ## Trace data
13 |
14 | The `clusterdata-2011-2` trace starts at 19:00 EDT on Sunday May 1, 2011, and
15 | the datacenter is in that timezone (US Eastern). This corresponds to a trace
16 | timestamp of 600s; see the data schema documentation for why.
17 |
18 | The trace is described in the trace-data
19 | [v2.1 format + schema document](https://drive.google.com/file/d/0B5g07T_gRDg9Z0lsSTEtTWtpOW8/view?usp=sharing&resourcekey=0-cozD56gA4fUDdrkHnLJSrQ).
20 |
21 | Priorities in this trace range from 0 to 11 inclusive; bigger numbers mean "more
22 | important". 0 and 1 are “free” priorities; 9, 10, and 11 are “production”
23 | priorities; and 12 is a “monitoring” priority.
24 |
25 | The `clusterdata-2011-2` trace is identical to the one called
26 | `clusterdata-2011-1`, except for the addition of a single new column of data in
27 | the `task_usage` tables. This new data is a randomly-picked 1 second sample of
28 | CPU usage from within the associated 5-minute usage-reporting period for that
29 | task. Using this data, it is possible to build up a stochastic model of task
30 | utilization over time for long-running tasks.
31 |
32 | 
33 | The data and trace documentation are made available under the
34 | [CC-BY](https://creativecommons.org/licenses/by/4.0/) license.
35 | By downloading it or using them, you agree to the terms of this license.
36 |
37 | ## Downloading the trace
38 |
39 | Download instructions for the trace are in the
40 | [v2.1 format + schema document](https://drive.google.com/file/d/0B5g07T_gRDg9Z0lsSTEtTWtpOW8/view?usp=sharing&resourcekey=0-cozD56gA4fUDdrkHnLJSrQ).
41 |
42 | The trace is stored in
43 | [Google Storage for Developers](https://developers.google.com/storage/) in the
44 | bucket called `clusterdata-2011-2`. The total size of the compressed trace is
45 | approximately 41GB.
46 |
47 | Most users should use the
48 | [gsutil](https://developers.google.com/storage/docs/gsutil) command-line tool to
49 | download the trace data.
50 |
51 |
52 | ## Known anomalies in the trace
53 |
54 | Disk-time-fraction data is only included in about the first 14 days, because of
55 | a change in our monitoring system.
56 |
57 | Some jobs are deliberately omitted because they ran primarily on machines not
58 | included in this trace. The portion that ran on included machines amounts to
59 | approximately 0.003% of the machines’ task-seconds of usage.
60 |
61 | We are aware of only one example of a job that retains its job ID after being
62 | stopped, reconfigured, and restarted (job number 6253771429).
63 |
64 | Approximately 70 jobs (for example, job number 6377830001) have job event
65 | records but no task event records. We believe that this is legitimate in a
66 | majority of cases: typically because the job is started but its tasks are
67 | disabled for its entire duration.
68 |
69 | Approximately 0.013% of task events and 0.0008% of job events in this trace have
70 | a non-empty missing info field.
71 |
72 | We estimate that less than 0.05% of job and task scheduling event records are
73 | missing and less than 1% of resource usage measurements are missing.
74 |
75 | Some cycles per instruction (CPI) and memory accesses per instruction (MAI)
76 | measurements are clearly inaccurate (for example, they are above or below the
77 | range possible on the underlying micro-architectures). We believe these
78 | measurements are caused by bugs in the data-capture system used, such as the
79 | cycle counter and instruction counter not being read at the same time. To obtain
80 | useful data from these measurements, we suggest filtering out measurements
81 | representing a very small amount of CPU time and measurements with unreasonable
82 | CPI and MAI values.
83 |
84 | # Questions?
85 |
86 | Please send email to googleclusterdata-discuss@googlegroups.com.
87 |
--------------------------------------------------------------------------------
/clusterdata_trace_format_v3.proto:
--------------------------------------------------------------------------------
1 | // This file defines the format of the 3rd version of cluster trace data
2 | // published by Google. Please refer to the associated 'Google cluster-usage
3 | // traces v3' document.
4 | // More information at https://github.com/google/cluster-data
5 |
6 | syntax = "proto2";
7 |
8 | package google.cluster_data;
9 |
10 | // Values used to indicate "not present" for special cases.
11 | enum Constants {
12 | option allow_alias = true; // OK for multiple names to have the same value.
13 |
14 | NO_MACHINE = 0; // The thing is not bound to a machine.
15 | DEDICATED_MACHINE = -1; // The thing is bound to a dedicated machine.
16 | NO_ALLOC_COLLECTION = 0; // The thing is not running in an alloc set.
17 | NO_ALLOC_INDEX = -1; // The thing does not have an alloc instance index.
18 | }
19 |
20 | // A common structure for CPU and memory resource units.
21 | // All resource measurements are normalized and scaled.
22 | message Resources {
23 | optional float cpus = 1; // Normalized GCUs (NCUs).
24 | optional float memory = 2; // Normalized RAM bytes.
25 | }
26 |
27 | // Collections are either jobs (which have tasks) or alloc sets (which have
28 | // alloc instances).
29 | enum CollectionType {
30 | JOB = 0;
31 | ALLOC_SET = 1;
32 | }
33 |
34 | // This enum is used in the 'type' field of the CollectionEvent and
35 | // InstanceEvent tables.
36 | enum EventType {
37 | // The collection or instance was submitted to the scheduler for scheduling.
38 | SUBMIT = 0;
39 | // The collection or instance was marked not eligible for scheduling by the
40 | // batch scheduler.
41 | QUEUE = 1;
42 | // The collection or instance became eligible for scheduling.
43 | ENABLE = 2;
44 | // The collection or instance started running.
45 | SCHEDULE = 3;
46 | // The collection or instance was descheduled because of a higher priority
47 | // collection or instance, or because the scheduler overcommitted resources.
48 | EVICT = 4;
49 | // The collection or instance was descheduled due to a failure.
50 | FAIL = 5;
51 | // The collection or instance completed normally.
52 | FINISH = 6;
53 | // The collection or instance was cancelled by the user or because a
54 | // depended-upon collection died.
55 | KILL = 7;
56 | // The collection or instance was presumably terminated, but due to missing
57 | // data there is insufficient information to identify when or how.
58 | LOST = 8;
59 | // The collection or instance was updated (scheduling class or resource
60 | // requirements) while it was waiting to be scheduled.
61 | UPDATE_PENDING = 9;
62 | // The collection or instance was updated while it was scheduled somewhere.
63 | UPDATE_RUNNING = 10;
64 | }
65 | // Represents reasons why we synthesized a scheduler event to replace
66 | // apparently missing data.
67 | enum MissingType {
68 | MISSING_TYPE_NONE = 0; // No data was missing.
69 | SNAPSHOT_BUT_NO_TRANSITION = 1;
70 | NO_SNAPSHOT_OR_TRANSITION = 2;
71 | EXISTS_BUT_NO_CREATION = 3;
72 | TRANSITION_MISSING_STEP = 4;
73 | TOO_MANY_EVENTS = 5;
74 | }
75 | // How latency-sensitive a thing is to CPU scheduling delays when running
76 | // on a machine, in increasing-sensitivity order.
77 | // Note that this is _not_ the same as the thing's cluster-scheduling
78 | // priority although latency-sensitive things do tend to have higher priorities.
79 | enum LatencySensitivity {
80 | MOST_INSENSITIVE = 0; // Also known as "best effort".
81 | INSENSITIVE = 1; // Often used for batch jobs.
82 | SENSITIVE = 2; // Used for latency-sensitive jobs.
83 | MOST_SENSITIVE = 3; // Used for the most latency-senstive jobs.
84 | }
85 |
86 | // Represents the type of scheduler that is handling a job.
87 | enum Scheduler {
88 | // Handled by the default cluster scheduler.
89 | SCHEDULER_DEFAULT = 0;
90 | // Handled by a secondary scheduler, optimized for batch loads.
91 | SCHEDULER_BATCH = 1;
92 | }
93 |
94 | // How the collection is verically auto-scaled.
95 | enum VerticalScalingSetting {
96 | // We were unable to determine the setting.
97 | VERTICAL_SCALING_SETTING_UNKNOWN = 0;
98 | // Vertical scaling was disabled, e.g., in the collection
99 | // creation request.
100 | VERTICAL_SCALING_OFF = 1;
101 | // Vertical scaling was enabled, with user-supplied lower
102 | // and/or upper bounds for GCU and/or RAM.
103 | VERTICAL_SCALING_CONSTRAINED = 2;
104 | // Vertical scaling was enabled, with no user-provided bounds.
105 | VERTICAL_SCALING_FULLY_AUTOMATED = 3;
106 | }
107 |
108 | // A constraint represents a request for a thing to be placed on a machine
109 | // (or machines) with particular attributes.
110 | message MachineConstraint {
111 | // Comparison operation between the supplied value and the machine's value.
112 | // For EQUAL and NOT_EQUAL relationships, the comparison is a string
113 | // comparison; for LESS_THAN, GREATER_THAN, etc., the values are converted to
114 | // floating point numbers first; for PRESENT and NOT_PRESENT, the test is
115 | // merely whether the supplied attribute exists for the machine in question,
116 | // and the value field of the constraint is ignored.
117 | enum Relation {
118 | EQUAL = 0;
119 | NOT_EQUAL = 1;
120 | LESS_THAN = 2;
121 | GREATER_THAN = 3;
122 | LESS_THAN_EQUAL = 4;
123 | GREATER_THAN_EQUAL = 5;
124 | PRESENT = 6;
125 | NOT_PRESENT = 7;
126 | }
127 |
128 | // Obfuscated name of the constraint.
129 | optional string name = 1;
130 | // Target value for the constraint (e.g., a minimum or equality).
131 | optional string value = 2;
132 | // Comparison operator.
133 | optional Relation relation = 3;
134 | }
135 |
136 | // Instance and collection events both share a common prefix, followed by
137 | // specific fields. Information about an instance event (task or alloc
138 | // instance).
139 | message InstanceEvent {
140 | // Common fields shared between instances and collections.
141 |
142 | // Timestamp, in microseconds since the start of the trace.
143 | optional int64 time = 1;
144 | // What type of event is this?
145 | optional EventType type = 2;
146 | // The identity of the collection that this instance is part of.
147 | optional int64 collection_id = 3;
148 | // How latency-sensitive is the instance?
149 | optional LatencySensitivity scheduling_class = 4;
150 | // Was there any missing data? If so, why?
151 | optional MissingType missing_type = 5;
152 | // What type of collection this instance belongs to.
153 | optional CollectionType collection_type = 6;
154 | // Cluster-level scheduling priority for the instance.
155 | optional int32 priority = 7;
156 | // (Tasks only) The ID of the alloc set that this task is running in, or
157 | // NO_ALLOC_COLLECTION if it is not running in an alloc.
158 | optional int64 alloc_collection_id = 8;
159 |
160 | // Begin: fields specific to instances
161 | // The index of the instance in its collection (starts at 0).
162 | optional int32 instance_index = 9;
163 | // The ID of the machine on which this instance is placed (or NO_MACHINE if
164 | // not placed on one, or DEDICATED_MACHINE if it's on a dedicated machine).
165 | optional int64 machine_id = 10;
166 | // (Tasks only) The index of the alloc instance that this task is running in,
167 | // or NO_ALLOC_INDEX if it is not running in an alloc.
168 | optional int32 alloc_instance_index = 11;
169 | // The resources requested when the instance was submitted or last updated.
170 | optional Resources resource_request = 12;
171 | // Currently active scheduling constraints.
172 | repeated MachineConstraint constraint = 13;
173 | }
174 |
175 | // Collection events apply to the collection as a whole.
176 | message CollectionEvent {
177 | // Common fields shared between instances and collections.
178 |
179 | // Timestamp, in microseconds since the start of the trace.
180 | optional int64 time = 1;
181 | // What type of event is this?
182 | optional EventType type = 2;
183 | // The identity of the collection.
184 | optional int64 collection_id = 3;
185 | // How latency-sensitive is the collection?
186 | optional LatencySensitivity scheduling_class = 4;
187 | // Was there any missing data? If so, why?
188 | optional MissingType missing_type = 5;
189 | // What type of collection is this?
190 | optional CollectionType collection_type = 6;
191 | // Cluster-level scheduling priority for the collection.
192 | optional int32 priority = 7;
193 | // The ID of the alloc set that this job is to run in, or NO_ALLOC_COLLECTION
194 | // (only for jobs).
195 | optional int64 alloc_collection_id = 8;
196 |
197 | // Fields specific to a collection.
198 |
199 | // The user who runs the collection
200 | optional string user = 9;
201 | // Obfuscated name of the collection.
202 | optional string collection_name = 10;
203 | // Obfuscated logical name of the collection.
204 | optional string collection_logical_name = 11;
205 | // ID of the collection that this is a child of.
206 | // (Used for stopping a collection when the parent terminates.)
207 | optional int64 parent_collection_id = 12;
208 | // IDs of collections that must finish before this collection may start.
209 | repeated int64 start_after_collection_ids = 13;
210 | // Maximum number of instances of this collection that may be placed on
211 | // one machine (or 0 if unlimited).
212 | optional int32 max_per_machine = 14;
213 | // Maximum number of instances of this collection that may be placed on
214 | // machines connected to a single Top of Rack switch (or 0 if unlimited).
215 | optional int32 max_per_switch = 15;
216 | // How/whether vertical scaling should be done for this collection.
217 | optional VerticalScalingSetting vertical_scaling = 16;
218 | // The preferred cluster scheduler to use.
219 | optional Scheduler scheduler = 17;
220 | }
221 |
222 | // Machine events describe the addition, removal, or update (change) of a
223 | // machine in the cluster at a particular time.
224 | message MachineEvent {
225 | enum EventType {
226 | // Should never happen :-).
227 | EVENT_TYPE_UNKNOWN = 0;
228 | // Machine added to the cluster.
229 | ADD = 1;
230 | // Machine removed from cluster (usually due to failure or repairs).
231 | REMOVE = 2;
232 | // Machine capacity updated (while not removed).
233 | UPDATE = 3;
234 | }
235 |
236 | // If we detect that data is missing, why do we know this?
237 | enum MissingDataReason {
238 | // No data is missing.
239 | MISSING_DATA_REASON_NONE = 0;
240 | // We observed that a change to the state of a machine must have
241 | // occurred from an internal state snapshot, but did not see a
242 | // corresponding transition event during the trace.
243 | SNAPSHOT_BUT_NO_TRANSITION = 1;
244 | }
245 |
246 | // Timestamp, in microseconds since the start of the trace. [key]
247 | optional int64 time = 1;
248 | // Unique ID of the machine within the cluster. [key]
249 | optional int64 machine_id = 2;
250 | // Specifies the type of event
251 | optional EventType type = 3;
252 | // Obfuscated name of the Top of Rack switch that this machine is attached to.
253 | optional string switch_id = 4;
254 | // Available resources that the machine supplies. (Note: may be smaller
255 | // than the physical machine's raw capacity.)
256 | optional Resources capacity = 5;
257 | // An obfuscated form of the machine platform (microarchitecture + motherboard
258 | // design).
259 | optional string platform_id = 6;
260 | // Did we detect possibly-missing data?
261 | optional MissingDataReason missing_data_reason = 7;
262 | }
263 |
264 | // A machine attribute update or (if time = 0) its initial value.
265 | message MachineAttribute {
266 | // Timestamp, in microseconds since the start of the trace. [key]
267 | optional int64 time = 1;
268 | // Unique ID of the machine within the cluster. [key]
269 | optional int64 machine_id = 2;
270 | // Obfuscated unique name of the attribute (unique across all clusters). [key]
271 | optional string name = 3;
272 | // Value of the attribute. If this is unset, then 'deleted' must be true.
273 | optional string value = 4;
274 | // True if the attribute is being deleted at this time.
275 | optional bool deleted = 5;
276 | }
277 |
278 | // Information about resource consumption (usage) during a sample window
279 | // (which is typically 300s, but may be shorter if the instance started
280 | // and/or ended during a measurement window).
281 | message InstanceUsage {
282 | // Sample window end points, in microseconds since the start of the trace.
283 | optional int64 start_time = 1;
284 | optional int64 end_time = 2;
285 | // ID of collection that this instance belongs to.
286 | optional int64 collection_id = 3;
287 | // Index of this instance's position in that collection (starts at 0).
288 | optional int32 instance_index = 4;
289 | // Unique ID of the machine on which the instance has been placed.
290 | optional int64 machine_id = 5;
291 |
292 | // ID and index of the alloc collection + instance in which this instance
293 | // is running, or NO_ALLOC_COLLECTION / NO_ALLOC_INDEX if it is not
294 | // running inside an alloc.
295 | optional int64 alloc_collection_id = 6;
296 | optional int64 alloc_instance_index = 7;
297 | // Type of the collection that this instance belongs to.
298 | optional CollectionType collection_type = 8;
299 | // Average (mean) usage over the measurement period.
300 | optional Resources average_usage = 9;
301 | // Observed maximum usage over the measurement period.
302 | // This measurement may be fully or partially missing in some cases.
303 | optional Resources maximum_usage = 10;
304 | // Observed CPU usage during a randomly-sampled second within the measurement
305 | // window. (No memory data is provided here.)
306 | optional Resources random_sample_usage = 11;
307 |
308 | // The memory limit imposed on this instance; normally, it will not be
309 | // allowed to exceed this amount of memory.
310 | optional float assigned_memory = 12;
311 | // Amount of memory that is used for the instance's file page cache in the OS
312 | // kernel.
313 | optional float page_cache_memory = 13;
314 |
315 | // Average (mean) number of processor and memory cycles per instruction.
316 | optional float cycles_per_instruction = 14;
317 | optional float memory_accesses_per_instruction = 15;
318 | // The average (mean) number of data samples collected per second
319 | // (e.g., sample_rate=0.5 means a sample every 2 seconds on average).
320 | optional float sample_rate = 16;
321 |
322 | // CPU usage percentile data.
323 | // The cpu_usage_distribution vector contains 10 elements, representing
324 | // 0%ile (aka min), 10%ile, 20%ile, ... 90%ile, 100%ile (aka max) of the
325 | // normalized CPU usage in NCUs.
326 | // Note that the 100%ile may not exactly match the maximum_usage
327 | // value because of interpolation effects.
328 | repeated float cpu_usage_distribution = 17;
329 | // The tail_cpu_usage_distribution vector contains 9 elements, representing
330 | // 91%ile, 92%ile, 93%ile, ... 98%ile, 99%ile of the normalized CPU resource
331 | // usage in NCUs.
332 | repeated float tail_cpu_usage_distribution = 18;
333 | }
334 |
--------------------------------------------------------------------------------
/clusterdata_analysis_colab.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "clusterdata_analysis_colab.ipynb",
7 | "provenance": [],
8 | "collapsed_sections": [],
9 | "authorship_tag": "ABX9TyN3LZGXsYYQ/HRBwFP0rm4Q",
10 | "include_colab_link": true
11 | },
12 | "kernelspec": {
13 | "name": "python3",
14 | "display_name": "Python 3"
15 | }
16 | },
17 | "cells": [
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {
21 | "id": "view-in-github",
22 | "colab_type": "text"
23 | },
24 | "source": [
25 | "
"
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {
31 | "id": "-qcDkhOIjj8a",
32 | "colab_type": "text"
33 | },
34 | "source": [
35 | "# Google trace analysis colab\n",
36 | "\n",
37 | "This colab provides several example queries and graphs using [Altair](https://altair-viz.github.io/) for the 2019 Google cluster trace. Further examples will be added over time.\n",
38 | "\n",
39 | "**Important:** in order to be able to run the queries you will need to:\n",
40 | "\n",
41 | "1. Use the [Cloud Resource Manager](https://console.cloud.google.com/cloud-resource-manager) to Create a Cloud Platform project if you do not already have one.\n",
42 | "2. [Enable billing](https://support.google.com/cloud/answer/6293499#enable-billing) for the project.\n",
43 | "3. [Enable BigQuery](https://console.cloud.google.com/flows/enableapi?apiid=bigquery) APIs for the project.\n"
44 | ]
45 | },
46 | {
47 | "cell_type": "code",
48 | "metadata": {
49 | "id": "Vcjo13Kejgij",
50 | "colab_type": "code",
51 | "colab": {}
52 | },
53 | "source": [
54 | "#@title Please input your project id\n",
55 | "import pandas as pd\n",
56 | "import numpy as np\n",
57 | "import altair as alt\n",
58 | "from google.cloud import bigquery\n",
59 | "# Provide credentials to the runtime\n",
60 | "from google.colab import auth\n",
61 | "from google.cloud.bigquery import magics\n",
62 | "\n",
63 | "auth.authenticate_user()\n",
64 | "print('Authenticated')\n",
65 | "project_id = '' #@param {type: \"string\"}\n",
66 | "# Set the default project id for %bigquery magic\n",
67 | "magics.context.project = project_id\n",
68 | "\n",
69 | "# Use the client to run queries constructed from a more complicated function.\n",
70 | "client = bigquery.Client(project=project_id)"
71 | ],
72 | "execution_count": null,
73 | "outputs": []
74 | },
75 | {
76 | "cell_type": "markdown",
77 | "metadata": {
78 | "id": "NFUPuLiajrC8",
79 | "colab_type": "text"
80 | },
81 | "source": [
82 | "# Basic queries\n",
83 | "\n",
84 | "This section shows the most basic way of querying the trace using the [bigquery magic](https://googleapis.dev/python/bigquery/latest/magics.html)"
85 | ]
86 | },
87 | {
88 | "cell_type": "code",
89 | "metadata": {
90 | "id": "3xyBH9oQjr1w",
91 | "colab_type": "code",
92 | "colab": {}
93 | },
94 | "source": [
95 | "%%bigquery\n",
96 | "SELECT capacity.cpus AS cpu_cap, \n",
97 | "capacity.memory AS memory_cap, \n",
98 | "COUNT(DISTINCT machine_id) AS num_machines\n",
99 | "FROM `google.com:google-cluster-data`.clusterdata_2019_a.machine_events\n",
100 | "GROUP BY 1,2"
101 | ],
102 | "execution_count": null,
103 | "outputs": []
104 | },
105 | {
106 | "cell_type": "code",
107 | "metadata": {
108 | "id": "SzIHMxy2jvsM",
109 | "colab_type": "code",
110 | "colab": {}
111 | },
112 | "source": [
113 | "%%bigquery\n",
114 | "SELECT COUNT(DISTINCT collection_id) AS collections FROM \n",
115 | "`google.com:google-cluster-data`.clusterdata_2019_a.collection_events;"
116 | ],
117 | "execution_count": null,
118 | "outputs": []
119 | },
120 | {
121 | "cell_type": "markdown",
122 | "metadata": {
123 | "id": "gm-gaRS-jwZj",
124 | "colab_type": "text"
125 | },
126 | "source": [
127 | "# Cell level resource usage time series\n",
128 | "\n",
129 | "This query takes a cell as input and plots a resource usage time-series for every hour of the trace broken down by tier."
130 | ]
131 | },
132 | {
133 | "cell_type": "code",
134 | "metadata": {
135 | "id": "ekhDGYOAjy54",
136 | "colab_type": "code",
137 | "colab": {}
138 | },
139 | "source": [
140 | "#@title Select a cell and a resource to plot the cell level usage series\n",
141 | "\n",
142 | "def query_cell_capacity(cell):\n",
143 | " return '''\n",
144 | "SELECT SUM(cpu_cap) AS cpu_capacity,\n",
145 | " SUM(memory_cap) AS memory_capacity\n",
146 | "FROM (\n",
147 | " SELECT machine_id, MAX(capacity.cpus) AS cpu_cap,\n",
148 | " MAX(capacity.memory) AS memory_cap\n",
149 | " FROM `google.com:google-cluster-data`.clusterdata_2019_{cell}.machine_events\n",
150 | " GROUP BY 1\n",
151 | ")\n",
152 | " '''.format(cell=cell)\n",
153 | "\n",
154 | "def query_per_instance_usage_priority(cell):\n",
155 | " return '''\n",
156 | "SELECT u.time AS time,\n",
157 | " u.collection_id AS collection_id,\n",
158 | " u.instance_index AS instance_index,\n",
159 | " e.priority AS priority,\n",
160 | " CASE\n",
161 | " WHEN e.priority BETWEEN 0 AND 99 THEN '1_free'\n",
162 | " WHEN e.priority BETWEEN 100 AND 115 THEN '2_beb'\n",
163 | " WHEN e.priority BETWEEN 116 AND 119 THEN '3_mid'\n",
164 | " ELSE '4_prod'\n",
165 | " END AS tier,\n",
166 | " u.cpu_usage AS cpu_usage,\n",
167 | " u.memory_usage AS memory_usage\n",
168 | "FROM (\n",
169 | " SELECT start_time AS time,\n",
170 | " collection_id,\n",
171 | " instance_index,\n",
172 | " machine_id,\n",
173 | " average_usage.cpus AS cpu_usage,\n",
174 | " average_usage.memory AS memory_usage\n",
175 | " FROM `google.com:google-cluster-data`.clusterdata_2019_{cell}.instance_usage\n",
176 | " WHERE (alloc_collection_id IS NULL OR alloc_collection_id = 0)\n",
177 | " AND (end_time - start_time) >= (5 * 60 * 1e6)\n",
178 | ") AS u JOIN (\n",
179 | " SELECT collection_id, instance_index, machine_id,\n",
180 | " MAX(priority) AS priority\n",
181 | " FROM `google.com:google-cluster-data`.clusterdata_2019_{cell}.instance_events\n",
182 | " WHERE (alloc_collection_id IS NULL OR alloc_collection_id = 0)\n",
183 | " GROUP BY 1, 2, 3\n",
184 | ") AS e ON u.collection_id = e.collection_id\n",
185 | " AND u.instance_index = e.instance_index\n",
186 | " AND u.machine_id = e.machine_id\n",
187 | " '''.format(cell=cell)\n",
188 | "\n",
189 | "def query_per_tier_utilization_time_series(cell, cpu_capacity, memory_capacity):\n",
190 | " return '''\n",
191 | "SELECT CAST(FLOOR(time/(1e6 * 60 * 60)) AS INT64) AS hour_index,\n",
192 | " tier,\n",
193 | " SUM(cpu_usage) / (12 * {cpu_capacity}) AS avg_cpu_usage,\n",
194 | " SUM(memory_usage) / (12 * {memory_capacity}) AS avg_memory_usage\n",
195 | "FROM ({table})\n",
196 | "GROUP BY 1, 2 ORDER BY hour_index\n",
197 | " '''.format(table=query_per_instance_usage_priority(cell),\n",
198 | " cpu_capacity=cpu_capacity, memory_capacity=memory_capacity)\n",
199 | " \n",
200 | "def run_query_utilization_per_time_time_series(cell):\n",
201 | " cell_cap = client.query(query_cell_capacity(cell)).to_dataframe()\n",
202 | " query = query_per_tier_utilization_time_series(\n",
203 | " cell,\n",
204 | " cell_cap['cpu_capacity'][0],\n",
205 | " cell_cap['memory_capacity'][0])\n",
206 | " time_series = client.query(query).to_dataframe()\n",
207 | " return time_series\n",
208 | "\n",
209 | "cell = 'c' #@param ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']\n",
210 | "hourly_usage = run_query_utilization_per_time_time_series(cell)\n",
211 | "\n",
212 | "# CPU graph\n",
213 | "cpu = alt.Chart(hourly_usage).mark_area().encode(\n",
214 | " alt.X('hour_index:N'),\n",
215 | " alt.Y('avg_cpu_usage:Q'),\n",
216 | " color=alt.Color('tier', legend=alt.Legend(orient=\"left\")),\n",
217 | " order=alt.Order('tier', sort='descending'),\n",
218 | " tooltip=['hour_index', 'tier', 'avg_cpu_usage']\n",
219 | " )\n",
220 | "cpu.encoding.x.title = \"Hour\"\n",
221 | "cpu.encoding.y.title = \"Average CPU usage\"\n",
222 | "cpu.display()\n",
223 | "\n",
224 | "# Memory graph\n",
225 | "memory = alt.Chart(hourly_usage).mark_area().encode(\n",
226 | " alt.X('hour_index:N'),\n",
227 | " alt.Y('avg_memory_usage:Q'),\n",
228 | " color=alt.Color('tier', legend=alt.Legend(orient=\"left\")),\n",
229 | " order=alt.Order('tier', sort='descending'),\n",
230 | " tooltip=['hour_index', 'tier', 'avg_memory_usage']\n",
231 | " )\n",
232 | "memory.encoding.x.title = \"Hour\"\n",
233 | "memory.encoding.y.title = \"Average memory usage\"\n",
234 | "memory.display()"
235 | ],
236 | "execution_count": null,
237 | "outputs": []
238 | },
239 | {
240 | "cell_type": "markdown",
241 | "metadata": {
242 | "id": "qz9m4P5hj2bv",
243 | "colab_type": "text"
244 | },
245 | "source": [
246 | "#Per machine resource usage distribution\n",
247 | "\n",
248 | "This query takes a cell as input and plots a per-machine resource utilization CDF."
249 | ]
250 | },
251 | {
252 | "cell_type": "code",
253 | "metadata": {
254 | "id": "fgSP3kvyj25-",
255 | "colab_type": "code",
256 | "colab": {}
257 | },
258 | "source": [
259 | "#@title Select a cell and plot its per-machine resource utilization CDFs\n",
260 | "\n",
261 | "# Functions to plot CDFs using Altair\n",
262 | "def pick_quantiles_from_tall_dataframe(data, qcol, name=\"\"):\n",
263 | " quantiles = pd.DataFrame([x for x in data[qcol]]).transpose()\n",
264 | " if name != \"\":\n",
265 | " quantiles.columns = data[name]\n",
266 | " return quantiles\n",
267 | "\n",
268 | "# - data: a dataframe with one row and one or more columns of quantiles (results\n",
269 | "# returned from APPROX_QUANTILES)\n",
270 | "# - qcols: a list of names of the quantiles\n",
271 | "# - names: the names of each returned quantiles' columns.\n",
272 | "def pick_quantiles_from_wide_dataframe(data, qcols, names=[]):\n",
273 | " quantiles = {}\n",
274 | " i = 0\n",
275 | " for qcol in qcols:\n",
276 | " col_name = qcol\n",
277 | " if i < len(names):\n",
278 | " col_name = names[i]\n",
279 | " quantiles[col_name] = data[qcol][0]\n",
280 | " i+=1\n",
281 | " return pd.DataFrame(quantiles)\n",
282 | "\n",
283 | "# - quantiles: a dataframe where each column contains the quantiles of one\n",
284 | "# data set. The index (i.e. row names) of the dataframe is the quantile. The\n",
285 | "# column names are the names of the data set.\n",
286 | "def plot_cdfs(quantiles, xlab=\"Value\", ylab=\"CDF\",\n",
287 | " legend_title=\"dataset\", labels=[],\n",
288 | " interactive=False,\n",
289 | " title=''):\n",
290 | " dfs = []\n",
291 | " label = legend_title\n",
292 | " yval = range(quantiles.shape[0])\n",
293 | " esp = 1.0/(len(quantiles)-1)\n",
294 | " yval = [y * esp for y in yval]\n",
295 | " while label == xlab or label == ylab:\n",
296 | " label += '_'\n",
297 | " for col_idx, col in enumerate(quantiles.columns):\n",
298 | " col_label = col\n",
299 | " if col_idx < len(labels):\n",
300 | " col_label = labels[col_idx]\n",
301 | " dfs.append(pd.DataFrame({\n",
302 | " label: col_label,\n",
303 | " xlab: quantiles[col],\n",
304 | " ylab: yval\n",
305 | " }))\n",
306 | " cdfs = pd.concat(dfs)\n",
307 | " lines = alt.Chart(cdfs).mark_line().encode(\n",
308 | " # If you can draw a CDF, it has to be continuous real-valued\n",
309 | " x=xlab+\":Q\",\n",
310 | " y=ylab+\":Q\",\n",
311 | " color=label+\":N\"\n",
312 | " ).properties(\n",
313 | " title=title\n",
314 | " )\n",
315 | " if not interactive:\n",
316 | " return lines\n",
317 | " # Create a selection that chooses the nearest point & selects based on x-value\n",
318 | " nearest = alt.selection(type='single', nearest=True, on='mouseover',\n",
319 | " fields=[ylab], empty='none')\n",
320 | " # Transparent selectors across the chart. This is what tells us\n",
321 | " # the y-value of the cursor\n",
322 | " selectors = alt.Chart(cdfs).mark_point().encode(\n",
323 | " y=ylab+\":Q\",\n",
324 | " opacity=alt.value(0),\n",
325 | " ).properties(\n",
326 | " selection=nearest\n",
327 | " )\n",
328 | "\n",
329 | " # Draw text labels near the points, and highlight based on selection\n",
330 | " text = lines.mark_text(align='left', dx=5, dy=-5).encode(\n",
331 | " text=alt.condition(nearest,\n",
332 | " alt.Text(xlab+\":Q\", format=\".2f\"),\n",
333 | " alt.value(' '))\n",
334 | " )\n",
335 | "\n",
336 | " # Draw a rule at the location of the selection\n",
337 | " rules = alt.Chart(cdfs).mark_rule(color='gray').encode(\n",
338 | " y=ylab+\":Q\",\n",
339 | " ).transform_filter(\n",
340 | " nearest.ref()\n",
341 | " )\n",
342 | " # Draw points on the line, and highlight based on selection\n",
343 | " points = lines.mark_point().encode(\n",
344 | " opacity=alt.condition(nearest, alt.value(1), alt.value(0))\n",
345 | " )\n",
346 | " # Put the five layers into a chart and bind the data\n",
347 | " return alt.layer(lines, selectors, rules, text, points).interactive(\n",
348 | " bind_y=False)\n",
349 | " \n",
350 | "# Functions to create the query\n",
351 | "\n",
352 | "def query_machine_capacity(cell):\n",
353 | " return '''\n",
354 | "SELECT machine_id, MAX(capacity.cpus) AS cpu_cap,\n",
355 | " MAX(capacity.memory) AS memory_cap\n",
356 | "FROM `google.com:google-cluster-data`.clusterdata_2019_{cell}.machine_events\n",
357 | "GROUP BY 1\n",
358 | " '''.format(cell=cell)\n",
359 | "\n",
360 | "def query_top_level_instance_usage(cell):\n",
361 | " return '''\n",
362 | "SELECT CAST(FLOOR(start_time/(1e6 * 300)) * (1000000 * 300) AS INT64) AS time,\n",
363 | " collection_id,\n",
364 | " instance_index,\n",
365 | " machine_id,\n",
366 | " average_usage.cpus AS cpu_usage,\n",
367 | " average_usage.memory AS memory_usage\n",
368 | "FROM `google.com:google-cluster-data`.clusterdata_2019_{cell}.instance_usage\n",
369 | "WHERE (alloc_collection_id IS NULL OR alloc_collection_id = 0)\n",
370 | " AND (end_time - start_time) >= (5 * 60 * 1e6)\n",
371 | " '''.format(cell=cell)\n",
372 | "\n",
373 | "def query_machine_usage(cell):\n",
374 | " return '''\n",
375 | "SELECT u.time AS time,\n",
376 | " u.machine_id AS machine_id,\n",
377 | " SUM(u.cpu_usage) AS cpu_usage,\n",
378 | " SUM(u.memory_usage) AS memory_usage,\n",
379 | " MAX(m.cpu_cap) AS cpu_capacity,\n",
380 | " MAX(m.memory_cap) AS memory_capacity\n",
381 | "FROM ({instance_usage}) AS u JOIN\n",
382 | " ({machine_capacity}) AS m\n",
383 | "ON u.machine_id = m.machine_id\n",
384 | "GROUP BY 1, 2\n",
385 | " '''.format(instance_usage = query_top_level_instance_usage(cell),\n",
386 | " machine_capacity = query_machine_capacity(cell))\n",
387 | " \n",
388 | "def query_machine_utilization_distribution(cell):\n",
389 | " return '''\n",
390 | "SELECT APPROX_QUANTILES(IF(cpu_usage > cpu_capacity, 1.0, cpu_usage / cpu_capacity), 100) AS cpu_util_dist,\n",
391 | " APPROX_QUANTILES(IF(memory_usage > memory_capacity, 1.0, memory_usage / memory_capacity), 100) AS memory_util_dist\n",
392 | "FROM ({table})\n",
393 | " '''.format(table = query_machine_usage(cell))\n",
394 | "\n",
395 | "cell = 'd' #@param ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']\n",
396 | "query = query_machine_utilization_distribution(cell)\n",
397 | "machine_util_dist = client.query(query).to_dataframe()\n",
398 | "plot_cdfs(pick_quantiles_from_wide_dataframe(machine_util_dist, ['cpu_util_dist', 'memory_util_dist'], ['CPU', 'Memory']), xlab='x - resource utilization (%)', ylab=\"Probability (resource utilization < x)\", interactive=True)"
399 | ],
400 | "execution_count": null,
401 | "outputs": []
402 | }
403 | ]
404 | }
--------------------------------------------------------------------------------
/power_trace_analysis_colab.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "provenance": []
7 | },
8 | "kernelspec": {
9 | "name": "python3",
10 | "display_name": "Python 3"
11 | },
12 | "language_info": {
13 | "name": "python"
14 | }
15 | },
16 | "cells": [
17 | {
18 | "cell_type": "markdown",
19 | "source": [
20 | "# Google Data Center Power Trace Analysis\n",
21 | "\n",
22 | "This colab demonstrates querying the Google data center power traces with bigquery, visualizing them with [Altair](https://altair-viz.github.io/), and analyzing them in conjunction with the 2019 Google cluster data.\n",
23 | "\n",
24 | "**Important:** in order to be able to run the queries you will need to:\n",
25 | "\n",
26 | "1. Use the [Cloud Resource Manager](https://console.cloud.google.com/cloud-resource-manager) to Create a Cloud Platform project if you do not already have one.\n",
27 | "1. [Enable billing](https://support.google.com/cloud/answer/6293499#enable-billing) for the project.\n",
28 | "1. [Enable BigQuery](https://console.cloud.google.com/flows/enableapi?apiid=bigquery) APIs for the project."
29 | ],
30 | "metadata": {
31 | "id": "n86D0dGs4Lh0"
32 | }
33 | },
34 | {
35 | "cell_type": "markdown",
36 | "source": [
37 | "To begin with, we'll authenticate with GCP and import the python libraries necessary to execute this colab."
38 | ],
39 | "metadata": {
40 | "id": "c-ipGY9-arep"
41 | }
42 | },
43 | {
44 | "cell_type": "code",
45 | "source": [
46 | "#@title Please input your project id\n",
47 | "import altair as alt\n",
48 | "import numpy as np\n",
49 | "import pandas as pd\n",
50 | "from google.cloud import bigquery\n",
51 | "# Provide credentials to the runtime\n",
52 | "from google.colab import auth\n",
53 | "from google.cloud.bigquery import magics\n",
54 | "\n",
55 | "auth.authenticate_user()\n",
56 | "print('Authenticated')\n",
57 | "project_id = 'google.com:google-cluster-data' #@param {type: \"string\"}\n",
58 | "# Set the default project id for %bigquery magic\n",
59 | "magics.context.project = project_id\n",
60 | "\n",
61 | "# Use the client to run queries constructed from a more complicated function.\n",
62 | "client = bigquery.Client(project=project_id)\n"
63 | ],
64 | "metadata": {
65 | "id": "CEhNsC1OPajn"
66 | },
67 | "execution_count": null,
68 | "outputs": []
69 | },
70 | {
71 | "cell_type": "markdown",
72 | "source": [
73 | "## Basic Queries"
74 | ],
75 | "metadata": {
76 | "id": "tLOi9ZDM46oa"
77 | }
78 | },
79 | {
80 | "cell_type": "markdown",
81 | "source": [
82 | "Here are some examples of using the [bigquery magic](https://cloud.google.com/python/docs/reference/bigquery/latest/index.html) to query the power traces.\n",
83 | "\n",
84 | "First we'll calculate the average production utilization for a single power domain.\n"
85 | ],
86 | "metadata": {
87 | "id": "nqX1P6QOOUpt"
88 | }
89 | },
90 | {
91 | "cell_type": "code",
92 | "source": [
93 | "%%bigquery\n",
94 | "SELECT\n",
95 | " AVG(production_power_util) AS average_production_power_util\n",
96 | "FROM `google.com:google-cluster-data`.powerdata_2019.cella_pdu10"
97 | ],
98 | "metadata": {
99 | "id": "63jcfAIdWfqN"
100 | },
101 | "execution_count": null,
102 | "outputs": []
103 | },
104 | {
105 | "cell_type": "markdown",
106 | "source": [
107 | "\n",
108 | "Now let's find the minimum and maximum measured power utilization for each cell. We use bigquery [wildcard tables](https://cloud.google.com/bigquery/docs/querying-wildcard-tables) in order to conveniently query all trace tables at once."
109 | ],
110 | "metadata": {
111 | "id": "FxNECyAlVJR3"
112 | }
113 | },
114 | {
115 | "cell_type": "code",
116 | "source": [
117 | "%%bigquery\n",
118 | "SELECT\n",
119 | " cell,\n",
120 | " MIN(measured_power_util) AS minimum_measured_power_util,\n",
121 | " MAX(measured_power_util) AS maximum_measured_power_util\n",
122 | "FROM `google.com:google-cluster-data.powerdata_2019.cell*`\n",
123 | "GROUP BY cell"
124 | ],
125 | "metadata": {
126 | "id": "YqpYZTniOq-t"
127 | },
128 | "execution_count": null,
129 | "outputs": []
130 | },
131 | {
132 | "cell_type": "markdown",
133 | "source": [
134 | "Modifying the previous query to also group by `pdu` gives us the maximum and minimum measured power utilization per power domain."
135 | ],
136 | "metadata": {
137 | "id": "z4PECIQ0OGYI"
138 | }
139 | },
140 | {
141 | "cell_type": "code",
142 | "source": [
143 | "%%bigquery\n",
144 | "SELECT\n",
145 | " cell,\n",
146 | " pdu,\n",
147 | " MIN(measured_power_util) AS minimum_measured_power_util,\n",
148 | " MAX(measured_power_util) AS maximum_measured_power_util\n",
149 | "FROM `google.com:google-cluster-data.powerdata_2019.cell*`\n",
150 | "GROUP BY cell, pdu\n",
151 | "ORDER BY maximum_measured_power_util"
152 | ],
153 | "metadata": {
154 | "id": "23wOHXUpNvNS"
155 | },
156 | "execution_count": null,
157 | "outputs": []
158 | },
159 | {
160 | "cell_type": "markdown",
161 | "source": [
162 | "## Measured and Production Power over Time"
163 | ],
164 | "metadata": {
165 | "id": "NvZLmc5TYB76"
166 | }
167 | },
168 | {
169 | "cell_type": "markdown",
170 | "source": [
171 | "We provide traces for two clusters of Google's new Medium Voltage Power Plane (MVPP) data center design, described in [the paper](https://research.google/pubs/pub49032/). Let's plot the measured and estimated production power utilization of one of these MVPPs: mvpp1. We'll limit this visualization to the first 15 days of the trace period (the first 4320 datapoints of the trace)."
172 | ],
173 | "metadata": {
174 | "id": "i89hzgdGbrrI"
175 | }
176 | },
177 | {
178 | "cell_type": "code",
179 | "source": [
180 | "%%bigquery mvpp1_df\n",
181 | "SELECT\n",
182 | " time,\n",
183 | " measured_power_util,\n",
184 | " production_power_util\n",
185 | "FROM `google.com:google-cluster-data`.powerdata_2019.celli_mvpp1\n",
186 | "ORDER BY time\n",
187 | "LIMIT 4320"
188 | ],
189 | "metadata": {
190 | "id": "WhMI0Y8sYBMU"
191 | },
192 | "execution_count": null,
193 | "outputs": []
194 | },
195 | {
196 | "cell_type": "code",
197 | "source": [
198 | "alt.Chart(mvpp1_df).mark_line().transform_fold(\n",
199 | " [\"measured_power_util\", \"production_power_util\"]).encode(\n",
200 | " x=\"time:Q\",\n",
201 | " y=alt.X(\"value:Q\", scale=alt.Scale(zero=False)),\n",
202 | " color=\"key:N\"\n",
203 | ").properties(width=700, height=75)"
204 | ],
205 | "metadata": {
206 | "id": "K07qANn2Zz_o"
207 | },
208 | "execution_count": null,
209 | "outputs": []
210 | },
211 | {
212 | "cell_type": "markdown",
213 | "source": [
214 | "## CPU and Power over Time"
215 | ],
216 | "metadata": {
217 | "id": "bdzTheN-XdHv"
218 | }
219 | },
220 | {
221 | "cell_type": "markdown",
222 | "source": [
223 | "As an example of joining the power traces with the cluster traces, we'll plot the average CPU utilization and power utilization per hour.\n",
224 | "\n",
225 | "You may remember this query from the [cluster analysis colab](https://github.com/google/cluster-data/blob/master/clusterdata_analysis_colab.ipynb). It's been modified slightly--see `query_per_tier_utilization_time_series` in particular."
226 | ],
227 | "metadata": {
228 | "id": "TL7Dh7-TFsK7"
229 | }
230 | },
231 | {
232 | "cell_type": "code",
233 | "source": [
234 | "def machines_in_pdu(pdu_number):\n",
235 | " return '''\n",
236 | "SELECT machine_id\n",
237 | "FROM `google.com:google-cluster-data`.powerdata_2019.machine_to_pdu_mapping\n",
238 | "WHERE pdu = 'pdu{pdu_number}'\n",
239 | " '''.format(pdu_number=pdu_number)\n",
240 | "\n",
241 | "def query_cell_capacity(cell, pdu_number):\n",
242 | " return '''\n",
243 | "SELECT SUM(cpu_cap) AS cpu_capacity\n",
244 | "FROM (\n",
245 | " SELECT machine_id, MAX(capacity.cpus) AS cpu_cap,\n",
246 | " FROM `google.com:google-cluster-data`.clusterdata_2019_{cell}.machine_events\n",
247 | " WHERE machine_id IN ({machine_query})\n",
248 | " GROUP BY 1\n",
249 | ")\n",
250 | " '''.format(cell=cell, machine_query=machines_in_pdu(pdu_number))\n",
251 | "\n",
252 | "def query_per_instance_usage_priority(cell, pdu_num):\n",
253 | " return '''\n",
254 | "SELECT u.time AS time,\n",
255 | " u.collection_id AS collection_id,\n",
256 | " u.instance_index AS instance_index,\n",
257 | " e.priority AS priority,\n",
258 | " CASE\n",
259 | " WHEN e.priority BETWEEN 0 AND 99 THEN '1_free'\n",
260 | " WHEN e.priority BETWEEN 100 AND 115 THEN '2_beb'\n",
261 | " WHEN e.priority BETWEEN 116 AND 119 THEN '3_mid'\n",
262 | " ELSE '4_prod'\n",
263 | " END AS tier,\n",
264 | " u.cpu_usage AS cpu_usage\n",
265 | "FROM (\n",
266 | " SELECT start_time AS time,\n",
267 | " collection_id,\n",
268 | " instance_index,\n",
269 | " machine_id,\n",
270 | " average_usage.cpus AS cpu_usage\n",
271 | " FROM `google.com:google-cluster-data`.clusterdata_2019_{cell}.instance_usage\n",
272 | " WHERE (alloc_collection_id IS NULL OR alloc_collection_id = 0)\n",
273 | " AND (end_time - start_time) >= (5 * 60 * 1e6)\n",
274 | ") AS u JOIN (\n",
275 | " SELECT collection_id, instance_index, machine_id,\n",
276 | " MAX(priority) AS priority\n",
277 | " FROM `google.com:google-cluster-data`.clusterdata_2019_{cell}.instance_events\n",
278 | " WHERE (alloc_collection_id IS NULL OR alloc_collection_id = 0)\n",
279 | " AND machine_id IN ({machine_query})\n",
280 | " GROUP BY 1, 2, 3\n",
281 | ") AS e ON u.collection_id = e.collection_id\n",
282 | " AND u.instance_index = e.instance_index\n",
283 | " AND u.machine_id = e.machine_id\n",
284 | " '''.format(cell=cell, machine_query=machines_in_pdu(pdu_num))\n",
285 | "\n",
286 | "def query_per_tier_utilization_time_series(cell, pdu_num, cpu_capacity):\n",
287 | " return '''\n",
288 | "SELECT * FROM (\n",
289 | " SELECT CAST(FLOOR(time/(1e6 * 60 * 60)) AS INT64) AS hour_index,\n",
290 | " tier,\n",
291 | " SUM(cpu_usage) / (12 * {cpu_capacity}) AS avg_cpu_usage,\n",
292 | " FROM ({table})\n",
293 | " GROUP BY 1, 2)\n",
294 | "JOIN (\n",
295 | " SELECT CAST(FLOOR(time/(1e6 * 60 * 60)) AS INT64) AS hour_index,\n",
296 | " pdu,\n",
297 | " AVG(measured_power_util) as avg_measured_power_util,\n",
298 | " AVG(production_power_util) AS avg_production_power_util\n",
299 | " FROM `google.com:google-cluster-data`.`powerdata_2019.cell{cell}_pdu{pdu_num}`\n",
300 | " GROUP BY hour_index, pdu\n",
301 | ") USING (hour_index)\n",
302 | " '''.format(table=query_per_instance_usage_priority(cell, pdu_num),\n",
303 | " cpu_capacity=cpu_capacity, cell=cell, pdu_num=pdu_num)\n",
304 | "\n",
305 | "def run_query_utilization_per_time_time_series(cell, pdu_num):\n",
306 | " cell_cap = client.query(query_cell_capacity(cell, pdu_num)).to_dataframe()\n",
307 | " query = query_per_tier_utilization_time_series(\n",
308 | " cell,\n",
309 | " pdu_num,\n",
310 | " cell_cap['cpu_capacity'][0])\n",
311 | " time_series = client.query(query).to_dataframe()\n",
312 | " return time_series\n",
313 | "\n",
314 | "CELL='f'\n",
315 | "PDU_NUM='17'\n",
316 | "hourly_usage = run_query_utilization_per_time_time_series(CELL, PDU_NUM)"
317 | ],
318 | "metadata": {
319 | "id": "Bmp8Z2diaCPC"
320 | },
321 | "execution_count": null,
322 | "outputs": []
323 | },
324 | {
325 | "cell_type": "markdown",
326 | "source": [
327 | "Plot power utilization on top of the CPU utilization graph."
328 | ],
329 | "metadata": {
330 | "id": "E47jgIjzGnDP"
331 | }
332 | },
333 | {
334 | "cell_type": "code",
335 | "source": [
336 | "# CPU graph\n",
337 | "cpu = alt.Chart().mark_area().encode(\n",
338 | " alt.X('hour_index:N'),\n",
339 | " alt.Y('avg_cpu_usage:Q'),\n",
340 | " color=alt.Color('tier', legend=alt.Legend(orient=\"left\", title=None)),\n",
341 | " order=alt.Order('tier', sort='descending'),\n",
342 | " tooltip=['tier:N', 'avg_cpu_usage:Q']\n",
343 | " )\n",
344 | "cpu.encoding.x.title = \"Hour\"\n",
345 | "cpu.encoding.y.title = \"Average Utilization\"\n",
346 | "\n",
347 | "\n",
348 | "# Power Utilization graph\n",
349 | "pu = (\n",
350 | " alt.Chart()\n",
351 | " .transform_fold(['avg_measured_power_util', 'avg_production_power_util'])\n",
352 | " .encode(\n",
353 | " alt.X(\n",
354 | " 'hour_index:N',\n",
355 | " axis=alt.Axis(labels=False, domain=False, ticks=False),\n",
356 | " ),\n",
357 | " alt.Y('value:Q'),\n",
358 | " color=alt.Color('key:N', legend=None),\n",
359 | " strokeDash=alt.StrokeDash('key:N', legend=None),\n",
360 | " tooltip=['hour_index:N', 'key:N', 'value:Q']\n",
361 | " )\n",
362 | " .mark_line().properties(title=alt.datum.cell + ': ' + alt.datum.cluster)\n",
363 | ")\n",
364 | "\n",
365 | "\n",
366 | "alt.layer(cpu, pu, data=hourly_usage).properties(\n",
367 | " width=1200,\n",
368 | " height=300,\n",
369 | " title=\"Average CPU and Power Utilization\").configure_axis(grid=False)"
370 | ],
371 | "metadata": {
372 | "id": "bUCQxV7-iDPm"
373 | },
374 | "execution_count": null,
375 | "outputs": []
376 | },
377 | {
378 | "cell_type": "markdown",
379 | "source": [
380 | "We can adapt the previous queries to calculate the average CPU and power utilizations per day by tier (i.e. cappable workloads or production) for each PDU in cell `b`."
381 | ],
382 | "metadata": {
383 | "id": "G0qxiNg96pq3"
384 | }
385 | },
386 | {
387 | "cell_type": "code",
388 | "source": [
389 | "%%bigquery cluster_and_power_data_df\n",
390 | "WITH\n",
391 | " machines_in_cell AS (\n",
392 | " SELECT machine_id, pdu, cell\n",
393 | " FROM `google.com:google-cluster-data`.powerdata_2019.machine_to_pdu_mapping\n",
394 | " WHERE cell = 'b'\n",
395 | " ),\n",
396 | " cpu_capacities AS (\n",
397 | " SELECT pdu, cell, SUM(cpu_cap) AS cpu_capacity\n",
398 | " FROM\n",
399 | " (\n",
400 | " SELECT machine_id, MAX(capacity.cpus) AS cpu_cap,\n",
401 | " FROM `google.com:google-cluster-data`.clusterdata_2019_b.machine_events\n",
402 | " GROUP BY 1\n",
403 | " )\n",
404 | " JOIN machines_in_cell\n",
405 | " USING (machine_id)\n",
406 | " GROUP BY 1, 2\n",
407 | " ),\n",
408 | " per_instance_usage_priority AS (\n",
409 | " SELECT\n",
410 | " u.time AS time,\n",
411 | " u.collection_id AS collection_id,\n",
412 | " u.instance_index AS instance_index,\n",
413 | " e.priority AS priority,\n",
414 | " IF(e.priority < 120, 'cappable', 'production') AS tier,\n",
415 | " u.cpu_usage AS cpu_usage,\n",
416 | " m.pdu\n",
417 | " FROM\n",
418 | " (\n",
419 | " SELECT\n",
420 | " start_time AS time,\n",
421 | " collection_id,\n",
422 | " instance_index,\n",
423 | " machine_id,\n",
424 | " average_usage.cpus AS cpu_usage\n",
425 | " FROM `google.com:google-cluster-data`.clusterdata_2019_b.instance_usage\n",
426 | " WHERE\n",
427 | " (alloc_collection_id IS NULL OR alloc_collection_id = 0)\n",
428 | " AND (end_time - start_time) >= (5 * 60 * 1e6)\n",
429 | " ) AS u\n",
430 | " JOIN\n",
431 | " (\n",
432 | " SELECT collection_id, instance_index, machine_id, MAX(priority) AS priority\n",
433 | " FROM `google.com:google-cluster-data`.clusterdata_2019_b.instance_events\n",
434 | " WHERE (alloc_collection_id IS NULL OR alloc_collection_id = 0)\n",
435 | " GROUP BY 1, 2, 3\n",
436 | " ) AS e\n",
437 | " ON\n",
438 | " u.collection_id = e.collection_id\n",
439 | " AND u.instance_index = e.instance_index\n",
440 | " AND u.machine_id = e.machine_id\n",
441 | " JOIN machines_in_cell AS m\n",
442 | " ON m.machine_id = u.machine_id\n",
443 | " )\n",
444 | "SELECT *\n",
445 | "FROM\n",
446 | " (\n",
447 | " SELECT\n",
448 | " CAST(FLOOR(time / (1e6 * 60 * 60 * 24)) AS INT64) AS day_index,\n",
449 | " pdu,\n",
450 | " tier,\n",
451 | " SUM(cpu_usage) / (12 * 24 * ANY_VALUE(cpu_capacity)) AS avg_cpu_usage,\n",
452 | " FROM per_instance_usage_priority\n",
453 | " JOIN cpu_capacities\n",
454 | " USING (pdu)\n",
455 | " GROUP BY 1, 2, 3\n",
456 | " )\n",
457 | "JOIN\n",
458 | " (\n",
459 | " SELECT\n",
460 | " CAST(FLOOR((time - 6e8 + 3e8) / (1e6 * 60 * 60 * 24)) AS INT64) AS day_index,\n",
461 | " pdu,\n",
462 | " AVG(measured_power_util) AS avg_measured_power_util,\n",
463 | " AVG(production_power_util) AS avg_production_power_util\n",
464 | " FROM `google.com:google-cluster-data`.`powerdata_2019.cellb_pdu*`\n",
465 | " GROUP BY 1, 2\n",
466 | " )\n",
467 | " USING (pdu, day_index)"
468 | ],
469 | "metadata": {
470 | "id": "Id_W7pdF6xK9"
471 | },
472 | "execution_count": null,
473 | "outputs": []
474 | },
475 | {
476 | "cell_type": "code",
477 | "source": [
478 | "cluster_and_power_data_df.describe()"
479 | ],
480 | "metadata": {
481 | "id": "-DFupvCbKniC"
482 | },
483 | "execution_count": null,
484 | "outputs": []
485 | },
486 | {
487 | "cell_type": "markdown",
488 | "source": [
489 | "### Be careful joining on `time` directly!\n",
490 | "Note that in the queries above we converted `time` from the cluster and power datasets to an `hour_index` which we then use to join the two datasets. We don't want to join on `time` directly due to the how the datasets are structured.\n",
491 | "\n",
492 | "The power trace `time` values are each aligned to the 5-minute mark. For example, there's a `time` at 600s, 900s, 1200s but never 700s or 601s. The cluster trace `time` values have no specific alignment. If we were to join the two datasets on `time`, we'd end up dropping a lot of data!"
493 | ],
494 | "metadata": {
495 | "id": "ok8vgKFvCCgt"
496 | }
497 | },
498 | {
499 | "cell_type": "markdown",
500 | "source": [
501 | "You may have also noticed that there are `day_index` has 32 unique values in the query above, despite May having 31 days. This is because timestamps in the data sets are represented as microseconds since 600 seconsd before the start of the trace period, May 01 2019 at 00:00 PT."
502 | ],
503 | "metadata": {
504 | "id": "ifMNifTPODkO"
505 | }
506 | },
507 | {
508 | "cell_type": "markdown",
509 | "source": [
510 | "## Recreating Graphs from the Paper"
511 | ],
512 | "metadata": {
513 | "id": "wbXuUBygOu5-"
514 | }
515 | },
516 | {
517 | "cell_type": "markdown",
518 | "source": [
519 | "Below, the power traces are used to re-create figures from [the paper](https://research.google/pubs/pub49032/)."
520 | ],
521 | "metadata": {
522 | "id": "aARdyheiRgVp"
523 | }
524 | },
525 | {
526 | "cell_type": "code",
527 | "source": [
528 | "def histogram_data(filter='pdu%', agg_by='pdu', util_type='measured_power_util'):\n",
529 | " query = \"\"\"\n",
530 | " SELECT bin as bins, SUM(count) as counts\n",
531 | " FROM (\n",
532 | " SELECT {agg_by},\n",
533 | " ROUND(CAST({util_type} / 0.0001 as INT64) * 0.0001, 3) as bin,\n",
534 | " COUNT(*) as count\n",
535 | " FROM (\n",
536 | " SELECT {agg_by},\n",
537 | " time,\n",
538 | " {util_type},\n",
539 | " FROM `google.com:google-cluster-data`.`powerdata_2019.cell*`\n",
540 | " WHERE NOT bad_measurement_data\n",
541 | " AND NOT bad_production_power_data\n",
542 | " AND {agg_by} LIKE '{filter}'\n",
543 | " AND NOT cell in ('i', 'j')\n",
544 | " )\n",
545 | " GROUP BY 1, 2\n",
546 | " ) GROUP BY 1 ORDER BY 1;\n",
547 | " \"\"\".format(**{'filter': filter, 'agg_by': agg_by, 'util_type': util_type})\n",
548 | " return client.query(query).to_dataframe()"
549 | ],
550 | "metadata": {
551 | "id": "LFuWscTAOyE3"
552 | },
553 | "execution_count": null,
554 | "outputs": []
555 | },
556 | {
557 | "cell_type": "code",
558 | "source": [
559 | "pdu_df = histogram_data()\n",
560 | "cluster_df = histogram_data('%', 'cell')\n",
561 | "prod_pdu_df = histogram_data(util_type='production_power_util')\n",
562 | "prod_cluster_df = histogram_data('%', 'cell', util_type='production_power_util')"
563 | ],
564 | "metadata": {
565 | "id": "62vMXuSYUIG8"
566 | },
567 | "execution_count": null,
568 | "outputs": []
569 | },
570 | {
571 | "cell_type": "code",
572 | "source": [
573 | "def make_cdf(p_df, c_df, title):\n",
574 | " pdu_counts, pdu_bins = p_df.counts, p_df.bins\n",
575 | " cluster_counts, cluster_bins = c_df.counts, c_df.bins\n",
576 | "\n",
577 | " pdu_cdf = np.cumsum (list(pdu_counts))\n",
578 | " pdu_cdf = (1.0 * pdu_cdf) / pdu_cdf[-1]\n",
579 | " cluster_cdf = np.cumsum (list(cluster_counts))\n",
580 | " cluster_cdf = (1.0 * cluster_cdf) / cluster_cdf[-1]\n",
581 | "\n",
582 | " pdu_cdf_graph = alt.Chart(pd.DataFrame(\n",
583 | " {'bins': pdu_bins, 'cdf': pdu_cdf})).mark_point(size=.1).encode(\n",
584 | " x=alt.X('bins', scale=alt.Scale(domain=[0.4, 0.90])),\n",
585 | " y=alt.Y('cdf'),\n",
586 | " color=alt.value('steelblue')\n",
587 | " )\n",
588 | "\n",
589 | " cluster_cdf_graph = alt.Chart(pd.DataFrame(\n",
590 | " {'bins': cluster_bins, 'cdf': cluster_cdf})).mark_point(size=.1).encode(\n",
591 | " x=alt.X('bins', scale=alt.Scale(domain=[0.4, 0.90])),\n",
592 | " y=alt.Y('cdf'),\n",
593 | " color=alt.value('forestgreen')\n",
594 | " )\n",
595 | "\n",
596 | " return (pdu_cdf_graph + cluster_cdf_graph).properties(title=title)\n",
597 | "\n",
598 | "\n",
599 | "alt.hconcat(make_cdf(pdu_df, cluster_df, \"Measured Power Util\"), make_cdf(\n",
600 | " prod_pdu_df, prod_cluster_df, \"Production Power Util\"))"
601 | ],
602 | "metadata": {
603 | "id": "eEze9O1CUlSU"
604 | },
605 | "execution_count": null,
606 | "outputs": []
607 | }
608 | ]
609 | }
--------------------------------------------------------------------------------
/bibliography.bib:
--------------------------------------------------------------------------------
1 | ################################################################
2 | # Introduction
3 | ################################################################
4 |
5 | This bibliography is a resource for people writing papers that refer
6 | to the Google cluster traces. It covers papers that analyze the
7 | traces, as well as ones that use them as inputs to other studies.
8 |
9 | * I recommend using \usepackage{url}.
10 | * Entries are in publication-date order, with the most recent at the top.
11 | * Bibtex ignores stuff that is outside the entries, so text like this is safe.
12 |
13 | The following are the RECOMMENDED CITATIONS if you just need the basics:
14 |
15 | * Borg:
16 | * \cite{clusterdata:Verma2015, clusterdata:Tirmazi2020} for Borg itself
17 | * 2019 traces:
18 | * \cite{clusterdata:Wilkes2020, clusterdata:Wilkes2020a, clusterdata:Tirmazi2020} for
19 | the complete set of info about trace itself.
20 | * \cite{clusterdata:Wilkes2020} for the 2019 trace announcement
21 | * \cite{clusterdata:Wilkes2020a} for the details about the 2019 trace contents
22 | * \cite{clusterdata:Tirmazi2020} for the EuroSys paper about the 2019 and 2011 traces
23 | * 2011 trace:
24 | * \cite{clusterdata:Wilkes2011, clusterdata:Reiss2011} for the trace itself
25 | * \cite{clusterdata:Reiss2012b} for the first thorough analysis of it.
26 |
27 | If you use the traces, please send a bibtex entry that looks *exactly* like one
28 | of these to johnwilkes@google.com, so your paper can be added - and cited! A
29 | Github pull request is the best format.
30 |
31 | ################################################################
32 | # Trace-announcements
33 | ################################################################
34 |
35 | These entries can be used to cite the traces themselves.
36 |
37 | # The May 2019 traces.
38 | # Use clusterdata:Tirmazi2020 for the first paper to analyze them.
39 |
40 | # This is the formal announcement of the trace:
41 | @Misc{clusterdata:Wilkes2020,
42 | author = {John Wilkes},
43 | title = {Yet more {Google} compute cluster trace data},
44 | howpublished = {Google research blog},
45 | month = Apr,
46 | year = 2020,
47 | address = {Mountain View, CA, USA},
48 | note = {Posted at \url{https://ai.googleblog.com/2020/04/yet-more-google-compute-cluster-trace.html}.},
49 | }
50 |
51 | # If you want to cite details about the trace itself:
52 | @TechReport{clusterdata:Wilkes2020a,
53 | author = {John Wilkes},
54 | title = {{Google} cluster-usage traces v3},
55 | institution = {Google Inc.},
56 | year = 2020,
57 | month = Apr,
58 | type = {Technical Report},
59 | address = {Mountain View, CA, USA},
60 | note = {Posted at \url{https://github.com/google/cluster-data/blob/master/ClusterData2019.md}},
61 | abstract = {
62 | This document describes the semantics, data format, and
63 | schema of usage traces of a few Google compute cells.
64 | This document describes version 3 of the trace format.},
65 | }
66 |
67 | #----------------
68 | The next couple are for the May 2011 "full" trace.
69 | @Misc{clusterdata:Wilkes2011,
70 | author = {John Wilkes},
71 | title = {More {Google} cluster data},
72 | howpublished = {Google research blog},
73 | month = Nov,
74 | year = 2011,
75 | address = {Mountain View, CA, USA},
76 | note = {Posted at \url{http://googleresearch.blogspot.com/2011/11/more-google-cluster-data.html}.},
77 | }
78 |
79 | @TechReport{clusterdata:Reiss2011,
80 | author = {Charles Reiss and John Wilkes and Joseph L. Hellerstein},
81 | title = {{Google} cluster-usage traces: format + schema},
82 | institution = {Google Inc.},
83 | year = 2011,
84 | month = Nov,
85 | type = {Technical Report},
86 | address = {Mountain View, CA, USA},
87 | note = {Revised 2014-11-17 for version 2.1. Posted at
88 | \url{https://github.com/google/cluster-data}},
89 | }
90 |
91 |
92 | #----------------
93 | # The next one is for the earlier "small" 7-hour trace.
94 | # (Most people should not be using this.)
95 |
96 | @Misc{clusterdata:Hellersetein2010,
97 | author = {Joseph L. Hellerstein},
98 | title = {{Google} cluster data},
99 | howpublished = {Google research blog},
100 | month = Jan,
101 | year = 2010,
102 | note = {Posted at \url{http://googleresearch.blogspot.com/2010/01/google-cluster-data.html}.},
103 | }
104 |
105 | #----------------
106 | The canonical Borg paper.
107 | @inproceedings{clusterdata:Verma2015,
108 | title = {Large-scale cluster management at {Google} with {Borg}},
109 | author = {Abhishek Verma and Luis Pedrosa and Madhukar R. Korupolu and David Oppenheimer and Eric Tune and John Wilkes},
110 | year = {2015},
111 | booktitle = {Proceedings of the European Conference on Computer Systems (EuroSys'15)},
112 | address = {Bordeaux, France},
113 | articleno = {18},
114 | numpages = {17},
115 | abstract = {
116 | Google's Borg system is a cluster manager that runs hundreds of thousands of jobs,
117 | from many thousands of different applications, across a number of clusters each with
118 | up to tens of thousands of machines.
119 |
120 | It achieves high utilization by combining admission control, efficient task-packing,
121 | over-commitment, and machine sharing with process-level performance isolation.
122 | It supports high-availability applications with runtime features that minimize
123 | fault-recovery time, and scheduling policies that reduce the probability of correlated
124 | failures. Borg simplifies life for its users by offering a declarative job specification
125 | language, name service integration, real-time job monitoring, and tools to analyze and
126 | simulate system behavior.
127 |
128 | We present a summary of the Borg system architecture and features, important design
129 | decisions, a quantitative analysis of some of its policy decisions, and a qualitative
130 | examination of lessons learned from a decade of operational experience with it.},
131 | url = {https://dl.acm.org/doi/10.1145/2741948.2741964},
132 | doi = {10.1145/2741948.2741964},
133 | }
134 |
135 |
136 | #----------------
137 | The next paper describes the policy choices and technologies used to
138 | make the traces safe to release.
139 |
140 | @InProceedings{clusterdata:Reiss2012,
141 | author = {Charles Reiss and John Wilkes and Joseph L. Hellerstein},
142 | title = {Obfuscatory obscanturism: making workload traces of
143 | commercially-sensitive systems safe to release},
144 | year = 2012,
145 | booktitle = {3rd International Workshop on Cloud Management (CLOUDMAN)},
146 | month = Apr,
147 | publisher = {IEEE},
148 | pages = {1279--1286},
149 | address = {Maui, HI, USA},
150 | abstract = {Cloud providers such as Google are interested in fostering
151 | research on the daunting technical challenges they face in
152 | supporting planetary-scale distributed systems, but no
153 | academic organizations have similar scale systems on which to
154 | experiment. Fortunately, good research can still be done using
155 | traces of real-life production workloads, but there are risks
156 | in releasing such data, including inadvertently disclosing
157 | confidential or proprietary information, as happened with the
158 | Netflix Prize data. This paper discusses these risks, and our
159 | approach to them, which we call systematic obfuscation. It
160 | protects proprietary and personal data while leaving it
161 | possible to answer interesting research questions. We explain
162 | and motivate some of the risks and concerns and propose how
163 | they can best be mitigated, using as an example our recent
164 | publication of a month-long trace of a production system
165 | workload on a 11k-machine cluster.},
166 | url = {http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6212064},
167 | }
168 |
169 | ################################################################
170 | # Trace-analysis papers
171 | ################################################################
172 |
173 | These papers are primarily about analyzing the traces.
174 | Order: most recent first.
175 |
176 | If you just want one citation about the Cluster2011 trace, then
177 | use \cite{clusterdata:Reiss2012b}.
178 |
179 |
180 | ################ 2025
181 | @InProceedings{clusterdata:Sliwko2025,
182 | author = {Sliwko, Leszek and Mizera-Pietraszko, Jolanta},
183 | title = {Enhancing Cluster Scheduling in {HPC}: A Continuous Transfer Learning for Real-Time Optimization},
184 | year = 2025,
185 | booktitle = {2025 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)},
186 | month = Jun,
187 | pages = {316--325},
188 | doi = {10.1109/IPDPSW66978.2025.00056},
189 | issn = {2995-066X},
190 | url = {https://ieeexplore.ieee.org/document/11105897},
191 | keywords = {Cloud computing;machine learning;load balancing and task assignment;transfer learning},
192 | abstract = {This study presents a machine learning-assisted approach to optimize task scheduling in cluster systems,
193 | focusing on node-affinity constraints. Traditional schedulers like Kubernetes struggle with real-time adaptability,
194 | whereas the proposed continuous transfer learning model evolves dynamically during operations, minimizing retraining
195 | needs. Evaluated on Google Cluster Data, the model achieves over 99\% accuracy, reducing computational overhead and
196 | improving scheduling latency for constrained tasks. This scalable solution enables real-time optimization, advancing
197 | machine learning integration in cluster management and paving the way for future adaptive scheduling strategies.},
198 | }
199 |
200 |
201 | ################ 2024
202 | @article{clusterdata:Sliwko2024,
203 | author = {Sliwko, Leszek},
204 | title = {Cluster Workload Allocation: A Predictive Approach Leveraging Machine Learning Efficiency},
205 | year = 2024,
206 | month = Dec,
207 | journal = {IEEE Access},
208 | volume = 12,
209 | pages = {194091--194107},
210 | doi = {10.1109/ACCESS.2024.3520422},
211 | issn = {2169-3536},
212 | url = {https://ieeexplore.ieee.org/document/10807210},
213 | keywords = {Machine learning;classification algorithms;load balancing and task assignment;Google Cluster Data},
214 | abstract = {This research investigates how Machine Learning (ML) algorithms can assist in workload allocation
215 | strategies by detecting tasks with node affinity operators (referred to as constraint operators), which constrain
216 | their execution to a limited number of nodes. Using real-world Google Cluster Data (GCD) workload traces and
217 | the AGOCS framework, the study extracts node attributes and task constraints, then analyses them to identify suitable
218 | node-task pairings. It focuses on tasks that can be executed on either a single node or fewer than a thousand out of
219 | 12.5k nodes in the analysed GCD cluster. Task constraint operators are compacted, pre-processed with one-hot
220 | encoding,and used as features in a training dataset. Various ML classifiers, including Artificial Neural Networks,
221 | K-Nearest Neighbours, Decision Trees, Naive Bayes, Ridge Regression, Adaptive Boosting, and Bagging, are fine-tuned
222 | and assessed for accuracy and F1-scores. The final ensemble voting classifier model achieved 98\% accuracy and a
223 | 1.5-1.8\% misclassification rate for tasks with a single suitable node.}
224 | }
225 |
226 |
227 | ################ 2023
228 | @INPROCEEDINGS{clusterdata:Tuns2023,
229 | author = {Tuns, Adrian-Ioan and Spătaru, Adrian},
230 | booktitle = {2023 25th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)},
231 | title = {Cloud Service Failure Prediction on {Google’s Borg} Cluster Traces Using Traditional Machine Learning},
232 | year = 2023,
233 | month = Sep,
234 | ISSN = {2470-881X},
235 | doi = {10.1109/SYNASC61333.2023.00029},
236 | url = {https://doi.org/10.1109/SYNASC61333.2023.00029},
237 | pages = {162--169},
238 | keywords = {Cloud computing;Machine learning algorithms;Scientific computing;Clustering algorithms;
239 | Prediction algorithms;Boosting;Classification algorithms;failure prediction;big data;machine learning;
240 | classification algorithms;Google Borg},
241 | abstract = {The ability to predict failures in complex systems is crucial for maintaining their
242 | optimal performance, opening the possibility of reducing downtime and minimizing costs.
243 | In the context of cloud computing, cloud failure represents one of the most relevant problems,
244 | which not only leads to substantial financial losses but also negatively impacts the productivity
245 | of both industrial and end users. This paper presents a comprehensive study on the application of
246 | failure prediction techniques, by exploring four machine learning algorithms, namely Decision Tree,
247 | Random Forest, Gradient Boosting, and Logistic Regression.The research focuses on analyzing the
248 | workload of an industrial set of clusters, provided as traces in Google’s Borg cluster workload traces.
249 | The aim was to develop highly accurate predictive models for both job and task failures, a goal which
250 | was achieved. A job classifier having a performance of 83.97\% accuracy (Gradient Boosting) and a task
251 | classifier of 98.79\% accuracy performance (Decision Tree) were obtained.},
252 | }
253 |
254 |
255 | ################ 2022
256 | @inproceedings {clusterdata:jajooSLearn2022,
257 | author = {Akshay Jajoo and Y. Charlie Hu and Xiaojun Lin and Nan Deng},
258 | title = {A Case for Task Sampling based Learning for Cluster Job Scheduling},
259 | booktitle = {19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)},
260 | year = {2022},
261 | address = {Renton, WA, USA},
262 | url = {https://www.usenix.org/conference/nsdi22/presentation/jajoo},
263 | publisher = {USENIX Association},
264 | keywords = {data centers, big data, job scheduling, learning, online learning},
265 | abstract = {The ability to accurately estimate job runtime properties allows a
266 | scheduler to effectively schedule jobs. State-of-the-art online cluster job
267 | schedulers use history-based learning, which uses past job execution information
268 | to estimate the runtime properties of newly arrived jobs. However, with fast-paced
269 | development in cluster technology (in both hardware and software) and changing user
270 | inputs, job runtime properties can change over time, which lead to inaccurate predictions.
271 | In this paper, we explore the potential and limitation of real-time learning of job
272 | runtime properties, by proactively sampling and scheduling a small fraction of the
273 | tasks of each job. Such a task-sampling-based approach exploits the similarity among
274 | runtime properties of the tasks of the same job and is inherently immune to changing
275 | job behavior. Our study focuses on two key questions in comparing task-sampling-based
276 | learning (learning in space) and history-based learning (learning in time): (1) Can
277 | learning in space be more accurate than learning in time? (2) If so, can delaying
278 | scheduling the remaining tasks of a job till the completion of sampled tasks be more
279 | than compensated by the improved accuracy and result in improved job performance? Our
280 | analytical and experimental analysis of 3 production traces with different skew and job
281 | distribution shows that learning in space can be substantially more accurate. Our
282 | simulation and testbed evaluation on Azure of the two learning approaches anchored in a
283 | generic job scheduler using 3 production cluster job traces shows that despite its online
284 | overhead, learning in space reduces the average Job Completion Time (JCT) by 1.28x, 1.56x,
285 | and 1.32x compared to the prior-art history-based predictor.},
286 | }
287 |
288 |
289 | ################ 2021
290 | @article{clusterdata:jajooSLearnTechReport2021,
291 | author = {Akshay Jajoo and Y. Charlie Hu and Xiaojun Lin and Nan Deng},
292 | title = {The Case for Task Sampling based Learning for Cluster Job Scheduling},
293 | journal = {Computing Research Repository},
294 | volume = {abs/2108.10464},
295 | year = {2021},
296 | url = {https://arxiv.org/abs/2108.10464},
297 | eprinttype = {arXiv},
298 | eprint = {2108.10464},
299 | timestamp = {Fri, 27 Aug 2021 15:02:29 +0200},
300 | biburl = {https://dblp.org/rec/journals/corr/abs-2108-10464.bib},
301 | bibsource = {dblp computer science bibliography, https://dblp.org},
302 | keywords = {data centers, big data, job scheduling, learning, online learning},
303 | abstract = {The ability to accurately estimate job runtime properties allows a
304 | scheduler to effectively schedule jobs. State-of-the-art online cluster job
305 | schedulers use history-based learning, which uses past job execution information
306 | to estimate the runtime properties of newly arrived jobs. However, with fast-paced
307 | development in cluster technology (in both hardware and software) and changing user
308 | inputs, job runtime properties can change over time, which lead to inaccurate predictions.
309 | In this paper, we explore the potential and limitation of real-time learning of job
310 | runtime properties, by proactively sampling and scheduling a small fraction of the
311 | tasks of each job. Such a task-sampling-based approach exploits the similarity among
312 | runtime properties of the tasks of the same job and is inherently immune to changing
313 | job behavior. Our study focuses on two key questions in comparing task-sampling-based
314 | learning (learning in space) and history-based learning (learning in time): (1) Can
315 | learning in space be more accurate than learning in time? (2) If so, can delaying
316 | scheduling the remaining tasks of a job till the completion of sampled tasks be more
317 | than compensated by the improved accuracy and result in improved job performance? Our
318 | analytical and experimental analysis of 3 production traces with different skew and job
319 | distribution shows that learning in space can be substantially more accurate. Our
320 | simulation and testbed evaluation on Azure of the two learning approaches anchored in a
321 | generic job scheduler using 3 production cluster job traces shows that despite its online
322 | overhead, learning in space reduces the average Job Completion Time (JCT) by 1.28x, 1.56x,
323 | and 1.32x compared to the prior-art history-based predictor.},
324 | }
325 |
326 | ################ 2020
327 |
328 | @inproceedings{clusterdata:Tirmazi2020,
329 | author = {Tirmazi, Muhammad and Barker, Adam and Deng, Nan and Haque, Md E. and Qin, Zhijing Gene and Hand, Steven and Harchol-Balter, Mor and Wilkes, John},
330 | title = {{Borg: the Next Generation}},
331 | year = {2020},
332 | isbn = {9781450368827},
333 | publisher = {ACM},
334 | address = {Heraklion, Greece},
335 | url = {https://doi.org/10.1145/3342195.3387517},
336 | doi = {10.1145/3342195.3387517},
337 | booktitle = {Proceedings of the Fifteenth European Conference on Computer Systems (EuroSys'20)},
338 | articleno = {30},
339 | numpages = {14},
340 | keywords = {data centers, cloud computing},
341 | abstract = {
342 | This paper analyzes a newly-published trace that covers 8
343 | different Borg clusters for the month of May 2019. The
344 | trace enables researchers to explore how scheduling works in
345 | large-scale production compute clusters. We highlight how
346 | Borg has evolved and perform a longitudinal comparison of
347 | the newly-published 2019 trace against the 2011 trace, which
348 | has been highly cited within the research community.
349 | Our findings show that Borg features such as alloc sets
350 | are used for resource-heavy workloads; automatic vertical
351 | scaling is effective; job-dependencies account for much of
352 | the high failure rates reported by prior studies; the workload
353 | arrival rate has increased, as has the use of resource
354 | over-commitment; the workload mix has changed, jobs have
355 | migrated from the free tier into the best-effort batch tier;
356 | the workload exhibits an extremely heavy-tailed distribution
357 | where the top 1\% of jobs consume over 99\% of resources; and
358 | there is a great deal of variation between different clusters.},
359 | }
360 |
361 |
362 | ################ 2018
363 |
364 | @article{clusterdata:Sebastio2018,
365 | title = {Characterizing machines lifecycle in Google data centers},
366 | journal = {Performance Evaluation},
367 | volume = 126,
368 | pages = {39 -- 63},
369 | year = 2018,
370 | issn = {0166-5316},
371 | doi = {https://doi.org/10.1016/j.peva.2018.08.001},
372 | url = {http://www.sciencedirect.com/science/article/pii/S016653161830004X},
373 | author = {Stefano Sebastio and Kishor S. Trivedi and Javier Alonso},
374 | keywords = {Statistical analysis, Distributed architectures, Cloud computing, System reliability, Large-scale systems, Empirical studies},
375 | abstract = {Due to the increasing need for computational power, the market has
376 | shifted towards big centralized data centers. Understanding the nature
377 | of the dynamics of these data centers from machine and job/task
378 | perspective is critical to design efficient data center management
379 | policies like optimal resource/power utilization, capacity planning and
380 | optimal (reactive and proactive) maintenance scheduling. Whereas
381 | jobs/tasks dynamics have received a lot of attention, the study of the
382 | dynamics of the underlying machines supporting the jobs/tasks execution
383 | has received much less attention, even when these dynamics would
384 | substantially affect the performance of the jobs/tasks execution. Given
385 | the limited data available from large computing installations, only a
386 | few previous studies have inspected data centers and only concerning
387 | failures and their root causes. In this paper, we study the 2011 Google
388 | data center traces from the machine dynamics perspective. First, we
389 | characterize the machine events and their underlying distributions in
390 | order to have a better understanding of the entire machine lifecycle.
391 | Second, we propose a data-driven model to enable the estimate of the
392 | expected number of available machines at any instant of time. The model
393 | is parameterized and validated using the empirical data collected by
394 | Google during a one month period.}
395 | }
396 |
397 | ################ 2017
398 |
399 | @Inbook{clusterdata:Ray2017,
400 | author = {Ray, Biplob R., Chowdhury, Morshed and Atif, Usman},
401 | editor = {Doss, Robin, Piramuthu, Selwyn and Zhou, Wei},
402 | title = {Is {High Performance Computing (HPC)} Ready to Handle Big Data?},
403 | bookTitle = {Future Network Systems and Security},
404 | year = 2017,
405 | month = Aug,
406 | publisher = {Springer},
407 | address = {Cham, Switzerland},
408 | pages = {97--112},
409 | abstract={In recent years big data has emerged as a universal term and its
410 | management has become a crucial research topic. The phrase `big data'
411 | refers to data sets so large and complex that the processing of them
412 | requires collaborative High Performance Computing (HPC). How to
413 | effectively allocate resources is one of the prime challenges in
414 | HPC. This leads us to the question: are the existing HPC resource
415 | allocation techniques effective enough to support future big data
416 | challenges? In this context, we have investigated the effectiveness of
417 | HPC resource allocation using the Google cluster dataset and a number of
418 | data mining tools to determine the correlational coefficient between
419 | resource allocation, resource usages and priority. Our analysis
420 | initially focused on correlation between resource allocation and
421 | resource uses. The finding shows that a high volume of resources that
422 | are allocated by the system for a job are not being used by that same
423 | job. To investigate further, we analyzed the correlation between
424 | resource allocation, resource usages and priority. Our clustering,
425 | classification and prediction techniques identified that the allocation
426 | and uses of resources are very loosely correlated with priority of the
427 | jobs. This research shows that our current HPC scheduling needs
428 | improvement in order to accommodate the big data challenge
429 | efficiently.},
430 | keywords = {Big data; HPC; Data mining; QoS; Correlation },
431 | isbn = {978-3-319-65548-2},
432 | doi = {10.1007/978-3-319-65548-2_8},
433 | url = {https://doi.org/10.1007/978-3-319-65548-2_8},
434 | }
435 |
436 | @INPROCEEDINGS{clusterdata:Elsayed2017,
437 | author = {Nosayba El-Sayed and Hongyu Zhu and Bianca Schroeder},
438 | title = {Learning from Failure Across Multiple Clusters: A Trace-Driven Approach to Understanding, Predicting, and Mitigating Job Terminations},
439 | booktitle={International Conference on Distributed Computing Systems (ICDCS)},
440 | year=2017,
441 | month=Jun,
442 | pages={1333--1344},
443 | abstract={In large-scale computing platforms, jobs are prone to interruptions
444 | and premature terminations, limiting their usability and leading to
445 | significant waste in cluster resources. In this paper, we tackle this
446 | problem in three steps. First, we provide a comprehensive study based on
447 | log data from multiple large-scale production systems to identify
448 | patterns in the behaviour of unsuccessful jobs across different clusters
449 | and investigate possible root causes behind job termination. Our results
450 | reveal several interesting properties that distinguish unsuccessful jobs
451 | from others, particularly w.r.t. resource consumption patterns and job
452 | configuration settings. Secondly, we design a machine learning-based
453 | framework for predicting job and task terminations. We show that job
454 | failures can be predicted relatively early with high precision and
455 | recall, and also identify attributes that have strong predictive power
456 | of job failure. Finally, we demonstrate in a concrete use case how our
457 | prediction framework can be used to mitigate the effect of unsuccessful
458 | execution using an effective task-cloning policy that we propose.},
459 | keywords={learning (artificial intelligence);parallel
460 | processing;resource allocation;software fault tolerance; job
461 | configuration settings;job failures prediction;job
462 | terminations mitigation;job terminations prediction;
463 | large-scale computing platforms;machine learning-based
464 | framework;resource consumption patterns; task-cloning
465 | policy;trace-driven approach;Computer crashes;Electric
466 | breakdown;Google;Large-scale systems; Linear systems;Parallel
467 | processing;Program processors;Failure Mitigation;Failure
468 | Prediction;Job Failure; Large-Scale Systems;Reliability;Trace
469 | Analysis}, doi={10.1109/ICDCS.2017.317}, issn={1063-6927}, }
470 |
471 | ################ 2014
472 |
473 | @INPROCEEDINGS{clusterdata:Abdul-Rahman2014,
474 | author = {Abdul-Rahman, Omar Arif and Aida, Kento},
475 | title = {Towards understanding the usage behavior of {Google} cloud
476 | users: the mice and elephants phenomenon},
477 | booktitle = {IEEE International Conference on Cloud Computing
478 | Technology and Science (CloudCom)},
479 | year = 2014,
480 | month = dec,
481 | address = {Singapore},
482 | pages = {272--277},
483 | keywords = {Google trace; Workload trace analysis; User session view;
484 | Application composition; Mass-Count disparity; Exploratory statistical
485 | analysis; Visual analysis; Color-schemed graphs; Coarse grain
486 | classification; Heavy-tailed distributions; Long-tailed lognormal
487 | distributions; Exponential distribution; Normal distribution; Discrete
488 | modes; Large web services; Batch processing; MapReduce computation;
489 | Human users; },
490 | abstract = {In the era of cloud computing, users encounter the challenging
491 | task of effectively composing and running their applications on the
492 | cloud. In an attempt to understand user behavior in constructing
493 | applications and interacting with typical cloud infrastructures, we
494 | analyzed a large utilization dataset of Google cluster. In the present
495 | paper, we consider user behavior in composing applications from the
496 | perspective of topology, maximum requested computational resources, and
497 | workload type. We model user dynamic behavior around the user's session
498 | view. Mass-Count disparity metrics are used to investigate the
499 | characteristics of underlying statistical models and to characterize
500 | users into distinct groups according to their composition and behavioral
501 | classes and patterns. The present study reveals interesting insight into
502 | the heterogeneous structure of the Google cloud workload.},
503 | doi = {10.1109/CloudCom.2014.75},
504 | }
505 |
506 | ################ 2013
507 |
508 | @inproceedings{clusterdata:Di2013,
509 | title = {Characterizing cloud applications on a {Google} data center},
510 | author = {Di, Sheng and Kondo, Derrick and Franck, Cappello},
511 | booktitle = {42nd International Conference on Parallel Processing (ICPP)},
512 | year = 2013,
513 | month = Oct,
514 | address = {Lyon, France},
515 | abstract = {In this paper, we characterize Google applications,
516 | based on a one-month Google trace with over 650k jobs running
517 | across over 12000 heterogeneous hosts from a Google data
518 | center. On one hand, we carefully compute the valuable
519 | statistics about task events and resource utilization for
520 | Google applications, based on various types of resources (such
521 | as CPU, memory) and execution types (e.g., whether they can
522 | run batch tasks or not). Resource utilization per application
523 | is observed with an extremely typical Pareto principle. On the
524 | other hand, we classify applications via a K-means clustering
525 | algorithm with optimized number of sets, based on task events
526 | and resource usage. The number of applications in the Kmeans
527 | clustering sets follows a Pareto-similar distribution. We
528 | believe our work is very interesting and valuable for the
529 | further investigation of Cloud environment.},
530 | }
531 |
532 | ################ 2012
533 |
534 | @INPROCEEDINGS{clusterdata:Reiss2012b,
535 | title = {Heterogeneity and dynamicity of clouds at scale: {Google}
536 | trace analysis},
537 | author = {Charles Reiss and Alexey Tumanov and Gregory R. Ganger and
538 | Randy H. Katz and Michael A. Kozuch},
539 | booktitle = {ACM Symposium on Cloud Computing (SoCC)},
540 | year = 2012,
541 | month = Oct,
542 | address = {San Jose, CA, USA},
543 | abstract = {To better understand the challenges in developing effective
544 | cloud-based resource schedulers, we analyze the first publicly available
545 | trace data from a sizable multi-purpose cluster. The most notable
546 | workload characteristic is heterogeneity: in resource types (e.g.,
547 | cores:RAM per machine) and their usage (e.g., duration and resources
548 | needed). Such heterogeneity reduces the effectiveness of traditional
549 | slot- and core-based scheduling. Furthermore, some tasks are
550 | constrained as to the kind of machine types they can use, increasing the
551 | complexity of resource assignment and complicating task migration. The
552 | workload is also highly dynamic, varying over time and most workload
553 | features, and is driven by many short jobs that demand quick scheduling
554 | decisions. While few simplifying assumptions apply, we find that many
555 | longer-running jobs have relatively stable resource utilizations, which
556 | can help adaptive resource schedulers.},
557 | url = {http://www.pdl.cmu.edu/PDL-FTP/CloudComputing/googletrace-socc2012.pdf},
558 | privatenote = {An earlier version of this was posted at
559 | \url{http://www.istc-cc.cmu.edu/publications/papers/2012/ISTC-CC-TR-12-101.pdf},
560 | and included here as clusterdata:Reiss2012a. Please use this
561 | version instead of that.},
562 | }
563 |
564 | @INPROCEEDINGS{clusterdata:Liu2012,
565 | author = {Zitao Liu and Sangyeun Cho},
566 | title = {Characterizing machines and workloads on a {Google} cluster},
567 | booktitle = {8th International Workshop on Scheduling and Resource
568 | Management for Parallel and Distributed Systems (SRMPDS)},
569 | year = 2012,
570 | month = Sep,
571 | address = {Pittsburgh, PA, USA},
572 | abstract = {Cloud computing offers high scalability, flexibility and
573 | cost-effectiveness to meet emerging computing
574 | requirements. Understanding the characteristics of real workloads on a
575 | large production cloud cluster benefits not only cloud service providers
576 | but also researchers and daily users. This paper studies a large-scale
577 | Google cluster usage trace dataset and characterizes how the machines in
578 | the cluster are managed and the workloads submitted during a 29-day
579 | period behave. We focus on the frequency and pattern of machine
580 | maintenance events, job- and task-level workload behavior, and how the
581 | overall cluster resources are utilized.},
582 | url = {http://www.cs.pitt.edu/cast/abstract/liu-srmpds12.html},
583 | }
584 |
585 | @INPROCEEDINGS{clusterdata:Di2012a,
586 | author = {Sheng Di and Derrick Kondo and Walfredo Cirne},
587 | title = {Characterization and comparison of cloud versus {Grid} workloads},
588 | booktitle = {International Conference on Cluster Computing (IEEE CLUSTER)},
589 | year = 2012,
590 | month = Sep,
591 | pages = {230--238},
592 | address = {Beijing, China},
593 | abstract = {A new era of Cloud Computing has emerged, but the characteristics
594 | of Cloud load in data centers is not perfectly clear. Yet this
595 | characterization is critical for the design of novel Cloud job and
596 | resource management systems. In this paper, we comprehensively
597 | characterize the job/task load and host load in a real-world production
598 | data center at Google Inc. We use a detailed trace of over 25 million
599 | tasks across over 12,500 hosts. We study the differences between a
600 | Google data center and other Grid/HPC systems, from the perspective of
601 | both work load (w.r.t. jobs and tasks) and host load
602 | (w.r.t. machines). In particular, we study the job length, job
603 | submission frequency, and the resource utilization of jobs in the
604 | different systems, and also investigate valuable statistics of machine's
605 | maximum load, queue state and relative usage levels, with different job
606 | priorities and resource attributes. We find that the Google data center
607 | exhibits finer resource allocation with respect to CPU and memory than
608 | that of Grid/HPC systems. Google jobs are always submitted with much
609 | higher frequency and they are much shorter than Grid jobs. As such,
610 | Google host load exhibits higher variance and noise.},
611 | keywords = {cloud computing;computer centres;grid computing;queueing
612 | theory;resource allocation;search engines;CPU;Google data
613 | center;cloud computing;cloud job;cloud load;data centers;grid
614 | workloads;grid-HPC systems;host load;job length;job submission
615 | frequency;jobs resource utilization;machine maximum load;queue
616 | state;real-world production data center;relative usage
617 | levels;resource allocation;resource attributes;resource
618 | management systems;task load;Capacity
619 | planning;Google;Joints;Load modeling;Measurement;Memory
620 | management;Resource management;Cloud Computing;Grid
621 | Computing;Load Characterization},
622 | doi = {10.1109/CLUSTER.2012.35},
623 | privatenote = {An earlier version is available at
624 | \url{http://hal.archives-ouvertes.fr/hal-00705858}. It used
625 | to be included here as clusterdata:Di2012.},
626 | }
627 |
628 | ################ 2010
629 |
630 | @Article{clusterdata:Mishra2010,
631 | author = {Mishra, Asit K. and Hellerstein, Joseph L. and Cirne,
632 | Walfredo and Das, Chita R.},
633 | title = {Towards characterizing cloud backend workloads: insights
634 | from {Google} compute clusters},
635 | journal = {SIGMETRICS Perform. Eval. Rev.},
636 | volume = {37},
637 | number = {4},
638 | month = Mar,
639 | year = 2010,
640 | issn = {0163-5999},
641 | pages = {34--41},
642 | numpages = {8},
643 | url = {http://doi.acm.org/10.1145/1773394.1773400},
644 | doi = {10.1145/1773394.1773400},
645 | publisher = {ACM},
646 | abstract = {The advent of cloud computing promises highly available,
647 | efficient, and flexible computing services for applications such as web
648 | search, email, voice over IP, and web search alerts. Our experience at
649 | Google is that realizing the promises of cloud computing requires an
650 | extremely scalable backend consisting of many large compute clusters
651 | that are shared by application tasks with diverse service level
652 | requirements for throughput, latency, and jitter. These considerations
653 | impact (a) capacity planning to determine which machine resources must
654 | grow and by how much and (b) task scheduling to achieve high machine
655 | utilization and to meet service level objectives.
656 |
657 | Both capacity planning and task scheduling require a good understanding
658 | of task resource consumption (e.g., CPU and memory usage). This in turn
659 | demands simple and accurate approaches to workload
660 | classification-determining how to form groups of tasks (workloads) with
661 | similar resource demands. One approach to workload classification is to
662 | make each task its own workload. However, this approach scales poorly
663 | since tens of thousands of tasks execute daily on Google compute
664 | clusters. Another approach to workload classification is to view all
665 | tasks as belonging to a single workload. Unfortunately, applying such a
666 | coarse-grain workload classification to the diversity of tasks running
667 | on Google compute clusters results in large variances in predicted
668 | resource consumptions.
669 |
670 | This paper describes an approach to workload classification and its
671 | application to the Google Cloud Backend, arguably the largest cloud
672 | backend on the planet. Our methodology for workload classification
673 | consists of: (1) identifying the workload dimensions; (2) constructing
674 | task classes using an off-the-shelf algorithm such as k-means; (3)
675 | determining the break points for qualitative coordinates within the
676 | workload dimensions; and (4) merging adjacent task classes to reduce the
677 | number of workloads. We use the foregoing, especially the notion of
678 | qualitative coordinates, to glean several insights about the Google
679 | Cloud Backend: (a) the duration of task executions is bimodal in that
680 | tasks either have a short duration or a long duration; (b) most tasks
681 | have short durations; and (c) most resources are consumed by a few tasks
682 | with long duration that have large demands for CPU and memory.},
683 | }
684 |
685 |
686 | ################################################################
687 | # Trace-usage papers
688 | ################################################################
689 |
690 | These entries are for papers that primarily focus on some other topic, but
691 | use the traces as inputs, e.g., in simulations or load predictions.
692 | Order: most recent first.
693 |
694 | ################ 2023
695 | @ARTICLE{clusterdata:jajooSLearnTCC2023,
696 | author={Jajoo, Akshay and Hu, Y. Charlie and Lin, Xiaojun and Deng, Nan},
697 | journal={IEEE Transactions on Cloud Computing},
698 | title={SLearn: A Case for Task Sampling Based Learning for Cluster Job Scheduling},
699 | year={2023},
700 | volume={11},
701 | number={3},
702 | pages={2664-2680},
703 | publisher = {USENIX Association},
704 | keywords = {data centers, big data, job scheduling, learning, online learning},
705 | abstract = {The ability to accurately estimate job runtime properties allows a
706 | scheduler to effectively schedule jobs. State-of-the-art online cluster
707 | job schedulers use history-based learning, which uses past job execution
708 | information to estimate the runtime properties of newly arrived jobs.
709 | However, with fast-paced development in cluster technology (in both hardware
710 | and software) and changing user inputs, job runtime properties can change over
711 | time, which lead to inaccurate predictions. In this article, we explore the
712 | potential and limitation of real-time learning of job runtime properties,
713 | by proactively sampling and scheduling a small fraction of the tasks of
714 | each job. Such a task-sampling-based approach exploits the similarity among
715 | runtime properties of the tasks of the same job and is inherently immune to
716 | changing job behavior. Our analytical and experimental analysis of 3 production
717 | traces with different skew and job distribution shows that learning in space can
718 | be substantially more accurate. Our simulation and testbed evaluation on Azure of
719 | the two learning approaches anchored in a generic job scheduler using 3 production
720 | cluster job traces shows that despite its online overhead, learning in space reduces
721 | the average Job Completion Time (JCT) by 1.28×, 1.56×, and 1.32× compared to the
722 | prior-art history-based predictor. We further analyze the experimental results to
723 | give intuitive explanations to why learning in space outperforms learning in time
724 | in these experiments. Finally, we show how sampling-based learning can be extended
725 | to schedule DAG jobs and achieve similar speedups over the prior-art history-based
726 | predictor.},
727 | doi={10.1109/TCC.2022.3222649}}
728 |
729 |
730 | ################ 2022
731 | @inproceedings {clusterdata:jajooSLearnNSDI2022,
732 | author = {Akshay Jajoo and Y. Charlie Hu and Xiaojun Lin and Nan Deng},
733 | title = {A Case for Task Sampling based Learning for Cluster Job Scheduling},
734 | booktitle = {19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)},
735 | year = {2022},
736 | address = {Renton, WA, USA},
737 | url = {https://www.usenix.org/conference/nsdi22/presentation/jajoo},
738 | publisher = {USENIX Association},
739 | keywords = {data centers, big data, job scheduling, learning, online learning},
740 | abstract = {The ability to accurately estimate job runtime properties allows a
741 | scheduler to effectively schedule jobs. State-of-the-art online cluster job
742 | schedulers use history-based learning, which uses past job execution information
743 | to estimate the runtime properties of newly arrived jobs. However, with fast-paced
744 | development in cluster technology (in both hardware and software) and changing user
745 | inputs, job runtime properties can change over time, which lead to inaccurate predictions.
746 | In this paper, we explore the potential and limitation of real-time learning of job
747 | runtime properties, by proactively sampling and scheduling a small fraction of the
748 | tasks of each job. Such a task-sampling-based approach exploits the similarity among
749 | runtime properties of the tasks of the same job and is inherently immune to changing
750 | job behavior. Our study focuses on two key questions in comparing task-sampling-based
751 | learning (learning in space) and history-based learning (learning in time): (1) Can
752 | learning in space be more accurate than learning in time? (2) If so, can delaying
753 | scheduling the remaining tasks of a job till the completion of sampled tasks be more
754 | than compensated by the improved accuracy and result in improved job performance? Our
755 | analytical and experimental analysis of 3 production traces with different skew and job
756 | distribution shows that learning in space can be substantially more accurate. Our
757 | simulation and testbed evaluation on Azure of the two learning approaches anchored in a
758 | generic job scheduler using 3 production cluster job traces shows that despite its online
759 | overhead, learning in space reduces the average Job Completion Time (JCT) by 1.28x, 1.56x,
760 | and 1.32x compared to the prior-art history-based predictor.},
761 | }
762 |
763 |
764 | ################ 2021
765 | @article{clusterdata:jajooSLearnTechReport2021,
766 | author = {Akshay Jajoo and Y. Charlie Hu and Xiaojun Lin and Nan Deng},
767 | title = {The Case for Task Sampling based Learning for Cluster Job Scheduling},
768 | journal = {Computing Research Repository},
769 | volume = {abs/2108.10464},
770 | year = {2021},
771 | url = {https://arxiv.org/abs/2108.10464},
772 | eprinttype = {arXiv},
773 | eprint = {2108.10464},
774 | timestamp = {Fri, 27 Aug 2021 15:02:29 +0200},
775 | biburl = {https://dblp.org/rec/journals/corr/abs-2108-10464.bib},
776 | bibsource = {dblp computer science bibliography, https://dblp.org},
777 | keywords = {data centers, big data, job scheduling, learning, online learning},
778 | abstract = {The ability to accurately estimate job runtime properties allows a
779 | scheduler to effectively schedule jobs. State-of-the-art online cluster job
780 | schedulers use history-based learning, which uses past job execution information
781 | to estimate the runtime properties of newly arrived jobs. However, with fast-paced
782 | development in cluster technology (in both hardware and software) and changing user
783 | inputs, job runtime properties can change over time, which lead to inaccurate predictions.
784 | In this paper, we explore the potential and limitation of real-time learning of job
785 | runtime properties, by proactively sampling and scheduling a small fraction of the
786 | tasks of each job. Such a task-sampling-based approach exploits the similarity among
787 | runtime properties of the tasks of the same job and is inherently immune to changing
788 | job behavior. Our study focuses on two key questions in comparing task-sampling-based
789 | learning (learning in space) and history-based learning (learning in time): (1) Can
790 | learning in space be more accurate than learning in time? (2) If so, can delaying
791 | scheduling the remaining tasks of a job till the completion of sampled tasks be more
792 | than compensated by the improved accuracy and result in improved job performance? Our
793 | analytical and experimental analysis of 3 production traces with different skew and job
794 | distribution shows that learning in space can be substantially more accurate. Our
795 | simulation and testbed evaluation on Azure of the two learning approaches anchored in a
796 | generic job scheduler using 3 production cluster job traces shows that despite its online
797 | overhead, learning in space reduces the average Job Completion Time (JCT) by 1.28x, 1.56x,
798 | and 1.32x compared to the prior-art history-based predictor.},
799 | }
800 |
801 | ################ 2020
802 |
803 | @INPROCEEDINGS{clusterdata:Lin2020,
804 | title = {Using {GANs} for Sharing Networked Time Series Data: Challenges,
805 | Initial Promise, and Open Questions},
806 | author = {Lin, Zinan and Jain, Alankar and Wang, Chen and Fanti,
807 | Giulia and Sekar, Vyas},
808 | year = {2020},
809 | isbn = {9781450381383},
810 | publisher = {Association for Computing Machinery},
811 | url = {https://doi.org/10.1145/3419394.3423643},
812 | doi = {10.1145/3419394.3423643},
813 | abstract = {Limited data access is a longstanding barrier to data-driven
814 | research and development in the networked systems community. In this work,
815 | we explore if and how generative adversarial networks (GANs) can be used to
816 | incentivize data sharing by enabling a generic framework for sharing
817 | synthetic datasets with minimal expert knowledge. As a specific target, our
818 | focus in this paper is on time series datasets with metadata (e.g., packet
819 | loss rate measurements with corresponding ISPs). We identify key challenges
820 | of existing GAN approaches for such workloads with respect to fidelity
821 | (e.g., long-term dependencies, complex multidimensional relationships, mode
822 | collapse) and privacy (i.e., existing guarantees are poorly understood and
823 | can sacrifice fidelity). To improve fidelity, we design a custom workflow
824 | called DoppelGANger (DG) and demonstrate that across diverse real-world
825 | datasets (e.g., bandwidth measurements, cluster requests, web sessions) and
826 | use cases (e.g., structural characterization, predictive modeling, algorithm
827 | comparison), DG achieves up to 43% better fidelity than baseline models.
828 | Although we do not resolve the privacy problem in this work, we identify
829 | fundamental challenges with both classical notions of privacy and recent
830 | advances to improve the privacy properties of GANs, and suggest a potential
831 | roadmap for addressing these challenges. By shedding light on the promise
832 | and challenges, we hope our work can rekindle the conversation on workflows
833 | for data sharing.},
834 | booktitle = {Proceedings of the ACM Internet Measurement Conference (IMC
835 | 2020)},
836 | pages = {464--483},
837 | numpages = {20},
838 | keywords = {privacy, synthetic data generation, time series,
839 | generative adversarial networks},
840 | }
841 |
842 | @article{clusterdata:Aydin2020,
843 | title = {Multi-objective temporal bin packing problem: an application in cloud computing},
844 | journal = {Computers \& Operations Research},
845 | volume = 121,
846 | pages = {1049--59},
847 | year = 2020,
848 | month = Sep,
849 | issn = {0305-0548},
850 | doi = {https://doi.org/10.1016/j.cor.2020.104959},
851 | url = {http://www.sciencedirect.com/science/article/pii/S0305054820300769},
852 | author = {Nurşen Aydin and Ibrahim Muter and Ş. Ilker Birbil},
853 | keywords = {Bin packing, Cloud computing, Heuristics, Exact methods, Column generation},
854 | abstract = {Improving energy efficiency and lowering operational
855 | costs are the main challenges faced in systems with multiple
856 | servers. One prevalent objective in such systems is to
857 | minimize the number of servers required to process a given set
858 | of tasks under server capacity constraints. This objective
859 | leads to the well-known bin packing problem. In this study, we
860 | consider a generalization of this problem with a time
861 | dimension, where the tasks are to be performed with predefined
862 | start and end times. This new dimension brings about new
863 | performance considerations, one of which is the uninterrupted
864 | utilization of servers. This study is motivated by the problem
865 | of energy efficient assignment of virtual machines to physical
866 | servers in a cloud computing service. We address the virtual
867 | machine placement problem and present a binary integer
868 | programming model to develop different assignment policies. By
869 | analyzing the structural properties of the problem, we propose
870 | an efficient heuristic method based on solving smaller
871 | versions of the original problem iteratively. Moreover, we
872 | design a column generation algorithm that yields a lower bound
873 | on the objective value, which can be utilized to evaluate the
874 | performance of the heuristic algorithm. Our numerical study
875 | indicates that the proposed heuristic is capable of solving
876 | large-scale instances in a short time with small optimality
877 | gaps.},
878 | }
879 |
880 | @article{clusterdata:Milocco2020,
881 | title = {Evaluating the Upper Bound of Energy Cost Saving by Proactive Data Center Management},
882 | journal = {IEEE Transactions on Network and Service Management},
883 | year = 2020,
884 | issn = {1932-4537},
885 | doi = {10.1109/TNSM.2020.2988346},
886 | url = {https://ieeexplore.ieee.org/abstract/document/9069318},
887 | author = {Ruben Milocco and Pascale Minet and Éric Renault and Selma Boumerdassi},
888 | keywords = {Data center management, Proactive management, Machine Learning, Prediction, Energy cost},
889 | abstract = {
890 | Data Centers (DCs) need to periodically configure their servers in order to meet user demands.
891 | Since appropriate proactive management to meet demands reduces the cost, either by improving Quality of
892 | Service (QoS) or saving energy, there is a great interest in studying different proactive strategies
893 | based on predictions of the energy used to serve CPU and memory requests. The amount of savings that can
894 | be achieved depends not only on the selected proactive strategy but also on user-demand statistics and the
895 | predictors used. Despite its importance, it is difficult to find theoretical studies that quantify the
896 | savings that can be made, due to the problem complexity. A proactive DC management strategy is presented
897 | together with its upper bound of energy cost savings obtained with respect to a purely reactive management.
898 | Using this method together with records of the recent past, it is possible to quantify the efficiency of
899 | different predictors. Both linear and nonlinear predictors are studied, using a Google data set collected
900 | over 29 days, to evaluate the benefits that can be obtained with these two predictors.},
901 | }
902 |
903 |
904 | ################ 2018
905 |
906 | @article{clusterdata:Sliwko2018,
907 | author = {Sliwko, Leszek},
908 | title = {A Scalable Service Allocation Negotiation For Cloud Computing},
909 | journal = {Journal of Theoretical and Applied Information Technology},
910 | volume = 96,
911 | number = 20,
912 | month = Oct,
913 | year = 2018,
914 | issn = {1817-3195},
915 | pages = {6751--6782},
916 | numpages = {32},
917 | keywords = {distributed scheduling, agents; load balancing, MASB},
918 | abstract={This paper presents a detailed design of a decentralised agent-based
919 | scheduler, which can be used to manage workloads within the computing cells
920 | of a Cloud system. This scheme in based on the concept of service allocation
921 | negotiation, whereby all system nodes communicate between themselves and
922 | scheduling logic is decentralised. The architecture presented has been
923 | implemented, with multiple simulations run using realword workload traces from
924 | the Google Cluster Data project. The results were then compared to the
925 | scheduling patterns of Google’s Borg system.}
926 | }
927 |
928 | @INPROCEEDINGS{clusterdata:Liu2018gh,
929 | author = {Liu, Jinwei and Shen, Haiying and Sarker, Ankur and Chung, Wingyan},
930 | title = {Leveraging Dependency in Scheduling and Preemption for High Throughput in Data-Parallel Clusters},
931 | booktitle = {2018 IEEE International Conference on Cluster Computing (CLUSTER)},
932 | year = {2018},
933 | month = Sep,
934 | pages = {359--369},
935 | publisher = {IEEE},
936 | abstract = {Task scheduling and preemption are two important functions in
937 | data-parallel clusters. Though directed acyclic graph task dependencies
938 | are common in data-parallel clusters, previous task scheduling and
939 | preemption methods do not fully utilize such task dependency to increase
940 | throughput since they simply schedule precedent tasks prior to their
941 | dependent tasks or neglect the dependency. We notice that in both
942 | scheduling and preemption, choosing a task with more dependent tasks to
943 | run allows more tasks to be runnable next, which facilitates to select a
944 | task that can more increase throughput. Accordingly, in this paper, we
945 | propose a Dependency-aware Scheduling and Preemption system (DSP) to
946 | achieve high throughput. First, we build an integer linear programming
947 | model to minimize the makespan (i.e., the time when all jobs finish
948 | execution) with the consideration of task dependency and deadline, and
949 | derive the target server and start time for each task, which can
950 | minimize the makespan. Second, we utilize task dependency to determine
951 | tasks' priorities for preemption. Finally, we propose a method to reduce
952 | the number of unnecessary preemptions that cause more overhead than the
953 | throughput gain. Extensive experimental results based on a real cluster
954 | and Amazon EC2 cloud service show that DSP achieves much higher
955 | throughput compared to existing strategies.},
956 | doi = {10.1109/CLUSTER.2018.00054},
957 | }
958 |
959 | @inproceedings{clusterdata:Minet2018j,
960 | author = {Pascale Minet and Éric Renault and Ines Khoufi and Selma Boumerdassi},
961 | title = {Analyzing Traces from a {Google} Data Center},
962 | booktitle = {14th International Wireless Communications and Mobile Computing Conference (IWCMC 2018)},
963 | year = 2018,
964 | month = Jun,
965 | publisher = {IEEE},
966 | address = {Limassol, Cyprus},
967 | pages = {1167--1172},
968 | url = {https://doi.org/10.1109/IWCMC.2018.8450304},
969 | doi = {10.1109/IWCMC.2018.8450304},
970 | abstract = {
971 | Traces collected from an operational Google data center over 29 days represent a very rich
972 | and useful source of information for understanding the main features of a data center. In this
973 | paper, we characterize the strong heterogeneity of jobs and the medium heterogeneity of machine
974 | configurations. We analyze the off-periods of machines. We study the distribution of jobs per
975 | category, per scheduling class, per priority and per number of tasks. The distribution of job
976 | execution durations shows a high disparity, as does the job waiting time before being scheduled.
977 | The resource requests in terms of CPU and memory are also analyzed. The distribution of these
978 | parameter values is very useful to develop accurate models and algorithms for resource allocation
979 | in data centers.},
980 | keywords = {Data analysis, data center, big data application, resource allocation, scheduling},
981 | }
982 |
983 | @inproceedings{clusterdata:Minet2018m,
984 | author = {Pascale Minet and Éric Renault and Ines Khoufi and Selma Boumerdassi},
985 | title = {Data Analysis of a {Google} Data Center},
986 | booktitle = {18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 2018)},
987 | year = 2018,
988 | month = May,
989 | publisher = {IEEE},
990 | address = {Washington DC, USA},
991 | pages = {342--343},
992 | url = {https://doi.org/10.1109/CCGRID.2018.00049},
993 | doi = {10.1109/CCGRID.2018.00049},
994 | abstract = {
995 | Data collected from an operational Google data center during 29 days represent a very rich
996 | and very useful source of information for understanding the main features of a data center.
997 | In this paper, we highlight the strong heterogeneity of jobs. The distribution of job execution
998 | duration shows a high disparity, as well as the job waiting time before being scheduled. The
999 | resource requests in terms of CPU and memory are also analyzed. The knowledge of all these features
1000 | is needed to design models of jobs, machines and resource requests that are representative of a
1001 | real data center.},
1002 | }
1003 |
1004 | @ARTICLE{clusterdata:Sebastio2018b,
1005 | author = {Stefano Sebastio and Rahul Ghosh and Tridib Mukherjee},
1006 | journal = {IEEE Transactions on Services Computing},
1007 | title = {An availability analysis approach for deployment configurations of containers},
1008 | year = {2018},
1009 | month = Jan,
1010 | abstract = {Operating system (OS) containers enabling the microservice-oriented
1011 | architecture are becoming popular in the context of Cloud
1012 | services. Containers provide the ability to create lightweight and
1013 | portable runtime environments decoupling the application requirements
1014 | from the characteristics of the underlying system. Services built on
1015 | containers have a small resource footprint in terms of processing,
1016 | storage, memory and network, allowing a denser deployment
1017 | environment. While the performance of such containers is addressed in
1018 | few previous studies, understanding the failure-repair behavior of the
1019 | containers remains unexplored. In this paper, from an availability point
1020 | of view, we propose and compare different configuration models for
1021 | deploying a containerized software system. Inspired by Google
1022 | Kubernetes, a container management system, these configurations are
1023 | characterized with a failure response and migration service. We develop
1024 | novel non-state-space and state-space analytic models for container
1025 | availability analysis. Analytical as well as simulative solutions are
1026 | obtained for the developed models. Our analysis provides insights on k
1027 | out-of N availability and sensitivity of system availability for key
1028 | system parameters. Finally, we build an open-source software tool
1029 | powered by these models. The tool helps Cloud administrators to assess
1030 | the availability of containerized systems and to conduct a what-if
1031 | analysis based on user-provided parameters and configurations.},
1032 | keywords = {Containers;Analytical models;Cloud computing;Stochastic
1033 | processes;Tools;Computer architecture;Google;container;system
1034 | availability;virtual machine;cloud computing;analytic model;stochastic
1035 | reward net},
1036 | doi = {10.1109/TSC.2017.2788442},
1037 | ISSN = {1939-1374},
1038 | }
1039 |
1040 |
1041 | @article{clusterdata:Sebastio2018c,
1042 | author = {Sebastio, Stefano and Amoretti, Michele and Lafuente, Alberto Lluch and Scala, Antonio},
1043 | title = {A Holistic Approach for Collaborative Workload Execution in Volunteer Clouds},
1044 | journal = {ACM Transactions on Modeling and Computer Simulation (TOMACS)},
1045 | volume = 28,
1046 | number = 2,
1047 | month = Mar,
1048 | year = 2018,
1049 | issn = {1049-3301},
1050 | pages = {14:1--14:27},
1051 | articleno = {14},
1052 | numpages = {27},
1053 | url = {http://doi.acm.org/10.1145/3155336},
1054 | doi = {10.1145/3155336},
1055 | acmid = {3155336},
1056 | publisher = {ACM},
1057 | keywords = {Collective adaptive systems, ant colony optimization (ACO),
1058 | autonomic computing, cloud computing, collaborative computing,
1059 | computational fields, multiagent optimization, peer-to-peer (P2P), task
1060 | scheduling},
1061 | abstract={The demand for provisioning, using, and maintaining distributed
1062 | computational resources is growing hand in hand with the quest for
1063 | ubiquitous services. Centralized infrastructures such as cloud computing
1064 | systems provide suitable solutions for many applications, but their
1065 | scalability could be limited in some scenarios, such as in the case of
1066 | latency-dependent applications. The volunteer cloud paradigm aims at
1067 | overcoming this limitation by encouraging clients to offer their own
1068 | spare, perhaps unused, computational resources. Volunteer clouds are
1069 | thus complex, large-scale, dynamic systems that demand for self-adaptive
1070 | capabilities to offer effective services, as well as modeling and
1071 | analysis techniques to predict their behavior. In this article, we
1072 | propose a novel holistic approach for volunteer clouds supporting
1073 | collaborative task execution services able to improve the quality of
1074 | service of compute-intensive workloads. We instantiate our approach by
1075 | extending a recently proposed ant colony optimization algorithm for
1076 | distributed task execution with a workload-based partitioning of the
1077 | overlay network of the volunteer cloud. Finally, we evaluate our
1078 | approach using simulation-based statistical analysis techniques on a
1079 | workload benchmark provided by Google. Our results show that the
1080 | proposed approach outperforms some traditional distributed task
1081 | scheduling algorithms in the presence of compute-intensive workloads.}
1082 | }
1083 |
1084 | @Article{clusterdata:Sebastio2018d,
1085 | author = {Stefano Sebastio and Giorgio Gnecco},
1086 | title = {A green policy to schedule tasks in a distributed cloud},
1087 | journal = {Optimization Letters},
1088 | year = 2018,
1089 | month = Oct,
1090 | day = 01,
1091 | volume = 12,
1092 | number = 7,
1093 | pages = {1535--1551},
1094 | abstract = {In the last years, demand and availability of computational
1095 | capabilities experienced radical changes. Desktops and laptops increased
1096 | their processing resources, exceeding users' demand for large part of
1097 | the day. On the other hand, computational methods are more and more
1098 | frequently adopted by scientific communities, which often experience
1099 | difficulties in obtaining access to the required
1100 | resources. Consequently, data centers for outsourcing use, relying on
1101 | the cloud computing paradigm, are proliferating. Notwithstanding the
1102 | effort to build energy-efficient data centers, their energy footprint is
1103 | still considerable, since cooling a large number of machines situated in
1104 | the same room or container requires a significant amount of power. The
1105 | volunteer cloud, exploiting the users' willingness to share a quote of
1106 | their underused machine resources, can constitute an effective solution
1107 | to have the required computational resources when needed. In this paper,
1108 | we foster the adoption of the volunteer cloud computing as a green
1109 | (i.e., energy efficient) solution even able to outperform existing data
1110 | centers in specific tasks. To manage the complexity of such a large
1111 | scale heterogeneous system, we propose a distributed optimization policy
1112 | to task scheduling with the aim of reducing the overall energy
1113 | consumption executing a given workload. To this end, we consider an
1114 | integer programming problem relying on the Alternating Direction Method
1115 | of Multipliers (ADMM) for its solution. Our approach is compared with a
1116 | centralized one and other non-green targeting solutions. Results show
1117 | that the distributed solution found by the ADMM constitutes a good
1118 | suboptimal solution, worth to be applied in a real environment.},
1119 | issn = {1862-4480},
1120 | doi = {10.1007/s11590-017-1208-8},
1121 | url = {https://doi.org/10.1007/s11590-017-1208-8}
1122 | }
1123 | ################ 2017
1124 |
1125 | @Article{clusterdata:Carvalho2017b,
1126 | author = {Marcus Carvalho and Daniel A. Menasc\'{e} and Francisco Brasileiro},
1127 | title = {Capacity planning for {IaaS} cloud providers offering multiple
1128 | service classes},
1129 | journal = {Future Generation Computer Systems},
1130 | volume = {77},
1131 | pages = {97--111},
1132 | month = Dec,
1133 | year = 2017,
1134 | abstract = {Infrastructure as a Service (IaaS) cloud providers typically offer
1135 | multiple service classes to satisfy users with different requirements
1136 | and budgets. Cloud providers are faced with the challenge of estimating
1137 | the minimum resource capacity required to meet Service Level Objectives
1138 | (SLOs) defined for all service classes. This paper proposes a capacity
1139 | planning method that is combined with an admission control mechanism to
1140 | address this challenge. The capacity planning method uses analytical
1141 | models to estimate the output of a quota-based admission control
1142 | mechanism and find the minimum capacity required to meet availability
1143 | SLOs and admission rate targets for all classes. An evaluation using
1144 | trace-driven simulations shows that our method estimates the best cloud
1145 | capacity with a mean relative error of 2.5\% with respect to the
1146 | simulation, compared to a 36\% relative error achieved by a single-class
1147 | baseline method that does not consider admission control
1148 | mechanisms. Moreover, our method exhibited a high SLO fulfillment for
1149 | both availability and admission rates, and obtained mean CPU utilization
1150 | over 91\%, while the single-class baseline method had values not greater
1151 | than 78\%.},
1152 | url = {http://www.sciencedirect.com/science/article/pii/S0167739X16308561},
1153 | doi = {10.1016/j.future.2017.07.019},
1154 | issn = {0167-739X},
1155 | }
1156 |
1157 | @inproceedings{clusterdata:Janus2017,
1158 | author = {Pawel Janus and Krzysztof Rzadca},
1159 | title = {{SLO}-aware Colocation of Data Center Tasks Based on Instantaneous Processor Requirements},
1160 | booktitle = {ACM Symposium on Cloud Computing (SoCC)},
1161 | year = 2017,
1162 | month = Sep,
1163 | pages = {256--268},
1164 | address = {Santa Clara, CA, USA},
1165 | publisher = {ACM},
1166 | abstract = {In a cloud data center, a single physical machine simultaneously
1167 | executes dozens of highly heterogeneous tasks. Such colocation results
1168 | in more efficient utilization of machines, but, when tasks' requirements
1169 | exceed available resources, some of the tasks might be throttled down or
1170 | preempted. We analyze version 2.1 of the Google cluster trace that
1171 | shows short-term (1 second) task CPU usage. Contrary to the assumptions
1172 | taken by many theoretical studies, we demonstrate that the empirical
1173 | distributions do not follow any single distribution. However, high
1174 | percentiles of the total processor usage (summed over at least 10 tasks)
1175 | can be reasonably estimated by the Gaussian distribution. We use this
1176 | result for a probabilistic fit test, called the Gaussian Percentile
1177 | Approximation (GPA), for standard bin-packing algorithms. To check
1178 | whether a new task will fit into a machine, GPA checks whether the
1179 | resulting distribution's percentile corresponding to the requested
1180 | service level objective, SLO is still below the machine's capacity. In
1181 | our simulation experiments, GPA resulted in colocations exceeding the
1182 | machines' capacity with a frequency similar to the requested SLO.},
1183 | doi = {10.1145/3127479.3132244},
1184 | url = {http://arxiv.org/abs/1709.01384},
1185 | }
1186 |
1187 |
1188 | @InProceedings{clusterdata:Carvalho2017,
1189 | title = {Multi-dimensional admission control and capacity planning for {IaaS} clouds with multiple service classes},
1190 | author = {Carvalho, Marcus and Brasileiro, Francisco and Lopes, Raquel and Farias, Giovanni and Fook, Alessandro and Mafra, Jo\~{a}o and Turull, Daniel},
1191 | booktitle = {IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)},
1192 | year = 2017,
1193 | month = May,
1194 | pages = {160--169},
1195 | address = {Madrid, Spain},
1196 | keywords = {admission control, capacity planning, cloud computing, performance models, simulation},
1197 | abstract = {Infrastructure as a Service (IaaS) providers typically offer
1198 | multiple service classes to deal with the wide variety of users adopting
1199 | this cloud computing model. In this scenario, IaaS providers need to
1200 | perform efficient admission control and capacity planning in order to
1201 | minimize infrastructure costs, while fulfilling the different Service
1202 | Level Objectives (SLOs) defined for all service classes
1203 | offered. However, most of the previous work on this field consider a
1204 | single resource dimension -- typically CPU -- when making such
1205 | management decisions. We show that this approach will either increase
1206 | infrastructure costs due to over-provisioning, or violate SLOs due to
1207 | lack of capacity for the resource dimensions being ignored. To fill this
1208 | gap, we propose admission control and capacity planning methods that
1209 | consider multiple service classes and multiple resource dimensions. Our
1210 | results show that our admission control method can guarantee a high
1211 | availability SLO fulfillment in scenarios where both CPU and memory can
1212 | become the bottleneck resource. Moreover, we show that our capacity
1213 | planning method can find the minimum capacity required for both CPU and
1214 | memory to meet SLOs with good accuracy. We also analyze how the load
1215 | variation on one resource dimension can affect another, highlighting the
1216 | need to manage resources for multiple dimensions simultaneously.},
1217 | url = {https://doi.org/10.1109/CCGRID.2017.14},
1218 | doi = {10.1109/CCGRID.2017.14},
1219 | }
1220 |
1221 | @article{clusterdata:Sebastio2017,
1222 | title = {Optimal distributed task scheduling in volunteer clouds},
1223 | journal = {Computers and Operations Research},
1224 | volume = 81,
1225 | pages = {231 - 246},
1226 | year = 2017,
1227 | month = May,
1228 | issn = {0305-0548},
1229 | doi = {https://doi.org/10.1016/j.cor.2016.11.004},
1230 | url = {http://www.sciencedirect.com/science/article/pii/S0305054816302660},
1231 | author = {Stefano Sebastio and Giorgio Gnecco and Alberto Bemporad},
1232 | keywords = {Cloud computing, Distributed optimization, Integer programming,
1233 | Combinatorial optimization, ADMM},
1234 | abstract = {The ever increasing request of computational resources has shifted
1235 | the computing paradigm towards solutions where less computation is
1236 | performed locally. The most widely adopted approach nowadays is
1237 | represented by cloud computing. With the cloud, users can transparently
1238 | access to virtually infinite resources with the same aptitude of using
1239 | any other utility. Next to the cloud, the volunteer computing paradigm
1240 | has gained attention in the last decade, where the spared resources on
1241 | each personal machine are shared thanks to the users� willingness to
1242 | cooperate. Cloud and volunteer paradigms have been recently seen as
1243 | companion technologies to better exploit the use of local
1244 | resources. Conversely, this scenario places complex challenges in
1245 | managing such a large-scale environment, as the resources available on
1246 | each node and the presence of the nodes online are not known
1247 | a-priori. The complexity further increases in presence of tasks that
1248 | have an associated Service Level Agreement specified, e.g., through a
1249 | deadline. Distributed management solutions have then be advocated as the
1250 | only approaches that are realistically applicable. In this paper, we
1251 | propose a framework to allocate tasks according to different policies,
1252 | defined by suitable optimization problems. Then, we provide a
1253 | distributed optimization approach relying on the Alternating Direction
1254 | Method of Multipliers (ADMM) for one of these policies, and we compare
1255 | it with a centralized approach. Results show that, when a centralized
1256 | approach can not be adopted in a real environment, it could be possible
1257 | to rely on the good suboptimal solutions found by the ADMM.}
1258 | }
1259 | ################ 2016
1260 |
1261 | @InProceedings{clusterdata:Zakarya2016,
1262 | title = {An energy aware cost recovery approach for virtual machine migration},
1263 | author = {Muhammad Zakarya and Lee Gillam},
1264 | year = 2016,
1265 | booktitle = {13th International Conference on Economics of Grids, Clouds, Systems and Services (GECON2016)},
1266 | month = September,
1267 | address = {Athens, Greece},
1268 | abstract = {Datacenters provide an IT backbone for today's business and
1269 | economy, and are the principal electricity consumers for Cloud
1270 | computing. Various studies suggest that approximately 30\% of the
1271 | running servers in US datacenters are idle and the others are
1272 | under-utilized, making it possible to save energy and money by using
1273 | Virtual Machine (VM) consolidation to reduce the number of hosts in
1274 | use. However, consolidation involves migrations that can be expensive in
1275 | terms of energy consumption, and sometimes it will be more energy
1276 | efficient not to consolidate. This paper investigates how migration
1277 | decisions can be made such that the energy costs involved with the
1278 | migration are recovered, as only when costs of migration have been
1279 | recovered will energy start to be saved. We demonstrate through a number
1280 | of experiments, using the Google workload traces for 12,583 hosts and
1281 | 1,083,309 tasks, how different VM allocation heuristics, combined with
1282 | different approaches to migration, will impact on energy efficiency. We
1283 | suggest, using reasonable assumptions for datacenter setup, that a
1284 | combination of energy-aware fill-up VM allocation and energy-aware
1285 | migration, and migration only for relatively long running VMs, provides
1286 | for optimal energy efficiency.},
1287 | url = {http://epubs.surrey.ac.uk/id/eprint/813810},
1288 | }
1289 |
1290 | @INPROCEEDINGS{clusterdata:Sliwko2016,
1291 | title = {{AGOCS} - Accurate {Google} Cloud Simulator Framework},
1292 | author = {Leszek Sliwko and Vladimir Getov},
1293 | booktitle = {16th IEEE International Conference on Scalable Computing and Communications (ScalCom 2016)},
1294 | year = 2016,
1295 | month = July,
1296 | pages={550--558},
1297 | address = {Toulouse, France},
1298 | keywords = {cloud system; workload traces; workload simulation framework; google cluster data},
1299 | abstract = {This paper presents the Accurate Google Cloud Simulator (AGOCS) -
1300 | a novel high-fidelity Cloud workload simulator based on parsing real
1301 | workload traces, which can be conveniently used on a desktop machine for
1302 | day-to-day research. Our simulation is based on real-world workload
1303 | traces from a Google Cluster with 12.5K nodes, over a period of a
1304 | calendar month. The framework is able to reveal very precise and
1305 | detailed parameters of the executed jobs, tasks and nodes as well as to
1306 | provide actual resource usage statistics. The system has been
1307 | implemented in Scala language with focus on parallel execution and an
1308 | easy-to-extend design concept. The paper presents the detailed
1309 | structural framework for AGOCS and discusses our main design decisions,
1310 | whilst also suggesting alternative and possibly performance enhancing
1311 | future approaches. The framework is available via the Open Source GitHub
1312 | repository.},
1313 | url = {http://dx.doi.org/10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.10},
1314 | doi={10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.10},
1315 | }
1316 |
1317 | ################ 2015
1318 |
1319 | @INPROCEEDINGS{clusterdata:Carvalho2015,
1320 | title = {Prediction-Based Admission Control for {IaaS} Clouds with Multiple Service Classes},
1321 | author = {Marcus Carvalho and Daniel Menasce and Francisco Brasileiro},
1322 | booktitle = {IEEE International Conference on Cloud Computing Technology and Science (CloudCom)},
1323 | year = 2015,
1324 | month = Nov,
1325 | pages={82--90},
1326 | address = {Vancouver, BC, Canada},
1327 | keywords = {admission control;cloud computing;infrastructure-as-a-service;
1328 | performance prediction;quality of service;resource management},
1329 | abstract = {There is a growing adoption of cloud computing services,
1330 | attracting users with different requirements and budgets to run
1331 | their applications in cloud infrastructures. In order to match
1332 | users' needs, cloud providers can offer multiple service
1333 | classes with different pricing and Service Level Objective (SLO)
1334 | guarantees. Admission control mechanisms can help providers to
1335 | meet target SLOs by limiting the demand at peak periods. This
1336 | paper proposes a prediction-based admission control model for
1337 | IaaS clouds with multiple service classes, aiming to maximize
1338 | request admission rates while fulfilling availability SLOs
1339 | defined for each class. We evaluate our approach with trace-driven
1340 | simulations fed with data from production systems. Our results
1341 | show that admission control can reduce SLO violations
1342 | significantly, specially in underprovisioned scenarios. Moreover,
1343 | our predictive heuristics are less sensitive to different capacity
1344 | planning and SLO decisions, as they fulfill availability SLOs for
1345 | more than 91\% of requests even in the worst case scenario, for
1346 | which only 56\% of SLOs are fulfilled by a simpler greedy heuristic
1347 | and as little as 0.2\% when admission control is not used.},
1348 | url = {http://dx.doi.org/10.1109/CloudCom.2015.16},
1349 | doi={10.1109/CloudCom.2015.16},
1350 | }
1351 |
1352 | @INPROCEEDINGS{clusterdata:Ismaeel2015,
1353 | author = {Salam Ismaeel and Ali Miri},
1354 | title = {Using {ELM} Techniques to Predict Data Centre {VM} Requests},
1355 | year = 2015,
1356 | booktitle = {IEEE International Conference on Cyber Security and Cloud Computing (CSCloud)},
1357 | month = Nov,
1358 | publisher = {IEEE},
1359 | address = {New York, NY, USA},
1360 | abstract = {Data centre prediction models can be used to forecast future loads
1361 | for a given centre in terms of CPU, memory, VM requests, and other
1362 | parameters. An effective and efficient model can not only be used to
1363 | optimize resource allocation, but can also be used as part of a strategy
1364 | to conserve energy, improve performance and increase profits for both
1365 | clients and service providers. In this paper, we have developed a
1366 | prediction model, which combines k-means clustering techniques and
1367 | Extreme Learning Machines (ELMs). We have shown the effectiveness of our
1368 | proposed model by using it to estimate future VM requests in a data
1369 | centre based on its historical usage. We have tested our model on real
1370 | Google traces that feature over 25 million tasks collected over a 29-day
1371 | time period. Experimental results presented show that our proposed
1372 | system outperforms other models reported in the literature.},
1373 | }
1374 |
1375 | @INPROCEEDINGS{clusterdata:Sebastio2015b,
1376 | author = {Stefano Sebastio and Antonio Scala},
1377 | booktitle = {2015 IEEE Conference on Collaboration and Internet Computing (CIC)},
1378 | title = {A workload-based approach to partition the volunteer cloud},
1379 | year = 2015,
1380 | month = Oct,
1381 | pages = {210--218},
1382 | abstract = {The growing demand of computational resources has shifted users
1383 | towards the adoption of cloud computing technologies. Cloud allows users
1384 | to transparently access to remote computing capabilities as an
1385 | utility. The volunteer computing paradigm, another ICT trend of the last
1386 | years, can be considered a companion force to enhance the cloud in
1387 | fulfilling specific domain requirements, such as computational intensive
1388 | requests. Combining the spared resources provided by volunteer nodes
1389 | with few data centers is possible to obtain a robust and scalable cloud
1390 | platform. The price for such benefits relies in increased challenges to
1391 | design and manage a dynamic complex system composed by heterogeneous
1392 | nodes. Task execution requests submitted in the volunteer cloud are
1393 | usually associated with Quality of Service requirements e.g., Specified
1394 | through an execution deadline. In this paper, we present a preliminary
1395 | evaluation of a cloud partitioning approach to distribute task execution
1396 | requests in volunteer cloud, that has been validated through a
1397 | simulation-based statistical analysis using the Google workload data
1398 | trace.},
1399 | keywords = {cloud computing;computer centres;digital simulation;quality of
1400 | service;statistical analysis;volunteer computing;workload-based
1401 | approach;volunteer cloud partitioning;computational resources;cloud
1402 | computing technologies;remote computing capabilities;volunteer computing
1403 | paradigm;volunteer nodes;data centers;cloud platform;dynamic complex
1404 | system;heterogeneous nodes;task execution request;quality of service
1405 | requirements;simulation-based statistical analysis;Google workload data
1406 | trace;Cloud computing;Measurement;Peer-to-peer computing;Quality of
1407 | service;Google;Overlay networks;Computer applications;cloud
1408 | computing;autonomic clouds;autonomous systems;volunteer
1409 | computing;distributed tasks execution},
1410 | doi = {10.1109/CIC.2015.27},
1411 | }
1412 |
1413 | @INPROCEEDINGS{clusterdata:Sirbu2015,
1414 | title = {Towards Data-Driven Autonomics in Data Centers},
1415 | author = {Alina S{\^\i}rbu and Ozalp Babaoglu},
1416 | booktitle = {International Conference on Cloud and Autonomic Computing (ICCAC)},
1417 | month = Sep,
1418 | year = 2015,
1419 | address = {Cambridge, MA, USA},
1420 | publisher = {IEEE Computer Society},
1421 | keywords = {Data science; predictive analytics; Google cluster
1422 | trace; log data analysis; failure prediction; machine learning
1423 | classification; ensemble classifier; random forest; BigQuery},
1424 | abstract = {Continued reliance on human operators for managing data centers is
1425 | a major impediment for them from ever reaching extreme dimensions.
1426 | Large computer systems in general, and data centers in particular, will
1427 | ultimately be managed using predictive computational and executable
1428 | models obtained through data-science tools, and at that point, the
1429 | intervention of humans will be limited to setting high-level goals and
1430 | policies rather than performing low-level operations. Data-driven
1431 | autonomics, where management and control are based on holistic
1432 | predictive models that are built and updated using generated data, opens
1433 | one possible path towards limiting the role of operators in data
1434 | centers. In this paper, we present a data-science study of a public
1435 | Google dataset collected in a 12K-node cluster with the goal of building
1436 | and evaluating a predictive model for node failures. We use BigQuery,
1437 | the big data SQL platform from the Google Cloud suite, to process
1438 | massive amounts of data and generate a rich feature set characterizing
1439 | machine state over time. We describe how an ensemble classifier can be
1440 | built out of many Random Forest classifiers each trained on these
1441 | features, to predict if machines will fail in a future 24-hour
1442 | window. Our evaluation reveals that if we limit false positive rates to
1443 | 5\%, we can achieve true positive rates between 27\% and 88\% with
1444 | precision varying between 50\% and 72\%. We discuss the practicality of
1445 | including our predictive model as the central component of a data-driven
1446 | autonomic manager and operating it on-line with live data streams
1447 | (rather than off-line on data logs). All of the scripts used for
1448 | BigQuery and classification analyses are publicly available from the
1449 | authors' website.},
1450 | url = {http://www.cs.unibo.it/babaoglu/papers/pdf/CAC2015.pdf},
1451 | }
1452 |
1453 | @inproceedings {clusterdata:Delgado2015hawk,
1454 | author = {Pamela Delgado and Florin Dinu and Anne-Marie Kermarrec and Willy Zwaenepoel},
1455 | title = {{Hawk}: hybrid datacenter scheduling},
1456 | year = {2015},
1457 | booktitle = {USENIX Annual Technical Conference (USENIX ATC)},
1458 | month = Jul,
1459 | publisher = {USENIX Association},
1460 | pages = {499--510},
1461 | address = {Santa Clara, CA, USA},
1462 | isbn = {978-1-931971-225},
1463 | url = {https://www.usenix.org/conference/atc15/technical-session/presentation/delgado},
1464 | abstract = {This paper addresses the problem of efficient scheduling of large
1465 | clusters under high load and heterogeneous workloads. A heterogeneous
1466 | workload typically consists of many short jobs and a small number of
1467 | large jobs that consume the bulk of the cluster's resources.
1468 |
1469 | Recent work advocates distributed scheduling to overcome the limitations
1470 | of centralized schedulers for large clusters with many competing
1471 | jobs. Such distributed schedulers are inherently scalable, but may make
1472 | poor scheduling decisions because of limited visibility into the overall
1473 | resource usage in the cluster. In particular, we demonstrate that under
1474 | high load, short jobs can fare poorly with such a distributed scheduler.
1475 |
1476 | We propose instead a new hybrid centralized/distributed scheduler,
1477 | called Hawk. In Hawk, long jobs are scheduled using a centralized
1478 | scheduler, while short ones are scheduled in a fully distributed
1479 | way. Moreover, a small portion of the cluster is reserved for the use of
1480 | short jobs. In order to compensate for the occasional poor decisions
1481 | made by the distributed scheduler, we propose a novel and efficient
1482 | randomized work-stealing algorithm.
1483 |
1484 | We evaluate Hawk using a trace-driven simulation and a prototype
1485 | implementation in Spark. In particular, using a Google trace, we show
1486 | that under high load, compared to the purely distributed Sparrow
1487 | scheduler, Hawk improves the 50th and 90th percentile runtimes by 80\%
1488 | and 90\% for short jobs and by 35\% and 10\% for long jobs,
1489 | respectively. Measurements of a prototype implementation using Spark on
1490 | a 100-node cluster confirm the results of the simulation.},
1491 | }
1492 |
1493 | @article{clusterdata:Sebastio2015,
1494 | author = {Sebastio, Stefano and Amoretti, Michele and Lluch-Lafuente, Alberto},
1495 | title = {AVOCLOUDY: a simulator of volunteer clouds},
1496 | journal = {Software: Practice and Experience},
1497 | volume = {46},
1498 | number = {1},
1499 | pages = {3--30},
1500 | year = 2015,
1501 | month = Jan,
1502 | keywords = {cloud computing, volunteer computing, autonomic computing, distributed computing, discrete event simulation},
1503 | doi = {10.1002/spe.2345},
1504 | url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/spe.2345},
1505 | eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/spe.2345},
1506 | abstract = {The increasing demand of computational and storage resources is
1507 | shifting users toward the adoption of cloud technologies. Cloud
1508 | computing is based on the vision of computing as utility, where users no
1509 | more need to buy machines but simply access remote resources made
1510 | available on-demand by cloud providers. The relationship between users
1511 | and providers is defined by a service-level agreement, where the
1512 | non-fulfillment of its terms is regulated by the associated penalty
1513 | fees. Therefore, it is important that the providers adopt proper
1514 | monitoring and managing strategies. Despite their reduced application,
1515 | intelligent agents constitute a feasible technology to add autonomic
1516 | features to cloud operations. Furthermore, the volunteer computing
1517 | paradigm�one of the Information and Communications Technology (ICT)
1518 | trends of the last decade�can be pulled alongside traditional cloud
1519 | approaches, with the purpose to �green� them. Indeed, the
1520 | combination of data center and volunteer resources, managed by agents,
1521 | allows one to obtain a more robust and scalable cloud computing
1522 | platform. The increased challenges in designing such a complex system
1523 | can benefit from a simulation-based approach, to test autonomic
1524 | management solutions before their deployment in the production
1525 | environment. However, currently available simulators of cloud platforms
1526 | are not suitable to model and analyze such heterogeneous, large-scale,
1527 | and highly dynamic systems. We propose the AVOCLOUDY simulator to fill
1528 | this gap. This paper presents the internal architecture of the
1529 | simulator, provides implementation details, summarizes several notable
1530 | applications, and provides experimental results that measure the
1531 | simulator performance and its accuracy. The latter experiments are based
1532 | on real-world worldwide distributed computations on top of the PlanetLab
1533 | platform.}
1534 | }
1535 |
1536 | ################ 2014
1537 |
1538 | @InProceedings{clusterdata:Iglesias2014:task-estimation,
1539 | author = {Jesus Omana Iglesias and Liam Murphy Lero and Milan De
1540 | Cauwer and Deepak Mehta and Barry O'Sullivan},
1541 | title = {A methodology for online consolidation of tasks through
1542 | more accurate resource estimations},
1543 | year = 2014,
1544 | month = Dec,
1545 | booktitle = {IEEE/ACM Intl. Conf. on Utility and Cloud Computing (UCC)},
1546 | address = {London, UK},
1547 | abstract = {Cloud providers aim to provide computing services for a wide range
1548 | of applications, such as web applications, emails, web searches, and map
1549 | reduce jobs. These applications are commonly scheduled to run on
1550 | multi-purpose clusters that nowadays are becoming larger and more
1551 | heterogeneous. A major challenge is to efficiently utilize the cluster's
1552 | available resources, in particular to maximize overall machine
1553 | utilization levels while minimizing application waiting time. We studied
1554 | a publicly available trace from a large Google cluster ($\sim$12,000
1555 | machines) and observed that users generally request more resources than
1556 | required for running their tasks, leading to low levels of utilization.
1557 | In this paper, we propose a methodology for achieving an efficient
1558 | utilization of the cluster's resources while providing the users with
1559 | fast and reliable computing services. The methodology consists of three
1560 | main modules: i) a prediction module that forecasts the maximum resource
1561 | requirement of a task; ii) a scalable scheduling module that efficiently
1562 | allocates tasks to machines; and iii) a monitoring module that tracks
1563 | the levels of utilization of the machines and tasks. We present results
1564 | that show that the impact of more accurate resource estimations for the
1565 | scheduling of tasks can lead to an increase in the average utilization
1566 | of the cluster, a reduction in the number of tasks being evicted, and a
1567 | reduction in task waiting time.},
1568 | keys = {online scheduling, Cloud computing, forecasting, resource provisioning,
1569 | constraint programming},
1570 | }
1571 |
1572 | @InProceedings{clusterdata:Balliu2014,
1573 | author = {Alkida Balliu and Dennis Olivetti and Ozalp Babaoglu and
1574 | Moreno Marzolla and Alina Sirbu},
1575 | title = {{BiDAl: Big Data Analyzer} for cluster traces},
1576 | year = 2014,
1577 | booktitle = {Informatik Workshop on System Software Support for Big Data (BigSys)},
1578 | month = Sep,
1579 | publisher = {GI-Edition Lecture Notes in Informatics},
1580 | abstract = {Modern data centers that provide Internet-scale services are
1581 | stadium-size structures housing tens of thousands of heterogeneous
1582 | devices (server clusters, networking equipment, power and cooling
1583 | infrastructures) that must operate continuously and reliably. As part
1584 | of their operation, these devices produce large amounts of data in the
1585 | form of event and error logs that are essential not only for identifying
1586 | problems but also for improving data center efficiency and
1587 | management. These activities employ data analytics and often exploit
1588 | hidden statistical patterns and correlations among different factors
1589 | present in the data. Uncovering these patterns and correlations is
1590 | challenging due to the sheer volume of data to be analyzed. This paper
1591 | presents BiDAl, a prototype ``log-data analysis framework'' that
1592 | incorporates various Big Data technologies to simplify the analysis of
1593 | data traces from large clusters. BiDAl is written in Java with a modular
1594 | and extensible architecture so that different storage backends
1595 | (currently, HDFS and SQLite are supported), as well as different
1596 | analysis languages (current implementation supports SQL, R and Hadoop
1597 | MapReduce) can be easily selected as appropriate. We present the design
1598 | of BiDAl and describe our experience using it to analyze several public
1599 | traces of Google data clusters for building a simulation model capable
1600 | of reproducing observed behavior.},
1601 | }
1602 |
1603 | @inproceedings{clusterdata:Caglar2014,
1604 | title = {{iOverbook}: intelligent resource-overbooking to support
1605 | soft real-time applications in the cloud},
1606 | author = {Faruk Caglar and Aniruddha Gokhale},
1607 | booktitle = {7th IEEE International Conference on Cloud Computing (IEEE CLOUD)},
1608 | year = 2014,
1609 | month = {Jun--Jul},
1610 | address = {Anchorage, AK, USA},
1611 | abstract = {Cloud service providers (CSPs) often overbook their resources
1612 | with user applications despite having to maintain service-level
1613 | agreements with their customers. Overbooking is attractive to CSPs
1614 | because it helps to reduce power consumption in the data center by
1615 | packing more user jobs in less number of resources while improving their
1616 | profits. Overbooking becomes feasible because user applications tend to
1617 | overestimate their resource requirements utilizing only a fraction of
1618 | the allocated resources. Arbitrary resource overbooking ratios, however,
1619 | may be detrimental to soft real-time applications, such as airline
1620 | reservations or Netflix video streaming, which are increasingly hosted
1621 | in the cloud. The changing dynamics of the cloud preclude an offline
1622 | determination of overbooking ratios. To address these concerns, this
1623 | paper presents iOverbook, which uses a machine learning approach to make
1624 | systematic and online determination of overbooking ratios such that the
1625 | quality of service needs of soft real-time systems can be met while
1626 | still benefiting from overbooking. Specifically, iOverbook utilizes
1627 | historic data of tasks and host machines in the cloud to extract their
1628 | resource usage patterns and predict future resource usage along with the
1629 | expected mean performance of host machines. To evaluate our approach, we
1630 | have used a large usage trace made available by Google of one of its
1631 | production data centers. In the context of the traces, our experiments
1632 | show that iOverbook can help CSPs improve their resource utilization by
1633 | an average of 12.5\% and save 32\% power in the data center.},
1634 | url = {http://www.dre.vanderbilt.edu/~gokhale/WWW/papers/CLOUD-2014.pdf},
1635 | }
1636 |
1637 | @inproceedings{clusterdata:Sebastio2014,
1638 | author = {Sebastio, Stefano and Amoretti, Michele and Lluch Lafuente, Alberto},
1639 | title = {A computational field framework for collaborative task
1640 | execution in volunteer clouds},
1641 | booktitle = {International Symposium on Software Engineering for
1642 | Adaptive and Self-Managing Systems (SEAMS)},
1643 | year = 2014,
1644 | month = Jun,
1645 | isbn = {978-1-4503-2864-7},
1646 | address = {Hyderabad, India},
1647 | pages = {105--114},
1648 | url = {http://doi.acm.org/10.1145/2593929.2593943},
1649 | doi = {10.1145/2593929.2593943},
1650 | publisher = {ACM},
1651 | keywords = {ant colony optimization, bio-inspired algorithms, cloud computing,
1652 | distributed tasks execution, peer-to-peer, self-* systems, spatial
1653 | computing, volunteer computing},
1654 | abstract = {The increasing diffusion of cloud technologies offers new
1655 | opportunities for distributed and collaborative computing. Volunteer
1656 | clouds are a prominent example, where participants join and leave the
1657 | platform and collaborate by sharing computational resources. The high
1658 | complexity, dynamism and unpredictability of such scenarios call for
1659 | decentralized self-* approaches. We present in this paper a framework
1660 | for the design and evaluation of self-adaptive collaborative task
1661 | execution strategies in volunteer clouds. As a byproduct, we propose a
1662 | novel strategy based on the Ant Colony Optimization paradigm, that we
1663 | validate through simulation-based statistical analysis over Google
1664 | cluster data.},
1665 | }
1666 |
1667 | @inproceedings{clusterdata:Breitgand2014-adaptive,
1668 | title = {An adaptive utilization accelerator for virtualized environments},
1669 | author = {Breitgand, David and Dubitzky, Zvi and Epstein, Amir and
1670 | Feder, Oshrit and Glikson, Alex and Shapira, Inbar and
1671 | Toffetti, Giovanni},
1672 | booktitle = {International Conference on Cloud Engineering (IC2E)},
1673 | pages = {165--174},
1674 | year = 2014,
1675 | month = Mar,
1676 | publisher = IEEE,
1677 | address = {Boston, MA, USA},
1678 | abstract = {One of the key enablers of a cloud provider competitiveness is
1679 | ability to over-commit shared infrastructure at ratios that are higher
1680 | than those of other competitors, without compromising non-functional
1681 | requirements, such as performance. A widely recognized impediment to
1682 | achieving this goal is so called ``Virtual Machines sprawl'', a
1683 | phenomenon referring to the situation when customers order Virtual
1684 | Machines (VM) on the cloud, use them extensively and then leave them
1685 | inactive for prolonged periods of time. Since a typical cloud
1686 | provisioning system treats new VM provision requests according to the
1687 | nominal virtual hardware specification, an often occurring situation is
1688 | that the nominal resources of a cloud/pool become exhausted fast while
1689 | the physical hosts utilization remains low. We present IBM adaPtive
1690 | UtiLiSation AcceleratoR (IBM PULSAR), a cloud resources scheduler that
1691 | extends OpenStack Nova Filter Scheduler. IBM PULSAR recognises that
1692 | effective safely attainable over-commit ratio varies with time due to
1693 | workloads' variability and dynamically adapts the effective over-commit
1694 | ratio to these changes.},
1695 | }
1696 |
1697 | @ARTICLE{clusterdata:Zhang2014-Harmony,
1698 | author = {Qi Zhang and Mohamed Faten Zhani and Raouf Boutaba and
1699 | Joseph L Hellerstein},
1700 | title = {Dynamic heterogeneity-aware resource provisioning in the cloud},
1701 | journal = {IEEE Transactions on Cloud Computing (TCC)},
1702 | year = 2014,
1703 | month = Mar,
1704 | volume = 2,
1705 | number = 1,
1706 | abstract = {Data centers consume tremendous amounts of energy in terms of
1707 | power distribution and cooling. Dynamic capacity provisioning is a
1708 | promising approach for reducing energy consumption by dynamically
1709 | adjusting the number of active machines to match resource
1710 | demands. However, despite extensive studies of the problem, existing
1711 | solutions have not fully considered the heterogeneity of both workload
1712 | and machine hardware found in production environments. In particular,
1713 | production data centers often comprise heterogeneous machines with
1714 | different capacities and energy consumption characteristics. Meanwhile,
1715 | the production cloud workloads typically consist of diverse applications
1716 | with different priorities, performance and resource
1717 | requirements. Failure to consider the heterogeneity of both machines and
1718 | workloads will lead to both sub-optimal energy-savings and long
1719 | scheduling delays, due to incompatibility between workload requirements
1720 | and the resources offered by the provisioned machines. To address this
1721 | limitation, we present Harmony, a Heterogeneity-Aware dynamic capacity
1722 | provisioning scheme for cloud data centers. Specifically, we first use
1723 | the K-means clustering algorithm to divide workload into distinct task
1724 | classes with similar characteristics in terms of resource and
1725 | performance requirements. Then we present a technique that dynamically
1726 | adjusting the number of machines to minimize total energy consumption
1727 | and scheduling delay. Simulations using traces from a Google's compute
1728 | cluster demonstrate Harmony can reduce energy by 28 percent compared to
1729 | heterogeneity-oblivious solutions.},
1730 | }
1731 |
1732 | ################ 2013
1733 |
1734 | @INPROCEEDINGS{clusterdata:Di2013a,
1735 | title = {Optimization of cloud task processing with checkpoint-restart mechanism},
1736 | author = {Di, Sheng and Robert, Yves and Vivien, Fr\'ed\'eric and
1737 | Kondo, Derrick and Wang, Cho-Li and Cappello, Franck},
1738 | booktitle = {25th International Conference on High Performance
1739 | Computing, Networking, Storage and Analysis (SC)},
1740 | year = 2013,
1741 | month = Nov,
1742 | address = {Denver, CO, USA},
1743 | abstract = {In this paper, we aim at optimizing fault-tolerance techniques
1744 | based on a checkpointing/restart mechanism, in the context of cloud
1745 | computing. Our contribution is three-fold. (1) We derive a fresh formula
1746 | to compute the optimal number of checkpoints for cloud jobs with varied
1747 | distributions of failure events. Our analysis is not only generic with
1748 | no assumption on failure probability distribution, but also attractively
1749 | simple to apply in practice. (2) We design an adaptive algorithm to
1750 | optimize the impact of checkpointing regarding various costs like
1751 | checkpointing/restart overhead. (3) We evaluate our optimized solution
1752 | in a real cluster environment with hundreds of virtual machines and
1753 | Berkeley Lab Checkpoint/Restart tool. Task failure events are emulated
1754 | via a production trace produced on a large-scale Google data
1755 | center. Experiments confirm that our solution is fairly suitable for
1756 | Google systems. Our optimized formula outperforms Young's formula by
1757 | 3--10 percent, reducing wallclock lengths by 50--100 seconds per job on
1758 | average.},
1759 | }
1760 |
1761 | @inproceedings{clusterdata:Qiang2013-anomaly,
1762 | author = {Qiang Guan and Song Fu},
1763 | title = {Adaptive Anomaly Identification by Exploring Metric
1764 | Subspace in Cloud Computing Infrastructures},
1765 | booktitle = {32nd IEEE Symposium on Reliable Distributed Systems (SRDS)},
1766 | year = 2013,
1767 | month = Sep,
1768 | pages = {205--214},
1769 | address = {Braga, Portugal},
1770 | abstract = {Cloud computing has become increasingly popular by obviating the
1771 | need for users to own and maintain complex computing
1772 | infrastructures. However, due to their inherent complexity and large
1773 | scale, production cloud computing systems are prone to various runtime
1774 | problems caused by hardware and software faults and environmental
1775 | factors. Autonomic anomaly detection is a crucial technique for
1776 | understanding emergent, cloud-wide phenomena and self-managing cloud
1777 | resources for system-level dependability assurance. To detect anomalous
1778 | cloud behaviors, we need to monitor the cloud execution and collect
1779 | runtime cloud performance data. These data consist of values of
1780 | performance metrics for different types of failures, which display
1781 | different correlations with the performance metrics. In this paper, we
1782 | present an adaptive anomaly identification mechanism that explores the
1783 | most relevant principal components of different failure types in cloud
1784 | computing infrastructures. It integrates the cloud performance metric
1785 | analysis with filtering techniques to achieve automated, efficient, and
1786 | accurate anomaly identification. The proposed mechanism adapts itself by
1787 | recursively learning from the newly verified detection results to refine
1788 | future detections. We have implemented a prototype of the anomaly
1789 | identification system and conducted experiments in an on-campus cloud
1790 | computing environment and by using the Google data center traces. Our
1791 | experimental results show that our mechanism can achieve more efficient
1792 | and accurate anomaly detection than other existing schemes.},
1793 | }
1794 |
1795 | @ARTICLE{clusterdata:Zhani2013-HARMONY,
1796 | title = {{HARMONY}: dynamic heterogeneity-aware resource provisioning in the cloud},
1797 | author = {Qi Zhang and Mohamed Faten Zhani and Raouf Boutaba and
1798 | Joseph L. Hellerstein},
1799 | journal = {The 33rd International Conference on Distributed Computing Systems (ICDCS)},
1800 | year = 2013,
1801 | pages = {510--519},
1802 | month = Jul,
1803 | address = {Philadelphia, PA, USA},
1804 | abstract = {Data centers today consume tremendous amount of energy in terms
1805 | of power distribution and cooling. Dynamic capacity provisioning is a
1806 | promising approach for reducing energy consumption by dynamically
1807 | adjusting the number of active machines to match resource
1808 | demands. However, despite extensive studies of the problem, existing
1809 | solutions for dynamic capacity provisioning have not fully considered
1810 | the heterogeneity of both workload and machine hardware found in
1811 | production environments. In particular, production data centers often
1812 | comprise several generations of machines with different capacities,
1813 | capabilities and energy consumption characteristics. Meanwhile, the
1814 | workloads running in these data centers typically consist of a wide
1815 | variety of applications with different priorities, performance
1816 | objectives and resource requirements. Failure to consider heterogenous
1817 | characteristics will lead to both sub-optimal energy-savings and long
1818 | scheduling delays, due to incompatibility between workload requirements
1819 | and the resources offered by the provisioned machines. To address this
1820 | limitation, in this paper we present HARMONY, a Heterogeneity-Aware
1821 | Resource Management System for dynamic capacity provisioning in cloud
1822 | computing environments. Specifically, we first use the K-means
1823 | clustering algorithm to divide the workload into distinct task classes
1824 | with similar characteristics in terms of resource and performance
1825 | requirements. Then we present a novel technique for dynamically
1826 | adjusting the number of machines of each type to minimize total energy
1827 | consumption and performance penalty in terms of scheduling
1828 | delay. Through simulations using real traces from Google's compute
1829 | clusters, we found that our approach can improve data center energy
1830 | efficiency by up to 28\% compared to heterogeneity-oblivious
1831 | solutions.},
1832 | }
1833 |
1834 | @INPROCEEDINGS{clusterdata:Amoretti2013
1835 | title = {A cooperative approach for distributed task execution in autonomic clouds},
1836 | author = {Amoretti, M. and Lafuente, A.L. and Sebastio, S.},
1837 | booktitle = {21st Euromicro International Conference on Parallel,
1838 | Distributed and Network-Based Processing (PDP)},
1839 | publisher = {IEEE},
1840 | year = 2013,
1841 | month = Feb,
1842 | pages = {274--281},
1843 | abstract = {Virtualization and distributed computing are two key pillars that
1844 | guarantee scalability of applications deployed in the Cloud. In
1845 | Autonomous Cooperative Cloud-based Platforms, autonomous computing nodes
1846 | cooperate to offer a PaaS Cloud for the deployment of user
1847 | applications. Each node must allocate the necessary resources for
1848 | applications to be executed with certain QoS guarantees. If the QoS of
1849 | an application cannot be guaranteed a node has mainly two options: to
1850 | allocate more resources (if it is possible) or to rely on the
1851 | collaboration of other nodes. Making a decision is not trivial since it
1852 | involves many factors (e.g. the cost of setting up virtual machines,
1853 | migrating applications, discovering collaborators). In this paper we
1854 | present a model of such scenarios and experimental results validating
1855 | the convenience of cooperative strategies over selfish ones, where nodes
1856 | do not help each other. We describe the architecture of the platform of
1857 | autonomous clouds and the main features of the model, which has been
1858 | implemented and evaluated in the DEUS discrete-event simulator. From the
1859 | experimental evaluation, based on workload data from the Google Cloud
1860 | Backend, we can conclude that (modulo our assumptions and
1861 | simplifications) the performance of a volunteer cloud can be compared to
1862 | that of a Google Cluster.},
1863 | doi = {10.1109/PDP.2013.47},
1864 | ISSN = {1066-6192},
1865 | address = {Belfast, UK},
1866 | url = {http://doi.ieeecomputersociety.org/10.1109/PDP.2013.47},
1867 | }
1868 |
1869 | ################ 2012
1870 |
1871 | @INPROCEEDINGS{clusterdata:Di2012b,
1872 | title = {Host load prediction in a {Google} compute cloud with a {Bayesian} model},
1873 | author = {Di, Sheng and Kondo, Derrick and Cirne, Walfredo},
1874 | booktitle = {International Conference on High Performance Computing,
1875 | Networking, Storage and Analysis (SC)},
1876 | year = 2012,
1877 | month = Nov,
1878 | isbn = {978-1-4673-0804-5},
1879 | address = {Salt Lake City, UT, USA},
1880 | pages = {21:1--21:11},
1881 | abstract = {Prediction of host load in Cloud systems is critical for achieving
1882 | service-level agreements. However, accurate prediction of host load in
1883 | Clouds is extremely challenging because it fluctuates drastically at
1884 | small timescales. We design a prediction method based on Bayes model to
1885 | predict the mean load over a long-term time interval, as well as the
1886 | mean load in consecutive future time intervals. We identify novel
1887 | predictive features of host load that capture the expectation,
1888 | predictability, trends and patterns of host load. We also determine the
1889 | most effective combinations of these features for prediction. We
1890 | evaluate our method using a detailed one-month trace of a Google data
1891 | center with thousands of machines. Experiments show that the Bayes
1892 | method achieves high accuracy with a mean squared error of
1893 | 0.0014. Moreover, the Bayes method improves the load prediction accuracy
1894 | by 5.6--50\% compared to other state-of-the-art methods based on moving
1895 | averages, auto-regression, and/or noise filters.},
1896 | url = {http://dl.acm.org/citation.cfm?id=2388996.2389025},
1897 | publisher = {IEEE Computer Society Press},
1898 | }
1899 |
1900 | @INPROCEEDINGS{clusterdata:Zhang2012,
1901 | title = {Dynamic energy-aware capacity provisioning for cloud computing environments},
1902 | author = {Zhang, Qi and Zhani, Mohamed Faten and Zhang, Shuo and
1903 | Zhu, Quanyan and Boutaba, Raouf and Hellerstein, Joseph L.},
1904 | booktitle = {9th ACM International Conference on Autonomic Computing (ICAC)},
1905 | year = 2012,
1906 | month = Sep,
1907 | isbn = {978-1-4503-1520-3},
1908 | address = {San Jose, CA, USA},
1909 | pages = {145--154},
1910 | acmid = {2371562},
1911 | publisher = {ACM},
1912 | doi = {10.1145/2371536.2371562},
1913 | keywords = {cloud computing, energy management, model predictive
1914 | control, resource management},
1915 | abstract = {Data centers have recently gained significant popularity as a
1916 | cost-effective platform for hosting large-scale service
1917 | applications. While large data centers enjoy economies of scale by
1918 | amortizing initial capital investment over large number of machines,
1919 | they also incur tremendous energy cost in terms of power distribution
1920 | and cooling. An effective approach for saving energy in data centers is
1921 | to adjust dynamically the data center capacity by turning off unused
1922 | machines. However, this dynamic capacity provisioning problem is known
1923 | to be challenging as it requires a careful understanding of the resource
1924 | demand characteristics as well as considerations to various cost
1925 | factors, including task scheduling delay, machine reconfiguration cost
1926 | and electricity price fluctuation. In this paper, we provide a
1927 | control-theoretic solution to the dynamic capacity provisioning problem
1928 | that minimizes the total energy cost while meeting the performance
1929 | objective in terms of task scheduling delay. Specifically, we model this
1930 | problem as a constrained discrete-time optimal control problem, and use
1931 | Model Predictive Control (MPC) to find the optimal control
1932 | policy. Through extensive analysis and simulation using real workload
1933 | traces from Google's compute clusters, we show that our proposed
1934 | framework can achieve significant reduction in energy cost, while
1935 | maintaining an acceptable average scheduling delay for individual
1936 | tasks.},
1937 | }
1938 |
1939 | @INPROCEEDINGS{clusterdata:Ali-Eldin2012
1940 | title = {Efficient provisioning of bursty scientific workloads on the
1941 | cloud using adaptive elasticity control},
1942 | author = {Ahmed Ali-Eldin and Maria Kihl and Johan Tordsson and Erik Elmroth},
1943 | booktitle = {3rd Workshop on Scientific Cloud Computing (ScienceCloud)},
1944 | year = 2012,
1945 | month = Jun,
1946 | address = {Delft, The Nederlands},
1947 | isbn = {978-1-4503-1340-7},
1948 | pages = {31--40},
1949 | url = {http://dl.acm.org/citation.cfm?id=2287044},
1950 | doi = {10.1145/2287036.2287044},
1951 | publisher = {ACM},
1952 | abstract = {Elasticity is the ability of a cloud infrastructure to dynamically
1953 | change the amount of resources allocated to a running service as load
1954 | changes. We build an autonomous elasticity controller that changes the
1955 | number of virtual machines allocated to a service based on both
1956 | monitored load changes and predictions of future load. The cloud
1957 | infrastructure is modeled as a G/G/N queue. This model is used to
1958 | construct a hybrid reactive-adaptive controller that quickly reacts to
1959 | sudden load changes, prevents premature release of resources, takes into
1960 | account the heterogeneity of the workload, and avoids
1961 | oscillations. Using simulations with Web and cluster workload traces, we
1962 | show that our proposed controller lowers the number of delayed requests
1963 | by a factor of 70 for the Web traces and 3 for the cluster traces when
1964 | compared to a reactive controller. Our controller also decreases the
1965 | average number of queued requests by a factor of 3 for both traces, and
1966 | reduces oscillations by a factor of 7 for the Web traces and 3 for the
1967 | cluster traces. This comes at the expense of between 20\% and 30\%
1968 | over-provisioning, as compared to a few percent for the reactive
1969 | controller.},
1970 | }
1971 |
1972 | ################ 2011
1973 |
1974 | @INPROCEEDINGS{clusterdata:Sharma2011,
1975 | title = {Modeling and synthesizing task placement constraints in
1976 | {Google} compute clusters},
1977 | author = {Sharma, Bikash and Chudnovsky, Victor and Hellerstein,
1978 | Joseph L. and Rifaat, Rasekh and Das, Chita R.},
1979 | booktitle = {2nd ACM Symposium on Cloud Computing (SoCC)},
1980 | year = 2011,
1981 | month = Oct,
1982 | isbn = {978-1-4503-0976-9},
1983 | address = {Cascais, Portugal},
1984 | pages = {3:1--3:14},
1985 | url = {http://doi.acm.org/10.1145/2038916.2038919},
1986 | doi = {10.1145/2038916.2038919},
1987 | publisher = {ACM},
1988 | keywords = {benchmarking, benchmarks, metrics, modeling, performance
1989 | evaluation, workload characterization},
1990 | abstract = {Evaluating the performance of large compute clusters requires
1991 | benchmarks with representative workloads. At Google, performance
1992 | benchmarks are used to obtain performance metrics such as task
1993 | scheduling delays and machine resource utilizations to assess changes in
1994 | application codes, machine configurations, and scheduling
1995 | algorithms. Existing approaches to workload characterization for high
1996 | performance computing and grids focus on task resource requirements for
1997 | CPU, memory, disk, I/O, network, etc. Such resource requirements address
1998 | how much resource is consumed by a task. However, in addition to
1999 | resource requirements, Google workloads commonly include task placement
2000 | constraints that determine which machine resources are consumed by
2001 | tasks. Task placement constraints arise because of task dependencies
2002 | such as those related to hardware architecture and kernel version.
2003 |
2004 | This paper develops methodologies for incorporating task placement
2005 | constraints and machine properties into performance benchmarks of large
2006 | compute clusters. Our studies of Google compute clusters show that
2007 | constraints increase average task scheduling delays by a factor of 2 to
2008 | 6, which often results in tens of minutes of additional task wait
2009 | time. To understand why, we extend the concept of resource utilization
2010 | to include constraints by introducing a new metric, the Utilization
2011 | Multiplier (UM). UM is the ratio of the resource utilization seen by
2012 | tasks with a constraint to the average utilization of the resource. UM
2013 | provides a simple model of the performance impact of constraints in that
2014 | task scheduling delays increase with UM. Last, we describe how to
2015 | synthesize representative task constraints and machine properties, and
2016 | how to incorporate this synthesis into existing performance
2017 | benchmarks. Using synthetic task constraints and machine properties
2018 | generated by our methodology, we accurately reproduce performance
2019 | metrics for benchmarks of Google compute clusters with a discrepancy of
2020 | only 13\% in task scheduling delay and 5\% in resource utilization.},
2021 | }
2022 |
2023 | @INPROCEEDINGS{clusterdata:Wang2011,
2024 | title = {Towards synthesizing realistic workload traces for studying
2025 | the {Hadoop} ecosystem},
2026 | author = {Wang, Guanying and Butt, Ali R. and Monti, Henry and Gupta, Karan},
2027 | booktitle = {19th IEEE Annual International Symposium on Modelling,
2028 | Analysis, and Simulation of Computer and Telecommunication
2029 | Systems (MASCOTS)},
2030 | year = 2011,
2031 | month = Jul,
2032 | isbn = {978-0-7695-4430-4},
2033 | pages = {400--408},
2034 | url = {http://people.cs.vt.edu/~butta/docs/mascots11-hadooptrace.pdf},
2035 | doi = {10.1109/MASCOTS.2011.59},
2036 | publisher = {IEEE Computer Society},
2037 | address = {Raffles Hotel, Singapore},
2038 | keywords = {Cloud computing, Performance analysis, Design
2039 | optimization, Software performance modeling},
2040 | abstract = {Designing cloud computing setups is a challenging task. It
2041 | involves understanding the impact of a plethora of parameters ranging
2042 | from cluster configuration, partitioning, networking characteristics,
2043 | and the targeted applications' behavior. The design space, and the scale
2044 | of the clusters, make it cumbersome and error-prone to test different
2045 | cluster configurations using real setups. Thus, the community is
2046 | increasingly relying on simulations and models of cloud setups to infer
2047 | system behavior and the impact of design choices. The accuracy of the
2048 | results from such approaches depends on the accuracy and realistic
2049 | nature of the workload traces employed. Unfortunately, few cloud
2050 | workload traces are available (in the public domain). In this paper, we
2051 | present the key steps towards analyzing the traces that have been made
2052 | public, e.g., from Google, and inferring lessons that can be used to
2053 | design realistic cloud workloads as well as enable thorough quantitative
2054 | studies of Hadoop design. Moreover, we leverage the lessons learned from
2055 | the traces to undertake two case studies: (i) Evaluating Hadoop job
2056 | schedulers, and (ii) Quantifying the impact of shared storage on Hadoop
2057 | system performance.}
2058 | }
2059 |
--------------------------------------------------------------------------------