├── LICENSE
├── README-pt.md
├── README.md
└── images
├── Auth.png
├── ChangeCode.png
├── Jupyter.png
├── Login.png
├── Pwd.png
├── Run.png
├── Screen.png
├── Select.png
├── Spark.png
├── Start.png
├── Submit.png
├── Table.png
├── Trial.png
├── Upload.png
├── clear.png
├── clear2.png
├── highlight.png
├── out1.png
├── out2.png
├── out3.png
├── out4.png
├── out5.png
├── senario.png
└── spark_zOS.png
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "{}"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright {yyyy} {name of copyright owner}
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
--------------------------------------------------------------------------------
/README-pt.md:
--------------------------------------------------------------------------------
1 | # Análise de Dados usando Spark on z/OS e Jupyter Notebooks
2 |
3 | [Apache Spark](https://spark.apache.org/) é uma estrutura de computação em cluster de software livre. O Spark é executado em Hadoop, Mesos, de modo independente ou na Cloud. Pode acessar diversas origens de dados, incluindo HDFS, Cassandra, HBase e S3. Além disso, pode ser usado de forma interativa a partir dos shells Scala, Python e R. É possível executar o Spark usando o modo de cluster independente, em uma IaaS, no Hadoop YARN ou em orquestradores de contêiner, como Apache Mesos. O
4 |
5 | [z/OS](https://en.wikipedia.org/wiki/Z/OS) é um sistema operacional extremamente escalável, seguro e de alto desempenho baseado na z/Architecture de 64 bits. Ele é altamente confiável para a execução de aplicativos essenciais; o sistema operacional oferece suporte a aplicativos baseados na web e em Java.
6 |
7 | Nesta jornada, demonstramos a execução de um aplicativo de analytics usando o Spark on z/OS. O [Apache Spark on z/OS](https://www-03.ibm.com/systems/z/os/zos/apache-spark.html) é uma análise local, de abstração otimizada e em tempo real de dados corporativos estruturados e não estruturados, desenvolvida com a [z Systems Community Cloud](https://zcloud.marist.edu).
8 |
9 | A z/OS Platform for Apache Spark inclui uma versão com suporte dos recursos de software livre do Apache Spark que consiste em núcleo do ApacheSpark, Spark SQL, Spark Streaming, Machine Learning Library (MLib) e Graphx. Também inclui acesso otimizado a dados a um amplo conjunto de origens de dados estruturados e não estruturados por meio de APIs do Spark. Com esse recurso, origens de dados de z/OS tradicionais, como dados de IMS™, VSAM, IBM DB2®, z/OS, PDSE ou SMF, podem ser acessadas de uma maneira otimizada para o desempenho com o Spark.
10 |
11 | Esse exemplo de analytics usa dados armazenados em tabelas de DB2 e VSAM, bem como um aplicativo de aprendizado de máquina escrito em [Scala](). O código também usa o software livre [Jupyter Notebook](http://jupyter.org) para escrever e enviar código do Scala para sua instância do Spark, além de visualizar a saída dentro de uma GUI da web. O Jupyter Notebook é muito usado no espaço de analytics de dados para limpeza e transformação de dados, simulação numérica, modelagem estatística, aprendizado de máquina e muito mais.
12 |
13 |
14 |
15 | ## Componentes inclusos
16 | Os cenários são realizados com:
17 |
18 | - [IBM z/OS Platform for Apache Spark](http://www-03.ibm.com/systems/z/os/zos/apache-spark.html)
19 | - [Jupyter Notebook](http://jupyter-notebook.readthedocs.io/en/latest/)
20 | - [Scala](https://www.scala-lang.org/documentation)
21 | - [IBM DB2 for z/OS](https://www.ibm.com/analytics/us/en/technology/db2/db2-for-zos.html)
22 | - [VSAM](https://www.ibm.com/support/knowledgecenter/zosbasics/com.ibm.zos.zconcepts/zconcepts_169.htm)
23 |
24 | ## Pré-requisitos
25 |
26 | Inscreva-se na [z Systems Community Cloud](https://zcloud.marist.edu/#/register) para obter uma conta para teste. Você receberá um e-mail contendo as credenciais para acessar o portal de autoatendimento. É aqui que você pode começar a explorar todos os serviços disponíveis.
27 |
28 | ## Etapas
29 |
30 | ### Parte A: Usar o Painel de Autoatendimento
31 | 1. [Iniciar seu Cluster do Spark](#1-start-your-spark-cluster)
32 | 2. [Fazer upload dos dados do DB2 e do VSAM](#2-upload-the-db2-and-vsam-data)
33 | 3. [Enviar um programa do Scala para analisar os dados](#3-submit-a-scala-program-to-analyze-the-data)
34 | 4. [Acionar a GUI do Spark para visualizar a tarefa enviada](#4-launch-spark-gui-to-view-the-submitted-job)
35 |
36 | ### Parte B: Trabalhar com o Jupyter Notebook
37 | 5. [Acionar o Jupyter Notebook e se conectar ao Spark](#5-launch-jupyter-notebook-and-connect-to-spark)
38 | 6. [Executar células do Jupyter Notebook para carregar dados e executar análise](#6-run-jupyter-notebook-cells-to-load-data-and-perform-analysis)
39 | - 6.1 [Carregar dados do VSAM e do DB2 no Spark e realizar uma transformação de dados](#6.1-load-vsam-and-db2-data-into-spark-and-perform-a-data-transformation)
40 | - 6.2 [Unir os dados do VSAM e do DB2 no dataframe no Spark](#6.2-join-the-vsam-and-db2-data-into-dataframe-in-spark)
41 | - 6.3 [Criar um dataframe de regressão logística e transformá-lo em gráfico](#6.3-create-a-logistic-regression-dataframe-and-plot-it)
42 | - 6.4 [Obter dados estatísticos](#6.4-get-statistical-data)
43 |
44 | ## Parte A: Usar o Painel de Autoatendimento
45 |
46 | ### 1. Iniciar seu Cluster do Spark
47 |
48 | 1. Abra um navegador da web e insira a URL para acessar o portal de autoatendimento da [z Systems Community Cloud](https://zcloud.marist.edu).
49 |
50 |
51 | 2. Insira seu ID do Usuário do Portal e a Senha do Portal e clique em ‘Sign In’.
52 | 3. Você verá a página inicial do portal de autoatendimento da z Systems Community Cloud.
53 | * **Clique em ‘Try Analytics Service’**
54 |
55 | 4. Agora, você verá um painel que mostra o status da sua instância do Apache Spark on z/OS.
56 |
57 | Na parte superior da tela, observe o indicador ‘z/OS Status’, que deve mostrar o status da sua instância como ‘OK’.
58 |
59 | No meio da tela, serão exibidas as seções ‘Spark Instance’, ‘Status’, ‘Data management’ e ‘Operations’. A seção ‘Spark Instance’ contém seu nome do usuário individual do Spark e o endereço IP.
60 |
61 | Abaixo dos títulos dos arquivos, você verá botões para funções que podem ser aplicadas à sua instância. 
62 |
63 | A tabela a seguir mostra a operação para cada função:
64 |
65 |
66 | 5. Caso esteja usando o Serviço de Analytics no zOS pela primeira vez, você deve definir uma nova senha para o Spark.
67 | * **Clique em ‘Change Password’**
68 |
69 |
70 |
71 | 6. Confirme se sua instância está ‘Active’. Se estiver ‘Stopped’, clique em ‘Start’ para iniciá-la.
72 |
73 |
74 |
75 | ### 2. Fazer upload dos dados do DB2 e do VSAM
76 |
77 | 1. Acesse https://github.com/cloud4z/spark e faça download de todos os arquivos de amostra.
78 |
79 | 2. Carregue o arquivo de dados do DB2:
80 | * **Clique em ‘Upload Data’**
81 | * **Selecione e carregue o arquivo DDL do DB2**
82 | * **Selecione e carregue o arquivo de dados do DB2**
83 | * **Clique em ‘Upload’**
84 |
85 |
86 |
87 | A mensagem “Upload Success” será exibida no painel quando o carregamento de dados for concluído. Os dados do VSAM para este exercício já foram carregados para você. No entanto, essa etapa poderá ser repetida carregando o copybook do VSAM e o arquivo de dados do VSAM (que você transferiu por download) do seu sistema local.
88 |
89 | ### 3. Enviar um programa do Scala para analisar os dados
90 |
91 | Envie um programa preparado do Scala para analisar os dados.
92 | * **Clique em ‘Spark Submit’**
93 | * **Selecione seu arquivo JAR de demonstração do Spark**
94 | * **Especifique o nome da classe principal ‘com.ibm.scalademo.ClientJoinVSAM’**
95 | * **Insira os argumentos: ‘Spark Instance Username’ ‘Spark Instance Password’**
96 | * **Clique em ‘Submit’**
97 | Observação: Os argumentos sugerem que você precisa efetuar login na GUI para visualizar os resultados da tarefa.
98 |
99 |
100 |
101 | A mensagem “JOB Submitted” será exibida no painel quando o programa for concluído. Esse programa do Scala acessará os dados do DB2 e do VSAM, realizará transformações nos dados, unirá as duas tabelas em um dataframe do Spark e armazenará o resultado no DB2.
102 |
103 | ### 4. Acionar a GUI do Spark para visualizar a tarefa enviada
104 |
105 | Acione a GUI de saída do trabalhador individual do Spark para visualizar a tarefa que acabou de enviar.
106 | * **Clique em ‘Spark UI’**
107 | * **Clique no ‘Worker ID’ do seu programa na seção ‘Completed Drivers’.**
108 | * **Efetue login com o nome do usuário do Spark e a senha do Spark. Aqueles mencionados na etapa 6.**
109 | * **Clique em ‘stdout’ em relação ao seu programa, na seção ‘Finished Drivers’, para visualizar seus resultados.**
110 |
111 | !
112 |
113 | ## Parte B: Trabalhar com o Jupyter Notebook
114 |
115 | ### 5. Acionar o Jupyter Notebook e se conectar ao Spark
116 |
117 | A ferramenta Jupyter Notebook está instalada no painel. Com ela, é possível escrever e enviar código do Scala para sua instância do Spark, além de visualizar a saída dentro de uma GUI da web.
118 |
119 | 1.**Acione o serviço Jupyter Notebook no seu navegador, a partir do painel.** * **Clique em ‘Jupyter’.** Você verá a página inicial do Jupyter.
120 |
121 | 
122 |
123 | O programa preparado do Scala nesse nível acessará os dados do DB2 e do VSAM, realizará transformações nos dados, unirá as duas tabelas em um dataframe do Spark e armazenará o resultado no DB2. Também realizará uma análise de regressão logística e transformará a saída em gráfico.
124 |
125 | 2. **Clique duas vezes no arquivo Demo.jpynb.**
126 |
127 | 
128 |
129 | O Jupyter Notebook vai se conectar à sua instância do Spark on z/OS automaticamente e estará no estado pronto quando o indicador Apache Toree –Scala, localizado no canto superior direito da tela, estiver transparente.
130 | ### 6. Executar células do Jupyter Notebook para carregar dados e executar análise
131 | O ambiente do Jupyter Notebook divide-se em células de entrada identificadas com ‘In [#]:’.
132 | #### 6.1 Carregar dados do VSAM e do DB2 no Spark e realizar uma transformação de dados
133 |
134 | Executar célula nº 1 - O código do Scala na primeira célula carrega os dados do VSAM (informações do cliente) no Spark e realiza uma transformação de dados.
135 | * **Clique no primeiro ‘In [ ]:’** A borda esquerda mudará para azul quando uma célula estiver no modo de comando, como mostrado abaixo.
136 |
137 |
138 |
139 | Antes de executar o código, faça estas alterações:
140 | * **Altere o valor de zOS_IP para seu endereço IP do Spark.**
141 | * **Altere o valor de zOS_USERNAME para o nome do usuário do Spark e o valor de zOS_PASSWORD para a senha do Spark.**  * **Clique no botão de execução da célula, indicado pela caixa vermelha, como mostrado abaixo**
142 |
143 |
144 |
145 | A conexão do Jupyter Notebook com sua instância do Spark está no estado ocupado quando o indicador Apache Toree –Scala, localizado no canto superior direito da tela, está cinza.
146 |
147 |
148 |
149 | Quando esse indicador fica transparente, significa que a execução da célula foi concluída e retornou ao estado de pronto. A saída deve ser semelhante ao exemplo a seguir:
150 |
151 |
152 |
153 | Executar célula #2 - O código do Scala na segunda célula carrega os dados do DB2 (dados da transação) no Spark e realiza uma transformação de dados.
154 | * **Clique no próximo ‘In [ ]:’ para selecionar a próxima célula**
155 | * **Clique no botão de execução da célula**
156 | A saída deve ser semelhante ao exemplo a seguir:
157 |
158 |
159 |
160 | #### 6.2 Unir os dados do VSAM e do DB2 no dataframe no Spark
161 | Executar célula nº 3 - O código do Scala na terceira célula une os dados do VSAM e do DB2 em um novo dataframe ‘client_join’ no Spark.
162 | * **Clique no próximo ‘In [ ]:’ para selecionar a próxima célula**
163 | * **Clique no botão de execução da célula**
164 | A saída deve ser semelhante ao exemplo a seguir:
165 |
166 |
167 | #### 6.3 Criar um dataframe de regressão logística e transformá-lo em gráfico
168 | Executar a célula #4 - O código do Scala na quarta célula realiza uma regressão logística para avaliar a probabilidade de rotatividade de clientes como uma função do nível de atividade dos clientes. O dataframe ‘result_df’ também é criado; ele é usado para transformar os resultados em um gráfico de linha.
169 | * **Clique no próximo ‘In [ ]:’ para selecionar a próxima célula**
170 | * **Clique no botão de execução da célula**
171 | A saída deve ser semelhante ao exemplo a seguir:
172 |
173 | Executar célula nº 5 - O código do Scala na quinta célula transforma em gráfico o dataframe ‘plot_df’. * **Clique no próximo ‘In [ ]:’ para selecionar a próxima célula** * **Clique no botão de execução da célula** A saída deve ser semelhante ao exemplo a seguir:
174 |
175 |
176 | #### 6.4 Obter dados estatísticos
177 | a. O número de linhas no conjunto de dados do VSAM de entrada
178 | * **println(clientInfo_df.count())** O resultado deve ser 6001.
179 | b. O número de linhas no conjunto de dados do DB2 de entrada
180 | * **println(clientTrans_df.count())** O resultado deve ser 20000.
181 | c. O número de linhas no conjunto de dados unido
182 | * **println(client_df.count())** O resultado deve ser 112.
183 |
184 | ## Referência
185 | IBM z/OS Platform for Apache Spark - http://www-03.ibm.com/systems/z/os/zos/apache-spark.html
186 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Data Analysis using Spark on zOS and Jupyter Notebooks
2 |
3 | [Apache Spark](https://spark.apache.org/) is an open-source cluster-computing framework. Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. And you can use it interactively from the Scala, Python and R shells. You can run Spark using its standalone cluster mode, on an IaaS, on Hadoop YARN, or on container orchestrators like Apache Mesos.
4 |
5 | [z/OS](https://en.wikipedia.org/wiki/Z/OS) is an extremely scalable and secure high-performance operating system based on the 64-bit z/Architecture. z/OS is highly reliable for running mission-critical applications, and the operating system supports Web- and Java-based applications.
6 |
7 | In this jurney we demonstrate running an analytics application using Spark on z/OS. [Apache Spark on z/OS](https://www-03.ibm.com/systems/z/os/zos/apache-spark.html) is in-place, optimized abstraction and real-time analysis of structured and unstructured enterprise data which is powered by [z Systems Community Cloud](https://zcloud.marist.edu).
8 |
9 | z/OS Platform for Apache Spark includes a supported version of Apache Spark open source capabilities consisting of the ApacheSpark core, Spark SQL, Spark Streaming, Machine Learning Library (MLib) and Graphx.It also includes optimized data access to a broad set of structured and unstructured data sources through Spark APIs. With this capability, traditional z/OS data sources, such as IMS™, VSAM, IBM DB2®, z/OS, PDSE, or SMF data, can be accessed in a performance-optimized manner with Spark
10 |
11 | This analytics example uses data stored in DB2 and VSAM tables, and a machine learning application written in [Scala](). The code also uses open-source [Jupyter Notebook](http://jupyter.org) to write and submit Scala code to your Spark instance, and view the output within a web GUI. The Jupyter Notebook is commonly used in data analytics space for data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.
12 |
13 |
14 |
15 |
16 | ## Included Components
17 | The scenarios are accomplished by using:
18 |
19 | - [IBM z/OS Platform for Apache Spark](http://www-03.ibm.com/systems/z/os/zos/apache-spark.html)
20 | - [Jupyter Notebook](http://jupyter-notebook.readthedocs.io/en/latest/)
21 | - [Scala](https://www.scala-lang.org/documentation)
22 | - [IBM DB2 for z/OS](https://www.ibm.com/analytics/us/en/technology/db2/db2-for-zos.html)
23 | - [VSAM](https://www.ibm.com/support/knowledgecenter/zosbasics/com.ibm.zos.zconcepts/zconcepts_169.htm)
24 |
25 | ## Prerequisites
26 |
27 | Register at [z Systems Community Cloud](https://zcloud.marist.edu/#/register) for a trial account. You will receive an email containing credentials to access the self-service portal. This is where you can start exploring all the available services.
28 |
29 | ## Steps
30 |
31 | ### Part A: Use Self-service Dashboard
32 |
33 | 1. [Start your Spark Cluster](#1-start-your-spark-cluster)
34 | 2. [Upload the DB2 and VSAM data](#2-upload-the-db2-and-vsam-data)
35 | 3. [Submit a Scala program to analyze the data](#3-submit-a-scala-program-to-analyze-the-data)
36 | 4. [Launch Spark GUI to view the submitted job](#4-launch-spark-gui-to-view-the-submitted-job)
37 |
38 | ### Part B: Work with Jupyter Notebook
39 |
40 | 5. [Launch Jupyter Notebook and connect to Spark](#5-launch-jupyter-notebook-and-connect-to-spark)
41 | 6. [Run Jupyter Notebook cells to load data and perform analysis](#6-run-jupyter-notebook-cells-to-load-data-and-perform-analysis)
42 | - 6.1 [Load VSAM and DB2 data into Spark and perform a data transformation](#6.1-load-vsam-and-db2-data-into-spark-and-perform-a-data-transformation)
43 | - 6.2 [Join the VSAM and DB2 data into dataframe in Spark](#6.2-join-the-vsam-and-db2-data-into-dataframe-in-spark)
44 | - 6.3 [Create a logistic regression dataframe and plot it](#6.3-create-a-logistic-regression-dataframe-and-plot-it)
45 | - 6.4 [Get statistical data](#6.4-get-statistical-data)
46 |
47 | ## Part A: Use Self-service Dashboard
48 |
49 | ### 1. Start your Spark Cluster
50 |
51 | 1.Open a web browser and enter the URL to access the [z Systems Community Cloud](https://zcloud.marist.edu) self-service portal.
52 |
53 |
54 |
55 |
56 |
57 | 2.Enter your Portal User ID and Portal Password, and click ‘Sign In’.
58 |
59 | 3.You will see the home page for the z Systems Community Cloud self-service portal.
60 | * **Click on ‘Try Analytics Service’**
61 |
62 |
63 |
64 |
65 |
66 | 4.You will now see a dashboard, which shows the status of your Apache Spark on z/OS instance.
67 |
68 | At the top of the screen, notice the ‘z/OS Status’ indicator, which should show the status of your instance
69 | as ‘OK’.
70 |
71 | In the middle of the screen, the ‘Spark Instance’, ‘Status’, ‘Data management’, and ‘Operations’ sections
72 | will be displayed. The ‘Spark Instance’ section contains your individual Spark username and IP address.
73 |
74 | Below the field headings, you will see buttons for functions that can be applied to your instance.
75 | 
76 |
77 | The following table lists the operation for each function:
78 |
79 |
80 |
81 |
82 | 5.If it is the first time for you to try the Analytics Service on zOS, you must set a new Spark password.
83 | * **Click ‘Change Password’**
84 |
85 |
86 |
87 |
88 |
89 | 6.Confirm your instance is Active. If it is ‘Stopped’, click ‘Start’ to start it.
90 |
91 |
92 |
93 | ### 2. Upload the DB2 and VSAM data
94 |
95 | 1.Go to https://github.com/cloud4z/spark and download all the sample files.
96 |
97 | 2.Load the DB2 data file :
98 | * **Click ‘Upload Data’**
99 | * **Select and load the DB2 DDL file**
100 | * **Select and load the DB2 data file**
101 | * **Click ‘Upload’**
102 |
103 |
104 |
105 |
106 | “Upload Success” will appear in the dashboard when the data load is complete. The VSAM data for this exercise has already been loaded for you. However, this step may be repeated by loading the VSAM copybook and VSAM data file you downloaded, from your local system.
107 |
108 | ### 3. Submit a Scala program to analyze the data
109 |
110 | Submit a prepared Scala program to analyze the data.
111 | * **Click ‘Spark Submit’**
112 | * **Select your Spark Demo JAR file**
113 | * **Specify Main class name ‘com.ibm.scalademo.ClientJoinVSAM’**
114 | * **Enter the arguments: ‘Spark Instance Username’ ‘Spark Instance Password’**
115 | * **Click ‘Submit’**
116 | Note: The arguments suggest you need to login to the GUI to view the job results.
117 |
118 |
119 |
120 |
121 | “JOB Submitted” will appear in the dashboard when the program is complete. This Scala program will access DB2 and VSAM data, perform transformations on the data, join these two tables in a Spark dataframe, and store the result back to DB2.
122 |
123 | ### 4. Launch Spark GUI to view the submitted job
124 |
125 | Launch your individual Spark worker output GUI to view the job you just submitted.
126 | * **Click ‘Spark UI’**
127 | * **Click on the ‘Worker ID’ for your program in the ‘Completed Drivers’ section.**
128 | * **Log in with your Spark username and Spark password. The ones mentioned in step 6.**
129 | * **Click on ‘stdout’ for your program in the ‘Finished Drivers’ section to view your results.**
130 |
131 | !
132 |
133 |
134 |
135 | ## Part B: Work with Jupyter Notebook
136 |
137 | ### 5. Launch Jupyter Notebook and connect to Spark
138 |
139 | Jupyter Notebook tool that is installed in the dashboard. This tool will allow you to write and submit Scala code to your Spark instance, and view the output within a web GUI.
140 |
141 | 1.**Launch the Jupyter Notebook service in your browser from your dashboard.**
142 | * **Click on ‘Jupyter’.**
143 | You will see the Jupyter home page.
144 |
145 | 
146 |
147 | The prepared Scala program in this level will access DB2 and VSAM data, perform transformations on the data, join these two tables in a Spark dataframe, and store the result back to DB2. It will also perform a logistic regression analysis and plot the output.
148 |
149 | 2. **Double click the Demo.jpynb file.**
150 |
151 | 
152 |
153 | The Jupyter Notebook will connect to your Spark on z/OS instance automatically and will be in the ready state when the Apache Toree –Scala indicator in the top right hand corner of the screen is clear.
154 |
155 |
156 | ### 6. Run Jupyter Notebook cells to load data and perform analysis
157 | The Jupyter Notebook environment is divided into input cells labelled with ‘In [#]:’.
158 | #### 6.1 Load VSAM and DB2 data into Spark and perform a data transformation
159 | Run cell #1 - The Scala code in the first cell loads the VSAM data (customer information) into Spark and performs a data transformation.
160 | * **Click on the first ‘In [ ]:’**
161 | The left border will change to blue when a cell is in command mode, as shown below.
162 |
163 |
164 |
165 |
166 | Before running the code, make the fllowing changes:
167 | * **Change the value of zOS_IP to your Spark IP address.**
168 | * **Change the value of zOS_USERNAME to your Spark username and the value of zOS_PASSWORD to your Spark password.**
169 |
170 | 
171 |
172 | * **Click the run cell button indicated by the red box as shown below**
173 |
174 |
175 |
176 |
177 | The Jupyter Notebook connection to your Spark instance is in the busy state when the Apache Toree –Scala indicator in the top right hand corner of the screen is grey.
178 |
179 |
180 |
181 |
182 | When this indicator turns clear, the cell run has completed and returned to the ready state.
183 | The output should be similar to the following:
184 |
185 |
186 |
187 |
188 | Run cell #2 - The Scala code in the second cell loads the DB2 data (transaction data) into Spark and performs a data transformation.
189 | * **Click on the next ‘In [ ]:’ to select the next cell**
190 | * **Click the run cell button**
191 | The output should be similar to the following:
192 |
193 |
194 |
195 |
196 | #### 6.2 Join the VSAM and DB2 data into dataframe in Spark
197 | Run cell #3 - The Scala code in the third cell joins the VSAM and DB2 data into a new ‘client_join’ dataframe in Spark.
198 | * **Click on the next ‘In [ ]:’ to select the next cell**
199 | * **Click the run cell button**
200 | The output should be similar to the following:
201 |
202 |
203 |
204 | #### 6.3 Create a logistic regression dataframe and plot it
205 | Run cell #4 - The Scala code in the fourth cell performs a logistic regression to evaluate the probability of customer churn as a function of customer activity level. The ‘result_df’ dataframe is also created, which is used to plot the results on a line graph.
206 | * **Click on the next ‘In [ ]:’ to select the next cell**
207 | * **Click the run cell button**
208 | The output should be similar to the following:
209 |
210 |
211 |
212 |
213 | Run cell #5 - The Scala code in the fifth cell plots the ‘plot_df’ dataframe.
214 | * **Click on the next ‘In [ ]:’ to select the next cell**
215 | * **Click the run cell button**
216 | The output should be similar to the following:
217 |
218 |
219 |
220 | #### 6.4 Get statistical data
221 | a. The number of rows in the input VSAM dataset
222 | * **println(clientInfo_df.count())**
223 | Result should be 6001.
224 | b. The number of rows in the input DB2 dataset
225 | * **println(clientTrans_df.count())**
226 | Result should be 20000.
227 | c. The number of rows in the joined dataset
228 | * **println(client_df.count())**
229 | Result should be 112.
230 |
231 |
232 | ## Reference
233 | IBM z/OS Platform for Apache Spark - http://www-03.ibm.com/systems/z/os/zos/apache-spark.html
234 |
--------------------------------------------------------------------------------
/images/Auth.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Auth.png
--------------------------------------------------------------------------------
/images/ChangeCode.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/ChangeCode.png
--------------------------------------------------------------------------------
/images/Jupyter.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Jupyter.png
--------------------------------------------------------------------------------
/images/Login.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Login.png
--------------------------------------------------------------------------------
/images/Pwd.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Pwd.png
--------------------------------------------------------------------------------
/images/Run.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Run.png
--------------------------------------------------------------------------------
/images/Screen.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Screen.png
--------------------------------------------------------------------------------
/images/Select.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Select.png
--------------------------------------------------------------------------------
/images/Spark.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Spark.png
--------------------------------------------------------------------------------
/images/Start.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Start.png
--------------------------------------------------------------------------------
/images/Submit.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Submit.png
--------------------------------------------------------------------------------
/images/Table.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Table.png
--------------------------------------------------------------------------------
/images/Trial.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Trial.png
--------------------------------------------------------------------------------
/images/Upload.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Upload.png
--------------------------------------------------------------------------------
/images/clear.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/clear.png
--------------------------------------------------------------------------------
/images/clear2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/clear2.png
--------------------------------------------------------------------------------
/images/highlight.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/highlight.png
--------------------------------------------------------------------------------
/images/out1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/out1.png
--------------------------------------------------------------------------------
/images/out2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/out2.png
--------------------------------------------------------------------------------
/images/out3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/out3.png
--------------------------------------------------------------------------------
/images/out4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/out4.png
--------------------------------------------------------------------------------
/images/out5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/out5.png
--------------------------------------------------------------------------------
/images/senario.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/senario.png
--------------------------------------------------------------------------------
/images/spark_zOS.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/spark_zOS.png
--------------------------------------------------------------------------------