├── LICENSE ├── README-pt.md ├── README.md └── images ├── Auth.png ├── ChangeCode.png ├── Jupyter.png ├── Login.png ├── Pwd.png ├── Run.png ├── Screen.png ├── Select.png ├── Spark.png ├── Start.png ├── Submit.png ├── Table.png ├── Trial.png ├── Upload.png ├── clear.png ├── clear2.png ├── highlight.png ├── out1.png ├── out2.png ├── out3.png ├── out4.png ├── out5.png ├── senario.png └── spark_zOS.png /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "{}" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright {yyyy} {name of copyright owner} 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README-pt.md: -------------------------------------------------------------------------------- 1 | # Análise de Dados usando Spark on z/OS e Jupyter Notebooks 2 | 3 | [Apache Spark](https://spark.apache.org/) é uma estrutura de computação em cluster de software livre. O Spark é executado em Hadoop, Mesos, de modo independente ou na Cloud. Pode acessar diversas origens de dados, incluindo HDFS, Cassandra, HBase e S3. Além disso, pode ser usado de forma interativa a partir dos shells Scala, Python e R. É possível executar o Spark usando o modo de cluster independente, em uma IaaS, no Hadoop YARN ou em orquestradores de contêiner, como Apache Mesos. O 4 | 5 | [z/OS](https://en.wikipedia.org/wiki/Z/OS) é um sistema operacional extremamente escalável, seguro e de alto desempenho baseado na z/Architecture de 64 bits. Ele é altamente confiável para a execução de aplicativos essenciais; o sistema operacional oferece suporte a aplicativos baseados na web e em Java. 6 | 7 | Nesta jornada, demonstramos a execução de um aplicativo de analytics usando o Spark on z/OS. O [Apache Spark on z/OS](https://www-03.ibm.com/systems/z/os/zos/apache-spark.html) é uma análise local, de abstração otimizada e em tempo real de dados corporativos estruturados e não estruturados, desenvolvida com a [z Systems Community Cloud](https://zcloud.marist.edu). 8 | 9 | A z/OS Platform for Apache Spark inclui uma versão com suporte dos recursos de software livre do Apache Spark que consiste em núcleo do ApacheSpark, Spark SQL, Spark Streaming, Machine Learning Library (MLib) e Graphx. Também inclui acesso otimizado a dados a um amplo conjunto de origens de dados estruturados e não estruturados por meio de APIs do Spark. Com esse recurso, origens de dados de z/OS tradicionais, como dados de IMS™, VSAM, IBM DB2®, z/OS, PDSE ou SMF, podem ser acessadas de uma maneira otimizada para o desempenho com o Spark. 10 | 11 | Esse exemplo de analytics usa dados armazenados em tabelas de DB2 e VSAM, bem como um aplicativo de aprendizado de máquina escrito em [Scala](). O código também usa o software livre [Jupyter Notebook](http://jupyter.org) para escrever e enviar código do Scala para sua instância do Spark, além de visualizar a saída dentro de uma GUI da web. O Jupyter Notebook é muito usado no espaço de analytics de dados para limpeza e transformação de dados, simulação numérica, modelagem estatística, aprendizado de máquina e muito mais. 12 | 13 |
14 | 15 | ## Componentes inclusos 16 | Os cenários são realizados com: 17 | 18 | - [IBM z/OS Platform for Apache Spark](http://www-03.ibm.com/systems/z/os/zos/apache-spark.html) 19 | - [Jupyter Notebook](http://jupyter-notebook.readthedocs.io/en/latest/) 20 | - [Scala](https://www.scala-lang.org/documentation) 21 | - [IBM DB2 for z/OS](https://www.ibm.com/analytics/us/en/technology/db2/db2-for-zos.html) 22 | - [VSAM](https://www.ibm.com/support/knowledgecenter/zosbasics/com.ibm.zos.zconcepts/zconcepts_169.htm) 23 | 24 | ## Pré-requisitos 25 | 26 | Inscreva-se na [z Systems Community Cloud](https://zcloud.marist.edu/#/register) para obter uma conta para teste. Você receberá um e-mail contendo as credenciais para acessar o portal de autoatendimento. É aqui que você pode começar a explorar todos os serviços disponíveis. 27 | 28 | ## Etapas 29 | 30 | ### Parte A: Usar o Painel de Autoatendimento 31 | 1. [Iniciar seu Cluster do Spark](#1-start-your-spark-cluster) 32 | 2. [Fazer upload dos dados do DB2 e do VSAM](#2-upload-the-db2-and-vsam-data) 33 | 3. [Enviar um programa do Scala para analisar os dados](#3-submit-a-scala-program-to-analyze-the-data) 34 | 4. [Acionar a GUI do Spark para visualizar a tarefa enviada](#4-launch-spark-gui-to-view-the-submitted-job) 35 | 36 | ### Parte B: Trabalhar com o Jupyter Notebook 37 | 5. [Acionar o Jupyter Notebook e se conectar ao Spark](#5-launch-jupyter-notebook-and-connect-to-spark) 38 | 6. [Executar células do Jupyter Notebook para carregar dados e executar análise](#6-run-jupyter-notebook-cells-to-load-data-and-perform-analysis) 39 | - 6.1 [Carregar dados do VSAM e do DB2 no Spark e realizar uma transformação de dados](#6.1-load-vsam-and-db2-data-into-spark-and-perform-a-data-transformation) 40 | - 6.2 [Unir os dados do VSAM e do DB2 no dataframe no Spark](#6.2-join-the-vsam-and-db2-data-into-dataframe-in-spark) 41 | - 6.3 [Criar um dataframe de regressão logística e transformá-lo em gráfico](#6.3-create-a-logistic-regression-dataframe-and-plot-it) 42 | - 6.4 [Obter dados estatísticos](#6.4-get-statistical-data) 43 | 44 | ## Parte A: Usar o Painel de Autoatendimento 45 | 46 | ### 1. Iniciar seu Cluster do Spark 47 | 48 | 1. Abra um navegador da web e insira a URL para acessar o portal de autoatendimento da [z Systems Community Cloud](https://zcloud.marist.edu). 49 | 50 |

51 | 2. Insira seu ID do Usuário do Portal e a Senha do Portal e clique em ‘Sign In’. 52 | 3. Você verá a página inicial do portal de autoatendimento da z Systems Community Cloud. 53 | * **Clique em ‘Try Analytics Service’** 54 |

55 | 4. Agora, você verá um painel que mostra o status da sua instância do Apache Spark on z/OS. 56 | 57 | Na parte superior da tela, observe o indicador ‘z/OS Status’, que deve mostrar o status da sua instância como ‘OK’. 58 | 59 | No meio da tela, serão exibidas as seções ‘Spark Instance’, ‘Status’, ‘Data management’ e ‘Operations’. A seção ‘Spark Instance’ contém seu nome do usuário individual do Spark e o endereço IP. 60 | 61 | Abaixo dos títulos dos arquivos, você verá botões para funções que podem ser aplicadas à sua instância. ![GUI](images/Screen.png) 62 | 63 | A tabela a seguir mostra a operação para cada função: 64 |

65 | 66 | 5. Caso esteja usando o Serviço de Analytics no zOS pela primeira vez, você deve definir uma nova senha para o Spark. 67 | * **Clique em ‘Change Password’** 68 | 69 |

70 | 71 | 6. Confirme se sua instância está ‘Active’. Se estiver ‘Stopped’, clique em ‘Start’ para iniciá-la. 72 | 73 |
74 | 75 | ### 2. Fazer upload dos dados do DB2 e do VSAM 76 | 77 | 1. Acesse https://github.com/cloud4z/spark e faça download de todos os arquivos de amostra. 78 | 79 | 2. Carregue o arquivo de dados do DB2: 80 | * **Clique em ‘Upload Data’** 81 | * **Selecione e carregue o arquivo DDL do DB2** 82 | * **Selecione e carregue o arquivo de dados do DB2** 83 | * **Clique em ‘Upload’** 84 | 85 |
86 | 87 | A mensagem “Upload Success” será exibida no painel quando o carregamento de dados for concluído. Os dados do VSAM para este exercício já foram carregados para você. No entanto, essa etapa poderá ser repetida carregando o copybook do VSAM e o arquivo de dados do VSAM (que você transferiu por download) do seu sistema local. 88 | 89 | ### 3. Enviar um programa do Scala para analisar os dados 90 | 91 | Envie um programa preparado do Scala para analisar os dados. 92 | * **Clique em ‘Spark Submit’** 93 | * **Selecione seu arquivo JAR de demonstração do Spark** 94 | * **Especifique o nome da classe principal ‘com.ibm.scalademo.ClientJoinVSAM’** 95 | * **Insira os argumentos: ‘Spark Instance Username’ ‘Spark Instance Password’** 96 | * **Clique em ‘Submit’** 97 | Observação: Os argumentos sugerem que você precisa efetuar login na GUI para visualizar os resultados da tarefa. 98 | 99 |
100 | 101 | A mensagem “JOB Submitted” será exibida no painel quando o programa for concluído. Esse programa do Scala acessará os dados do DB2 e do VSAM, realizará transformações nos dados, unirá as duas tabelas em um dataframe do Spark e armazenará o resultado no DB2. 102 | 103 | ### 4. Acionar a GUI do Spark para visualizar a tarefa enviada 104 | 105 | Acione a GUI de saída do trabalhador individual do Spark para visualizar a tarefa que acabou de enviar. 106 | * **Clique em ‘Spark UI’** 107 | * **Clique no ‘Worker ID’ do seu programa na seção ‘Completed Drivers’.** 108 | * **Efetue login com o nome do usuário do Spark e a senha do Spark. Aqueles mencionados na etapa 6.** 109 | * **Clique em ‘stdout’ em relação ao seu programa, na seção ‘Finished Drivers’, para visualizar seus resultados.** 110 | 111 | !
112 | 113 | ## Parte B: Trabalhar com o Jupyter Notebook 114 | 115 | ### 5. Acionar o Jupyter Notebook e se conectar ao Spark 116 | 117 | A ferramenta Jupyter Notebook está instalada no painel. Com ela, é possível escrever e enviar código do Scala para sua instância do Spark, além de visualizar a saída dentro de uma GUI da web. 118 | 119 | 1.**Acione o serviço Jupyter Notebook no seu navegador, a partir do painel.** * **Clique em ‘Jupyter’.** Você verá a página inicial do Jupyter. 120 | 121 | ![Jupyter](images/Jupyter.png) 122 | 123 | O programa preparado do Scala nesse nível acessará os dados do DB2 e do VSAM, realizará transformações nos dados, unirá as duas tabelas em um dataframe do Spark e armazenará o resultado no DB2. Também realizará uma análise de regressão logística e transformará a saída em gráfico. 124 | 125 | 2. **Clique duas vezes no arquivo Demo.jpynb.** 126 | 127 | ![Jupyter File Select](images/Select.png) 128 | 129 | O Jupyter Notebook vai se conectar à sua instância do Spark on z/OS automaticamente e estará no estado pronto quando o indicador Apache Toree –Scala, localizado no canto superior direito da tela, estiver transparente.
130 | ### 6. Executar células do Jupyter Notebook para carregar dados e executar análise 131 | O ambiente do Jupyter Notebook divide-se em células de entrada identificadas com ‘In [#]:’. 132 | #### 6.1 Carregar dados do VSAM e do DB2 no Spark e realizar uma transformação de dados 133 | 134 | Executar célula nº 1 - O código do Scala na primeira célula carrega os dados do VSAM (informações do cliente) no Spark e realiza uma transformação de dados. 135 | * **Clique no primeiro ‘In [ ]:’** A borda esquerda mudará para azul quando uma célula estiver no modo de comando, como mostrado abaixo. 136 | 137 |
138 | 139 |   Antes de executar o código, faça estas alterações: 140 | * **Altere o valor de zOS_IP para seu endereço IP do Spark.** 141 | * **Altere o valor de zOS_USERNAME para o nome do usuário do Spark e o valor de zOS_PASSWORD para a senha do Spark.** ![Change code](images/ChangeCode.png) * **Clique no botão de execução da célula, indicado pela caixa vermelha, como mostrado abaixo** 142 | 143 |
144 | 145 | A conexão do Jupyter Notebook com sua instância do Spark está no estado ocupado quando o indicador Apache Toree –Scala, localizado no canto superior direito da tela, está cinza. 146 | 147 |
148 | 149 | Quando esse indicador fica transparente, significa que a execução da célula foi concluída e retornou ao estado de pronto. A saída deve ser semelhante ao exemplo a seguir: 150 | 151 |
152 | 153 | Executar célula #2 - O código do Scala na segunda célula carrega os dados do DB2 (dados da transação) no Spark e realiza uma transformação de dados. 154 | * **Clique no próximo ‘In [ ]:’ para selecionar a próxima célula** 155 | * **Clique no botão de execução da célula** 156 | A saída deve ser semelhante ao exemplo a seguir: 157 | 158 |
159 | 160 | #### 6.2 Unir os dados do VSAM e do DB2 no dataframe no Spark 161 | Executar célula nº 3 - O código do Scala na terceira célula une os dados do VSAM e do DB2 em um novo dataframe ‘client_join’ no Spark. 162 | * **Clique no próximo ‘In [ ]:’ para selecionar a próxima célula** 163 | * **Clique no botão de execução da célula** 164 | A saída deve ser semelhante ao exemplo a seguir: 165 |
166 | 167 | #### 6.3 Criar um dataframe de regressão logística e transformá-lo em gráfico 168 | Executar a célula #4 - O código do Scala na quarta célula realiza uma regressão logística para avaliar a probabilidade de rotatividade de clientes como uma função do nível de atividade dos clientes. O dataframe ‘result_df’ também é criado; ele é usado para transformar os resultados em um gráfico de linha. 169 | * **Clique no próximo ‘In [ ]:’ para selecionar a próxima célula** 170 | * **Clique no botão de execução da célula** 171 | A saída deve ser semelhante ao exemplo a seguir: 172 |

173 | Executar célula nº 5 - O código do Scala na quinta célula transforma em gráfico o dataframe ‘plot_df’. * **Clique no próximo ‘In [ ]:’ para selecionar a próxima célula** * **Clique no botão de execução da célula** A saída deve ser semelhante ao exemplo a seguir: 174 |
175 | 176 | #### 6.4 Obter dados estatísticos 177 | a. O número de linhas no conjunto de dados do VSAM de entrada 178 | * **println(clientInfo_df.count())** O resultado deve ser 6001. 179 | b. O número de linhas no conjunto de dados do DB2 de entrada 180 | * **println(clientTrans_df.count())** O resultado deve ser 20000. 181 | c. O número de linhas no conjunto de dados unido 182 | * **println(client_df.count())** O resultado deve ser 112. 183 | 184 | ## Referência 185 | IBM z/OS Platform for Apache Spark - http://www-03.ibm.com/systems/z/os/zos/apache-spark.html 186 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data Analysis using Spark on zOS and Jupyter Notebooks 2 | 3 | [Apache Spark](https://spark.apache.org/) is an open-source cluster-computing framework. Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. And you can use it interactively from the Scala, Python and R shells. You can run Spark using its standalone cluster mode, on an IaaS, on Hadoop YARN, or on container orchestrators like Apache Mesos. 4 | 5 | [z/OS](https://en.wikipedia.org/wiki/Z/OS) is an extremely scalable and secure high-performance operating system based on the 64-bit z/Architecture. z/OS is highly reliable for running mission-critical applications, and the operating system supports Web- and Java-based applications. 6 | 7 | In this jurney we demonstrate running an analytics application using Spark on z/OS. [Apache Spark on z/OS](https://www-03.ibm.com/systems/z/os/zos/apache-spark.html) is in-place, optimized abstraction and real-time analysis of structured and unstructured enterprise data which is powered by [z Systems Community Cloud](https://zcloud.marist.edu). 8 | 9 | z/OS Platform for Apache Spark includes a supported version of Apache Spark open source capabilities consisting of the ApacheSpark core, Spark SQL, Spark Streaming, Machine Learning Library (MLib) and Graphx.It also includes optimized data access to a broad set of structured and unstructured data sources through Spark APIs. With this capability, traditional z/OS data sources, such as IMS™, VSAM, IBM DB2®, z/OS, PDSE, or SMF data, can be accessed in a performance-optimized manner with Spark 10 | 11 | This analytics example uses data stored in DB2 and VSAM tables, and a machine learning application written in [Scala](). The code also uses open-source [Jupyter Notebook](http://jupyter.org) to write and submit Scala code to your Spark instance, and view the output within a web GUI. The Jupyter Notebook is commonly used in data analytics space for data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more. 12 | 13 | 14 |
15 | 16 | ## Included Components 17 | The scenarios are accomplished by using: 18 | 19 | - [IBM z/OS Platform for Apache Spark](http://www-03.ibm.com/systems/z/os/zos/apache-spark.html) 20 | - [Jupyter Notebook](http://jupyter-notebook.readthedocs.io/en/latest/) 21 | - [Scala](https://www.scala-lang.org/documentation) 22 | - [IBM DB2 for z/OS](https://www.ibm.com/analytics/us/en/technology/db2/db2-for-zos.html) 23 | - [VSAM](https://www.ibm.com/support/knowledgecenter/zosbasics/com.ibm.zos.zconcepts/zconcepts_169.htm) 24 | 25 | ## Prerequisites 26 | 27 | Register at [z Systems Community Cloud](https://zcloud.marist.edu/#/register) for a trial account. You will receive an email containing credentials to access the self-service portal. This is where you can start exploring all the available services. 28 | 29 | ## Steps 30 | 31 | ### Part A: Use Self-service Dashboard 32 | 33 | 1. [Start your Spark Cluster](#1-start-your-spark-cluster) 34 | 2. [Upload the DB2 and VSAM data](#2-upload-the-db2-and-vsam-data) 35 | 3. [Submit a Scala program to analyze the data](#3-submit-a-scala-program-to-analyze-the-data) 36 | 4. [Launch Spark GUI to view the submitted job](#4-launch-spark-gui-to-view-the-submitted-job) 37 | 38 | ### Part B: Work with Jupyter Notebook 39 | 40 | 5. [Launch Jupyter Notebook and connect to Spark](#5-launch-jupyter-notebook-and-connect-to-spark) 41 | 6. [Run Jupyter Notebook cells to load data and perform analysis](#6-run-jupyter-notebook-cells-to-load-data-and-perform-analysis) 42 | - 6.1 [Load VSAM and DB2 data into Spark and perform a data transformation](#6.1-load-vsam-and-db2-data-into-spark-and-perform-a-data-transformation) 43 | - 6.2 [Join the VSAM and DB2 data into dataframe in Spark](#6.2-join-the-vsam-and-db2-data-into-dataframe-in-spark) 44 | - 6.3 [Create a logistic regression dataframe and plot it](#6.3-create-a-logistic-regression-dataframe-and-plot-it) 45 | - 6.4 [Get statistical data](#6.4-get-statistical-data) 46 | 47 | ## Part A: Use Self-service Dashboard 48 | 49 | ### 1. Start your Spark Cluster 50 | 51 | 1.Open a web browser and enter the URL to access the [z Systems Community Cloud](https://zcloud.marist.edu) self-service portal. 52 | 53 | 54 |
55 |
56 | 57 | 2.Enter your Portal User ID and Portal Password, and click ‘Sign In’. 58 | 59 | 3.You will see the home page for the z Systems Community Cloud self-service portal. 60 | * **Click on ‘Try Analytics Service’** 61 | 62 | 63 |
64 |
65 | 66 | 4.You will now see a dashboard, which shows the status of your Apache Spark on z/OS instance. 67 | 68 | At the top of the screen, notice the ‘z/OS Status’ indicator, which should show the status of your instance 69 | as ‘OK’. 70 | 71 | In the middle of the screen, the ‘Spark Instance’, ‘Status’, ‘Data management’, and ‘Operations’ sections 72 | will be displayed. The ‘Spark Instance’ section contains your individual Spark username and IP address. 73 | 74 | Below the field headings, you will see buttons for functions that can be applied to your instance. 75 | ![GUI](images/Screen.png) 76 | 77 | The following table lists the operation for each function: 78 | 79 |
80 |
81 | 82 | 5.If it is the first time for you to try the Analytics Service on zOS, you must set a new Spark password. 83 | * **Click ‘Change Password’** 84 | 85 | 86 |
87 |
88 | 89 | 6.Confirm your instance is Active. If it is ‘Stopped’, click ‘Start’ to start it. 90 | 91 |
92 | 93 | ### 2. Upload the DB2 and VSAM data 94 | 95 | 1.Go to https://github.com/cloud4z/spark and download all the sample files. 96 | 97 | 2.Load the DB2 data file : 98 | * **Click ‘Upload Data’** 99 | * **Select and load the DB2 DDL file** 100 | * **Select and load the DB2 data file** 101 | * **Click ‘Upload’** 102 | 103 | 104 |
105 | 106 | “Upload Success” will appear in the dashboard when the data load is complete. The VSAM data for this exercise has already been loaded for you. However, this step may be repeated by loading the VSAM copybook and VSAM data file you downloaded, from your local system. 107 | 108 | ### 3. Submit a Scala program to analyze the data 109 | 110 | Submit a prepared Scala program to analyze the data. 111 | * **Click ‘Spark Submit’** 112 | * **Select your Spark Demo JAR file** 113 | * **Specify Main class name ‘com.ibm.scalademo.ClientJoinVSAM’** 114 | * **Enter the arguments: ‘Spark Instance Username’ ‘Spark Instance Password’** 115 | * **Click ‘Submit’** 116 | Note: The arguments suggest you need to login to the GUI to view the job results. 117 | 118 | 119 |
120 | 121 | “JOB Submitted” will appear in the dashboard when the program is complete. This Scala program will access DB2 and VSAM data, perform transformations on the data, join these two tables in a Spark dataframe, and store the result back to DB2. 122 | 123 | ### 4. Launch Spark GUI to view the submitted job 124 | 125 | Launch your individual Spark worker output GUI to view the job you just submitted. 126 | * **Click ‘Spark UI’** 127 | * **Click on the ‘Worker ID’ for your program in the ‘Completed Drivers’ section.** 128 | * **Log in with your Spark username and Spark password. The ones mentioned in step 6.** 129 | * **Click on ‘stdout’ for your program in the ‘Finished Drivers’ section to view your results.** 130 | 131 | ! 132 |
133 | 134 | 135 | ## Part B: Work with Jupyter Notebook 136 | 137 | ### 5. Launch Jupyter Notebook and connect to Spark 138 | 139 | Jupyter Notebook tool that is installed in the dashboard. This tool will allow you to write and submit Scala code to your Spark instance, and view the output within a web GUI. 140 | 141 | 1.**Launch the Jupyter Notebook service in your browser from your dashboard.** 142 | * **Click on ‘Jupyter’.** 143 | You will see the Jupyter home page. 144 | 145 | ![Jupyter](images/Jupyter.png) 146 | 147 | The prepared Scala program in this level will access DB2 and VSAM data, perform transformations on the data, join these two tables in a Spark dataframe, and store the result back to DB2. It will also perform a logistic regression analysis and plot the output. 148 | 149 | 2. **Double click the Demo.jpynb file.** 150 | 151 | ![Jupyter File Select](images/Select.png) 152 | 153 | The Jupyter Notebook will connect to your Spark on z/OS instance automatically and will be in the ready state when the Apache Toree –Scala indicator in the top right hand corner of the screen is clear. 154 | 155 |
156 | ### 6. Run Jupyter Notebook cells to load data and perform analysis 157 | The Jupyter Notebook environment is divided into input cells labelled with ‘In [#]:’. 158 | #### 6.1 Load VSAM and DB2 data into Spark and perform a data transformation 159 | Run cell #1 - The Scala code in the first cell loads the VSAM data (customer information) into Spark and performs a data transformation. 160 | * **Click on the first ‘In [ ]:’** 161 | The left border will change to blue when a cell is in command mode, as shown below. 162 | 163 | 164 |
165 | 166 |   Before running the code, make the fllowing changes: 167 | * **Change the value of zOS_IP to your Spark IP address.** 168 | * **Change the value of zOS_USERNAME to your Spark username and the value of zOS_PASSWORD to your Spark password.** 169 | 170 | ![Change code](images/ChangeCode.png) 171 | 172 | * **Click the run cell button indicated by the red box as shown below** 173 | 174 | 175 |
176 | 177 | The Jupyter Notebook connection to your Spark instance is in the busy state when the Apache Toree –Scala indicator in the top right hand corner of the screen is grey. 178 | 179 | 180 |
181 | 182 | When this indicator turns clear, the cell run has completed and returned to the ready state. 183 | The output should be similar to the following: 184 | 185 | 186 |
187 | 188 | Run cell #2 - The Scala code in the second cell loads the DB2 data (transaction data) into Spark and performs a data transformation. 189 | * **Click on the next ‘In [ ]:’ to select the next cell** 190 | * **Click the run cell button** 191 | The output should be similar to the following: 192 | 193 | 194 |
195 | 196 | #### 6.2 Join the VSAM and DB2 data into dataframe in Spark 197 | Run cell #3 - The Scala code in the third cell joins the VSAM and DB2 data into a new ‘client_join’ dataframe in Spark. 198 | * **Click on the next ‘In [ ]:’ to select the next cell** 199 | * **Click the run cell button** 200 | The output should be similar to the following: 201 | 202 |
203 | 204 | #### 6.3 Create a logistic regression dataframe and plot it 205 | Run cell #4 - The Scala code in the fourth cell performs a logistic regression to evaluate the probability of customer churn as a function of customer activity level. The ‘result_df’ dataframe is also created, which is used to plot the results on a line graph. 206 | * **Click on the next ‘In [ ]:’ to select the next cell** 207 | * **Click the run cell button** 208 | The output should be similar to the following: 209 | 210 |
211 |
212 | 213 | Run cell #5 - The Scala code in the fifth cell plots the ‘plot_df’ dataframe. 214 | * **Click on the next ‘In [ ]:’ to select the next cell** 215 | * **Click the run cell button** 216 | The output should be similar to the following: 217 | 218 |
219 | 220 | #### 6.4 Get statistical data 221 | a. The number of rows in the input VSAM dataset 222 | * **println(clientInfo_df.count())** 223 | Result should be 6001. 224 | b. The number of rows in the input DB2 dataset 225 | * **println(clientTrans_df.count())** 226 | Result should be 20000. 227 | c. The number of rows in the joined dataset 228 | * **println(client_df.count())** 229 | Result should be 112. 230 | 231 | 232 | ## Reference 233 | IBM z/OS Platform for Apache Spark - http://www-03.ibm.com/systems/z/os/zos/apache-spark.html 234 | -------------------------------------------------------------------------------- /images/Auth.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Auth.png -------------------------------------------------------------------------------- /images/ChangeCode.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/ChangeCode.png -------------------------------------------------------------------------------- /images/Jupyter.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Jupyter.png -------------------------------------------------------------------------------- /images/Login.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Login.png -------------------------------------------------------------------------------- /images/Pwd.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Pwd.png -------------------------------------------------------------------------------- /images/Run.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Run.png -------------------------------------------------------------------------------- /images/Screen.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Screen.png -------------------------------------------------------------------------------- /images/Select.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Select.png -------------------------------------------------------------------------------- /images/Spark.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Spark.png -------------------------------------------------------------------------------- /images/Start.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Start.png -------------------------------------------------------------------------------- /images/Submit.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Submit.png -------------------------------------------------------------------------------- /images/Table.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Table.png -------------------------------------------------------------------------------- /images/Trial.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Trial.png -------------------------------------------------------------------------------- /images/Upload.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/Upload.png -------------------------------------------------------------------------------- /images/clear.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/clear.png -------------------------------------------------------------------------------- /images/clear2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/clear2.png -------------------------------------------------------------------------------- /images/highlight.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/highlight.png -------------------------------------------------------------------------------- /images/out1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/out1.png -------------------------------------------------------------------------------- /images/out2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/out2.png -------------------------------------------------------------------------------- /images/out3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/out3.png -------------------------------------------------------------------------------- /images/out4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/out4.png -------------------------------------------------------------------------------- /images/out5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/out5.png -------------------------------------------------------------------------------- /images/senario.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/senario.png -------------------------------------------------------------------------------- /images/spark_zOS.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/Spark-on-zOS/d4ac1581f406330b55d87d6e7baef7a73cc91ada/images/spark_zOS.png --------------------------------------------------------------------------------