├── .gitignore ├── CONTRIBUTING.md ├── LICENSE.txt ├── readme.md ├── sample_data ├── access_log_20151221-101535.log.gz ├── access_log_20151221-101548.log.gz ├── access_log_20151221-101555.log.gz ├── access_log_20151221-101601.log.gz ├── access_log_20151221-101606.log.gz └── ccsample ├── sdk-tutorials ├── find-methods-fields │ ├── README.md │ └── images │ │ └── pipeline_introspection.jpeg └── sch │ ├── tutorial-getting-started │ ├── README.md │ └── images │ │ ├── Hello_World_job.jpeg │ │ ├── Hello_World_job_running.jpeg │ │ ├── Hello_World_pipeline.jpeg │ │ └── Hello_World_pipeline_canvaas.jpeg │ ├── tutorial-jobs │ ├── README.md │ ├── data-collector-logs │ │ └── README.md │ ├── generate-a-report │ │ └── README.md │ ├── images │ │ ├── hello_world_job_details.jpeg │ │ ├── hello_world_job_history.jpeg │ │ ├── hello_world_job_monitoring.jpeg │ │ ├── hello_world_job_report_definition.jpeg │ │ ├── hello_world_job_show_reports.jpeg │ │ ├── list_of_jobs_by_new_data_collector_label.jpeg │ │ └── list_of_jobs_by_old_data_collector_label.jpeg │ ├── preparation-for-tutorial │ │ └── README.md │ ├── start-monitor-a-specific-job │ │ └── README.md │ ├── update-data-collector-labels │ │ └── README.md │ └── ways-to-fetch-jobs │ │ └── README.md │ └── tutorial-pipelines │ ├── README.md │ ├── common-pipeline-methods │ └── README.md │ ├── create-ci-cd-demo-pipeline │ └── README.md │ ├── edit-pipelines-and-stages │ └── README.md │ ├── images │ ├── 3_duplicated_pipelines.jpeg │ ├── duplicated_pipeline.jpeg │ ├── duplicated_pipeline_label.jpeg │ ├── duplicated_pipeline_label_updated.jpeg │ ├── duplicated_pipeline_stage_updated.jpeg │ ├── imported_pipeline.jpeg │ ├── pipeline_parameters.jpeg │ ├── sch_ci_cd_pipeline.jpeg │ ├── sch_ci_cd_pipeline_destination_stage_config.jpeg │ ├── sch_ci_cd_pipeline_field_remover_config.jpeg │ ├── sch_ci_cd_pipeline_field_splitter_config.jpeg │ ├── sch_ci_cd_pipeline_jython_evaluator_config.jpeg │ ├── sch_ci_cd_pipeline_origin_stage_config.jpeg │ ├── sch_ci_cd_pipeline_overview.jpeg │ ├── sch_ci_cd_pipeline_runtime_parameters.jpeg │ ├── sch_hello_world_created.jpeg │ ├── stage_configs.jpeg │ └── stage_label_in_UI.jpeg │ └── preparation-for-tutorial │ └── README.md ├── tutorial-1 ├── elasticsearch │ └── logs.json ├── img │ ├── add_jython.png │ ├── directory_config.png │ ├── directory_config_log.png │ ├── directory_config_postproc.png │ ├── discard_errors.png │ ├── elastic_config.png │ ├── expression_eval.png │ ├── field_converter.png │ ├── field_converter_timestamp.png │ ├── field_remover.png │ ├── geo_ip.png │ ├── geoip_errors.png │ ├── idle_alert.png │ ├── import_pipeline.png │ ├── metric_alerts.png │ ├── part1_kibana_dashboard.png │ ├── part2_kibana_dashboard.png │ ├── running_pipeline.png │ └── vimeo-thumbnail.png ├── kibana │ └── ApacheWebLog.json ├── log_shipping_to_elasticsearch_part1.md ├── log_shipping_to_elasticsearch_part2.md ├── pipelines │ └── Directory_to_ElasticSearch_Tutorial_Part_1.json └── readme.md ├── tutorial-2 ├── directory_to_kafkaproducer.md ├── img │ ├── data_conversions.png │ ├── directory_config_dataformat.png │ ├── directory_config_postproc.png │ ├── directory_setup.png │ ├── elastic_config.png │ ├── field_converter_config.png │ ├── field_masker_config.png │ ├── kafka_consumer_config.png │ ├── kafka_consumer_dataformat.png │ ├── kafka_consumer_pipeline.png │ ├── kafka_producer_config.png │ ├── kafka_producer_dataformat.png │ ├── our_setup.png │ ├── s3_config1.png │ ├── s3_config2.png │ └── vimeo-thumbnail.png ├── kafkaconsumer_to_multipledestinations.md ├── pipelines │ ├── Directory_to_KafkaProducer.json │ └── KafkaConsumer_to_MultipleDestinations.json └── readme.md ├── tutorial-3 ├── image_0.png ├── image_1.png ├── image_10.png ├── image_11.png ├── image_2.png ├── image_3.png ├── image_4.png ├── image_5.png ├── image_6.png ├── image_7.png ├── image_8.png ├── image_9.png └── readme.md ├── tutorial-adls-destination ├── image_0.png ├── image_1.png ├── image_10.png ├── image_11.png ├── image_12.png ├── image_13.png ├── image_14.png ├── image_15.png ├── image_16.png ├── image_17.png ├── image_18.png ├── image_19.png ├── image_2.png ├── image_20.png ├── image_21.png ├── image_22.png ├── image_23.png ├── image_24.png ├── image_3.png ├── image_4.png ├── image_5.png ├── image_6.png ├── image_7.png ├── image_8.png ├── image_9.png └── readme.md ├── tutorial-crud-microservice ├── add-expression-evaluator.png ├── add-jdbc-lookup.png ├── add-jdbc-tee.png ├── add-record-found.png ├── add-request-id-test.png ├── added-jdbc-lookup.png ├── expression-evaluator-fields.png ├── http-request-router-conditions.png ├── new_pipeline.png ├── paste-jdbc-tee.png ├── readme.md ├── ready-for-delete.png ├── ready-for-update.png ├── snapshot.png └── template_microservice.png ├── tutorial-custom-dataprotector-procedure ├── README.md └── images │ ├── finder1.png │ ├── finder2.png │ ├── intellij1.png │ ├── intellij2.png │ ├── intellij3.png │ ├── intellij4.png │ ├── sch1.png │ ├── sch2.png │ ├── sch3.png │ ├── sch4.png │ ├── sch5.png │ ├── sch6.png │ ├── sch7.png │ ├── sch8.png │ ├── sch9.png │ └── sdc1.png ├── tutorial-destination ├── image_0.png ├── image_1.png ├── image_10.png ├── image_11.png ├── image_2.png ├── image_3.png ├── image_4.png ├── image_5.png ├── image_6.png ├── image_7.png ├── image_8.png ├── image_9.png └── readme.md ├── tutorial-hivedrift ├── RDBMS-SDC-Hive.png ├── drifttutorial1.png ├── drifttutorial2.png ├── drifttutorial3.png ├── drifttutorial4.png ├── image_0.png ├── image_1.png ├── image_2.png ├── image_6.png ├── image_7.png ├── image_8.png ├── image_9.png └── readme.md ├── tutorial-kubernetes-deployment ├── 1-java-opts │ ├── README.md │ └── sdc.yaml ├── 2-custom-docker-image │ ├── README.md │ ├── sdc-docker-custom-config │ │ ├── Dockerfile │ │ ├── build.sh │ │ └── sdc-conf │ │ │ └── sdc.properties │ └── sdc.yaml ├── 3-volumes │ ├── README.md │ ├── get-stage-libs.sh │ ├── images │ │ └── azure-file-share.png │ └── sdc.yaml ├── 4-persistent-volumes │ ├── README.md │ ├── images │ │ ├── port-forward.png │ │ └── sdcs.png │ ├── sdc-stage-libs-configmap.yaml │ ├── sdc-stage-libs-job.yaml │ ├── sdc-stage-libs-pvc.yaml │ ├── sdc-stage-libs-sc.yaml │ └── sdc.yaml ├── 5-sdc-properties-configmap-1 │ ├── README.md │ ├── sdc.properties │ └── sdc.yaml ├── 6-sdc-properties-configmap-2 │ ├── README.md │ ├── images │ │ └── sdc-config.png │ ├── sdc-dynamic-properties.yaml │ ├── sdc.properties │ └── sdc.yaml ├── 7-credential-stores │ ├── README.md │ ├── credential-stores.properties │ └── sdc.yaml ├── 8-ingress │ ├── README.md │ ├── host-based-routing │ │ ├── sdc1.yaml │ │ ├── sdc2.yaml │ │ └── sdc3.yaml │ ├── images │ │ └── path-based-routing.png │ ├── path-based-routing │ │ ├── sdc1.yaml │ │ ├── sdc2.yaml │ │ └── sdc3.yaml │ └── sdc.yaml ├── NoteOnEnvVars.md └── README.md ├── tutorial-multithreaded-origin └── README.md ├── tutorial-origin ├── image_0.png ├── image_1.png ├── image_10.png ├── image_11.png ├── image_2.png ├── image_3.png ├── image_4.png ├── image_5.png ├── image_6.png ├── image_7.png ├── image_8.png ├── image_9.png └── readme.md ├── tutorial-processor ├── .DS_Store ├── image_0.png ├── image_1.png ├── image_10.png ├── image_11.png ├── image_2.png ├── image_3.png ├── image_4.png ├── image_5.png ├── image_6.png ├── image_7.png ├── image_8.png ├── image_9.png ├── readme.md └── sampleprocessor │ ├── pom.xml │ └── src │ ├── main │ ├── assemblies │ │ └── stage-lib.xml │ ├── java │ │ └── com │ │ │ └── example │ │ │ └── stage │ │ │ ├── lib │ │ │ └── sample │ │ │ │ └── Errors.java │ │ │ └── processor │ │ │ └── sample │ │ │ ├── Groups.java │ │ │ ├── SampleDProcessor.java │ │ │ └── SampleProcessor.java │ └── resources │ │ ├── data-collector-library-bundle.properties │ │ └── default.png │ └── test │ └── java │ └── com │ └── example │ └── stage │ └── processor │ └── sample │ └── TestSampleProcessor.java ├── tutorial-spark-transformer-scala ├── image_0.png ├── image_1.png ├── image_10.png ├── image_11.png ├── image_12.png ├── image_2.png ├── image_3.png ├── image_4.png ├── image_5.png ├── image_6.png ├── image_7.png ├── image_8.png ├── image_9.png ├── readme.md └── upload.png ├── tutorial-spark-transformer ├── image_0.png ├── image_1.png ├── image_2.png ├── image_3.png ├── image_4.png ├── image_5.png ├── image_6.png ├── image_7.png ├── readme.md └── upload.png └── working-with-azure ├── blobstorage_to_hdinsightkafka.md ├── hdinsightkafka_to_sqldw_and_blobstorage.md ├── img ├── Ambari_Kafka.png ├── BlobToKafka │ ├── HadoopFS_DataFormat.png │ ├── HadoopFS_Files.png │ ├── HadoopFS_HadoopFS.png │ ├── Kafka_DataFormat.png │ ├── Kafka_Kafka.png │ ├── PlayIcon.png │ ├── Preview.png │ ├── PreviewIcon.png │ ├── SelectSource_Hadoop.png │ ├── StartandMonitor.png │ ├── ValidateIcon.png │ ├── image_4.png │ ├── image_44.png │ └── image_5.png ├── KafkaToSqlDWandHive │ ├── ExpressionEvaluator.png │ ├── FieldConvertor.png │ ├── HadoopFS_DataFormat.png │ ├── HadoopFS_OutputFiles.png │ ├── HiveMetadata_General.png │ ├── HiveMetadata_Hive.png │ ├── HiveMetadata_Table.png │ ├── HiveMetastore_Hive.png │ ├── JDBCProducer.png │ ├── KafkaConnection.png │ ├── KafkaDataFormatJson.png │ └── Preview1.png └── sdc_ssh_login.png ├── pipelines ├── BlobStorage to HDInsightKafka.json └── Kafka_to_Blob_and_SQL_DW.json └── readme.md /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | .idea 3 | *.iml 4 | target/ -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | 14 | 15 | # Contributing 16 | 17 | Thank you for your interest in contributing to the StreamSets Tutorials Library. 18 | 19 | You can help in several different ways : 20 | - [Open an issue](http://issues.streamsets.com) and submit your suggestions for improvements. 21 | - Fork this repo and submit a pull request. 22 | 23 | To begin you first need to sign our [Contributor License Agreement](http://streamsets.com/contributing/). 24 | 25 | - To submit a pull request, fork [this repository](http://github.com/streamsets/tutorials) and clone your fork: 26 | 27 | `git clone git@github.com:<>/tutorials.git` 28 | 29 | - Make your suggested changes, do a `git push` and submit a pull request. 30 | -------------------------------------------------------------------------------- /sample_data/access_log_20151221-101535.log.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sample_data/access_log_20151221-101535.log.gz -------------------------------------------------------------------------------- /sample_data/access_log_20151221-101548.log.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sample_data/access_log_20151221-101548.log.gz -------------------------------------------------------------------------------- /sample_data/access_log_20151221-101555.log.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sample_data/access_log_20151221-101555.log.gz -------------------------------------------------------------------------------- /sample_data/access_log_20151221-101601.log.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sample_data/access_log_20151221-101601.log.gz -------------------------------------------------------------------------------- /sample_data/access_log_20151221-101606.log.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sample_data/access_log_20151221-101606.log.gz -------------------------------------------------------------------------------- /sample_data/ccsample: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sample_data/ccsample -------------------------------------------------------------------------------- /sdk-tutorials/find-methods-fields/images/pipeline_introspection.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/find-methods-fields/images/pipeline_introspection.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-getting-started/README.md: -------------------------------------------------------------------------------- 1 | StreamSets Control Hub: Getting Started with SDK for Python 2 | =========================================================== 3 | 4 | This tutorial covers the most basic yet very powerful workflow for [StreamSets Control Hub](https://streamsets.com/products/dataops-platform/control-hub/). The workflow shows how to design and 5 | publish a pipeline followed by how to create, start, and stop a job using [SDK for Python](https://docs.streamsets.com/sdk/latest/index.html). 6 | 7 | ### Prerequisites 8 | * [Python 3.4+](https://docs.python.org/3/using/index.html) and pip3 installed 9 | * StreamSets for SDK [Installed and activated](https://docs.streamsets.com/sdk/latest/installation.html) 10 | * [Access to StreamSets Control Hub](https://streamsets.com/documentation/controlhub/latest/help/controlhub/UserGuide/OrganizationSecurity/OrgSecurity_Overview.html#concept_q5z_jkl_wy) with an user account in your organization 11 | * At least one [StreamSets Data Collector](https://streamsets.com/products/dataops-platform/data-collector/) instance registered with the above StreamSets Control Hub instance 12 | 13 | 14 | **Note**: Make sure that the user account has proper access to do the following tasks this blog post covers. The easiest way for this, is to do those tasks using the Web UI of the StreamSets Control Hub first and fix any access problems before embarking on the path below. 15 | 16 | ### Workflow 17 | On a terminal, type the following command to open a Python 3 interpreter. 18 | 19 | ```bash 20 | $ python3 21 | Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 26 2018, 19:50:54) 22 | [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin 23 | Type "help", "copyright", "credits" or "license" for more information. 24 | >>> 25 | ``` 26 | 27 | ### Step 1 — Connect to StreamSets Control Hub instance 28 | 29 | Let’s assume the StreamSets Control Hub is running at http://sch.streamsets.com 30 | Create an object called control_hub which is connected to the above. 31 | 32 | ```python 33 | from streamsets.sdk import ControlHub 34 | 35 | # Replace the argument values according to your setup 36 | control_hub = ControlHub(server_url='http://sch.streamsets.com', 37 | username='user@organization1', 38 | password='password') 39 | ``` 40 | 41 | To connect to https enabled Control Hub using a cert file, utilize the attribute streamsets.sdk.ControlHub.VERIFY_SSL_CERTIFICATES: 42 | 43 | ```python 44 | from streamsets.sdk import ControlHub 45 | 46 | # Replace the according to your setup 47 | ControlHub.VERIFY_SSL_CERTIFICATES = '' 48 | 49 | # To skip verifying SSL certificate use the following 50 | # ControlHub.VERIFY_SSL_CERTIFICATES = False 51 | 52 | # Replace the argument values according to your setup 53 | control_hub = ControlHub(server_url='https://sch.streamsets.com', 54 | username='user@organization1', 55 | password='password') 56 | ``` 57 | 58 | ### Step 2 — Build and publish a pipeline 59 | A pipeline describes the flow of data from an origin system to destination systems and defines how to transform the data along the way. The following is a very simple pipeline. One can create pretty complex pipelines using the constructs available in SDK for Python. 60 | 61 | After building, publish the pipeline to indicate that design is complete and the pipeline is ready to be added to a job and run. 62 | 63 | ```python 64 | # Create 2 stages and wire them together 65 | dev_raw_data_source = builder.add_stage('Dev Raw Data Source') 66 | trash = builder.add_stage('Trash') 67 | dev_raw_data_source >> trash 68 | 69 | 70 | # Now build the pipeline with a name and publish it 71 | pipeline = builder.build('SCH Hello World pipeline') 72 | control_hub.publish_pipeline(pipeline, commit_message='First Commit') 73 | ``` 74 | If all goes well, then StreamSets Control Hub UI on browser shows the following: 75 | 76 | ![image alt text](images/Hello_World_pipeline_canvaas.jpeg) 77 | 78 | ### Step 3 — Create and start a job 79 | When adding a job, specify the published pipeline to run. Also can select Data Collector labels for the job. The labels indicate which group of Data Collectors should run the pipeline. Here we do not specify any label and hence it uses the default label `all`. 80 | 81 | ```python 82 | # Create a job 83 | job_builder = control_hub.get_job_builder() 84 | job = job_builder.build('Hello_World_job', pipeline=pipeline) 85 | control_hub.add_job(job) 86 | ``` 87 | ![image alt text](images/Hello_World_job.jpeg) 88 | 89 | Now let’s start the job. 90 | 91 | ```python 92 | # Start the job 93 | control_hub.start_job(job) 94 | ``` 95 | 96 | Give it some time. And then StreamSets Control Hub UI on browser shows the following: 97 | 98 | ![image alt text](images/Hello_World_job_running.jpeg) 99 | 100 | Voila! We have successfully created a job and it is running with just a few lines of code. Pretty simple. Right? 101 | 102 | ### Step 4 — Monitor the job using UI 103 | Now as the UI shows, one can monitor a job in the UI. 104 | To achieve the same in SDK, check out [the set of tutorials for the jobs](../tutorial-jobs/readme.md). 105 | 106 | 107 | ### Step 5 — Stop and delete the job 108 | This is the last step in our workflow. 109 | 110 | ```python 111 | control_hub.stop_job(job) 112 | 113 | control_hub.delete_job(job) 114 | ``` 115 | 116 | ### Conclusion 117 | The workflow showed basics about pipeline and job using [StreamSets SDK for Python](https://docs.streamsets.com/sdk/latest/index.html). 118 | Now you are ready to start the journey to create more sophisticated workflows. 119 | 120 | If you encounter any problems with this tutorial, please [file an issue in the tutorials project](https://github.com/streamsets/tutorials/issues/new). 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-getting-started/images/Hello_World_job.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-getting-started/images/Hello_World_job.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-getting-started/images/Hello_World_job_running.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-getting-started/images/Hello_World_job_running.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-getting-started/images/Hello_World_pipeline.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-getting-started/images/Hello_World_pipeline.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-getting-started/images/Hello_World_pipeline_canvaas.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-getting-started/images/Hello_World_pipeline_canvaas.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-jobs/README.md: -------------------------------------------------------------------------------- 1 | Interaction with StreamSets Control Hub jobs 2 | ============================================ 3 | 4 | This set contains tutorials for [StreamSets Control Hub jobs](https://streamsets.com/documentation/controlhub/latest/help/controlhub/UserGuide/Jobs/Jobs_title.html). 5 | 6 | A job defines the pipeline to run and the execution engine that runs the pipeline: Data Collector, Data Collector Edge, or Transformer. Jobs are the execution of the dataflow. 7 | 8 | ### Prerequisites 9 | Before starting on any of the following tutorials, make sure to complete [Prerequisites for the jobs tutorial](preparation-for-tutorial/README.md). 10 | 11 | ### Tutorials for Jobs 12 | 13 | 1. [Sample ways to fetch one or more jobs](ways-to-fetch-jobs/README.md) - Sample ways to fetch one or more jobs. 14 | 15 | 1. [Start a job and monitor that specific job](start-monitor-a-specific-job/README.md) - Start a job and monitor that specific job using metrics and time series metrics. 16 | 17 | 1. [Move jobs from dev to prod using data_collector_labels](update-data-collector-labels/README.md) - Move jobs from dev to prod by updating data_collector label. 18 | 19 | 20 | 1. [Generate a report for a specific job](generate-a-report/README.md) - Generate a report for a specific job and then; fetch and download it. 21 | 22 | 1. [See logs for a data-collector where a job is running](data-collector-logs/README.md) - Get the DataCollector where a job is running and then see its logs. 23 | 24 | 25 | 26 | ### Conclusion 27 | 28 | To get to know more details about SDK for Python, check the [SDK documentation](https://docs.streamsets.com/sdk/latest/index.html). 29 | 30 | If you don't have access to SCH, sign up for 30-day free trial by visiting https://streamsets.com/products/sch/control-hub-trial. -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-jobs/generate-a-report/README.md: -------------------------------------------------------------------------------- 1 | Generate a Data Delivery Report 2 | =============================== 3 | 4 | This tutorial covers how to generate a [StreamSets Control Hub Data Delivery Report](https://streamsets.com/documentation/controlhub/latest/help/controlhub/UserGuide/Reports/DeliveryReports_title.html#concept_xkf_v34_ndb) for a specific job and then; fetch and download it. 5 | 6 | Data delivery reports provide data processing metrics for a given job or topology. For example, you can use reports to view the number of records that were processed by a job or topology the previous day. 7 | 8 | ### Prerequisites 9 | Make sure to complete [Prerequisites for the jobs tutorial](../preparation-for-tutorial). 10 | 11 | ### Tutorial environment details 12 | While creating this tutorial following was used: 13 | * Python 3.6 14 | * StreamSets for SDK 3.8.0 15 | * All StreamSets Data Collector with version 3.17.0 16 | 17 | ### Outline 18 | In [Prerequisites for the jobs tutorial](../preparation-for-tutorial/README.md), one job was created with name 'Job for Kirti-HelloWorld'. 19 | This tutorial shows the following: 20 | 1. Create a report definition 21 | 1. Generate a report 22 | 1. Download that report 23 | 24 | ### Workflow 25 | On a terminal, type the following command to open a Python 3 interpreter. 26 | 27 | ```bash 28 | $ python3 29 | Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 26 2018, 19:50:54) 30 | [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin 31 | Type "help", "copyright", "credits" or "license" for more information. 32 | >>> 33 | ``` 34 | 35 | ### Step 1 — Connect to StreamSets Control Hub instance 36 | 37 | Let’s assume the StreamSets Control Hub is running at http://sch.streamsets.com 38 | Create an object called control_hub which is connected to the above. 39 | 40 | ```python 41 | from streamsets.sdk import ControlHub 42 | 43 | # Replace the argument values according to your setup 44 | control_hub = ControlHub(server_url='http://sch.streamsets.com', 45 | username='user@organization1', 46 | password='password') 47 | ``` 48 | 49 | ### Step 2 — Create a report definition 50 | 51 | Following code shows how to create a report definition for the job with name 'Job for Kirti-HelloWorld' using [StreamSets SDK for Python](https://docs.streamsets.com/sdk/latest/index.html). 52 | Optionally you can create the same using UI on the browser. 53 | 54 | ```python 55 | # Get the specific job using job name 56 | job = control_hub.jobs.get(job_name='Job for Kirti-HelloWorld') 57 | 58 | # Create Report Definition 59 | report_definition_builder = control_hub.get_report_definition_builder() 60 | report_definition_builder.set_data_retrieval_period(start_time='${time:now() - 30 * MINUTES}', 61 | end_time='${time:now()}') 62 | # Specify the selected job as the report resource 63 | report_definition_builder.add_report_resource(job) 64 | report_definition = report_definition_builder.build(name='Kirti-HelloWorld-Report') 65 | 66 | control_hub.add_report_definition(report_definition) 67 | ``` 68 | Above code produces report definition like following: 69 | 70 | ![image alt text](../images/hello_world_job_report_definition.jpeg) 71 | 72 | ### Step 3 — Generate a report 73 | ```python 74 | report_definition = control_hub.report_definitions.get(name='Kirti-HelloWorld-Report-Def') 75 | # Generate Report 76 | report_command = report_definition.generate_report() 77 | report_id = report_command.response['id'] 78 | 79 | # Wait for the report to be ready 80 | report = report_definition.reports.get(id=report_id) 81 | while report.report_status == 'REPORT_SUCCESS': 82 | time.sleep(5) 83 | report = report_definition.reports.get(id=report_id) 84 | 85 | # Fetch the report 86 | report = report_definition.reports.get(id=report_id) 87 | print(f'Fetched report = {report}') 88 | ``` 89 | 90 | Above code generates report like following: 91 | 92 | ![image alt text](../images/hello_world_job_show_reports.jpeg) 93 | 94 | ### Step 4 — Download the report 95 | 96 | ```python 97 | # Another way to get the report 98 | report = report_definition.reports[0] 99 | # download report pdf and write it to a file called report.pdf in the current directory 100 | report_content = report.download() 101 | with open('report.pdf', 'wb') as report_file: 102 | report_file.write(report_content) 103 | ``` 104 | 105 | ### Follow-up 106 | To get to know more details about SDK for Python, check the [SDK documentation](https://docs.streamsets.com/sdk/latest/index.html). 107 | 108 | If you encounter any problems with this tutorial, please [file an issue in the tutorials project](https://github.com/streamsets/tutorials/issues/new). 109 | 110 | -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-jobs/images/hello_world_job_details.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-jobs/images/hello_world_job_details.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-jobs/images/hello_world_job_history.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-jobs/images/hello_world_job_history.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-jobs/images/hello_world_job_monitoring.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-jobs/images/hello_world_job_monitoring.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-jobs/images/hello_world_job_report_definition.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-jobs/images/hello_world_job_report_definition.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-jobs/images/hello_world_job_show_reports.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-jobs/images/hello_world_job_show_reports.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-jobs/images/list_of_jobs_by_new_data_collector_label.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-jobs/images/list_of_jobs_by_new_data_collector_label.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-jobs/images/list_of_jobs_by_old_data_collector_label.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-jobs/images/list_of_jobs_by_old_data_collector_label.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-jobs/preparation-for-tutorial/README.md: -------------------------------------------------------------------------------- 1 | Prerequisites - for SCH Jobs related tutorials 2 | ============================================== 3 | 4 | This covers the steps needed to complete before starting on any other [StreamSets Control Hub job](https://streamsets.com/documentation/controlhub/latest/help/controlhub/UserGuide/Jobs/Jobs_title.html) related tutorials in this set. 5 | 6 | ### Prerequisites 7 | * [Python 3.4+](https://docs.python.org/3/using/index.html) and pip3 installed 8 | * StreamSets for SDK [Installed and activated](https://docs.streamsets.com/sdk/latest/installation.html) 9 | * [Access to StreamSets Control Hub](https://streamsets.com/documentation/controlhub/latest/help/controlhub/UserGuide/OrganizationSecurity/OrgSecurity_Overview.html#concept_q5z_jkl_wy) with an user account in your organization 10 | * At least one [StreamSets Data Collector](https://streamsets.com/products/dataops-platform/data-collector/) instance registered with the above StreamSets Control Hub instance 11 | 12 | 13 | **Note**: Make sure that the user account has proper access to do the following tasks this blog post covers. The easiest way for this, is to do those tasks using the Web UI of the StreamSets Control Hub first and fix any access problems before embarking on the path below. 14 | 15 | ### Tutorial environment details 16 | While creating this tutorial following was used: 17 | * Python 3.6 18 | * StreamSets for SDK 3.8.0 19 | * All StreamSets Data Collector with version 3.17.0 20 | 21 | ### Outline 22 | In this preparation, 2 jobs are created with following names: 23 | 1. Job for Kirti-HelloWorld 24 | 1. Job for Kirti-DevRawDataSource 25 | 26 | This page details on how to create them using SDK for Python. 27 | Optionally, you can create them using UI in the browser too. Just follow all the details needed for the jobs. 28 | 29 | ### Workflow 30 | 31 | On a terminal, type the following command to open a Python 3 interpreter. 32 | 33 | ```bash 34 | $ python3 35 | Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 26 2018, 19:50:54) 36 | [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin 37 | Type "help", "copyright", "credits" or "license" for more information. 38 | >>> 39 | ``` 40 | 41 | ### Step 1 — Connect to StreamSets Control Hub instance 42 | 43 | Let’s assume the StreamSets Control Hub is running at http://sch.streamsets.com 44 | Create an object called control_hub which is connected to the above. 45 | 46 | ```python 47 | from streamsets.sdk import ControlHub 48 | 49 | # Replace the argument values according to your setup 50 | control_hub = ControlHub(server_url='http://sch.streamsets.com', 51 | username='user@organization1', 52 | password='password') 53 | ``` 54 | 55 | ### Step 2 — Create first job 56 | Create a job either using UI or using SDK for Python. 57 | 58 | Here is a sample job created using SDK for Python. For this tutorial purpose, create the job with 59 | 60 | 1. tags e.g. tags=['kirti-job-dev-tag'] 61 | 1. datacollector-labels e.g. data_collector_labels = ['kirti-dev'] 62 | 1. Time series analysis enabled 63 | 64 | ```python 65 | # Create a pipeline 66 | builder = control_hub.get_pipeline_builder() 67 | dev_raw_data_source = builder.add_stage('Dev Data Generator') 68 | trash = builder.add_stage('Trash') 69 | dev_raw_data_source >> trash # connect the Dev Raw Data Source origin to the Trash destination. 70 | pipeline = builder.build('Kirti-HelloWorld') 71 | control_hub.publish_pipeline(pipeline) 72 | 73 | # Create a job for the above 74 | job_builder = control_hub.get_job_builder() 75 | job = job_builder.build('Job for Kirti-HelloWorld', pipeline=pipeline, tags=['kirti-job-dev-tag']) 76 | job.data_collector_labels = ['kirti-dev'] 77 | job.enable_time_series_analysis = True 78 | control_hub.add_job(job) 79 | ``` 80 | 81 | After the above code is executed, one can see the job in the UI as following. Note the datacollector-label here. 82 | 83 | ![image alt text](../images/hello_world_job_details.jpeg) 84 | 85 | 86 | ### Step 3 — Create second job 87 | 88 | Create another job either using UI or using SDK for Python. 89 | 90 | Here is a sample job created using SDK for Python. For this tutorial purpose, create the job with 91 | 92 | 1. tags e.g. tags=['kirti-job-dev-RawDS-tag'] 93 | 1. datacollector-labels e.g. data_collector_labels = ['kirti-dev'] 94 | 1. Time series analysis enabled 95 | 96 | ```python 97 | # Create second pipeline 98 | builder = control_hub.get_pipeline_builder() 99 | dev_raw_data_source = builder.add_stage('Dev Raw Data Source') 100 | trash = builder.add_stage('Trash') 101 | dev_raw_data_source >> trash # connect the Dev Raw Data Source origin to the Trash destination. 102 | pipeline = builder.build('Kirti-DevRawDataSource') 103 | control_hub.publish_pipeline(pipeline) 104 | 105 | # Create a job for the above 106 | job_builder = control_hub.get_job_builder() 107 | job = job_builder.build('Job for Kirti-DevRawDataSource', pipeline=pipeline, tags=['kirti-job-dev-RawDS-tag']) 108 | job.data_collector_labels = ['kirti-dev'] 109 | job.enable_time_series_analysis = True 110 | control_hub.add_job(job) 111 | ``` 112 | 113 | ### Conclusion 114 | Now with this preparation, you are ready to start on other tutorials in this set. 115 | To get to know more details about SDK for Python, check the [SDK documentation](https://streamsets.com/documentation/sdk/latest/index.html). 116 | 117 | If you encounter any problems with this tutorial, please [file an issue in the tutorials project](https://github.com/streamsets/tutorials/issues/new). 118 | -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-jobs/start-monitor-a-specific-job/README.md: -------------------------------------------------------------------------------- 1 | Start and monitor a job 2 | ======================= 3 | 4 | This tutorial covers how to start a [StreamSets Control Hub job](https://streamsets.com/documentation/controlhub/latest/help/controlhub/UserGuide/Jobs/Jobs_title.html) and monitor that specific job. 5 | 6 | A job defines the pipeline to run and the execution engine that runs the pipeline: Data Collector, Data Collector Edge, or Transformer. Jobs are the execution of the dataflow. 7 | 8 | ### Prerequisites 9 | Make sure to complete [Prerequisites for the jobs tutorial](../preparation-for-tutorial). 10 | 11 | ### Tutorial environment details 12 | While creating this tutorial following was used: 13 | * Python 3.6 14 | * StreamSets for SDK 3.8.0 15 | * All StreamSets Data Collector with version 3.17.0 16 | 17 | ### Outline 18 | In [Prerequisites for the jobs tutorial](../preparation-for-tutorial), one job was created with name 'Job for Kirti-HelloWorld'. 19 | This tutorial shows to start that job and then monitor it using metrics, status and time series metrics for that specific job. 20 | And also how can log for the data_collector where the job was started 21 | 22 | ### Workflow 23 | On a terminal, type the following command to open a Python 3 interpreter. 24 | 25 | ```bash 26 | $ python3 27 | Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 26 2018, 19:50:54) 28 | [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin 29 | Type "help", "copyright", "credits" or "license" for more information. 30 | >>> 31 | ``` 32 | 33 | ### Step 1 — Connect to StreamSets Control Hub instance 34 | 35 | Let’s assume the StreamSets Control Hub is running at http://sch.streamsets.com 36 | Create an object called control_hub which is connected to the above. 37 | 38 | ```python 39 | from streamsets.sdk import ControlHub 40 | 41 | # Replace the argument values according to your setup 42 | control_hub = ControlHub(server_url='http://sch.streamsets.com', 43 | username='user@organization1', 44 | password='password') 45 | ``` 46 | 47 | ### Step 2 — Start the job 48 | Now let’s start the job. 49 | 50 | ```python 51 | import time 52 | # Select the job using name 53 | job = control_hub.jobs.get(job_name='Job for Kirti-HelloWorld') 54 | control_hub.start_job(job) 55 | time.sleep(60) # Let it run for a minute 56 | ``` 57 | 58 | ### Step 3 — Monitor started job with Metrics 59 | Let's monitor it using SDK for Python: 60 | 61 | **Job Metrics** 62 | ```python 63 | # Fetch the job metrics 64 | job_metrics = job.metrics(metric_type='RECORD_THROUGHPUT', include_error_count=True) 65 | print('Job Metrics = ', job_metrics) 66 | print('Output count = ', job_metrics.output_count) 67 | print('Error count = ', job_metrics.error_count) 68 | ``` 69 | Above code produces a sample output like following: 70 | 71 | ```bash 72 | Job Metrics = 81 | Output count = {'DevDataGenerator_01:DevDataGenerator_01OutputLane7b50203d_5de5_4ee8_9f9c_d42c2614ba74': 1000.0, 82 | 'PIPELINE': 1000.0, 83 | 'DevDataGenerator_01': 1000.0, 84 | 'Trash_01': 1000.0} 85 | Error count = {'DevDataGenerator_01:DevDataGenerator_01OutputLane7b50203d_5de5_4ee8_9f9c_d42c2614ba74': 0.0, 86 | 'PIPELINE': 0.0, 87 | 'DevDataGenerator_01': 0.0, 88 | 'Trash_01': 0.0} 89 | ``` 90 | 91 | ### Step 4 — Monitor started job with Job Status 92 | Let's monitor it with Job status: 93 | 94 | ```python 95 | job.refresh() # Make sure to run this so that Job status is updated 96 | job_status = job.status 97 | # The id for the data_collector that was used to execute job 98 | sdc_ids = job_status.sdc_ids 99 | print('sdc_ids = ', sdc_ids) 100 | ``` 101 | Above code produces a sample output like following: 102 | ```bash 103 | 104 | sdc_ids = ['380e3f59-d74e-11ea-b07f-adf940e256e9'] 105 | ``` 106 | 107 | ### Step 5 — Monitor started job with Time Series Metrics 108 | Let's monitor it with Job status: 109 | 110 | **Job Time Series Metrics** 111 | Sine while creating the jobs we had enabled the time series metrics, it is available now. 112 | ```python 113 | # Fetch the time series metrics 114 | job_time_series_metrics = job.time_series_metrics(metric_type='Record Throughput Time Series') 115 | print('job_time_series_metrics = ', job_time_series_metrics) 116 | ``` 117 | Above code produced a sample output like following: 118 | ```bash 119 | job_time_series_metrics = , 126 | output_records=, 132 | error_records=)> 138 | ``` 139 | 140 | One can observe the same using StreamSets Control Hub UI on browser: 141 | 142 | ![image alt text](../images/hello_world_job_monitoring.jpeg) 143 | 144 | ### Follow-up 145 | To get to know more details about SDK for Python, check the [SDK documentation](https://docs.streamsets.com/sdk/latest/index.html). 146 | 147 | If you encounter any problems with this tutorial, please [file an issue in the tutorials project](https://github.com/streamsets/tutorials/issues/new). 148 | 149 | -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-jobs/update-data-collector-labels/README.md: -------------------------------------------------------------------------------- 1 | Promote jobs to production by updating data-collector-labels for jobs 2 | ===================================================================== 3 | 4 | This tutorial shows how to move [StreamSets Control Hub jobs](https://streamsets.com/documentation/controlhub/latest/help/controlhub/UserGuide/Jobs/Jobs_title.html) jobs from dev to production by updating from one label to another. 5 | 6 | A job defines the pipeline to run and the execution engine that runs the pipeline: Data Collector, Data Collector Edge, or Transformer. Jobs are the execution of the dataflow. 7 | 8 | When there are many jobs that need this update, SDK for Python makes it easy to update them with just a few lines of code. 9 | 10 | ### Prerequisites 11 | Make sure to complete [Prerequisites for the jobs tutorial](../preparation-for-tutorial/README.md). 12 | 13 | ### Tutorial environment details 14 | While creating this tutorial following was used: 15 | * Python 3.6 16 | * StreamSets for SDK 3.8.0 17 | * All StreamSets Data Collector with version 3.17.0 18 | 19 | ### Outline 20 | At this point, let's say we are satisfied with both the jobs that we created while preparing for the tutorials. 21 | These jobs have data-collector-label = 'kirti-dev'. 22 | Now let's promote them to production data_collectors, by changing the data_collector_labels for them to 'kirti-prod'. 23 | 24 | ### Workflow 25 | On a terminal, type the following command to open a Python 3 interpreter. 26 | 27 | ```bash 28 | $ python3 29 | Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 26 2018, 19:50:54) 30 | [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin 31 | Type "help", "copyright", "credits" or "license" for more information. 32 | >>> 33 | ``` 34 | 35 | ### Step 1 — Connect to StreamSets Control Hub instance 36 | 37 | Let’s assume the StreamSets Control Hub is running at http://sch.streamsets.com 38 | Create an object called control_hub which is connected to the above. 39 | 40 | ```python 41 | from streamsets.sdk import ControlHub 42 | 43 | # Replace the argument values according to your setup 44 | control_hub = ControlHub(server_url='http://sch.streamsets.com', 45 | username='user@organization1', 46 | password='password') 47 | ``` 48 | 49 | ### Step 2 — Fetch jobs with dev data_collector_labels 50 | 51 | On browser, one can see jobs with existing data_collector_label = ['kirti-dev'] as following: 52 | 53 | ![image alt text](../images/list_of_jobs_by_old_data_collector_label.jpeg) 54 | 55 | ```python 56 | jobs_with_existing_label = control_hub.jobs.get_all(data_collector_labels=['kirti-dev']) 57 | print(f'jobs_with_existing_label = \n {jobs_with_existing_label}') 58 | ``` 59 | Above code produces following kind of output: 60 | ```bash 61 | jobs_with_existing_label = 62 | [, 64 | ] 66 | ``` 67 | 68 | ### Step 3 — Update data_collector_labels 69 | 70 | ```python 71 | for job in jobs_with_existing_label: 72 | job.data_collector_labels = ['kirti-prod'] 73 | control_hub.update_job(job) 74 | jobs_with_updated_label = control_hub.jobs.get_all(data_collector_labels=['kirti-prod']) 75 | print(f'jobs_with_updated_label = \n {jobs_with_updated_label}') 76 | jobs_with_old_label = control_hub.jobs.get_all(data_collector_labels=['kirti-dev']) 77 | print(f'jobs_with_old_label = {jobs_with_old_label}') 78 | ``` 79 | Above code produces following kind of output: 80 | ```bash 81 | jobs_with_updated_label = 82 | [, 84 | ] 86 | jobs_with_old_label = [] 87 | ``` 88 | 89 | At this point, on browser, one can see jobs with updated data_collector_label = ['kirti-prod'] as following: 90 | 91 | ![image alt text](../images/list_of_jobs_by_new_data_collector_label.jpeg) 92 | 93 | 94 | ### Follow-up 95 | To get to know more details about SDK for Python, check the [SDK documentation](https://streamsets.com/documentation/sdk/latest/index.html). 96 | 97 | If you encounter any problems with this tutorial, please [file an issue in the tutorials project](https://github.com/streamsets/tutorials/issues/new). 98 | -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-jobs/ways-to-fetch-jobs/README.md: -------------------------------------------------------------------------------- 1 | Some sample ways to fetch one or more jobs 2 | ========================================== 3 | 4 | This tutorial shows few ways to fetch one or more [StreamSets Control Hub jobs](https://streamsets.com/documentation/controlhub/latest/help/controlhub/UserGuide/Jobs/Jobs_title.html). 5 | 6 | A job defines the pipeline to run and the execution engine that runs the pipeline: Data Collector, Data Collector Edge, or Transformer. Jobs are the execution of the dataflow. 7 | 8 | ### Prerequisites 9 | Make sure to complete [Prerequisites for the jobs tutorial](../preparation-for-tutorial). 10 | 11 | ### Tutorial environment details 12 | While creating this tutorial following was used: 13 | * Python 3.6 14 | * StreamSets for SDK 3.8.0 15 | * All StreamSets Data Collector with version 3.17.0 16 | 17 | 18 | ### Workflow 19 | On a terminal, type the following command to open a Python 3 interpreter. 20 | 21 | ```bash 22 | $ python3 23 | Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 26 2018, 19:50:54) 24 | [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin 25 | Type "help", "copyright", "credits" or "license" for more information. 26 | >>> 27 | ``` 28 | 29 | ### Connect to StreamSets Control Hub instance 30 | 31 | Let’s assume the StreamSets Control Hub is running at http://sch.streamsets.com 32 | Create an object called control_hub which is connected to the above. 33 | 34 | ```python 35 | from streamsets.sdk import ControlHub 36 | 37 | # Replace the argument values according to your setup 38 | control_hub = ControlHub(server_url='http://sch.streamsets.com', 39 | username='user@organization1', 40 | password='password') 41 | ``` 42 | 43 | ### 1 — Get all jobs 44 | 45 | If you have lot of jobs, this might be halting the system and so do not use this method to choose jobs. 46 | 47 | ```python 48 | print('All jobs') 49 | all_jobs = control_hub.jobs 50 | print(all_jobs) 51 | ``` 52 | Sample output is like following: 53 | ``` 54 | [, 55 | , 56 | , 57 | ] 58 | ``` 59 | 60 | ### 2 — Get a limited number of jobs and also specify sort order 61 | 62 | ```python 63 | limited_num_of_jobs = control_hub.jobs.get_all(len=2, order='ASC') 64 | print('\nlimited_num_of_jobs are as following: ', limited_num_of_jobs) 65 | ``` 66 | 67 | ### 3 — Get a specific job using name 68 | 69 | ```python 70 | first_job = control_hub.jobs.get(job_name='Job for Kirti-HelloWorld') 71 | ``` 72 | 73 | ### 4 — Get a specific job using id 74 | 75 | ```python 76 | job = control_hub.jobs.get(id=first_job.job_id) 77 | print(job) 78 | ``` 79 | 80 | ### 5 — Get a list of jobs specified by pipeline_commit_label 81 | 82 | ```python 83 | jobs = control_hub.jobs.get_all(pipeline_commit_label='v1') 84 | print(jobs) 85 | ``` 86 | 87 | ### 6 — Get a list of jobs specified by job_tag 88 | 89 | ```python 90 | first_job_tags = first_job.job_tags 91 | first_job_first_tag_id = first_job_tags[0]['id'] if first_job_tags else '' 92 | if first_job_first_tag_id: 93 | jobs = control_hub.jobs.get_all(job_tag=first_job_first_tag_id) 94 | print(jobs) 95 | ``` 96 | 97 | ### Follow-up 98 | To get to know more details about SDK for Python, check the [SDK documentation](https://docs.streamsets.com/sdk/latest/index.html). 99 | 100 | If you encounter any problems with this tutorial, please [file an issue in the tutorials project](https://github.com/streamsets/tutorials/issues/new). 101 | 102 | -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-pipelines/README.md: -------------------------------------------------------------------------------- 1 | Interaction with StreamSets Control Hub pipelines 2 | ================================================= 3 | 4 | This set contains tutorials for [StreamSets Control Hub pipelines](https://streamsets.com/documentation/controlhub/latest/help/datacollector/UserGuide/Pipeline_Design/What_isa_Pipeline.html). 5 | 6 | A pipeline describes the flow of data from the origin system to destination systems and defines how to transform the data along the way. 7 | 8 | ### Prerequisites 9 | Before starting on any of the tutorials in this set, make sure to complete [Prerequisites for the pipelines tutorial](preparation-for-tutorial/README.md). 10 | 11 | ### Tutorials for Pipelines 12 | 13 | 1. [Common pipeline methods](common-pipeline-methods/README.md) - Common operations for [StreamSets Control Hub pipelines](https://streamsets.com/documentation/controlhub/latest/help/datacollector/UserGuide/Pipeline_Design/What_isa_Pipeline.html) like update, duplicate, import, export. 14 | 15 | 1. [Loop over pipelines and stages and make an edit to stages](edit-pipelines-and-stages/README.md) - When there are many pipelines and stages that need an update, SDK for Python makes it easy to update them with just a few lines of code. 16 | 17 | 1. [Create CI CD pipeline used in demo](create-ci-cd-demo-pipeline/README.md) - This covers the steps to create CI CD pipeline as used in the [SCH CI CD demo](https://github.com/dimaspivak/sch_ci_cd_poc). The steps include how to add stages like JDBC, some processors and Kineticsearch; and how to set stage configurations. Also shows, the use of runtime parameters. 18 | 19 | ### Conclusion 20 | 21 | To get to know more details about SDK for Python, check the [SDK documentation](https://docs.streamsets.com/sdk/latest/index.html). 22 | 23 | If you don't have access to SCH, sign up for 30-day free trial by visiting https://streamsets.com/products/sch/control-hub-trial. 24 | -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-pipelines/common-pipeline-methods/README.md: -------------------------------------------------------------------------------- 1 | Common operations on SCH pipeline 2 | ================================= 3 | 4 | This tutorial covers some common operations for [StreamSets Control Hub pipelines](https://streamsets.com/documentation/controlhub/latest/help/datacollector/UserGuide/Pipeline_Design/What_isa_Pipeline.html) 5 | like update, duplicate , import. 6 | 7 | A pipeline describes the flow of data from the origin system to destination systems and defines how to transform the data along the way. 8 | 9 | ### Prerequisites 10 | Make sure to complete [Prerequisites for the pipelines tutorial](../preparation-for-tutorial). 11 | 12 | ### Tutorial environment details 13 | While creating this tutorial following was used: 14 | * Python 3.6 15 | * StreamSets for SDK 3.8.0 16 | * All StreamSets Data Collector with version 3.17.1 17 | 18 | ### Outline 19 | In [Prerequisites for the pipelines tutorial](../preparation-for-tutorial/README.md), one pipeline was created with name 'SCH Hello World pipeline'. 20 | On the very pipeline, this tutorial shows some operations like 21 | 1. to update the pipeline by adding new pipeline parameters 22 | 1. add a label to it 23 | 1. duplicate the pipeline 24 | 1. import and export it 25 | 26 | ### Workflow 27 | On a terminal, type the following command to open a Python 3 interpreter. 28 | 29 | ```bash 30 | $ python3 31 | Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 26 2018, 19:50:54) 32 | [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin 33 | Type "help", "copyright", "credits" or "license" for more information. 34 | >>> 35 | ``` 36 | 37 | ### Step 1 — Connect to StreamSets Control Hub instance 38 | 39 | Let’s assume the StreamSets Control Hub is running at http://sch.streamsets.com 40 | Create an object called control_hub which is connected to the above. 41 | 42 | ```python 43 | from streamsets.sdk import ControlHub 44 | 45 | # Replace the argument values according to your setup 46 | control_hub = ControlHub(server_url='http://sch.streamsets.com', 47 | username='user@organization1', 48 | password='password') 49 | ``` 50 | 51 | ### Step 2 — Update pipeline - by adding runtime parameters 52 | Let's update the pipeline by adding new [runtime parameters](https://streamsets.com/documentation/controlhub/latest/help/controlhub/UserGuide/Jobs/RuntimeParameters.html#concept_dwq_33w_vz). 53 | 54 | ```python 55 | pipeline.parameters = {'USERNAME': 'admin_user', 'PASSWORD': 'admin_password'} 56 | control_hub.publish_pipeline(pipeline) 57 | ``` 58 | At this point, on browser, one can see the pipeline with updated runtime parameters as following: 59 | ![image alt text](../images/pipeline_parameters.jpeg) 60 | 61 | ### Step 3 — Add a label 62 | [A pipeline label](https://streamsets.com/documentation/controlhub/latest/help/controlhub/UserGuide/Pipelines/PipelineLabels.html?hl=pipeline%2Clabel) 63 | identifies similar pipelines or pipeline fragments. Use pipeline labels to easily search and filter pipelines and fragments when viewing them in the pipeline repository. 64 | 65 | ```python 66 | pipeline.add_label('updated_label') 67 | control_hub.publish_pipeline(pipeline) 68 | ``` 69 | 70 | ### Step 4 — Duplicate the pipeline 71 | 72 | ```python 73 | # Duplicate the pipeline 74 | duplicated_pipeline = control_hub.duplicate_pipeline(pipeline) 75 | ``` 76 | 77 | At this point, on browser, one can see the duplicated pipeline with name as 'SCH Hello World pipeline copy': 78 | 79 | ![image alt text](../images/duplicated_pipeline.jpeg) 80 | 81 | ### Step 5 — Export and import the pipeline 82 | 83 | ```python 84 | # Export the pipeline 85 | pipeline_zip_data = control_hub.export_pipelines(duplicated_pipeline) 86 | 87 | # Write to an archive the exported pipeline 88 | with open('/tmp/sample_imported_pipeline.zip', 'wb') as pipelines_zip_file: 89 | pipelines_zip_file.write(pipeline_zip_data) 90 | 91 | # Import the pipeline from the above archive 92 | with open('/tmp/sample_imported_pipeline.zip', 'rb') as input_file: 93 | pipelines_imported = control_hub.import_pipelines_from_archive(input_file, 'Import Pipeline using SDK') 94 | ``` 95 | 96 | At this point, on browser, one can see the duplicated pipeline with name as 'SCH Hello World pipeline copy': 97 | 98 | ![image alt text](../images/imported_pipeline.jpeg) 99 | 100 | ### Step 6 — Delete pipelines 101 | Since now done with the duplicated and imported pipelines, let's delete them. 102 | 103 | **Note: Be careful** about delete operation as, this can not be undone. 104 | ```python 105 | control_hub.delete_pipeline(duplicated_pipeline[0]) 106 | control_hub.delete_pipeline(pipelines_imported[0]) 107 | ``` 108 | 109 | ### Follow-up 110 | To get to know more details about SDK for Python, check the [SDK documentation](https://docs.streamsets.com/sdk/latest/index.html). 111 | 112 | If you encounter any problems with this tutorial, please [file an issue in the tutorials project](https://github.com/streamsets/tutorials/issues/new). 113 | 114 | -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-pipelines/images/3_duplicated_pipelines.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-pipelines/images/3_duplicated_pipelines.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-pipelines/images/duplicated_pipeline.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-pipelines/images/duplicated_pipeline.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-pipelines/images/duplicated_pipeline_label.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-pipelines/images/duplicated_pipeline_label.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-pipelines/images/duplicated_pipeline_label_updated.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-pipelines/images/duplicated_pipeline_label_updated.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-pipelines/images/duplicated_pipeline_stage_updated.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-pipelines/images/duplicated_pipeline_stage_updated.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-pipelines/images/imported_pipeline.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-pipelines/images/imported_pipeline.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-pipelines/images/pipeline_parameters.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-pipelines/images/pipeline_parameters.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-pipelines/images/sch_ci_cd_pipeline.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-pipelines/images/sch_ci_cd_pipeline.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-pipelines/images/sch_ci_cd_pipeline_destination_stage_config.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-pipelines/images/sch_ci_cd_pipeline_destination_stage_config.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-pipelines/images/sch_ci_cd_pipeline_field_remover_config.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-pipelines/images/sch_ci_cd_pipeline_field_remover_config.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-pipelines/images/sch_ci_cd_pipeline_field_splitter_config.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-pipelines/images/sch_ci_cd_pipeline_field_splitter_config.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-pipelines/images/sch_ci_cd_pipeline_jython_evaluator_config.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-pipelines/images/sch_ci_cd_pipeline_jython_evaluator_config.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-pipelines/images/sch_ci_cd_pipeline_origin_stage_config.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-pipelines/images/sch_ci_cd_pipeline_origin_stage_config.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-pipelines/images/sch_ci_cd_pipeline_overview.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-pipelines/images/sch_ci_cd_pipeline_overview.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-pipelines/images/sch_ci_cd_pipeline_runtime_parameters.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-pipelines/images/sch_ci_cd_pipeline_runtime_parameters.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-pipelines/images/sch_hello_world_created.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-pipelines/images/sch_hello_world_created.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-pipelines/images/stage_configs.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-pipelines/images/stage_configs.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-pipelines/images/stage_label_in_UI.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/sdk-tutorials/sch/tutorial-pipelines/images/stage_label_in_UI.jpeg -------------------------------------------------------------------------------- /sdk-tutorials/sch/tutorial-pipelines/preparation-for-tutorial/README.md: -------------------------------------------------------------------------------- 1 | Prerequisites - for SCH Pipelines related tutorials 2 | =================================================== 3 | 4 | This covers the steps needed to complete before starting on any other [StreamSets Control Hub pipelines](https://streamsets.com/documentation/controlhub/latest/help/datacollector/UserGuide/Pipeline_Design/What_isa_Pipeline.html) related tutorials in this set. 5 | 6 | ### Prerequisites 7 | * [Python 3.4+](https://docs.python.org/3/using/index.html) and pip3 installed 8 | * StreamSets for SDK [Installed and activated](https://docs.streamsets.com/sdk/latest/installation.html) 9 | * [Access to StreamSets Control Hub](https://streamsets.com/documentation/controlhub/latest/help/controlhub/UserGuide/OrganizationSecurity/OrgSecurity_Overview.html#concept_q5z_jkl_wy) with an user account in your organization 10 | * At least one [StreamSets Data Collector](https://streamsets.com/products/dataops-platform/data-collector/) instance registered with the above StreamSets Control Hub instance 11 | 12 | 13 | **Note**: Make sure that the user account has proper access to do the following tasks this blog post covers. The easiest way for this, is to do those tasks using the Web UI of the StreamSets Control Hub first and fix any access problems before embarking on the path below. 14 | 15 | ### Tutorial environment details 16 | While creating this tutorial following was used: 17 | * Python 3.6 18 | * StreamSets for SDK 3.8.0 19 | * All StreamSets Data Collector with version 3.17.1 20 | 21 | ### Outline 22 | In this preparation, a pipeline is created with the name as `SCH Hello World pipeline`. 23 | 24 | This page details on how to create it using SDK for Python. 25 | Optionally, you can create it using UI in the browser too. Just follow all the details needed for the pipeline. 26 | 27 | ### Workflow 28 | 29 | On a terminal, type the following command to open a Python 3 interpreter. 30 | 31 | ```bash 32 | $ python3 33 | Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 26 2018, 19:50:54) 34 | [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin 35 | Type "help", "copyright", "credits" or "license" for more information. 36 | >>> 37 | ``` 38 | 39 | ### Step 1 — Connect to StreamSets Control Hub instance 40 | 41 | Let’s assume the StreamSets Control Hub is running at http://sch.streamsets.com 42 | Create an object called control_hub which is connected to the above. 43 | 44 | ```python 45 | from streamsets.sdk import ControlHub 46 | 47 | # Replace the argument values according to your setup 48 | control_hub = ControlHub(server_url='http://sch.streamsets.com', 49 | username='user@organization1', 50 | password='password') 51 | ``` 52 | 53 | ### Step 2 — Create a pipeline 54 | Create a pipeline either using UI or using SDK for Python. 55 | 56 | Here is a sample pipeline created using SDK for Python. For this tutorial purpose, create the pipeline with 57 | 1. name as 'SCH Hello World pipeline' 58 | 1. Origin Stage = 'Dev Data Generator' 59 | 1. Destination Stage = 'Trash' 60 | 61 | ```python 62 | # Create a pipeline 63 | builder = control_hub.get_pipeline_builder() 64 | dev_data_generator = builder.add_stage('Dev Data Generator') 65 | trash = builder.add_stage('Trash') 66 | 67 | dev_data_generator >> trash # connect the Dev Data Generator origin to the Trash destination. 68 | 69 | pipeline = builder.build('SCH Hello World pipeline') 70 | 71 | # Add the pipeline to Control Hub 72 | control_hub.publish_pipeline(pipeline) 73 | ``` 74 | 75 | After the above code is executed, one can see the pipeline in the UI as following. 76 | 77 | ![image alt text](../images/sch_hello_world_created.jpeg) 78 | 79 | ### How to find out add_stage method parameter 80 | The easiest way to find this out, is using UI. 81 | So create a pipeline in UI with the desired stage and see it's label. 82 | 83 | e.g. Let's say we wish to create a pipeline using SDK with JDBC stage. 84 | Now, in a browser, create a pipeline with desired stage and it shows like following: 85 | 86 | ![image alt text](../images/stage_label_in_UI.jpeg) 87 | 88 | So it shows us Name as `JDBC Query Consumer 1`. 89 | So now, to achieve the same in SDK, we need to specify `JDBC Query Consumer` (drop the last digit from Name seen in UI) 90 | 91 | ```python 92 | # Create a pipeline 93 | builder = control_hub.get_pipeline_builder() 94 | dev_data_generator = builder.add_stage('JDBC Query Consumer') 95 | ``` 96 | 97 | ### Conclusion 98 | Now with this preparation, you are ready to start on other tutorials in this set. 99 | To get to know more details about SDK for Python, check the [SDK documentation](https://docs.streamsets.com/sdk/latest/index.html). 100 | 101 | If you encounter any problems with this tutorial, please [file an issue in the tutorials project](https://github.com/streamsets/tutorials/issues/new). 102 | -------------------------------------------------------------------------------- /tutorial-1/elasticsearch/logs.json: -------------------------------------------------------------------------------- 1 | { 2 | "mappings": { 3 | "logs" : { 4 | "properties" : { 5 | "timestamp": { 6 | "type": "date" 7 | }, 8 | "geo": { 9 | "type": "geo_point" 10 | } 11 | } 12 | } 13 | } 14 | } -------------------------------------------------------------------------------- /tutorial-1/img/add_jython.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-1/img/add_jython.png -------------------------------------------------------------------------------- /tutorial-1/img/directory_config.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-1/img/directory_config.png -------------------------------------------------------------------------------- /tutorial-1/img/directory_config_log.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-1/img/directory_config_log.png -------------------------------------------------------------------------------- /tutorial-1/img/directory_config_postproc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-1/img/directory_config_postproc.png -------------------------------------------------------------------------------- /tutorial-1/img/discard_errors.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-1/img/discard_errors.png -------------------------------------------------------------------------------- /tutorial-1/img/elastic_config.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-1/img/elastic_config.png -------------------------------------------------------------------------------- /tutorial-1/img/expression_eval.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-1/img/expression_eval.png -------------------------------------------------------------------------------- /tutorial-1/img/field_converter.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-1/img/field_converter.png -------------------------------------------------------------------------------- /tutorial-1/img/field_converter_timestamp.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-1/img/field_converter_timestamp.png -------------------------------------------------------------------------------- /tutorial-1/img/field_remover.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-1/img/field_remover.png -------------------------------------------------------------------------------- /tutorial-1/img/geo_ip.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-1/img/geo_ip.png -------------------------------------------------------------------------------- /tutorial-1/img/geoip_errors.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-1/img/geoip_errors.png -------------------------------------------------------------------------------- /tutorial-1/img/idle_alert.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-1/img/idle_alert.png -------------------------------------------------------------------------------- /tutorial-1/img/import_pipeline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-1/img/import_pipeline.png -------------------------------------------------------------------------------- /tutorial-1/img/metric_alerts.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-1/img/metric_alerts.png -------------------------------------------------------------------------------- /tutorial-1/img/part1_kibana_dashboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-1/img/part1_kibana_dashboard.png -------------------------------------------------------------------------------- /tutorial-1/img/part2_kibana_dashboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-1/img/part2_kibana_dashboard.png -------------------------------------------------------------------------------- /tutorial-1/img/running_pipeline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-1/img/running_pipeline.png -------------------------------------------------------------------------------- /tutorial-1/img/vimeo-thumbnail.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-1/img/vimeo-thumbnail.png -------------------------------------------------------------------------------- /tutorial-1/log_shipping_to_elasticsearch_part2.md: -------------------------------------------------------------------------------- 1 | ## Part 2 - Enhancing Log Data 2 | 3 | Now that we've examined the basics of how use Data Collector, let's see how to clean up and/or decorate the log data before posting it into Elasticsearch. We'll also look at some nifty features (metric alerts and data rules) within Data Collector that set up alerts for when the pipeline needs attention. 4 | 5 | ### Before We Begin 6 | * Clean up Elasticsearch - *delete any previous test data by running the following command.* 7 | 8 | ```bash 9 | $ curl -XDELETE 'http://localhost:9200/logs' 10 | ``` 11 | 12 | * Recreate the Elasticsearch index: 13 | 14 | ```bash 15 | $ curl -X PUT -H "Content-Type: application/json" 'http://localhost:9200/logs' -d '{ 16 | "mappings": { 17 | "logs" : { 18 | "properties" : { 19 | "timestamp": { 20 | "type": "date" 21 | }, 22 | "geo": { 23 | "type": "geo_point" 24 | } 25 | } 26 | } 27 | } 28 | }' 29 | ``` 30 | 31 | ### Add a Jython Evaluator 32 | The log files contain a User Agent string that contains a lot of information about the browser. For the sake of this exercise we want to parse the UA string and only extract the name of the browser. Let's use the Python [user-agents](https://pypi.python.org/pypi/user-agents/0.2.0) package for this. 33 | * Install user-agents to your computer. 34 | ```python 35 | pip install pyyaml ua-parser user-agents 36 | ``` 37 | 38 | * In your existing pipeline, click the connector between the GeoIP and Elasticsearch processors and select Jython Evaluator from the Add Processor list. 39 | * If the help bar doesn't display, you can turn it on: Click the Help icon in the upper right corner > Settings, clear the Hide Pipeline Creation Help Bar option. 40 | * If the Jython Evaluator does not appear in the list, then you will need to [install it](https://streamsets.com/documentation/datacollector/latest/help/index.html#datacollector/UserGuide/Installation/AddtionalStageLibs.html#concept_fb2_qmn_bz). 41 | 42 | 43 | * Replace the existing code in the Jython Evaluator's **Script** box with the following code snippet : 44 | 45 | ```python 46 | 47 | import sys 48 | sys.path.append('/Library/Python/2.7/site-packages') 49 | from user_agents import parse 50 | 51 | for record in records: 52 | try: 53 | user_agent = parse(record.value['agent']) 54 | record.value['browser'] = user_agent.browser.family 55 | # Write record to processor output 56 | output.write(record) 57 | 58 | except Exception as e: 59 | # Send record to error 60 | error.write(record, str(e)) 61 | ``` 62 | 63 | This piece of Python code parses the User Agent field denoted by ```record.value['agent']``` and uses the user_agent parser to figure out the browser family. 64 | 65 | *Note: The location of your pip packages may differ from this example, use* 66 | 67 | ```python 68 | >>> import site; site.getsitepackages() 69 | ``` 70 | 71 | *to find the location on your computer.* 72 | 73 | ### Removing fields with the Field Remover 74 | Now that we've identified the browser, we don't have any use for the user-agent string in our dataset. Let's remove that field and save space on our Elasticsearch index. 75 | 76 | * Add a Field Remover processor to the pipeline. 77 | 78 | * In its configuration properties, click the *Remove* tab. 79 | 80 | 81 | 82 | * In the Fields property, select the `/agent` field, and set Action to "Remove Listed Fields". You can add additional fields to remove fields you don't need. 83 | 84 | ### Setting up for production 85 | At this point, you can hit Start and get data flowing into Elasticsearch. However for long running pipelines, you may want to configure a few alerts to let you know when the status of the pipeline changes. 86 | 87 | #### Setting up metric alerts 88 | Metric alerts are a powerful mechanism for notifying users when the pipeline needs attention. To configure these alerts, click on a blank spot on the canvas and go to the *Rules* tab. 89 | 90 | 91 | For this exercise let's pick from a preconfigured alert. 92 | Let's say we know that we are expecting a steady flow of data from our web server logs, and if we go two minutes without receiving any data, something might be wrong upstream and an operator will need to look into it. 93 | 94 | * Click Edit for the Pipeline is Idle alert and set the value to 95 | 96 | ```${time:now() - value() > 120000}``` 97 | 98 | where 120000 is the number of milliseconds. 99 | 100 | * Enable the **Active** checkbox for the Pipeline is Idle alert. 101 | 102 | * Data Collector triggers alerts on the Data Collector console, and if you select the Send Email option, sends an alert email to addresses specified in the Email IDs tab. 103 | 104 | #### Reprocess the Data 105 | 106 | * You can hit Preview to see and debug a subset of the data flowing through the pipeline. Or reset the origin, then hit Run to start the pipeline and get data flowing into Elasticsearch. 107 | 108 | * Refresh the Dashboard in Kibana and you will see the Browser Type graph correctly rendered: 109 | 110 | 111 | 112 | * Leave the pipeline for a couple of minutes, and, since there is no new data being processed, you will see the Pipeline is Idle alert: 113 | 114 | 115 | 116 | ## Where to go from here 117 | 118 | * [Explore the full set of StreamSets tutorials](https://streamsets.com/tutorials/). -------------------------------------------------------------------------------- /tutorial-1/readme.md: -------------------------------------------------------------------------------- 1 | # Log Shipping into Elasticsearch 2 | 3 | In this two part tutorial, we will learn how to read Apache web server logs and send them to Elasticsearch. Along the way we will transform the data and set up alerts and data rules to let us know if any bad data is encountered. And finally, we'll learn how to adapt the pipeline when data suddenly changes. 4 | 5 | Data Collector can read from and write to a large number of origins and destinations, but for this tutorial we will limit our scope to a Directory origin and Elasticsearch destination. 6 | 7 | [![Log shipping into Elasticsearch](img/vimeo-thumbnail.png)](https://vimeo.com/152097120 "Log shipping into Elasticsearch") 8 | 9 | ## Goals 10 | The goal of this tutorial is to gather Apache log files and send them to Elasticsearch. 11 | 12 | ## Pre-requisites 13 | * A working instance of StreamSets Data Collector. 14 | * Access to Elasticsearch and Kibana. 15 | * A copy of this tutorials directory containing the [sample data](../sample_data) and [pipeline](pipelines/Directory_to_Elasticsearch_Tutorial_Part_1.json). 16 | * A copy of the MaxMind GeoLite2 free IP geolocation database. *Either get and unzip the binary file or use the csv file* [GeoLite2 City](https://dev.maxmind.com/geoip/geoip2/geolite2/). 17 | 18 | ## Our Setup 19 | The tutorial's [sample data directory](../sample_data) contains a set of Apache web server log files. Data Collector can read many file formats, but for this example we will use compressed logs (.log.gz) that simulate a system that generates log rotated files. 20 | 21 | The log files contain standard Apache Combined Log Format Data. 22 | 23 | ` host rfc931 username date:time request statuscode bytes referrer user_agent ` 24 | 25 | *If you'd like to generate a larger volume of log files, you can use the [Fake Apache Log Generator](http://github.com/kiritbasu/Fake-Apache-Log-Generator) script*. 26 | 27 | ### Setting up an index on Elasticsearch 28 | We will need to setup an index with the right mapping before we can use [Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/setup.html), here's how: 29 | 30 | ```bash 31 | $ curl -X PUT -H "Content-Type: application/json" 'http://localhost:9200/logs' -d '{ 32 | "mappings": { 33 | "logs" : { 34 | "properties" : { 35 | "timestamp": { 36 | "type": "date" 37 | }, 38 | "geo": { 39 | "type": "geo_point" 40 | } 41 | } 42 | } 43 | } 44 | }' 45 | ``` 46 | This piece of code creates an index called "logs" and defines a few field types: 47 | 48 | * `timestamp` - this is a date field 49 | * `geo` - this is a geo_point field that has lat/lon attributes 50 | 51 | *You can use the [Kibana Dev Tools Console](https://www.elastic.co/guide/en/kibana/current/console-kibana.html) or the [Postman API Tool](http://www.getpostman.com/) to interact with Elasticsearch via API*. 52 | 53 | ### Installing StreamSets 54 | * Download and install the latest [StreamSets Data Collector](https://streamsets.com/opensource) binaries. 55 | 56 | ## Let's Get Started 57 | * [Part 1 - Basic Log Preparation](log_shipping_to_elasticsearch_part1.md) 58 | * [Part 2 - Enhancing Log Data & Preparing for Production](log_shipping_to_elasticsearch_part2.md) 59 | -------------------------------------------------------------------------------- /tutorial-2/directory_to_kafkaproducer.md: -------------------------------------------------------------------------------- 1 | ## Part 1 - Publishing to a Kafka Producer 2 | 3 | 4 | ### Creating a Pipeline 5 | * Launch the Data Collector console and create a new pipeline. 6 | 7 | 8 | #### Defining the Source 9 | * Drag the Directory origin stage into your canvas. 10 | 11 | * In the Configuration settings below, select the *Files* tab. 12 | 13 | 14 | 15 | * Enter the following settings: 16 | 17 | * **Files Directory** - The absolute file path to the directory containing the sample .avro files. 18 | * **File Name Pattern** - `cc*` - 19 | *The ccdata file in the samples directory is a bzip2 compressed Avro file. Data Collector will automatically detect and decrypt it on the fly.* 20 | 21 | * In the *Post Processing* tab make sure **File Post Processing** is set to None. 22 | 23 | *Note: This property also lets you delete source files after they have been processed. You may want to use this in your production systems once you have verified your pipelines are configured correctly.* 24 | 25 | 26 | * In the *Data Format* select **Avro**. 27 | 28 | 29 | #### Defining the Kafka Producer 30 | * Drag a Kafka Producer destination to the canvas. 31 | 32 | * In the Configuration settings, click the General tab. For Stage Library, select the version of Kafka that matches your environment. 33 | 34 | * Go to the Kafka tab and set the Broker URI property to point to your Kafka broker e.g.`:`. Set Topic to the name of your Kafka topic. 35 | 36 | 37 | You can use the Kafka Configuration section of this tab to enter any specific Kafka settings you want to use. In a future tutorial we'll see how to configure TLS, SASL or Kerberos with Kafka. 38 | 39 | * On Data Format tab select SDC Record 40 | 41 | 42 | *SDC Record is the internal data format that is highly optimized for use within StreamSets Data Collector (SDC). Since we are going to be using another Data Collector pipeline to read from this Kafka topic we can use SDC Record to optimize performance. If you have a custom Kafka Consumer on the other side you may want to use one of the other data formats and decode it accordingly.* 43 | 44 | You may choose to transform data using any of the Data Collector processor stages before you write it to Kafka, however for this tutorial we will do the transformations on the other end. 45 | 46 | That's it! Your pipeline is now ready to feed messages into Kafka. 47 | 48 | #### Preview the Data 49 | * Feel free to hit the Preview icon to examine the data before executing the pipeline. 50 | 51 | #### Execute the Pipeline 52 | * Hit the Start icon. If your Kafka server is up and running, the pipeline should start sending data to Kafka. 53 | 54 | #### What's Next? 55 | * Part 2 - [Reading with a Kafka Consumer](kafkaconsumer_to_multipledestinations.md) 56 | -------------------------------------------------------------------------------- /tutorial-2/img/data_conversions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-2/img/data_conversions.png -------------------------------------------------------------------------------- /tutorial-2/img/directory_config_dataformat.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-2/img/directory_config_dataformat.png -------------------------------------------------------------------------------- /tutorial-2/img/directory_config_postproc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-2/img/directory_config_postproc.png -------------------------------------------------------------------------------- /tutorial-2/img/directory_setup.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-2/img/directory_setup.png -------------------------------------------------------------------------------- /tutorial-2/img/elastic_config.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-2/img/elastic_config.png -------------------------------------------------------------------------------- /tutorial-2/img/field_converter_config.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-2/img/field_converter_config.png -------------------------------------------------------------------------------- /tutorial-2/img/field_masker_config.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-2/img/field_masker_config.png -------------------------------------------------------------------------------- /tutorial-2/img/kafka_consumer_config.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-2/img/kafka_consumer_config.png -------------------------------------------------------------------------------- /tutorial-2/img/kafka_consumer_dataformat.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-2/img/kafka_consumer_dataformat.png -------------------------------------------------------------------------------- /tutorial-2/img/kafka_consumer_pipeline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-2/img/kafka_consumer_pipeline.png -------------------------------------------------------------------------------- /tutorial-2/img/kafka_producer_config.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-2/img/kafka_producer_config.png -------------------------------------------------------------------------------- /tutorial-2/img/kafka_producer_dataformat.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-2/img/kafka_producer_dataformat.png -------------------------------------------------------------------------------- /tutorial-2/img/our_setup.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-2/img/our_setup.png -------------------------------------------------------------------------------- /tutorial-2/img/s3_config1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-2/img/s3_config1.png -------------------------------------------------------------------------------- /tutorial-2/img/s3_config2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-2/img/s3_config2.png -------------------------------------------------------------------------------- /tutorial-2/img/vimeo-thumbnail.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-2/img/vimeo-thumbnail.png -------------------------------------------------------------------------------- /tutorial-2/kafkaconsumer_to_multipledestinations.md: -------------------------------------------------------------------------------- 1 | ## Part 2 - Reading from a Kafka Consumer 2 | 3 | In this part of the tutorial we will setup a pipeline that drains data from a Kafka Consumer, makes a couple of transformations and writes to multiple destinations. 4 | 5 | 6 | 7 | You may remember the data we are reading simulates credit card information and contains the card number : 8 | ```json 9 | { 10 | "transaction_date":"dd/mm/YYYY", 11 | "card_number":"0000-0000-0000-0000", 12 | "card_expiry_date":"mm/YYYY", 13 | "card_security_code":"0000", 14 | "purchase_amount":"$00.00", 15 | "description":"transaction description of the purchase" 16 | } 17 | ``` 18 | We don't want to store credit card information in any of our data stores so this is a perfect opportunity to sanitize the data before it gets there. We'll use a few built in transformation stages to mask the card numbers so what makes it through are just the last 4 digits. 19 | 20 | #### Defining the source 21 | * Drag the 'Kafka Consumer' origin stage into your canvas. 22 | 23 | * Go to the 'General' Tab in its configuration and select the version of Kafka that matches your environment in the 'Stage Library' dropdown. 24 | 25 | * On 'Kafka' tab set the Broker URI, Zookeeper URI and topic name to match the settings in your environment. 26 | 27 | 28 | 29 | * On 'Data Format' tab select SDC Record. (You may remember from Part 1 of this tutorial we sent data through Kafka in this format, so we want to make sure we decode the incoming data appropriately.) 30 | 31 | 32 | #### Field Converter 33 | * It so happens that the card number field is defined as an integer in Avro. We will want to convert this to a string value. So type '/card_number' in the 'Fields to Convert' text box and set it to type String in 'Convert to Type'. Leave the rest to default values. 34 | 35 | 36 | #### Jython Evaluator 37 | * In this stage we'll use a small piece of python code to look at the first few digits of the card number and figure out what type of card it is. We'll add that card type to a new field called 'credit_card_type'. 38 | 39 | Go to the 'Jython' tab of the Jython Evaluator and enter the following piece of code. 40 | 41 | ```python 42 | 43 | for record in records: 44 | try: 45 | cc = record.value['card_number'] 46 | if cc == '': 47 | error.write(record, "Credit Card Number was null") 48 | continue 49 | 50 | cc_type = '' 51 | if cc.startswith('4'): 52 | cc_type = 'Visa' 53 | elif cc.startswith(('51','52','53','54','55')): 54 | cc_type = 'MasterCard' 55 | elif cc.startswith(('34','37')): 56 | cc_type = 'AMEX' 57 | elif cc.startswith(('300','301','302','303','304','305','36','38')): 58 | cc_type = 'Diners Club' 59 | elif cc.startswith(('6011','65')): 60 | cc_type = 'Discover' 61 | elif cc.startswith(('2131','1800','35')): 62 | cc_type = 'JCB' 63 | else: 64 | cc_type = 'Other' 65 | 66 | record.value['credit_card_type'] = cc_type 67 | output.write(record) 68 | 69 | except Exception as e: 70 | # Send record to error 71 | error.write(record, str(e)) 72 | 73 | ``` 74 | 75 | #### Field Masker 76 | * The last step of the process is to mask the card number so that the last 4 digits of the card is all that makes it to the data stores. 77 | 78 | 79 | 80 | * In the 'Field Masker' stage configuration type '/card_number', set the mask type to custom. In this mode you can use '#' to show characters and any other character to use as a mask. e.g. a mask to show the last 4 digits of a credit card number : 81 | 82 | '0123 4567 8911 0123' would be 83 | 84 | '---- ---- ---- ####' will change the value to 85 | 86 | '---- ---- ---- 0123' 87 | 88 | #### Destinations 89 | In this particular example we will write the results to 2 destinations: Elasticsearch and an Amazon S3 bucket. 90 | 91 | ##### Setting up ElasticSearch 92 | 93 | * Drag and Drop a 'ElasticSearch' stage to the Canvas. 94 | 95 | * Go to its Configuration and select the 'General' Tab. In the drop down for 'Stage Library' select the version of ElasticSearch you are running. 96 | 97 | * Go to the 'ElasticSearch' Tab and in the 'Cluster URI' field specify the host:port where your ElasticSearch service is running 98 | 99 | * In 'Index' and 'Mapping' textboxes specify the name of your index and mapping. 100 | 101 | 102 | 103 | ##### Writing to an Amazon S3 bucket 104 | A common usecase is to backup data to S3, in this example we'll convert the data back to Avro format and store it there. 105 | 106 | * Drag and drop the 'Amazon S3' stage to the canvas. 107 | 108 | * In its configuration enter in your 'Access Key ID' and 'Secret Access Key', select the 'Region' and enter the 'Bucket' name you want to store the files in. 109 | 110 | 111 | 112 | * On 'Data Format' select 'Avro' and 'In Pipeline Configuration' for Avro Schema Location. Then specify the following schema for Avro Schema: 113 | 114 | ```json 115 | {"namespace" : "cctest.avro", 116 | "type": "record", 117 | "name": "CCTest", 118 | "doc": "Test Credit Card Transactions", 119 | "fields": [ 120 | {"name": "transaction_date", "type": "string"}, 121 | {"name": "card_number", "type": "string"}, 122 | {"name": "card_expiry_date", "type": "string"}, 123 | {"name": "card_security_code", "type": "string"}, 124 | {"name": "purchase_amount", "type": "string"}, 125 | {"name": "description", "type": "string"} 126 | ] 127 | } 128 | ``` 129 | 130 | * To save space on the S3 bucket lets compress the data as its written. Select BZip2 as the Avro Compression Codec. 131 | 132 | 133 | 134 | #### Execute the Pipeline 135 | * Hit Run and the pipeline should start draining Kafka messages and writing them to Elastic and Amazon S3. 136 | -------------------------------------------------------------------------------- /tutorial-2/readme.md: -------------------------------------------------------------------------------- 1 | # Simple Kafka Enablement using StreamSets Data Collector 2 | 3 | Creating custom Kafka producers and consumers is often a tedious process that requires manual coding. In this tutorial, we'll see how to use StreamSets Data Collector to create data ingest pipelines to write to Kafka using a Kafka Producer, and read from Kafka with a Kafka Consumer with no handwritten code. 4 | 5 | 6 | 7 | [![Simple Kafka Enablement](img/vimeo-thumbnail.png)](https://vimeo.com/153061876 "Simple Kafka Enablement") 8 | 9 | ## Goals 10 | The goal of this tutorial is read Avro files from a file system directory and write them to a Kafka topic using the StreamSets Kafka Producer. We'll then use a second pipeline configured with a Kafka Consumer to drain that topic, perform a set of transformations and send the data to two different destinations. 11 | 12 | ## Prerequisites 13 | * A working instance of StreamSets Data Collector 14 | * A working Kafka instance (see the [Quickstart](http://kafka.apache.org/quickstart) for easy local setup. Last tested on version 1.1.0 but older and newer versions should work too.) 15 | * A copy of this tutorials directory containing the [sample data](../sample_data) 16 | 17 | ## Our Setup 18 | The tutorial [sample data directory](../sample_data) contains a set of compressed Avro files that contain simulated credit card transactions in the following JSON format: 19 | 20 | ```json 21 | { 22 | "transaction_date":"dd/mm/YYYY", 23 | "card_number":"0000-0000-0000-0000", 24 | "card_expiry_date":"mm/YYYY", 25 | "card_security_code":"0000", 26 | "purchase_amount":"$00.00", 27 | "description":"transaction description of the purchase" 28 | } 29 | ``` 30 | ## Data Conversions 31 | We will read Avro files from our source directory and write to Kafka using the Data Collector SDC Record data format. Then use another pipeline to read the SDC Record data from Kafka and write it to Elasticsearch and convert data to Avro for S3. 32 | 33 | 34 | 35 | 36 | ## Let's Get Started 37 | * Part 1 - [Publishing to a Kafka Producer](directory_to_kafkaproducer.md) 38 | * Part 2 - [Reading from a Kafka Consumer](kafkaconsumer_to_multipledestinations.md) 39 | -------------------------------------------------------------------------------- /tutorial-3/image_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-3/image_0.png -------------------------------------------------------------------------------- /tutorial-3/image_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-3/image_1.png -------------------------------------------------------------------------------- /tutorial-3/image_10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-3/image_10.png -------------------------------------------------------------------------------- /tutorial-3/image_11.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-3/image_11.png -------------------------------------------------------------------------------- /tutorial-3/image_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-3/image_2.png -------------------------------------------------------------------------------- /tutorial-3/image_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-3/image_3.png -------------------------------------------------------------------------------- /tutorial-3/image_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-3/image_4.png -------------------------------------------------------------------------------- /tutorial-3/image_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-3/image_5.png -------------------------------------------------------------------------------- /tutorial-3/image_6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-3/image_6.png -------------------------------------------------------------------------------- /tutorial-3/image_7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-3/image_7.png -------------------------------------------------------------------------------- /tutorial-3/image_8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-3/image_8.png -------------------------------------------------------------------------------- /tutorial-3/image_9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-3/image_9.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_0.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_1.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_10.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_11.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_11.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_12.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_12.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_13.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_13.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_14.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_14.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_15.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_15.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_16.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_16.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_17.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_17.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_18.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_18.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_19.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_19.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_2.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_20.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_20.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_21.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_21.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_22.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_22.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_23.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_23.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_24.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_24.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_3.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_4.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_5.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_6.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_7.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_8.png -------------------------------------------------------------------------------- /tutorial-adls-destination/image_9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-adls-destination/image_9.png -------------------------------------------------------------------------------- /tutorial-crud-microservice/add-expression-evaluator.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-crud-microservice/add-expression-evaluator.png -------------------------------------------------------------------------------- /tutorial-crud-microservice/add-jdbc-lookup.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-crud-microservice/add-jdbc-lookup.png -------------------------------------------------------------------------------- /tutorial-crud-microservice/add-jdbc-tee.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-crud-microservice/add-jdbc-tee.png -------------------------------------------------------------------------------- /tutorial-crud-microservice/add-record-found.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-crud-microservice/add-record-found.png -------------------------------------------------------------------------------- /tutorial-crud-microservice/add-request-id-test.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-crud-microservice/add-request-id-test.png -------------------------------------------------------------------------------- /tutorial-crud-microservice/added-jdbc-lookup.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-crud-microservice/added-jdbc-lookup.png -------------------------------------------------------------------------------- /tutorial-crud-microservice/expression-evaluator-fields.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-crud-microservice/expression-evaluator-fields.png -------------------------------------------------------------------------------- /tutorial-crud-microservice/http-request-router-conditions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-crud-microservice/http-request-router-conditions.png -------------------------------------------------------------------------------- /tutorial-crud-microservice/new_pipeline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-crud-microservice/new_pipeline.png -------------------------------------------------------------------------------- /tutorial-crud-microservice/paste-jdbc-tee.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-crud-microservice/paste-jdbc-tee.png -------------------------------------------------------------------------------- /tutorial-crud-microservice/ready-for-delete.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-crud-microservice/ready-for-delete.png -------------------------------------------------------------------------------- /tutorial-crud-microservice/ready-for-update.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-crud-microservice/ready-for-update.png -------------------------------------------------------------------------------- /tutorial-crud-microservice/snapshot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-crud-microservice/snapshot.png -------------------------------------------------------------------------------- /tutorial-crud-microservice/template_microservice.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-crud-microservice/template_microservice.png -------------------------------------------------------------------------------- /tutorial-custom-dataprotector-procedure/images/finder1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-custom-dataprotector-procedure/images/finder1.png -------------------------------------------------------------------------------- /tutorial-custom-dataprotector-procedure/images/finder2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-custom-dataprotector-procedure/images/finder2.png -------------------------------------------------------------------------------- /tutorial-custom-dataprotector-procedure/images/intellij1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-custom-dataprotector-procedure/images/intellij1.png -------------------------------------------------------------------------------- /tutorial-custom-dataprotector-procedure/images/intellij2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-custom-dataprotector-procedure/images/intellij2.png -------------------------------------------------------------------------------- /tutorial-custom-dataprotector-procedure/images/intellij3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-custom-dataprotector-procedure/images/intellij3.png -------------------------------------------------------------------------------- /tutorial-custom-dataprotector-procedure/images/intellij4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-custom-dataprotector-procedure/images/intellij4.png -------------------------------------------------------------------------------- /tutorial-custom-dataprotector-procedure/images/sch1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-custom-dataprotector-procedure/images/sch1.png -------------------------------------------------------------------------------- /tutorial-custom-dataprotector-procedure/images/sch2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-custom-dataprotector-procedure/images/sch2.png -------------------------------------------------------------------------------- /tutorial-custom-dataprotector-procedure/images/sch3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-custom-dataprotector-procedure/images/sch3.png -------------------------------------------------------------------------------- /tutorial-custom-dataprotector-procedure/images/sch4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-custom-dataprotector-procedure/images/sch4.png -------------------------------------------------------------------------------- /tutorial-custom-dataprotector-procedure/images/sch5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-custom-dataprotector-procedure/images/sch5.png -------------------------------------------------------------------------------- /tutorial-custom-dataprotector-procedure/images/sch6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-custom-dataprotector-procedure/images/sch6.png -------------------------------------------------------------------------------- /tutorial-custom-dataprotector-procedure/images/sch7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-custom-dataprotector-procedure/images/sch7.png -------------------------------------------------------------------------------- /tutorial-custom-dataprotector-procedure/images/sch8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-custom-dataprotector-procedure/images/sch8.png -------------------------------------------------------------------------------- /tutorial-custom-dataprotector-procedure/images/sch9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-custom-dataprotector-procedure/images/sch9.png -------------------------------------------------------------------------------- /tutorial-custom-dataprotector-procedure/images/sdc1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-custom-dataprotector-procedure/images/sdc1.png -------------------------------------------------------------------------------- /tutorial-destination/image_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-destination/image_0.png -------------------------------------------------------------------------------- /tutorial-destination/image_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-destination/image_1.png -------------------------------------------------------------------------------- /tutorial-destination/image_10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-destination/image_10.png -------------------------------------------------------------------------------- /tutorial-destination/image_11.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-destination/image_11.png -------------------------------------------------------------------------------- /tutorial-destination/image_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-destination/image_2.png -------------------------------------------------------------------------------- /tutorial-destination/image_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-destination/image_3.png -------------------------------------------------------------------------------- /tutorial-destination/image_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-destination/image_4.png -------------------------------------------------------------------------------- /tutorial-destination/image_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-destination/image_5.png -------------------------------------------------------------------------------- /tutorial-destination/image_6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-destination/image_6.png -------------------------------------------------------------------------------- /tutorial-destination/image_7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-destination/image_7.png -------------------------------------------------------------------------------- /tutorial-destination/image_8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-destination/image_8.png -------------------------------------------------------------------------------- /tutorial-destination/image_9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-destination/image_9.png -------------------------------------------------------------------------------- /tutorial-hivedrift/RDBMS-SDC-Hive.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-hivedrift/RDBMS-SDC-Hive.png -------------------------------------------------------------------------------- /tutorial-hivedrift/drifttutorial1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-hivedrift/drifttutorial1.png -------------------------------------------------------------------------------- /tutorial-hivedrift/drifttutorial2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-hivedrift/drifttutorial2.png -------------------------------------------------------------------------------- /tutorial-hivedrift/drifttutorial3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-hivedrift/drifttutorial3.png -------------------------------------------------------------------------------- /tutorial-hivedrift/drifttutorial4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-hivedrift/drifttutorial4.png -------------------------------------------------------------------------------- /tutorial-hivedrift/image_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-hivedrift/image_0.png -------------------------------------------------------------------------------- /tutorial-hivedrift/image_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-hivedrift/image_1.png -------------------------------------------------------------------------------- /tutorial-hivedrift/image_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-hivedrift/image_2.png -------------------------------------------------------------------------------- /tutorial-hivedrift/image_6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-hivedrift/image_6.png -------------------------------------------------------------------------------- /tutorial-hivedrift/image_7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-hivedrift/image_7.png -------------------------------------------------------------------------------- /tutorial-hivedrift/image_8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-hivedrift/image_8.png -------------------------------------------------------------------------------- /tutorial-hivedrift/image_9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-hivedrift/image_9.png -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/1-java-opts/README.md: -------------------------------------------------------------------------------- 1 | ### How to set Java Heap size and other Java Options 2 | 3 | Java options, like heap size, can be set at deployment time using the SDC_JAVA_OPTS environment variable in the deployment manifest like this: 4 | 5 | env: 6 | - name: SDC_JAVA_OPTS 7 | value: "-Xmx4g -Xms4g" 8 | 9 | See [sdc.yaml](sdc.yaml) for an example SDC manifest with Java Opts settings. 10 | 11 | -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/1-java-opts/sdc.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: apps/v1 2 | kind: Deployment 3 | metadata: 4 | name: datacollector-deployment 5 | spec: 6 | replicas: 1 7 | selector: 8 | matchLabels: 9 | app: datacollector-deployment 10 | template: 11 | metadata: 12 | labels: 13 | app: datacollector-deployment 14 | spec: 15 | containers: 16 | - name: datacollector 17 | image: streamsets/datacollector:latest 18 | ports: 19 | - containerPort: 18630 20 | env: 21 | - name: SDC_JAVA_OPTS 22 | value: "-Xmx4g -Xms4g" -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/2-custom-docker-image/README.md: -------------------------------------------------------------------------------- 1 | ### Baked-in Stage Libraries and Configuration 2 | 3 | This example packages a custom sdc.properties file within an SDC image, along with a set of SDC stage libs (including Enterprise stage libs), at the time the image is built. 4 | 5 | This approach is suitable for [execution SDCs](https://streamsets.com/documentation/controlhub/latest/help/controlhub/UserGuide/DataCollectors/DataCollectors.html#concept_mwp_fcf_gw) whose configuration and stage libs do not need to be dynamically set. The sdc.properties file "baked in" to the custom SDC image may include custom settings for properties like production.maxBatchSize and email configuration if these properties are consistent across deployments. 6 | 7 | Set http.realm.file.permission.check=false in your sdc.properties file to avoid permission issues. 8 | 9 | See the Dockerfile and build.sh in the [sdc-docker-custom-config](sdc-docker-custom-config) directory. 10 | 11 | 12 | -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/2-custom-docker-image/sdc-docker-custom-config/Dockerfile: -------------------------------------------------------------------------------- 1 | ARG SDC_VERSION=latest 2 | FROM streamsets/datacollector:${SDC_VERSION} 3 | 4 | # Copy the stage libs and enterprise stage libs 5 | COPY --chown=sdc:sdc streamsets-libs ${SDC_DIST}/streamsets-libs 6 | 7 | # Copy the custom sdc.properties file into the image 8 | COPY --chown=sdc:sdc sdc-conf/ ${SDC_CONF}/ -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/2-custom-docker-image/sdc-docker-custom-config/build.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | # This script builds a custom Docker image that extends the base SDC image 4 | # It downloads a set of SDC stage libs and enterprise stage libs to a local directory. 5 | # The Dockerfile copies those libs into the SDC image as well as a custom sdc.properties file 6 | 7 | # Your custom image 8 | IMAGE_NAME= 9 | 10 | # SDC Version 11 | SDC_VERSION=3.16.1 12 | 13 | # A space separated list of stage libs to download 14 | SDC_STAGE_LIBS="streamsets-datacollector-aws-lib streamsets-datacollector-bigtable-lib streamsets-datacollector-google-cloud-lib streamsets-datacollector-groovy_2_4-lib streamsets-datacollector-jdbc-lib streamsets-datacollector-jms-lib streamsets-datacollector-jython_2_7-lib" 15 | 16 | # A space separated list of enterprise stage libs to download 17 | SDC_ENTERPRISE_STAGE_LIBS="streamsets-datacollector-databricks-lib-1.0.0 streamsets-datacollector-snowflake-lib-1.4.0 streamsets-datacollector-oracle-lib-1.2.0 streamsets-datacollector-sql-server-bdc-lib-1.0.1" 18 | 19 | # Base URL to download SDC Stage Libs 20 | BASE_URL=https://archives.streamsets.com/datacollector 21 | 22 | # Use a tmp directory to unpack the downloaded stage libs 23 | mkdir -p tmp-stage-libs 24 | cd tmp-stage-libs 25 | 26 | # Download and extract stage libs 27 | for s in $SDC_STAGE_LIBS; 28 | do 29 | wget ${BASE_URL}/${SDC_VERSION}/tarball/${s}-${SDC_VERSION}.tgz; 30 | tar -xvf ${s}-${SDC_VERSION}.tgz; 31 | rm ${s}-${SDC_VERSION}.tgz; 32 | done 33 | 34 | # Download and extract enterprise stage libs 35 | cd streamsets-datacollector-${SDC_VERSION} 36 | for s in $SDC_ENTERPRISE_STAGE_LIBS; 37 | do 38 | wget ${BASE_URL}/latest/tarball/enterprise/${s}.tgz; 39 | tar -xvf ${s}.tgz; 40 | rm -rf ${s}.tgz; 41 | done 42 | 43 | cd ../.. 44 | 45 | # move all the stage libs to the ./streamsets-libs dir 46 | mv tmp-stage-libs/streamsets-datacollector-${SDC_VERSION}/streamsets-libs . 47 | 48 | # remove tmp dir 49 | rm -rf tmp-stage-libs 50 | 51 | # Build the image 52 | docker build -t $IMAGE_NAME . 53 | 54 | # clean up 55 | rm -rf streamsets-libs 56 | 57 | # Push the image 58 | docker push $IMAGE_NAME -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/2-custom-docker-image/sdc.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: apps/v1 2 | kind: Deployment 3 | metadata: 4 | name: datacollector-deployment 5 | spec: 6 | replicas: 1 7 | selector: 8 | matchLabels: 9 | app: datacollector-deployment 10 | template: 11 | metadata: 12 | labels: 13 | app: datacollector-deployment 14 | spec: 15 | containers: 16 | - name: datacollector 17 | image: 18 | ports: 19 | - containerPort: 18630 20 | env: 21 | - name: SDC_JAVA_OPTS 22 | value: "-Xmx4g -Xms4g" -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/3-volumes/README.md: -------------------------------------------------------------------------------- 1 | ### Loading stage libs from a pre-populated Volume 2 | 3 | This example shows how to load resources from a pre-populated Kubernetes [Volume](https://kubernetes.io/docs/concepts/storage/volumes/). (The next example covers how to dynamically populate and use a [Persistent Volume](https://kubernetes.io/docs/concepts/storage/persistent-volumes/)). 4 | 5 | There are many different Volume [types](https://kubernetes.io/docs/concepts/storage/volumes/#types-of-volumes); this example uses an [Azure File Volume](https://docs.microsoft.com/en-us/azure/aks/azure-files-volume). 6 | 7 | A pre-populated Volume can provide resources to multiple SDC Pods at deployment time, including stage libs, hadoop config files, lookup files, JDBC drivers, etc... 8 | 9 | In this example, the Volume has already been populated with a set of SDC stage libs. The [get-stage-libs.sh](get-stage-libs.sh) script provides an example of how to download stage libs. 10 | 11 | Here is a view of an Azure File Share populated with a set of stage libs within a streamsets-libsdirectory: 12 | 13 | azure-file-share 14 | 15 | To access the Azure File share, we need to store the connection properties in a Secret named azure-secret: 16 | 17 | $ kubectl create secret generic azure-secret \ 18 | --from-literal=azurestorageaccountname= \ 19 | --from-literal=azurestorageaccountkey=$ 20 | 21 | Mount the Volume into an SDC deployment and the stage libs will be accessible to SDC Pods. 22 | 23 | Here is an example manifest for a Control Hub-based deployment that mounts the contents of the Azure File Share's /streamsets-libs directory to SDC's /streamsets-libs directory: 24 | 25 | apiVersion: apps/v1 26 | kind: Deployment 27 | metadata: 28 | name: sdc 29 | spec: 30 | replicas: 1 31 | selector: 32 | matchLabels: 33 | app: sdc 34 | template: 35 | metadata: 36 | labels: 37 | app: sdc 38 | spec: 39 | containers: 40 | - name: sdc 41 | image: streamsets/datacollector:latest 42 | ports: 43 | - containerPort: 18630 44 | env: 45 | volumeMounts: 46 | - name: streamsets-libs 47 | mountPath: /opt/streamsets-datacollector-3.16.1/streamsets-libs 48 | volumes: 49 | - name: streamsets-libs 50 | azureFile: 51 | secretName: azure-secret 52 | shareName: markaksshare/streamsets-libs 53 | readOnly: true 54 | 55 | 56 | After starting a deployment, one can inspect the installed stage libs in the deployed SDCs. For example, here are Pods for two instances of SDC: 57 | 58 | $ kubectl get pods | grep sdc 59 | sdc-69bd8dcc78-x2bs2 1/1 Running 0 17m 60 | sdc-69bd8dcc78-z9wtj 1/1 Running 0 17m 61 | 62 | Run an exec command in one of the SDC's containers to see the installed stage-libs: 63 | 64 | $ kubectl exec -it sdc-69bd8dcc78-x2bs2 -- bash -c "ls /opt/streamsets-*/streamsets-libs" 65 | streamsets-datacollector-aws-lib 66 | streamsets-datacollector-basic-lib 67 | streamsets-datacollector-bigtable-lib 68 | streamsets-datacollector-databricks-lib 69 | streamsets-datacollector-dataformats-lib 70 | streamsets-datacollector-dev-lib 71 | streamsets-datacollector-google-cloud-lib 72 | streamsets-datacollector-groovy_2_4-lib 73 | streamsets-datacollector-jdbc-lib 74 | streamsets-datacollector-jms-lib 75 | streamsets-datacollector-jython_2_7-lib 76 | streamsets-datacollector-oracle-lib 77 | streamsets-datacollector-snowflake-lib 78 | streamsets-datacollector-sql-server-bdc-lib 79 | streamsets-datacollector-stats-lib 80 | streamsets-datacollector-windows-lib 81 | 82 | This technique will also work for any other resources that need to be shared with SDCs including hadoop-config files, keystores and truststores, etc... 83 | -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/3-volumes/get-stage-libs.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | # This script will download SDC stage libs and enterprise stage libs 4 | # and write them to a directory named streamsets-libs 5 | 6 | SDC_VERSION=3.16.1 7 | 8 | BASE_URL=https://archives.streamsets.com/datacollector 9 | 10 | # A space separated list of stage libs to download 11 | SDC_STAGE_LIBS="streamsets-datacollector-aws-lib streamsets-datacollector-basic-lib streamsets-datacollector-bigtable-lib streamsets-datacollector-dataformats-lib streamsets-datacollector-dev-lib streamsets-datacollector-google-cloud-lib streamsets-datacollector-groovy_2_4-lib streamsets-datacollector-jdbc-lib streamsets-datacollector-jms-lib streamsets-datacollector-jython_2_7-lib streamsets-datacollector-stats-lib streamsets-datacollector-windows-lib" 12 | 13 | # A space separated list of enterprise stage libs to download 14 | SDC_ENTERPRISE_STAGE_LIBS="streamsets-datacollector-databricks-lib-1.0.0 streamsets-datacollector-snowflake-lib-1.4.0 streamsets-datacollector-oracle-lib-1.2.0 streamsets-datacollector-sql-server-bdc-lib-1.0.1" 15 | 16 | # Use a tmp directory to unpack the downloaded stage libs 17 | mkdir -p tmp-stage-libs 18 | cd tmp-stage-libs 19 | 20 | # Download and extract stage libs 21 | for s in $SDC_STAGE_LIBS; 22 | do 23 | wget ${BASE_URL}/${SDC_VERSION}/tarball/${s}-${SDC_VERSION}.tgz; 24 | tar -xvf ${s}-${SDC_VERSION}.tgz; 25 | rm ${s}-${SDC_VERSION}.tgz; 26 | done 27 | 28 | # Download and extract enterprise stage libs 29 | cd streamsets-datacollector-${SDC_VERSION} 30 | for s in $SDC_ENTERPRISE_STAGE_LIBS; 31 | do 32 | wget ${BASE_URL}/latest/tarball/enterprise/${s}.tgz; 33 | tar -xvf ${s}.tgz; 34 | rm -rf ${s}.tgz; 35 | done 36 | 37 | cd ../.. 38 | 39 | # move all the stage libs to the ./streamsets-libs dir 40 | mv tmp-stage-libs/streamsets-datacollector-${SDC_VERSION}/streamsets-libs . 41 | 42 | # remove tmp dir 43 | rm -rf tmp-stage-libs -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/3-volumes/images/azure-file-share.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-kubernetes-deployment/3-volumes/images/azure-file-share.png -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/3-volumes/sdc.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: apps/v1 2 | kind: Deployment 3 | metadata: 4 | name: sdc 5 | spec: 6 | replicas: 1 7 | selector: 8 | matchLabels: 9 | app: sdc 10 | template: 11 | metadata: 12 | labels: 13 | app: sdc 14 | spec: 15 | containers: 16 | - name: sdc 17 | image: streamsets/datacollector:latest 18 | ports: 19 | - containerPort: 18630 20 | env: 21 | volumeMounts: 22 | - name: streamsets-libs 23 | mountPath: /opt/streamsets-datacollector-3.16.1/streamsets-libs 24 | volumes: 25 | - name: streamsets-libs 26 | azureFile: 27 | secretName: azure-secret 28 | shareName: markaksshare/streamsets-libs 29 | readOnly: true -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/4-persistent-volumes/images/port-forward.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-kubernetes-deployment/4-persistent-volumes/images/port-forward.png -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/4-persistent-volumes/images/sdcs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-kubernetes-deployment/4-persistent-volumes/images/sdcs.png -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/4-persistent-volumes/sdc-stage-libs-configmap.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: v1 2 | kind: ConfigMap 3 | metadata: 4 | name: sdc-stage-libs-list 5 | data: 6 | sdc-stage-libs: | 7 | streamsets-datacollector-aws-lib 8 | streamsets-datacollector-basic-lib 9 | streamsets-datacollector-bigtable-lib 10 | streamsets-datacollector-dataformats-lib 11 | streamsets-datacollector-dev-lib 12 | streamsets-datacollector-google-cloud-lib 13 | streamsets-datacollector-groovy_2_4-lib 14 | streamsets-datacollector-jdbc-lib 15 | streamsets-datacollector-jms-lib 16 | streamsets-datacollector-jython_2_7-lib 17 | streamsets-datacollector-stats-lib 18 | streamsets-datacollector-windows-lib 19 | sdc-enterprise-stage-libs: | 20 | streamsets-datacollector-databricks-lib-1.0.0 21 | streamsets-datacollector-snowflake-lib-1.4.0 22 | streamsets-datacollector-oracle-lib-1.2.0 23 | streamsets-datacollector-sql-server-bdc-lib-1.0.1 -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/4-persistent-volumes/sdc-stage-libs-job.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: batch/v1 2 | kind: Job 3 | metadata: 4 | name: sdc-stage-libs-job 5 | spec: 6 | template: 7 | metadata: 8 | name: sdc-stage-libs-job 9 | spec: 10 | containers: 11 | - name: sdc-stage-libs-job 12 | image: busybox:latest 13 | command: ["/bin/sh", "-c"] 14 | args: 15 | - > 16 | BASE_URL=https://archives.streamsets.com/datacollector 17 | && SDC_VERSION=3.16.1 18 | && for s in `cat /tmp/sdc-stage-libs`; 19 | do 20 | wget ${BASE_URL}/${SDC_VERSION}/tarball/${s}-${SDC_VERSION}.tgz; 21 | tar -xvf ${s}-${SDC_VERSION}.tgz -C /tmp; 22 | done 23 | && cp -R /tmp/streamsets-datacollector-${SDC_VERSION}/streamsets-libs/streamsets-* /streamsets-libs 24 | && for s in `cat /tmp/sdc-enterprise-stage-libs`; 25 | do 26 | wget ${BASE_URL}/latest/tarball/enterprise/${s}.tgz; 27 | tar -xvf ${s}.tgz -C /tmp; 28 | done 29 | && cp -R /tmp/streamsets-libs/streamsets-* /streamsets-libs 30 | && rm -rf /streamsets-libs/lost+found 31 | && echo 'Here is a listing of the /streamsets-libs dir' 32 | && ls -l /streamsets-libs 33 | volumeMounts: 34 | - name: sdc-stage-libs-list 35 | mountPath: /tmp/sdc-stage-libs 36 | subPath: sdc-stage-libs 37 | - name: sdc-stage-libs-list 38 | mountPath: /tmp/sdc-enterprise-stage-libs 39 | subPath: sdc-enterprise-stage-libs 40 | - name: sdc-stage-libs-pvc 41 | mountPath: /streamsets-libs 42 | restartPolicy: Never 43 | volumes: 44 | - name: sdc-stage-libs-list 45 | configMap: 46 | name: sdc-stage-libs-list 47 | items: 48 | - key: sdc-stage-libs 49 | path: sdc-stage-libs 50 | - key: sdc-enterprise-stage-libs 51 | path: sdc-enterprise-stage-libs 52 | - name: sdc-stage-libs-pvc 53 | persistentVolumeClaim: 54 | claimName: sdc-stage-libs-pvc -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/4-persistent-volumes/sdc-stage-libs-pvc.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: v1 2 | kind: PersistentVolumeClaim 3 | metadata: 4 | name: sdc-stage-libs-pvc 5 | spec: 6 | accessModes: 7 | - ReadWriteOnce 8 | storageClassName: sdc-stage-libs-sc 9 | resources: 10 | requests: 11 | storage: 5Gi 12 | -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/4-persistent-volumes/sdc-stage-libs-sc.yaml: -------------------------------------------------------------------------------- 1 | kind: StorageClass 2 | apiVersion: storage.k8s.io/v1 3 | metadata: 4 | name: sdc-stage-libs-sc 5 | provisioner: kubernetes.io/azure-file 6 | mountOptions: 7 | - dir_mode=0777 8 | - file_mode=0777 9 | - uid=0 10 | - gid=0 11 | - mfsymlinks 12 | - cache=strict 13 | parameters: 14 | skuName: Standard_LRS 15 | -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/4-persistent-volumes/sdc.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: apps/v1 2 | kind: Deployment 3 | metadata: 4 | name: sdc 5 | spec: 6 | selector: 7 | matchLabels: 8 | app: sdc 9 | template: 10 | metadata: 11 | labels: 12 | app: sdc 13 | spec: 14 | containers: 15 | - name: sdc 16 | image: streamsets/datacollector:3.16.1 17 | ports: 18 | - containerPort: 18630 19 | env: 20 | - name: SDC_JAVA_OPTS 21 | value: "-Xmx2g -Xms2g" 22 | volumeMounts: 23 | - name: sdc-stage-libs 24 | mountPath: /opt/streamsets-datacollector-3.16.1/streamsets-libs 25 | readOnly: "true" 26 | volumes: 27 | - name: sdc-stage-libs 28 | persistentVolumeClaim: 29 | claimName: sdc-stage-libs-pvc 30 | -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/5-sdc-properties-configmap-1/README.md: -------------------------------------------------------------------------------- 1 | ### Loading sdc.properties from a ConfigMap 2 | 3 | An approach that offers greater flexibility than "baking-in" the sdc.properties file (as in the [Custom Docker Image example](../2-custom-docker-image)) is to dynamically mount an sdc.properties file at deployment time. One way to do that is to store an sdc.properties file in a configMap and to Volume Mount the configMap into the SDC container, overwriting the default sdc.properties file packaged with the image. 4 | 5 | The configMap's representation of sdc.properties will be read-only, so one can't use any SDC_CONF_ prefixed environment variables in the SDC deployment (see [this note](../NoteOnEnvVars.md)); all custom property values for properties defined in sdc.properties need to be set in the configMap (though one can still set SDC_JAVA_OPTS in the environment as that is a "pure" environment variable used by SDC). 6 | 7 | This example uses one monolithic sdc.properties file stored in a single configMap (see the example [here](../6-sdc-properties-configmap-2) for a more modular approach). 8 | 9 | Start by copying a clean sdc.properties file to a local working directory. Edit the property values you want for a given deployment. For example, I will edit these properties within the file: 10 | 11 | sdc.base.http.url=https://sequoia.onefoursix.com 12 | http.enable.forwarded.requests=true 13 | http.realm.file.permission.check=false # set this to avoid permission issues 14 | production.maxBatchSize=20000 15 | 16 | Save the edited sdc.properties file in a configMap named sdc-properties by executing the command: 17 | 18 | $ kubectl create configmap sdc-properties --from-file=sdc.properties 19 | 20 | Create the configMap prior to starting the SDC deployment. 21 | 22 | Add the configMap as a Volume in your SDC deployment manifest like this: 23 | 24 | volumes: 25 | - name: sdc-properties 26 | configMap: 27 | name: sdc-properties 28 | 29 | And add a Volume Mount to the SDC container, to overwrite the sdc.properties file: 30 | 31 | volumeMounts: 32 | - name: sdc-properties 33 | mountPath: /etc/sdc/sdc.properties 34 | subPath: sdc.properties 35 | 36 | Create and start a Control Hub deployment using those settings and confirm the expected values are set in sdc.properties. 37 | 38 | For example: 39 | 40 | $ kubectl get pods | grep sdc 41 | sdc-59d44698b9-kvd7n 1/1 Running 0 2m54s 42 | 43 | $ kubectl exec -it sdc-59d44698b9-kvd7n -- sh -c 'grep 'production.maxBatchSize' /etc/sdc/sdc.properties' 44 | production.maxBatchSize=20000 45 | 46 | See [sdc.yaml](sdc.yaml) for an example manifest. 47 | 48 | 49 | -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/5-sdc-properties-configmap-1/sdc.yaml: -------------------------------------------------------------------------------- 1 | 2 | apiVersion: apps/v1 3 | kind: Deployment 4 | metadata: 5 | name: sdc 6 | labels: 7 | app: sdc 8 | spec: 9 | replicas: 1 10 | selector: 11 | matchLabels: 12 | app: sdc 13 | template: 14 | metadata: 15 | labels: 16 | app: sdc 17 | spec: 18 | containers: 19 | - name: sdc 20 | image: streamsets/datacollector:latest 21 | ports: 22 | - containerPort: 18630 23 | volumeMounts: 24 | - name: sdc-properties 25 | mountPath: /etc/sdc/sdc.properties 26 | subPath: sdc.properties 27 | volumes: 28 | - name: sdc-properties 29 | configMap: 30 | name: sdc-properties 31 | -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/6-sdc-properties-configmap-2/README.md: -------------------------------------------------------------------------------- 1 | ### Loading static and dynamic sdc.properties from separate ConfigMaps 2 | 3 | This example splits the monolithic sdc.properties file used in the [previous example](../5-sdc-properties-configmap-1) into two configMaps: one for properties that rarely if ever change (and that can be reused across multiple deployments), and one for dynamic properties targeted for a specific deployment. 4 | 5 | Similar to the previous example, start by copying a clean sdc.properties file to a local working directory. 6 | 7 | Within that file, edit properties with values that will rarely change, and comment out properties that will need to be set for specific deployments. For example, I'll set these two properties values within the file: 8 | 9 | http.realm.file.permission.check=false 10 | http.enable.forwarded.requests=true 11 | 12 | And I will comment out these two properties which I want to set specifically for a given deployment: 13 | 14 | # sdc.base.http.url=http://: 15 | # production.maxBatchSize=1000 16 | 17 | One final setting: append the filename sdc-dynamic.properties to the config.includes property in the sdc.properties file, like this: 18 | 19 | config.includes=dpm.properties,vault.properties,credential-stores.properties,sdc-dynamic.properties 20 | 21 | That setting will load the dynamic properties described below. 22 | 23 | Save the sdc.properties file in a configMap named sdc-static-properties by executing the command: 24 | 25 | $ kubectl create configmap sdc-static-properties --from-file=sdc.properties 26 | 27 | Once again, the configMap sdc-static-properties can be reused across multiple deployments. 28 | 29 | Next, create a manifest named sdc-dynamic-properties.yaml that will contain only properties specific to a given deployment, For example, my sdc-dynamic-properties.yaml contains these two properties: 30 | 31 | apiVersion: v1 32 | kind: ConfigMap 33 | metadata: 34 | name: sdc-dynamic-properties 35 | data: 36 | sdc-dynamic.properties: | 37 | sdc.base.http.url=https://sequoia.onefoursix.com 38 | production.maxBatchSize=50000 39 | 40 | Create the configMap by executing the command: 41 | 42 | $ kubectl apply -f sdc-dynamic-properties.yaml 43 | 44 | Add two Volumes to your SDC deployment manifest like this: 45 | 46 | volumes: 47 | - name: sdc-static-properties 48 | configMap: 49 | name: sdc-static-properties 50 | items: 51 | - key: sdc.properties 52 | path: sdc.properties 53 | - name: sdc-dynamic-properties 54 | configMap: 55 | name: sdc-dynamic-properties 56 | items: 57 | - key: sdc-dynamic.properties 58 | path: sdc-dynamic.properties 59 | 60 | And add two Volume Mounts to the SDC container, the first to overwrite the sdc.properties file and the second to add the referenced sdc-dynamic.properties file 61 | 62 | volumeMounts: 63 | - name: sdc-static-properties 64 | mountPath: /etc/sdc/sdc.properties 65 | subPath: sdc.properties 66 | - name: sdc-dynamic-properties 67 | mountPath: /etc/sdc/sdc-dynamic.properties 68 | subPath: sdc-dynamic.properties 69 | 70 | See [sdc.yaml](sdc.yaml) for the an example manifest. 71 | 72 | Use [kubectl port-forward](https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/) to confirm that the dynamically set properties are picked up by SDC: 73 | 74 | sdc-config.png 75 | 76 | 77 | -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/6-sdc-properties-configmap-2/images/sdc-config.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-kubernetes-deployment/6-sdc-properties-configmap-2/images/sdc-config.png -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/6-sdc-properties-configmap-2/sdc-dynamic-properties.yaml: -------------------------------------------------------------------------------- 1 | 2 | apiVersion: v1 3 | kind: ConfigMap 4 | metadata: 5 | name: sdc-dynamic-properties 6 | data: 7 | sdc-dynamic.properties: | 8 | sdc.base.http.url=https:[:] 9 | production.maxBatchSize=50000 -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/6-sdc-properties-configmap-2/sdc.yaml: -------------------------------------------------------------------------------- 1 | 2 | apiVersion: apps/v1 3 | kind: Deployment 4 | metadata: 5 | name: sdc 6 | labels: 7 | app: sdc 8 | spec: 9 | replicas: 1 10 | selector: 11 | matchLabels: 12 | app: sdc 13 | template: 14 | metadata: 15 | labels: 16 | app: sdc 17 | spec: 18 | containers: 19 | - name: sdc 20 | image: streamsets/datacollector:latest 21 | ports: 22 | - containerPort: 18630 23 | env: 24 | - name: SDC_JAVA_OPTS 25 | value: "-Xmx4g -Xms4g" 26 | volumeMounts: 27 | - name: sdc-static-properties 28 | mountPath: /etc/sdc/sdc.properties 29 | subPath: sdc.properties 30 | - name: sdc-dynamic-properties 31 | mountPath: /etc/sdc/sdc-dynamic.properties 32 | subPath: sdc-dynamic.properties 33 | volumes: 34 | - name: sdc-static-properties 35 | configMap: 36 | name: sdc-static-properties 37 | items: 38 | - key: sdc.properties 39 | path: sdc.properties 40 | - name: sdc-dynamic-properties 41 | configMap: 42 | name: sdc-dynamic-properties 43 | items: 44 | - key: sdc-dynamic.properties 45 | path: sdc-dynamic.properties 46 | -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/7-credential-stores/README.md: -------------------------------------------------------------------------------- 1 | ### Loading credential-stores.properties from a Secret 2 | 3 | This example shows how to load a credential-stores.properties file from a Secret. This technique is useful if you have different credential stores in different environments (for example, Dev, QA, Prod) and want each environment's SDCs to automatically load the appropriate settings. 4 | 5 | Start by creating a credential-stores.properties file. For example, a credential-stores.properties file used for Azure Key Vault might look like this: 6 | 7 | credentialStores=azure 8 | credentialStore.azure.def=streamsets-datacollector-azure-keyvault-credentialstore-lib::com_streamsets_datacollector_credential_azure_keyvault_AzureKeyVaultCredentialStore 9 | credentialStore.azure.config.credential.refresh.millis=30000 10 | credentialStore.azure.config.credential.retry.millis=15000 11 | credentialStore.azure.config.vault.url=https://mykeyvault.vault.azure.net/ 12 | credentialStore.azure.config.client.id=[redacted] 13 | credentialStore.azure.config.client.key=[redacted] 14 | 15 | Store the credential-stores.properties file in a Secret; I'll name my secret azure-key-vault-credential-store: 16 | 17 | $ kubectl create secret generic azure-key-vault-credential-store --from-file=credential-stores.properties 18 | 19 | In your SDC deployment manifest, create a Volume for the Secret: 20 | 21 | volumes: 22 | - name: azure-key-vault-credential-store 23 | secret: 24 | secretName: azure-key-vault-credential-store 25 | 26 | And then create a Volume Mount that overwrites the default credential-stores.properties file: 27 | 28 | volumeMounts: 29 | - name: azure-key-vault-credential-store 30 | mountPath: /etc/sdc/credential-stores.properties 31 | subPath: credential-stores.properties 32 | 33 | See [sdc.yaml](sdc.yaml) for an example manifest. 34 | 35 | Make sure to load the Azure Key Vault Credentials Store stage library in your deployment to order to run this example. 36 | 37 | 38 | -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/7-credential-stores/credential-stores.properties: -------------------------------------------------------------------------------- 1 | credentialStores=azure 2 | credentialStore.azure.def=streamsets-datacollector-azure-keyvault-credentialstore-lib::com_streamsets_datacollector_credential_azure_keyvault_AzureKeyVaultCredentialStore 3 | credentialStore.azure.config.credential.refresh.millis=30000 4 | credentialStore.azure.config.credential.retry.millis=15000 5 | credentialStore.azure.config.vault.url=https://mykeyvault.vault.azure.net/ 6 | credentialStore.azure.config.client.id=[redacted] 7 | credentialStore.azure.config.client.key=[redacted] -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/7-credential-stores/sdc.yaml: -------------------------------------------------------------------------------- 1 | 2 | apiVersion: apps/v1 3 | kind: Deployment 4 | metadata: 5 | name: sdc 6 | labels: 7 | app: sdc 8 | spec: 9 | replicas: 1 10 | selector: 11 | matchLabels: 12 | app: sdc 13 | template: 14 | metadata: 15 | labels: 16 | app: sdc 17 | spec: 18 | containers: 19 | - name: sdc 20 | image: 21 | ports: 22 | - containerPort: 18630 23 | env: 24 | - name: SDC_JAVA_OPTS 25 | value: "-Xmx4g -Xms4g" 26 | volumeMounts: 27 | - name: azure-key-vault-credential-store 28 | mountPath: /etc/sdc/credential-stores.properties 29 | subPath: credential-stores.properties 30 | volumes: 31 | - name: azure-key-vault-credential-store 32 | secret: 33 | secretName: azure-key-vault-credential-store 34 | -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/8-ingress/README.md: -------------------------------------------------------------------------------- 1 | ### Ingress 2 | 3 | If an [Ingress Controller](https://kubernetes.io/docs/concepts/services-networking/ingress-controllers/) has been deployed, one can include [Service](https://kubernetes.io/docs/concepts/services-networking/service/) and [Ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/) resources within a Control Hub-based deployment. This allows end users to reach the SDC UI over HTTP or HTTPS. An SDC reachable over HTTPS can serve as an [Authoring Data Collector](https://streamsets.com/documentation/controlhub/latest/help/controlhub/UserGuide/DataCollectors/PDesigner_AuthoringSDC.html?hl=authoring%2Cdata%2Ccollectors). 4 | 5 | If the SDC's sdc.properties file is packaged within the SDC image, or is mounted with read/write permissions on an appropriate Volume, one can set these two environment variables within the SDC deployment manifest's container env section, the first of which specifies the URL SDC will be reachable at: 6 | 7 | - name: SDC_CONF_SDC_BASE_HTTP_URL 8 | value: https://[:] 9 | 10 | - name: SDC_CONF_HTTP_ENABLE_FORWARDED_REQUESTS 11 | value: true 12 | 13 | If sdc.properties is mounted with read-only permissions, these two properties may be set in a configMap as shown in the previous examples. 14 | 15 | See [sdc.yaml](sdc.yaml) for an example manifest that includes an SDC Deployment, Service and Ingress that will allow SDC to be reached at the base URL of the Ingress Controller. 16 | 17 | One can also use a single Ingress Controller to route traffic to multiple SDCs, using either [host-based](https://kubernetes.github.io/ingress-nginx/user-guide/basic-usage/) or [path-based](https://kubernetes.github.io/ingress-nginx/user-guide/ingress-path-matching/) ingress routing rules. 18 | 19 | The following two examples were tested using the [ingress-nginx](https://kubernetes.github.io/ingress-nginx/) Ingress Controller. 20 | 21 | #### Host-based Routing 22 | 23 | In this example, three SDC deployments each use a different hostname as their base URL, with all three hostnames mapped in DNS to the same address (the front end of the Ingress Controller) and host-based routing rules map requests to the appropriate SDC. This approach requires permissions to add alias records to the domain's DNS (the next example does not need any additional permissions). 24 | 25 | In the deployment manifests for these three SDCs, sdc1 has this value: 26 | 27 | - name: SDC_CONF_SDC_BASE_HTTP_URL 28 | value: https://sdc1.onefoursix.com 29 | 30 | sdc2 has this value: 31 | 32 | - name: SDC_CONF_SDC_BASE_HTTP_URL 33 | value: https://sdc2.onefoursix.com 34 | 35 | sdc3 has this value: 36 | 37 | - name: SDC_CONF_SDC_BASE_HTTP_URL 38 | value: https://sdc3.onefoursix.com 39 | 40 | These three host names must be added as DNS Aliases that all point to the the external IP of the Load Balancer (by a DNS admin for the domain). 41 | 42 | Each SDC has its own Service that specifies a unique NodePort and an Ingress with a host rule. Here is the Ingress for sdc1 with a rule that ensures that requests with the hostname sdc1.onefoursix.com are routed to the sdc1 Service: 43 | 44 | 45 | apiVersion: extensions/v1beta1 46 | kind: Ingress 47 | metadata: 48 | name: sdc1 49 | annotations: 50 | kubernetes.io/ingress.class: nginx 51 | spec: 52 | tls: 53 | - hosts: 54 | - sdc1.onefoursix.com 55 | secretName: streamsets-tls 56 | rules: 57 | - host: sdc1.onefoursix.com 58 | http: 59 | paths: 60 | - path: / 61 | backend: 62 | serviceName: sdc1 63 | servicePort: 18635 64 | 65 | 66 | Example manifests for three SDCs that use Host-based routing are in the directory [here](host-based-routing). 67 | 68 | 69 | #### Path-based Routing 70 | 71 | Path-based routing relies on a single DNS name for the Ingress, and each SDC will have a unique path appended to the Ingress Controller's base URL. In the deployment manifests for three SDCs using path-based routing: 72 | 73 | sdc1 has this value: 74 | 75 | - name: SDC_CONF_SDC_BASE_HTTP_URL 76 | value: https://saturn.onefoursix.com/sdc1/ 77 | 78 | sdc2 has this value: 79 | 80 | - name: SDC_CONF_SDC_BASE_HTTP_URL 81 | value: https://saturn.onefoursix.com/sdc2/ 82 | 83 | sdc3 has this value: 84 | 85 | - name: SDC_CONF_SDC_BASE_HTTP_URL 86 | value: https://saturn.onefoursix.com/sdc3/ 87 | 88 | 89 | Ingress is defined using a regular expression to match the request path along with a [rewrite-target](https://kubernetes.github.io/ingress-nginx/examples/rewrite/#rewrite-target) annotation. 90 | 91 | Here is an example of Ingress for sdc1 using path-based routing: 92 | 93 | - apiVersion: extensions/v1beta1 94 | kind: Ingress 95 | metadata: 96 | name: sdc1 97 | namespace: ns1 98 | annotations: 99 | kubernetes.io/ingress.class: nginx 100 | nginx.ingress.kubernetes.io/ssl-redirect: \"false\" 101 | nginx.ingress.kubernetes.io/rewrite-target: /$2 102 | spec: 103 | tls: 104 | - hosts: 105 | - saturn.onefoursix.com 106 | secretName: streamsets-tls 107 | rules: 108 | - host: saturn.onefoursix.com 109 | http: 110 | paths: 111 | - path: /sdc1(/|$)(.*) 112 | backend: 113 | serviceName: sdc1 114 | servicePort: 18635 115 | 116 | 117 | Here is an example of sdc1's UI reached using path-based routing: 118 | 119 | path-based-routing 120 | 121 | Example manifests for three SDCs that use path-based routing are in the directory [here](path-based-routing). 122 | -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/8-ingress/host-based-routing/sdc1.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: 1 2 | kind: List 3 | items: 4 | - apiVersion: apps/v1 5 | kind: Deployment 6 | metadata: 7 | name: sdc1 8 | labels: 9 | app: sdc1 10 | spec: 11 | replicas: 1 12 | selector: 13 | matchLabels: 14 | app: sdc1 15 | template: 16 | metadata: 17 | labels: 18 | app: sdc1 19 | spec: 20 | containers: 21 | - name: sdc1 22 | image: streamsets/datacollector:latest 23 | ports: 24 | - containerPort: 18630 25 | env: 26 | - name: SDC_JAVA_OPTS 27 | value: "-Xmx4g -Xms4g" 28 | - name: SDC_CONF_SDC_BASE_HTTP_URL 29 | value: https://sdc1.onefoursix.com 30 | - name: SDC_CONF_HTTP_ENABLE_FORWARDED_REQUESTS 31 | value: true 32 | - apiVersion: v1 33 | kind: Service 34 | metadata: 35 | name: sdc1 36 | labels: 37 | app: sdc1 38 | spec: 39 | type: NodePort 40 | ports: 41 | - name: http 42 | port: 18635 43 | targetPort: 18630 44 | protocol: TCP 45 | selector: 46 | app: sdc1 47 | - apiVersion: extensions/v1beta1 48 | kind: Ingress 49 | metadata: 50 | name: sdc1 51 | annotations: 52 | kubernetes.io/ingress.class: nginx 53 | spec: 54 | tls: 55 | - hosts: 56 | - sdc1.onefoursix.com 57 | secretName: streamsets-tls 58 | rules: 59 | - host: sdc1.onefoursix.com 60 | http: 61 | paths: 62 | - path: / 63 | backend: 64 | serviceName: sdc1 65 | servicePort: 18635 -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/8-ingress/host-based-routing/sdc2.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: 1 2 | kind: List 3 | items: 4 | - apiVersion: apps/v1 5 | kind: Deployment 6 | metadata: 7 | name: sdc2 8 | labels: 9 | app: sdc2 10 | spec: 11 | replicas: 1 12 | selector: 13 | matchLabels: 14 | app: sdc2 15 | template: 16 | metadata: 17 | labels: 18 | app: sdc2 19 | spec: 20 | containers: 21 | - name: sdc2 22 | image: streamsets/datacollector:latest 23 | ports: 24 | - containerPort: 18630 25 | env: 26 | - name: SDC_JAVA_OPTS 27 | value: "-Xmx4g -Xms4g" 28 | - name: SDC_CONF_SDC_BASE_HTTP_URL 29 | value: https://sdc2.onefoursix.com 30 | - name: SDC_CONF_HTTP_ENABLE_FORWARDED_REQUESTS 31 | value: true 32 | - apiVersion: v1 33 | kind: Service 34 | metadata: 35 | name: sdc2 36 | labels: 37 | app: sdc2 38 | spec: 39 | type: NodePort 40 | ports: 41 | - name: http 42 | port: 18636 43 | targetPort: 18630 44 | protocol: TCP 45 | selector: 46 | app: sdc2 47 | - apiVersion: extensions/v1beta1 48 | kind: Ingress 49 | metadata: 50 | name: sdc2 51 | annotations: 52 | kubernetes.io/ingress.class: nginx 53 | spec: 54 | tls: 55 | - hosts: 56 | - sdc2.onefoursix.com 57 | secretName: streamsets-tls 58 | rules: 59 | - host: sdc2.onefoursix.com 60 | http: 61 | paths: 62 | - path: / 63 | backend: 64 | serviceName: sdc2 65 | servicePort: 18636 -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/8-ingress/host-based-routing/sdc3.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: 1 2 | kind: List 3 | items: 4 | - apiVersion: apps/v1 5 | kind: Deployment 6 | metadata: 7 | name: sdc3 8 | labels: 9 | app: sdc3 10 | spec: 11 | replicas: 1 12 | selector: 13 | matchLabels: 14 | app: sdc3 15 | template: 16 | metadata: 17 | labels: 18 | app: sdc3 19 | spec: 20 | containers: 21 | - name: sdc3 22 | image: streamsets/datacollector:latest 23 | ports: 24 | - containerPort: 18630 25 | env: 26 | - name: SDC_JAVA_OPTS 27 | value: "-Xmx4g -Xms4g" 28 | - name: SDC_CONF_SDC_BASE_HTTP_URL 29 | value: https://sdc3.onefoursix.com 30 | - name: SDC_CONF_HTTP_ENABLE_FORWARDED_REQUESTS 31 | value: true 32 | - apiVersion: v1 33 | kind: Service 34 | metadata: 35 | name: sdc3 36 | labels: 37 | app: sdc3 38 | spec: 39 | type: NodePort 40 | ports: 41 | - name: http 42 | port: 18637 43 | targetPort: 18630 44 | protocol: TCP 45 | selector: 46 | app: sdc3 47 | - apiVersion: extensions/v1beta1 48 | kind: Ingress 49 | metadata: 50 | name: sdc3 51 | annotations: 52 | kubernetes.io/ingress.class: nginx 53 | spec: 54 | tls: 55 | - hosts: 56 | - sdc3.onefoursix.com 57 | secretName: streamsets-tls 58 | rules: 59 | - host: sdc3.onefoursix.com 60 | http: 61 | paths: 62 | - path: / 63 | backend: 64 | serviceName: sdc3 65 | servicePort: 18637 -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/8-ingress/images/path-based-routing.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-kubernetes-deployment/8-ingress/images/path-based-routing.png -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/8-ingress/path-based-routing/sdc1.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: 1 2 | kind: List 3 | items: 4 | - apiVersion: apps/v1 5 | kind: Deployment 6 | metadata: 7 | name: sdc1 8 | namespace: ns1 9 | labels: 10 | app: sdc1 11 | spec: 12 | replicas: 1 13 | selector: 14 | matchLabels: 15 | app: sdc1 16 | template: 17 | metadata: 18 | labels: 19 | app: sdc1 20 | spec: 21 | containers: 22 | - name: sdc1 23 | image: streamsets/datacollector:latest 24 | ports: 25 | - containerPort: 18630 26 | env: 27 | - name: SDC_JAVA_OPTS 28 | value: "-Xmx4g -Xms4g" 29 | - name: SDC_CONF_SDC_BASE_HTTP_URL 30 | value: https://saturn.onefoursix.com/sdc1/ 31 | - name: SDC_CONF_HTTP_ENABLE_FORWARDED_REQUESTS 32 | value: true 33 | - apiVersion: v1 34 | kind: Service 35 | metadata: 36 | name: sdc1 37 | namespace: ns1 38 | labels: 39 | app: sdc1 40 | spec: 41 | type: NodePort 42 | ports: 43 | - name: http 44 | port: 18635 45 | targetPort: 18630 46 | protocol: TCP 47 | selector: 48 | app: sdc1 49 | - apiVersion: extensions/v1beta1 50 | kind: Ingress 51 | metadata: 52 | name: sdc1 53 | namespace: ns1 54 | annotations: 55 | kubernetes.io/ingress.class: nginx 56 | nginx.ingress.kubernetes.io/ssl-redirect: \"false\" 57 | nginx.ingress.kubernetes.io/rewrite-target: /$2 58 | spec: 59 | tls: 60 | - hosts: 61 | - saturn.onefoursix.com 62 | secretName: streamsets-tls 63 | rules: 64 | - host: saturn.onefoursix.com 65 | http: 66 | paths: 67 | - path: /sdc1(/|$)(.*) 68 | backend: 69 | serviceName: sdc1 70 | servicePort: 18635 -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/8-ingress/path-based-routing/sdc2.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: 1 2 | kind: List 3 | items: 4 | - apiVersion: apps/v1 5 | kind: Deployment 6 | metadata: 7 | name: sdc2 8 | namespace: ns1 9 | labels: 10 | app: sdc2 11 | spec: 12 | replicas: 1 13 | selector: 14 | matchLabels: 15 | app: sdc2 16 | template: 17 | metadata: 18 | labels: 19 | app: sdc2 20 | spec: 21 | containers: 22 | - name: sdc2 23 | image: streamsets/datacollector:latest 24 | ports: 25 | - containerPort: 18630 26 | env: 27 | - name: SDC_JAVA_OPTS 28 | value: "-Xmx4g -Xms4g" 29 | - name: SDC_CONF_SDC_BASE_HTTP_URL 30 | value: https://saturn.onefoursix.com/sdc2/ 31 | - name: SDC_CONF_HTTP_ENABLE_FORWARDED_REQUESTS 32 | value: true 33 | - apiVersion: v1 34 | kind: Service 35 | metadata: 36 | name: sdc2 37 | namespace: ns1 38 | labels: 39 | app: sdc2 40 | spec: 41 | type: NodePort 42 | ports: 43 | - name: http 44 | port: 18636 45 | targetPort: 18630 46 | protocol: TCP 47 | selector: 48 | app: sdc2 49 | - apiVersion: extensions/v1beta1 50 | kind: Ingress 51 | metadata: 52 | name: sdc2 53 | namespace: ns1 54 | annotations: 55 | kubernetes.io/ingress.class: nginx 56 | nginx.ingress.kubernetes.io/ssl-redirect: \"false\" 57 | nginx.ingress.kubernetes.io/rewrite-target: /$2 58 | spec: 59 | tls: 60 | - hosts: 61 | - saturn.onefoursix.com 62 | secretName: streamsets-tls 63 | rules: 64 | - host: saturn.onefoursix.com 65 | http: 66 | paths: 67 | - path: /sdc2(/|$)(.*) 68 | backend: 69 | serviceName: sdc2 70 | servicePort: 18636 -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/8-ingress/path-based-routing/sdc3.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: 1 2 | kind: List 3 | items: 4 | - apiVersion: apps/v1 5 | kind: Deployment 6 | metadata: 7 | name: sdc3 8 | namespace: ns1 9 | labels: 10 | app: sdc3 11 | spec: 12 | replicas: 1 13 | selector: 14 | matchLabels: 15 | app: sdc3 16 | template: 17 | metadata: 18 | labels: 19 | app: sdc3 20 | spec: 21 | containers: 22 | - name: sdc3 23 | image: streamsets/datacollector:latest 24 | ports: 25 | - containerPort: 18630 26 | env: 27 | - name: SDC_JAVA_OPTS 28 | value: "-Xmx4g -Xms4g" 29 | - name: SDC_CONF_SDC_BASE_HTTP_URL 30 | value: "https://saturn.onefoursix.com/sdc3/" 31 | - name: SDC_CONF_HTTP_ENABLE_FORWARDED_REQUESTS 32 | value: true 33 | - apiVersion: v1 34 | kind: Service 35 | metadata: 36 | name: sdc3 37 | namespace: ns1 38 | labels: 39 | app: sdc3 40 | spec: 41 | type: NodePort 42 | ports: 43 | - name: http 44 | port: 18637 45 | targetPort: 18630 46 | protocol: TCP 47 | selector: 48 | app: sdc3 49 | - apiVersion: extensions/v1beta1 50 | kind: Ingress 51 | metadata: 52 | name: sdc3 53 | namespace: ns1 54 | annotations: 55 | kubernetes.io/ingress.class: nginx 56 | nginx.ingress.kubernetes.io/ssl-redirect: \"false\" 57 | nginx.ingress.kubernetes.io/rewrite-target: /$2 58 | spec: 59 | tls: 60 | - hosts: 61 | - saturn.onefoursix.com 62 | secretName: streamsets-tls 63 | rules: 64 | - host: saturn.onefoursix.com 65 | http: 66 | paths: 67 | - path: /sdc3(/|$)(.*) 68 | backend: 69 | serviceName: sdc3 70 | servicePort: 18637 -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/8-ingress/sdc.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: 1 2 | kind: List 3 | items: 4 | - apiVersion: apps/v1 5 | kind: Deployment 6 | metadata: 7 | name: sdc 8 | labels: 9 | app: sdc 10 | spec: 11 | replicas: 1 12 | selector: 13 | matchLabels: 14 | app: sdc 15 | template: 16 | metadata: 17 | labels: 18 | app: sdc 19 | spec: 20 | containers: 21 | - name: sdc 22 | image: streamsets/datacollector:latest 23 | ports: 24 | - containerPort: 18630 25 | env: 26 | - name: SDC_JAVA_OPTS 27 | value: "-Xmx4g -Xms4g" 28 | - name: SDC_CONF_SDC_BASE_HTTP_URL 29 | value: https://saturn.onefoursix.com 30 | - name: SDC_CONF_HTTP_ENABLE_FORWARDED_REQUESTS 31 | value: true 32 | - apiVersion: v1 33 | kind: Service 34 | metadata: 35 | name: sdc 36 | labels: 37 | app: sdc 38 | spec: 39 | type: NodePort 40 | ports: 41 | - name: http 42 | port: 18635 43 | targetPort: 18630 44 | protocol: TCP 45 | selector: 46 | app: sdc 47 | - apiVersion: extensions/v1beta1 48 | kind: Ingress 49 | metadata: 50 | name: sdc 51 | annotations: 52 | kubernetes.io/ingress.class: nginx 53 | spec: 54 | tls: 55 | - hosts: 56 | - saturn.onefoursix.com 57 | secretName: streamsets-tls 58 | rules: 59 | - host: 60 | http: 61 | paths: 62 | - path: / 63 | backend: 64 | serviceName: sdc 65 | servicePort: 18635 -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/NoteOnEnvVars.md: -------------------------------------------------------------------------------- 1 | ### A note about Environment Variables in SDC deployment manifests 2 | Environment variables specified in SDC Deployment manifests that have the prefix SDC_CONF_, like SDC_CONF_SDC_BASE_HTTP_URL, can be used to dynamically set properties in the deployed SDC's sdc.properties file. 3 | 4 | These environment variables are mapped to SDC properties by trimming the SDC_CONF_ prefix, lowercasing the name and replacing "_" with ".". For example, the value set in the environment variable SDC_CONF_SDC_BASE_HTTP_URL will be set in sdc.properties as the property sdc.base.http.url. 5 | 6 | However, due to the current behavior of the SDC image's docker-entrypoint.sh script, this technique is not able to set mixed-case properties in sdc.properties like production.maxBatchSize. This limitation may be lifted in the near future. 7 | 8 | In the meantime, if mixed-case SDC properties need to be set, they can either be set in an sdc.properties file packaged in a custom SDC image, as in the [Custom Docker Image example](2-custom-docker-image), or loaded from a configmap as shown in the examples [here](5-sdc-properties-configmap-1) and [here](6-sdc-properties-configmap-2). 9 | 10 | It's also worth noting that values for environment variables with the prefix SDC_CONF_ are written to the sdc.properties file by the SDC container's docker-entrypoint.sh script, which forces the SDC container to have read/write access to the sdc.properties file, which may not be the case if sdc.properties is mounted with read-only access. 11 | 12 | Best practice for now is to mount sdc.properties from a configmap and to avoid using SDC_CONF\__ environment variables. -------------------------------------------------------------------------------- /tutorial-kubernetes-deployment/README.md: -------------------------------------------------------------------------------- 1 | ## Kubernetes Deployment 2 | 3 | # Note - this tutorial is primarily for use with legacy Control Hub 3.x. Users of StreamSets Platform now enjoy [top-level Kubernetes deployment support](https://docs.streamsets.com/portal/platform-controlhub/controlhub/UserGuide/Deployments/Kubernetes.html#concept_ec3_cqg_hvb) and no longer need to manually configure the majority of aspects covered by this Tutorial. 4 | 5 | This tutorial provides examples and guidance for deploying custom configurations of [StreamSets Data Collector](https://streamsets.com/products/dataops-platform/data-collector) on Kubernetes using [Control Hub](https://streamsets.com/products/dataops-platform/control-hub). Please see [the note about Environment Variables](NoteOnEnvVars.md). 6 | 7 | ### Prerequisites 8 | 9 | * A Kubernetes Cluster 10 | 11 | * A deployed [Provisioning Agent](https://streamsets.com/documentation/controlhub/latest/help/controlhub/UserGuide/DataCollectorsProvisioned/ProvisionSteps.html#concept_hjy_tft_1gb) 12 | 13 | * A deployed [Ingress Controller](https://kubernetes.cn/docs/concepts/services-networking/ingress-controllers/) for the Ingress example 14 | 15 | 16 | ### Examples 17 | 18 | 1. [How to set Java Heap Size and other Java Options](1-java-opts) 19 | 20 | 1. [Baked-in Stage Libs and Configuration](2-custom-docker-image) 21 | 22 | 1. [Loading Stage Libs from a pre-populated Volume](3-volumes) 23 | 24 | 1. [Loading Stage Libs from a Persistent Volume](4-persistent-volumes) 25 | 26 | 1. [Loading sdc.properties from a ConfigMap](5-sdc-properties-configmap-1) 27 | 28 | 1. [Loading static and dynamic sdc.properties from separate ConfigMaps](6-sdc-properties-configmap-2) 29 | 30 | 1. [Loading credential-stores.properties from a Secret](7-credential-stores) 31 | 32 | 1. [Ingress](8-ingress) 33 | -------------------------------------------------------------------------------- /tutorial-origin/image_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-origin/image_0.png -------------------------------------------------------------------------------- /tutorial-origin/image_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-origin/image_1.png -------------------------------------------------------------------------------- /tutorial-origin/image_10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-origin/image_10.png -------------------------------------------------------------------------------- /tutorial-origin/image_11.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-origin/image_11.png -------------------------------------------------------------------------------- /tutorial-origin/image_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-origin/image_2.png -------------------------------------------------------------------------------- /tutorial-origin/image_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-origin/image_3.png -------------------------------------------------------------------------------- /tutorial-origin/image_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-origin/image_4.png -------------------------------------------------------------------------------- /tutorial-origin/image_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-origin/image_5.png -------------------------------------------------------------------------------- /tutorial-origin/image_6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-origin/image_6.png -------------------------------------------------------------------------------- /tutorial-origin/image_7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-origin/image_7.png -------------------------------------------------------------------------------- /tutorial-origin/image_8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-origin/image_8.png -------------------------------------------------------------------------------- /tutorial-origin/image_9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-origin/image_9.png -------------------------------------------------------------------------------- /tutorial-processor/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-processor/.DS_Store -------------------------------------------------------------------------------- /tutorial-processor/image_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-processor/image_0.png -------------------------------------------------------------------------------- /tutorial-processor/image_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-processor/image_1.png -------------------------------------------------------------------------------- /tutorial-processor/image_10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-processor/image_10.png -------------------------------------------------------------------------------- /tutorial-processor/image_11.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-processor/image_11.png -------------------------------------------------------------------------------- /tutorial-processor/image_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-processor/image_2.png -------------------------------------------------------------------------------- /tutorial-processor/image_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-processor/image_3.png -------------------------------------------------------------------------------- /tutorial-processor/image_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-processor/image_4.png -------------------------------------------------------------------------------- /tutorial-processor/image_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-processor/image_5.png -------------------------------------------------------------------------------- /tutorial-processor/image_6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-processor/image_6.png -------------------------------------------------------------------------------- /tutorial-processor/image_7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-processor/image_7.png -------------------------------------------------------------------------------- /tutorial-processor/image_8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-processor/image_8.png -------------------------------------------------------------------------------- /tutorial-processor/image_9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-processor/image_9.png -------------------------------------------------------------------------------- /tutorial-processor/sampleprocessor/pom.xml: -------------------------------------------------------------------------------- 1 | 2 | 17 | 19 | 4.0.0 20 | com.example 21 | sampleprocessor 22 | 1.0-SNAPSHOT 23 | 24 | 25 | 2.1.0.0 26 | 1.7.7 27 | 4.12 28 | UTF-8 29 | 30 | 31 | 32 | 33 | 34 | 35 | com.streamsets 36 | streamsets-datacollector-api 37 | ${streamsets.version} 38 | provided 39 | 40 | 41 | com.drewnoakes 42 | metadata-extractor 43 | 2.9.1 44 | 45 | 46 | org.slf4j 47 | slf4j-api 48 | ${slf4j.version} 49 | provided 50 | 51 | 52 | org.slf4j 53 | slf4j-log4j12 54 | ${slf4j.version} 55 | provided 56 | 57 | 58 | 59 | 60 | 61 | junit 62 | junit 63 | ${junit.version} 64 | test 65 | 66 | 67 | com.streamsets 68 | streamsets-datacollector-sdk 69 | ${streamsets.version} 70 | test 71 | 72 | 73 | 74 | 75 | 76 | 77 | org.apache.maven.plugins 78 | maven-assembly-plugin 79 | 80 | 81 | dist 82 | package 83 | 84 | single 85 | 86 | 87 | false 88 | false 89 | ${project.artifactId}-${project.version} 90 | 91 | src/main/assemblies/stage-lib.xml 92 | 93 | 94 | 95 | 96 | 97 | 98 | org.apache.maven.plugins 99 | maven-compiler-plugin 100 | 3.3 101 | 102 | 1.7 103 | 1.7 104 | 105 | 106 | 107 | 108 | 109 | -------------------------------------------------------------------------------- /tutorial-processor/sampleprocessor/src/main/assemblies/stage-lib.xml: -------------------------------------------------------------------------------- 1 | 2 | 17 | 20 | 21 | stage-lib 22 | 23 | tar.gz 24 | 25 | false 26 | 27 | 28 | 29 | 30 | 31 | 32 | true 33 | true 34 | true 35 | /${project.artifactId}/lib 36 | false 37 | runtime 38 | 39 | 40 | org.python:jython-standalone 41 | 42 | 46 | org.xerial.snappy:snappy-java 47 | 48 | 49 | 50 | 51 | 52 | -------------------------------------------------------------------------------- /tutorial-processor/sampleprocessor/src/main/java/com/example/stage/lib/sample/Errors.java: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright 2015 StreamSets Inc. 3 | * 4 | * Licensed under the Apache Software Foundation (ASF) under one 5 | * or more contributor license agreements. See the NOTICE file 6 | * distributed with this work for additional information 7 | * regarding copyright ownership. The ASF licenses this file 8 | * to you under the Apache License, Version 2.0 (the 9 | * "License"); you may not use this file except in compliance 10 | * with the License. You may obtain a copy of the License at 11 | * 12 | * http://www.apache.org/licenses/LICENSE-2.0 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | */ 20 | package com.example.stage.lib.sample; 21 | 22 | import com.streamsets.pipeline.api.ErrorCode; 23 | import com.streamsets.pipeline.api.GenerateResourceBundle; 24 | 25 | @GenerateResourceBundle 26 | public enum Errors implements ErrorCode { 27 | 28 | SAMPLE_00("A configuration is invalid because: {}"), 29 | SAMPLE_01("Specific reason writing record failed: {}"), 30 | SAMPLE_02("Retrieving metadata failed: {}"), 31 | ; 32 | private final String msg; 33 | 34 | Errors(String msg) { 35 | this.msg = msg; 36 | } 37 | 38 | /** {@inheritDoc} */ 39 | @Override 40 | public String getCode() { 41 | return name(); 42 | } 43 | 44 | /** {@inheritDoc} */ 45 | @Override 46 | public String getMessage() { 47 | return msg; 48 | } 49 | } 50 | -------------------------------------------------------------------------------- /tutorial-processor/sampleprocessor/src/main/java/com/example/stage/processor/sample/Groups.java: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright 2015 StreamSets Inc. 3 | * 4 | * Licensed under the Apache Software Foundation (ASF) under one 5 | * or more contributor license agreements. See the NOTICE file 6 | * distributed with this work for additional information 7 | * regarding copyright ownership. The ASF licenses this file 8 | * to you under the Apache License, Version 2.0 (the 9 | * "License"); you may not use this file except in compliance 10 | * with the License. You may obtain a copy of the License at 11 | * 12 | * http://www.apache.org/licenses/LICENSE-2.0 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | */ 20 | package com.example.stage.processor.sample; 21 | 22 | import com.streamsets.pipeline.api.GenerateResourceBundle; 23 | import com.streamsets.pipeline.api.Label; 24 | 25 | @GenerateResourceBundle 26 | public enum Groups implements Label { 27 | SAMPLE("Sample"), 28 | ; 29 | 30 | private final String label; 31 | 32 | private Groups(String label) { 33 | this.label = label; 34 | } 35 | 36 | /** {@inheritDoc} */ 37 | @Override 38 | public String getLabel() { 39 | return this.label; 40 | } 41 | } -------------------------------------------------------------------------------- /tutorial-processor/sampleprocessor/src/main/java/com/example/stage/processor/sample/SampleDProcessor.java: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright 2015 StreamSets Inc. 3 | * 4 | * Licensed under the Apache Software Foundation (ASF) under one 5 | * or more contributor license agreements. See the NOTICE file 6 | * distributed with this work for additional information 7 | * regarding copyright ownership. The ASF licenses this file 8 | * to you under the Apache License, Version 2.0 (the 9 | * "License"); you may not use this file except in compliance 10 | * with the License. You may obtain a copy of the License at 11 | * 12 | * http://www.apache.org/licenses/LICENSE-2.0 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | */ 20 | package com.example.stage.processor.sample; 21 | 22 | import com.streamsets.pipeline.api.ConfigDef; 23 | import com.streamsets.pipeline.api.ConfigGroups; 24 | import com.streamsets.pipeline.api.GenerateResourceBundle; 25 | import com.streamsets.pipeline.api.StageDef; 26 | 27 | @StageDef( 28 | version = 1, 29 | label = "Sample Processor", 30 | description = "", 31 | icon = "default.png", 32 | onlineHelpRefUrl = "" 33 | ) 34 | @ConfigGroups(Groups.class) 35 | @GenerateResourceBundle 36 | public class SampleDProcessor extends SampleProcessor { 37 | 38 | @ConfigDef( 39 | required = true, 40 | type = ConfigDef.Type.STRING, 41 | defaultValue = "default", 42 | label = "Sample Config", 43 | displayPosition = 10, 44 | group = "SAMPLE" 45 | ) 46 | public String config; 47 | 48 | /** {@inheritDoc} */ 49 | @Override 50 | public String getConfig() { 51 | return config; 52 | } 53 | 54 | } -------------------------------------------------------------------------------- /tutorial-processor/sampleprocessor/src/main/java/com/example/stage/processor/sample/SampleProcessor.java: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright 2015 StreamSets Inc. 3 | * 4 | * Licensed under the Apache Software Foundation (ASF) under one 5 | * or more contributor license agreements. See the NOTICE file 6 | * distributed with this work for additional information 7 | * regarding copyright ownership. The ASF licenses this file 8 | * to you under the Apache License, Version 2.0 (the 9 | * "License"); you may not use this file except in compliance 10 | * with the License. You may obtain a copy of the License at 11 | * 12 | * http://www.apache.org/licenses/LICENSE-2.0 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | */ 20 | package com.example.stage.processor.sample; 21 | 22 | import com.drew.imaging.ImageMetadataReader; 23 | import com.drew.imaging.ImageProcessingException; 24 | import com.drew.metadata.Directory; 25 | import com.drew.metadata.Metadata; 26 | import com.drew.metadata.Tag; 27 | import com.example.stage.lib.sample.Errors; 28 | 29 | import com.streamsets.pipeline.api.Field; 30 | import com.streamsets.pipeline.api.FileRef; 31 | import com.streamsets.pipeline.api.Record; 32 | import com.streamsets.pipeline.api.StageException; 33 | import com.streamsets.pipeline.api.base.OnRecordErrorException; 34 | import com.streamsets.pipeline.api.base.SingleLaneRecordProcessor; 35 | import org.slf4j.Logger; 36 | import org.slf4j.LoggerFactory; 37 | 38 | import java.io.IOException; 39 | import java.io.InputStream; 40 | import java.util.LinkedHashMap; 41 | import java.util.List; 42 | 43 | public abstract class SampleProcessor extends SingleLaneRecordProcessor { 44 | private static final Logger LOG = LoggerFactory.getLogger(SampleProcessor.class); 45 | /** 46 | * Gives access to the UI configuration of the stage provided by the {@link SampleDProcessor} class. 47 | */ 48 | public abstract String getConfig(); 49 | 50 | /** {@inheritDoc} */ 51 | @Override 52 | protected List init() { 53 | // Validate configuration values and open any required resources. 54 | List issues = super.init(); 55 | 56 | if (getConfig().equals("invalidValue")) { 57 | issues.add( 58 | getContext().createConfigIssue( 59 | Groups.SAMPLE.name(), "config", Errors.SAMPLE_00, "Here's what's wrong..." 60 | ) 61 | ); 62 | } 63 | 64 | // If issues is not empty, the UI will inform the user of each configuration issue in the list. 65 | return issues; 66 | } 67 | 68 | /** {@inheritDoc} */ 69 | @Override 70 | public void destroy() { 71 | // Clean up any open resources. 72 | super.destroy(); 73 | } 74 | 75 | /** {@inheritDoc} */ 76 | @Override 77 | protected void process(Record record, SingleLaneBatchMaker batchMaker) throws StageException { 78 | LOG.info("Input record: {}", record); 79 | 80 | FileRef fileRef = record.get("/fileRef").getValueAsFileRef(); 81 | Metadata metadata; 82 | try { 83 | metadata = ImageMetadataReader.readMetadata(fileRef.createInputStream(getContext(), InputStream.class)); 84 | } catch (ImageProcessingException | IOException e) { 85 | String filename = record.get("/fileInfo/filename").getValueAsString(); 86 | LOG.info("Exception getting metadata from {}", filename, e); 87 | throw new OnRecordErrorException(record, Errors.SAMPLE_02, e); 88 | } 89 | 90 | for (Directory directory : metadata.getDirectories()) { 91 | LinkedHashMap listMap = new LinkedHashMap<>(); 92 | 93 | for (Tag tag : directory.getTags()) { 94 | listMap.put(tag.getTagName(), Field.create(tag.getDescription())); 95 | } 96 | 97 | if (directory.hasErrors()) { 98 | for (String error : directory.getErrors()) { 99 | LOG.info("ERROR: {}", error); 100 | } 101 | } 102 | 103 | record.set("/" + directory.getName(), Field.createListMap(listMap)); 104 | } 105 | 106 | LOG.info("Output record: {}", record); 107 | 108 | batchMaker.addRecord(record); 109 | } 110 | 111 | } -------------------------------------------------------------------------------- /tutorial-processor/sampleprocessor/src/main/resources/data-collector-library-bundle.properties: -------------------------------------------------------------------------------- 1 | # 2 | # 3 | # Copyright 2015 StreamSets Inc. 4 | # 5 | # Licensed under the Apache Software Foundation (ASF) under one 6 | # or more contributor license agreements. See the NOTICE file 7 | # distributed with this work for additional information 8 | # regarding copyright ownership. The ASF licenses this file 9 | # to you under the Apache License, Version 2.0 (the 10 | # "License"); you may not use this file except in compliance 11 | # with the License. You may obtain a copy of the License at 12 | # 13 | # http://www.apache.org/licenses/LICENSE-2.0 14 | # 15 | # Unless required by applicable law or agreed to in writing, software 16 | # distributed under the License is distributed on an "AS IS" BASIS, 17 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 18 | # See the License for the specific language governing permissions and 19 | # limitations under the License. 20 | # 21 | # 22 | library.name=Sample Library 1.0.0 23 | -------------------------------------------------------------------------------- /tutorial-processor/sampleprocessor/src/main/resources/default.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-processor/sampleprocessor/src/main/resources/default.png -------------------------------------------------------------------------------- /tutorial-processor/sampleprocessor/src/test/java/com/example/stage/processor/sample/TestSampleProcessor.java: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright 2015 StreamSets Inc. 3 | * 4 | * Licensed under the Apache Software Foundation (ASF) under one 5 | * or more contributor license agreements. See the NOTICE file 6 | * distributed with this work for additional information 7 | * regarding copyright ownership. The ASF licenses this file 8 | * to you under the Apache License, Version 2.0 (the 9 | * "License"); you may not use this file except in compliance 10 | * with the License. You may obtain a copy of the License at 11 | * 12 | * http://www.apache.org/licenses/LICENSE-2.0 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | */ 20 | package com.example.stage.processor.sample; 21 | 22 | import com.streamsets.pipeline.api.Field; 23 | import com.streamsets.pipeline.api.Record; 24 | import com.streamsets.pipeline.api.StageException; 25 | import com.streamsets.pipeline.sdk.ProcessorRunner; 26 | import com.streamsets.pipeline.sdk.RecordCreator; 27 | import com.streamsets.pipeline.sdk.StageRunner; 28 | import org.junit.Assert; 29 | import org.junit.Test; 30 | 31 | import java.util.Arrays; 32 | 33 | public class TestSampleProcessor { 34 | @Test 35 | @SuppressWarnings("unchecked") 36 | public void testProcessor() throws StageException { 37 | ProcessorRunner runner = new ProcessorRunner.Builder(SampleDProcessor.class) 38 | .addConfiguration("config", "value") 39 | .addOutputLane("output") 40 | .build(); 41 | 42 | runner.runInit(); 43 | 44 | try { 45 | Record record = RecordCreator.create(); 46 | record.set(Field.create(true)); 47 | StageRunner.Output output = runner.runProcess(Arrays.asList(record)); 48 | Assert.assertEquals(1, output.getRecords().get("output").size()); 49 | Assert.assertEquals(true, output.getRecords().get("output").get(0).get().getValueAsBoolean()); 50 | } finally { 51 | runner.runDestroy(); 52 | } 53 | } 54 | } 55 | -------------------------------------------------------------------------------- /tutorial-spark-transformer-scala/image_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-spark-transformer-scala/image_0.png -------------------------------------------------------------------------------- /tutorial-spark-transformer-scala/image_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-spark-transformer-scala/image_1.png -------------------------------------------------------------------------------- /tutorial-spark-transformer-scala/image_10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-spark-transformer-scala/image_10.png -------------------------------------------------------------------------------- /tutorial-spark-transformer-scala/image_11.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-spark-transformer-scala/image_11.png -------------------------------------------------------------------------------- /tutorial-spark-transformer-scala/image_12.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-spark-transformer-scala/image_12.png -------------------------------------------------------------------------------- /tutorial-spark-transformer-scala/image_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-spark-transformer-scala/image_2.png -------------------------------------------------------------------------------- /tutorial-spark-transformer-scala/image_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-spark-transformer-scala/image_3.png -------------------------------------------------------------------------------- /tutorial-spark-transformer-scala/image_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-spark-transformer-scala/image_4.png -------------------------------------------------------------------------------- /tutorial-spark-transformer-scala/image_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-spark-transformer-scala/image_5.png -------------------------------------------------------------------------------- /tutorial-spark-transformer-scala/image_6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-spark-transformer-scala/image_6.png -------------------------------------------------------------------------------- /tutorial-spark-transformer-scala/image_7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-spark-transformer-scala/image_7.png -------------------------------------------------------------------------------- /tutorial-spark-transformer-scala/image_8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-spark-transformer-scala/image_8.png -------------------------------------------------------------------------------- /tutorial-spark-transformer-scala/image_9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-spark-transformer-scala/image_9.png -------------------------------------------------------------------------------- /tutorial-spark-transformer-scala/upload.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-spark-transformer-scala/upload.png -------------------------------------------------------------------------------- /tutorial-spark-transformer/image_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-spark-transformer/image_0.png -------------------------------------------------------------------------------- /tutorial-spark-transformer/image_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-spark-transformer/image_1.png -------------------------------------------------------------------------------- /tutorial-spark-transformer/image_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-spark-transformer/image_2.png -------------------------------------------------------------------------------- /tutorial-spark-transformer/image_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-spark-transformer/image_3.png -------------------------------------------------------------------------------- /tutorial-spark-transformer/image_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-spark-transformer/image_4.png -------------------------------------------------------------------------------- /tutorial-spark-transformer/image_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-spark-transformer/image_5.png -------------------------------------------------------------------------------- /tutorial-spark-transformer/image_6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-spark-transformer/image_6.png -------------------------------------------------------------------------------- /tutorial-spark-transformer/image_7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-spark-transformer/image_7.png -------------------------------------------------------------------------------- /tutorial-spark-transformer/upload.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/tutorial-spark-transformer/upload.png -------------------------------------------------------------------------------- /working-with-azure/blobstorage_to_hdinsightkafka.md: -------------------------------------------------------------------------------- 1 | # Ingesting Data from Blob Storage into Apache Kafka on HDInsight 2 | 3 | [Azure Blob Storage](https://azure.microsoft.com/en-us/services/storage/blobs/ "Azure Blob Storage") can handle all your unstructured data, scaling up or down as required. You no longer have to manage it, you only pay for what you use, and you save money over on-premises storage options. 4 | 5 | [Apache Kafka](https://docs.microsoft.com/en-us/azure/hdinsight/kafka/apache-kafka-introduction) is an open-source, distributed streaming platform. It's often used as a message broker, as it provides functionality similar to a publish-subscribe message queue. [Apache Kafka on HDInsight](https://azure.microsoft.com/en-us/services/hdinsight/apache-kafka/) is a managed service that provides a simplified configuration for Apache Kafka. 6 | 7 | ## Goal 8 | 9 | In this tutorial, you will learn how to leverage Azure services with StreamSets to read data from Blob Storage and send data into Apache Kafka on HDInsight cluster. 10 | 11 | ## Prerequisites 12 | 13 | [Install and configure StreamSets Data Collector and the necessary Azure services](readme.md). 14 | 15 | You can download sample data for this tutorial from the following location: https://www.streamsets.com/documentation/datacollector/sample_data/tutorial/nyc_taxi_data.csv and copy into your Blob Storage. 16 | 17 | ## Creating the Kafka Topic 18 | 19 | When working with HDInsight Kafka, we need to pre-create the Kafka topic. Ssh into the terminal for StreamSets Data Collector on HDInsight and run the following command to create a Kafka topic. Remember to replace the zookeeper_url with your own uri for zookeeper. 20 | 21 | /usr/hdp/current/kafka-broker/bin/kafka-topics.sh --create --replication-factor 3 --partitions 3 --topic nyctaxi --zookeeper 22 | 23 | ## Creating a Pipeline 24 | 25 | Now let's get some data flowing! In your browser, login to StreamSets Data Collector (SDC) and create a new pipeline. 26 | 27 | Select Origin from the Drop down list: `Hadoop FS Standalone - HDP 2.6.2.1-1` 28 | ![image alt text](img/BlobToKafka/SelectSource_Hadoop.png) 29 | **Hadoop FS tab** 30 | 31 | * **Hadoop FS URI**: this has the form `wasb[s]://@.blob.core.windows.net/` 32 | 33 | * **Hadoop FS Configuration** 34 | * **fs.azure.account.key..blob.core.windows.net** = `` 35 | * **fs.azure.account.keyprovider..blob.core.windows.net** = `org.apache.hadoop.fs.azure.SimpleKeyProvider` 36 | 37 | ![image alt text](img/BlobToKafka/HadoopFS_HadoopFS.png) 38 | 39 | **Files** 40 | 41 | * **Files Directory**: `` 42 | 43 | * **File Name Pattern**: `*` 44 | 45 | * **Read Order**: `Last Modified Timestamp` 46 | 47 | ![image alt text](img/BlobToKafka/HadoopFS_Files.png) 48 | 49 | **Data Format tab** 50 | 51 | * **Data Format**: `Delimited` 52 | 53 | * **Header Line**: `With Header Line` 54 | 55 | ![image alt text](img/BlobToKafka/HadoopFS_DataFormat.png) 56 | 57 | Now, let’s send this data to Kafka. 58 | In the Select Destination to connect dropdown, select `Kafka Producer` with the associated HDP version of the HDInsight Kafka cluster. Configure Kafka as follows: 59 | 60 | **Kafka tab** 61 | 62 | * **Broker URI**: `` 63 | 64 | * **Topic**: `nyctaxi` 65 | 66 | ![image alt text](img/BlobToKafka/Kafka_Kafka.png) 67 | 68 | **Kafka Data Format tab** 69 | 70 | * **Data Format**: `JSON` 71 | 72 | ![image alt text](img/BlobToKafka/Kafka_DataFormat.png) 73 | 74 | Configure the pipeline's **Error Records** property according to your preference. Since this is a tutorial, you could discard error records, but in a production system you would write them to a file or queue for analysis later. 75 | 76 | Now your pipeline is fully configured and ready for action! Hit the validate button ![image alt text](img/BlobToKafka/ValidateIcon.png) to check the connections. If successful, hit the preview button ![image alt text](img/BlobToKafka/PreviewIcon.png) to check that you can read records from the Blob Store file. Click the Hadoop FS stage and you should see ten records listed in the preview panel. You can click into them to see the individual fields and their values: 77 | 78 | ![image alt text](img/BlobToKafka/Preview.png) 79 | 80 | If your pipeline reports an error at validation or preview, check your configuration properties. If it’s still not working, contact us via the [sdc-user Google Group](https://groups.google.com/a/streamsets.com/forum/#!forum/sdc-user) or the [StreamSets Community Slack channel](https://streamsetters-slack.herokuapp.com/) - details are on the [StreamSets Community Page](https://streamsets.com/community/). 81 | 82 | ## Running the Pipeline 83 | 84 | If all is well, it’s time to run the pipeline! Hit the run button ![image alt text](img/BlobToKafka/PlayIcon.png) and you should see 5386 input records and 5386 output records in the monitoring panel. 85 | 86 | ![image alt text](img/BlobToKafka/StartandMonitor.png) 87 | 88 | Use Kafka tools provided by Azure to verify data was written to the topic. More details can be found here: [Manage Kafka Topics](https://docs.microsoft.com/en-us/azure/hdinsight/kafka/apache-kafka-get-started#manage-kafka-topics) 89 | 90 | ## Conclusion 91 | 92 | This tutorial shows how simple it is to stream data from any Directory on Blob Storage, structured on unstructured, into a Kafka topic that can feed multiple consumers downstream. 93 | 94 | [Follow the next tutorial](hdinsightkafka_to_sqldw_and_blobstorage.md) to see how you can use StreamSets Data Collector to read data from the topic and feed it into Azure SQL Data Warehouse. 95 | -------------------------------------------------------------------------------- /working-with-azure/img/Ambari_Kafka.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/Ambari_Kafka.png -------------------------------------------------------------------------------- /working-with-azure/img/BlobToKafka/HadoopFS_DataFormat.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/BlobToKafka/HadoopFS_DataFormat.png -------------------------------------------------------------------------------- /working-with-azure/img/BlobToKafka/HadoopFS_Files.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/BlobToKafka/HadoopFS_Files.png -------------------------------------------------------------------------------- /working-with-azure/img/BlobToKafka/HadoopFS_HadoopFS.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/BlobToKafka/HadoopFS_HadoopFS.png -------------------------------------------------------------------------------- /working-with-azure/img/BlobToKafka/Kafka_DataFormat.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/BlobToKafka/Kafka_DataFormat.png -------------------------------------------------------------------------------- /working-with-azure/img/BlobToKafka/Kafka_Kafka.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/BlobToKafka/Kafka_Kafka.png -------------------------------------------------------------------------------- /working-with-azure/img/BlobToKafka/PlayIcon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/BlobToKafka/PlayIcon.png -------------------------------------------------------------------------------- /working-with-azure/img/BlobToKafka/Preview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/BlobToKafka/Preview.png -------------------------------------------------------------------------------- /working-with-azure/img/BlobToKafka/PreviewIcon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/BlobToKafka/PreviewIcon.png -------------------------------------------------------------------------------- /working-with-azure/img/BlobToKafka/SelectSource_Hadoop.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/BlobToKafka/SelectSource_Hadoop.png -------------------------------------------------------------------------------- /working-with-azure/img/BlobToKafka/StartandMonitor.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/BlobToKafka/StartandMonitor.png -------------------------------------------------------------------------------- /working-with-azure/img/BlobToKafka/ValidateIcon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/BlobToKafka/ValidateIcon.png -------------------------------------------------------------------------------- /working-with-azure/img/BlobToKafka/image_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/BlobToKafka/image_4.png -------------------------------------------------------------------------------- /working-with-azure/img/BlobToKafka/image_44.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/BlobToKafka/image_44.png -------------------------------------------------------------------------------- /working-with-azure/img/BlobToKafka/image_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/BlobToKafka/image_5.png -------------------------------------------------------------------------------- /working-with-azure/img/KafkaToSqlDWandHive/ExpressionEvaluator.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/KafkaToSqlDWandHive/ExpressionEvaluator.png -------------------------------------------------------------------------------- /working-with-azure/img/KafkaToSqlDWandHive/FieldConvertor.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/KafkaToSqlDWandHive/FieldConvertor.png -------------------------------------------------------------------------------- /working-with-azure/img/KafkaToSqlDWandHive/HadoopFS_DataFormat.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/KafkaToSqlDWandHive/HadoopFS_DataFormat.png -------------------------------------------------------------------------------- /working-with-azure/img/KafkaToSqlDWandHive/HadoopFS_OutputFiles.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/KafkaToSqlDWandHive/HadoopFS_OutputFiles.png -------------------------------------------------------------------------------- /working-with-azure/img/KafkaToSqlDWandHive/HiveMetadata_General.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/KafkaToSqlDWandHive/HiveMetadata_General.png -------------------------------------------------------------------------------- /working-with-azure/img/KafkaToSqlDWandHive/HiveMetadata_Hive.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/KafkaToSqlDWandHive/HiveMetadata_Hive.png -------------------------------------------------------------------------------- /working-with-azure/img/KafkaToSqlDWandHive/HiveMetadata_Table.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/KafkaToSqlDWandHive/HiveMetadata_Table.png -------------------------------------------------------------------------------- /working-with-azure/img/KafkaToSqlDWandHive/HiveMetastore_Hive.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/KafkaToSqlDWandHive/HiveMetastore_Hive.png -------------------------------------------------------------------------------- /working-with-azure/img/KafkaToSqlDWandHive/JDBCProducer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/KafkaToSqlDWandHive/JDBCProducer.png -------------------------------------------------------------------------------- /working-with-azure/img/KafkaToSqlDWandHive/KafkaConnection.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/KafkaToSqlDWandHive/KafkaConnection.png -------------------------------------------------------------------------------- /working-with-azure/img/KafkaToSqlDWandHive/KafkaDataFormatJson.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/KafkaToSqlDWandHive/KafkaDataFormatJson.png -------------------------------------------------------------------------------- /working-with-azure/img/KafkaToSqlDWandHive/Preview1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/KafkaToSqlDWandHive/Preview1.png -------------------------------------------------------------------------------- /working-with-azure/img/sdc_ssh_login.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/streamsets/tutorials/9d4510a0770cb7109de0c78574d41b921aefc6b7/working-with-azure/img/sdc_ssh_login.png -------------------------------------------------------------------------------- /working-with-azure/readme.md: -------------------------------------------------------------------------------- 1 | # Working with StreamSets Data Collector and Microsoft Azure 2 | 3 | These tutorials explain how to use [StreamSets Data Collector](https://streamsets.com/products/sdc/) to integrate [Azure Blob Storage](https://azure.microsoft.com/en-us/services/storage/blobs/), [Apache Kafka on HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/), [Azure SQL Data Warehouse](https://azure.microsoft.com/en-us/services/sql-data-warehouse/) and Apache Hive backed by Azure Blob Storage: 4 | 5 | * [Ingesting Data from Blob Storage into Apache Kafka on HDInsight](blobstorage_to_hdinsightkafka.md) 6 | * [Ingesting Data from Apache Kafka on HDInsight into Azure SQL Data Warehouse and Apache Hive backed by Azure Blob Storage](hdinsightkafka_to_sqldw_and_blobstorage.md) 7 | 8 | ### Pre-requisites 9 | 10 | In order to work through these tutorials, ensure you already have the following setup ready, otherwise follow the instructions described below: 11 | 12 | 1. Login to the [Azure Portal](https://portal.azure.com) 13 | 14 | 2. Create a [Virtual Network](https://docs.microsoft.com/en-us/azure/virtual-network/quick-create-portal#create-a-virtual-network) (vNet): 15 | This vNet will allow us to enable communication between the clusters created in the next steps to communicate privately with each other. 16 | 17 | 3. Data storage: Create a [Blob Container](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage#create-blob-containers) 18 | 19 | 4. Install [StreamSets Data Collector for HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apps-install-streamsets) 20 | Also check [StreamSets Documentation](https://streamsets.com/documentation/datacollector/latest/help/index.html#datacollector/UserGuide/Installation/CloudInstall.html#task_vnj_rl2_wdb) for more details on the installation process. 21 | 22 | 5. Create an [Apache Kafka on HDInsight cluster](https://docs.microsoft.com/en-us/azure/hdinsight/kafka/apache-kafka-get-started) in the same vNet as above. Use Azure Storage for the Kafka cluster. 23 | 24 | 6. Configure StreamSets Data Collector to connect to HDInsight cluster 25 | 26 | Configure connection to HDInsight cluster by creating symlinks to the configuration files. 27 | 28 | - ssh into the SDC node using the SSH Endpoint of your cluster: 29 | 30 | ``` 31 | ssh sshuser@ 32 | ``` 33 | 34 | - Navigate to the StreamSets Resources Directory and create a directory to hold cluster configuration symlinks 35 | 36 | ``` 37 | cd /var/lib/sdc-resources 38 | sudo mkdir hadoop-conf 39 | cd hadoop-conf 40 | ``` 41 | 42 | - Symlink all *.xml files from /etc/hadoop/conf and hive-site.xml from /etc/hive/conf: 43 | ``` 44 | sudo ln -s /etc/hadoop/conf/*.xml . 45 | sudo ln -s /etc/hive/conf/hive-site.xml . 46 | ``` 47 | 48 | ![image alt text](img/sdc_ssh_login.png) 49 | 50 | 7. Download and install SQL Server JDBC Driver for StreamSets Data Collector 51 | - Download: https://docs.microsoft.com/en-us/sql/connect/jdbc/microsoft-jdbc-driver-for-sql-server?view=sql-server-2017 52 | - Install: https://streamsets.com/documentation/datacollector/latest/help/index.html#datacollector/UserGuide/Configuration/ExternalLibs.html#concept_amy_pzs_gz 53 | 54 | 8. Capture and note down the Kafka Broker URI and Zookeeper Configuration from Ambari: 55 | 56 | ![image alt text](img/Ambari_Kafka.png) 57 | --------------------------------------------------------------------------------