├── .gitignore ├── README.md ├── content ├── airflow.md ├── avro.md ├── aws.md ├── azure.md ├── bigquery.md ├── bigtable.md ├── cassandra.md ├── data-structure.md ├── delta.md ├── dwha.md ├── dynamodb.md ├── flink.md ├── flume.md ├── full.md ├── gcp.md ├── greenplum.md ├── hadoop.md ├── hbase.md ├── hive.md ├── impala.md ├── kafka.md ├── kubernetes.md ├── looker.md ├── mongo.md ├── nifi.md ├── parquet.md ├── redshift.md ├── spark.md ├── sql.md ├── superset.md └── tableau.md └── img ├── flink ├── 1.png ├── 2.png ├── apche-flink-architecture.png ├── bounded-unbounded-stream.png ├── flink-job-exe-architecture.png ├── flink3.jpg └── flinkVsHadoopVsSpark.JPG └── icon ├── airflow.ico ├── avro.ico ├── aws.ico ├── awstime.ico ├── azure.ico ├── bigquery.ico ├── bigtable.ico ├── cassandra.ico ├── cosmosdb.ico ├── datastruct.ico ├── deltalake.ico ├── dwha.ico ├── dynamodb.ico ├── fire.ico ├── flink.ico ├── flume.ico ├── gcp.ico ├── gcpsql.ico ├── github.ico ├── greenplum.ico ├── hadoop.ico ├── hbase.ico ├── hive.ico ├── impala.ico ├── kafka.ico ├── kuber.ico ├── looker.ico ├── mongo.ico ├── neptune.ico ├── nifi.ico ├── parquet.ico ├── rds.ico ├── redshift.ico ├── spanner.ico ├── spark.ico ├── sql.ico ├── superset.ico └── tableau.ico /.gitignore: -------------------------------------------------------------------------------- 1 | # IntelliJ 2 | *.iml 3 | out/ 4 | .idea/ 5 | 6 | # System specific 7 | .DS_Store 8 | 9 | # Maven 10 | target/ -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

More than 2000+ questions for preparing a Data Engineer interview.

2 |

Full list of questions

3 |

Interview questions for Data Engineer

4 |
5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 58 | 59 | 60 | 61 | 62 | 63 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 |
Databases and Data Warehouses
GitHub RepoOfficial pageQuestionsDescriptionUseful links
CassandraCassandraApache CassandraCassandra is a distributed, wide-column store, NoSQL database management system.Awesome Cassandra
GreenplumGreenplumGreenplumGreenplum is a big data technology based on MPP architecture and the Postgres open source database technology.Awesome Greenplum
MongoDBMongoDBMongoDBMongoDB is a document-oriented database.Awesome MongoDB
HbaseHbaseApache HbaseHBase is an open-source non-relational distributed database.Awesome HBase
HiveHiveApache HiveApache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis.Awesome Hive
Amazon DynamoDBAmazon DynamoDBAmazon DynamoDB is a fully managed proprietary NoSQL database service.Awesome DynamoDB 57 | Awesome AWS
Amazon RedshiftAmazon RedshiftAmazon Redshift is a data warehouse product.Amazon Redshift Utilities 64 | Awesome AWS
BigQueryBigQuery GCPBigQuery is a fully-managed, serverless data warehouse.Awesome BigQuery
BigtableBigtable GCPBigtable is a fully managed wide-column and key-value NoSQL database service.Awesome Bigtable
Data Formats
AvroAvroApache AvroAvro is a row-oriented remote procedure call and data serialization framework.Awesome Avro
ParquetParquetApache ParquetApache Parquet is a column-oriented data file format designed for efficient data storage and retrieval.TODO
DeltaDeltaDeltaDelta Lake is a storage framework that enables building a Lakehouse architecture with compute enginesDelta examples
Big Data Frameworks
AirflowAirflowApache AirflowApache Airflow is a workflow management platform for data engineering pipelines.Awesome Airflow
FlumeFlumeApache FlumeApache Flume is a distributed, reliable, and available software for efficiently collecting, aggregating, and moving large amounts of log data.TODO
HadoopHadoopApache HadoopApache Hadoop is a collection of software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation.Awesome Hadoop
ImpalaImpalaApache ImpalaApache Impala is a parallel processing SQL query engine for data stored in a computer cluster running Apache Hadoop.TODO
KafkaKafkaApache KafkaApache Kafka is a distributed event store and stream-processing platform.Awesome Kafka
NiFiNiFiApache NiFiApache NiFi is a software project designed to automate the flow of data between software systems.Awesome NiFi
SparkSparkApache SparkApache Spark is unified analytics engine for large-scale data processing.Awesome Spark
FlinkFlinkApache FlinkApache Flink is unified stream-processing and batch-processing framework.Awesome Flink
KubernetesKubernetesKubernetes Kubernetes is a system for managing containerized applications across multiple hosts.Awesome Kubernetes
Cloud providers
AWSAWSAmazon Web ServicesAmazon web service is an online platform that provides scalable and cost-effective cloud computing solutions.Awesome AWS
AzureAzureMicrosoft AzureMicrosoft Azure is Microsoft's public cloud computing platform.Awesome Azure
GCPGCPGoogle Cloud PlatformGoogle Cloud Platform is a suite of cloud computing services.Awesome GCP
Theory
DWHADWH ArchitecturesA data warehouse architecture is a method of defining the overall architecture of data communication processing and presentation that exist for end-clients computing within the enterprise.Awesome databases
AirflowData StructuresA data structure is a specialized format for organizing, processing, retrieving and storing data. TODO
SQLSQLSQL is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS).Awesome SQL
Data visualization tools/BI
TableauTableauTableau is a powerful data visualization tool used in the Business Intelligence.TODO
LookerLookerLooker is an enterprise platform for BI, data applications, and embedded analytics that helps you explore and share insights in real time.TODO
KafkaApache SupersetApache SupersetSuperset is a modern data exploration and data visualization platformTODO
244 |
245 |
246 |

Contribution

247 |

Please contribute to this repository to help it make better. Any change like new question, code improvement, doc improvement etc is very welcome.

248 |
-------------------------------------------------------------------------------- /content/airflow.md: -------------------------------------------------------------------------------- 1 | ## [Main title](../README.md) 2 | ### [Interview questions](full.md) 3 | # Apache Airflow 4 | + [What is Airflow?](#What-is-Airflow) 5 | + [What issues does Airflow resolve?](#What-issues-does-Airflow-resolve) 6 | + [Explain how workflow is designed in Airflow?](#Explain-how-workflow-is-designed-in-Airflow) 7 | + [Explain Airflow Architecture and its components?](#Explain-Airflow-Architecture-and-its-components) 8 | + [What are the types of Executors in Airflow?](#What-are-the-types-of-Executors-in-Airflow) 9 | + [What are the pros and cons of SequentialExecutor?](#What-are-the-pros-and-cons-of-SequentialExecutor) 10 | + [What are the pros and cons of LocalExecutor?](#What-are-the-pros-and-cons-of-LocalExecutor) 11 | + [What are the pros and cons of CeleryExecutor?](#What-are-the-pros-and-cons-of-CeleryExecutor) 12 | + [What are the pros and cons of KubernetesExecutor?](#What-are-the-pros-and-cons-of-KubernetesExecutor) 13 | + [How to define a workflow in Airflow?](#How-to-define-a-workflow-in-Airflow) 14 | + [How do you make the module available to airflow if you're using Docker Compose?](#How-do-you-make-the-module-available-to-airflow-if-you're-using-Docker-Compose) 15 | + [How to schedule DAG in Airflow?](#How-to-schedule-DAG-in-Airflow) 16 | + [What is XComs In Airflow?](#What-is-XComs-In-Airflow) 17 | + [What is xcom_pull in XCom Airflow?](#What-is-xcom_pull-in-XCom-Airflow) 18 | + [What is Jinja templates?](#What-is-Jinja-templates) 19 | + [How to use Airflow XComs in Jinja templates?](#How-to-use-Airflow-XComs-in-Jinja-templates) 20 | 21 | ## What is Airflow? 22 | Apache Airflow is an open-source workflow management platform. It began in October 2014 at Airbnb as a solution for managing the company's increasingly complex workflows. Airbnb's creation of Airflow enabled them to programmatically author, schedule, and monitor their workflows via the built-in Airflow user interface. Airflow is a data transformation pipeline ETL (Extract, Transform, Load) workflow orchestration tool. 23 | 24 | [Table of Contents](#Apache-Airflow) 25 | 26 | ## What issues does Airflow resolve? 27 | Crons are an old technique of task scheduling. 28 | Scalable 29 | Cron requires external assistance to log, track, and manage tasks. The Airflow UI is used to track and monitor the workflow's execution. 30 | Creating and maintaining a relationship between tasks in cron is a challenge, whereas it is as simple as writing Python code in Airflow. 31 | Cron jobs are not reproducible until they are configured externally. Airflow maintains an audit trail of all tasks completed. 32 | 33 | [Table of Contents](#Apache-Airflow) 34 | 35 | ## Explain how workflow is designed in Airflow? 36 | A directed acyclic graph (DAG) is used to design an Airflow workflow. That is to say, when creating a workflow, consider how it can be divided into tasks that can be completed independently. The tasks can then be combined into a graph to form a logical whole. 37 | The overall logic of your workflow is based on the shape of the graph. An Airflow DAG can have multiple branches, and you can choose which ones to follow and which to skip during workflow execution. 38 | Airflow Pipeline DAG 39 | Airflow could be completely stopped, and able to run workflows would then resume through restarting the last unfinished task. 40 | It is important to remember that airflow operators can be run more than once when designing airflow operators. Each task should be idempotent, or capable of being performed multiple times without causing unintended consequences. 41 | 42 | [Table of Contents](#Apache-Airflow) 43 | 44 | ## Explain Airflow Architecture and its components? 45 | There are four major components to airflow. 46 | 47 | [Architecture : -> ](https://medium.com/@bageshwar.kumar/airflow-architecture-a-deep-dive-into-data-pipeline-orchestration-217dd2dbc1c3) 48 | 49 | + Webserver 50 | + This is the Airflow UI built on the Flask, which provides an overview of the overall health of various DAGs and helps visualise various components and states of every DAG. For the Airflow setup, the Web Server also allows you to manage users, roles, and different configurations. 51 | + Scheduler 52 | + Every n seconds, the scheduler walks over the DAGs and schedules the task to be executed.Executor 53 | + Executor is another internal component of the scheduler. 54 | + The executors are the components that actually execute the tasks, while the Scheduler orchestrates them. Airflow has different types of executors, including SequentialExecutor, LocalExecutor, CeleryExecutor and KubernetesExecutor. People generally choose the executor which is best for their use case. 55 | + Worker 56 | + Workers are responsible to run the task that the executor has given them. 57 | + Metadata Database 58 | Airflow supports a wide range of metadata storage databases. This database contains information about DAGs, their runs, and other Airflow configurations such as users, roles, and connections. 59 | The DAGs' states and runs are shown by the Web Server from the database. This information is also updated in the metadata database by the Scheduler. 60 | 61 | [Table of Contents](#Apache-Airflow) 62 | 63 | ## What are the types of Executors in Airflow? 64 | The executors are the components that actually execute the tasks, while the Scheduler orchestrates them. Airflow has different types of executors, including SequentialExecutor, LocalExecutor, CeleryExecutor and KubernetesExecutor. People generally choose the executor which is best for their use case. 65 | Types of Executor 66 | + SequentialExecutor 67 | + Only one task is executed at a time by SequentialExecutor. The scheduler and the workers both use the same machine. 68 | + LocalExecutor 69 | + LocalExecutor is the same as the Sequential Executor, except it can run multiple tasks at a time. 70 | + CeleryExecutor 71 | + Celery is a Python framework for running distributed asynchronous tasks. 72 | As a result, CeleryExecutor has long been a part of Airflow, even before Kubernetes. 73 | CeleryExecutors has a fixed number of workers on standby to take on tasks when they become available. 74 | + KubernetesExecutor 75 | + Each task is run by KubernetesExecutor in its own Kubernetes pod. It, unlike Celery, spins up worker pods on demand, allowing for the most efficient use of resources. 76 | 77 | [Table of Contents](#Apache-Airflow) 78 | 79 | ## What are the pros and cons of SequentialExecutor? 80 | Pros: 81 | + It's simple and straightforward to set up. 82 | + It's a good way to test DAGs while they're being developed. 83 | Pros: 84 | It isn't scalable. 85 | It is not possible to perform many tasks at the same time. 86 | Unsuitable for use in production 87 | 88 | [Table of Contents](#Apache-Airflow) 89 | 90 | ## What are the pros and cons of LocalExecutor? 91 | Pros: 92 | + Able to perform multiple tasks. 93 | + Can be used to run DAGs during development. 94 | Cons: 95 | + The product isn't scalable. 96 | + There is only one point of failure. 97 | + Unsuitable for use in production. 98 | 99 | [Table of Contents](#Apache-Airflow) 100 | 101 | ## What are the pros and cons of CeleryExecutor? 102 | Pros: 103 | + It allows for scalability. 104 | + Celery is responsible for managing the workers. Celery creates a new one in the case of a failure. 105 | Cons: 106 | + Celery requires RabbitMQ/Redis for task queuing, which is redundant with what Airflow already supports. 107 | + The setup is also complicated due to the above-mentioned dependencies. 108 | 109 | [Table of Contents](#Apache-Airflow) 110 | 111 | ## What are the pros and cons of KubernetesExecutor? 112 | Pros: 113 | It combines the benefits of CeleryExecutor and LocalExecutor in terms of scalability and simplicity. 114 | Fine-grained control over task-allocation resources. At the task level, the amount of CPU/memory needed can be configured. 115 | Cons: 116 | Airflow is newer to Kubernetes, and the documentation is complicated. 117 | 118 | [Table of Contents](#Apache-Airflow) 119 | 120 | ## How to define a workflow in Airflow? 121 | Python files are used to define workflows. 122 | DAG (Directed Acyclic Graph) 123 | The DAG Python class in Airflow allows you to generate a Directed Acyclic Graph, which is a representation of the workflow. 124 | from Airflow.models import DAG 125 | from airflow.utils.dates import days_ago 126 | ​ 127 | args = { 128 | 'start_date': days_ago(0), 129 | } 130 | ​ 131 | dag = DAG( 132 | dag_id='bash_operator_example', 133 | default_args=args, 134 | schedule_interval='* * * * *', 135 | ) 136 | You can use the start date to launch a task on a specific date. 137 | The schedule interval specifies how often each workflow is scheduled to run. '* * * * *' indicates that the tasks must run every minute. 138 | 139 | [Table of Contents](#Apache-Airflow) 140 | 141 | ## Understanding Cron Expression in Airflow 142 | 143 | The expression `schedule_interval='30 8 * * 1-5'` is a **cron expression** used in Airflow (and Unix-like systems) to define a specific schedule for running tasks. Here's a detailed breakdown: 144 | 145 | ## Cron Expression Structure 146 | 147 | A cron expression is composed of 5 fields separated by spaces: 148 | 149 | | Field | Position | Allowed Values | Description | 150 | |---------------|----------|-------------------------|----------------------------------| 151 | | **Minute** | 1 | `0-59` | The minute of the hour | 152 | | **Hour** | 2 | `0-23` | The hour of the day | 153 | | **Day of Month** | 3 | `1-31` | The day of the month | 154 | | **Month** | 4 | `1-12` or `JAN-DEC` | The month | 155 | | **Day of Week** | 5 | `0-6` or `SUN-SAT` | The day of the week (0 = Sunday)| 156 | 157 | ## Detailed Explanation of `30 8 * * 1-5` 158 | 159 | 1. **`30` (Minute)**: 160 | - The task will run at the **30th minute** of the hour. 161 | - Example: If the hour is `8`, the task will execute at `08:30`. 162 | 163 | 2. **`8` (Hour)**: 164 | - The task will run during the **8th hour of the day**. 165 | - Example: It will execute at `08:30 AM`. 166 | 167 | 3. **`*` (Day of Month)**: 168 | - The asterisk (`*`) means "every day of the month." 169 | - Example: It doesn't matter whether it's the 1st, 15th, or 30th. 170 | 171 | 4. **`*` (Month)**: 172 | - The asterisk (`*`) means "every month." 173 | - Example: It will run in January, February, and so on. 174 | 175 | 5. **`1-5` (Day of Week)**: 176 | - The range `1-5` means the task will run on **Monday to Friday**. 177 | - Example: It skips weekends (Saturday and Sunday). 178 | 179 | ## When Will This Schedule Trigger? 180 | 181 | This cron expression means: 182 | - **Time**: 8:30 AM. 183 | - **Days**: Monday through Friday. 184 | - **Frequency**: Daily (only on weekdays). 185 | 186 | ## Examples of Trigger Dates 187 | Assuming the current date is January 2025: 188 | - Monday, January 6, 2025, at 08:30 AM. 189 | - Tuesday, January 7, 2025, at 08:30 AM. 190 | - Wednesday, January 8, 2025, at 08:30 AM. 191 | - (And so on for all weekdays...) 192 | 193 | ## Real-World Use Case 194 | 195 | You might use this schedule for tasks that should only run during business hours on workdays, such as: 196 | - Sending daily reports to a team. 197 | - Updating a database with data from the previous day. 198 | - Running data pipelines during non-peak times. 199 | 200 | ## How do you make the module available to airflow if you're using Docker Compose? 201 | If we are using Docker Compose, then we will need to use a custom image with our own additional dependencies in order to make the module available to Airflow. Refer to the following Airflow Documentation for reasons why we need it and how to do it. 202 | 203 | [Table of Contents](#Apache-Airflow) 204 | 205 | ## How to schedule DAG in Airflow? 206 | DAGs could be scheduled by passing a timedelta or a cron expression (or one of the @ presets), which works well enough for DAGs that need to run on a regular basis, but there are many more use cases that are presently difficult to express "natively" in Airflow, or that require some complicated workarounds. You can refer Airflow Improvements Proposals (AIP). 207 | Simply use the following command to start a scheduler: 208 | + airflow scheduler 209 | 210 | [Table of Contents](#Apache-Airflow) 211 | 212 | ## What is XComs In Airflow? 213 | XCom (short for cross-communication) are messages that allow data to be sent between tasks. The key, value, timestamp, and task/DAG id are all defined. 214 | 215 | [Table of Contents](#Apache-Airflow) 216 | 217 | ## What is xcom_pull in XCom Airflow? 218 | The xcom push and xcom pull methods on Task Instances are used to explicitly "push" and "pull" XComs to and from their storage. Whereas if do xcom push parameter is set to True (as it is by default), many operators and @task functions will auto-push their results into an XCom key named return value. 219 | If no key is supplied to xcom pull, it will use this key by default, allowing you to write code like this: 220 | Pulls the return_value XCOM from "pushing_task" 221 | value = task_instance.xcom_pull(task_ids='pushing_task') 222 | 223 | [Table of Contents](#Apache-Airflow) 224 | 225 | ## What is Jinja templates? 226 | Jinja is a templating engine that is quick, expressive, and extendable. The template has special placeholders that allow you to write code that looks like Python syntax. After that, data is passed to the template in order to render the final document. 227 | 228 | [Table of Contents](#Apache-Airflow) 229 | 230 | ## How to use Airflow XComs in Jinja templates? 231 | We can use XComs in Jinja templates as given below: 232 | + SELECT * FROM {{ task_instance.xcom_pull(task_ids='foo', key='table_name') }} 233 | 234 | [Table of Contents](#Apache-Airflow) 235 | -------------------------------------------------------------------------------- /content/avro.md: -------------------------------------------------------------------------------- 1 | ## [Main title](../README.md) 2 | ### [Interview questions](full.md) 3 | # Apache Avro 4 | + [What is Apache Avro?](#What-is-Apache-Avro) 5 | + [State some Key Points about Apache Avro?](#State-some-Key-Points-about-Apache-Avro) 6 | + [What Avro Offers?](#What-Avro-Offers) 7 | + [Who is intended audience to Learn Avro?](#Who-is-intended-audience-to-Learn-Avro) 8 | + [What are prerequisites to learn Avro?](#What-are-prerequisites-to-learn-Avro) 9 | + [Explain Avro schemas?](#Explain-Avro-schemas) 10 | + [Explain Thrift and Protocol Buffers and Avro?](#Explain-Thrift-and-Protocol-Buffers-and-Avro) 11 | + [Why Avro?](#Why-Avro) 12 | + [How to use Avro?](#How-to-use-Avro) 13 | + [Name some primitive types of Data Types which Avro Supports.](#Name-some-primitive-types-of-Data-Types-which-Avro-Supports) 14 | + [Name some complex types of Data Types which Avro Supports.](#Name-some-complex-types-of-Data-Types-which-Avro-Supports) 15 | + [What are best features of Apache Avro?](#What-are-best-features-of-Apache-Avro) 16 | + [Explain some advantages of Avro.](#Explain-some-advantages-of-Avro) 17 | + [Explain some disadvantages of Avro.](#Explain-some-disadvantages-of-Avro) 18 | + [What do you mean by schema declaration?](#What-do-you-mean-by-schema-declaration) 19 | + [Explain the term Serialization?](#Explain-the-term-Serialization) 20 | + [What do you mean by Schema Resolution?](#What-do-you-mean-by-Schema-Resolution) 21 | + [Explain the Avro Sasl profile?](#Explain-the-Avro-Sasl-profile) 22 | + [What is the way of creating Avro Schemas?](#What-is-the-way-of-creating-Avro-Schemas) 23 | + [Name some Avro Reference Apis?](#Name-some-Avro-Reference-Apis) 24 | + [When to use Avro?](#When-to-use-Avro) 25 | + [Explain sort order in brief?](#Explain-sort-order-in-brief) 26 | + [What is the advantage of Hadoop Over Java Serialization?](#What-is-the-advantage-of-Hadoop-Over-Java-Serialization) 27 | + [What are the disadvantages of Hadoop Serialization?](#What-are-the-disadvantages-of-Hadoop-Serialization) 28 | 29 | ## What is Apache Avro? 30 | An open source project which offers data serialization as well as data exchange services for Apache Hadoop is what we call Apache Avro. It is possible to use these services together or independently both. However, programs can efficiently serialize data into files or into messages, with the serialization service. In addition, data storage is very compact and efficient in Avo because here data definition is in JSON, so, data itself is stored in the binary format making it compact and efficient. 31 | 32 | [Table of Contents](#Apache-Avro) 33 | 34 | ## State some Key Points about Apache Avro? 35 | Some key points are: 36 | + Avro is a Data serialization system 37 | + It uses JSON based schemas 38 | + Moreover, to send data, it uses RPC calls. 39 | + And, during data exchange, Schema’s sent. 40 | 41 | [Table of Contents](#Apache-Avro) 42 | 43 | ## What Avro Offers? 44 | Avro offers: 45 | + Avro offers Rich data structures. 46 | + And, a compact, fast, binary data format. 47 | + Further, it offers a container file, to store persistent data. 48 | + Remote procedure calls (RPC). 49 | 50 | [Table of Contents](#Apache-Avro) 51 | 52 | ## Who is intended audience to Learn Avro? 53 | Those people who want to learn the basics of Big Data Analytics by using Hadoop Framework and also those who aspire to become a successful Hadoop developer can go for Avro. Further, those aspirants who want to use Avro for data serialization and deserialization can also learn Avro. 54 | 55 | [Table of Contents](#Apache-Avro) 56 | 57 | ## What are prerequisites to learn Avro? 58 | Those who want to learn Avro must know Hadoop’s architecture and APIs, before learning Avro. Also, must know Java with experience in writing basic applications before going for Avro. 59 | 60 | [Table of Contents](#Apache-Avro) 61 | 62 | ## Explain Avro schemas? 63 | Mainly, Avro heavily depends on its schema. Basically, it permits every data to be written with no prior knowledge of the schema. We can say Avro serialized fast and the data resulting after serialization is least in size with schemas. 64 | 65 | [Table of Contents](#Apache-Avro) 66 | 67 | ## Explain Thrift and Protocol Buffers and Avro? 68 | The most competent libraries with Avro are Thrift and Protocol Buffers. 69 | The difference between them is: − As per the need, Avro supports both dynamic and static types. Basically, to specify schemas and their types, Protocol Buffers and Thrift uses Interface Definition Languages (IDLs). 70 | As Avro is built in the Hadoop ecosystem but Thrift and Protocol Buffers are not. 71 | 72 | [Table of Contents](#Apache-Avro) 73 | 74 | ## Why Avro? 75 | Some features where Avro differs from other systems are: 76 | + Dynamic typing. 77 | + Untagged data. 78 | + No manually-assigned field IDs. 79 | 80 | [Table of Contents](#Apache-Avro) 81 | 82 | ## How to use Avro? 83 | The workflow to use Avro is:− We need to create schemas at first to read the schemas into our program that is possible in two ways. 84 | Generating a Class Corresponding to Schema Using Parsers Library 85 | Then perform the serialization by using serialization API provided for Avro. And then perform deserialization by using deserialization API provided for Avro. 86 | 87 | [Table of Contents](#Apache-Avro) 88 | 89 | ## Name some primitive types of Data Types which Avro Supports. 90 | Avro supports a wide range of Primitive datatypes: 91 | + Null: no value 92 | + Boolean: a binary value 93 | + Int: 32-bit signed integer 94 | + Long: 64-bit signed integer 95 | + Float: single precision (32-bit) IEEE 754 floating-point number 96 | + Double: double precision (64-bit) IEEE 754 floating-point number 97 | + Bytes: the sequence of 8-bit unsigned bytes 98 | + String: Unicode character sequence 99 | 100 | [Table of Contents](#Apache-Avro) 101 | 102 | ## Name some complex types of Data Types which Avro Supports. 103 | There are six kinds of complex types which Avro supports: 104 | + Records 105 | + Enums 106 | + Arrays 107 | + Maps 108 | + Unions 109 | + Fixed 110 | 111 | [Table of Contents](#Apache-Avro) 112 | 113 | ## What are best features of Apache Avro? 114 | Some of the best features of Avro are: 115 | + Schema evolution 116 | + Untagged data 117 | + Language support 118 | + Transparent compression 119 | + Dynamic typing 120 | + Native support in MapReduce 121 | + Rich data structures 122 | 123 | [Table of Contents](#Apache-Avro) 124 | 125 | ## Explain some advantages of Avro. 126 | Pros of Avro are: 127 | + The Smallest Size. 128 | + It Compresses block at a time; split table. 129 | + Maintained Object structure. 130 | + Also, supports reading old data w/ new schema. 131 | 132 | [Table of Contents](#Apache-Avro) 133 | 134 | ## Explain some disadvantages of Avro. 135 | Cons of Avro are: 136 | + It must use .NET 4.5, in the case of C# Avro, to make the best use of it. 137 | + Potentially slower serialization. 138 | + In order to read/write data, need a schema. 139 | 140 | [Table of Contents](#Apache-Avro) 141 | 142 | ## What do you mean by schema declaration? 143 | In JSON, a Schema is represented by one of: 144 | + A JSON string 145 | + A JSON object 146 | + {“type”: “typename” …attributes…} 147 | + A JSON array 148 | 149 | [Table of Contents](#Apache-Avro) 150 | 151 | ## Explain the term Serialization? 152 | To transport the data over the network or to store on some persistent storage, the process of translating data structures or objects state into binary or textual form is what we call Serialization. In other words, serialization is also known as as marshaling and deserialization is known as unmarshalling. 153 | 154 | [Table of Contents](#Apache-Avro) 155 | 156 | ## What do you mean by Schema Resolution? 157 | Whether from an RPC or a file, a reader of Avro data, can always parse that data since its schema is offered. Yet it is possible that schema may not be exactly the schema what we expect so for that purpose we use Schema Resolution. 158 | 159 | [Table of Contents](#Apache-Avro) 160 | 161 | ## Explain the Avro Sasl profile? 162 | Basically, SASL offers a framework for authentication and security of network protocols. In Avro also we use SASL Profile for authentication and security purpose. 163 | 164 | [Table of Contents](#Apache-Avro) 165 | 166 | ## What is the way of creating Avro Schemas? 167 | In the format “lightweight text-based data interchange”, JavaScript Object Notation (JSON), the Avro schema gets created. 168 | We can make it in various ways: 169 | + A JSON string 170 | + JSON object 171 | + A JSON array 172 | 173 | [Table of Contents](#Apache-Avro) 174 | 175 | ## Name some Avro Reference Apis? 176 | The classes and methods which we use in the serialization, as well as deserialization of Avro schemas, are: 177 | + Specific Datum Writer Class 178 | + Specific Datum Reader Class 179 | + Data File Writer 180 | + Data File Reader 181 | + Class Schema. Parser 182 | + Interface Generic Record 183 | + Class Generic Data. Record 184 | 185 | [Table of Contents](#Apache-Avro) 186 | 187 | ## When to use Avro? 188 | Mainly, for two purposes, we use Avro, like: 189 | + Data serialization 190 | + RPC (Remote procedure call) protocol 191 | Although, some key points are: 192 | We are able to read the data from disk with applications, by using Avro even which are written in other languages besides java or the JVM. 193 | Also, Avro allows us to transfer data across a remote system without any overhead of java serialization. 194 | We use Avro when we need to store the large set of data on disk, as it conserves space. 195 | Further, by using Avro for RPC, we get a better remote data transfer. 196 | 197 | [Table of Contents](#Apache-Avro) 198 | 199 | ## Explain sort order in brief? 200 | There is a standard sort order for data in Avro which allows data written by one system to be efficiently sorted by another system. As sort order comparisons are sometimes the most frequent per-object operation, it can be an important optimization. 201 | 202 | [Table of Contents](#Apache-Avro) 203 | 204 | ## What is the advantage of Hadoop Over Java Serialization? 205 | As with the help of the Writable objects, Hadoop Writable-based serialization is able to reduce object-creation overhead, which is not possible with the Java native serialization framework that’s why using Hadoop one is an advantage. 206 | 207 | [Table of Contents](#Apache-Avro) 208 | 209 | ## What are the disadvantages of Hadoop Serialization? 210 | The only disadvantage of Hadoop Serialization is that the Writable and Sequence Files have only a Java API. Hence, to solve this issue, Avro comes in picture. 211 | 212 | [Table of Contents](#Apache-Avro) -------------------------------------------------------------------------------- /content/azure.md: -------------------------------------------------------------------------------- 1 | ## [Main title](../README.md) 2 | ### [Interview questions](full.md) 3 | # Azure 4 | + [What are the three Main Components of Windows Azure Platform?](#What-are-the-three-Main-Components-of-Windows-Azure-Platform) 5 | + [What are the Service Model in Cloud Computing?](#What-are-the-Service-Model-in-Cloud-Computing) 6 | + [How many Types of Deployment Models are used in Cloud?](#How-many-Types-of-Deployment-Models-are-used-in-Cloud) 7 | + [What is Windows Azure Platform?](#What-is-Windows-Azure-Platform) 8 | + [What are the Roles Available in Windows Azure?](#What-are-the-Roles-Available-in-Windows-Azure) 9 | + [What is difference between Windows Azure Platform and Windows Azure?](#What-is-difference-between-Windows-Azure-Platform-and-Windows-Azure) 10 | + [What are the three Types of Roles in Compute Component in Windows Azure?](#What-are-the-three-Types-of-Roles-in-Compute-Component-in-Windows-Azure) 11 | + [What is Windows Azure Compute Emulator?](#What-is-Windows-Azure-Compute-Emulator) 12 | + [What is Fabric?](#What-is-Fabric) 13 | + [How many instances of a Role should be deployed to Satisfy Azure Sla?](#How-many-instances-of-a-Role-should-be-deployed-to-Satisfy-Azure-Sla) 14 | + [What are the options to manage Session State in Windows Azure?](#What-are-the-options-to-manage-Session-State-in-Windows-Azure) 15 | + [What is Cspack?](#What-is-Cspack) 16 | + [What is Csrun?](#What-is-Csrun) 17 | + [What is Guest Os?](#What-is-Guest-Os) 18 | + [How to programmatically Scale Out Azure Worker Role Instances?](#How-to-programmatically-Scale-Out-Azure-Worker-Role-Instances) 19 | + [What is Web Role in Windows Azure?](#What-is-Web-Role-in-Windows-Azure) 20 | + [What is the difference between Public Cloud and Private Cloud?](#What-is-the-difference-between-Public-Cloud-and-Private-Cloud) 21 | + [What is Windows Azure Diagnostics?](#What-is-Windows-Azure-Diagnostics) 22 | + [What is Blob?](#What-is-Blob) 23 | + [What is the difference between Block Blob Vs Page Blob?](#What-is-the-difference-between-Block-Blob-Vs-Page-Blob) 24 | + [What is the difference between Windows Azure Queues and Windows Azure Service Bus Queues?](#What-is-the-difference-between-Windows-Azure-Queues-and-Windows-Azure-Service-Bus-Queues) 25 | + [What is Deadletter Queue?](#What-is-Deadletter-Queue) 26 | + [What are Instance Sizes of Azure?](#What-are-Instance-Sizes-of-Azure) 27 | + [ What is Table Storage in Windows Azure?](#What-is-Table-Storage-in-Windows-Azure) 28 | + [Difference between Web and Worker Roles in Windows Azure?](#Difference-between-Web-and-Worker-Roles-in-Windows-Azure) 29 | + [What is Azure Fabric Controller?](#What-is-Azure-Fabric-Controller) 30 | + [What is Autoscaling?](#What-is-Autoscaling) 31 | + [What is Vm Role in Windows Azure?](#What-is-Vm-Role-in-Windows-Azure) 32 | + [Apart from dotnet Framework please name other three language framework that can be used to Develop Windows Azure Applications?](#Apart-from-dotnet-Framework-please-name-other-three-language-framework-that-can-be-used-to-Develop-Windows-Azure-Applications) 33 | + [How would you categorize Windows Azure?](#How-would-you-categorize-Windows-Azure) 34 | + [What is Azure Cloud Service?](#What-is-Azure-Cloud-Service) 35 | + [What is Cloud Service Role?](#What-is-Cloud-Service-Role) 36 | + [What is Link Resource?](#What-is-Link-Resource) 37 | + [What is Scale Cloud Service?](#What-is-Scale-Cloud-Service) 38 | + [What is Web Role ?](#What-is-Web-Role) 39 | + [What is Worker Role ?](#What-is-Worker-Role) 40 | + [ What is Role Instance ?](#What-is-Role-Instance) 41 | + [What is Guest Operating System ?](#What-is-Guest-Operating-System) 42 | + [What is Cloud Service Components?](#What-is-Cloud-Service-Components) 43 | + [What is Deployment Environments?](#What-is-Deployment-Environments) 44 | + [What is Swap Deployments?](#What-is-Swap-Deployments) 45 | + [What is Minimal Vs Verbose Monitoring?](#What-is-Minimal-Vs-Verbose-Monitoring) 46 | + [What is Service Definition File?](#What-is-Service-Definition-File) 47 | + [What is Service Configuration File?](#What-is-Service-Configuration-File) 48 | + [What is Service Package ?](#What-is-Service-Package) 49 | + [What is Cloud Service Deployment ?](#What-is-Cloud-Service-Deployment) 50 | + [What is Azure Diagnostics ?](#What-is-Azure-Diagnostics) 51 | + [What is Azure Service Level Agreement?](#What-is-Azure-Service-Level-Agreement) 52 | ## What are the three Main Components of Windows Azure Platform? 53 | + Compute 54 | + Storage 55 | + AppFabric 56 | 57 | [Table of Contents](#Azure) 58 | 59 | ## What are the Service Model in Cloud Computing? 60 | Cloud computing providers offer their services according to three fundamental models: Infrastructure as a service (IaaS), platform as a service (PaaS), and software as a service (SaaS) where IaaS is the most basic and each higher model abstracts from the details of the lower models. 61 | Examples of IaaS include: Amazon CloudFormation (and underlying services such as Amazon EC2), Rackspace Cloud, Terremark, Windows Azure Virtual Machines, Google Compute Engine. and Joyent. 62 | Examples of PaaS include: Amazon Elastic Beanstalk, Cloud Foundry, Heroku, Force.com, EngineYard, Mendix, Google App Engine, Windows Azure Compute and OrangeScape. 63 | Examples of SaaS include: Google Apps, Microsoft Office 365, and Onlive. 64 | 65 | [Table of Contents](#Azure) 66 | 67 | ## How many Types of Deployment Models are used in Cloud? 68 | There are 4 types of deployment models used in cloud: 69 | + Public cloud 70 | + Private cloud 71 | + Community cloud 72 | + Hybrid cloud 73 | 74 | [Table of Contents](#Azure) 75 | 76 | ## What is Windows Azure Platform? 77 | A collective name of Microsoft’s Platform as a Service (PaaS) offering which provides a programming platform, a deployment vehicle, and a runtime environment of cloud computing hosted in Microsoft datacenters. 78 | 79 | [Table of Contents](#Azure) 80 | 81 | ## What are the Roles Available in Windows Azure? 82 | All three roles (web, worker, VM) are essentially Windows Server 2008. Web and Worker roles are nearly identical: With Web and Worker roles, the OS and related patches are taken care for you; you build your app’s components without having to manage a VM. 83 | 84 | [Table of Contents](#Azure) 85 | 86 | ## What is difference between Windows Azure Platform and Windows Azure? 87 | The former is Microsoft’s PaaS offering including Windows Azure, SQL Azure, and Appfabric; while the latter is part of the offering and the Microsoft’s cloud OS. 88 | 89 | [Table of Contents](#Azure) 90 | 91 | ## What are the three Types of Roles in Compute Component in Windows Azure? 92 | + WEB 93 | + Worker 94 | + VM 95 | 96 | [Table of Contents](#Azure) 97 | 98 | ## What is Windows Azure Compute Emulator? 99 | The compute emulator is a local emulator of Windows Azure that you can use to build and test your application before deploying it to Windows Azure. 100 | 101 | [Table of Contents](#Azure) 102 | 103 | ## What is Fabric? 104 | In the Windows Azure cloud fabric is nothing but a combination of many virtualized instances which run client application. 105 | 106 | [Table of Contents](#Azure) 107 | 108 | ## How many instances of a Role should be deployed to Satisfy Azure Sla? 109 | TWO. And if we do so, the role would have external connectivity at least 99.95% of the time. 110 | 111 | [Table of Contents](#Azure) 112 | 113 | ## What are the options to manage Session State in Windows Azure? 114 | ► Windows Azure Caching 115 | ► SQL Azure 116 | ► Azure Table 117 | 118 | [Table of Contents](#Azure) 119 | 120 | ## What is Cspack? 121 | It is a command-line tool that generates a service package file (.cspkg) and prepares an application for deployment, either to Windows Azure or to the compute emulator. 122 | 123 | [Table of Contents](#Azure) 124 | 125 | ## What is Csrun? 126 | It is a command-line tool that deploys a packaged application to the Windows Azure compute emulator and manages the running service. 127 | 128 | [Table of Contents](#Azure) 129 | 130 | ## What is Guest Os? 131 | It is the operating system that runs on the virtual machine that hosts an instance of a role. 132 | 133 | [Table of Contents](#Azure) 134 | 135 | ## How to programmatically Scale Out Azure Worker Role Instances? 136 | Using AutoScaling Application Block. 137 | 138 | [Table of Contents](#Azure) 139 | 140 | ## What is Web Role in Windows Azure? 141 | Web roles in Windows Azure are special purpose, and provide a dedicated Internet Information Services (IIS) web-server used for hosting front-end web applications. You can quickly and easily deploy web applications to Web Roles and then scale your Compute capabilities up or down to meet demand. 142 | 143 | [Table of Contents](#Azure) 144 | 145 | ## What is the difference between Public Cloud and Private Cloud? 146 | Public cloud is used as a service via Internet by the users, whereas a private cloud, as the name conveys is deployed within certain boundaries like firewall settings and is completely managed and monitored by the users working on it in an organization. 147 | 148 | [Table of Contents](#Azure) 149 | 150 | ## What is Windows Azure Diagnostics? 151 | Windows Azure Diagnostics enables you to collect diagnostic data from an application running in Windows Azure. You can use diagnostic data for debugging and troubleshooting, measuring performance, monitoring resource usage, traffic analysis and capacity planning, and auditing. 152 | 153 | [Table of Contents](#Azure) 154 | 155 | ## What is Blob? 156 | BLOB stands for Binary Large Object. Blob is file of any type and size. 157 | The Azure Blob Storage offers two types of blobs: 158 | 1. Block Blob 159 | 2. Page Blob 160 | URL format: Blobs are addressable using the following URL format: 161 | http://.blob.aaa.windows.net// 162 | 163 | [Table of Contents](#Azure) 164 | 165 | ## What is the difference between Block Blob Vs Page Blob? 166 | Block blobs are comprised of blocks, each of which is identified by a block ID. 167 | You create or modify a block blob by uploading a set of blocks and committing them by their block IDs. 168 | If you are uploading a block blob that is no more than 64 MB in size, you can also upload it in its entirety with a single Put Blob operation. -Each block can be a maximum of 4 MB in size. The maximum size for a block blob in version 2009-09-19 is 200 GB, or up to 50,000 blocks. 169 | Page blobs are a collection of pages. A page is a range of data that is identified by its offset from the start of the blob. To create a page blob, you initialize the page blob by calling Put Blob and specifying its maximum size. 170 | -The maximum size for a page blob is 1 TB. A page written to a page blob may be up to 1 TB in size. 171 | what to use block blobs for: streaming video. “The application must provide random read/write access” which is supported by Page Blobs 172 | 173 | [Table of Contents](#Azure) 174 | 175 | ## What is the difference between Windows Azure Queues and Windows Azure Service Bus Queues? 176 | Windows Azure supports two types of queue mechanisms: 177 | Windows Azure Queues and Service Bus Queues . 178 | Windows Azure Queues: which are part of the Windows Azure storage infrastructure, feature a simple REST-based Get/Put/Peek interface, providing reliable, persistent messaging within and between services. 179 | Service Bus Queues: are part of a broader Windows Azure messaging infrastructure that supports queuing as well as publish/subscribe, Web service remoting, and integration patterns. 180 | 181 | [Table of Contents](#Azure) 182 | 183 | ## What is Deadletter Queue? 184 | Messages are placed on the deadletter sub-queue by the messaging system in the following scenarios. 185 | ► When a message expires and deadlettering for expired messages is set to true in a queue or subscription. 186 | ► When the max delivery count for a message is exceeded on a queue or subscription. 187 | ► When a filter evaluation exception occurs in a subscription and deadlettering is enabled on filter evaluation exceptions. 188 | 189 | [Table of Contents](#Azure) 190 | 191 | ## What are Instance Sizes of Azure? 192 | Windows Azure will handle the load balancing for all the instances that are created. The VM sizes are as follows: 193 | Compute Instance Size CPU Memory Instance Storage I/O Performance 194 | ► Extra Small 1.0 Ghz 768 MB 20 GB Low 195 | ► Small 1.6 GHz 1.75 GB 225 GB Moderate 196 | ► Medium 2 x 1.6 GHz 3.5 GB 490 GB High 197 | ► Large 4 x 1.6 GHz 7 GB 1,000 GB High 198 | ► Extra large 8 x 1.6 GHz 14 GB 2,040 GB High 199 | 200 | [Table of Contents](#Azure) 201 | 202 | ## What is Table Storage in Windows Azure? 203 | The Windows Azure Table storage service stores large amounts of structured data. 204 | The service is a NoSQL datastore which accepts authenticated calls from inside and outside the Windows Azure cloud. 205 | Windows Azure tables are ideal for storing structured, non-relational data 206 | Table: A table is a collection of entities. Tables don’t enforce a schema on entities, which means a single table can contain entities that have different sets of properties. An account can contain many tables 207 | Entity: An entity is a set of properties, similar to a database row. An entity can be up to 1MB in size. 208 | Properties: A property is a name-value pair. Each entity can include up to 252 properties to store data. Each entity also has 3 system properties that specify a partition key, a row key, and a timestamp. 209 | Entities with the same partition key can be queried more quickly, and inserted/updated in atomic operations. An entity’s row key is its unique identifier within a partition. 210 | 211 | [Table of Contents](#Azure) 212 | 213 | ## Difference between Web and Worker Roles in Windows Azure? 214 | The main difference between the two is that an instance of a web role runs IIS, while an instance of a worker role does not. Both are managed in the same way, however, and it’s common for an application to use both.For example, a web role instance might accept requests from users, then pass them to a worker role instance for processing. 215 | 216 | [Table of Contents](#Azure) 217 | 218 | ## What is Azure Fabric Controller? 219 | The Windows Azure Fabric Controller is a resource provisioning and management layer that manages the hardware, and provides resource allocation, deployment/upgrade, and management for cloud services on the Windows Azure platform. 220 | 221 | [Table of Contents](#Azure) 222 | 223 | ## What is Autoscaling? 224 | Scaling by adding additional instances is often referred to as scaling out. Windows Azure also supports scaling up by using larger role instances instead of more role instances. 225 | By adding and removing role instances to your Windows Azure application while it is running, you can balance the performance of the application against its running costs. 226 | An autoscaling solution reduces the amount of manual work involved in dynamically scaling an application. 227 | 228 | [Table of Contents](#Azure) 229 | 230 | ## What is Vm Role in Windows Azure? 231 | Virtual Machine (VM) roles, now in Beta, enable you to deploy a custom Windows Server 2008 R2 (Enterprise or Standard) image to Windows Azure. You can use the VM role when your application requires a large number of server OS customizations and cannot be automated. The VM Role gives you full control over your application environment and lets you migrate existing applications to the cloud. 232 | 233 | [Table of Contents](#Azure) 234 | 235 | ## Apart from dotnet Framework please name other three language framework that can be used to Develop Windows Azure Applications? 236 | php, node.js, java 237 | 238 | [Table of Contents](#Azure) 239 | 240 | ## How would you categorize Windows Azure? 241 | PaaS (Platform as a Service) 242 | 243 | [Table of Contents](#Azure) 244 | 245 | ## What is Azure Cloud Service? 246 | By creating a cloud service, you can deploy a multi-tier web application in Azure, defining multiple roles to distribute processing and allow flexible scaling of your application. A cloud service consists of one or more web roles and/or worker roles, each with its own application files and configuration. Azure Websites and Virtual Machines also enable web applications on Azure. The main advantage of cloud services is the ability to support more complex multi-tier architectures 247 | 248 | [Table of Contents](#Azure) 249 | 250 | ## What is Cloud Service Role? 251 | A cloud service role is comprised of application files and a configuration. A cloud service can have two types of role. 252 | 253 | [Table of Contents](#Azure) 254 | 255 | ## What is Link Resource? 256 | To show your cloud service’s dependencies on other resources, such as an Azure SQL Database instance, you can “link” the resource to the cloud service. In the Preview Management Portal, you can view linked resources on the Linked Resources page, view their status on the dashboard, and scale a linked SQL Database instance along with the service roles on the Scale page. Linking a resource in this sense does not connect the resource to the application; you must configure the connections in the application code. 257 | 258 | [Table of Contents](#Azure) 259 | 260 | ## What is Scale Cloud Service? 261 | A cloud service is scaled out by increasing the number of role instances (virtual machines) deployed for a role. A cloud service is scaled in by decreasing role instances. In the Preview Management Portal, you can also scale a linked SQL Database instance, by changing the SQL Database edition and the maximum database size, when you scale your service roles. 262 | 263 | [Table of Contents](#Azure) 264 | 265 | ## What is Web Role ? 266 | A web role provides a dedicated Internet Information Services (IIS) web-server used for hosting front-end web applications. 267 | 268 | [Table of Contents](#Azure) 269 | 270 | ## What is Worker Role ? 271 | Applications hosted within worker roles can run asynchronous, long-running or perpetual tasks independent of user interaction or input. 272 | 273 | [Table of Contents](#Azure) 274 | 275 | ## What is Role Instance ? 276 | A role instance is a virtual machine on which the application code and role configuration run. A role can have multiple instances, defined in the service configuration file. 277 | 278 | [Table of Contents](#Azure) 279 | 280 | ## What is Guest Operating System ? 281 | The guest operating system for a cloud service is the operating system installed on the role instances (virtual machines) on which your application code runs. 282 | 283 | [Table of Contents](#Azure) 284 | 285 | ## What is Cloud Service Components? 286 | Three components are required in order to deploy an application as a cloud service in Azure. 287 | 288 | [Table of Contents](#Azure) 289 | 290 | ## What is Deployment Environments? 291 | Azure offers two deployment environments for cloud services: a staging environment in which you can test your deployment before you promote it to the production environment. The two environments are distinguished only by the virtual IP addresses (VIPs) by which the cloud service is accessed. In the staging environment, the cloud service’s globally unique identifier (GUID) identifies it in URLs (GUID.cloudapp.net). In the production environment, the URL is based on the friendlier DNS prefix assigned to the cloud service (for example, myservice.cloudapp.net). 292 | 293 | [Table of Contents](#Azure) 294 | 295 | ## What is Swap Deployments? 296 | To promote a deployment in the Azure staging environment to the production environment, you can “swap” the deployments by switching the VIPs by which the two deployments are accessed. After the deployment, the DNS name for the cloud service points to the deployment that had been in the staging environment. 297 | 298 | [Table of Contents](#Azure) 299 | 300 | ## What is Minimal Vs Verbose Monitoring? 301 | Minimal monitoring, which is configured by default for a cloud service, uses performance counters gathered from the host operating systems for role instances (virtual machines). Verbose monitoring gathers additional metrics based on performance data within the role instances to enable closer analysis of issues that occur during application processing. 302 | 303 | [Table of Contents](#Azure) 304 | 305 | ## What is Service Definition File? 306 | The cloud service definition file (.csdef) defines the service model, including the number of roles. 307 | 308 | [Table of Contents](#Azure) 309 | 310 | ## What is Service Configuration File? 311 | The cloud service configuration file (.cscfg) provides configuration settings for the cloud service and individual roles, including the number of role instances. 312 | 313 | [Table of Contents](#Azure) 314 | 315 | ## What is Service Package ? 316 | The service package (.cspkg) contains the application code and the service definition file. 317 | 318 | [Table of Contents](#Azure) 319 | 320 | ## What is Cloud Service Deployment ? 321 | A cloud service deployment is an instance of a cloud service deployed to the Azure staging or production environment. You can maintain deployments in both staging and production. 322 | 323 | [Table of Contents](#Azure) 324 | 325 | ## What is Azure Diagnostics ? 326 | Azure Diagnostics is the API that enables you to collect diagnostic data from applications running in Azure. Azure Diagnostics must be enabled for cloud service roles in order for verbose monitoring to be turned on. For more information. 327 | 328 | [Table of Contents](#Azure) 329 | 330 | ## What is Azure Service Level Agreement? 331 | The Azure Compute SLA guarantees that, when you deploy two or more role instances for every role, access to your cloud service will be maintained at least 99.95 percent of the time. Also, detection and corrective action will be initiated 99.9 percent of the time when a role instance’s process is not running. 332 | 333 | [Table of Contents](#Azure) -------------------------------------------------------------------------------- /content/delta.md: -------------------------------------------------------------------------------- 1 | ## [Main title](../README.md) 2 | ### [Interview questions](full.md) 3 | 4 | ### Will be available soon -------------------------------------------------------------------------------- /content/dynamodb.md: -------------------------------------------------------------------------------- 1 | ## [Main title](../README.md) 2 | ### [Interview questions](full.md) 3 | 4 | ### Will be available soon -------------------------------------------------------------------------------- /content/flume.md: -------------------------------------------------------------------------------- 1 | ## [Main title](../README.md) 2 | ### [Interview questions](full.md) 3 | 4 | # FLUME 5 | + [What is Flume?](#What-is-Flume) 6 | + [What is Apache Flume?](#What-is-Apache-Flume) 7 | + [Which is the reliable channel in Flume to ensure that there is no Data Loss?](#Which-is-the-reliable-channel-in-Flume-to-ensure-that-there-is-no-Data-Loss) 8 | + [How can Flume be used with Hbase?](#How-can-Flume-be-used-with-Hbase) 9 | + [What is an Agent?](#What-is-an-Agent) 10 | + [Is it possible to Leverage Real Time Analysis on the Big Data collected by Flume directly?](#Is-it-possible-to-Leverage-Real-Time-Analysis-on-the-Big-Data-collected-by-Flume-directly) 11 | + [What is a Channel?](#What-is-a-Channel) 12 | + [Explain about the different channel types in Flume and which channel type is faster?](#Explain-about-the-different-channel-types-in-Flume-and-which-channel-type-is-faster) 13 | + [Explain about the replication and multiplexing selectors in Flume?](#Explain-about-the-replication-and-multiplexing-selectors-in-Flume) 14 | + [Does Apache Flume provide support for third party Plugins?](#Does-Apache-Flume-provide-support-for-third-party-Plugins) 15 | + [Differentiate between FileSink and FileRollSink?](#Differentiate-between-FileSink-and-FileRollSink) 16 | + [Why we are using Flume?](#Why-we-are-using-Flume) 17 | + [What is Flumeng?](#What-is-Flumeng) 18 | + [What are the complicated steps in Flume configurations?](#What-are-the-complicated-steps-in-Flume-configurations) 19 | + [What are Flume core components?](#What-are-Flume-core-components) 20 | + [What are the Data Extraction Tools in Hadoop?](#What-are-the-Data-Extraction-Tools-in-Hadoop) 21 | + [Does Flume provide 100% reliability to the Data Flow?](#Does-Flume-provide-100-percents-reliability-to-the-Data-Flow) 22 | + [Tell any two Features of Flume?](#Tell-any-two-Features-of-Flume) 23 | + [What are Interceptors?](#What-are-Interceptors) 24 | + [Why Flume?](#Why-Flume) 25 | + [What is Flume Event?](#What-is-Flume-Event) 26 | + [How Multi hop agent can be setup in Flume?](#How-Multi-hop-agent-can-be-setup-in-Flume) 27 | + [Can Flume can distribute data to multiple destinations?](#Can-Flume-can-distribute-data-to-multiple-destinations) 28 | + [Can you explain about configuration files?](#Can-you-explain-about-configuration-files) 29 | + [What are the similarities and differences between Apache Flume and Apache Kafka?](#What-are-the-similarities-and-differences-between-Apache-Flume-and-Apache-Kafka) 30 | + [Explain Reliability and Failure Handling in Apache Flume?](#Explain-Reliability-and-Failure-Handling-in-Apache-Flume) 31 | 32 | ## What is Flume? 33 | Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many fail over and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. 34 | 35 | [Table of Contents](#FLUME) 36 | 37 | ## What is Apache Flume? 38 | Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data source. Review this Flume use case to learn how Mozilla collects and Analyse the Logs using Flume and Hive. 39 | Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop. 40 | 41 | ## Which is the reliable channel in Flume to ensure that there is no Data Loss? 42 | FILE Channel is the most reliable channel among the 3 channels JDBC, FILE and MEMORY. 43 | 44 | [Table of Contents](#FLUME) 45 | 46 | ## How can Flume be used with Hbase? 47 | Apache Flume can be used with HBase using one of the two HBase links: 48 | + HBaseSink (org.apache.flume.sink.hbase.HBaseSink) supports secure HBase clusters and also the novel HBase IPC that was introduced in the version HBase 0.96. 49 | + AsyncHBaseSink (org.apache.flume.sink.hbase.AsyncHBaseSink) has better performance than HBase sink as it can easily make non-blocking calls to HBase. 50 | 51 | Working of the HBaseSink: 52 | + In HBaseSink, a Flume Event is converted into HBase Increments or Puts. Serializer implements the HBaseEventSerializer which is then instantiated when the sink starts. For every event, sink calls the initialize method in the serializer which then translates the Flume Event into HBase increments and puts to be sent to HBase cluster. 53 | Working of the AsyncHBaseSink: 54 | + AsyncHBaseSink implements the AsyncHBaseEventSerializer. The initialize method is called only once by the sink when it starts. Sink invokes the setEvent method and then makes calls to the getIncrements and getActions methods just similar to HBase sink. When the sink stops, the cleanUp method is called by the serializer. 55 | 56 | [Table of Contents](#FLUME) 57 | 58 | ## What is an Agent? 59 | A process that hosts flume components such as sources, channels and sinks, and thus has the ability to receive, store and forward events to their destination. 60 | 61 | [Table of Contents](#FLUME) 62 | 63 | ## Is it possible to Leverage Real Time Analysis on the Big Data collected by Flume directly? 64 | Data from Flume can be extracted, transformed and loaded in real-time into Apache Solr servers usingMorphlineSolrSink. 65 | 66 | [Table of Contents](#FLUME) 67 | 68 | ## What is a Channel? 69 | It stores events,events are delivered to the channel via sources operating within the agent.An event stays in the channel until a sink removes it for further transport. 70 | 71 | [Table of Contents](#FLUME) 72 | 73 | ## Explain about the different channel types in Flume and which channel type is faster? 74 | The 3 different built in channel types available in Flume are: 75 | 76 | + MEMORY Channel – Events are read from the source into memory and passed to the sink. 77 | + JDBC Channel – JDBC Channel stores the events in an embedded Derby database. 78 | + FILE Channel –File Channel writes the contents to a file on the file system after reading the event from a source. The file is deleted only after the contents are successfully delivered to the sink. 79 | 80 | MEMORY Channel is the fastest channel among the three however has the risk of data loss. The channel that you choose completely depends on the nature of the big data application and the value of each event. 81 | 82 | [Table of Contents](#FLUME) 83 | 84 | ## Explain about the replication and multiplexing selectors in Flume? 85 | Channel Selectors are used to handle multiple channels. Based on the Flume header value, an event can be written just to a single channel or to multiple channels. If a channel selector is not specified to the source then by default it is the Replicating selector. Using the replicating selector, the same event is written to all the channels in the source’s channels list. Multiplexing channel selector is used when the application has to send different events to different channels. 86 | 87 | [Table of Contents](#FLUME) 88 | 89 | ## Does Apache Flume provide support for third party Plugins? 90 | Most of the data analysts use Apache Flume has plug-in based architecture as it can load data from external sources and transfer it to external destinations. 91 | 92 | [Table of Contents](#FLUME) 93 | 94 | ## Differentiate between FileSink and FileRollSink? 95 | The major difference between HDFS FileSink and FileRollSink is that HDFS File Sink writes the events into the Hadoop Distributed File System (HDFS) whereas File Roll Sink stores the events into the local file system. 96 | 97 | [Table of Contents](#FLUME) 98 | 99 | ## Why we are using Flume? 100 | Most often Hadoop developer use this too to get data from social media sites. Its developed by Cloudera for aggregating and moving very large amount if data. The primary use is to gather log files from different sources and asynchronously persist in the hadoop cluster. 101 | 102 | [Table of Contents](#FLUME) 103 | 104 | ## What is Flumeng? 105 | A real time loader for streaming your data into Hadoop. It stores data in HDFS and HBase. You’ll want to get started with FlumeNG, which improves on the original flume. 106 | 107 | [Table of Contents](#FLUME) 108 | 109 | ## What are the complicated steps in Flume configurations? 110 | Flume can processing streaming data. so if started once, there is no stop/end to the process. asynchronously it can flows data from source to HDFS via agent. First of all agent should know individual components how they are connected to load data. so configuration is trigger to load streaming data. for example consumerkey, consumersecret accessToken and accessTokenSecret are key factor to download data from twitter. 111 | 112 | [Table of Contents](#FLUME) 113 | 114 | ## What are Flume core components? 115 | Source, Channels and sink are core components in Apache Flume. When Flume source receives event from externalsource, it stores the event in one or multiple channels. Flume channel is temporarily store and keep the event until’s consumed by the Flume sink. It act as Flume repository. Flume Sink removes the event from channel and put into an external repository like HDFS or Move to the next flume. 116 | 117 | [Table of Contents](#FLUME) 118 | 119 | ## What are the Data Extraction Tools in Hadoop? 120 | Sqoop can be used to transfer data between RDBMS and HDFS. Flume can be used to extract the streaming data from social media, web log etc and store it on HDFS. 121 | 122 | [Table of Contents](#FLUME) 123 | 124 | ## Does Flume provide 100 percents reliability to the Data Flow? 125 | Yes, Apache Flume provides end to end reliability because of its transactional approach in data flow. 126 | 127 | [Table of Contents](#FLUME) 128 | 129 | ## Tell any two Features of Flume? 130 | Fume collects data efficiently, aggregate and moves large amount of log data from many different sources to centralized data store. 131 | Flume is not restricted to log data aggregation and it can transport massive quantity of event data including but not limited to network traffic data, social-media generated data , email message na pretty much any data storage. 132 | 133 | [Table of Contents](#FLUME) 134 | 135 | ## What are Interceptors? 136 | Interceptors are used to filter the events between source and channel, channel and sink. These channels can filter un-necessary or targeted log files. Depends on requirements you can use n number of interceptors. 137 | 138 | [Table of Contents](#FLUME) 139 | 140 | ## Why Flume? 141 | Flume is not limited to collect logs from distributed systems, but it is capable of performing other use cases such as 142 | + Collecting readings from array of sensors 143 | + Collecting impressions from custom apps for an ad network 144 | + Collecting readings from network devices in order to monitor their performance. 145 | Flume is targeted to preserve the reliability, scalability, manageability and extensibility while it serves maximum number of clients with higher QoS 146 | 147 | [Table of Contents](#FLUME) 148 | 149 | ## What is Flume Event? 150 | A unit of data with set of string attribute called Flume event. The external source like web-server send events to the source. Internally Flume has inbuilt functionality to understand the source format. 151 | Each log file is consider as an event. Each event has header and value sectors, which has header information and appropriate value that assign to articular header. 152 | 153 | [Table of Contents](#FLUME) 154 | 155 | ## How Multi hop agent can be setup in Flume? 156 | Avro RPC Bridge mechanism is used to setup Multi-hop agent in Apache Flume. 157 | 158 | [Table of Contents](#FLUME) 159 | 160 | ## Can Flume can distribute data to multiple destinations? 161 | Yes. It support multiplexing flow. The event flows from one source to multiple channel and multiple destionations, It is acheived by defining a flow multiplexer. 162 | 163 | [Table of Contents](#FLUME) 164 | 165 | ## Can you explain about configuration files? 166 | The agent configuration is stored in local configuration file. it comprises of each agents source, sink and channel information. 167 | 168 | [Table of Contents](#FLUME) 169 | 170 | ## What are the similarities and differences between Apache Flume and Apache Kafka? 171 | Flume pushes messages to their destination via its Sinks.With Kafka you need to consume messages from Kafka Broker using a Kafka Consumer API. 172 | 173 | [Table of Contents](#FLUME) 174 | 175 | ## Explain Reliability and Failure Handling in Apache Flume? 176 | Flume NG uses channel-based transactions to guarantee reliable message delivery. When a message moves from one agent to another, two transactions are started, one on the agent that delivers the event and the other on the agent that receives the event. In order for the sending agent to commit it’s transaction, it must receive success indication from the receiving agent. 177 | The receiving agent only returns a success indication if it’s own transaction commits properly first. This ensures guaranteed delivery semantics between the hops that the flow makes. Figure below shows a sequence diagram that illustrates the relative scope and duration of the transactions operating within the two interacting agents. 178 | 179 | [Table of Contents](#FLUME) -------------------------------------------------------------------------------- /content/gcp.md: -------------------------------------------------------------------------------- 1 | ## [Main title](../README.md) 2 | ### [Interview questions](full.md) 3 | # GCP 4 | + [Explain the concept of dataflow in GCP](#Explain-the-concept-of-dataflow-in-GCP) 5 | + [Explain the use of Cloud Machine Learning Engine in GCP](#Explain-the-use-of-Cloud-Machine-Learning-Engine-in-GCP) 6 | + [What is Cloud Composer in GCP](#What-is-Cloud-Composer-in-GCP) 7 | + [Explain the use of Cloud Datalab in GCP](#Explain-the-use-of-Cloud-Datalab-in-GCP) 8 | + [Explain the use of Cloud Dataflow in GCP](#Explain-the-use-of-Cloud-Dataflow-in-GCP) 9 | + [How can you manage data access and permissions in GCP](#How-can-you-manage-data-access-and-permissions-in-GCP) 10 | + [Explain the use of Cloud Composer in GCP](#Explain-the-use-of-Cloud-Composer-in-GCP) 11 | + [Explain the concept of VPC (Virtual Private Cloud) in GCP](#Explain-the-concept-of-VPC-(Virtual-Private-Cloud)-in-GCP) 12 | + [How can you monitor and troubleshoot performance issues in GCP](#How-can-you-monitor-and-troubleshoot-performance-issues-in-GCP) 13 | + [Explain the use of Cloud IoT Core in GCP](#Explain-the-use-of-Cloud-IoT-Core-in-GCP) 14 | + [What is Google BigQuery](#What-is-Google-BigQuery) 15 | + [How can you optimize data processing in GCP](#How-can-you-optimize-data-processing-in-GCP) 16 | + [What is the purpose of Cloud DNS in GCP](#What-is-the-purpose-of-Cloud-DNS-in-GCP) 17 | + [How does GCP ensure compliance and data privacy](#How-does-GCP-ensure-compliance-and-data-privacy) 18 | + [How can you monitor and analyze GCP resources and services](#How-can-you-monitor-and-analyze-GCP-resources-and-services) 19 | + [How does GCP handle data encryption](#How-does-GCP-handle-data-encryption) 20 | + [What are the advantages of using GCP for data engineering](#What-are-the-advantages-of-using-GCP-for-data-engineering) 21 | + [How can you optimize data ingestion in GCP](#How-can-you-optimize-data-ingestion-in-GCP) 22 | + [What is the purpose of Cloud SQL Proxy in GCP](#What-is-the-purpose-of-Cloud-SQL-Proxy-in-GCP) 23 | + [Explain the use of Cloud Security Command Center in GCP](#Explain-the-use-of-Cloud-Security-Command-Center-in-GCP) 24 | + [How does GCP handle data replication and synchronization](#How-does-GCP-handle-data-replication-and-synchronization) 25 | + [Explain the role of Cloud Storage in GCP](#Explain-the-role-of-Cloud-Storage-in-GCP) 26 | + [What is the purpose of Cloud Identity and Access Management (IAM) in GCP](#What-is-the-purpose-of-Cloud-Identity-and-Access-Management-(IAM)-in-GCP) 27 | + [How does GCP store data](#How-does-GCP-store-data) 28 | + [What is Memorystore in GCP](#What-is-Memorystore-in-GCP) 29 | + [Explain the use of Cloud Run in GCP](#Explain-the-use-of-Cloud-Run-in-GCP) 30 | + [How does GCP handle data archiving and long-term storage](#How-does-GCP-handle-data-archiving-and-long-term-storage) 31 | + [What is the purpose of Cloud Load Balancing in GCP](#What-is-the-purpose-of-Cloud-Load-Balancing-in-GCP) 32 | + [How can you monitor and analyze GCP costs](#How-can-you-monitor-and-analyze-GCP-costs) 33 | + [What are the key components of GCP](#What-are-the-key-components-of-GCP) 34 | + [Explain the use of Cloud Datastore in GCP](#Explain-the-use-of-Cloud-Datastore-in-GCP) 35 | + [What is Cloud Dataproc](#What-is-Cloud-Dataproc) 36 | + [What is the purpose of Cloud Deployment Manager in GCP](#What-is-the-purpose-of-Cloud-Deployment-Manager-in-GCP) 37 | + [What is Cloud Memorystore for Redis in GCP](#What-is-Cloud-Memorystore-for-Redis-in-GCP) 38 | + [What is the purpose of Cloud Monitoring in GCP](#What-is-the-purpose-of-Cloud-Monitoring-in-GCP) 39 | + [What is the purpose of Cloud Machine Learning Engine in GCP](#What-is-the-purpose-of-Cloud-Machine-Learning-Engine-in-GCP) 40 | + [What is the purpose of Cloud NAT in GCP](#What-is-the-purpose-of-Cloud-NAT-in-GCP) 41 | + [What is the purpose of Cloud Functions in GCP](#What-is-the-purpose-of-Cloud-Functions-in-GCP) 42 | + [What is Cloud Pub/Sub](#What-is-Cloud-Pub/Sub) 43 | + [Explain the use of Cloud VPN in GCP](#Explain-the-use-of-Cloud-VPN-in-GCP) 44 | + [What is Google Cloud Platform (GCP)](#What-is-Google-Cloud-Platform-(GCP)) 45 | + [How can you ensure high availability and fault tolerance in GCP](#How-can-you-ensure-high-availability-and-fault-tolerance-in-GCP) 46 | + [What is Cloud Spanner in GCP](#What-is-Cloud-Spanner-in-GCP) 47 | + [What is Cloud SQL and how is it used in GCP](#What-is-Cloud-SQL-and-how-is-it-used-in-GCP) 48 | + [How does GCP handle data redundancy and backup](#How-does-GCP-handle-data-redundancy-and-backup) 49 | + [How does GCP ensure data security](#How-does-GCP-ensure-data-security) 50 | + [How can you move data into GCP for analysis](#How-can-you-move-data-into-GCP-for-analysis) 51 | + [How can you securely transfer data to and from GCP](#How-can-you-securely-transfer-data-to-and-from-GCP) 52 | + [How does GCP handle disaster recovery](#How-does-GCP-handle-disaster-recovery) 53 | + [Explain the use of Cloud CDN (Content Delivery Network) in GCP](#Explain-the-use-of-Cloud-CDN-(Content-Delivery-Network)-in-GCP) 54 | + [How can you ensure data integrity in GCP](#How-can-you-ensure-data-integrity-in-GCP) 55 | + [Explain the use of Cloud Key Management Service (KMS) in GCP](#Explain-the-use-of-Cloud-Key-Management-Service-(KMS)-in-GCP) 56 | + [What is the purpose of Cloud Composer in GCP](#What-is-the-purpose-of-Cloud-Composer-in-GCP) 57 | + [What is the purpose of Cloud Security Scanner in GCP](#What-is-the-purpose-of-Cloud-Security-Scanner-in-GCP) 58 | + [Explain the use of Cloud AutoML in GCP](#Explain-the-use-of-Cloud-AutoML-in-GCP) 59 | 60 | ## What is Google Cloud Platform (GCP)? 61 | Google Cloud Platform is a suite of cloud computing services provided by Google, offering a wide range of infrastructure and platform services for building, deploying, and managing applications and data. 62 | 63 | [Table of Contents](#GCP) 64 | 65 | ## What are the key components of GCP? 66 | GCP comprises various services such as compute, storage, networking, databases, big data, machine learning, and management tools. 67 | 68 | [Table of Contents](#GCP) 69 | 70 | ## What are the advantages of using GCP for data engineering? 71 | GCP provides scalability, flexibility, high-performance computing, data storage options, managed services, security, and integration with other Google services. 72 | 73 | [Table of Contents](#GCP) 74 | 75 | ## How does GCP store data? 76 | GCP offers various storage options, including Cloud Storage for object storage, Cloud SQL for managed relational databases, Bigtable for NoSQL wide-column store, and Cloud Firestore for document-based NoSQL data. 77 | 78 | [Table of Contents](#GCP) 79 | 80 | ## What is Google BigQuery? 81 | BigQuery is a fully-managed, serverless data warehouse provided by GCP. It enables you to analyze massive datasets using SQL queries with high speed and scalability. 82 | 83 | [Table of Contents](#GCP) 84 | 85 | ## How can you move data into GCP for analysis? 86 | You can use services like Cloud Storage, Cloud Pub/Sub, Data Transfer Service, or third-party tools to import data into GCP for analysis. 87 | 88 | [Table of Contents](#GCP) 89 | 90 | ## Explain the concept of dataflow in GCP. 91 | Dataflow is a serverless data processing service that allows you to build, deploy, and execute data processing pipelines in GCP. It supports both batch and stream processing. 92 | 93 | [Table of Contents](#GCP) 94 | 95 | ## What is Cloud Dataproc? 96 | Cloud Dataproc is a fully-managed service in GCP for running Apache Spark and Apache Hadoop clusters. It provides a scalable and cost-effective way to process big data workloads. 97 | 98 | [Table of Contents](#GCP) 99 | 100 | ## How does GCP ensure data security? 101 | GCP employs various security measures, including encryption at rest and in transit, identity and access management, network security, and DDoS protection. 102 | 103 | [Table of Contents](#GCP) 104 | 105 | ## What is Cloud Pub/Sub? 106 | Cloud Pub/Sub is a messaging service in GCP that enables asynchronous communication between independent applications. It allows you to build scalable event-driven architectures. 107 | 108 | [Table of Contents](#GCP) 109 | 110 | ## Explain the role of Cloud Storage in GCP. 111 | Cloud Storage is an object storage service provided by GCP. It offers scalable, durable, and highly available storage for objects of any size. It can be used for storing files, backups, and serving static content. 112 | 113 | [Table of Contents](#GCP) 114 | 115 | ## What is Cloud Composer in GCP? 116 | Cloud Composer is a managed workflow orchestration service based on Apache Airflow. It allows you to create, schedule, and monitor workflows composed of tasks, such as data processing or ETL pipelines. 117 | 118 | [Table of Contents](#GCP) 119 | 120 | ## How can you monitor and analyze GCP resources and services? 121 | GCP provides various monitoring and logging tools, such as Cloud Monitoring, Cloud Logging, and Stackdriver, which allow you to collect, analyze, and visualize metrics, logs, and traces. 122 | 123 | [Table of Contents](#GCP) 124 | 125 | ## What is Cloud Spanner in GCP? 126 | Cloud Spanner is a horizontally scalable, globally distributed relational database service provided by GCP. It combines the benefits of relational databases with the scalability of NoSQL. 127 | 128 | [Table of Contents](#GCP) 129 | 130 | ## How can you ensure high availability and fault tolerance in GCP? 131 | GCP offers regional and multi-regional options for deploying services across multiple zones and regions. Load balancing, auto-scaling, and managed instance groups help ensure high availability and fault tolerance. 132 | 133 | [Table of Contents](#GCP) 134 | 135 | ## Explain the use of Cloud Dataflow in GCP. 136 | Cloud Dataflow is used for building data pipelines that transform and process data in parallel. It supports both batch and stream processing and can be integrated with other GCP services like BigQuery. 137 | 138 | [Table of Contents](#GCP) 139 | 140 | ## What is Cloud SQL and how is it used in GCP? 141 | Cloud SQL is a managed database service in GCP that provides MySQL, PostgreSQL, and SQL Server databases. It simplifies database management tasks, offers high availability, and scales seamlessly. 142 | 143 | [Table of Contents](#GCP) 144 | 145 | ## Explain the use of Cloud Datastore in GCP. 146 | Cloud Datastore is a NoSQL document database provided by GCP. It is schemaless, automatically scales, and is suitable for applications that require low-latency reads and writes. 147 | 148 | [Table of Contents](#GCP) 149 | 150 | ## How can you securely transfer data to and from GCP? 151 | GCP supports secure data transfer over encrypted connections using protocols like HTTPS and SSL/TLS. It also provides Cloud VPN and Dedicated Interconnect for private network connections. 152 | 153 | [Table of Contents](#GCP) 154 | 155 | ## What is the purpose of Cloud Machine Learning Engine in GCP? 156 | Cloud Machine Learning Engine is a managed service in GCP that allows you to build, train, and deploy machine learning models at scale. It supports popular frameworks like TensorFlow and scikit-learn. 157 | 158 | [Table of Contents](#GCP) 159 | 160 | ## How does GCP handle disaster recovery? 161 | GCP provides disaster recovery solutions through features like regional and multi-regional deployments, automated backups, and replication across zones and regions. 162 | 163 | [Table of Contents](#GCP) 164 | 165 | ## Explain the concept of VPC (Virtual Private Cloud) in GCP. 166 | VPC allows you to create a virtual private network within GCP, providing isolation and control over network resources. It enables you to define IP ranges, subnets, firewall rules, and routing tables. 167 | 168 | [Table of Contents](#GCP) 169 | 170 | ## What is the purpose of Cloud Identity and Access Management (IAM) in GCP? 171 | IAM allows you to manage access control and permissions for GCP resources. It helps you define who has access to which resources and what actions they can perform. 172 | 173 | [Table of Contents](#GCP) 174 | 175 | ## How does GCP handle data redundancy and backup? 176 | GCP offers redundancy and backup options such as regional and multi-regional storage, snapshot-based backups, and data replication across multiple zones and regions. 177 | 178 | [Table of Contents](#GCP) 179 | 180 | ## Explain the use of Cloud Composer in GCP. 181 | Cloud Composer is used for orchestrating workflows and managing dependencies between tasks. It allows you to create, schedule, and monitor complex data pipelines and ETL workflows. 182 | 183 | [Table of Contents](#GCP) 184 | 185 | ## What is the purpose of Cloud Load Balancing in GCP? 186 | Cloud Load Balancing distributes incoming traffic across multiple instances or backend services to ensure high availability, scalability, and fault tolerance. 187 | 188 | [Table of Contents](#GCP) 189 | 190 | ## How does GCP handle data encryption? 191 | GCP supports encryption at rest and in transit. It provides key management services like Cloud Key Management Service (KMS) for managing encryption keys and Cloud HSM for hardware security modules. 192 | 193 | [Table of Contents](#GCP) 194 | 195 | ## What is Memorystore in GCP? 196 | Memorystore is a fully managed in-memory data store service provided by GCP. It supports Redis, offering high-performance caching for applications. 197 | 198 | [Table of Contents](#GCP) 199 | 200 | ## Explain the use of Cloud CDN (Content Delivery Network) in GCP. 201 | Cloud CDN is a global content delivery network that caches and delivers content from GCP to users with low latency and high bandwidth. It improves the performance of web applications and reduces serving costs. 202 | 203 | [Table of Contents](#GCP) 204 | 205 | ## How can you monitor and analyze GCP costs? 206 | GCP provides tools like Cloud Billing, Cost Management, and budgets to monitor, analyze, and optimize costs associated with your GCP resources and services. 207 | 208 | [Table of Contents](#GCP) 209 | 210 | ## Explain the use of Cloud AutoML in GCP. 211 | Cloud AutoML is a suite of machine learning products in GCP that allows you to train custom machine learning models with minimal coding. It supports vision, natural language, and tabular data. 212 | 213 | [Table of Contents](#GCP) 214 | 215 | ## What is the purpose of Cloud Deployment Manager in GCP? 216 | Cloud Deployment Manager is an infrastructure deployment service in GCP. It allows you to define and manage resources as code, making it easier to create and maintain infrastructure configurations. 217 | 218 | [Table of Contents](#GCP) 219 | 220 | ## How does GCP ensure compliance and data privacy? 221 | GCP complies with various industry standards and regulations. It offers features like data location controls, access transparency, and compliance certifications to meet data privacy and compliance requirements. 222 | 223 | [Table of Contents](#GCP) 224 | 225 | ## Explain the use of Cloud Machine Learning Engine in GCP. 226 | Cloud Machine Learning Engine is a managed service that allows you to train, deploy, and serve machine learning models at scale. It integrates with other GCP services like BigQuery and Cloud Storage. 227 | 228 | [Table of Contents](#GCP) 229 | 230 | ## What is the purpose of Cloud Monitoring in GCP? 231 | Cloud Monitoring is a monitoring and observability service provided by GCP. It allows you to collect and analyze metrics, create dashboards, set up alerts, and visualize the health and performance of your resources. 232 | 233 | ## How can you optimize data ingestion in GCP? 234 | GCP provides services like Cloud Pub/Sub and Cloud Dataflow for efficient and scalable data ingestion. You can also optimize ingestion by using partitioning, batching, and compression techniques. 235 | 236 | [Table of Contents](#GCP) 237 | 238 | ## What is Cloud Memorystore for Redis in GCP? 239 | Cloud Memorystore for Redis is a fully-managed in-memory data store service provided by GCP. It offers high-performance, scalable Redis instances for caching and data storage. 240 | 241 | [Table of Contents](#GCP) 242 | 243 | ## Explain the use of Cloud Datalab in GCP. 244 | Cloud Datalab is an interactive data exploration and visualization tool in GCP. It provides a Jupyter notebook interface for analyzing and visualizing data using Python, SQL, and BigQuery. 245 | 246 | [Table of Contents](#GCP) 247 | 248 | ## What is the purpose of Cloud DNS in GCP? 249 | Cloud DNS is a scalable and reliable domain name system (DNS) service provided by GCP. It allows you to manage and resolve domain names to IP addresses. 250 | 251 | [Table of Contents](#GCP) 252 | 253 | ## How can you ensure data integrity in GCP? 254 | GCP provides various mechanisms for ensuring data integrity, including checksums, encryption, redundancy, and regular backups. 255 | 256 | [Table of Contents](#GCP) 257 | 258 | ## Explain the use of Cloud Security Command Center in GCP. 259 | Cloud Security Command Center is a security and risk management service provided by GCP. It helps you discover, monitor, and manage security vulnerabilities and threats across your GCP resources. 260 | 261 | [Table of Contents](#GCP) 262 | 263 | ## What is the purpose of Cloud SQL Proxy in GCP? 264 | Cloud SQL Proxy is a secure client-side proxy for connecting to Cloud SQL instances from external applications or local development environments without exposing the database to the internet. 265 | 266 | [Table of Contents](#GCP) 267 | 268 | ## Explain the use of Cloud Run in GCP. 269 | Cloud Run is a fully managed serverless execution environment in GCP. It allows you to run containers without worrying about infrastructure provisioning or scaling. 270 | 271 | [Table of Contents](#GCP) 272 | 273 | ## What is the purpose of Cloud Security Scanner in GCP? 274 | Cloud Security Scanner is a web application vulnerability scanning service provided by GCP. It helps you identify security vulnerabilities in your web applications by crawling your website. 275 | 276 | [Table of Contents](#GCP) 277 | 278 | ## What is the purpose of Cloud Composer in GCP? 279 | Cloud Composer is a fully managed workflow orchestration service based on Apache Airflow. It allows you to create, schedule, and monitor complex data pipelines and ETL workflows. 280 | 281 | [Table of Contents](#GCP) 282 | 283 | ## How does GCP handle data archiving and long-term storage? 284 | GCP offers services like Cloud Storage Coldline and Archive for long-term data storage, providing cost-effective options for archiving data with high durability and availability. 285 | 286 | [Table of Contents](#GCP) 287 | 288 | ## How can you manage data access and permissions in GCP? 289 | GCP provides Cloud Identity and Access Management (IAM) for managing access control and permissions to GCP resources. IAM allows you to define fine-grained access policies and grant access to specific users or groups. 290 | 291 | [Table of Contents](#GCP) 292 | 293 | ## How can you optimize data processing in GCP? 294 | GCP offers optimization techniques like data partitioning, distributed processing, caching, and using appropriate data storage and processing services to achieve efficient and scalable data processing. 295 | 296 | [Table of Contents](#GCP) 297 | 298 | ## What is the purpose of Cloud NAT in GCP? 299 | Cloud NAT is a service in GCP that allows your virtual machine instances to send outbound traffic to the internet without exposing their IP addresses. It provides network address translation capabilities. 300 | 301 | [Table of Contents](#GCP) 302 | 303 | ## Explain the use of Cloud IoT Core in GCP. 304 | Cloud IoT Core is a fully managed service in GCP for securely connecting, managing, and ingesting data from IoT devices at scale. It provides device management, data ingestion, and integration with other GCP services. 305 | 306 | [Table of Contents](#GCP) 307 | 308 | ## How does GCP handle data replication and synchronization? 309 | GCP offers data replication and synchronization capabilities through services like Cloud Storage, Cloud Datastore, Cloud Spanner, and database-specific replication features. 310 | 311 | [Table of Contents](#GCP) 312 | 313 | ## Explain the use of Cloud VPN in GCP. 314 | Cloud VPN is a service in GCP that provides a secure and encrypted connection between your on-premises network and GCP Virtual Private Cloud (VPC) network. 315 | 316 | [Table of Contents](#GCP) 317 | 318 | ## What is the purpose of Cloud Functions in GCP? 319 | Cloud Functions is a serverless compute service in GCP that allows you to run event-driven code in response to events from various GCP services or HTTP requests. 320 | 321 | [Table of Contents](#GCP) 322 | 323 | ## How can you monitor and troubleshoot performance issues in GCP? 324 | GCP provides monitoring tools like Cloud Monitoring and logging tools like Cloud Logging and Stackdriver, which allow you to monitor and troubleshoot performance issues by collecting and analyzing metrics, logs, and traces. 325 | 326 | [Table of Contents](#GCP) 327 | 328 | ## Explain the use of Cloud Key Management Service (KMS) in GCP. 329 | Cloud KMS is a managed service in GCP for generating, using, and managing encryption keys. It helps you encrypt data and control access to sensitive information. 330 | 331 | 332 | [Table of Contents](#GCP) 333 | -------------------------------------------------------------------------------- /content/kafka.md: -------------------------------------------------------------------------------- 1 | ## [Main title](../README.md) 2 | ### [Interview questions](full.md) 3 | 4 | # Apache Kafka 5 | + [What is Apache Kafka?](#What-is-Apache-Kafka) 6 | + [What is the traditional method of transferring messages?](#What-is-the-traditional-method-of-transfering-messages) 7 | + [What are the benefits of Apache Kafka over the traditional technique?](#What-is-the-benefits-of-Apache-Kafka-over-the-traditional-technique) 8 | + [What is the meaning of broker in Apache Kafka?](#What-is-the-meaning-of-broker-in-Apache-Kafka) 9 | + [What is the maximum size of a message that Kafka can receive?](#What-is-the-maximum-size-of-a-message-that-kafka-can-receive) 10 | + [What is Zookeeper's role in Kafka's ecosystem and can we use Kafka without Zookeeper?](#What-is-the-Zookeepers-role-in-Kafkas-ecosystem-and-can-we-use-Kafka-without-Zookeeper) 11 | + [How are messages consumed by a consumer in Apache Kafka?](#How-are-messages-consumed-by-a-consumer-in-apache-Kafka) 12 | + [How can you improve the throughput of a remote consumer?](#How-can-you-improve-the-throughput-of-a-remote-consumer) 13 | + [How can you get Exactly-Once Messaging from Kafka during data production?](#How-can-get-Exactly-Once-Messaging-from-Kafka-during-data-production) 14 | + [What is In-Sync Replicas (ISR) in Apache Kafka?](#What-is-In-Sync-MessagesISR-in-Apache-Kafka) 15 | + [How can we reduce churn (frequent changes) in ISR?](#How-can-we-redcue-chrunfrequent-changes-in-ISR) 16 | + [When does a broker leave ISR?](#When-does-a-broker-leave-ISR) 17 | + [What does it indicate if a replica stays out of ISR for a long time?](#What-does-it-indicate-if-replica-stays-out-of-Isr-for-a-long-time) 18 | + [What happens if the preferred replica is not in the ISR list?](#What-happens-if-the-preferred-replica-is-not-in-the-ISR-list) 19 | + [What is the purpose of replication in Apache Kafka?](#What-is-the-purpose-of-replication-in-apache-kafka) 20 | + [Is it possible to get the message offset after producing to a topic?](#Is-it-possible-to-get-the-message-offset-after-producing-to-a-topic) 21 | + [Mention what is the difference between Apache Kafka, Apache Storm, and Apache Flink?](#Mention-what-is-the-difference-between-Apache-Kafka-and-Apache-Storm-and-apache-flink) 22 | + [List the various components in Kafka?](#List-the-various-components-in-Kafka) 23 | + [What is the role of the offset in Kafka?](#What-is-the-role-of-the-offset-in-kafka) 24 | + [Can you explain the concept of `leader` and `follower` in the Kafka ecosystem?](#Can-you-explain-the-concept-of-leader-and-follower-in-kafka-ecosystem) 25 | + [How do you define a Partitioning Key?](#How-do-you-define-a-Partitioning-Key) 26 | + [In the Producer, when does QueueFullException occur?](#In-the-Producer-when-does-Queuefullexception-occur) 27 | + [Can you please explain the role of the Kafka Producer API?](#Can-you-please-explain-the-role-of-the-Kafka-Producer-API) 28 | 29 | ## What is Apache Kafka? 30 | Apache Kafka is a publish-subscribe messaging system developed by Apache written in Java and Scala. It is a distributed, partitioned and replicated log service. 31 | 32 | [Table of Contents](#Apache-Kafka) 33 | 34 | ## What is the traditional method of transfering messages? 35 | The traditional method of transfering messages includes two methods 36 | + Queuing: In a queuing, a pool of consumers may read message from the server and each message goes to one of them 37 | + Publish-Subscribe: In this model, messages are broadcasted to all consumers 38 | 39 | Kafka caters single consumer abstraction that generalized both of the above- the consumer group. 40 | 41 | [Table of Contents](#Apache-Kafka) 42 | 43 | ## What is the benefits of Apache Kafka over the traditional technique? 44 | Apache Kafka has following benefits above traditional messaging technique 45 | + Scalability: Kafka is designed for horizontal scalability. It can scale out by adding more brokers (servers) to the Kafka cluster to handle more partitions and thereby increase throughput. This scalability is seamless and can handle petabytes of data without downtime. 46 | 47 | + Performance: Kafka provides high throughput for both publishing and subscribing to messages, even with very large volumes of data. It uses a disk structure that optimizes for batched writes and reads, significantly outperforming traditional databases in scenarios that involve high-volume, high-velocity data. 48 | 49 | + Durability and Reliability: Kafka replicates data across multiple nodes, ensuring that data is not lost even if some brokers fail. This replication is configurable, allowing users to balance between redundancy and performance based on their requirements. 50 | 51 | + Fault Tolerance: Kafka is designed to be fault-tolerant. The distributed nature of Kafka, combined with its replication mechanisms, ensures that the system continues to operate even when individual components . 52 | 53 | + Real-time Processing: Kafka enables real-time data processing by allowing producers to write data into Kafka topics and consumers to read data from these topics with minimal latency. This capability is critical for applications that require real-time analytics, monitoring, and response. 54 | 55 | + Decoupling of Data Streams: Kafka allows producers and consumers to operate independently. Producers can write data to Kafka topics without being concerned about how the data will be processed. Similarly, consumers can read data from topics without needing to coordinate with producers. This decoupling simplifies system architecture and enhances flexibility. 56 | 57 | + Replayability: Kafka stores data for a configurable period, enabling applications to replay historical data. This is valuable for new applications that need access to historical data or for recovering from errors by reprocessing data. 58 | 59 | + High Availability: Kafka's distributed nature and replication model ensure high availability. Even if some brokers or partitions become unavailable, the system can continue to function, ensuring continuous operation of critical applications. 60 | 61 | [Table of Contents](#Apache-Kafka) 62 | 63 | ## What is the meaning of broker in Apache Kafka? 64 | a broker refers to a server in the Kafka cluster that stores and manages the data. Each broker holds a set of topic partitions, allowing Kafka to efficiently handle large volumes of data by distributing the load across multiple brokers in the cluster. Brokers handle all read and write requests from Kafka producers and consumers and ensure data replication and fault tolerance to prevent data loss. 65 | 66 | [Table of Contents](#Apache-Kafka) 67 | 68 | ## What is the maximum size of a message that kafka can receive? 69 | The maximum size of a message that Kafka can receive is determined by the message.max.bytes configuration parameter for the broker and the max.message.bytes parameter for the topic. By default, Kafka allows messages up to 1 MB (1,048,576 bytes) in size, but both parameters can be adjusted to allow larger messages if needed. 70 | 71 | [Table of Contents](#Apache-Kafka) 72 | 73 | ## What is the Zookeeper's role in Kafka's ecosystem and can we use Kafka without Zookeeper? 74 | Zookeeper in Kafka is used for managing and coordinating Kafka brokers. It helps in leader election for partitions, cluster membership, and configuration management among other tasks. Historically, Kafka required Zookeeper to function. 75 | 76 | However, with the introduction of KRaft mode (Kafka Raft Metadata mode), it's possible to use Kafka without Zookeeper. KRaft mode replaces Zookeeper by using a built-in consensus mechanism for managing cluster metadata, simplifying the architecture and potentially improving performance and scalability. 77 | 78 | [Table of Contents](#Apache-Kafka) 79 | 80 | ## How are messages consumed by a consumer in apache Kafka? 81 | In Apache Kafka, messages are consumed by a consumer through a pull-based model. The consumer subscribes to one or more topics and polls the Kafka broker at regular intervals to fetch new messages. Messages are consumed in the order they are stored in the topic's partitions. Each consumer keeps track of its offset in each partition, which is the position of the next message to be consumed, allowing it to pick up where it left off across restarts or failures. 82 | 83 | [Table of Contents](#Apache-Kafka) 84 | 85 | ## How can you improve the throughput of a remote consumer? 86 | + Increase Bandwidth: Ensure the network connection has sufficient bandwidth to handle the data being consumed. 87 | + Optimize Data Serialization: Use efficient data serialization formats to reduce the size of the data being transmitted. 88 | + Concurrency: Implement concurrency in the consumer to process data in parallel, if possible. 89 | + Batch Processing: Where applicable, batch data together to reduce the number of round-trip times needed. 90 | + Caching: Cache frequently accessed data on the consumer side to reduce data retrieval times. 91 | + Compression: Compress data before transmission to reduce the amount of data being sent over the network. 92 | + Optimize Network Routes: Use optimized network paths or CDN services to reduce latency. 93 | + Adjust Timeouts and Buffer Sizes: Fine-tune network settings, including timeouts and buffer sizes, for optimal data transfer rates. 94 | 95 | [Table of Contents](#Apache-Kafka) 96 | 97 | ## How can get Exactly-Once Messaging from Kafka during data production? 98 | 99 | 1. **Enable Idempotence**: Configure the producer for idempotence by setting `enable.idempotence` to `true`. This ensures that messages are not duplicated during network errors. 100 | 101 | 2. **Transactional API**: Use Kafka’s Transactional API by initiating transactions on the producer. This involves setting the `transactional.id` configuration and managing transactions with `beginTransaction()`, `commitTransaction()`, and `abortTransaction()` methods. It ensures that either all messages in a transaction are successfully published, or none are in case of failure, thereby achieving exactly-once semantics. 102 | 103 | 3. **Proper Configuration**: Alongside enabling idempotence, adjust `acks` to `all` (or `-1`) to ensure all replicas acknowledge the messages, and set an appropriate `retries` and `max.in.flight.requests.per.connection` (should be `1` when transactions are used) to handle retries without message duplication. 104 | 105 | 4. **Consistent Partitioning**: Ensure that messages are partitioned consistently if the order matters. This might involve custom partitioning strategies to avoid shuffling messages among partitions upon retries. 106 | 107 | [Table of Contents](#Apache-Kafka) 108 | 109 | ## What is In-Sync Messages(ISR) in Apache Kafka? 110 | In Apache Kafka, ISR stands for In-Sync Replicas. It's a concept related to Kafka's high availability and fault tolerance mechanisms. 111 | 112 | For each partition, Kafka maintains a list of replicas that are considered "in-sync" with the leader replica. The leader replica is the one that handles all read and write requests for a specific partition, while the follower replicas replicate the leader's log. Followers that have fully caught up with the leader log are considered in-sync. This means they have replicated all messages up to the last message acknowledged by the leader. 113 | 114 | The ISR ensures data durability and availability. If the leader fails, Kafka can elect a new leader from the in-sync replicas, minimizing data loss and downtime. 115 | 116 | [Table of Contents](#Apache-Kafka) 117 | 118 | ## How can we redcue chrun(frequent changes) in ISR? 119 | + Optimize Network Configuration: Ensure that the network connections between brokers are stable and have sufficient bandwidth. Network issues can cause followers to fall behind and drop out of the ISR. 120 | 121 | + Adjust Replica Lag Configuration: Kafka allows configuration of parameters like replica.lag.time.max.ms which defines how long a replica can be behind the leader before it is considered out of sync. Adjusting this value can help manage ISR churn by allowing replicas more or less time to catch up. 122 | 123 | + Monitor and Scale Resources Appropriately: Ensure that all brokers have sufficient resources (CPU, memory, disk I/O) to handle their workload. Overloaded brokers may struggle to keep up, leading to replicas falling out of the ISR. 124 | 125 | + Use Dedicated Networks for Replication Traffic: If possible, use a dedicated network for replication traffic. This can help prevent replication traffic from being impacted by other network loads. 126 | 127 | [Table of Contents](#Apache-Kafka) 128 | 129 | ## When does a broker leave ISR? 130 | A broker may leave the ISR for a few reasons: 131 | 132 | + Falling Behind: If a replica falls behind the leader by more than the configured thresholds (replica.lag.time.max.ms or replica.lag.max.messages), it is removed from the ISR. 133 | 134 | + Broker Failure: If a broker crashes or is otherwise disconnected from the cluster, its replicas are removed from the ISR. 135 | 136 | + Manual Intervention: An administrator can manually remove a replica from the ISR, although this is not common practice and should be done with caution. 137 | 138 | [Table of Contents](#Apache-Kafka) 139 | 140 | ## What does it indicate if replica stays out of Isr for a long time? 141 | If a replica stays out of the ISR (In-Sync Replicas) for a long time, it indicates that the replica is not able to keep up with the leader's log updates. This can be due to network issues, hardware failure, or high load on the broker. As a result, the replica might become a bottleneck for partition availability and durability, since it cannot participate in acknowledging writes or be elected as a leader if the current leader fails. 142 | 143 | [Table of Contents](#Apache-Kafka) 144 | 145 | ## What happens if the preferred replica is not in the ISR list? 146 | If the preferred replica is not in the In-Sync Replicas (ISR) for a Kafka topic, the producer will either wait for the preferred replica to become available (if configured with certain ack settings) or send messages to another available broker that is part of the ISR. This ensures data integrity by only using replicas that are fully up-to-date with the leader. Consumers might experience a delay in data availability if they are set to consume only from the preferred replica and it is not available. 147 | [Table of Contents](#Apache-Kafka) 148 | 149 | ## What is the purpose of replication in apache kafka? 150 | Replication in Kafka serves to increase data availability and durability. By replicating data across multiple brokers, Kafka ensures that even if a broker fails, the data is not lost and can still be accessed from other brokers. It is a fundamental feature for fault tolerance and high availability, making it essential for production environments where data reliability is critical. 151 | 152 | [Table of Contents](#Apache-Kafka) 153 | 154 | ## Is it possible to get the message offset after producing to a topic? 155 | Yes, it is possible to get the message offset after producing a message in Kafka. When you send a message to a Kafka topic, the producer API can return metadata about the message, including the offset of the message in the topic partition. 156 | 157 | [Table of Contents](#Apache-Kafka) 158 | 159 | ## Mention what is the difference between Apache Kafka and Apache Storm and apache flink? 160 | Apache Kafka, Apache Storm, and Apache Flink are all distributed systems designed for processing large volumes of data, but they serve different purposes and operate differently: 161 | 162 | + Apache Kafka is primarily a distributed messaging system or streaming platform aimed at high-throughput, fault-tolerant storage and processing of streaming data. It is often used as a buffer or storage system between data producers and consumers. 163 | 164 | + Apache Storm is a real-time computation system for processing streaming data. It excels at high-speed data ingestion and processing tasks that require immediate response, such as real-time analytics. Storm processes data in a record-by-record fashion. 165 | 166 | + Apache Flink is a stream processing framework that can also handle batch processing. It is known for its ability to perform complex computations on streaming data, including event time processing and state management. Flink is designed to run in all common cluster environments and perform computations at in-memory speed. 167 | 168 | [Table of Contents](#Apache-Kafka) 169 | 170 | ## List the various components in Kafka? 171 | The four major components of Kafka are: 172 | + Topic – a stream of messages belonging to the same type 173 | + Producer – that can publish messages to a topic 174 | + Brokers – a set of servers where the publishes messages are stored 175 | + Consumer – that subscribes to various topics and pulls data from the brokers. 176 | 177 | [Table of Contents](#Apache-Kafka) 178 | 179 | ## What is the role of the offset in kafka? 180 | In Kafka, the offset is a unique identifier for each record within a Kafka topic's partition. It denotes the position of a record within the partition. The offset is used by consumers to track which records have been read and which haven't, allowing for fault-tolerant and scalable message consumption. Essentially, it enables consumers to pick up reading from the exact point they left off, even in the event of a failure or restart, thereby ensuring that no messages are lost or read multiple times. 181 | 182 | [Table of Contents](#Apache-Kafka) 183 | 184 | ## Can you explain the concept of `leader` and `follower` in kafka ecosystem? 185 | In Apache Kafka, the concepts of "leader" and "follower" refer to roles that brokers play within a Kafka cluster to manage partitions of a topic. 186 | 187 | + Leader: For each partition of a topic, there is one broker that acts as the leader. The leader is responsible for handling all read and write requests for that partition. When messages are produced to a partition, they are sent to the leader broker, which then writes the messages to its local storage. The leader broker ensures that messages are stored in the order they are received. 188 | 189 | + Follower: Followers are other brokers in the cluster that replicate the data of the leader for fault tolerance. Each follower continuously pulls messages from the leader to stay up-to-date, ensuring that it has an exact copy of the leader's data. In case the leader broker fails, one of the followers can be elected as the new leader, ensuring high availability. 190 | 191 | [Table of Contents](#Apache-Kafka) 192 | 193 | ## How do you define a Partitioning Key? 194 | Within the Producer, the role of a Partitioning Key is to indicate the destination partition of the message. By default, a hashing-based Partitioner is used to determine the partition ID given the key. Alternatively, users can also use customized Partitions. 195 | 196 | [Table of Contents](#Apache-Kafka) 197 | 198 | ## In the Producer when does Queuefullexception occur? 199 | QueueFullException typically occurs when the Producer attempts to send messages at a pace that the Broker cannot handle. Since the Producer doesn’t block, users will need to add enough brokers to collaboratively handle the increased load. 200 | 201 | [Table of Contents](#Apache-Kafka) 202 | 203 | ## Can you please explain the role of the Kafka Producer API? 204 | The Kafka Producer API allows applications to send streams of data to topics in the Kafka cluster. Essentially, it enables the production of message data to one or more Kafka topics, facilitating the reliable and scalable distribution of messages across the Kafka ecosystem. 205 | 206 | we usually have two producer types – SyncProducer and AsyncProducer. The goal is to expose all the producer functionality through a single API to the client. 207 | 208 | [Table of Contents](#Apache-Kafka) 209 | -------------------------------------------------------------------------------- /content/kubernetes.md: -------------------------------------------------------------------------------- 1 | ## [Main title](../README.md) 2 | ### [Interview questions](full.md) 3 | 4 | # Kubernetes 5 | + [How to do maintenance activity on the K8 node?](#How-to-do-maintenance-activity-on-the-K8-node) 6 | + [How do we control the resource usage of POD?](#How-do-we-control-the-resource-usage-of-POD) 7 | + [What are the various K8's services running on nodes and describe the role of each service?](#What-are-the-various-K8's-services-running-on-nodes-and-describe-the-role-of-each-service) 8 | + [What is PDB (Pod Disruption Budget)?](#What-is-PDB-(Pod-Disruption-Budget)) 9 | + [What’s the init container and when it can be used?](#What’s-the-init-container-and-when-it-can-be-used) 10 | + [What is the role of Load Balance in Kubernetes?](#What-is-the-role-of-Load-Balance-in-Kubernetes) 11 | + [What are the various things that can be done to increase Kubernetes security?](#What-are-the-various-things-that-can-be-done-to-increase-Kubernetes-security) 12 | + [How to monitor the Kubernetes cluster?](#How-to-monitor-the-Kubernetes-cluster) 13 | + [How to get the central logs from POD?](#How-to-get-the-central-logs-from-POD) 14 | + [How to turn the service defined below in the spec into an external one?](#How-to-turn-the-service-defined-below-in-the-spec-into-an-external-one) 15 | + [How to configure TLS with Ingress?](#How-to-configure-TLS-with-Ingress) 16 | + [Why use namespaces? What is the problem with using the default namespace?](#Why-use-namespaces?-What-is-the-problem-with-using-the-default-namespace) 17 | + [In the following file which service and in which namespace is referred?](#In-the-following-file-which-service-and-in-which-namespace-is-referred) 18 | + [What is an Operator?](#What-is-an-Operator) 19 | + [Why do we need Operators?](#Why-do-we-need-Operators) 20 | + [What is GKE?](#What-is-GKE) 21 | + [What is Ingress Default Backend?](#What-is-Ingress-Default-Backend) 22 | + [How to run Kubernetes locally?](#How-to-run-Kubernetes-locally) 23 | + [What is Kubernetes Load Balancing?](#What-is-Kubernetes-Load-Balancing) 24 | + [What the following in the Deployment configuration file mean?](#What-the-following-in-the-Deployment-configuration-file-mean) 25 | + [What is the difference between Docker Swarm and Kubernetes?](#What-is-the-difference-between-Docker-Swarm-and-Kubernetes) 26 | + [How to troubleshoot if the POD is not getting scheduled?](#How-to-troubleshoot-if-the-POD-is-not-getting-scheduled) 27 | + [How to run a POD on a particular node?](#How-to-run-a-POD-on-a-particular-node) 28 | + [What are the different ways to provide external network connectivity to K8?](#What-are-the-different-ways-to-provide-external-network-connectivity-to-K8) 29 | + [How can we forward the port 8080 container -> 8080 service -> 8080 ingress -> 80 browser and how it can be done?](#How-can-we-forward-the-port-8080-container-to-8080-service-to-8080-ingress-to-80-browser-and-how-it-can-be-done) 30 | [Table of Contents](#Kubernetes) 31 | 32 | ## How to do maintenance activity on the K8 node? 33 | Whenever there are security patches available the Kubernetes administrator has to perform the maintenance task to apply the security patch to the running container in order to prevent it from vulnerability, which is often an unavoidable part of the administration. The following two commands are useful to safely drain the K8s node. 34 | 35 | kubectl cordon 36 | kubectl drain –ignore-daemon set 37 | The first command moves the node to maintenance mode or makes the node unavailable, followed by kubectl drain which will finally discard the pod from the node. After the drain command is a success you can perform maintenance. 38 | 39 | Note: If you wish to perform maintenance on a single pod following two commands can be issued in order: 40 | 41 | kubectl get nodes: to list all the nodes 42 | kubectl drain : drain a particular node 43 | 44 | [Table of Contents](#Kubernetes) 45 | 46 | ## How do we control the resource usage of POD? 47 | With the use of limit and request resource usage of a POD can be controlled. 48 | Request: The number of resources being requested for a container. If a container exceeds its request for resources, it can be throttled back down to its request. 49 | 50 | Limit: An upper cap on the resources a single container can use. If it tries to exceed this predefined limit it can be terminated if K8's decides that another container needs these resources. If you are sensitive towards pod restarts, it makes sense to have the sum of all container resource limits equal to or less than the total resource capacity for your cluster. 51 | Example: 52 | 53 | apiVersion: v1 54 | kind: Pod 55 | metadata: 56 | name: demo 57 | spec: 58 | containers: 59 | - name: example1 60 | image:example/example1 61 | resources: 62 | requests: 63 | memory: "_Mi" 64 | cpu: "_m" 65 | limits: 66 | memory: "_Mi" 67 | cpu: "_m" 68 | 69 | [Table of Contents](#Kubernetes) 70 | 71 | ## What are the various K8's services running on nodes and describe the role of each service? 72 | Mainly K8 cluster consists of two types of nodes, executor and master. 73 | Executor node: (This runs on master node) 74 | Kube-proxy: This service is responsible for the communication of pods within the cluster and to the outside network, which runs on every node. This service is responsible to maintain network protocols when your pod establishes a network communication. 75 | kubelet: Each node has a running kubelet service that updates the running node accordingly with the configuration(YAML or JSON) file. NOTE: kubelet service is only for containers created by Kubernetes. 76 | Master services: 77 | 78 | Kube-apiserver: Master API service which acts as an entry point to K8 cluster. 79 | Kube-scheduler: Schedule PODs according to available resources on executor nodes. 80 | Kube-controller-manager: is a control loop that watches the shared state of the cluster through the apiserver and makes changes attempting to move the current state towards the desired stable state 81 | 82 | [Table of Contents](#Kubernetes) 83 | 84 | ## What is PDB (Pod Disruption Budget)? 85 | A Kubernetes administrator can create a deployment of a kind: PodDisruptionBudget for high availability of the application, it makes sure that the minimum number is running pods are respected as mentioned by the attribute minAvailable spec file. This is useful while performing a drain where the drain will halt until the PDB is respected to ensure the High Availability(HA) of the application. The following spec file also shows minAvailable as 2 which implies the minimum number of an available pod (even after the election). 86 | Example: YAML Config using minAvailable => 87 | 88 | apiVersion: policy/v1beta1 89 | kind: PodDisruptionBudget 90 | metadata: 91 | name: zk-pdb 92 | spec: 93 | minAvailable: 2 94 | selector: 95 | matchLabels: 96 | app: zookeeper 97 | 98 | [Table of Contents](#Kubernetes) 99 | 100 | ## What’s the init container and when it can be used? 101 | init containers will set a stage for you before running the actual POD. 102 | Wait for some time before starting the app Container with a command like sleep 60. 103 | Clone a git repository into a volume. 104 | 105 | [Table of Contents](#Kubernetes) 106 | 107 | ## What is the role of Load Balance in Kubernetes? 108 | Load balancing is a way to distribute the incoming traffic into multiple backend servers, which is useful to ensure the application available to the users. 109 | Load Balancer 110 | In Kubernetes, as shown in the above figure all the incoming traffic lands to a single IP address on the load balancer which is a way to expose your service to outside the internet which routes the incoming traffic to a particular pod (via service) using an algorithm known as round-robin. Even if any pod goes down load balances are notified so that the traffic is not routed to that particular unavailable node. Thus load balancers in Kubernetes are responsible for distributing a set of tasks (incoming traffic) to the pods 111 | 112 | [Table of Contents](#Kubernetes) 113 | 114 | ## What are the various things that can be done to increase Kubernetes security? 115 | By default, POD can communicate with any other POD, we can set up network policies to limit this communication between the PODs. 116 | RBAC (Role-based access control) to narrow down the permissions. 117 | Use namespaces to establish security boundaries. 118 | Set the admission control policies to avoid running the privileged containers. 119 | Turn on audit logging. 120 | 121 | [Table of Contents](#Kubernetes) 122 | 123 | ## How to monitor the Kubernetes cluster? 124 | Prometheus is used for Kubernetes monitoring. The Prometheus ecosystem consists of multiple components. 125 | Mainly Prometheus server which scrapes and stores time-series data. 126 | Client libraries for instrumenting application code. 127 | Push gateway for supporting short-lived jobs. 128 | Special-purpose exporters for services like StatsD, HAProxy, Graphite, etc. 129 | An alert manager to handle alerts on various support tools. 130 | 131 | [Table of Contents](#Kubernetes) 132 | 133 | ## How to get the central logs from POD? 134 | This architecture depends upon the application and many other factors. Following are the common logging patterns 135 | 136 | Node level logging agent. 137 | Streaming sidecar container. 138 | Sidecar container with the logging agent. 139 | Export logs directly from the application. 140 | In the setup, journalbeat and filebeat are running as daemonset. Logs collected by these are dumped to the kafka topic which is eventually dumped to the ELK stack. 141 | The same can be achieved using EFK stack and fluentd-bit. 142 | 143 | [Table of Contents](#Kubernetes) 144 | 145 | ## How to turn the service defined below in the spec into an external one? 146 | 147 | spec: 148 | selector: 149 | app: some-app 150 | ports: 151 | - protocol: UDP 152 | port: 8080 153 | targetPort: 8080 154 | Explanation - 155 | 156 | Adding type: LoadBalancer and nodePort as follows: 157 | 158 | spec: 159 | selector: 160 | app: some-app 161 | type: LoadBalancer 162 | ports: 163 | - protocol: UDP 164 | port: 8080 165 | targetPort: 8080 166 | nodePort: 32412 167 | 168 | 169 | Complete the following configurationspec file to make it Ingress 170 | metadata: 171 | name: someapp-ingress 172 | spec: 173 | Explanation - 174 | 175 | One of the several ways to answer this question. 176 | 177 | apiVersion: networking.k8s.io/v1 178 | kind: Ingress 179 | metadata: 180 | name: someapp-ingress 181 | spec: 182 | rules: 183 | - host: my.host 184 | http: 185 | paths: 186 | - backend: 187 | serviceName: someapp-internal-service 188 | servicePort: 8080 189 | 190 | [Table of Contents](#Kubernetes) 191 | 192 | ## How to configure TLS with Ingress? 193 | Add tls and secretName entries. 194 | 195 | spec: 196 | tls: 197 | - hosts: 198 | - some_app.com 199 | secretName: someapp-secret-tls 200 | 201 | [Table of Contents](#Kubernetes) 202 | 203 | ## Why use namespaces? What is the problem with using the default namespace? 204 | While using the default namespace alone, it becomes hard over time to get an overview of all the applications you can manage in your cluster. Namespaces make it easier to organize the applications into groups that make sense, like a namespace of all the monitoring applications and a namespace for all the security applications, etc. 205 | Namespaces can also be useful for managing Blue/Green environments where each namespace can include a different version of an app and also share resources that are in other namespaces (namespaces like logging, monitoring, etc.). 206 | Another use case for namespaces is one cluster with multiple teams. When multiple teams use the same cluster, they might end up stepping on each other's toes. For example, if they end up creating an app with the same name it means one of the teams overrides the app of the other team because there can't be two apps in Kubernetes with the same name (in the same namespace). 207 | 208 | [Table of Contents](#Kubernetes) 209 | 210 | ## In the following file which service and in which namespace is referred? 211 | apiVersion: v1 212 | kind: ConfigMap 213 | metadata: 214 | name: some-configmap 215 | data: 216 | some_url: silicon.chip 217 | Answer - It's referencing the service "silicon" in the namespace called "chip". 218 | 219 | [Table of Contents](#Kubernetes) 220 | 221 | ## What is an Operator? 222 | "Operators are software extensions to K8s which make use of custom resources to manage applications and their components. Operators follow Kubernetes principles, notably the control loop." 223 | 224 | [Table of Contents](#Kubernetes) 225 | 226 | ## Why do we need Operators? 227 | The process of managing applications in Kubernetes isn't as straightforward as managing stateless applications, where reaching the desired status and upgrades are both handled the same way for every replica. In stateful applications, upgrading each replica might require different handling due to the stateful nature of the app, each replica might be in a different status. As a result, we often need a human operator to manage stateful applications. Kubernetes Operator is supposed to assist with this. 228 | This will also help with automating a standard process on multiple Kubernetes clusters 229 | 230 | [Table of Contents](#Kubernetes) 231 | 232 | ## What is GKE? 233 | GKE is Google Kubernetes Engine that is used for managing and orchestrating systems for Docker containers. With the help of Google Public Cloud, we can also orchestrate the container cluster. 234 | 235 | [Table of Contents](#Kubernetes) 236 | 237 | ## What is Ingress Default Backend? 238 | It specifies what to do with an incoming request to the Kubernetes cluster that isn't mapped to any backend i.e what to do when no rules being defined for the incoming HTTP request If the default backend service is not defined, it's recommended to define it so that users still see some kind of message instead of an unclear error. 239 | 240 | [Table of Contents](#Kubernetes) 241 | 242 | ## How to run Kubernetes locally? 243 | Kubernetes can be set up locally using the Minikube tool. It runs a single-node bunch in a VM on the computer. Therefore, it offers the perfect way for users who have just ongoing learning Kubernetes. 244 | 245 | [Table of Contents](#Kubernetes) 246 | 247 | ## What is Kubernetes Load Balancing? 248 | Load Balancing is one of the most common and standard ways of exposing the services. There are two types of load balancing in K8s and they are: 249 | Internal load balancer – This type of balancer automatically balances loads and allocates the pods with the required incoming load. 250 | External Load Balancer – This type of balancer directs the traffic from the external loads to backend pods. 251 | 252 | [Table of Contents](#Kubernetes) 253 | 254 | ## What the following in the Deployment configuration file mean? 255 | spec: 256 | containers: 257 | - name: USER_PASSWORD 258 | valueFrom: 259 | secretKeyRef: 260 | name: some-secret 261 | key: password 262 | Explanation - 263 | 264 | USER_PASSWORD environment variable will store the value from the password key in the secret called "some-secret" In other words, you reference a value from a Kubernetes Secret. 265 | 266 | [Table of Contents](#Kubernetes) 267 | 268 | ## What is the difference between Docker Swarm and Kubernetes? 269 | Below are the main difference between Kubernetes and Docker: 270 | 271 | The installation procedure of the K8s is very complicated but if it is once installed then the cluster is robust. On the other hand, the Docker swarm installation process is very simple but the cluster is not at all robust. 272 | Kubernetes can process the auto-scaling but the Docker swarm cannot process the auto-scaling of the pods based on incoming load. 273 | Kubernetes is a full-fledged Framework. Since it maintains the cluster states more consistently so autoscaling is not as fast as Docker Swarm. 274 | 275 | [Table of Contents](#Kubernetes) 276 | 277 | ## How to troubleshoot if the POD is not getting scheduled? 278 | In K8’s scheduler is responsible to spawn pods into nodes. There are many factors that can lead to unstartable POD. The most common one is running out of resources, use the commands like kubectl describe -n to see the reason why POD is not started. Also, keep an eye on kubectl to get events to see all events coming from the cluster. 279 | 280 | [Table of Contents](#Kubernetes) 281 | 282 | ## How to run a POD on a particular node? 283 | Various methods are available to achieve it. 284 | nodeName: specify the name of a node in POD spec configuration, it will try to run the POD on a specific node. 285 | nodeSelector: Assign a specific label to the node which has special resources and use the same label in POD spec so that POD will run only on that node. 286 | nodeaffinities: required DuringSchedulingIgnoredDuringExecution, preferredDuringSchedulingIgnoredDuringExecution are hard and soft requirements for running the POD on specific nodes. This will be replacing nodeSelector in the future. It depends on the node labels. 287 | 288 | [Table of Contents](#Kubernetes) 289 | 290 | ## What are the different ways to provide external network connectivity to K8? 291 | By default, POD should be able to reach the external network but vice-versa we need to make some changes. Following options are available to connect with POD from the outer world. 292 | Nodeport (it will expose one port on each node to communicate with it) 293 | Load balancers (L4 layer of TCP/IP protocol) 294 | Ingress (L7 layer of TCP/IP Protocol) 295 | Another method is to use Kube-proxy which can expose a service with only cluster IP on the local system port. 296 | 297 | $ kubectl proxy --port=8080 $ http://localhost:8080/api/v1/proxy/namespaces//services/:/ 298 | 299 | [Table of Contents](#Kubernetes) 300 | 301 | ## How can we forward the port 8080 container to 8080 service to 8080 ingress to 80 browser and how it can be done? 302 | The ingress is exposing port 80 externally for the browser to access, and connecting to a service that listens on 8080. The ingress will listen on port 80 by default. An "ingress controller" is a pod that receives external traffic and handles the ingress and is configured by an ingress resource For this you need to configure the ingress selector and if no 'ingress controller selector' is mentioned then no ingress controller will manage the ingress. 303 | 304 | Simple ingress Config will look like 305 | host: abc.org 306 | http: 307 | paths: 308 | backend: 309 | serviceName: abc-service 310 | servicePort: 8080 311 | Then the service will look like 312 | kind: Service 313 | apiVersion: v1 314 | metadata: 315 | name: abc-service 316 | spec: 317 | ports: 318 | protocol: TCP 319 | port: 8080 # port to which the service listens to 320 | targetPort: 8080 321 | 322 | [Table of Contents](#Kubernetes) -------------------------------------------------------------------------------- /content/mongo.md: -------------------------------------------------------------------------------- 1 | ## [Main title](../README.md) 2 | ### [Interview questions](full.md) 3 | 4 | # MongoDB 5 | ## Table of Contents 6 | 7 | + [What is MongoDB?](#What-is-MongoDB) 8 | + [Why use MongoDB?](#Why-use-MongoDB) 9 | + [What are the Advantages of MongoDB?](#What-are-the-Advantages-of-MongoDB) 10 | + [When Should You Use MongoDB?](#When-Should-You-Use-MongoDB) 11 | + [How does MongoDB exactly store the data?](#How-does-MongoDB-exactly-stores-the-data) 12 | + [MongoDB vs. RDBMS: What are the differences?](#MongoDB-vs-RDBMS-What-are-the-differences) 13 | + [How does MongoDB scale?](#How-does-mongodb-scale) 14 | + [How does MongoDB scale horizontally?](#How-does-mongodb-scale-horizontally) 15 | + [What are the advantages of sharding?](#What-are-the-advantages-of-sharding) 16 | + [What methods can we use for sharding in MongoDB?](#What-methods-can-we-use-for-sharding-in-mongodb) 17 | + [How does atomicity and transaction work in MongoDB?](#How-does-atomicity-and-transaction-word-in-mongodb) 18 | + [What is an Aggregation Pipeline in MongoDB?](#What-is-an-Aggregation-Pipeline-in-mongodb) 19 | + [Where should I use an index in MongoDB?](#Where-should-I-use-an-index-in-mongodb) 20 | + [How would you choose an indexing strategy in MongoDB and what are some common considerations we need to care about?](#How-would-you-choose-a-indexing-strategy-in-mongo-and-what-are-some-common-considerations-we-need-to-care-about) 21 | + [How does MongoDB ensure data consistency and reliability?](#How-does-MongoDB-ensure-data-consistency-and-reliability) 22 | + [What are MongoDB's backup and restore options?](#What-are-MongoDB's-backup-and-restore-options) 23 | + [How does MongoDB handle concurrency?](#How-does-MongoDB-handle-concurrency) 24 | + [What security features does MongoDB offer?](#What-security-features-does-MongoDB-offer) 25 | + [How does MongoDB manage schema design and changes?](#How-does-MongoDB-manage-schema-design-and-changes) 26 | + [What are the limitations or disadvantages of MongoDB?](#What-are-the-limitations-or-disadvantages-of-MongoDB) 27 | + [How do you perform migrations from RDBMS to MongoDB?](#How-do-you-perform-migrations-from-RDBMS-to-MongoDB) 28 | + [What is MongoDB Atlas and what benefits does it provide?](#What-is-MongoDB-Atlas-and-what-benefits-does-it-provide) 29 | + [How can you monitor and optimize MongoDB performance?](#How-can-you-monitor-and-optimize-MongoDB-performance) 30 | + [What is a replica set in MongoDB and how does it work?](#What-is-a-replica-set-in-MongoDB-and-how-does-it-work) 31 | + [How do you handle data integrity in MongoDB?](#How-do-you-handle-data-integrity-in-MongoDB) 32 | + [How does MongoDB fit into CAP theorem?](#How-does-MongoDB-fit-into-CAP-theorem) 33 | + [How do you handle large data volumes in MongoDB?](#How-do-you-handle-large-data-volumes-in-MongoDB) 34 | + [What are the best practices for designing a MongoDB schema?](#What-are-the-best-practices-for-designing-a-MongoDB-schema) 35 | + [How does MongoDB handle geographic data and geospatial queries?](#How-does-MongoDB-handle-geographic-data-and-geospatial-queries) 36 | 37 | 38 | 39 | ## What is MongoDB? 40 | MongoDB is a document database built on a horizontal scale-out architecture that uses a flexible schema for storing data. Founded in 2007, MongoDB has a worldwide following in the developer community. 41 | 42 | [Table of Contents](#mongo) 43 | 44 | ## Why use MongoDB? 45 | As a document database, MongoDB makes it easy for developers to store structured or unstructured data. It uses a JSON-like format to store documents. This format directly maps to native objects in most modern programming languages, making it a natural choice for developers, as they don’t need to think about normalizing data. MongoDB can also handle high volume and can scale both vertically or horizontally to accommodate large data loads. 46 | 47 | [Table of Contents](#mongo) 48 | 49 | ## What are the Advantages of MongoDB? 50 | - A Powerful Document-Oriented Database 51 | - Developer User Experience 52 | - Scalability and Transactionality 53 | - Platform and Ecosystem Maturity 54 | 55 | [Table of Contents](#mongo) 56 | 57 | ## When Should You Use MongoDB? 58 | - Integrating large amounts of diverse data 59 | - Describing complex data structures that evolve 60 | - Delivering data in high-performance applications 61 | - Supporting hybrid and multi-cloud applications 62 | - Supporting agile development and collaboration 63 | 64 | [Table of Contents](#mongo) 65 | 66 | ## How does MongoDB exactly stores the data? 67 | 68 | In MongoDB, records are stored as documents in compressed [BSON]([url](https://www.mongodb.com/json-and-bson)) files. The documents can be retrieved directly in JSON format, which has many benefits: 69 | 70 | - It is a natural form to store data. 71 | - It is human-readable. 72 | - Structured and unstructured information can be stored in the same document. 73 | - You can nest JSON to store complex data objects. 74 | - JSON has a flexible and dynamic schema, so adding fields or leaving a field out is not a problem. 75 | - Documents map to objects in most popular programming languages. 76 | 77 | Most developers find it easy to work with JSON because it is a simple and powerful way to describe and store data. 78 | 79 | [Table of Contents](#mongo) 80 | 81 | 82 | ## MongoDB vs. RDBMS: What are the differences? 83 | 84 | One of the main differences between MongoDB and RDBMS is that RDBMS is a relational database while MongoDB is nonrelational. Likewise, while most RDBMS systems use SQL to manage stored data, MongoDB uses BSON for data storage. 85 | 86 | While RDBMS uses tables and rows, MongoDB uses documents and collections. In RDBMS a table -- the equivalent to a MongoDB collection -- stores data as columns and rows. Likewise, a row in RDBMS is the equivalent of a MongoDB document but stores data as structured data items in a table. A column denotes sets of data values, which is the equivalent to a field in MongoDB. 87 | 88 | [Table of Contents](#mongo) 89 | 90 | 91 | 92 | ## How does mongodb scale? 93 | 94 | mongo scales the same way as any distributed application would scale, throw more resources at it(`Vertical Scaling`) or distribute the data on multiple physical servers or nodes (`Horizontal Scaling` ) . 95 | 96 | or in other words: 97 | 98 | *Vertical Scaling* involves increasing the capacity of a single server, such as using a more powerful CPU, adding more RAM, or increasing the amount of storage space. 99 | 100 | *Horizontal Scaling* involves dividing the system dataset and load over multiple servers, adding additional servers to increase capacity as required. 101 | 102 | While the overall speed or capacity of a single machine may not be high, each machine handles a subset of the overall workload, potentially providing better efficiency than a single high-speed high-capacity server. 103 | 104 | [Table of Contents](#mongo) 105 | 106 | ## How does mongodb scale horizontally? 107 | 108 | MongoDB supports *horizontal scaling* through [sharding.](https://www.mongodb.com/docs/manual/reference/glossary/#std-term-sharding) 109 | 110 | Sharding is a method for distributing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations. 111 | 112 | A MongoDB [sharded cluster](https://www.mongodb.com/docs/manual/reference/glossary/#std-term-sharded-cluster) consists of the following components: 113 | 114 | - [shard](https://www.mongodb.com/docs/manual/core/sharded-cluster-shards/#std-label-shards-concepts): Each shard contains a subset of the sharded data. Each shard must be deployed as a [replica set.](https://www.mongodb.com/docs/manual/reference/glossary/#std-term-replica-set) 115 | - [mongos](https://www.mongodb.com/docs/manual/core/sharded-cluster-query-router/): The `mongos` acts as a query router, providing an interface between client applications and the sharded cluster. [`mongos`](https://www.mongodb.com/docs/manual/reference/program/mongos/#mongodb-binary-bin.mongos) can support [hedged reads](https://www.mongodb.com/docs/manual/core/sharded-cluster-query-router/#std-label-mongos-hedged-reads) to minimize latencies. 116 | - [config servers](https://www.mongodb.com/docs/manual/core/sharded-cluster-config-servers/#std-label-sharding-config-server): Config servers store metadata and configuration settings for the cluster. As of MongoDB 3.4, config servers must be deployed as a replica set (CSRS). 117 | 118 | The following graphic describes the interaction of components within a sharded cluster: 119 | 120 | ![Diagram of a sample sharded cluster for production purposes. Contains exactly 3 config servers, 2 or more ``mongos`` query routers, and at least 2 shards. The shards are replica sets.](https://www.mongodb.com/docs/manual/images/sharded-cluster-production-architecture.bakedsvg.svg) 121 | 122 | MongoDB shards data at the [collection](https://www.mongodb.com/docs/manual/reference/glossary/#std-term-collection) level, distributing the collection data across the shards in the cluster. 123 | 124 | [Table of Contents](#mongo) 125 | 126 | 127 | 128 | ## What are the advantages of sharding? 129 | 130 | ##### Reads / Writes[![img](https://www.mongodb.com/docs/manual/assets/link.svg)](https://www.mongodb.com/docs/manual/sharding/#reads---writes) 131 | 132 | MongoDB distributes the read and write workload across the [shards](https://www.mongodb.com/docs/manual/reference/glossary/#std-term-shard) in the [sharded cluster](https://www.mongodb.com/docs/manual/reference/glossary/#std-term-sharded-cluster), allowing each shard to process a subset of cluster operations. Both read and write workloads can be scaled horizontally across the cluster by adding more shards. 133 | 134 | For queries that include the shard key or the prefix of a [compound](https://www.mongodb.com/docs/manual/reference/glossary/#std-term-compound-index) shard key, [`mongos`](https://www.mongodb.com/docs/manual/reference/program/mongos/#mongodb-binary-bin.mongos) can target the query at a specific shard or set of shards. These [targeted operations](https://www.mongodb.com/docs/manual/core/sharded-cluster-query-router/#std-label-sharding-mongos-targeted) are generally more efficient than [broadcasting](https://www.mongodb.com/docs/manual/core/sharded-cluster-query-router/#std-label-sharding-mongos-broadcast) to every shard in the cluster. 135 | 136 | [`mongos`](https://www.mongodb.com/docs/manual/reference/program/mongos/#mongodb-binary-bin.mongos) can support [hedged reads](https://www.mongodb.com/docs/manual/core/sharded-cluster-query-router/#std-label-mongos-hedged-reads) to minimize latencies. 137 | 138 | ##### Storage Capacity[![img](https://www.mongodb.com/docs/manual/assets/link.svg)](https://www.mongodb.com/docs/manual/sharding/#storage-capacity) 139 | 140 | [Sharding](https://www.mongodb.com/docs/manual/reference/glossary/#std-term-sharding) distributes data across the [shards](https://www.mongodb.com/docs/manual/reference/glossary/#std-term-shard) in the cluster, allowing each shard to contain a subset of the total cluster data. As the data set grows, additional shards increase the storage capacity of the cluster. 141 | 142 | 143 | 144 | ##### High Availability[![img](https://www.mongodb.com/docs/manual/assets/link.svg)](https://www.mongodb.com/docs/manual/sharding/#high-availability) 145 | 146 | The deployment of config servers and shards as replica sets provide increased availability. 147 | 148 | Even if one or more shard replica sets become completely unavailable, the sharded cluster can continue to perform partial reads and writes. That is, while data on the unavailable shard(s) cannot be accessed, reads or writes directed at the available shards can still succeed. 149 | 150 | ## What methods can we use for sharding in mongodb? 151 | 152 | MongoDB supports two sharding strategies for distributing data across [sharded clusters.](https://www.mongodb.com/docs/manual/reference/glossary/#std-term-sharded-cluster) 153 | 154 | ##### Hashed Sharding[![img](https://www.mongodb.com/docs/manual/assets/link.svg)](https://www.mongodb.com/docs/manual/sharding/#hashed-sharding) 155 | 156 | Hashed Sharding involves computing a hash of the shard key field's value. Each [chunk](https://www.mongodb.com/docs/manual/reference/glossary/#std-term-chunk) is then assigned a range based on the hashed shard key values. 157 | 158 | ##### Ranged Sharding[![img](https://www.mongodb.com/docs/manual/assets/link.svg)](https://www.mongodb.com/docs/manual/sharding/#ranged-sharding) 159 | 160 | Ranged sharding involves dividing data into ranges based on the shard key values. Each [chunk](https://www.mongodb.com/docs/manual/reference/glossary/#std-term-chunk) is then assigned a range based on the shard key values. 161 | 162 | [Table of Contents](#mongo) 163 | 164 | ## How does atomicity and transaction word in mongodb? 165 | 166 | In MongoDB, a write operation is [atomic](https://www.mongodb.com/docs/manual/reference/glossary/#std-term-atomic-operation) on the level of a single document, even if the operation modifies multiple embedded documents *within* a single document. 167 | 168 | When a single write operation (e.g. [`db.collection.updateMany()`](https://www.mongodb.com/docs/manual/reference/method/db.collection.updateMany/#mongodb-method-db.collection.updateMany)) modifies multiple documents, the modification of each document is atomic, but the operation as a whole is not atomic. 169 | 170 | For situations that require atomicity of reads and writes to multiple documents (in a single or multiple collections), MongoDB supports distributed transactions, including transactions on replica sets and sharded clusters. 171 | 172 | [Table of Contents](#mongo) 173 | 174 | 175 | 176 | ## What is an Aggregation Pipeline in mongodb? 177 | 178 | An aggregation pipeline consists of one or more [stages](https://www.mongodb.com/docs/manual/reference/operator/aggregation-pipeline/#std-label-aggregation-pipeline-operator-reference) that process documents: 179 | 180 | - Each stage performs an operation on the input documents. For example, a stage can filter documents, group documents, and calculate values. 181 | - The documents that are output from a stage are passed to the next stage. 182 | - An aggregation pipeline can return results for groups of documents. For example, return the total, average, maximum, and minimum values. 183 | 184 | [Table of Contents](#mongo) 185 | 186 | ## Where should I use an index in mongodb? 187 | 188 | If your application is repeatedly running queries on the same fields, you can create an index on those fields to improve performance. 189 | 190 | Although indexes improve query performance, adding an index has negative performance impact for write operations. For collections with a high write-to-read ratio, indexes are expensive because each insert must also update any indexes. 191 | 192 | [Table of Contents](#mongo) 193 | 194 | ## How would you choose a indexing strategy in mongo and what are some common considerations we need to care about? 195 | 196 | The best indexes for your application must take a number of factors into account, including the kinds of queries you expect, the ratio of reads to writes, and the amount of free memory on your system. 197 | 198 | The best overall strategy for designing indexes is to profile a variety of index configurations with data sets similar to the ones you'll be running in production to see which configurations perform best. Inspect the current indexes created for your collections to ensure they are supporting your current and planned queries. If an index is no longer used, drop the index. 199 | 200 | These are some considerations you need to take for your indexing strategy: 201 | 202 | - [Use the ESR (Equality, Sort, Range) Rule](https://www.mongodb.com/docs/manual/tutorial/equality-sort-range-rule/#std-label-esr-indexing-rule) 203 | 204 | The ESR (Equality, Sort, Range) Rule is a guide to creating indexes that support your queries efficiently. 205 | 206 | - [Create Indexes to Support Your Queries](https://www.mongodb.com/docs/manual/tutorial/create-indexes-to-support-queries/#std-label-create-indexes-to-support-queries) 207 | 208 | An index supports a query when the index contains all the fields scanned by the query. Creating indexes that support queries results in greatly increased query performance. 209 | 210 | - [Use Indexes to Sort Query Results](https://www.mongodb.com/docs/manual/tutorial/sort-results-with-indexes/#std-label-sorting-with-indexes) 211 | 212 | To support efficient queries, use the strategies here when you specify the sequential order and sort order of index fields. 213 | 214 | - [Ensure Indexes Fit in RAM](https://www.mongodb.com/docs/manual/tutorial/ensure-indexes-fit-ram/#std-label-indexes-ensure-indexes-fit-ram) 215 | 216 | When your index fits in RAM, the system can avoid reading the index from disk and you get the fastest processing. 217 | 218 | - [Create Indexes to Ensure Query Selectivity](https://www.mongodb.com/docs/manual/tutorial/create-queries-that-ensure-selectivity/#std-label-index-selectivity) 219 | 220 | Selectivity is the ability of a query to narrow results using the index. Selectivity allows MongoDB to use the index for a larger portion of the work associated with fulfilling the query. 221 | 222 | [Table of Contents](#mongo) 223 | 224 | ## How does MongoDB ensure data consistency and reliability? 225 | 226 | MongoDB ensures data consistency and reliability through several mechanisms: 227 | 228 | - Write Concern: Allows you to specify the level of acknowledgment required from MongoDB for write operations. 229 | - Replica Sets: Provides data redundancy and high availability. 230 | - Journaling: Ensures data durability by writing operations to a journal before applying them to the data files. 231 | - Data Validation: Allows you to enforce document structure. 232 | 233 | [Table of Contents](#mongo) 234 | 235 | ## What are MongoDB's backup and restore options? 236 | 237 | MongoDB offers several backup and restore options: 238 | 239 | - mongodump and mongorestore: Command-line tools for creating binary exports of database contents and restoring them. 240 | - Filesystem snapshots: For creating point-in-time snapshots of data. 241 | - MongoDB Atlas Backup: Continuous backups and point-in-time recovery for cloud-hosted databases. 242 | - MongoDB Ops Manager: For on-premises deployments, offering backup automation and point-in-time recovery. 243 | 244 | ## How does MongoDB handle concurrency? 245 | 246 | MongoDB handles concurrency through: 247 | 248 | - Document-level locking: Allows multiple clients to read and write different documents simultaneously. 249 | - WiredTiger storage engine: Provides document-level concurrency control and compression. 250 | - Multi-version Concurrency Control (MVCC): Allows readers to see a consistent view of data without blocking writers. 251 | 252 | [Table of Contents](#mongo) 253 | 254 | ## What security features does MongoDB offer? 255 | 256 | Once again we have many options for security as well, like: 257 | 258 | - Authentication: Supports various authentication mechanisms (SCRAM, x.509 certificates, LDAP, Kerberos). 259 | - Authorization: Role-Based Access Control (RBAC) for fine-grained access control. 260 | - Encryption: Supports encryption at rest and in transit (TLS/SSL). 261 | - Auditing: Allows tracking of system and user activities. 262 | - Network isolation: Supports IP whitelisting and VPC peering. 263 | 264 | [Table of Contents](#mongo) 265 | 266 | ## How does MongoDB manage schema design and changes? 267 | 268 | MongoDB uses a flexible, document-based model for schema design: 269 | 270 | - Schemaless: Collections don't enforce document structure by default. 271 | - Dynamic Schema: Documents in a collection can have different fields. 272 | - Schema Validation: Optional rules can be set to enforce document structure. 273 | - Indexing: Supports various index types to optimize query performance. 274 | 275 | For schema changes: 276 | 277 | - Adding fields: Simply update documents with new fields. 278 | - Removing fields: Delete fields from documents or use $unset in updates. 279 | - Changing field types: Update documents with new data types. 280 | - Large-scale changes: Can be done programmatically or using tools like Mongoose for Node.js. 281 | 282 | [Table of Contents](#mongo) 283 | 284 | ## What are the limitations or disadvantages of MongoDB? 285 | 286 | Some limitations and disadvantages of MongoDB include: 287 | 288 | - Limited JOIN functionality compared to relational databases. 289 | - Document size limit of 16MB. 290 | - Lack of built-in data integrity constraints (like foreign key constraints). 291 | - Higher storage space requirements due to data duplication in denormalized models. 292 | - Steeper learning curve for those accustomed to relational databases. 293 | - Less mature ecosystem compared to some traditional RDBMSs. 294 | 295 | [Table of Contents](#mongo) 296 | 297 | ## How do you perform migrations from RDBMS to MongoDB? 298 | 299 | Migrating from RDBMS to MongoDB typically involves these steps: 300 | 301 | 1. Analyze the relational schema and design a document model. 302 | 2. Map relational tables to MongoDB collections. 303 | 3. Convert relational data to JSON format. 304 | 4. Use tools like mongoimport or write custom scripts to import data. 305 | 5. Verify data integrity after migration. 306 | 6. Update application code to work with MongoDB instead of SQL. 307 | 7. Test thoroughly to ensure functionality and performance. 308 | 309 | Tools that can assist in this process include: 310 | 311 | - MongoDB Compass for visualizing and manipulating data. 312 | - Official MongoDB connectors for various programming languages. 313 | - Third-party ETL (Extract, Transform, Load) tools. 314 | 315 | [Table of Contents](#mongo) 316 | 317 | ## What is MongoDB Atlas and what benefits does it provide? 318 | 319 | MongoDB Atlas is the cloud-hosted database-as-a-service (DBaaS) platform for MongoDB. Benefits include: 320 | 321 | - Automated deployment and scaling of MongoDB clusters. 322 | - Multi-cloud and multi-region deployment options. 323 | - Built-in security features (network isolation, encryption, access controls). 324 | - Automated backups and point-in-time recovery. 325 | - Performance monitoring and optimization tools. 326 | - Easy integration with other cloud services. 327 | - Reduced operational overhead for database management. 328 | - Automatic updates and patches for the database software. 329 | 330 | [Table of Contents](#mongo) 331 | 332 | ## How can you monitor and optimize MongoDB performance? 333 | 334 | Monitoring and optimizing MongoDB performance involves several strategies: 335 | 336 | ##### Monitoring: 337 | 338 | - Use MongoDB's built-in tools like mongotop and mongostat 339 | - Leverage MongoDB Compass for visual performance insights 340 | - Implement MongoDB Atlas monitoring for cloud deployments 341 | - Use third-party monitoring tools like Prometheus with MongoDB exporter 342 | 343 | ##### Optimization: 344 | 345 | - Analyze and optimize slow queries using explain() method 346 | - Create appropriate indexes based on query patterns 347 | - Use proper schema design to minimize data duplication 348 | - Configure appropriate write concern and read preferences 349 | - Optimize server and storage configurations 350 | - Use appropriate shard keys for distributed systems 351 | 352 | [Table of Contents](#mongo) 353 | 354 | ## What is a replica set in MongoDB and how does it work? 355 | 356 | A replica set in MongoDB is a group of mongod processes that maintain the same data set. It provides: 357 | 358 | - High Availability: If the primary node fails, an election occurs to choose a new primary from the secondary nodes. 359 | - Data Redundancy: Data is replicated across multiple nodes. 360 | - Read Scaling: Secondary nodes can handle read operations. 361 | 362 | How it works: 363 | 364 | - One primary node accepts all write operations 365 | - Multiple secondary nodes replicate data from the primary 366 | - Optionally, arbiter nodes can participate in elections but don't hold data 367 | - Automatic failover occurs if the primary becomes unavailable 368 | 369 | [Table of Contents](#mongo) 370 | 371 | ## How do you handle data integrity in MongoDB? 372 | 373 | While MongoDB doesn't have built-in referential integrity like relational databases, you can maintain data integrity through: 374 | 375 | - Schema Validation: Define JSON Schema to enforce document structure 376 | - Application-Level Checks: Implement integrity checks in your application code 377 | - Atomic Operations: Use multi-document transactions for operations that must succeed or fail as a unit 378 | - Unique Indexes: Ensure uniqueness of certain fields 379 | - Data Validation: Use $jsonSchema operator to define validation rules 380 | - Consistent Data Modeling: Design schemas to minimize the need for complex integrity constraints 381 | 382 | [Table of Contents](#mongo) 383 | 384 | ## how does MongoDB fit into CAP theorm it? 385 | 386 | The CAP theorem states that a distributed system can only provide two of three guarantees: Consistency, Availability, and Partition tolerance. MongoDB's position in the CAP theorem can be described as: 387 | 388 | **Partition Tolerance**: MongoDB is designed to handle network partitions in distributed deployments. 389 | **Consistency vs. Availability**: MongoDB allows you to tune the balance between consistency and availability: 390 | 391 | With default settings, MongoDB leans towards **CP** (Consistency and Partition Tolerance) 392 | By adjusting write concern and read preferences, you can shift towards **AP** (Availability and Partition Tolerance) 393 | 394 | In practice: 395 | 396 | - For strong consistency, use majority write concern and read preference 397 | - For higher availability at the cost of potential inconsistency, use lower write concerns or read from secondaries 398 | 399 | [Table of Contents](#mongo) 400 | 401 | ## How do you handle large data volumes in MongoDB? 402 | 403 | MongoDB offers several strategies for handling large data volumes: 404 | 405 | - Sharding: Distribute data across multiple machines to handle large datasets and high throughput operations. 406 | - Indexing: Create appropriate indexes to improve query performance on large collections. 407 | - Aggregation Pipeline: Use for efficient data processing and analysis on large datasets. 408 | - GridFS: Store and retrieve large files efficiently. 409 | - Data Compression: WiredTiger storage engine provides data compression to reduce storage requirements. 410 | - Capped Collections: Use for high-volume data that doesn't require long-term storage. 411 | - Time Series Collections: Optimize storage and querying for time series data. 412 | 413 | 414 | [Table of Contents](#mongo) 415 | 416 | ## What are the best practices for designing a MongoDB schema? 417 | 418 | Key best practices for MongoDB schema design include: 419 | 420 | - Embed related data in a single document when possible for faster reads. 421 | - Use references when data is used in multiple places or for very large datasets. 422 | - Design schema based on application query patterns. 423 | - Avoid deeply nested documents (keep nesting to 2-3 levels for better performance). 424 | - Use appropriate data types for fields. 425 | - Create indexes to support common queries. 426 | - Consider document growth when designing schemas. 427 | - Use schema versioning for easier updates and migrations. 428 | - Normalize data only when necessary (e.g., for data consistency across multiple collections). 429 | - Use array fields for one-to-many relationships within reasonable limits. 430 | 431 | [Table of Contents](#mongo) 432 | 433 | 434 | ## How does MongoDB handle geographic data and geospatial queries? 435 | 436 | MongoDB provides robust support for geographic data and geospatial queries: 437 | 438 | - Geospatial Indexes: Support for 2d and 2dsphere indexes for efficient geospatial queries. 439 | - GeoJSON Objects: Store location data using standard GeoJSON format. 440 | - Geospatial Queries: Support for various types of geospatial queries: 441 | - Proximity: Find documents near a point. 442 | - Intersection: Find geometries that intersect with a specified geometry. 443 | - Within: Find geometries contained within a specified geometry. 444 | - Aggregation Pipeline: Geospatial stages like $geoNear for complex geo-based analytics. 445 | - Coordinate Systems: Support for both legacy coordinate pairs and GeoJSON objects. 446 | - Geospatial Operators: $near, $geoWithin, $geoIntersects, etc., for flexible querying. 447 | - Spherical Geometry: 2dsphere index supports queries on an Earth-like sphere. 448 | 449 | [Table of Contents](#mongo) 450 | -------------------------------------------------------------------------------- /content/nifi.md: -------------------------------------------------------------------------------- 1 | ## [Main title](../README.md) 2 | ### [Interview questions](full.md) 3 | 4 | # NiFi 5 | + [What is Apache Nifi?](#What-is-Apache-Nifi) 6 | + [What is Nifi FlowFile?](#What-is-Nifi-FlowFile) 7 | + [What is Relationship in Nifi DataFlow?](#What-is-Relationship-in-Nifi-DataFlow) 8 | + [What is Reporting Task?](#What-is-Reporting-Task) 9 | + [What is a Nifi Processor?](#What-is-a-Nifi-Processor) 10 | + [Is there a programming language that Apache Nifi supports?](#Is-there-a-programming-language-that-Apache-Nifi-supports) 11 | + [How do you define Nifi content Repository?](#How-do-you-define-Nifi-content-Repository) 12 | + [What is the Backpressure in Nifi system?](#What-is-the-Backpressure-in-Nifi-system) 13 | + [What is the template in Nifi?](#What-is-the-template-in-Nifi) 14 | + [What is the bulleting and how it helps in Nifi?](#What-is-the-bulleting-and-how-it-helps-in-Nifi) 15 | + [Do the attributes get added to content (actual Data) when data is pulled by Nifi?](#Do-the-attributes-get-added-to-content-when-data-is-pulled-by-Nifi) 16 | + [What happens if you have stored a password in a dataflow and create a template out of it?](#What-happens-if-you-have-stored-a-password-in-a-dataflow-and-create-a-template-out-of-it) 17 | + [How does Nifi support huge volume of Payload in a Dataflow?](#How-does-Nifi-support-huge-volume-of-Payload-in-a-Dataflow) 18 | + [What is a Nifi custom properties registry?](#What-is-a-Nifi-custom-properties-registry) 19 | + [Does Nifi works as a Master Slave architecture?](#Does-Nifi-works-as-a-Master-Slave-architecture) 20 | 21 | ## What is Apache Nifi? 22 | NiFi is helpful in creating DataFlow. It means you can transfer data from one system to another system as well as process the data in between. 23 | 24 | [Table of Contents](#NiFi) 25 | 26 | ## What is Nifi FlowFile? 27 | A FlowFile is a message or event data or user data, which is pushed or created in the NiFi. A FlowFile has mainly two things attached with it. Its content (Actual payload: Stream of bytes) and attributes. Attributes are key value pairs attached to the content (You can say metadata for the content). 28 | 29 | [Table of Contents](#NiFi) 30 | 31 | ## What is Relationship in Nifi DataFlow? 32 | When a processor finishes with processing of FlowFile. It can result in Failure or Success or any other relationship. And based on this relationship you can send data to the Downstream or next processor or mediated accordingly. 33 | 34 | [Table of Contents](#NiFi) 35 | 36 | ## What is Reporting Task? 37 | A Reporting Task is a NiFi extension point that is capable of reporting and analyzing NiFi's internal metrics in order to provide the information to external resources or report status information as bulletins that appear directly in the NiFi User Interface. 38 | 39 | [Table of Contents](#NiFi) 40 | 41 | ## What is a Nifi Processor? 42 | Processor is a main component in the NiFi, which will really work on the FlowFile content and helps in creating, sending, receiving, transforming routing, splitting, merging, and processing FlowFile. 43 | 44 | [Table of Contents](#NiFi) 45 | 46 | ## Is there a programming language that Apache Nifi supports? 47 | NiFi is implemented in the Java programming language and allows extensions (processors, controller services, and reporting tasks) to be implemented in Java. In addition, NiFi supports processors that execute scripts written in Groovy, Jython, and several other popular scripting languages. 48 | 49 | [Table of Contents](#NiFi) 50 | 51 | ## How do you define Nifi content Repository? 52 | As we mentioned previously, contents are not stored in the FlowFile. They are stored in the content repository and referenced by the FlowFile. This allows the contents of FlowFiles to be stored independently and efficiently based on the underlying storage mechanism. 53 | 54 | [Table of Contents](#NiFi) 55 | 56 | ## What is the Backpressure in Nifi system? 57 | Sometime what happens that Producer system is faster than consumer system. Hence, the messages which are consumed is slower. Hence, all the messages (FlowFiles) which are not being processed will remain in the connection buffer. However, you can limit the connection backpressure size either based on number of FlowFiles or number of data size. If it reaches to defined limit, connection will give back pressure to producer processor not run. Hence, no more FlowFiles generated, until backpressure is reduced. 58 | 59 | [Table of Contents](#NiFi) 60 | 61 | ## What is the template in Nifi? 62 | Template is a re-usable workflow. Which you can import and export in the same or different NiFi instances. It can save lot of time rather than creating Flow again and again each time. Template is created as an xml file. 63 | 64 | [Table of Contents](#NiFi) 65 | 66 | ## What is the bulleting and how it helps in Nifi? 67 | If you want to know if any problems occur in a dataflow. You can check in the logs for anything interesting, it is much more convenient to have notifications pop up on the screen. If a Processor logs anything as a WARNING or ERROR, we will see a "Bulletin Indicator" show up in the top-right-hand corner of the Processor. 68 | This indicator looks like a sticky note and will be shown for five minutes after the event occurs. Hovering over the bulletin provides information about what happened so that the user does not have to sift through log messages to find it. If in a cluster, the bulletin will also indicate which node in the cluster emitted the bulletin. We can also change the log level at which bulletins will occur in the Settings tab of the Configure dialog for a Processor. 69 | 70 | [Table of Contents](#NiFi) 71 | 72 | ## Do the attributes get added to content when data is pulled by Nifi? 73 | You can certainly add attributes to your FlowFiles at anytime, that’s the whole point of separating metadata from the actual data. Essentially, one FlowFile represents an object or a message moving through NiFi. Each FlowFile contains a piece of content, which is the actual bytes. You can then extract attributes from the content, and store them in memory. You can then operate against those attributes in memory, without touching your content. By doing so you can save a lot of IO overhead, making the whole flow management process extremely efficient. 74 | 75 | [Table of Contents](#NiFi) 76 | 77 | ## What happens if you have stored a password in a dataflow and create a template out of it? 78 | Password is a sensitive property. Hence, while exporting the DataFlow as a template password will be dropped. As soon as you import the template in the same or different NiFi system. 79 | 80 | [Table of Contents](#NiFi) 81 | 82 | ## How does Nifi support huge volume of Payload in a Dataflow? 83 | Huge volume of data can transit from DataFlow. As data moves through NiFi, a pointer to the data is being passed around, referred to as a FlowFile. The content of the FlowFile is only accessed as needed. 84 | 85 | [Table of Contents](#NiFi) 86 | 87 | ## What is a Nifi custom properties registry? 88 | You can use to load custom key, value pair you can use custom properties registry, which can be configure as (in nifi.properties file) 89 | nifi.variable.registry.properties=/conf/nifi_registry 90 | And you can put key value pairs in that file and you can use that properties in you NiFi processor using expression language e.g. ${OS} , if you have configured that property in registry file. 91 | 92 | [Table of Contents](#NiFi) 93 | 94 | ## Does Nifi works as a Master Slave architecture? 95 | No, from NiFi 1.0 there is 0-master philosophy is considered. And each node in the NiFi cluster is the same. NiFi cluster is managed by the Zookeeper. Apache ZooKeeper elects a single node as the Cluster Coordinator, and failover is handled automatically by ZooKeeper. All cluster nodes report heartbeat and status information to the Cluster Coordinator. The Cluster Coordinator is responsible for disconnecting and connecting nodes. Additionally, every cluster has one Primary Node, also elected by ZooKeeper. -------------------------------------------------------------------------------- /content/parquet.md: -------------------------------------------------------------------------------- 1 | ## [Main title](../README.md) 2 | ### [Interview questions](full.md) 3 | 4 | ### Will be available soon -------------------------------------------------------------------------------- /content/tableau.md: -------------------------------------------------------------------------------- 1 | ## [Main title](../README.md) 2 | ### [Interview questions](full.md) 3 | 4 | # Tableau 5 | + [What is data visualization in Tableau?](#What-is-data-visualization-in-Tableau) 6 | + [What is the difference between various BI tools and Tableau?](#What-is-the-difference-between-various-BI-tools-and-Tableau) 7 | + [What are different Tableau products?](#What-are-different-Tableau-products) 8 | + [What is a parameter in Tableau?](#What-is-a-parameter-in-Tableau) 9 | + [Tell me something about measures and dimensions?](#Tell-me-something-about-measures-and-dimensions) 10 | + [What are continuous and discrete field types?](#What-are-continuous-and-discrete-field-types) 11 | + [What is aggregation and disaggregation of data?](#What-is-aggregation-and-disaggregation-of-data) 12 | + [What are the different types of joins in Tableau?](#What-are-the-different-types-of-joins-in-Tableau) 13 | + [Tell me the different connections to make with a dataset?](#Tell-me-the-different-connections-to-make-with-a-dataset) 14 | + [What are the supported file extensions in Tableau?](#What-are-the-supported-file-extensions-in-Tableau) 15 | + [What are the supported data types in Tableau?](#What-are-the-supported-data-types-in-Tableau) 16 | + [What are sets?](#What-are-sets) 17 | + [What are groups in Tableau?](#What-are-groups-in-Tableau) 18 | + [What are shelves?](#What-are-shelves) 19 | + [Tell me something about Data blending in Tableau?](#Tell-me-something-about-Data-blending-in-Tableau) 20 | + [How do you generally perform load testing in Tableau?](#How-do-you-generally-perform-load-testing-in-Tableau) 21 | + [Why would someone not use Tableau?](#Why-would-someone-not-use-Tableau) 22 | + [What is Tableau data engine?](#What-is-Tableau-data-engine) 23 | + [What are the various types of filters in Tableau?](#What-are-the-various-types-of-filters-in-Tableau) 24 | + [What are dual axes?](#What-are-dual-axes) 25 | + [What is the difference between a tree and heat map?](#What-is-the-difference-between-a-tree-and-heat-map) 26 | + [What are extracts and schedules in Tableau server?](#What-are-extracts-and-schedules-in-Tableau-server) 27 | + [What are the components in a dashboard?](#What-are-the-components-in-a-dashboard) 28 | + [What is a TDE file?](#What-is-a-TDE-file) 29 | + [What is the story in Tableau?](#What-is-the-story-in-Tableau) 30 | + [What are different Tableau files?](#What-are-different-Tableau-files) 31 | + [How do you embed views into webpages?](#How-do-you-embed-views-into-webpages) 32 | + [What is the maximum num of rows Tableau can utilize at one time?](#What-is-the-maximum-num-of-rows-Tableau-can-utilize-at-one-time) 33 | + [Mention what is the difference between published data sources and embedded data sources in Tableau?](#Mention-what-is-the-difference-between-published-data-sources-and-embedded-data-sources-in-Tableau) 34 | + [What is the DRIVE Program Methodology?](#What-is-the-DRIVE-Program-Methodology) 35 | + [How to use groups in a calculated field?](#How-to-use-groups-in-a-calculated-field) 36 | + [Explain when would you use Joins vs Blending in Tableau?](#Explain-when-would-you-use-Joins-vs-Blending-in-Tableau) 37 | + [What is Assume referential integrity?](#What-is-Assume-referential-integrity) 38 | + [What is a Calculated Field and How Will You Create One?](#What-is-a-Calculated-Field-and-How-Will-You-Create-One) 39 | + [How Can You Display the Top Five and Bottom Five Sales in the Same View?](#How-Can-You-Display-the-Top-Five-and-Bottom-Five-Sales-in-the-Same-View) 40 | + [What is the Rank Function in Tableau?](#What-is-the-Rank-Function-in-Tableau) 41 | + [What is the difference between Tableau and other similar tools like QlikView or IBM Cognos?](#What-is-the-difference-between-Tableau-and-other-similar-tools-like-QlikView-or-IBM-Cognos) 42 | 43 | [Table of Contents](#Tableau) 44 | 45 | ## What is data visualization in Tableau? 46 | Data visualization is a way to represent data that is visually appealing and interactive. With advancements in technology, the number of business intelligence tools has increased which helps users understand data, data sets, data points, charts, graphs, and focus on its impact rather than understanding the tool itself. 47 | 48 | [Table of Contents](#Tableau) 49 | 50 | ## What is the difference between various BI tools and Tableau? 51 | The basic difference between the traditional BI tools and Tableau lies in the efficiency and speed. 52 | The architecture of Traditional BI tools has hardware limitations. While Tableau does not have any sort of dependencies 53 | The traditional BI tools work on complex technologies while Tableau uses simple associative search to make it dynamic. 54 | Traditional BI tools do not support multi-thread, in-memory, or multi-core computing while Tableau supports all these features after integrating complex technologies. 55 | Traditional BI tools have a pre-defined data view while Tableau does a predictive analysis for business operations. 56 | 57 | [Table of Contents](#Tableau) 58 | 59 | ## What are different Tableau products? 60 | Tableau like other BI tools has a range of products: 61 | Tableau Desktop: Desktop product is used to create optimized queries out from pictures of data. Once the queries are ready, you can perform those queries without the need to code. Tableau desktop encompasses data from various sources into its data engine and creates an interactive dashboard. 62 | Tableau Server: When you have published dashboards using Tableau Desktop, Tableau servers help in sharing them throughout the organization. It is an enterprise-level feature that is installed on a Windows or Linux server. 63 | Tableau Reader: Tableau Reader is a free feature available on Desktop that lets you open and views data visualizations. You can filter or drill down the data but restricts editing any formulas or performing any kind of actions on it. It is also used to extract connection files. 64 | Tableau Online: Tableau online is also a paid feature but doesn’t need exclusive installation. It comes with the software and is used to share the published dashboards anywhere and everywhere. 65 | Tableau Public: Tableau public is yet another free feature to view your data visualizations by saving them as worksheets or workbooks on Tableau Server. 66 | 67 | [Table of Contents](#Tableau) 68 | 69 | ## What is a parameter in Tableau? 70 | The parameter is a variable (numbers, strings, or date) created to replace a constant value in calculations, filters, or reference lines. For example, you create a field that returns true if the sales are greater than 30,000 and false if otherwise. Parameters are used to replace these numbers (30000 in this case) to dynamically set this during calculations. Parameters allow you to dynamically modify values in a calculation. The parameters can accept values in the following options: 71 | All: Simple text field 72 | List: List of possible values to select from 73 | Range: Select values from a specified range 74 | 75 | [Table of Contents](#Tableau) 76 | 77 | ## Tell me something about measures and dimensions? 78 | In Tableau, when we connect to a new data source, each field in the data source is either mapped as measures or dimensions. These fields are the columns defined in the data source. Each field is assigned a dataType (integer, string, etc.) and a role (discrete dimension or continuous measure). 79 | Measures contain numeric values that are analyzed by a dimension table. Measures are stored in a table that allows storage of multiple records and contains foreign keys referring uniquely to the associated dimension tables. 80 | While Dimensions contain qualitative values (name, dates, geographical data) to define comprehensive attributes to categorize, segment, and reveal the data details. 81 | 82 | [Table of Contents](#Tableau) 83 | 84 | ## What are continuous and discrete field types? 85 | Tableau’s specialty lies in displaying data differently either in continuous format or discrete. Both of them are mathematical terms used to define data where continuous means without interruptions and discrete means are individually separate and distinct. 86 | While the blue color indicates discrete behavior, the green color indicates continuous behavior. On one hand, the discrete view defines the headers and can be easily sorted, while continuous defines the axis in a graph view and cannot be sorted. 87 | Discrete View Tableau 88 | 89 | [Table of Contents](#Tableau) 90 | 91 | ## What is aggregation and disaggregation of data? 92 | Aggregation of data means displaying the measures and dimensions in an aggregated form. The aggregate functions available in the Tableau tool are: 93 | SUM (expression): Adds up all the values used in the expression. Used only for numeric values. 94 | AVG (expression): Calculates the average of all the values used in the expression. Used only for numeric values. 95 | Median (expression): Calculates the median of all the values across all the records used in the expression. Used only for numeric values. 96 | Count (expression): Returns the number of values in the set of expressions. Excludes null values. 97 | Count (distinct): Returns the number of unique values in the set of expressions. 98 | Tableau, in fact, lets you alter the aggregation type for a view. 99 | 100 | Disaggregation of data means displaying each and every data field separately. 101 | 102 | [Table of Contents](#Tableau) 103 | 104 | ## What are the different types of joins in Tableau? 105 | Tableau is pretty similar to SQL. Therefore, the types of joins in Tableau are similar: 106 | Left Outer Join: Extracts all the records from the left table and the matching rows from the right table. 107 | Right Outer Join: Extracts all the records from the right table and the matching rows from the left table. 108 | Full Outer Join: Extracts the records from both the left and right tables. All unmatched rows go with the NULL value. 109 | Inner Join: Extracts the records from both tables. 110 | 111 | [Table of Contents](#Tableau) 112 | 113 | ## Tell me the different connections to make with a dataset? 114 | There are two types of data connections in Tableau: 115 | LIVE: Live connection is a dynamic way to extract real-time data by directly connecting to the data source. Tableau directly creates queries against the database entries and retrieves the query results in a workbook. 116 | EXTRACT: A snapshot of the data, extract the file (.tde or .hyper file) contains data from a relational database. The data is extracted from a static source of data like an Excel Spreadsheet. You can schedule to refresh the snapshots which are done using the Tableau server. This doesn’t need any connection with the database. 117 | 118 | [Table of Contents](#Tableau) 119 | 120 | ## What are the supported file extensions in Tableau? 121 | The supported file extensions used in Tableau Desktop are: 122 | Tableau Workbook (TWB): contains all worksheets, story points, dashboards, etc. 123 | Tableau Data Source (TDS): contains connection information and metadata about your data source 124 | Tableau Data Extract (TDE): contains data that has been extracted from other data sources. 125 | Tableau Packaged Workbook (TWBX): contains a combination of the workbook, connection data, and metadata, and the data itself in the form of TDE. It can be zipped and shared. 126 | Tableau Packaged Data Source (TDSX): contains a combination of different files. 127 | Tableau Bookmark (TBM): to earmark a specific worksheet. 128 | 129 | [Table of Contents](#Tableau) 130 | 131 | ## What are the supported data types in Tableau? 132 | The following data types are supported in Tableau: 133 | DataType Possible Values 134 | Boolean True/False 135 | Date Date Value (December 28, 2016) 136 | Date & Time Date & Timestamp values (December 28, 2016 137 | 06:00:00 PM) 138 | Geographical Values Geographical Mapping (Beijing, Mumbai) 139 | Text/String Text/String 140 | Number Decimal (8.00) 141 | Number Whole Number (5) 142 | 143 | [Table of Contents](#Tableau) 144 | 145 | ## What are sets? 146 | Sets are custom fields created as a subset of the data in your Tableau desktop. Sets can be computed based on conditions or created manually based on the dimensions of the data source. 147 | For example, A set of customers that earned revenue more than some value. Now, set data may update dynamically based on the conditions applied. Learn More 148 | 149 | [Table of Contents](#Tableau) 150 | 151 | ## What are groups in Tableau? 152 | Groups are created to visualize larger memberships using dimensions. Groups can create their own fields to categorize values in that specific dimension. 153 | 154 | [Table of Contents](#Tableau) 155 | 156 | ## What are shelves? 157 | Tableau worksheets contain various named elements like columns, rows, marks, filters, pages, etc. which are called shelves. You can place fields on shelves to create visualizations, increase the level of detail, or add context to it. 158 | 159 | [Table of Contents](#Tableau) 160 | 161 | ## Tell me something about Data blending in Tableau? 162 | Data blending is viewing and analyzing data from multiple sources in one place. Primary and secondary are two types of data sources that are involved in data blending. 163 | 164 | [Table of Contents](#Tableau) 165 | 166 | ## How do you generally perform load testing in Tableau? 167 | Load testing in Tableau is done to understand the server’s capacity with respect to its environment, data, workload, and use. It is preferable to conduct load testing at least 3-4 times in a year because with every new user, upgrade, or content authoring, the usage, data, and workload change. 168 | Tabjolt was created by Tableau to conduct point-and-run load and performance testing specifically for Tableau servers. Tabjolt: 169 | Automates the process of user-specified loads 170 | Eliminates dependency on script development or script maintenance 171 | Scales linearly with an increase in the load by adding more nodes to the cluster 172 | 173 | [Table of Contents](#Tableau) 174 | 175 | ## Why would someone not use Tableau? 176 | The limitations of using Tableau are: 177 | Not cost-effective: Tableau is not that cost-effective when we compare it well with the other available data visualization tools. In addition to this, it has software upgrades, proper deployment, maintenance, and also training people for using the tool. 178 | Not so secure: When it comes to data, everyone is extra cautious. Tableau focussed on security issues but fails to provide centralized data-level security. It pushes for row-level security and creates an account for every user which makes it more prone to security glitches. 179 | BI capabilities are not enough: Tableau lacks basic BI capabilities like large-scale reporting, building data tables, or creating static layouts. It has limited result-sharing capabilities, email notification configuration is limited to admins, and the vendor doesn’t support trigger-based notifications. 180 | 181 | [Table of Contents](#Tableau) 182 | 183 | ## What is Tableau data engine? 184 | An analytical database that computes instant query responses, predictive analysis of the server, and integrated data. The data engine is useful when you need to create, refresh, or query extracts. It can be used for cross-database joins as well. 185 | 186 | [Table of Contents](#Tableau) 187 | 188 | ## What are the various types of filters in Tableau? 189 | Tableau has 6 different types of filters: 190 | Extract Filter: This filter retrieves a subset of data from the data source. 191 | Dimension Filter: This filter is for non-aggregated data (discrete). 192 | Data Source Filter: This filter refrains users from viewing sensitive information and thus reduces data feeds. 193 | Context Filter: This filter creates datasets by applying presets in Tableau. 194 | Measure Filter: This filter applies various operations like sum, median, avg, etc. 195 | Table Calculation Filter: This filter is applied after the view has been created. 196 | 197 | [Table of Contents](#Tableau) 198 | 199 | ## What are dual axes? 200 | Dual axes are used to analyze two different measures at two different scales in the same graph. This lets you compare multiple attributes on one graph with two independent axes layered one above the other. 201 | To add a measure as a dual-axis, drag the field to the right side of the view and drop it when you see a black dashed line appear. You can also right-click (control-click on Mac) the measure on the Columns or Rows shelf and select Dual Axis. 202 | 203 | [Table of Contents](#Tableau) 204 | 205 | ## What is the difference between a tree and heat map? 206 | Both the maps help in analyzing data. While a heat map visualizes and compares different categories of data, a treemap displays a hierarchical structure of data in rectangles. Heat map visualizes measures against dimensions by depicting them in different colors. Similar to a text table with values defined in different colors. 207 | Heatmap In Tableau 208 | Treemap visualizes the hierarchy of data in nested rectangles. Hierarchy levels are displayed from larger rectangles to smaller ones. 209 | Example - Below treemap shows aggregated sales totals across a range of product categories: 210 | TreeMap in Tableau 211 | 212 | [Table of Contents](#Tableau) 213 | 214 | ## What are extracts and schedules in Tableau server? 215 | Data extracts are the subsets of data created from data sources. Schedules are scheduled refreshes made on extracts after publishing the workbook. This keeps the data up-to-date. Schedules are strictly managed by the server administrators. 216 | 217 | [Table of Contents](#Tableau) 218 | 219 | ## What are the components in a dashboard? 220 | The components displayed in a dashboard are: 221 | Horizontal: Horizontal view allows the users to combine the worksheets and dashboard elements from left to right and edit the height of the elements. 222 | Vertical: Vertical view allows the users to combine the worksheets and dashboard elements from top to bottom and edit the width of the elements. 223 | Text: All the textual fields. 224 | Image Extract: To extract an image Tableau applies some code, extracts the image, and saves it in a workbook in the XML format. 225 | Web URL: Hyperlink that points to a web page, file, or other web resources outside of Tableau 226 | 227 | [Table of Contents](#Tableau) 228 | 229 | ## What is a TDE file? 230 | TDE is Tableau Desktop Extension with extension .tde. TDE file points to a file that contains data from external sources like MS Excel, MS Access, or CSV files. TDE makes it easier to analyze and discover data. 231 | 232 | [Table of Contents](#Tableau) 233 | 234 | ## What is the story in Tableau? 235 | Creating a story is effective in Tableau which is created by combining various charts to portray a plot of viewers. A story is a sheet that contains all the methods used to create those worksheets. To create a story: 236 | Click the New Story on the dashboard. 237 | Choose the right size of the story from the bottom-left corner or choose a custom size. 238 | Start building the story by double-clicking the sheet and add it to the story point. 239 | Add a caption to the story by clicking Add a caption. 240 | You can update the highlights by clicking Update in the toolbar. You can also add layout options, format a story, or fit the story to your dashboard. 241 | 242 | [Table of Contents](#Tableau) 243 | 244 | ## What are different Tableau files? 245 | Workbooks: Workbooks contain one or more worksheets and dashboard elements. 246 | Bookmarks: Contains a single worksheet that is easier to share. 247 | Packaged Workbooks: Contains a workbook along with supporting local file data and background images. 248 | Data Extraction Files: Extract files that contain a subset of data. 249 | Data Connection Files: Small XML file with various connection information. 250 | 251 | [Table of Contents](#Tableau) 252 | 253 | ## How do you embed views into webpages? 254 | You can easily integrate interactive views from your Tableau Server or Tableau online onto webpages, blogs, web applications, or internet portals. But to have a look at the views, the permissions demand the viewer to create an account on the Tableau Server. To embed views, click the Share button on the top of the view and copy the embed code to paste it on the web page. 255 | You can also customize the embedded code or Tableau Javascript APIs to embed views. 256 | 257 | [Table of Contents](#Tableau) 258 | 259 | ## What is the maximum num of rows Tableau can utilize at one time? 260 | The maximum number of rows or columns is indefinite because even though Tableau contains petabytes of data, it intelligently uses only those rows and columns which you need to extract for your purpose. 261 | 262 | [Table of Contents](#Tableau) 263 | 264 | ## Mention what is the difference between published data sources and embedded data sources in Tableau? 265 | Connection information is the details of data that you want to bring into Tableau. Before publishing it, you can create an extract of the same. 266 | Published Data Source: It contains connection information that is independent of any workbook. 267 | Embedded Data Source: It contains connection information which is connected to a workbook 268 | 269 | [Table of Contents](#Tableau) 270 | 271 | ## What is the DRIVE Program Methodology? 272 | DRIVE program methodology creates a structure around data analytics derived from enterprise deployments. The drive methodology is iterative in nature and includes agile methods that are faster and effective. 273 | 274 | [Table of Contents](#Tableau) 275 | 276 | ## How to use groups in a calculated field? 277 | Add the GroupBy clause to SQL queries or create a calculated field in the data window to group fields. 278 | Using groups in a calculation. You cannot reference ad-hoc groups in a calculation. 279 | Blend data using groups created in the secondary data source: Only calculated groups can be used in data blending if the group was created in the secondary data source. 280 | Use a group in another workbook. You can easily replicate a group in another workbook by copy and pasting a calculation. 281 | 282 | [Table of Contents](#Tableau) 283 | 284 | ## Explain when would you use Joins vs Blending in Tableau? 285 | While the two terms may sound similar, there is a difference in their meaning and use in Tableau: 286 | While Join is used to combine two or more tables within the same data source. 287 | Blending is used to combine data from multiple data sources such as Oracle, Excel, SQL server, etc. 288 | 289 | [Table of Contents](#Tableau) 290 | 291 | ## What is Assume referential integrity? 292 | In some cases, you can improve query performance by selecting the option to Assume Referential Integrity from the Data menu. When you use this option, Tableau will include the joined table in the query only if it is specifically referenced by fields in the view. 293 | 294 | [Table of Contents](#Tableau) 295 | 296 | ## What is a Calculated Field and How Will You Create One? 297 | Calculated fields are created using formulas based on other fields. These fields do not exist but are created by you. 298 | You can create these fields to: 299 | Segment data 300 | Convert the data type of a field, such as converting a string to a date. 301 | Aggregate data 302 | Filter results 303 | Calculate ratios 304 | 305 | There are three main types of calculations that you can create: 306 | Basic Calculations: Transform values of the data fields at the source level 307 | Level of Detail (LOD) Expressions: Transform values of the data fields at the source level like basic calculations but with more granular access 308 | Table Calculations: Transform values of the data fields only at the visualization level 309 | To create calculate fields: 310 | In Tableau, navigate to Analysis>Create a calculated field. Input details in the calculation editor. 311 | And, done! 312 | 313 | [Table of Contents](#Tableau) 314 | 315 | ## How Can You Display the Top Five and Bottom Five Sales in the Same View? 316 | You can see top five and bottom five sales with the help of these functions: 317 | Drag customer name to row and sales to the column. 318 | Sort Sum(sales) in descending order. 319 | Create a calculated field Rank of Sales. 320 | 321 | [Table of Contents](#Tableau) 322 | 323 | ## What is the Rank Function in Tableau? 324 | Rank function is used to give positions (rank) to any measure in the data set. Tableau can rank measure in the following ways: 325 | Rank: The rank function in Tableau accepts two arguments: aggregated measure and ranking order (optional) with a default value of desc. 326 | Rank_dense: The rank_dense also accepts the two arguments: aggregated measure and ranking order. This assigns the same rank to the same values but doesn’t stop there and keeps incrementing with the other values. For instance, if you have values 10, 20, 20, 30, then ranks will be 1, 2, 2, 3. 327 | Rank_modified: The rank_modified assigns the same rank to similar values. 328 | Rank_unique: The rank_unique assigns a unique rank to each and every value. For example, If the values are 10, 20, 20, 30 then the assigned ranks will be 1,2,3,4 respectively. 329 | 330 | [Table of Contents](#Tableau) 331 | 332 | ## What is the difference between Tableau and other similar tools like QlikView or IBM Cognos? 333 | Tableau is different than QlikView or IBM Cognos for various reasons: 334 | Tableau is an intuitive data visualization tool simplifying the story creation by simple drag and drop techniques. On the other hand, BI tools like QlikView or Cognos convert data into metadata to let the users explore data relations. If your presentation runs around presenting data in aesthetic visualizations then opt for Tableau. If not, and might need a full BI platform then go for Cognos/QlikView 335 | The ease of use or extracting data details is easier in Tableau than compared to extensive BI tools like Cognos. With Tableau, your team members, be it a guy from sales can easily read the data and give insights. But with Cognos, only members with extensive tool knowledge are appreciated and welcomed. 336 | [Table of Contents](#Tableau) -------------------------------------------------------------------------------- /img/flink/1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/flink/1.png -------------------------------------------------------------------------------- /img/flink/2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/flink/2.png -------------------------------------------------------------------------------- /img/flink/apche-flink-architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/flink/apche-flink-architecture.png -------------------------------------------------------------------------------- /img/flink/bounded-unbounded-stream.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/flink/bounded-unbounded-stream.png -------------------------------------------------------------------------------- /img/flink/flink-job-exe-architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/flink/flink-job-exe-architecture.png -------------------------------------------------------------------------------- /img/flink/flink3.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/flink/flink3.jpg -------------------------------------------------------------------------------- /img/flink/flinkVsHadoopVsSpark.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/flink/flinkVsHadoopVsSpark.JPG -------------------------------------------------------------------------------- /img/icon/airflow.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/airflow.ico -------------------------------------------------------------------------------- /img/icon/avro.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/avro.ico -------------------------------------------------------------------------------- /img/icon/aws.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/aws.ico -------------------------------------------------------------------------------- /img/icon/awstime.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/awstime.ico -------------------------------------------------------------------------------- /img/icon/azure.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/azure.ico -------------------------------------------------------------------------------- /img/icon/bigquery.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/bigquery.ico -------------------------------------------------------------------------------- /img/icon/bigtable.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/bigtable.ico -------------------------------------------------------------------------------- /img/icon/cassandra.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/cassandra.ico -------------------------------------------------------------------------------- /img/icon/cosmosdb.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/cosmosdb.ico -------------------------------------------------------------------------------- /img/icon/datastruct.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/datastruct.ico -------------------------------------------------------------------------------- /img/icon/deltalake.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/deltalake.ico -------------------------------------------------------------------------------- /img/icon/dwha.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/dwha.ico -------------------------------------------------------------------------------- /img/icon/dynamodb.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/dynamodb.ico -------------------------------------------------------------------------------- /img/icon/fire.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/fire.ico -------------------------------------------------------------------------------- /img/icon/flink.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/flink.ico -------------------------------------------------------------------------------- /img/icon/flume.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/flume.ico -------------------------------------------------------------------------------- /img/icon/gcp.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/gcp.ico -------------------------------------------------------------------------------- /img/icon/gcpsql.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/gcpsql.ico -------------------------------------------------------------------------------- /img/icon/github.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/github.ico -------------------------------------------------------------------------------- /img/icon/greenplum.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/greenplum.ico -------------------------------------------------------------------------------- /img/icon/hadoop.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/hadoop.ico -------------------------------------------------------------------------------- /img/icon/hbase.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/hbase.ico -------------------------------------------------------------------------------- /img/icon/hive.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/hive.ico -------------------------------------------------------------------------------- /img/icon/impala.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/impala.ico -------------------------------------------------------------------------------- /img/icon/kafka.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/kafka.ico -------------------------------------------------------------------------------- /img/icon/kuber.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/kuber.ico -------------------------------------------------------------------------------- /img/icon/looker.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/looker.ico -------------------------------------------------------------------------------- /img/icon/mongo.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/mongo.ico -------------------------------------------------------------------------------- /img/icon/neptune.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/neptune.ico -------------------------------------------------------------------------------- /img/icon/nifi.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/nifi.ico -------------------------------------------------------------------------------- /img/icon/parquet.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/parquet.ico -------------------------------------------------------------------------------- /img/icon/rds.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/rds.ico -------------------------------------------------------------------------------- /img/icon/redshift.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/redshift.ico -------------------------------------------------------------------------------- /img/icon/spanner.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/spanner.ico -------------------------------------------------------------------------------- /img/icon/spark.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/spark.ico -------------------------------------------------------------------------------- /img/icon/sql.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/sql.ico -------------------------------------------------------------------------------- /img/icon/superset.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/superset.ico -------------------------------------------------------------------------------- /img/icon/tableau.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OBenner/data-engineering-interview-questions/b29624875edcc439616a59bf294cececeac3f46c/img/icon/tableau.ico --------------------------------------------------------------------------------