├── .gitignore ├── README.md ├── images ├── 1.1.1.png ├── 1.1.2.png ├── 2.1.1.png ├── 2.1.2.png ├── 2.3.1.png ├── 2.3.2.png ├── 2.5.1.png ├── 3.1.1.png ├── 3.1.2.png ├── 3.2.1.png ├── 3.3.1.png ├── 4.0.1.png └── 4.4.1.png ├── 2_Relational_database ├── 2.5_working_with_azure_database.md ├── 2.0_relational_data_concepts.md ├── 2.1_shared_responsibility_model.md ├── 2.3_Querying_relational_data.md ├── 2.2_Relational_database_oferings_in_Azure.md └── 2.4_Relational_data_management_task.md ├── 0.0_Refernce.md ├── 1_Core_Data_Concepts ├── 1.0_intro_core_data_concepts.md ├── 1.3_data_analytics.md └── 1.1_data_processing.md ├── 3_Non-Relational_Data ├── 3.0_non-relational-data-concepts.md ├── 3.1_Non-Relational_database_offerings_in_Azure.md ├── 3.3_Azure_Storage_Services.md └── 3.2_CosmosDB.md └── 4_DataWareHousing_in_Azure ├── 4.1_modern_data_warehousing.md ├── 4.0_analytics_workloads.md ├── 4.4_MicroSoft_PowerBI.md ├── 4.3_data_analytics_tools.md └── 4.2_data_ingestion_components.md /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | .idea -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # AZURE-DP900 2 | Azure DP 900 notes 3 | -------------------------------------------------------------------------------- /images/1.1.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/eandbsoftware/AZURE-DP900/HEAD/images/1.1.1.png -------------------------------------------------------------------------------- /images/1.1.2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/eandbsoftware/AZURE-DP900/HEAD/images/1.1.2.png -------------------------------------------------------------------------------- /images/2.1.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/eandbsoftware/AZURE-DP900/HEAD/images/2.1.1.png -------------------------------------------------------------------------------- /images/2.1.2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/eandbsoftware/AZURE-DP900/HEAD/images/2.1.2.png -------------------------------------------------------------------------------- /images/2.3.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/eandbsoftware/AZURE-DP900/HEAD/images/2.3.1.png -------------------------------------------------------------------------------- /images/2.3.2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/eandbsoftware/AZURE-DP900/HEAD/images/2.3.2.png -------------------------------------------------------------------------------- /images/2.5.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/eandbsoftware/AZURE-DP900/HEAD/images/2.5.1.png -------------------------------------------------------------------------------- /images/3.1.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/eandbsoftware/AZURE-DP900/HEAD/images/3.1.1.png -------------------------------------------------------------------------------- /images/3.1.2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/eandbsoftware/AZURE-DP900/HEAD/images/3.1.2.png -------------------------------------------------------------------------------- /images/3.2.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/eandbsoftware/AZURE-DP900/HEAD/images/3.2.1.png -------------------------------------------------------------------------------- /images/3.3.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/eandbsoftware/AZURE-DP900/HEAD/images/3.3.1.png -------------------------------------------------------------------------------- /images/4.0.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/eandbsoftware/AZURE-DP900/HEAD/images/4.0.1.png -------------------------------------------------------------------------------- /images/4.4.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/eandbsoftware/AZURE-DP900/HEAD/images/4.4.1.png -------------------------------------------------------------------------------- /2_Relational_database/2.5_working_with_azure_database.md: -------------------------------------------------------------------------------- 1 | 2 | # Working with Azure Database 3 | 4 | - go to Azure portal -> search for azure SQL -> 3 options for create 5 | - SQL db 6 | - SQL managed instance 7 | - SQL VM .i.e, diff configuration of OS and SQL server Version 8 | - ![img.png](../images/2.5.1.png) 9 | - choose connectivity method 10 | - choose firewall rules 11 | - click NExt and create 12 | 13 | - Using Query editor we can run sql queries 14 | - also we can use Azure Studio to run queries. its one of most popular tools to manager sql data on Azure 15 | -------------------------------------------------------------------------------- /0.0_Refernce.md: -------------------------------------------------------------------------------- 1 | 2 | # Reference links 3 | 4 | 1. [how to prepare](https://medium.com/bb-tutorials-and-thoughts/how-to-pass-microsoft-azure-dp-900-data-fundamentals-exam-180aebdc27b2) 5 | 2. [examtopics for practice](https://www.examtopics.com/exams/microsoft/dp-900/view/) 6 | 3. [Exam details](https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4wsKZ) 7 | 4. [200 practice question](https://medium.com/bb-tutorials-and-thoughts/200-practice-questions-for-azure-data-dp-900-fundamentals-exam-ea2446ee3a0) 8 | 5. [Read this after course completion](https://docs.microsoft.com/en-us/learn/paths/azure-data-fundamentals-explore-data-warehouse-analytics/) 9 | ## Exam Topics breakup: 10 | 11 | https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4wsKZ 12 | 13 | 1. Describe core data concepts (15–20%) 14 | 2. Describe how to work with relational data on Azure (25–30%) 15 | 3. Describe how to work with non-relational data on Azure (25–30%) 16 | 4. Describe an analytics workload on Azure (25–30%) 17 | 18 | 19 | **Focus on Following topics:** 20 | - Relational and Non Relational Database 21 | - Modern Data Warehouse 22 | - Reporting (Power BI) 23 | - Data Ingestion and processing 24 | - Data Bricks 25 | - HInsight -------------------------------------------------------------------------------- /1_Core_Data_Concepts/1.0_intro_core_data_concepts.md: -------------------------------------------------------------------------------- 1 | 2 | # Core data concepts 3 | 4 | ## What is data? 5 | 6 | - data is the new oil 7 | - data is collection of facts such as numbers, descriptions and observations used in decision-making. 8 | 9 | ## How we can organize data? 10 | 11 | - Structured 12 | - Semi-Structured 13 | - UnStructured 14 | 15 | ### Structured Data 16 | - tabular data represented by columns and tables in relational database 17 | - examples: SQL Server, Oracle DB2, MySQL etc. 18 | - we have upfront info on structure of data called schema 19 | 20 | 21 | ### Semi-Structured Data 22 | 23 | - its not classified based on rigid schema structure of relational database but it still holds some structure to it. 24 | - examples: XMl(Extensible Markup Language), JSON(JavaScript Object Notation), key-value pairs, graphs etc. 25 | - technologies that work with semi-structured data formats: MongoDb, Cosmos Db, Cassandra 26 | 27 | ### Non Structured Data 28 | 29 | - don't have any Structure or searchable fields 30 | - these include: binary files, documents, images, audio files etc. 31 | - data hosting can be done using file server, SharePoint, Azure Files, Azure Data Lake, Azure Blob(Binary Large Objects) Storage. 32 | -------------------------------------------------------------------------------- /3_Non-Relational_Data/3.0_non-relational-data-concepts.md: -------------------------------------------------------------------------------- 1 | 2 | # Non-Relational Data Concepts 3 | 4 | - When data is coming from multiple sources viz. mobile phone, social n/w etc. then relational database is not a good fit as its schema may not be compatible with incoming data and also cost will be very high for large volume of data. 5 | 6 | - in that case data is first input to staging later which can hold large volume of non-relational data 7 | - this data is normally unstructured like audio, video files etc. 8 | - **Azure files or Blob Storage is used for such type of data repository** 9 | - few good use cases for no non-relational data are IOT and telematics, gaming, web and mobile, and social network applications. 10 | - Also we have semi-structured data which contains fields ( not same as relational db) and it can be stored in Azure files for further pricessing 11 | - **The main non-relational file formats are JSON, which is probably the most popular and it's quickly replaced the XML as the de facto standard for non-relational data.** 12 | - Another quite popular one is **Parquet** which uses a columnar format. Parquet was developed by **Cloudera and Twitter** and it's sufficient levels of compression and encoding make an excellent format for data ingestion. 13 | - Another common columnar format is **ORC which stands for Optimized Row Columnar format**. It was **developed by Hortonworks** for optimizing read and write operations on a Apache Hive 14 | - **Avro**, which uses a row based format. Also created by **Apache**. -------------------------------------------------------------------------------- /1_Core_Data_Concepts/1.3_data_analytics.md: -------------------------------------------------------------------------------- 1 | 2 | # Data Analytics 3 | 4 | - data analysis is the process of inspecting, cleansing, transforming and modeling data with the goal of discovering useful information, informing conclusions, in supporting decision-making. 5 | - examples: getting recomendatation on websites for which product to buy, which movie to watch, whether stock prices will go up or down etc. 6 | 7 | - 5 types: 8 | ### descriptive Analysis 9 | 10 | - focuses on discovering what happened in the past based on historical data 11 | - helps summarize large data into meaningful outcomes 12 | - example: total sales last year 13 | - 14 | 15 | ### diagnostic analytics 16 | - focuses on why it happened 17 | - example: you could identify that a substantial amount of your customers left you in the month when your competitor launched a new product 18 | 19 | ### predictive analytics 20 | - goal to discover what will happen 21 | - example: what will be the Microsoft stock price by the end of next year?, will this customer default on his loan? 22 | 23 | 24 | ### prescriptive analytics 25 | - what you should do next 26 | - while predictive analytics might tell you that an equipment component is likely to fail, if it gets above a certain temperature, prescriptive analytics could recommend you to slow down the machine, to prevent the failure from happening.This could give enough time for a maintenance team to arrive and replace the component before it breaks 27 | 28 | 29 | ### cognitive analytics 30 | - attempts to draw inferences for an existing data and patterns 31 | - its a realm of artificial intelligence and it can do things such as transcribing audio to text or vice versa, find your objects and images, detect your anomalies in using natural language processing and LP, to understand and translate language and much more. Microsoft has an entire set of APIs related to this called Cognitive services. 32 | -------------------------------------------------------------------------------- /4_DataWareHousing_in_Azure/4.1_modern_data_warehousing.md: -------------------------------------------------------------------------------- 1 | 2 | # modern data warehousing(MDW) 3 | 4 | - data warehouses are no longer just reorganizing the data from your OLTP systems into a more read intensive format. Instead, data houses nowadays will gather data from multiple data stores, including IOT, social networks, web APIs, files and multiple corporate systems such as your CRM, HR application, and ERP systems. 5 | 6 | - **Advantages**: 7 | - cross referencing this data instead of keeping those in silos 8 | 9 | - modern data warehouses must be able to read the data in various formats 10 | - XML, Parquet, ORC, JSON and much, much more 11 | - can even use cognitive services to extract tax from a recorded phone call or obtain metadata from pictures that your company has. 12 | - modern data warehouse solutions should be able to handle big data 13 | - **phases of MDW:** 14 | - **data ingestion** 15 | - capturing the data from diff sources 16 | - **Solutions from Azure: Azure Data Factory, Stream Analytics or Event Hubs.** 17 | - **data staging** 18 | - holds the data temporarily 19 | - ELT operations keep the data on the staging layer and have the analytical system just grab the data on the fly for further analysis 20 | - **Azure Data Lake** can be used if we want to keep data in raw format 21 | - **data transformation:** 22 | - transform and process this data and model that into a format that is more convenient for the reporting 23 | - includes data cleansing, filtering, normalization or denormalization, conversational formats and so on. 24 | - **Solutions from Azure: Azure Data Factory and Databricks** 25 | - **Data Modeling:** 26 | - model and serve your data so that business intelligence analysts can generate reports and conclusions about it 27 | - Azure BI Services and Azure Synapse Analytics 28 | - **Solution: Power BI** -------------------------------------------------------------------------------- /3_Non-Relational_Data/3.1_Non-Relational_database_offerings_in_Azure.md: -------------------------------------------------------------------------------- 1 | 2 | # Non-Relational Database Offering in Azure 3 | 4 | - also called NoSQL databases 5 | - **Types**: 6 | - **key-value stores**, 7 | - simplest and fastest 8 | - each row can have any number of columns 9 | - can have any number of columns 10 | - each item is key-value pair 11 | - read and write data very quickly 12 | - **limitations**: 13 | - search only based on key 14 | - write operation is restricted to insert and delete 15 | - to update and item - retrive it, modify in memory, write back to db overwriting original db 16 | - **Azure Tables, Cosmos DB(Table API)** 17 | 18 | - **document databases**, 19 | - better than key-value store 20 | - store data in json format 21 | - other formats- xml , yaml etc. 22 | - flexible schema - any number of columns 23 | - more flexibility than relational database 24 | - can do search on values as well as key 25 | - support indexing for fast retrival 26 | - single document has all needed info while for same info in relational db we may have to query multiple tables 27 | - **limitations**: 28 | - data repetitions may occur 29 | - more storage 30 | - not as fast as key-value but much better search 31 | - Cosmos Db(SQL API) 32 | - **column-family databases**, 33 | - similar to relational but here we can group logically related columns into column-families 34 | - ![img.png](../images/3.1.1.png) 35 | - retrieval of related info is much faster 36 | - As JSON is good example for document structure, parquet is good example for column-family. 37 | - apache Cassandra, Cosmos DB(Cassandra API) 38 | - 39 | - **graph databases** 40 | - use to model complex relationship 41 | - consists of nodes(info about objects) and edges( info about relationship) 42 | - ![img.png](../images/3.1.2.png) 43 | - edges can also have direction 44 | - goal of graph db is to perform queries which officially traverse this network of nodes and edges. 45 | - fast analysis without nested joins and subqueries 46 | - example : Cosmos DB( Gremlin API) -------------------------------------------------------------------------------- /2_Relational_database/2.0_relational_data_concepts.md: -------------------------------------------------------------------------------- 1 | 2 | # Relational Data Concepts: 3 | 4 | - Everything is **hosted in tables**. 5 | - most important and widespread methods for storing and retrieving data 6 | - **provides a simple and well-understood model for holding data** 7 | - We use Relational Database when we need strong consistency for data 8 | - example of applications: banking applications, e-commerce and online retail systems, and flight and hotel reservation sites 9 | 10 | 11 | ## Understanding Relational Database 12 | 13 | - Everything is stored in Tables 14 | - A table is just a database object that **holds data in a row and column format.** 15 | - Each table must have a column or a combination of columns, that uniquely identifies each row in that table. This is called a primary key. No two rows can have the same primary key 16 | - As you commonly have several tables in your database, you generally use primary and foreign key string for the relationship between these tables 17 | - A foreign key is just a reference to a primary key on a related table. 18 | - cardinality of the relationship: 19 | - one-to-one 20 | - one-to-many 21 | - many-to-many 22 | - Views: 23 | - view is just a virtual table based on the results of a query that allows you to filter the data 24 | 25 | - **Index**: 26 | - index help you search data substantially faster 27 | - occupy extra space on the databases and 28 | - each index should be maintained by the database server 29 | - two types: 30 | - **clustered index**: 31 | - which physically organizes the data on your table, based on a column or key that you choose 32 | - most important index on the table 33 | - we can have only one clustered index 34 | - **nonclustered index**: 35 | - less efficient than clustered ones 36 | - we can have as many as required 37 | - The overall rule is create a clustered index based on your most searched column, and one or more nonclustered indices for columns that are also searched relatively often. 38 | - **stored procedures and functions**: 39 | - implement repeatable portion of code 40 | - can even accept input and output parameters, making them quite flexible -------------------------------------------------------------------------------- /4_DataWareHousing_in_Azure/4.0_analytics_workloads.md: -------------------------------------------------------------------------------- 1 | 2 | # analytics workloads 3 | 4 | - analytical workloads is about transforming data into various insights 5 | 6 | ## data processing solutions: 7 | 8 | - OLTP 9 | - online transaction processing solution 10 | - for the day-to-day operations of a business 11 | - These kinds of systems record transactions, which are small in discrete units of work that needs to be executed or rolled back as a whole, example deposit in bank account etc. 12 | - follow ACID rules, which stand for atomicity, consistency, isolation, and durability 13 | - often associated with relational databases, such as SQL Server, Oracle, DB2, and so on 14 | - optimized for CRUD transactions, create, read, update and delete 15 | - 16 | - OLAP 17 | - online analytical processing 18 | - provide support for business intelligence or BI, which is a set of technologies, applications, and practices to support business decision making. 19 | - analytical systems must be optimized for read operations. 20 | - example: data warehousing solutions such as SQL Server Analysis services or Azure Synapse Analytics 21 | 22 | 23 | ## Data Modeling: 24 | 25 | - based on data processing solution OLTV or OLAP we decide how we will model out data 26 | - OLTP: 27 | - works on normalized data 28 | - Normalization consist of distributing the data across several related tables to ensure data integrity and preventing data redundancy. 29 | - favours CRUD operaions 30 | 31 | - OLAP: 32 | - de-normalized model which decreases the number of table even if that incurs some data redundancy 33 | - because for analytics many tables need to be joined so de-normalized model is best suited 34 | 35 | ## Modeling Standard: 36 | 37 | - **Star Schema:** 38 | - main modeling standard for business intelligence solution 39 | - ![img.png](../images/4.0.1.png) 40 | - On a star schema, you have a central facts table with data about something that happened, the sale of product, an ATM withdrawal, or the prescription for magical treatment. 41 | - Then you have dimension tables that describe what happened 42 | - one-to-end operations between facts and dimensions 43 | - **Snowflake Schema** 44 | - your dimensions are more normalized 45 | - These decreases data repetition, but makes your models more complex -------------------------------------------------------------------------------- /4_DataWareHousing_in_Azure/4.4_MicroSoft_PowerBI.md: -------------------------------------------------------------------------------- 1 | 2 | # MicroSoft PowerBI 3 | 4 | - Power BI which wraps up all the Data work in beautifully designed reports and dashboards. 5 | 6 | - **definition**: Power BI is a collection of software, service, apps and connectors that work together to create visually immersive Data visualization experiences. 7 | 8 | ## Data Visualization 9 | 10 | - graphical representation of data and info 11 | - uses visual elements such as charts, maps, graphs and tables to help you understand trends, anomalies, and patterns on Data. 12 | - most popular Data visualization tool is Power BI. 13 | - Power BI dashboard with a few elements to it 14 | - bar and column chart 15 | - line chart 16 | - pie and donut charts 17 | - Entry maps 18 | - maps 19 | 20 | ## Products in PowerBI 21 | 22 | - Power BI Desktop 23 | - create the reports 24 | - Power BI Pro service 25 | - Cloud-based service originally designed for viewing and sharing reports and dashboards. 26 | - Power BI Mobile 27 | - view your reports on a mobile phone. 28 | 29 | - It can get Data from over 130 different Data sources of various types such as files, online services, Databases and Datahouses and SAS applications. 30 | - These Data can be combined on a single Data model for easier reporting. 31 | - extremely flexible 32 | 33 | ## Definitions to know in PowerBI 34 | 35 | - Datasets 36 | - collection of Data that Power BI will use to create the report 37 | - Power BI Desktop has a very powerful tool for obtaining, treaching and cleansing the Data called Power Query which also allows to model server Data sources into a single Data model. This functionality is also available on the power BI service but there, it's called Data flows. 38 | 39 | - Visulazitions: 40 | - visual representation of the Data sets such as line, bar or pie charts, tables and matrices and maps 41 | 42 | - Reports 43 | - collection of visuals grouped together into one or more pages. 44 | - The report is the unit of work that you will then publish into the power BI service to be viewed and shared by others. 45 | - DashBoards 46 | - aggregations of one or more reports into a single page 47 | 48 | - Apps 49 | - can combine dashboards and reports into 50 | - can be distributed internally or externally 51 | - 52 | 53 | 54 | ![img.png](../images/4.4.1.png) -------------------------------------------------------------------------------- /2_Relational_database/2.1_shared_responsibility_model.md: -------------------------------------------------------------------------------- 1 | 2 | # Cloud Service Models:- On Premises, Iaas, pass, and Saas 3 | 4 | ![img.png](../images/2.1.1.png) 5 | 6 | ### on-premises: 7 | - This has been the option traditionally chosen before public clouds became widely available 8 | - option for highly regulated industries that forbid cloud hosting 9 | - you'll be responsible for everything, hardware, software, cabling, patching, backups, VMs, the storage, and so on 10 | - Examples of on-premises technologies are SQL Server or a physical file server running Windows Server 2019. 11 | 12 | ### IAAS: Infrastructure-as-a-Service 13 | 14 | - Refers to Actual Servers provided by Azure 15 | - Scaling can be done on need basis 16 | - Iass provides servers, storage and networking as a service 17 | - Maintenance of server is done by Azure 18 | - You purchase, configure your own software, OS, middleware and applications and do installation on host and maintain it. 19 | - IAAS includes VM's, Networking, Storage, Firewall, and physical hardware everything runs on. 20 | 21 | ### PAAS: Platform-as-a-Service 22 | - It is a superset of IaaS 23 | - Apart from IAAS offering, **it will also offer Middleware and development tools, BI Services, database management Systems and more** 24 | - PaaS is designed to support the complete web application lifecycle: building, testing, deploying, managing, and updating. 25 | - So you just manage application that you develop and cloud Service provider does manage everything else which includes Security Features, data Warehouse Service, VM provisioning, Networking, etc. etc. 26 | - example: Cosmos DB 27 | - Azure SQL Managed Instance gives highest compatibility to on premise sql servers. 28 | 29 | ### SAAS: Software-as-a-Service 30 | - Its superset of both PASS and IASS 31 | - **You don't own Software but just pay for usage** 32 | - No maintenance to be done by you but taken care by respective Service Provider. 33 | - Example: Microsoft 365, gmail for email , Azure SQL Server, Azure AD etc. 34 | 35 | ![img.png](../images/2.1.2.png) 36 | 37 | ### Serverless 38 | 39 | - **You don't have to manage any servers** 40 | - Azure Functions are probably the best-known examples of Serverless on Azure. 41 | - Serverless architecture takes PaaS to the most extreme by fully abstracting away the server in such a way that a single function of code can be hosted, deployed, run, and managed without even having to maintain a full application. -------------------------------------------------------------------------------- /2_Relational_database/2.3_Querying_relational_data.md: -------------------------------------------------------------------------------- 1 | 2 | # Querying Relational Data 3 | 4 | - SQL or structured query language was originally created in the 1970s as a way to query relational database 5 | - By 1987, it had being made the standard by both ANSI and ISO and it has been the main language used by most relational database systems ever since. 6 | - Vendors create their own dialects of the SQL language exchanging them with additional features. For example, SQL Server uses Transact-SQL, Oracle uses PL-SQL. Postgres SQL uses pgSQL and so on. 7 | - Vendor specific dialects of SQL have additional features. 8 | 9 | ## SQL Language Command Types: 10 | 11 | ### DML: Data manipulation Language 12 | 13 | - allow you to perform **CRUD operations on the data.** 14 | - CRUD stands for create, read, update and delete. 15 | - most commonly used commands is SELECT to retrieve data from a table 16 | - **DML focuses on data** 17 | - common commands: 18 | - SELECT: read data or query data 19 | - INSERT, 20 | - UPDATE, 21 | - DELETE 22 | - MERGE - merges(syncs) data 23 | 24 | 25 | - Add WHERE clause to filter data 26 | - Add JOIN to related tables : INNER JOIN, CROSS JOIN, FULL JOIN etc. 27 | - Add Functions for additional logic , example MAX used below: 28 | 29 | example: 30 | ```sql 31 | SELECT MAX(ListPrice) FROM Production.Product FULL JOIN Production.ProductModel PM ON PM.ProductModelID = Product.ProductID WHERE color = 'Blue' 32 | ``` 33 | 34 | ### DDL: Data Definition Language 35 | 36 | - **these are used to create, update, delete new objects in database** such as tables, views, start procedures and functions. 37 | - common DDL commands: 38 | - CREATE : create new object 39 | - ALTER : modify property of existing one 40 | - RENAME 41 | - DROP 42 | 43 | example: 44 | ![img.png](../images/2.3.1.png) 45 | 46 | - as we can see in above example we need to provide name and data type for each of column and also define if value will be NOT NULL 47 | 48 | **- DML focuses on objects compared to DDL which focuses on data.** 49 | 50 | ### DCL: Data Control Language 51 | - these are used to set permissions on database objects 52 | - commands usd: 53 | - GRANT 54 | - REVOKE 55 | - DENY 56 | 57 | ### TCL: Transaction Control Language 58 | - these are used to manipulate transactions 59 | - commands used: 60 | - BEGIN TRAN 61 | - COMMIT TRAN 62 | - ROLLABACK 63 | 64 | ### Which tools we can use to query Azure Database: 65 | 66 | ![img.png](../images/2.3.2.png) 67 | 68 | 69 | -------------------------------------------------------------------------------- /3_Non-Relational_Data/3.3_Azure_Storage_Services.md: -------------------------------------------------------------------------------- 1 | 2 | # Azure Storage Services 3 | 4 | - Azure storage account can contain, Azure blob storage, Azure files, Azure tables and Azure queues. 5 | 6 | ## Azure Table Storage 7 | - Microsoft's implementation of key value stores 8 | - doesn't have the concepts of schema, relationships, start procedures, secondary indices or foreign keys that are present on the relationship database 9 | - can only search based on the key but not on the values 10 | - data insertion and retrieval as expected with key value stores is quite fast regardless of the database size. 11 | - like in cosmos DB, it divides the data into partitions. You can add the partition key on your queries for even faster results, which makes the choice of the partition key an important design decision. 12 | - Azure tables is quite robust, being able to store hundreds of terabytes of data and is ideal when you need extremely fast data ingestion, such as IOT and telematics scenarios. 13 | - like cosmos DB, it supports multiple read replicas. 14 | - unlike cosmos DB, it does not support multiple write regions. 15 | - 16 | ## Azure Blob Storage 17 | 18 | - Microsoft's main solution for unstructured data or blobs. 19 | - available under an Azure storage account. 20 | - supports massive amounts of data along with metadata, 21 | - example: Azure blob storage for example, if you're storing x-ray and MRI images of patients, you could store that along with metadata such as name, age, and patient ID 22 | - supports encryption including the possibility of bring your own encryption keys, which is called BYOK. 23 | - **hot tier** uses high-performance media and is therefore more expensive. So it's ideal for more frequently accessed data. 24 | - the **cool tier** is more in the middle ground. It's cheaper, but not as fast as the hot tier. 25 | - the **archive tier**, which is the cheapest but it might take a few hours to retrieve 26 | - **re-hydration**: You could use this tier for files that you're unlikely to need again but you still need to keep that for compliance reasons such as a three-year-old backup 27 | - Some uses of Azure blob storage are, serving images and documents for a website, is streaming audio and video, is storing backed up or archived data, and storing data for analytics 28 | 29 | 30 | ## Azure files 31 | - Azure files enables you to create file shares on the cloud accessible through the internet. 32 | - supports SMB 3.0, which is the protocol used by Windows but it's also currently in preview for NFS which is used by Linux. 33 | - supports encryption including BYOK and Azure ID 34 | - two main performance tiers standard, which uses HDD disks, and premium, which uses solid state drives or SSD 35 | - The main use for Azure files is the migration of file shares from on premises Windows servers. You can use a command line utility called AzCopy to perform the copy to the cloud. 36 | 37 | ## Queue Storage 38 | 39 | 40 | 41 | ![img.png](../images/3.3.1.png) -------------------------------------------------------------------------------- /2_Relational_database/2.2_Relational_database_oferings_in_Azure.md: -------------------------------------------------------------------------------- 1 | 2 | # Relational Database Offering in Azure 3 | 4 | ## SQL server hosted on Azure 5 | 6 | - Same as SQL Premises but hosted on Azure 7 | - IAAS level 8 | - No hardware or cabling 9 | - you still manage patching, upgrade, backup, licenses etc. 10 | - this is required in case of lift and shift scenarios where we still have compatibility difference with azure SQL database. 11 | 12 | ## Azure SQL database: 13 | 14 | - PASS level service where backup, patching etc. are all done by Azure,i.e,management of database is done by Azure 15 | - 99.9% SLA 16 | - no upfront costs 17 | - **Available in Business Tier**: 18 | - high speed and availability and low latency 19 | - read only copy for reporting 20 | 21 | - **Disadvantage**: 22 | - cannot install custom software as it would compromise security of azure environment 23 | - you can shutdown azure SQL VM but you cannot shutdown azure SQL database 24 | 25 | 26 | - MicroSoft has created Data Migration Systems which scans your database and tells you the best migration option for you based on any compatibility issues 27 | 28 | ## Azure SQL is actually available in three different options. 29 | 30 | ### Single Database 31 | - low cost and minimal administration 32 | - preferred for new projects 33 | 34 | - limitations: 35 | - Azure SQL Single Database can only see one database at a time, which means that you cannot create cross database queries. 36 | 37 | ### Elastic Pool 38 | - very similar to Single Database, except that it allows multiple databases to share the same pool of resources, such as processor and memory. 39 | 40 | ### Azure SQL Managed Instance 41 | 42 | - close to 100% compatibility to SQL server on-premises 43 | - It came in as a response to some limitations of Azure SQL Single Databases that were preventing companies from migrating to the cloud, such as linked servers, database mail, and cross database queries and transactions. 44 | - On SQL MI, you're managing on a server, not database level, but you still enjoy the same PaaS level benefits, such as automated backups, patching, and advanced security 45 | 46 | 47 | ## Other Azure Database Offering for open source databases: 48 | 49 | ### Azure database for MySQL 50 | - PaaS implementation of MySql Community Edition 51 | 52 | ### MariaDB 53 | - newer Database Management System created by the original developers of MySQL 54 | - engine has been rewritten and optimized for better performance 55 | - And it also has some interesting new features, such as support for versioning, which allows you to create your tables as they were at different points in time. 56 | - also built on top of the Community Edition 57 | 58 | ### Azure database for PostgreSQL 59 | - hybrid relational object database system 60 | - PostgreSQL is extensible, and has good support for geometric data such as lines, circles, and polygons. 61 | - It has two deployment options: single server for smaller workloads, and hyper scale which uses multiple nodes for faster query performance. 62 | 63 | 64 | 65 | -------------------------------------------------------------------------------- /1_Core_Data_Concepts/1.1_data_processing.md: -------------------------------------------------------------------------------- 1 | 2 | # Data Processing: 3 | 4 | - Data processing is just a conversion of raw data into meaningful information, through a specific method after it has been ingested and collected. 5 | 6 | 2 Types: 7 | 1. Batch Processing 8 | 2. Stream Processing 9 | 10 | ## Batch Processing: 11 | 12 | - In this processing mode, newly arriving data elements are collected into a group. The whole group is then processed at a future time, as a batch, when a certain condition is met 13 | - **Conditions may include**: 14 | - scheduled time intervals: salary processing,CC or utility bill etc. 15 | - event based processing: cpu utilization- optimizing utilization of servers 16 | - specific size or volume of data has arrived e.g., more volume is there on Black Friday etc. 17 | 18 | - Advantages: 19 | - Process large amount of data 20 | - Process data at convinient times 21 | 22 | - Disadvantages: 23 | - high latency 24 | 25 | - example: for complex analytics, such as moving the data to a Data Warehouse for Business Intelligence operations 26 | 27 | ## Stream Processing: 28 | 29 | - Processing data in real time mode as it arrives 30 | - Useful for time critical operations requiring immediate responses. 31 | - examples: 32 | - Sending telemetry data from a device from the edge. Device could be IoT device, mobile phone etc. 33 | 34 | - Most organizations might require a combination of batch and stream processing for the day to day operations 35 | - Stream processing is used for simpler, more reactive situations, or small calculations. 36 | 37 | ### Batch Processing Vs Stream Processing: 38 | 39 | ![img.png](../images/1.1.1.png) 40 | 41 | 42 | 43 | ## Order of Data Processing: 44 | 45 | - Data processing generally extracts the data from a source, transforms that into a format more suitable to work with, and loads it into a destination 46 | - Microsoft has ETL and ELT tools available both on premises, which is called **SQL Server Integration Services** or **SSIS**, and on the cloud, which is called Azure Data Factory 47 | - Two approaches: 48 | - **ETL**: 49 | - Extract, Transform, Load 50 | - Traditional Business Intelligence processes used ETL, which means extracting data from a source, usually a database, transforming it through operations such as filters, sorters, and lookups, and loading this data into a destination, generally a Data Warehouse. 51 | - data is fully processed before it is loaded into destination. 52 | - requires high upfront work to create Data Warehouse 53 | - once data is processed, we will have higher confidence that its compliant, well-structured, and easily queried. 54 | 55 | - **ELT**: 56 | - Extract, Load, Transform 57 | - performs the transformations after the load by the destination system itself 58 | - provides more agility for your development team to change queries on the fly, in case the business needs change often 59 | - fulfuil our need to experiment with several different possibilities, a common occurrence on advanced analytics workloads 60 | - ELT only became feasible on more recent years, as storage became cheaper, and after the development of technologies such as **PolyBase**, **Data Lakes**, and **Massive Parallel Processing MPP systems, like Azure Synapse Analytics**. 61 | 62 | ![img.png](../images/1.1.2.png) -------------------------------------------------------------------------------- /4_DataWareHousing_in_Azure/4.3_data_analytics_tools.md: -------------------------------------------------------------------------------- 1 | 2 | # data analytics tools: 3 | 4 | - Analytical Tools to obtain valuable insights. 5 | 6 | 7 | ## Databricks 8 | 9 | - most important analytics tools in Azure. 10 | - advanced analytics and machine learning platform **based on Apache Spark** which is a parallel processing engine for large scale analytics 11 | - Spark is designed to handle massive amounts of data by distributing the work across a cluster of computers, which considerably reduces the time needed to complete the analysis. 12 | - Databricks uses a collaborative workspace that allows data engineers, data scientists and business analysts to work together. 13 | - It uses the concept of notebooks, which is a web based mix of runnable code, visualizations and text. 14 | - The code runs in a series of steps called cells. Cells can support several languages such as Python, R, Scala, Java, and SQL. 15 | - Databricks supports the stream processing and can connect to several other Azure tools including data link storage, SQL databases, data warehouses, and Cosmos DB. 16 | - Azure Data Factory also has Databricks activities. So you can call Databricks from an ADF pipeline. 17 | 18 | ## HDInsight 19 | - managed analytic service **based on Apache Hadoop**, a collection of open source tools that can process large amounts of data through a set of clusters. 20 | - HDInsight supports several analytics frameworks such as Hadoop, MapReduce, Apache Spark, Apache Hive, Apache Kafka, Storm, R, and more 21 | - It stores the data using Azure Data Lake Storage and it's easily integrated with all their Azure tools and services. 22 | 23 | 24 | ## Data Modeling: 25 | 26 | - **Azure Analysis Services:** 27 | - This enables you to build tabular models to support your OLAP queries. 28 | - So the focus here is on analytics not transactional workloads. 29 | - The tool combines a graphical designer that helps you define the queries by combining, filtering, and, aggregating the data. 30 | - You can also use a Microsoft BI language called DAX for the query building 31 | - **best suited for smaller databases in less computational heavy workloads** 32 | - easier development experience and it's more easily integrated to Power BI 33 | 34 | - **Azure Synapse Analytics** 35 | - advanced analytics and machine engine based on Spark and notebooks, similar to Databricks. 36 | - supports a wide variety of languages such as PolyBase, C-sharp, Python, Scala, and Spark SQL and also several file formats such as CSV, JSON, XML, Parquet, ORC, and AVRO 37 | - Azure snaps links for Azure Cosmos DB, which allows for hybrid transactional analytics processing or HTAB. HTAB tab is a mix of OLTP and OLAP. 38 | - In Azure Synapse Studio, which is a web interface used to manage Synapse Analytics as well as create, edit, and debug both SQL and Spark codes 39 | - main advantages of using Azure Synapse Analytics is it's massive **scalability**. 40 | - **better for high volumes of data, several terabytes to a few petabytes to perform very complex calculations, and to implement complex ELT operations because MPP Engine of Spark and SQL Clusters allow for a higher scalability.** 41 | 42 | 43 | - You can use Azure Synapse Analytics for the heavy lifting of dealing with large amounts of data and calculations and Azure Analysis Services to better serve your business users. -------------------------------------------------------------------------------- /2_Relational_database/2.4_Relational_data_management_task.md: -------------------------------------------------------------------------------- 1 | 2 | # Relational Data Management Task: 3 | 4 | - here we will discuss on tools to manage databases operations such as creating, deploying, configuring, searching, permissions, and security, and so on 5 | 6 | - The most common option is the **Azure portal** or https://portal.azure.com 7 | - Management via GUI 8 | - Azure CLI 9 | - for automation 10 | - Azure PowerShell 11 | - for automation 12 | 13 | - ARM templates 14 | - that's azure resource manager 15 | - These are JSON files that describe the settings that you want you to configure for the resource. Which you can later provision using either Azure CLI or Azure PowerShell 16 | 17 | ## database security 18 | 19 | - all Azure databases are protected by server-level firewalls 20 | - it prevents access to database resources from other networks 21 | - All access to your databases is blocked by the firewall by default. So remember that you need to configure the firewall rules before your client applications will be able to connect to them 22 | - You might also need to enable outgoing traffic on your company's firewall 23 | - Azure PostgreSQL uses port **5432**, and Azure MySQL and MariaDB both use port **3306** 24 | - 25 | ### database encryption: 26 | 27 | - Azure databases also support encryption, which guarantees that the data doesn't appear as plain text 28 | - Encryption occurs both in-transit with SSL connections enabled automatically, and at rest by encryption the database itself. 29 | - **the technology that is used to encrypt an Azure SQL database is a proprietary Microsoft one called Transparent Data encryption, or TD** 30 | 31 | ### database threat protection: 32 | - We also have Advanced Threat Protection, which detects unusual activities in accesses to your databases, alerting you of suspicious events. 33 | - ATP's available for Azure SQL, Azure MI, Synapse Analytics, and it's currently preview for MariaDB, MySQL, and PostgreSQL. 34 | 35 | ### database authentication and permissions: 36 | - Authentication defines how we validate your identity on the database. 37 | - For authentication, **all four databases support both native SQL authentication and Azure AD**, with the exception of Azure MariaDB, which does not yet have Azure AD integration. 38 | 39 | - **Azure AD integration is the preferred option** as it has several advantages over SQL authentication, such as centralized management of identities across several Microsoft and Azure resources, so that you don't need one account for each server. 40 | - In Multi-Factor Authentication or MFA, which allows for more than one form of identification, considerably increasing security. 41 | 42 | 43 | 44 | - Once you're authenticated, you need to make sure that you have the proper permissions to access the database resources. 45 | - **2 permission levels available:** 46 | 1. 47 | - **one** is permission on database resource itself which allows you to configure CPU, memory etc. 48 | - **this includes Azure Role based access controls or RBAC** with a set of **built-in groups and well-defined permissions** 49 | - You just need to put your administrators on the proper roles and all the relevant permissions are automatically assigned. 50 | 2. 51 | - **second** **refers to the data in objects inside the database** 52 | - For example, to be able to access a table on an Azure SQL database, you need to have at least a select permission on that table. Azure SQL has several built-in database roles available to simplify this management, such as **DB Owner, which has full access, DB DDL Admin, which can execute DDL commands, and DB Data Reader, which can read all the tables.** -------------------------------------------------------------------------------- /3_Non-Relational_Data/3.2_CosmosDB.md: -------------------------------------------------------------------------------- 1 | 2 | # Cosmos DB 3 | 4 | - Azure Cosmos DB is the main Microsoft NoSQL database management system, storing data as JSON documents. 5 | - It works on a Platform as a Service level, so several administrative tasks are managed for you. 6 | - Cosmos DB is **multi-model**, which means that it supports documents, key-value pairs, graph and column-family data, depending on the API that you choose. 7 | - It's also very fast, guaranteeing **less than 10 milliseconds latency** for both reads and writes 99% of the time. That's because the data is spread across partitions on several nodes. 8 | - These makes Cosmos DB an excellent choice for IoT and telematics, gaming and the highly responsible mobile and web applications on a global scale 9 | - Microsoft itself uses Cosmos DB on several of their mission-critical applications including Skype, Xbox, Office 365, and Azure 10 | - capability of supporting multiple read replicas and write regions. 11 | - 12 | 13 | 14 | ## APIs supported by CosmosDB 15 | 16 | ### SQL API 17 | - default native API 18 | - supports SQL-like commands 19 | 20 | ### Gremlin API 21 | - enables you to implement a graph database on Cosmos DB 22 | - you can also query graph data as a JSON document using a SQL-like language 23 | 24 | 25 | 26 | next three APIs are less focused on new projects but recommended instead when you're migrating from Azure Tables, MongoDB or Cassandra. 27 | 28 | ### Table API 29 | 30 | - allows you to migrate your key value pairs from Azure Tables 31 | - customers start with Azure Tables because of its low price, simplicity and high throughput 32 | 33 | ### MongoDB API 34 | - allows to migrate from Mongo DB 35 | - MongoDB is another very old established document database. But customers may decide to migrate to Cosmos, either because it fits their IT strategy or to leverage the PaaS-level capabilities of the product, including automated backups and indexing. 36 | 37 | ### Cassandra API 38 | - recommended for migrations from a Apache Cassandra, which is another famous on-premises column-family database management system 39 | 40 | 41 | ## how Cosmos DB manage the data. 42 | 43 | ![img.png](../images/3.2.1.png) 44 | 45 | - The top level on Cosmos DB is a Cosmos DB account. 46 | - We can have 50 accounts under azure subscription 47 | - next we can have one or more databases under each of those accounts 48 | - under each database we have containers. 49 | - Containers are the units of scalability for both throughput and storage, which means that it's on the container level that you configure a Cosmos DB performance 50 | 51 | - depending on which API you have configured for Cosmos DB Container will mean diffefnt things 52 | - For Gremlin API -> Container Resource type would be graph 53 | - Containers will not only host data items but also other database elements, such as triggers, stored procedures and functions 54 | - items are the ones holding the data. 55 | - it can even hold these small binary files up to two megabytes in size. If you need more than that, however, you can always create a reference to an external Azure Blob Storage 56 | - the data type of the item will depend on which API you have configured for Cosmos 57 | - If you have configured the Gremlin API, for example, the items will be nodes and edges 58 | 59 | ## Cosmos DB Management Task 60 | 61 | - **provisioning** 62 | - creation of the resource on Azure 63 | - Azure Portal, Azure CLI, Azure PowerShell, ARM templates 64 | - Provisioning cosmos db resource you need to define amount of resource allocatd for it in **Request Units per second**. RUs are also the billing unit of Cosmos DB 65 | - The minimum throughput is 400 RUs per second, 66 | - can configure Rus both at the database and the container levels 67 | 68 | - **replication** 69 | - As Cosmos DB is a PaaS-level solution, both replication and failover, in case of a failure, happen automatically within a single region giving Cosmos DB a guaranteed high availability of 99.99%. 70 | - can always configure multi-master replication, which means that every node can write 71 | 72 | - Data Explorer, Cosmos DB Data Migration tool if you want to perform data migrations 73 | - which is available in the Azure portal under your Cosmos DB resource. 74 | - This is a downloadable tool available on GitHub. -------------------------------------------------------------------------------- /4_DataWareHousing_in_Azure/4.2_data_ingestion_components.md: -------------------------------------------------------------------------------- 1 | 2 | # data ingestion components: 3 | 4 | ## Azure Data Factory 5 | 6 | - Azure Data Factory is a data ingestion, transformation and orchestration-managed service on the cloud for data engineers, perfect for data integration workflows. 7 | - It can ingest large amount of raw data from several sources, both on premises and in the cloud. 8 | - It has connectors for most Azure services and even services from cloud competitors, such as Google and AWS, dozens of relational and non-relational database and data warehouses, such as SQL Server, Oracle, Cassandra, MongoDB and SAP HANA, several SaaS applications, such as Dynamics, Jira and Salesforce 9 | - supports several file formats, including JSON, XML, Parquet, Avro and ORC. 10 | - can also clean, transform and restructure data, as well as filter out data that might be corrupt or duplicated, supporting therefore both ETL and ELT processes. 11 | - These transformation tasks can be done by Azure Data Factory itself through a feature called **mapping data flows**. However, these transformations can also be done by other Azure services, such as **Databricks, and HDInsight.** 12 | 13 | ### main components of Data Factory 14 | 15 | - **pipeline** 16 | - logical group of activities that perform unit of work 17 | - its the actual work that we do in data factory 18 | - can be created both via GUI or through code 19 | - **Activities in pipeline:** 20 | - **Data movement activities** move data between a source and a destination. It's also called **sink**.For example, a copy from SQL Server to Cosmos DB. 21 | - **Data transformation activities** perform some change on the data. 22 | - This data change could be executed by the Data Factory itself, which is called mapping data flow or by calling an external compute resource, such as Databricks or Hive 23 | - **control flow activities** which are a way to implement coding logic on your pipeline, such as pinging variable or executing a loop etc. 24 | 25 | - **Integration runtimes** 26 | - these are the compute infrastructure of Data Factory, which is needed to execute your activities. 27 | - **linked services** 28 | - A linked service provides the information that you need to connect to the source, a destination or a compute resource. 29 | - they basically tell you where to find your external data or service. 30 | 31 | - **datasets** 32 | - representation of the data that you're working with. 33 | - While linked services tell you where to find the data, datasets tells you its details and structure and format, such as JSON or XML. 34 | 35 | - **triggers**: 36 | - Data Factory component that initiate the execution of a pipeline 37 | - example: event based triggers 38 | 39 | ## SSIS or SQL Server Integration Services. 40 | 41 | - SSIS is the on-premises counterpart of Data Factory and it's part of SQL Server. 42 | - can add Azure support to SSIS by installing the Azure feature pack for SSIS. 43 | - you can also run your existing SSIS packages on Azure Data Factory, which is useful for migration scenarios 44 | - scalability is limited to the performance of the server where SSIS is installed. 45 | 46 | 47 | ## PolyBase 48 | 49 | - feature of both SQL Server and Azure Synapse Analytics that enables you to run Transact-SQL commands on external data sources, such as Azure Data Lake, Blob Storage, Hadoop or Spark just as if they were SQL tables. 50 | - 51 | ## datalakes 52 | 53 | - repository for large amounts of raw data. 54 | - semi-structured or unstructured. 55 | - used as a staging layer for your ingested data before this data is structured and loaded into a final destination, which is generally a data warehouse solution. 56 | - **2 main services:** 57 | - **Azure Data Lake Storage** 58 | - Azure Data Lake Storage provides a file repository that can store near unlimited amounts of data. 59 | - compatible with the Hadoop Distributed File System, HDFS, and can be accessed directly by Azure Data Factory, Databricks, HDInsight, Data Lake Analytics and Stream Analytics 60 | - Ideally, you should place your data lake on the same Azure data center as your analytics tools, otherwise you incur bandwidth costs as the data transverses the regions 61 | - 62 | - **Azure Data Lake Analytics** 63 | - Azure Data Lake Analytics is an on-demand analytics job service that you can use to process big data. 64 | - has a set of tools that allow you to create jobs that can transform data and extracting sites. 65 | - You write those jobs using U-SQL, which is a hybrid programming language that mixes SQL and C#. --------------------------------------------------------------------------------