├── Coronavirus.pbix ├── neo4j_recommender ├── users_minimal_test.dat ├── ratings_minimal_test.dat └── movies_minimal_test.dat ├── Arxiv_graph ├── arxiv_df.pkl ├── arxiv_math_20220301.xlsx ├── graph-dbms-neo4j-Mar-18-2022-17-13-25.dump ├── README.md ├── display_graph.html └── Scrap Arxiv and insert to Neo4j v1.ipynb ├── online_retail ├── GDP.xlsx ├── customer_segments_RFM_country.pickle ├── customer_segments_buying_categories.pickle └── Online_retail_Combine_Segmentations.ipynb ├── topological_data_analysis └── README.md ├── clustering_example └── Outlier_example.R ├── README.md └── Markov_text_generation.ipynb /Coronavirus.pbix: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dpanagop/data_analytics_examples/master/Coronavirus.pbix -------------------------------------------------------------------------------- /neo4j_recommender/users_minimal_test.dat: -------------------------------------------------------------------------------- 1 | 1::F::1::10::48067 2 | 2::M::56::16::70072 3 | 3::M::25::15::55117 -------------------------------------------------------------------------------- /Arxiv_graph/arxiv_df.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dpanagop/data_analytics_examples/master/Arxiv_graph/arxiv_df.pkl -------------------------------------------------------------------------------- /online_retail/GDP.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dpanagop/data_analytics_examples/master/online_retail/GDP.xlsx -------------------------------------------------------------------------------- /Arxiv_graph/arxiv_math_20220301.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dpanagop/data_analytics_examples/master/Arxiv_graph/arxiv_math_20220301.xlsx -------------------------------------------------------------------------------- /topological_data_analysis/README.md: -------------------------------------------------------------------------------- 1 | # Topological Data Analysis 2 | 3 | This folder contains code with examples of topological data analysis. 4 | -------------------------------------------------------------------------------- /online_retail/customer_segments_RFM_country.pickle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dpanagop/data_analytics_examples/master/online_retail/customer_segments_RFM_country.pickle -------------------------------------------------------------------------------- /Arxiv_graph/graph-dbms-neo4j-Mar-18-2022-17-13-25.dump: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dpanagop/data_analytics_examples/master/Arxiv_graph/graph-dbms-neo4j-Mar-18-2022-17-13-25.dump -------------------------------------------------------------------------------- /online_retail/customer_segments_buying_categories.pickle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dpanagop/data_analytics_examples/master/online_retail/customer_segments_buying_categories.pickle -------------------------------------------------------------------------------- /neo4j_recommender/ratings_minimal_test.dat: -------------------------------------------------------------------------------- 1 | 1::1::5::978300760 2 | 1::2::5::978302109 3 | 1::4::3::978301968 4 | 2::1::5::978300760 5 | 2::2::5::978302109 6 | 3::1::4::978302109 7 | 3::10::5::978302109 -------------------------------------------------------------------------------- /neo4j_recommender/movies_minimal_test.dat: -------------------------------------------------------------------------------- 1 | 1::Toy Story (1995)::Animation|Children's|Comedy 2 | 2::Jumanji (1995)::Adventure|Children's|Fantasy 3 | 3::Grumpier Old Men (1995)::Comedy|Romance 4 | 4::Waiting to Exhale (1995)::Comedy|Drama 5 | 5::Father of the Bride Part II (1995)::Comedy 6 | 6::Heat (1995)::Action|Crime|Thriller 7 | 7::Sabrina (1995)::Comedy|Romance 8 | 8::Tom and Huck (1995)::Adventure|Children's 9 | 9::Sudden Death (1995)::Action 10 | 10::GoldenEye (1995)::Action|Adventure|Thriller -------------------------------------------------------------------------------- /clustering_example/Outlier_example.R: -------------------------------------------------------------------------------- 1 | ### Preample #### 2 | # Loading libraries 3 | library(ggplot2) 4 | library(cowplot) 5 | 6 | # Setting random generator seed 7 | set.seed(42) 8 | 9 | ### No Outlier #### 10 | # First cluster - cluster 0 11 | x1 = rnorm(20,0,0.5) 12 | y1 = rnorm(20,0,0.5) 13 | 14 | # Second cluster - cluster 1 15 | x2 =rnorm(20,2,0.5) 16 | y2 =rnorm(20,2,0.5) 17 | 18 | # Dataset creation 19 | d1=data.frame(x=x1,y=y1,c=rep(0,10)) 20 | d2=data.frame(x=x2,y=y2,c=rep(1,10)) 21 | d=rbind(d1,d2) 22 | 23 | # k-means 24 | d_clst=kmeans(d[,-3],2) 25 | 26 | g1 = ggplot(d)+ 27 | aes(x=x,y=y,color=c,shape=as.factor(d_clst$cluster))+geom_point()+ 28 | theme(legend.position="none")+ scale_color_gradient(low="blue", high="red") 29 | g1 30 | 31 | ### Outlier #### 32 | 33 | # Data 34 | out=data.frame(x=c(4),y=c(4),c=c(1)) 35 | d_out=rbind(d,out) 36 | #k-means 37 | d_out_clst=kmeans(d_out[,-3],2) 38 | # Plot 39 | g2 = ggplot(d_out)+ 40 | aes(x=x,y=y,color=c,shape=as.factor(d_out_clst$cluster))+geom_point()+ 41 | theme(legend.position="none")+ scale_color_gradient(low="blue", high="red") 42 | g2 43 | 44 | # Combining two graphs 45 | plot_grid(g1,g2, labels=c("No outlier", "Outlier"), ncol = 2, nrow = 1) 46 | -------------------------------------------------------------------------------- /Arxiv_graph/README.md: -------------------------------------------------------------------------------- 1 | # arXiv meatadata analysis with neo4j 2 | 3 | This folder contains Python code for: 4 | - extracting (scraping) metadata from arXiv 5 | - importing them in neo4j graph database, 6 | - perform analysis on them. 7 | 8 | In especial, 9 | * [Scrap Arxiv and insert to Neo4j v1.ipynb](https://github.com/dpanagop/data_analytics_examples/blob/master/Arxiv_graph/Scrap%20Arxiv%20and%20insert%20to%20Neo4j%20v1.ipynb) is a jupyter notebook that extracts (scraps) data from arXiv and imports them in neo4j graph database. The extracted data are alos stored in an excel file [arxiv_math_20220301.xlsx](https://github.com/dpanagop/data_analytics_examples/blob/master/Arxiv_graph/arxiv_math_20220301.xlsx) and in a pickle file [arxiv_df.pkl](https://github.com/dpanagop/data_analytics_examples/blob/master/Arxiv_graph/arxiv_df.pkl). 10 | * [Embedding v2.ipynb](https://github.com/dpanagop/data_analytics_examples/blob/master/Arxiv_graph/Embedding%20v2.ipynb) does the analysis. The main highlights, are creating a node embedding with node2vec and using the result with k-means and UMAP. In [Embedding v2-Stress.ipynb](https://github.com/dpanagop/data_analytics_examples/blob/master/Arxiv_graph/Embedding%20v2-Stress.ipynb) you can see the results of the sama analysis for a dataset of around 10k articles. 11 | * [display_graph.html](https://github.com/dpanagop/data_analytics_examples/blob/master/Arxiv_graph/display_graph.html) uses neovis.js to produce a graph the that depicts the relationship between different Mathematics subject classifications. 12 | * [graph-dbms-neo4j-Mar-18-2022-17-13-25.dump](graph-dbms-neo4j-Mar-18-2022-17-13-25.dump) is a dump of the neo4j database. 13 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # data_analytics_examples 2 | 3 | This repository contains code that is used for various data analysis/data science projects. In most cases, the code is related to a Medium post. 4 | 5 | - [online_retail](https://github.com/dpanagop/data_analytics_examples/tree/master/online_retail) has code related to 6 | an example of segmentation of wholesale customers. It is split into two parts. The [first part](https://towardsdatascience.com/customer-segmentation-part-i-2c5e2145e719) uses natural language processing 7 | to cluster clients based on their transactions. The [second part](https://towardsdatascience.com/customer-segmentation-part-ii-1c94bdc03de5) creates an RFM segmentation which is combined with the results of the first part. 8 | 9 | - [Coronavirus.pbix](https://github.com/dpanagop/data_analytics_examples/blob/master/Coronavirus.pbix) is a Power BI dashboard that about COVID-19. You can read more in the related 10 | [Medium post](https://dpanagop-53386.medium.com/covid-19-dashboard-with-power-bi-78caf8d16856?source=your_stories_page-------------------------------------). 11 | 12 | - [Markov_text_generation.ipynb](https://github.com/dpanagop/data_analytics_examples/blob/master/Markov_text_generation.ipynb) is a jupyter notebook that uses Markov chains to produce text. You can read more in 13 | the related [Medium article](https://towardsdatascience.com/using-a-transition-matrix-to-generate-text-in-python-c5e78495b09b?source=your_stories_page-------------------------------------). 14 | 15 | - [UMAP](https://github.com/dpanagop/data_analytics_examples/blob/master/UMAP.ipynb) is an example of Uniform Manifold Approximation and Projection (UMAP) (for details see [UMAP](https://umap-learn.readthedocs.io/en/latest/) algorithm) 16 | 17 | - [Arxiv_graph](https://github.com/dpanagop/data_analytics_examples/tree/master/Arxiv_graph) arXiv meatadata analysis with neo4j 18 | 19 | - [neo4j_recommender](https://github.com/dpanagop/data_analytics_examples/tree/master/neo4j_recommender) creation of movie recommenders with Neo4j 20 | -------------------------------------------------------------------------------- /Arxiv_graph/display_graph.html: -------------------------------------------------------------------------------- 1 | 2 |
3 |