├── .gitignore ├── LICENSE.txt ├── PyDataLens.egg-info ├── PKG-INFO ├── SOURCES.txt ├── dependency_links.txt ├── requires.txt └── top_level.txt ├── README.md ├── build └── lib │ ├── pydatalens │ ├── __init__.py │ ├── cleaning.py │ ├── eda.py │ ├── utils.py │ └── visualizations.py │ └── tests │ ├── __init__.py │ ├── test_cleaning.py │ ├── test_eda.py │ └── test_visualizations.py ├── dist ├── pydatalens-0.0.8-py3-none-any.whl └── pydatalens-0.0.8.tar.gz ├── docs ├── INSTALL.md ├── README.md └── USAGE.md ├── examples └── example_usage.py ├── pydatalens ├── Templates │ └── report_template.html ├── __init__.py ├── cleaning.py ├── eda.py ├── utils.py └── visualizations.py ├── requirements.txt ├── setup.py └── tests ├── __init__.py ├── test_cleaning.py ├── test_eda.py └── test_visualizations.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | __pycache__/ 3 | *.log 4 | *.csv 5 | *.xlsx 6 | .env 7 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 Gopalakrishnan Arjunan 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /PyDataLens.egg-info/PKG-INFO: -------------------------------------------------------------------------------- 1 | Metadata-Version: 2.2 2 | Name: pydatalens 3 | Version: 0.0.8 4 | Summary: A Python package for automatic EDA, data cleaning, and visualization. 5 | Home-page: https://github.com/gopalakrishnanarjun/pydatalens 6 | Author: Gopalakrishnan Arjunan 7 | Author-email: gopalakrishnana02@gmail.com 8 | Classifier: Programming Language :: Python :: 3 9 | Classifier: License :: OSI Approved :: MIT License 10 | Classifier: Operating System :: OS Independent 11 | Requires-Python: >=3.6 12 | Description-Content-Type: text/markdown 13 | License-File: LICENSE.txt 14 | Requires-Dist: pandas 15 | Requires-Dist: numpy 16 | Requires-Dist: matplotlib 17 | Requires-Dist: seaborn 18 | Dynamic: author 19 | Dynamic: author-email 20 | Dynamic: classifier 21 | Dynamic: description 22 | Dynamic: description-content-type 23 | Dynamic: home-page 24 | Dynamic: requires-dist 25 | Dynamic: requires-python 26 | Dynamic: summary 27 | 28 | 29 | # pydatalens 30 | 31 | pydatalens is a Python package designed to streamline the process of **Exploratory Data Analysis (EDA)**, **data cleaning**, and **visualization**. 32 | It enables data scientists and analysts to quickly prepare, explore, and gain insights from datasets with minimal effort. 33 | 34 | --- 35 | 36 | ## Features 37 | 38 | ### 1. **Smart Summarization** 39 | - Automatically generates a summary of the dataset, including: 40 | - Data types 41 | - Missing values 42 | - Descriptive statistics 43 | - Unique value counts 44 | 45 | ### 2. **Data Cleaning** 46 | - Detects and handles missing values using various strategies (mean, median, mode). 47 | - Identifies and removes duplicate rows. 48 | - Supports basic outlier detection (planned for future updates). 49 | 50 | ### 3. **Correlation Analysis** 51 | - Generates a correlation matrix to identify relationships between features. 52 | - Provides heatmaps for better visualization. 53 | 54 | ### 4. **Automatic Visualizations** 55 | - Supports generating: 56 | - Histograms 57 | - Box plots 58 | - Correlation heatmaps 59 | - Scatter plots (planned for future updates). 60 | 61 | ### 5. **Report Generation** 62 | - Exports EDA results and visualizations into a detailed **HTML report** for easy sharing. 63 | 64 | --- 65 | 66 | ## Installation 67 | 68 | ### Using pip (from source) 69 | 1. Clone the repository: 70 | ```bash 71 | git clone https://github.com/gopalakrishnanarjun/pydatalens.git 72 | cd pydatalens 73 | ``` 74 | 2. Install the package: 75 | ```bash 76 | pip install -e . 77 | ``` 78 | 79 | ### Dependencies 80 | - Python >= 3.6 81 | - pandas >= 1.0 82 | - numpy >= 1.18 83 | - matplotlib >= 3.1 84 | - seaborn >= 0.11 85 | 86 | Install dependencies manually: 87 | ```bash 88 | pip install pandas numpy matplotlib seaborn 89 | ``` 90 | 91 | --- 92 | 93 | ## Quick Start 94 | 95 | ### 1. Import the package 96 | ```python 97 | from pydatalens import eda, cleaning, visualizations 98 | ``` 99 | 100 | ### 2. Load a dataset 101 | ```python 102 | import pandas as pd 103 | df = pd.read_csv("your_dataset.csv") 104 | ``` 105 | 106 | ### 3. Summarize the dataset 107 | ```python 108 | print(eda.summarize(df)) 109 | ``` 110 | 111 | ### 4. Handle missing values 112 | ```python 113 | df_cleaned = cleaning.handle_missing(df, strategy="mean") 114 | ``` 115 | 116 | ### 5. Visualize the data 117 | ```python 118 | visualizations.plot_histogram(df_cleaned, column="age") 119 | visualizations.correlation_heatmap(df_cleaned) 120 | ``` 121 | 122 | --- 123 | 124 | ## Examples 125 | 126 | ### Summarizing the Data 127 | ```python 128 | from pydatalens import eda 129 | summary = eda.summarize(df) 130 | print(summary) 131 | ``` 132 | 133 | ### Cleaning the Data 134 | ```python 135 | from pydatalens import cleaning 136 | df = cleaning.handle_missing(df, strategy="median") 137 | df = cleaning.drop_duplicates(df) 138 | ``` 139 | 140 | ### Visualizing the Data 141 | ```python 142 | from pydatalens import visualizations 143 | visualizations.plot_histogram(df, "column_name") 144 | visualizations.correlation_heatmap(df) 145 | ``` 146 | 147 | --- 148 | 149 | ## Future Enhancements 150 | - Advanced anomaly detection. 151 | - Support for time series analysis. 152 | - Enhanced visualization options (e.g., scatter plots, pair plots). 153 | - Integration with machine learning pipelines. 154 | 155 | --- 156 | 157 | ## Contributing 158 | Contributions are welcome! If you'd like to contribute, please fork the repository and submit a pull request. 159 | 160 | --- 161 | 162 | ## License 163 | pydatalens is licensed under the MIT License. See the [LICENSE](LICENSE) file for more details. 164 | 165 | --- 166 | -------------------------------------------------------------------------------- /PyDataLens.egg-info/SOURCES.txt: -------------------------------------------------------------------------------- 1 | LICENSE.txt 2 | README.md 3 | setup.py 4 | PyDataLens.egg-info/PKG-INFO 5 | PyDataLens.egg-info/SOURCES.txt 6 | PyDataLens.egg-info/dependency_links.txt 7 | PyDataLens.egg-info/requires.txt 8 | PyDataLens.egg-info/top_level.txt 9 | pydatalens/__init__.py 10 | pydatalens/cleaning.py 11 | pydatalens/eda.py 12 | pydatalens/utils.py 13 | pydatalens/visualizations.py 14 | pydatalens.egg-info/PKG-INFO 15 | pydatalens.egg-info/SOURCES.txt 16 | pydatalens.egg-info/dependency_links.txt 17 | pydatalens.egg-info/requires.txt 18 | pydatalens.egg-info/top_level.txt 19 | tests/__init__.py 20 | tests/test_cleaning.py 21 | tests/test_eda.py 22 | tests/test_visualizations.py -------------------------------------------------------------------------------- /PyDataLens.egg-info/dependency_links.txt: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /PyDataLens.egg-info/requires.txt: -------------------------------------------------------------------------------- 1 | pandas 2 | numpy 3 | matplotlib 4 | seaborn 5 | -------------------------------------------------------------------------------- /PyDataLens.egg-info/top_level.txt: -------------------------------------------------------------------------------- 1 | pydatalens 2 | tests 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # pydatalens 3 | 4 | pydatalens is a Python package designed to streamline the process of **Exploratory Data Analysis (EDA)**, **data cleaning**, and **visualization**. 5 | It enables data scientists and analysts to quickly prepare, explore, and gain insights from datasets with minimal effort. 6 | 7 | --- 8 | 9 | ## Features 10 | 11 | ### 1. **Smart Summarization** 12 | - Automatically generates a summary of the dataset, including: 13 | - Data types 14 | - Missing values 15 | - Descriptive statistics 16 | - Unique value counts 17 | 18 | ### 2. **Data Cleaning** 19 | - Detects and handles missing values using various strategies (mean, median, mode). 20 | - Identifies and removes duplicate rows. 21 | - Supports basic outlier detection (planned for future updates). 22 | 23 | ### 3. **Correlation Analysis** 24 | - Generates a correlation matrix to identify relationships between features. 25 | - Provides heatmaps for better visualization. 26 | 27 | ### 4. **Automatic Visualizations** 28 | - Supports generating: 29 | - Histograms 30 | - Box plots 31 | - Correlation heatmaps 32 | - Scatter plots (planned for future updates). 33 | 34 | ### 5. **Report Generation** 35 | - Exports EDA results and visualizations into a detailed **HTML report** for easy sharing. 36 | 37 | --- 38 | 39 | ## Installation 40 | 41 | ### Using pip (from source) 42 | 1. Clone the repository: 43 | ```bash 44 | git clone https://github.com/gopalakrishnanarjun/pydatalens.git 45 | cd pydatalens 46 | ``` 47 | 2. Install the package: 48 | ```bash 49 | pip install -e . 50 | ``` 51 | 52 | ### Dependencies 53 | - Python >= 3.6 54 | - pandas >= 1.0 55 | - numpy >= 1.18 56 | - matplotlib >= 3.1 57 | - seaborn >= 0.11 58 | 59 | Install dependencies manually: 60 | ```bash 61 | pip install pandas numpy matplotlib seaborn 62 | ``` 63 | 64 | --- 65 | 66 | ## Quick Start 67 | 68 | ### 1. Import the package 69 | ```python 70 | from pydatalens import eda, cleaning, visualizations 71 | ``` 72 | 73 | ### 2. Load a dataset 74 | ```python 75 | import pandas as pd 76 | df = pd.read_csv("your_dataset.csv") 77 | ``` 78 | 79 | ### 3. Summarize the dataset 80 | ```python 81 | print(eda.summarize(df)) 82 | ``` 83 | 84 | ### 4. Handle missing values 85 | ```python 86 | df_cleaned = cleaning.handle_missing(df, strategy="mean") 87 | ``` 88 | 89 | ### 5. Visualize the data 90 | ```python 91 | visualizations.plot_histogram(df_cleaned, column="age") 92 | visualizations.correlation_heatmap(df_cleaned) 93 | ``` 94 | 95 | --- 96 | 97 | ## Examples 98 | 99 | ### Summarizing the Data 100 | ```python 101 | from pydatalens import eda 102 | summary = eda.summarize(df) 103 | print(summary) 104 | ``` 105 | 106 | ### Cleaning the Data 107 | ```python 108 | from pydatalens import cleaning 109 | df = cleaning.handle_missing(df, strategy="median") 110 | df = cleaning.drop_duplicates(df) 111 | ``` 112 | 113 | ### Visualizing the Data 114 | ```python 115 | from pydatalens import visualizations 116 | visualizations.plot_histogram(df, "column_name") 117 | visualizations.correlation_heatmap(df) 118 | ``` 119 | 120 | --- 121 | 122 | ## Future Enhancements 123 | - Advanced anomaly detection. 124 | - Support for time series analysis. 125 | - Enhanced visualization options (e.g., scatter plots, pair plots). 126 | - Integration with machine learning pipelines. 127 | 128 | --- 129 | 130 | ## Contributing 131 | Contributions are welcome! If you'd like to contribute, please fork the repository and submit a pull request. 132 | 133 | --- 134 | 135 | ## License 136 | pydatalens is licensed under the MIT License. See the [LICENSE](LICENSE) file for more details. 137 | 138 | --- 139 | -------------------------------------------------------------------------------- /build/lib/pydatalens/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | PyDataLens: A Python package for automatic EDA, data cleaning, and visualization. 3 | """ 4 | 5 | from .eda import summarize, correlation 6 | from .cleaning import handle_missing, drop_duplicates 7 | from .visualizations import plot_histogram, correlation_heatmap 8 | -------------------------------------------------------------------------------- /build/lib/pydatalens/cleaning.py: -------------------------------------------------------------------------------- 1 | def handle_missing(df, strategy="mean"): 2 | """ 3 | Fills missing values in the DataFrame. 4 | Args: 5 | strategy: mean, median, or mode. 6 | """ 7 | print(f"Handling missing values using {strategy} strategy...") 8 | for column in df.select_dtypes(include=["float", "int"]).columns: 9 | if strategy == "mean": 10 | df[column] = df[column].fillna(df[column].mean()) 11 | elif strategy == "median": 12 | df[column] = df[column].fillna(df[column].median()) 13 | elif strategy == "mode": 14 | df[column] = df[column].fillna(df[column].mode()[0]) 15 | return df 16 | 17 | def drop_duplicates(df): 18 | """ 19 | Drops duplicate rows. 20 | """ 21 | print("Dropping duplicate rows...") 22 | return df.drop_duplicates() 23 | -------------------------------------------------------------------------------- /build/lib/pydatalens/eda.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | 3 | def summarize(df): 4 | """ 5 | Summarizes the given DataFrame. 6 | """ 7 | print("Generating data summary...") 8 | summary = { 9 | "Columns": df.columns.tolist(), 10 | "Data Types": df.dtypes.tolist(), 11 | "Missing Values": df.isnull().sum().tolist(), 12 | "Unique Values": df.nunique().tolist(), 13 | } 14 | return pd.DataFrame(summary) 15 | 16 | def correlation(df): 17 | """ 18 | Generates a correlation matrix. 19 | """ 20 | print("Calculating correlation matrix...") 21 | return df.corr() 22 | -------------------------------------------------------------------------------- /build/lib/pydatalens/utils.py: -------------------------------------------------------------------------------- 1 | def save_plot(filename): 2 | """ 3 | Utility function to save a plot to a file. 4 | """ 5 | import matplotlib.pyplot as plt 6 | plt.savefig(filename) 7 | print(f"Plot saved as {filename}.") 8 | -------------------------------------------------------------------------------- /build/lib/pydatalens/visualizations.py: -------------------------------------------------------------------------------- 1 | import seaborn as sns 2 | import matplotlib.pyplot as plt 3 | 4 | def plot_histogram(df, column): 5 | """ 6 | Plots a histogram for a column. 7 | """ 8 | print(f"Generating histogram for {column}...") 9 | sns.histplot(df[column], kde=True) 10 | plt.title(f"Histogram of {column}") 11 | plt.show() 12 | 13 | def correlation_heatmap(df): 14 | """ 15 | Plots a heatmap of the correlation matrix. 16 | """ 17 | print("Generating correlation heatmap...") 18 | sns.heatmap(df.corr(), annot=True, cmap="coolwarm") 19 | plt.title("Correlation Heatmap") 20 | plt.show() 21 | -------------------------------------------------------------------------------- /build/lib/tests/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | PyDataLens: A Python package for automatic EDA, data cleaning, and visualization. 3 | """ 4 | 5 | from .eda import summarize, correlation 6 | from .cleaning import handle_missing, drop_duplicates 7 | from .visualizations import plot_histogram, correlation_heatmap 8 | -------------------------------------------------------------------------------- /build/lib/tests/test_cleaning.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from pydatalens import cleaning 3 | 4 | def test_handle_missing(): 5 | data = {"A": [1, None, 3]} 6 | df = pd.DataFrame(data) 7 | cleaned_df = cleaning.handle_missing(df, strategy="mean") 8 | assert cleaned_df.isnull().sum().sum() == 0 9 | print("Handle missing test passed.") 10 | 11 | def test_drop_duplicates(): 12 | data = {"A": [1, 1, 2]} 13 | df = pd.DataFrame(data) 14 | cleaned_df = cleaning.drop_duplicates(df) 15 | assert cleaned_df.shape[0] == 2 16 | print("Drop duplicates test passed.") 17 | -------------------------------------------------------------------------------- /build/lib/tests/test_eda.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from pydatalens import eda 3 | 4 | def test_summarize(): 5 | data = {"A": [1, 2, None], "B": [4, None, 6]} 6 | df = pd.DataFrame(data) 7 | summary = eda.summarize(df) 8 | assert "Columns" in summary.columns 9 | print("Summarize test passed.") 10 | 11 | def test_correlation(): 12 | data = {"A": [1, 2, 3], "B": [4, 5, 6]} 13 | df = pd.DataFrame(data) 14 | corr = eda.correlation(df) 15 | assert corr.shape[0] == corr.shape[1] 16 | print("Correlation test passed.") 17 | -------------------------------------------------------------------------------- /build/lib/tests/test_visualizations.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from pydatalens import visualizations 3 | 4 | def test_plot_histogram(): 5 | data = {"A": [1, 2, 3, 4, 5]} 6 | df = pd.DataFrame(data) 7 | visualizations.plot_histogram(df, column="A") 8 | print("Histogram test passed.") 9 | 10 | def test_correlation_heatmap(): 11 | data = {"A": [1, 2, 3], "B": [4, 5, 6]} 12 | df = pd.DataFrame(data) 13 | visualizations.correlation_heatmap(df) 14 | print("Correlation heatmap test passed.") 15 | -------------------------------------------------------------------------------- /dist/pydatalens-0.0.8-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gopalakrishnanarjun/pydatalens/2aa676bd5ab5c0708de1b996c7d9b24e785f7330/dist/pydatalens-0.0.8-py3-none-any.whl -------------------------------------------------------------------------------- /dist/pydatalens-0.0.8.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gopalakrishnanarjun/pydatalens/2aa676bd5ab5c0708de1b996c7d9b24e785f7330/dist/pydatalens-0.0.8.tar.gz -------------------------------------------------------------------------------- /docs/INSTALL.md: -------------------------------------------------------------------------------- 1 | 2 | # **pydatalens Installation Guide** 3 | 4 | This guide explains how to install and set up the **pydatalens** package for automatic exploratory data analysis (EDA), data cleaning, and visualization. 5 | 6 | --- 7 | 8 | ## **Prerequisites** 9 | 10 | ### 1. **Python Version** 11 | - Ensure you have **Python 3.6 or later** installed on your system. 12 | - You can check your Python version by running: 13 | ```bash 14 | python --version 15 | ``` 16 | 17 | ### 2. **Pip** 18 | - Make sure you have `pip` (Python's package manager) installed and updated: 19 | ```bash 20 | pip install --upgrade pip 21 | ``` 22 | 23 | --- 24 | 25 | ## **Installation Steps** 26 | 27 | ### **Step 1: Clone the Repository** 28 | Download the pydatalens repository from GitHub: 29 | ```bash 30 | git clone https://github.com/gopalakrishnanarjun/pydatalens.git 31 | cd pydatalens 32 | ``` 33 | 34 | ### **Step 2: Create a Virtual Environment (Optional but Recommended)** 35 | Set up a virtual environment to isolate your Python dependencies: 36 | ```bash 37 | python -m venv pydatalens_env 38 | source pydatalens_env/bin/activate # On Windows: pydatalens_env\Scripts\activate 39 | ``` 40 | 41 | ### **Step 3: Install the Package** 42 | Run the following command to install the pydatalens package and its dependencies: 43 | ```bash 44 | pip install -e . 45 | ``` 46 | 47 | --- 48 | 49 | ## **Dependencies** 50 | 51 | pydatalens requires the following Python packages: 52 | 53 | - **pandas**: For data manipulation 54 | - **numpy**: For numerical operations 55 | - **matplotlib**: For data visualizations 56 | - **seaborn**: For advanced visualizations 57 | 58 | All required dependencies will be installed automatically during the installation process. If you encounter issues, you can manually install them: 59 | ```bash 60 | pip install pandas numpy matplotlib seaborn 61 | ``` 62 | 63 | --- 64 | 65 | ## **Testing the Installation** 66 | 67 | After installation, you can verify the setup by running the following commands: 68 | 69 | ### **1. Check the Installation** 70 | Run this Python command to check if the package is installed: 71 | ```bash 72 | python -c "import pydatalens; print('pydatalens is installed successfully!')" 73 | ``` 74 | 75 | ### **2. Run Tests** 76 | Navigate to the `tests` folder and run the test scripts: 77 | ```bash 78 | cd tests 79 | python test_eda.py 80 | python test_cleaning.py 81 | python test_visualizations.py 82 | ``` 83 | 84 | --- 85 | 86 | ## **Examples** 87 | 88 | Once installed, you can try the package using the example script provided: 89 | 90 | ### **Run the Example** 91 | Navigate to the `examples` folder and execute the script: 92 | ```bash 93 | cd examples 94 | python example_usage.py 95 | ``` 96 | 97 | --- 98 | 99 | ## **Common Installation Issues** 100 | 101 | ### **1. Missing Dependencies** 102 | If some dependencies fail to install, try installing them manually: 103 | ```bash 104 | pip install -r requirements.txt 105 | ``` 106 | 107 | ### **2. Permission Issues** 108 | If you encounter permission-related errors, try: 109 | ```bash 110 | pip install --user -e . 111 | ``` 112 | 113 | --- 114 | 115 | ## **Uninstalling pydatalens** 116 | 117 | To remove the package from your system: 118 | ```bash 119 | pip uninstall pydatalens 120 | ``` 121 | 122 | --- 123 | 124 | ## **Support** 125 | 126 | If you encounter any issues or need help, please open an issue in the GitHub repository: 127 | [pydatalens GitHub Issues](https://github.com/gopalakrishnanarjun/pydatalens/issues) 128 | 129 | Enjoy using **pydatalens** for your data analysis needs! 🚀 130 | -------------------------------------------------------------------------------- /docs/README.md: -------------------------------------------------------------------------------- 1 | # PyDataLens -------------------------------------------------------------------------------- /docs/USAGE.md: -------------------------------------------------------------------------------- 1 | 2 | # pydatalens Usage Guide 3 | 4 | pydatalens is a Python package designed to simplify exploratory data analysis (EDA), data cleaning, and visualization. This guide provides examples of how to use the package effectively. 5 | 6 | ## Installation 7 | 8 | Ensure you have the required dependencies installed. You can install pydatalens using the following steps: 9 | 10 | 1. Clone the repository: 11 | 12 | ```bash 13 | git clone https://github.com/gopalakrishnanarjun/pydatalens.git 14 | cd pydatalens 15 | ``` 16 | 17 | 2. Install the package: 18 | 19 | ```bash 20 | pip install -e . 21 | ``` 22 | 23 | ## Getting Started 24 | 25 | ### Importing the Package 26 | 27 | ```python 28 | from pydatalens import eda, cleaning, visualizations 29 | ``` 30 | 31 | ### Loading Data 32 | 33 | You can load your dataset using Pandas: 34 | 35 | ```python 36 | import pandas as pd 37 | 38 | # Load your dataset 39 | df = pd.read_csv("data.csv") 40 | ``` 41 | 42 | ## Features 43 | 44 | ### 1. Summarizing Data 45 | 46 | Get an overview of your dataset using the `summarize` function. 47 | 48 | ```python 49 | summary = eda.summarize(df) 50 | print(summary) 51 | ``` 52 | 53 | ### 2. Correlation Analysis 54 | 55 | Calculate and display a correlation matrix. 56 | 57 | ```python 58 | correlation_matrix = eda.correlation(df) 59 | print(correlation_matrix) 60 | ``` 61 | 62 | ### 3. Handling Missing Values 63 | 64 | Handle missing values in the dataset using `mean`, `median`, or `mode` strategies. 65 | 66 | ```python 67 | df_cleaned = cleaning.handle_missing(df, strategy="mean") 68 | ``` 69 | 70 | ### 4. Dropping Duplicates 71 | 72 | Remove duplicate rows from the dataset. 73 | 74 | ```python 75 | df_unique = cleaning.drop_duplicates(df) 76 | ``` 77 | 78 | ### 5. Visualizations 79 | 80 | #### Histogram 81 | 82 | Generate a histogram for a specific column. 83 | 84 | ```python 85 | visualizations.plot_histogram(df, column="age") 86 | ``` 87 | 88 | #### Correlation Heatmap 89 | 90 | Generate a heatmap to visualize correlations. 91 | 92 | ```python 93 | visualizations.correlation_heatmap(df) 94 | ``` 95 | 96 | ## Advanced Features 97 | 98 | ### Generating an EDA Report 99 | 100 | Future versions will include a feature to generate an HTML or PDF EDA report. 101 | 102 | ## Example Usage 103 | 104 | Here’s a complete example of using pydatalens: 105 | 106 | ```python 107 | import pandas as pd 108 | from pydatalens import eda, cleaning, visualizations 109 | 110 | # Load your dataset 111 | data = {"ColumnA": [1, None, 3, 4], "ColumnB": [10, 20, 30, 40]} 112 | df = pd.DataFrame(data) 113 | 114 | # Summarize the dataset 115 | summary = eda.summarize(df) 116 | print(summary) 117 | 118 | # Handle missing values 119 | df_cleaned = cleaning.handle_missing(df, strategy="mean") 120 | 121 | # Drop duplicates 122 | df_unique = cleaning.drop_duplicates(df_cleaned) 123 | 124 | # Visualize data 125 | visualizations.plot_histogram(df_unique, "ColumnA") 126 | visualizations.correlation_heatmap(df_unique) 127 | ``` 128 | 129 | ## Support 130 | 131 | If you encounter any issues, feel free to open an issue on GitHub or contact the package maintainer. 132 | 133 | --- 134 | 135 | Happy analyzing with pydatalens! 136 | -------------------------------------------------------------------------------- /examples/example_usage.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from pydatalens import eda, cleaning, visualizations 3 | 4 | # Example dataset 5 | data = {"ColumnA": [1, None, 3, 4], "ColumnB": [10, 20, 30, 40]} 6 | df = pd.DataFrame(data) 7 | 8 | # Summarize 9 | print(eda.summarize(df)) 10 | 11 | # Clean 12 | df = cleaning.handle_missing(df, strategy="mean") 13 | df = cleaning.drop_duplicates(df) 14 | 15 | # Visualize 16 | visualizations.plot_histogram(df, "ColumnA") 17 | visualizations.correlation_heatmap(df) 18 | -------------------------------------------------------------------------------- /pydatalens/Templates/report_template.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | EDA Report 5 | 6 | 7 |

Exploratory Data Analysis Report

8 |

Summary

9 | {{ summary_table }} 10 |

Correlation Heatmap

11 | Correlation Heatmap 12 | 13 | 14 | -------------------------------------------------------------------------------- /pydatalens/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | PyDataLens: A Python package for automatic EDA, data cleaning, and visualization. 3 | """ 4 | 5 | from .eda import summarize, correlation 6 | from .cleaning import handle_missing, drop_duplicates 7 | from .visualizations import plot_histogram, correlation_heatmap 8 | -------------------------------------------------------------------------------- /pydatalens/cleaning.py: -------------------------------------------------------------------------------- 1 | def handle_missing(df, strategy="mean"): 2 | """ 3 | Fills missing values in the DataFrame. 4 | Args: 5 | strategy: mean, median, or mode. 6 | """ 7 | print(f"Handling missing values using {strategy} strategy...") 8 | for column in df.select_dtypes(include=["float", "int"]).columns: 9 | if strategy == "mean": 10 | df[column] = df[column].fillna(df[column].mean()) 11 | elif strategy == "median": 12 | df[column] = df[column].fillna(df[column].median()) 13 | elif strategy == "mode": 14 | df[column] = df[column].fillna(df[column].mode()[0]) 15 | return df 16 | 17 | def drop_duplicates(df): 18 | """ 19 | Drops duplicate rows. 20 | """ 21 | print("Dropping duplicate rows...") 22 | return df.drop_duplicates() 23 | -------------------------------------------------------------------------------- /pydatalens/eda.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | 3 | def summarize(df): 4 | """ 5 | Summarizes the given DataFrame. 6 | """ 7 | print("Generating data summary...") 8 | summary = { 9 | "Columns": df.columns.tolist(), 10 | "Data Types": df.dtypes.tolist(), 11 | "Missing Values": df.isnull().sum().tolist(), 12 | "Unique Values": df.nunique().tolist(), 13 | } 14 | return pd.DataFrame(summary) 15 | 16 | def correlation(df): 17 | """ 18 | Generates a correlation matrix. 19 | """ 20 | print("Calculating correlation matrix...") 21 | return df.corr() 22 | -------------------------------------------------------------------------------- /pydatalens/utils.py: -------------------------------------------------------------------------------- 1 | def save_plot(filename): 2 | """ 3 | Utility function to save a plot to a file. 4 | """ 5 | import matplotlib.pyplot as plt 6 | plt.savefig(filename) 7 | print(f"Plot saved as {filename}.") 8 | -------------------------------------------------------------------------------- /pydatalens/visualizations.py: -------------------------------------------------------------------------------- 1 | import seaborn as sns 2 | import matplotlib.pyplot as plt 3 | 4 | def plot_histogram(df, column): 5 | """ 6 | Plots a histogram for a column. 7 | """ 8 | print(f"Generating histogram for {column}...") 9 | sns.histplot(df[column], kde=True) 10 | plt.title(f"Histogram of {column}") 11 | plt.show() 12 | 13 | def correlation_heatmap(df): 14 | """ 15 | Plots a heatmap of the correlation matrix. 16 | """ 17 | print("Generating correlation heatmap...") 18 | sns.heatmap(df.corr(), annot=True, cmap="coolwarm") 19 | plt.title("Correlation Heatmap") 20 | plt.show() 21 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pandas>=1.0 2 | numpy>=1.18 3 | matplotlib>=3.1 4 | seaborn>=0.11 5 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | setup( 4 | name="pydatalens", 5 | version="1.0.0", 6 | description="A Python package for automatic EDA, data cleaning, and visualization.", 7 | author='Gopalakrishnan Arjunan', 8 | author_email='gopalakrishnana02@gmail.com', 9 | long_description=open('README.md', encoding='utf-8').read(), 10 | long_description_content_type='text/markdown', 11 | packages=find_packages(), 12 | install_requires=[ 13 | "pandas", 14 | "numpy", 15 | "matplotlib", 16 | "seaborn", 17 | ], 18 | url='https://github.com/gopalakrishnanarjun/pydatalens', # Update with your GitHub repository URL 19 | classifiers=[ 20 | 'Programming Language :: Python :: 3', 21 | 'License :: OSI Approved :: MIT License', 22 | 'Operating System :: OS Independent', 23 | ], 24 | python_requires=">=3.6", 25 | ) 26 | -------------------------------------------------------------------------------- /tests/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | PyDataLens: A Python package for automatic EDA, data cleaning, and visualization. 3 | """ 4 | 5 | from .eda import summarize, correlation 6 | from .cleaning import handle_missing, drop_duplicates 7 | from .visualizations import plot_histogram, correlation_heatmap 8 | -------------------------------------------------------------------------------- /tests/test_cleaning.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from pydatalens import cleaning 3 | 4 | def test_handle_missing(): 5 | data = {"A": [1, None, 3]} 6 | df = pd.DataFrame(data) 7 | cleaned_df = cleaning.handle_missing(df, strategy="mean") 8 | assert cleaned_df.isnull().sum().sum() == 0 9 | print("Handle missing test passed.") 10 | 11 | def test_drop_duplicates(): 12 | data = {"A": [1, 1, 2]} 13 | df = pd.DataFrame(data) 14 | cleaned_df = cleaning.drop_duplicates(df) 15 | assert cleaned_df.shape[0] == 2 16 | print("Drop duplicates test passed.") 17 | -------------------------------------------------------------------------------- /tests/test_eda.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from pydatalens import eda 3 | 4 | def test_summarize(): 5 | data = {"A": [1, 2, None], "B": [4, None, 6]} 6 | df = pd.DataFrame(data) 7 | summary = eda.summarize(df) 8 | assert "Columns" in summary.columns 9 | print("Summarize test passed.") 10 | 11 | def test_correlation(): 12 | data = {"A": [1, 2, 3], "B": [4, 5, 6]} 13 | df = pd.DataFrame(data) 14 | corr = eda.correlation(df) 15 | assert corr.shape[0] == corr.shape[1] 16 | print("Correlation test passed.") 17 | -------------------------------------------------------------------------------- /tests/test_visualizations.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from pydatalens import visualizations 3 | 4 | def test_plot_histogram(): 5 | data = {"A": [1, 2, 3, 4, 5]} 6 | df = pd.DataFrame(data) 7 | visualizations.plot_histogram(df, column="A") 8 | print("Histogram test passed.") 9 | 10 | def test_correlation_heatmap(): 11 | data = {"A": [1, 2, 3], "B": [4, 5, 6]} 12 | df = pd.DataFrame(data) 13 | visualizations.correlation_heatmap(df) 14 | print("Correlation heatmap test passed.") 15 | --------------------------------------------------------------------------------