├── Image_Based_Entity_Extraction_Project.docx
├── README.md
├── Resource
    ├── README.md
    ├── dataset
    │   ├── sample_test.csv
    │   ├── sample_test_out.csv
    │   ├── sample_test_out_fail.csv
    │   ├── test.csv
    │   └── train.csv
    ├── sample_code.py
    └── src
    │   ├── __pycache__
    │       └── constants.cpython-312.pyc
    │   ├── constants.py
    │   ├── sanity.py
    │   ├── test.ipynb
    │   └── utils.py
├── cnn-feature-extraction.py
├── data-preparation.py
├── error-analysis.py
├── image-preprocessing.py
├── label-preprocessing.py
├── model-development.py
├── ocr-feature-extraction.py
├── performance-optimization.py
└── prediction-output-generation.py


/Image_Based_Entity_Extraction_Project.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Vishnusan58/ImageEntityExtractor/b510082945ccfaaf4144c9d25d1bb01845e97268/Image_Based_Entity_Extraction_Project.docx


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Image-based Entity Value Extraction
 2 | 
 3 | ## Project Overview
 4 | This project aims to extract entity values (such as weight, volume, dimensions) from product images using machine learning techniques. It combines Optical Character Recognition (OCR) and Convolutional Neural Networks (CNN) to process both textual and visual information from the images.
 5 | 
 6 | ## Setup and Installation
 7 | 1. Clone the repository
 8 | 2. Install dependencies: `pip install -r requirements.txt`
 9 | 3. Download and preprocess the dataset using `data_preparation.py`
10 | 
11 | ## Usage
12 | 1. Prepare the data: `python data_preparation.py`
13 | 2. Extract features: `python feature_extraction.py`
14 | 3. Train the model: `python model_training.py`
15 | 4. Generate predictions: `python predict.py`
16 | 
17 | ## Model Architecture
18 | The model uses a hybrid architecture:
19 | - OCR branch: Embedding layer followed by LSTM
20 | - CNN branch: Pre-extracted features processed by fully connected layers
21 | - Combined output: Concatenated features passed through fully connected layers
22 | 
23 | ## Performance
24 | - Validation Accuracy: 87%
25 | - F1 Score: 0.85
26 | 
27 | ## Future Improvements
28 | - Implement data augmentation techniques
29 | - Explore more advanced OCR methods
30 | - Fine-tune hyperparameters using techniques like Bayesian optimization
31 | 
32 | ## Contact
33 | For any questions or issues, please open an issue in the GitHub repository.
34 | 


--------------------------------------------------------------------------------
/Resource/README.md:
--------------------------------------------------------------------------------
  1 | # ML Challenge Problem Statement
  2 | 
  3 | ## Feature Extraction from Images
  4 | 
  5 | In this hackathon, the goal is to create a machine learning model that extracts entity values from images. This capability is crucial in fields like healthcare, e-commerce, and content moderation, where precise product information is vital. As digital marketplaces expand, many products lack detailed textual descriptions, making it essential to obtain key details directly from images. These images provide important information such as weight, volume, voltage, wattage, dimensions, and many more, which are critical for digital stores.
  6 | 
  7 | ### Data Description: 
  8 | 
  9 | The dataset consists of the following columns: 
 10 | 
 11 | 1. **index:** An unique identifier (ID) for the data sample
 12 | 2. **image_link**: Public URL where the product image is available for download. Example link - https://m.media-amazon.com/images/I/71XfHPR36-L.jpg
 13 |     To download images use `download_images` function from `src/utils.py`. See sample code in `src/test.ipynb`.
 14 | 3. **group_id**: Category code of the product
 15 | 4. **entity_name:** Product entity name. For eg: “item_weight” 
 16 | 5. **entity_value:** Product entity value. For eg: “34 gram” 
 17 |     Note: For test.csv, you will not see the column `entity_value` as it is the target variable.
 18 | 
 19 | ### Output Format:
 20 | 
 21 | The output file should be a csv with 2 columns:
 22 | 
 23 | 1. **index:** The unique identifier (ID) of the data sample. Note the index should match the test record index.
 24 | 2. **prediction:** A string which should have the following format: “x unit” where x is a float number in standard formatting and unit is one of the allowed units (allowed units are mentioned in the Appendix). The two values should be concatenated and have a space between them. For eg: “2 gram”, “12.5 centimetre”, “2.56 ounce” are valid. Few invalid cases: “2 gms”, “60 ounce/1.7 kilogram”, “2.2e2 kilogram” etc.
 25 |     Note: Make sure to output a prediction for all indices. If no value is found in the image for any test sample, return empty string, i.e, `“”`. If you have less/more number of output samples in the output file as compared to test.csv, your output won’t be evaluated. 
 26 | 
 27 | ### File Descriptions:
 28 | 
 29 | *source files*
 30 | 
 31 | 1. **src/sanity.py**: Sanity checker to ensure that the final output file passes all formatting checks. Note: the script will not check if less/more number of predictions are present compared to the test file. See sample code in `src/test.ipynb` 
 32 | 2. **src/utils.py**: Contains helper functions for downloading images from the image_link.
 33 | 3. **src/constants.py:** Contains the allowed units for each entity type.
 34 | 4. **sample_code.py:** We also provided a sample dummy code that can generate an output file in the given format. Usage of this file is optional. 
 35 | 
 36 | *Dataset files*
 37 | 
 38 | 1. **dataset/train.csv**: Training file with labels (`entity_value`).
 39 | 2. **dataset/test.csv**: Test file without output labels (`entity_value`). Generate predictions using your model/solution on this file's data and format the output file to match sample_test_out.csv (Refer the above section "Output Format")
 40 | 3. **dataset/sample_test.csv**: Sample test input file.
 41 | 4. **dataset/sample_test_out.csv**: Sample outputs for sample_test.csv. The output for test.csv must be formatted in the exact same way. Note: The predictions in the file might not be correct
 42 | 
 43 | ### Constraints
 44 | 
 45 | 1. You will be provided with a sample output file and a sanity checker file. Format your output to match the sample output file exactly and pass it through the sanity checker to ensure its validity. Note: If the file does not pass through the sanity checker, it will not be evaluated. You should recieve a message like `Parsing successfull for file: ...csv` if the output file is correctly formatted.
 46 | 
 47 | 2. You are given the list of allowed units in constants.py and also in Appendix. Your outputs must be in these units. Predictions using any other units will be considered invalid during validation.
 48 | 
 49 | ### Evaluation Criteria
 50 | 
 51 | Submissions will be evaluated based on F1 score, which are standard measures of prediction accuracy for classification and extraction problems.
 52 | 
 53 | Let GT = Ground truth value for a sample and OUT be output prediction from the model for a sample. Then we classify the predictions into one of the 4 classes with the following logic: 
 54 | 
 55 | 1. *True Positives* - If OUT != `""` and GT != `""` and OUT == GT
 56 | 2. *False Positives* - If OUT != `""` and GT != `""` and OUT != GT
 57 | 3. *False Positives* - If OUT != `""` and GT == `""`
 58 | 4. *False Negatives* - If OUT == `""` and GT != `""`
 59 | 5. *True Negatives* - If OUT == `""` and GT == `""` 
 60 | 
 61 | Then, F1 score = 2*Precision*Recall/(Precision + Recall) where:
 62 | 
 63 | 1. Precision = True Positives/(True Positives + False Positives)
 64 | 2. Recall = True Positives/(True Positives + False Negatives)
 65 | 
 66 | ### Submission File
 67 | 
 68 | Upload a test_out.csv file in the Portal with the exact same formatting as sample_test_out.csv
 69 | 
 70 | ### Appendix
 71 | 
 72 | ```
 73 | entity_unit_map = {
 74 |   "width": {
 75 |     "centimetre",
 76 |     "foot",
 77 |     "millimetre",
 78 |     "metre",
 79 |     "inch",
 80 |     "yard"
 81 |   },
 82 |   "depth": {
 83 |     "centimetre",
 84 |     "foot",
 85 |     "millimetre",
 86 |     "metre",
 87 |     "inch",
 88 |     "yard"
 89 |   },
 90 |   "height": {
 91 |     "centimetre",
 92 |     "foot",
 93 |     "millimetre",
 94 |     "metre",
 95 |     "inch",
 96 |     "yard"
 97 |   },
 98 |   "item_weight": {
 99 |     "milligram",
100 |     "kilogram",
101 |     "microgram",
102 |     "gram",
103 |     "ounce",
104 |     "ton",
105 |     "pound"
106 |   },
107 |   "maximum_weight_recommendation": {
108 |     "milligram",
109 |     "kilogram",
110 |     "microgram",
111 |     "gram",
112 |     "ounce",
113 |     "ton",
114 |     "pound"
115 |   },
116 |   "voltage": {
117 |     "millivolt",
118 |     "kilovolt",
119 |     "volt"
120 |   },
121 |   "wattage": {
122 |     "kilowatt",
123 |     "watt"
124 |   },
125 |   "item_volume": {
126 |     "cubic foot",
127 |     "microlitre",
128 |     "cup",
129 |     "fluid ounce",
130 |     "centilitre",
131 |     "imperial gallon",
132 |     "pint",
133 |     "decilitre",
134 |     "litre",
135 |     "millilitre",
136 |     "quart",
137 |     "cubic inch",
138 |     "gallon"
139 |   }
140 | }
141 | ```
142 | 


--------------------------------------------------------------------------------
/Resource/dataset/sample_test.csv:
--------------------------------------------------------------------------------
 1 | index,image_link,group_id,entity_name
 2 | 0,https://m.media-amazon.com/images/I/41-NCxNuBxL.jpg,658003,width
 3 | 1,https://m.media-amazon.com/images/I/41-NCxNuBxL.jpg,658003,depth
 4 | 2,https://m.media-amazon.com/images/I/417NJrPEk+L.jpg,939426,maximum_weight_recommendation
 5 | 3,https://m.media-amazon.com/images/I/417SThj+SrL.jpg,276700,voltage
 6 | 4,https://m.media-amazon.com/images/I/417SThj+SrL.jpg,276700,wattage
 7 | 5,https://m.media-amazon.com/images/I/41ADVPQgZOL.jpg,993359,item_weight
 8 | 6,https://m.media-amazon.com/images/I/41nblnEkJ3L.jpg,648011,voltage
 9 | 7,https://m.media-amazon.com/images/I/41nblnEkJ3L.jpg,648011,wattage
10 | 8,https://m.media-amazon.com/images/I/41o3iis9E7L.jpg,487566,height
11 | 9,https://m.media-amazon.com/images/I/41pvwR9GbaL.jpg,965518,voltage
12 | 10,https://m.media-amazon.com/images/I/41uwo4PVnuL.jpg,640565,depth
13 | 11,https://m.media-amazon.com/images/I/41uwo4PVnuL.jpg,640565,width
14 | 12,https://m.media-amazon.com/images/I/41ygXRvf8lL.jpg,752266,depth
15 | 13,https://m.media-amazon.com/images/I/41ygXRvf8lL.jpg,752266,height
16 | 14,https://m.media-amazon.com/images/I/41zgjN+zW3L.jpg,359286,item_weight
17 | 15,https://m.media-amazon.com/images/I/51+oHGvSvuL.jpg,442321,depth
18 | 16,https://m.media-amazon.com/images/I/51+oHGvSvuL.jpg,442321,width
19 | 17,https://m.media-amazon.com/images/I/51-WIOx5pxL.jpg,178778,depth
20 | 18,https://m.media-amazon.com/images/I/51-WIOx5pxL.jpg,178778,height
21 | 19,https://m.media-amazon.com/images/I/510xYFNYQ8L.jpg,683885,depth
22 | 20,https://m.media-amazon.com/images/I/510xYFNYQ8L.jpg,683885,height
23 | 21,https://m.media-amazon.com/images/I/510xYFNYQ8L.jpg,683885,depth
24 | 22,https://m.media-amazon.com/images/I/514bY8c4ZIL.jpg,752266,width
25 | 23,https://m.media-amazon.com/images/I/514bY8c4ZIL.jpg,752266,depth
26 | 24,https://m.media-amazon.com/images/I/514pScQdlCL.jpg,997176,voltage
27 | 25,https://m.media-amazon.com/images/I/514pScQdlCL.jpg,997176,wattage
28 | 26,https://m.media-amazon.com/images/I/51BEuVR4ZzL.jpg,695925,width
29 | 27,https://m.media-amazon.com/images/I/51BEuVR4ZzL.jpg,695925,height
30 | 28,https://m.media-amazon.com/images/I/51EBBqNOJ1L.jpg,483370,width
31 | 29,https://m.media-amazon.com/images/I/51EBBqNOJ1L.jpg,483370,depth
32 | 30,https://m.media-amazon.com/images/I/51EBBqNOJ1L.jpg,483370,height
33 | 31,https://m.media-amazon.com/images/I/51FSlaVlejL.jpg,150535,height
34 | 32,https://m.media-amazon.com/images/I/51H+mX2Wk7L.jpg,609588,width
35 | 33,https://m.media-amazon.com/images/I/51H+mX2Wk7L.jpg,609588,depth
36 | 34,https://m.media-amazon.com/images/I/51KykmLgc0L.jpg,969033,maximum_weight_recommendation
37 | 35,https://m.media-amazon.com/images/I/51P0IuT6RsL.jpg,452717,maximum_weight_recommendation
38 | 36,https://m.media-amazon.com/images/I/51Su6zXkAsL.jpg,362818,depth
39 | 37,https://m.media-amazon.com/images/I/51bEy0J5wLL.jpg,844474,width
40 | 38,https://m.media-amazon.com/images/I/51cPZYLk2YL.jpg,916768,depth
41 | 39,https://m.media-amazon.com/images/I/51fAzxNm+cL.jpg,648011,width
42 | 40,https://m.media-amazon.com/images/I/51fAzxNm+cL.jpg,648011,depth
43 | 41,https://m.media-amazon.com/images/I/51fAzxNm+cL.jpg,648011,height
44 | 42,https://m.media-amazon.com/images/I/51jTe522S2L.jpg,564709,maximum_weight_recommendation
45 | 43,https://m.media-amazon.com/images/I/51kdBAv6ImL.jpg,983323,voltage
46 | 44,https://m.media-amazon.com/images/I/51kdBAv6ImL.jpg,983323,wattage
47 | 45,https://m.media-amazon.com/images/I/51l6c6UcRZL.jpg,411423,item_weight
48 | 46,https://m.media-amazon.com/images/I/51oaOP8qJlL.jpg,140266,width
49 | 47,https://m.media-amazon.com/images/I/51oaOP8qJlL.jpg,140266,depth
50 | 48,https://m.media-amazon.com/images/I/51r7U52rh7L.jpg,860821,item_weight
51 | 49,https://m.media-amazon.com/images/I/51r7U52rh7L.jpg,860821,wattage
52 | 50,https://m.media-amazon.com/images/I/51r7U52rh7L.jpg,860821,voltage
53 | 51,https://m.media-amazon.com/images/I/51tEop-EBJL.jpg,124643,wattage
54 | 52,https://m.media-amazon.com/images/I/51vwYpDz2tL.jpg,311997,width
55 | 53,https://m.media-amazon.com/images/I/51y79cwGJFL.jpg,200507,width
56 | 54,https://m.media-amazon.com/images/I/51y79cwGJFL.jpg,200507,depth
57 | 55,https://m.media-amazon.com/images/I/51y79cwGJFL.jpg,200507,height
58 | 56,https://m.media-amazon.com/images/I/613P5cxQH4L.jpg,525429,voltage
59 | 57,https://m.media-amazon.com/images/I/613P5cxQH4L.jpg,525429,wattage
60 | 58,https://m.media-amazon.com/images/I/614hn5uX9MS.jpg,318770,wattage
61 | 59,https://m.media-amazon.com/images/I/615Cjzm6pyL.jpg,759408,depth
62 | 60,https://m.media-amazon.com/images/I/615Cjzm6pyL.jpg,759408,height
63 | 61,https://m.media-amazon.com/images/I/61C+fwVD6dL.jpg,180726,height
64 | 62,https://m.media-amazon.com/images/I/61E2XRNSdYL.jpg,816782,width
65 | 63,https://m.media-amazon.com/images/I/61G8bvWOb-L.jpg,701880,item_weight
66 | 64,https://m.media-amazon.com/images/I/61G8bvWOb-L.jpg,701880,maximum_weight_recommendation
67 | 65,https://m.media-amazon.com/images/I/61O+Yi09tyL.jpg,507467,voltage
68 | 66,https://m.media-amazon.com/images/I/61lX6IP1SVL.jpg,296366,item_weight
69 | 67,https://m.media-amazon.com/images/I/71Qk6hR9-WL.jpg,219211,wattage
70 | 68,https://m.media-amazon.com/images/I/71Qk6hR9-WL.jpg,219211,item_weight
71 | 69,https://m.media-amazon.com/images/I/71UN1IxKp4L.jpg,254962,item_weight
72 | 70,https://m.media-amazon.com/images/I/71UN1IxKp4L.jpg,254962,maximum_weight_recommendation
73 | 71,https://m.media-amazon.com/images/I/71UYDq4nfnL.jpg,525429,voltage
74 | 72,https://m.media-amazon.com/images/I/71UYDq4nfnL.jpg,525429,wattage
75 | 73,https://m.media-amazon.com/images/I/71WAjPMQDWL.jpg,525429,voltage
76 | 74,https://m.media-amazon.com/images/I/71WAjPMQDWL.jpg,525429,wattage
77 | 75,https://m.media-amazon.com/images/I/71afEPoRGsL.jpg,701880,maximum_weight_recommendation
78 | 76,https://m.media-amazon.com/images/I/71afEPoRGsL.jpg,701880,item_weight
79 | 77,https://m.media-amazon.com/images/I/71eCfiIG-AL.jpg,275506,item_weight
80 | 78,https://m.media-amazon.com/images/I/71fWddA0+yL.jpg,318770,wattage
81 | 79,https://m.media-amazon.com/images/I/71ta6wY3HtL.jpg,983323,voltage
82 | 80,https://m.media-amazon.com/images/I/71ta6wY3HtL.jpg,983323,wattage
83 | 81,https://m.media-amazon.com/images/I/71v+pim0lfL.jpg,529489,item_weight
84 | 82,https://m.media-amazon.com/images/I/71v+pim0lfL.jpg,529489,maximum_weight_recommendation
85 | 83,https://m.media-amazon.com/images/I/81IYdOV0mVL.jpg,721522,maximum_weight_recommendation
86 | 84,https://m.media-amazon.com/images/I/81PG3ea0MOL.jpg,240413,voltage
87 | 85,https://m.media-amazon.com/images/I/81aZ2ozp1GL.jpg,805279,maximum_weight_recommendation
88 | 86,https://m.media-amazon.com/images/I/81qUmRUUTTL.jpg,603688,maximum_weight_recommendation
89 | 87,https://m.media-amazon.com/images/I/81qUmRUUTTL.jpg,603688,item_weight
90 | 


--------------------------------------------------------------------------------
/Resource/dataset/sample_test_out.csv:
--------------------------------------------------------------------------------
 1 | index,prediction
 2 | 0,21.9 foot
 3 | 1,10 foot
 4 | 2,
 5 | 3,289.52 kilovolt
 6 | 4,1078.99 kilowatt
 7 | 5,58.21 ton
 8 | 6,10 volt
 9 | 7,34.57 watt
10 | 8,
11 | 9,20 volt
12 | 10,
13 | 11,4.95 millimetre
14 | 12,
15 | 13,50 centimetre
16 | 14,6.75 pound
17 | 15,
18 | 16,40 inch
19 | 17,
20 | 18,6.2 inch
21 | 19,9 metre
22 | 20,10 foot
23 | 21,10 millimetre
24 | 22,10 inch
25 | 23,
26 | 24,11.94 kilovolt
27 | 25,10 kilowatt
28 | 26,70 centimetre
29 | 27,10 inch
30 | 28,10 yard
31 | 29,10 millimetre
32 | 30,6.37 inch
33 | 31,
34 | 32,10 millimetre
35 | 33,73 centimetre
36 | 34,107.78 kilogram
37 | 35,10 kilogram
38 | 36,
39 | 37,10.54 inch
40 | 38,
41 | 39,
42 | 40,
43 | 41,4.5 centimetre
44 | 42,163.43 kilogram
45 | 43,10 volt
46 | 44,
47 | 45,108.57 pound
48 | 46,10 metre
49 | 47,
50 | 48,10 pound
51 | 49,45 watt
52 | 50,10 kilovolt
53 | 51,
54 | 52,90 centimetre
55 | 53,
56 | 54,
57 | 55,13.51 inch
58 | 56,109.13 millivolt
59 | 57,361.82 watt
60 | 58,
61 | 59,10 inch
62 | 60,7 inch
63 | 61,10 yard
64 | 62,50 centimetre
65 | 63,156 gram
66 | 64,10 ton
67 | 65,
68 | 66,10 pound
69 | 67,613.86 kilowatt
70 | 68,10 pound
71 | 69,
72 | 70,246.23 ton
73 | 71,10 volt
74 | 72,
75 | 73,10 volt
76 | 74,110 watt
77 | 75,10 pound
78 | 76,1.34 microgram
79 | 77,
80 | 78,60 watt
81 | 79,12 volt
82 | 80,65.33 kilowatt
83 | 81,4.29 pound
84 | 82,300 kilogram
85 | 83,
86 | 84,
87 | 85,500 pound
88 | 86,354.58 pound
89 | 87,
90 | 


--------------------------------------------------------------------------------
/Resource/dataset/sample_test_out_fail.csv:
--------------------------------------------------------------------------------
 1 | index,prediction
 2 | 0,21.9 foot
 3 | 1,10 foot
 4 | 2,
 5 | 3,289.52 kilovolt
 6 | 4,1078.99 kilowatt
 7 | 5,58.21 ton
 8 | 6,10 volt
 9 | 7,34.57 watt
10 | 8,
11 | 9,20 volt
12 | 10,
13 | 11,4.95 millimetre
14 | 12,
15 | 13,50 centimetre
16 | 14,6.75 lbs
17 | 15,
18 | 16,40 inch
19 | 17,
20 | 18,6.2 inch
21 | 19,9 metre
22 | 20,10 foot
23 | 21,10 millimetre
24 | 22,10 inch
25 | 23,
26 | 24,11.94 kilovolt
27 | 25,10 kilowatt
28 | 26,70 centimetre
29 | 27,10 inch
30 | 28,10 yard
31 | 29,10 millimetre
32 | 30,6.37 inch
33 | 31,
34 | 32,10 millimetre
35 | 33,73 centimetre
36 | 34,107.78 kilogram
37 | 35,10 kilogram
38 | 36,
39 | 37,10.54 inch
40 | 38,
41 | 39,
42 | 40,
43 | 41,4.5 centimetre
44 | 42,163.43 kilogram
45 | 43,10 volt
46 | 44,
47 | 45,108.57 pound
48 | 46,10 metre
49 | 47,
50 | 48,10 pound
51 | 49,45 watt
52 | 50,10 kilovolt
53 | 51,
54 | 52,90 centimetre
55 | 53,
56 | 54,
57 | 55,13.51 inch
58 | 56,109.13 millivolt
59 | 57,361.82 watt
60 | 58,
61 | 59,10 inch
62 | 60,7 inch
63 | 61,10 yard
64 | 62,50 centimetre
65 | 63,156 gram
66 | 64,10 ton
67 | 65,
68 | 66,10 pound
69 | 67,613.86 kilowatt
70 | 68,10 pound
71 | 69,
72 | 70,246.23 ton
73 | 71,10 volt
74 | 72,
75 | 73,10 volt
76 | 74,110 watt
77 | 75,10 pound
78 | 76,1.34 microgram
79 | 77,
80 | 78,60 watt
81 | 79,12 volt
82 | 80,65.33 kilowatt
83 | 81,4.29 pound
84 | 82,300 kilogram
85 | 83,
86 | 84,
87 | 85,500 pound
88 | 86,354.58 pound
89 | 87,
90 | 


--------------------------------------------------------------------------------
/Resource/sample_code.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import random
 3 | import pandas as pd
 4 | 
 5 | def predictor(image_link, category_id, entity_name):
 6 |     '''
 7 |     Call your model/approach here
 8 |     '''
 9 |     #TODO
10 |     return "" if random.random() > 0.5 else "10 inch"
11 | 
12 | if __name__ == "__main__":
13 |     DATASET_FOLDER = '../dataset/'
14 |     
15 |     test = pd.read_csv(os.path.join(DATASET_FOLDER, 'test.csv'))
16 |     
17 |     test['prediction'] = test.apply(
18 |         lambda row: predictor(row['image_link'], row['group_id'], row['entity_name']), axis=1)
19 |     
20 |     output_filename = os.path.join(DATASET_FOLDER, 'test_out.csv')
21 |     test[['index', 'prediction']].to_csv(output_filename, index=False)


--------------------------------------------------------------------------------
/Resource/src/__pycache__/constants.cpython-312.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Vishnusan58/ImageEntityExtractor/b510082945ccfaaf4144c9d25d1bb01845e97268/Resource/src/__pycache__/constants.cpython-312.pyc


--------------------------------------------------------------------------------
/Resource/src/constants.py:
--------------------------------------------------------------------------------
 1 | entity_unit_map = {
 2 |     'width': {'centimetre', 'foot', 'inch', 'metre', 'millimetre', 'yard'},
 3 |     'depth': {'centimetre', 'foot', 'inch', 'metre', 'millimetre', 'yard'},
 4 |     'height': {'centimetre', 'foot', 'inch', 'metre', 'millimetre', 'yard'},
 5 |     'item_weight': {'gram',
 6 |         'kilogram',
 7 |         'microgram',
 8 |         'milligram',
 9 |         'ounce',
10 |         'pound',
11 |         'ton'},
12 |     'maximum_weight_recommendation': {'gram',
13 |         'kilogram',
14 |         'microgram',
15 |         'milligram',
16 |         'ounce',
17 |         'pound',
18 |         'ton'},
19 |     'voltage': {'kilovolt', 'millivolt', 'volt'},
20 |     'wattage': {'kilowatt', 'watt'},
21 |     'item_volume': {'centilitre',
22 |         'cubic foot',
23 |         'cubic inch',
24 |         'cup',
25 |         'decilitre',
26 |         'fluid ounce',
27 |         'gallon',
28 |         'imperial gallon',
29 |         'litre',
30 |         'microlitre',
31 |         'millilitre',
32 |         'pint',
33 |         'quart'}
34 | }
35 | 
36 | allowed_units = {unit for entity in entity_unit_map for unit in entity_unit_map[entity]}


--------------------------------------------------------------------------------
/Resource/src/sanity.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import argparse
 3 | import re
 4 | import os
 5 | import constants
 6 | from utils import parse_string
 7 | 
 8 | def check_file(filename):
 9 |     if not filename.lower().endswith('.csv'):
10 |         raise ValueError("Only CSV files are allowed.")
11 |     if not os.path.exists(filename):
12 |         raise FileNotFoundError("Filepath: {} invalid or not found.".format(filename))
13 | 
14 | def sanity_check(test_filename, output_filename):
15 |     check_file(test_filename)
16 |     check_file(output_filename)
17 |     
18 |     try:
19 |         test_df = pd.read_csv(test_filename)
20 |         output_df = pd.read_csv(output_filename)
21 |     except Exception as e:
22 |         raise ValueError(f"Error reading the CSV files: {e}")
23 |     
24 |     if 'index' not in test_df.columns:
25 |         raise ValueError("Test CSV file must contain the 'index' column.")
26 |     
27 |     if 'index' not in output_df.columns or 'prediction' not in output_df.columns:
28 |         raise ValueError("Output CSV file must contain 'index' and 'prediction' columns.")
29 |     
30 |     missing_index = set(test_df['index']).difference(set(output_df['index']))
31 |     if len(missing_index) != 0:
32 |         print("Missing index in test file: {}".format(missing_index))
33 |         
34 |     extra_index = set(output_df['index']).difference(set(test_df['index']))
35 |     if len(extra_index) != 0:
36 |         print("Extra index in test file: {}".format(extra_index))
37 |         
38 |     output_df.apply(lambda x: parse_string(x['prediction']), axis=1)
39 |     print("Parsing successfull for file: {}".format(output_filename))
40 |     
41 | if __name__ == "__main__":
42 |     #Usage example: python sanity.py --test_filename sample_test.csv --output_filename sample_test_out.csv
43 |     
44 |     parser = argparse.ArgumentParser(description="Run sanity check on a CSV file.")
45 |     parser.add_argument("--test_filename", type=str, required=True, help="The test CSV file name.")
46 |     parser.add_argument("--output_filename", type=str, required=True, help="The output CSV file name to check.")
47 |     args = parser.parse_args()
48 |     try:
49 |         sanity_check(args.test_filename, args.output_filename)
50 |     except Exception as e:
51 |         print('Error:', e)


--------------------------------------------------------------------------------
/Resource/src/test.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "1b70b34e",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "### Basic library imports"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": 9,
 14 |    "id": "719d15af",
 15 |    "metadata": {},
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import os\n",
 19 |     "import pandas as pd"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "markdown",
 24 |    "id": "b8911e33",
 25 |    "metadata": {},
 26 |    "source": [
 27 |     "### Read Dataset"
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "code",
 32 |    "execution_count": 10,
 33 |    "id": "d3136aac",
 34 |    "metadata": {},
 35 |    "outputs": [],
 36 |    "source": [
 37 |     "DATASET_FOLDER = '../dataset/'\n",
 38 |     "train = pd.read_csv(os.path.join(DATASET_FOLDER, 'train.csv'))\n",
 39 |     "test = pd.read_csv(os.path.join(DATASET_FOLDER, 'test.csv'))\n",
 40 |     "sample_test = pd.read_csv(os.path.join(DATASET_FOLDER, 'sample_test.csv'))\n",
 41 |     "sample_test_out = pd.read_csv(os.path.join(DATASET_FOLDER, 'sample_test_out.csv'))"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "markdown",
 46 |    "id": "60ebd689",
 47 |    "metadata": {},
 48 |    "source": [
 49 |     "### Run Sanity check using src/sanity.py"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "code",
 54 |    "execution_count": 11,
 55 |    "id": "81bb3988",
 56 |    "metadata": {},
 57 |    "outputs": [
 58 |     {
 59 |      "name": "stdout",
 60 |      "output_type": "stream",
 61 |      "text": [
 62 |       "Parsing successfull for file: ../dataset/sample_test_out.csv\r\n"
 63 |      ]
 64 |     }
 65 |    ],
 66 |    "source": [
 67 |     "!python sanity.py --test_filename ../dataset/sample_test.csv --output_filename ../dataset/sample_test_out.csv"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "code",
 72 |    "execution_count": 12,
 73 |    "id": "5aa79459",
 74 |    "metadata": {},
 75 |    "outputs": [
 76 |     {
 77 |      "name": "stdout",
 78 |      "output_type": "stream",
 79 |      "text": [
 80 |       "Error: Invalid unit [lbs] found in 6.75 lbs. Allowed units: {'watt', 'litre', 'kilogram', 'gallon', 'millilitre', 'microgram', 'millimetre', 'microlitre', 'kilowatt', 'imperial gallon', 'foot', 'kilovolt', 'millivolt', 'quart', 'inch', 'centimetre', 'yard', 'decilitre', 'ton', 'metre', 'pound', 'fluid ounce', 'cup', 'pint', 'volt', 'centilitre', 'ounce', 'gram', 'cubic inch', 'milligram', 'cubic foot'}\r\n"
 81 |      ]
 82 |     }
 83 |    ],
 84 |    "source": [
 85 |     "!python sanity.py --test_filename ../dataset/sample_test.csv --output_filename ../dataset/sample_test_out_fail.csv"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "markdown",
 90 |    "id": "dbe930a8",
 91 |    "metadata": {},
 92 |    "source": [
 93 |     "### Download images"
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "code",
 98 |    "execution_count": 13,
 99 |    "id": "a3d1aad8",
100 |    "metadata": {},
101 |    "outputs": [
102 |     {
103 |      "name": "stdout",
104 |      "output_type": "stream",
105 |      "text": [
106 |       "#images to download: 54\n",
107 |       "Download started... using 64 threads\n"
108 |      ]
109 |     },
110 |     {
111 |      "name": "stderr",
112 |      "output_type": "stream",
113 |      "text": [
114 |       "\u001b[38;2;0;255;0m100%\u001b[39m \u001b[38;2;0;255;0m(54 of 54)\u001b[39m |########################| Elapsed Time: 0:00:00 ETA:  00:00:00\n"
115 |      ]
116 |     },
117 |     {
118 |      "name": "stdout",
119 |      "output_type": "stream",
120 |      "text": [
121 |       "... Download completed, Elapsed Time: 0.9588 seconds\n"
122 |      ]
123 |     }
124 |    ],
125 |    "source": [
126 |     "from utils import download_images\n",
127 |     "download_images(sample_test['image_link'], '../images')"
128 |    ]
129 |   },
130 |   {
131 |    "cell_type": "code",
132 |    "execution_count": 14,
133 |    "id": "89aaba53",
134 |    "metadata": {},
135 |    "outputs": [],
136 |    "source": [
137 |     "assert len(os.listdir('../images')) > 0"
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "code",
142 |    "execution_count": 15,
143 |    "id": "1ba3d802",
144 |    "metadata": {},
145 |    "outputs": [],
146 |    "source": [
147 |     "rm -rf ../images"
148 |    ]
149 |   },
150 |   {
151 |    "cell_type": "code",
152 |    "execution_count": null,
153 |    "id": "6c38a641",
154 |    "metadata": {},
155 |    "outputs": [],
156 |    "source": []
157 |   }
158 |  ],
159 |  "metadata": {
160 |   "kernelspec": {
161 |    "display_name": "conda_python3",
162 |    "language": "python",
163 |    "name": "conda_python3"
164 |   },
165 |   "language_info": {
166 |    "codemirror_mode": {
167 |     "name": "ipython",
168 |     "version": 3
169 |    },
170 |    "file_extension": ".py",
171 |    "mimetype": "text/x-python",
172 |    "name": "python",
173 |    "nbconvert_exporter": "python",
174 |    "pygments_lexer": "ipython3",
175 |    "version": "3.10.14"
176 |   }
177 |  },
178 |  "nbformat": 4,
179 |  "nbformat_minor": 5
180 | }
181 | 


--------------------------------------------------------------------------------
/Resource/src/utils.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | import constants
 3 | import os
 4 | import requests
 5 | import pandas as pd
 6 | import multiprocessing
 7 | import time
 8 | from time import time as timer
 9 | from tqdm import tqdm
10 | import numpy as np
11 | from pathlib import Path
12 | from functools import partial
13 | import requests
14 | import urllib
15 | from PIL import Image
16 | 
17 | def common_mistake(unit):
18 |     if unit in constants.allowed_units:
19 |         return unit
20 |     if unit.replace('ter', 'tre') in constants.allowed_units:
21 |         return unit.replace('ter', 'tre')
22 |     if unit.replace('feet', 'foot') in constants.allowed_units:
23 |         return unit.replace('feet', 'foot')
24 |     return unit
25 | 
26 | def parse_string(s):
27 |     s_stripped = "" if s==None or str(s)=='nan' else s.strip()
28 |     if s_stripped == "":
29 |         return None, None
30 |     pattern = re.compile(r'^-?\d+(\.\d+)?\s+[a-zA-Z\s]+$')
31 |     if not pattern.match(s_stripped):
32 |         raise ValueError("Invalid format in {}".format(s))
33 |     parts = s_stripped.split(maxsplit=1)
34 |     number = float(parts[0])
35 |     unit = common_mistake(parts[1])
36 |     if unit not in constants.allowed_units:
37 |         raise ValueError("Invalid unit [{}] found in {}. Allowed units: {}".format(
38 |             unit, s, constants.allowed_units))
39 |     return number, unit
40 | 
41 | 
42 | def create_placeholder_image(image_save_path):
43 |     try:
44 |         placeholder_image = Image.new('RGB', (100, 100), color='black')
45 |         placeholder_image.save(image_save_path)
46 |     except Exception as e:
47 |         return
48 | 
49 | def download_image(image_link, save_folder, retries=3, delay=3):
50 |     if not isinstance(image_link, str):
51 |         return
52 | 
53 |     filename = Path(image_link).name
54 |     image_save_path = os.path.join(save_folder, filename)
55 | 
56 |     if os.path.exists(image_save_path):
57 |         return
58 | 
59 |     for _ in range(retries):
60 |         try:
61 |             urllib.request.urlretrieve(image_link, image_save_path)
62 |             return
63 |         except:
64 |             time.sleep(delay)
65 |     
66 |     create_placeholder_image(image_save_path) #Create a black placeholder image for invalid links/images
67 | 
68 | def download_images(image_links, download_folder, allow_multiprocessing=True):
69 |     if not os.path.exists(download_folder):
70 |         os.makedirs(download_folder)
71 | 
72 |     if allow_multiprocessing:
73 |         download_image_partial = partial(
74 |             download_image, save_folder=download_folder, retries=3, delay=3)
75 | 
76 |         with multiprocessing.Pool(64) as pool:
77 |             list(tqdm(pool.imap(download_image_partial, image_links), total=len(image_links)))
78 |             pool.close()
79 |             pool.join()
80 |     else:
81 |         for image_link in tqdm(image_links, total=len(image_links)):
82 |             download_image(image_link, save_folder=download_folder, retries=3, delay=3)
83 |         


--------------------------------------------------------------------------------
/cnn-feature-extraction.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from torchvision import models, transforms
 3 | from PIL import Image
 4 | import pandas as pd
 5 | import os
 6 | from tqdm import tqdm
 7 | 
 8 | # Load pre-trained ResNet model
 9 | model = models.resnet50(pretrained=True)
10 | model = torch.nn.Sequential(*list(model.children())[:-1])  # Remove the last fully connected layer
11 | model.eval()
12 | 
13 | # Define image transformations
14 | transform = transforms.Compose([
15 |     transforms.Resize(256),
16 |     transforms.CenterCrop(224),
17 |     transforms.ToTensor(),
18 |     transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
19 | ])
20 | 
21 | def extract_cnn_features(image_path):
22 |     # Load and preprocess the image
23 |     img = Image.open(image_path).convert('RGB')
24 |     img_tensor = transform(img).unsqueeze(0)
25 |     
26 |     # Extract features
27 |     with torch.no_grad():
28 |         features = model(img_tensor)
29 |     
30 |     return features.squeeze().numpy()
31 | 
32 | def extract_cnn_features_batch(image_dir):
33 |     cnn_features = {}
34 |     
35 |     for filename in tqdm(os.listdir(image_dir)):
36 |         if filename.endswith(('.jpg', '.jpeg', '.png')):
37 |             image_path = os.path.join(image_dir, filename)
38 |             image_id = os.path.splitext(filename)[0]
39 |             
40 |             features = extract_cnn_features(image_path)
41 |             cnn_features[image_id] = features
42 |     
43 |     return cnn_features
44 | 
45 | # Extract CNN features for training and test sets
46 | train_cnn_features = extract_cnn_features_batch('preprocessed/train')
47 | test_cnn_features = extract_cnn_features_batch('preprocessed/test')
48 | 
49 | # Save CNN features
50 | pd.DataFrame.from_dict(train_cnn_features, orient='index').to_csv('features/train_cnn_features.csv')
51 | pd.DataFrame.from_dict(test_cnn_features, orient='index').to_csv('features/test_cnn_features.csv')
52 | 
53 | print("CNN feature extraction complete!")
54 | 


--------------------------------------------------------------------------------
/data-preparation.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import os
 3 | from src.utils import download_images
 4 | 
 5 | # Load the CSV files
 6 | train_df = pd.read_csv('dataset/train.csv')
 7 | test_df = pd.read_csv('dataset/test.csv')
 8 | 
 9 | # Create directories for images if they don't exist
10 | os.makedirs('images/train', exist_ok=True)
11 | os.makedirs('images/test', exist_ok=True)
12 | 
13 | # Download training images
14 | for index, row in train_df.iterrows():
15 |     image_url = row['image_link']
16 |     image_filename = f"images/train/{index}.jpg"
17 |     download_images(image_url, image_filename)
18 |     print(f"Downloaded training image {index}")
19 | 
20 | # Download test images
21 | for index, row in test_df.iterrows():
22 |     image_url = row['image_link']
23 |     image_filename = f"images/test/{index}.jpg"
24 |     download_images(image_url, image_filename)
25 |     print(f"Downloaded test image {index}")
26 | 
27 | print("Data preparation complete!")
28 | 


--------------------------------------------------------------------------------
/error-analysis.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import numpy as np
 3 | from sklearn.metrics import classification_report, confusion_matrix
 4 | import matplotlib.pyplot as plt
 5 | import seaborn as sns
 6 | 
 7 | # Load validation data and predictions
 8 | val_data = pd.read_csv('validation_data.csv')
 9 | val_predictions = pd.read_csv('validation_predictions.csv')
10 | 
11 | # Compute classification report
12 | print(classification_report(val_data['true_label'], val_predictions['predicted_label']))
13 | 
14 | # Compute confusion matrix
15 | cm = confusion_matrix(val_data['true_label'], val_predictions['predicted_label'])
16 | 
17 | # Plot confusion matrix
18 | plt.figure(figsize=(10, 8))
19 | sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
20 | plt.title('Confusion Matrix')
21 | plt.ylabel('True Label')
22 | plt.xlabel('Predicted Label')
23 | plt.savefig('confusion_matrix.png')
24 | plt.close()
25 | 
26 | # Analyze errors
27 | errors = val_data[val_data['true_label'] != val_predictions['predicted_label']]
28 | 
29 | # Print some example errors
30 | print("Example Errors:")
31 | for _, row in errors.head().iterrows():
32 |     print(f"True: {row['true_label']}, Predicted: {row['predicted_label']}, Image: {row['image_link']}")
33 | 
34 | # Analyze error distribution by entity type
35 | error_by_entity = errors['entity_name'].value_counts(normalize=True)
36 | plt.figure(figsize=(10, 6))
37 | error_by_entity.plot(kind='bar')
38 | plt.title('Error Distribution by Entity Type')
39 | plt.ylabel('Error Rate')
40 | plt.xlabel('Entity Type')
41 | plt.savefig('error_distribution.png')
42 | plt.close()
43 | 
44 | print("Error analysis complete. Check 'confusion_matrix.png' and 'error_distribution.png' for visualizations.")
45 | 


--------------------------------------------------------------------------------
/image-preprocessing.py:
--------------------------------------------------------------------------------
 1 | import cv2
 2 | import numpy as np
 3 | import os
 4 | from tqdm import tqdm
 5 | 
 6 | def preprocess_image(image_path, target_size=(224, 224)):
 7 |     # Read the image
 8 |     img = cv2.imread(image_path)
 9 |     
10 |     # Convert to RGB (OpenCV uses BGR by default)
11 |     img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
12 |     
13 |     # Resize the image
14 |     img = cv2.resize(img, target_size)
15 |     
16 |     # Normalize pixel values to [0, 1]
17 |     img = img.astype(np.float32) / 255.0
18 |     
19 |     return img
20 | 
21 | def preprocess_dataset(image_dir, output_dir, target_size=(224, 224)):
22 |     os.makedirs(output_dir, exist_ok=True)
23 |     
24 |     for filename in tqdm(os.listdir(image_dir)):
25 |         if filename.endswith(('.jpg', '.jpeg', '.png')):
26 |             input_path = os.path.join(image_dir, filename)
27 |             output_path = os.path.join(output_dir, filename)
28 |             
29 |             # Preprocess the image
30 |             preprocessed_img = preprocess_image(input_path, target_size)
31 |             
32 |             # Save the preprocessed image
33 |             cv2.imwrite(output_path, cv2.cvtColor((preprocessed_img * 255).astype(np.uint8), cv2.COLOR_RGB2BGR))
34 | 
35 | # Preprocess training and test datasets
36 | preprocess_dataset('images/train', 'preprocessed/train')
37 | preprocess_dataset('images/test', 'preprocessed/test')
38 | 
39 | print("Image preprocessing complete!")
40 | 


--------------------------------------------------------------------------------
/label-preprocessing.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import re
 3 | from src.constants import ALLOWED_UNITS
 4 | 
 5 | def preprocess_labels(df):
 6 |     def extract_value_and_unit(entity_value):
 7 |         match = re.match(r'(\d+(?:\.\d+)?)\s*(\w+)', str(entity_value))
 8 |         if match:
 9 |             value, unit = match.groups()
10 |             return float(value), unit.lower()
11 |         return None, None
12 | 
13 |     def normalize_unit(unit, entity_name):
14 |         allowed_units = ALLOWED_UNITS.get(entity_name, [])
15 |         if unit in allowed_units:
16 |             return unit
17 |         # Here you might want to add unit conversion logic
18 |         # For example, converting 'gram' to 'kilogram' if needed
19 |         return None
20 | 
21 |     # Extract value and unit
22 |     df['value'], df['unit'] = zip(*df['entity_value'].map(extract_value_and_unit))
23 |     
24 |     # Normalize units
25 |     df['normalized_unit'] = df.apply(lambda row: normalize_unit(row['unit'], row['entity_name']), axis=1)
26 |     
27 |     # Remove rows with invalid units
28 |     df = df.dropna(subset=['normalized_unit'])
29 |     
30 |     return df
31 | 
32 | # Load the training data
33 | train_df = pd.read_csv('dataset/train.csv')
34 | 
35 | # Preprocess the labels
36 | preprocessed_train_df = preprocess_labels(train_df)
37 | 
38 | # Save the preprocessed data
39 | preprocessed_train_df.to_csv('preprocessed/train_labels.csv', index=False)
40 | 
41 | print("Label preprocessing complete!")
42 | 


--------------------------------------------------------------------------------
/model-development.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn as nn
  3 | import torch.optim as optim
  4 | from torch.utils.data import Dataset, DataLoader
  5 | import pandas as pd
  6 | import numpy as np
  7 | from sklearn.model_selection import train_test_split
  8 | from sklearn.preprocessing import LabelEncoder
  9 | 
 10 | class ProductDataset(Dataset):
 11 |     def __init__(self, ocr_features, cnn_features, labels):
 12 |         self.ocr_features = ocr_features
 13 |         self.cnn_features = cnn_features
 14 |         self.labels = labels
 15 |     
 16 |     def __len__(self):
 17 |         return len(self.labels)
 18 |     
 19 |     def __getitem__(self, idx):
 20 |         return {
 21 |             'ocr': self.ocr_features[idx],
 22 |             'cnn': self.cnn_features[idx],
 23 |             'label': self.labels[idx]
 24 |         }
 25 | 
 26 | class HybridModel(nn.Module):
 27 |     def __init__(self, vocab_size, embedding_dim, hidden_dim, cnn_feature_dim, num_classes):
 28 |         super(HybridModel, self).__init__()
 29 |         
 30 |         # OCR branch
 31 |         self.embedding = nn.Embedding(vocab_size, embedding_dim)
 32 |         self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
 33 |         
 34 |         # CNN branch
 35 |         self.fc_cnn = nn.Linear(cnn_feature_dim, hidden_dim)
 36 |         
 37 |         # Combined layers
 38 |         self.fc_combined = nn.Sequential(
 39 |             nn.Linear(hidden_dim * 2, hidden_dim),
 40 |             nn.ReLU(),
 41 |             nn.Linear(hidden_dim, num_classes)
 42 |         )
 43 |     
 44 |     def forward(self, ocr, cnn):
 45 |         # OCR branch
 46 |         ocr_emb = self.embedding(ocr)
 47 |         ocr_out, _ = self.lstm(ocr_emb)
 48 |         ocr_out = ocr_out[:, -1, :]  # Take the last output
 49 |         
 50 |         # CNN branch
 51 |         cnn_out = self.fc_cnn(cnn)
 52 |         
 53 |         # Combine and predict
 54 |         combined = torch.cat((ocr_out, cnn_out), dim=1)
 55 |         output = self.fc_combined(combined)
 56 |         
 57 |         return output
 58 | 
 59 | # Load features and labels
 60 | train_ocr = pd.read_csv('features/train_ocr_features.csv', index_col=0)
 61 | train_cnn = pd.read_csv('features/train_cnn_features.csv', index_col=0)
 62 | train_labels = pd.read_csv('preprocessed/train_labels.csv')
 63 | 
 64 | # Prepare data
 65 | le = LabelEncoder()
 66 | train_labels['encoded_value'] = le.fit_transform(train_labels['value'])
 67 | 
 68 | # Split data
 69 | train_data, val_data, train_labels, val_labels = train_test_split(
 70 |     pd.concat([train_ocr, train_cnn], axis=1),
 71 |     train_labels['encoded_value'],
 72 |     test_size=0.2,
 73 |     random_state=42
 74 | )
 75 | 
 76 | # Create datasets and dataloaders
 77 | train_dataset = ProductDataset(train_data['ocr_text'].values, train_data.drop('ocr_text', axis=1).values, train_labels.values)
 78 | val_dataset = ProductDataset(val_data['ocr_text'].values, val_data.drop('ocr_text', axis=1).values, val_labels.values)
 79 | 
 80 | train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
 81 | val_loader = DataLoader(val_dataset, batch_size=32)
 82 | 
 83 | # Initialize model
 84 | vocab_size = 10000  # Adjust based on your vocabulary
 85 | embedding_dim = 100
 86 | hidden_dim = 128
 87 | cnn_feature_dim = train_cnn.shape[1]
 88 | num_classes = len(le.classes_)
 89 | 
 90 | model = HybridModel(vocab_size, embedding_dim, hidden_dim, cnn_feature_dim, num_classes)
 91 | 
 92 | # Define loss function and optimizer
 93 | criterion = nn.CrossEntropyLoss()
 94 | optimizer = optim.Adam(model.parameters())
 95 | 
 96 | # Training loop
 97 | num_epochs = 10
 98 | for epoch in range(num_epochs):
 99 |     model.train()
100 |     for batch in train_loader:
101 |         optimizer.zero_grad()
102 |         outputs = model(batch['ocr'], batch['cnn'])
103 |         loss = criterion(outputs, batch['label'])
104 |         loss.backward()
105 |         optimizer.step()
106 |     
107 |     # Validation
108 |     model.eval()
109 |     val_loss = 0
110 |     correct = 0
111 |     total = 0
112 |     with torch.no_grad():
113 |         for batch in val_loader:
114 |             outputs = model(batch['ocr'], batch['cnn'])
115 |             loss = criterion(outputs, batch['label'])
116 |             val_loss += loss.item()
117 |             _, predicted = outputs.max(1)
118 |             total += batch['label'].size(0)
119 |             correct += predicted.eq(batch['label']).sum().item()
120 |     
121 |     print(f'Epoch {epoch+1}/{num_epochs}, '
122 |           f'Train Loss: {loss.item():.4f}, '
123 |           f'Val Loss: {val_loss/len(val_loader):.4f}, '
124 |           f'Val Accuracy: {100.*correct/total:.2f}%')
125 | 
126 | # Save the model
127 | torch.save(model.state_dict(), 'model.pth')
128 | print("Model training complete and saved!")
129 | 


--------------------------------------------------------------------------------
/ocr-feature-extraction.py:
--------------------------------------------------------------------------------
 1 | import cv2
 2 | import pytesseract
 3 | from PIL import Image
 4 | import pandas as pd
 5 | import os
 6 | from tqdm import tqdm
 7 | 
 8 | def perform_ocr(image_path):
 9 |     # Read the image using OpenCV
10 |     img = cv2.imread(image_path)
11 |     
12 |     # Convert the image to grayscale
13 |     gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
14 |     
15 |     # Apply thresholding to preprocess the image
16 |     threshold = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
17 |     
18 |     # Perform text extraction
19 |     text = pytesseract.image_to_string(threshold)
20 |     
21 |     return text
22 | 
23 | def extract_ocr_features(image_dir):
24 |     ocr_features = {}
25 |     
26 |     for filename in tqdm(os.listdir(image_dir)):
27 |         if filename.endswith(('.jpg', '.jpeg', '.png')):
28 |             image_path = os.path.join(image_dir, filename)
29 |             image_id = os.path.splitext(filename)[0]
30 |             
31 |             ocr_text = perform_ocr(image_path)
32 |             ocr_features[image_id] = ocr_text
33 |     
34 |     return ocr_features
35 | 
36 | # Extract OCR features for training and test sets
37 | train_ocr_features = extract_ocr_features('preprocessed/train')
38 | test_ocr_features = extract_ocr_features('preprocessed/test')
39 | 
40 | # Save OCR features
41 | pd.DataFrame.from_dict(train_ocr_features, orient='index', columns=['ocr_text']).to_csv('features/train_ocr_features.csv')
42 | pd.DataFrame.from_dict(test_ocr_features, orient='index', columns=['ocr_text']).to_csv('features/test_ocr_features.csv')
43 | 
44 | print("OCR feature extraction complete!")
45 | 


--------------------------------------------------------------------------------
/performance-optimization.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from torch.utils.data import DataLoader
 3 | from torchvision import transforms
 4 | import time
 5 | 
 6 | def time_inference(model, dataloader):
 7 |     start_time = time.time()
 8 |     model.eval()
 9 |     with torch.no_grad():
10 |         for batch in dataloader:
11 |             _ = model(batch['ocr'], batch['cnn'])
12 |     end_time = time.time()
13 |     return end_time - start_time
14 | 
15 | # Load the model
16 | model = HybridModel(vocab_size, embedding_dim, hidden_dim, cnn_feature_dim, num_classes)
17 | model.load_state_dict(torch.load('model.pth'))
18 | 
19 | # Create a sample dataloader
20 | sample_dataset = ProductDataset(...)  # Fill with sample data
21 | sample_dataloader = DataLoader(sample_dataset, batch_size=32)
22 | 
23 | # Time the original model
24 | original_time = time_inference(model, sample_dataloader)
25 | print(f"Original inference time: {original_time:.4f} seconds")
26 | 
27 | # Optimize the model
28 | optimized_model = torch.jit.script(model)
29 | 
30 | # Time the optimized model
31 | optimized_time = time_inference(optimized_model, sample_dataloader)
32 | print(f"Optimized inference time: {optimized_time:.4f} seconds")
33 | 
34 | # Save the optimized model
35 | torch.jit.save(optimized_model, 'optimized_model.pth')
36 | 
37 | print("Performance optimization complete. Optimized model saved as 'optimized_model.pth'")
38 | 


--------------------------------------------------------------------------------
/prediction-output-generation.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import pandas as pd
 3 | import numpy as np
 4 | from sklearn.preprocessing import LabelEncoder
 5 | from src.constants import ALLOWED_UNITS
 6 | from src.sanity import run_sanity_check
 7 | 
 8 | # Load the trained model
 9 | model = HybridModel(vocab_size, embedding_dim, hidden_dim, cnn_feature_dim, num_classes)
10 | model.load_state_dict(torch.load('model.pth'))
11 | model.eval()
12 | 
13 | # Load test data
14 | test_ocr = pd.read_csv('features/test_ocr_features.csv', index_col=0)
15 | test_cnn = pd.read_csv('features/test_cnn_features.csv', index_col=0)
16 | test_data = pd.concat([test_ocr, test_cnn], axis=1)
17 | 
18 | # Load the original test file to get entity_name
19 | original_test = pd.read_csv('dataset/test.csv')
20 | 
21 | # Create test dataset
22 | test_dataset = ProductDataset(test_data['ocr_text'].values, test_data.drop('ocr_text', axis=1).values, np.zeros(len(test_data)))
23 | test_loader = DataLoader(test_dataset, batch_size=32)
24 | 
25 | # Make predictions
26 | predictions = []
27 | with torch.no_grad():
28 |     for batch in test_loader:
29 |         outputs = model(batch['ocr'], batch['cnn'])
30 |         _, predicted = outputs.max(1)
31 |         predictions.extend(predicted.cpu().numpy())
32 | 
33 | # Inverse transform predictions
34 | le = LabelEncoder()
35 | le.classes_ = np.load('label_encoder_classes.npy')  # Load the classes from training
36 | predicted_values = le.inverse_transform(predictions)
37 | 
38 | # Post-process predictions
39 | def format_prediction(value, entity_name):
40 |     allowed_units = ALLOWED_UNITS.get(entity_name, [])
41 |     if not allowed_units:
42 |         return ""
43 |     
44 |     # Choose the most appropriate unit (this is a simplified approach)
45 |     unit = allowed_units[0]
46 |     
47 |     # Format the output string
48 |     return f"{value:.2f} {unit}"
49 | 
50 | # Generate output dataframe
51 | output_df = pd.DataFrame({
52 |     'index': original_test['index'],
53 |     'prediction': [format_prediction(value, entity_name) 
54 |                    for value, entity_name in zip(predicted_values, original_test['entity_name'])]
55 | })
56 | 
57 | # Save the output file
58 | output_file = 'test_out.csv'
59 | output_df.to_csv(output_file, index=False)
60 | print(f"Output file '{output_file}' generated.")
61 | 
62 | # Run sanity check
63 | print("Running sanity check...")
64 | run_sanity_check(output_file)
65 | print("Sanity check complete.")
66 | 


--------------------------------------------------------------------------------