├── README.md └── images └── Preview-of-RegEx.png /README.md: -------------------------------------------------------------------------------- 1 | # Price Parsing Tutorial for Python 2 | 3 | 4 | 5 | [![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=877&url_id=112) 6 | 7 | [![](https://dcbadge.vercel.app/api/server/eWsVUJrnG5)](https://discord.gg/GbxmdGhZjq) 8 | 9 | This article covers everything you need to know about parsing prices in Python. By the end of this article, you will be able to extract price and currency from raw text, which is usually collected from external data sources like web pages or files. 10 | 11 | ## Table of Contents 12 | 13 | - [Introduction](#introduction) 14 | - [Parsing Prices With Python Standard Library](#parsing-prices-with-python-standard-library) 15 | - [Building Regular Expression](#building-regular-expression) 16 | - [Price Parser Package](#price-parser-package) 17 | - [`Price` Class](#price-class) 18 | - [Parsing Price - Simple Cases](#parsing-price---simple-cases) 19 | - [Parsing Price - Complex Cases](#parsing-price---complex-cases) 20 | 21 | ## Introduction 22 | 23 | This article covers a specific case - converting unprocessed text containing price and any other money-like data into price and currency. This is a common problem that needs to be addressed in cases of web scraping, data cleaning, and similar projects where the currency needs some cleaning. 24 | 25 | The currency data can be in various formats, varying, especially with currency code in the locale. Here are few examples: 26 | 27 | ```python 28 | '29,99 €' # Comma as decimal separator 29 | '59,90 EUR' 30 | '$ 9.99' # Period as decimal separator 31 | '9.99 USD' 32 | 'Price: $12.99' 33 | '¥ 170 000' 34 | '12,000원' 35 | '₹898.00' 36 | 'Price From £ 39.95' 37 | ``` 38 | 39 | As it is evident that every format needs a different way of extracting the currency and the price value. 40 | 41 | This article will first examine how it can be done using Python Standard Library. In the second part, a specialized library for this purpose called Price Parser will be examined. 42 | 43 | ## Parsing Prices with Python Standard Library 44 | 45 | The best way to extract numeric data from any string is to use regular expressions. Regular expression operations can be performed using Python's `re` module. 46 | 47 | This is going to be a two-part process. The first part is building a regular expression, and the second part is using this expression. 48 | 49 | ### Building Regular Expression 50 | 51 | Regular Expressions can be difficult for beginners. Websites like [Regexr](https://regexr.com/) can help build and test regular expressions. 52 | 53 | Let's start with an example. 54 | 55 | - The digits can be matched with regex expression `\d`. 56 | - The price can also have a comma and a period. These can be selected using `.` and `,`. 57 | - The price can also contain space. The whitespace characters can be selected using `\s`. 58 | - All above need to be put in a set using `[]`. 59 | - There can be zero or more instances of the above. We need to use a wild card `*` for this. 60 | - If we put together all these in a set, the resulting regular expression will be like this: 61 | 62 | ```python 63 | expression = '[\d\s.,]*' 64 | ``` 65 | 66 | This expression will match the space outside the currency as well. This can be fixed easily by appending `\d` to ensure that even if space is matched, it ends with a number. 67 | 68 | ```python 69 | expression = '[\d\s.,]*\d' 70 | ``` 71 | 72 | In this statement, all the characters from the set contained in the squared brackets will be matched, and an asterisk will ensure that 0 or more of these can be matched. 73 | 74 | If you are building this regular expression at [Regexr](https://regexr.com/), you can paste in all the strings and see what is getting matched. 75 | 76 | ![Preview of Regular Expression](images/Preview-of-RegEx.png) 77 | 78 | Now we can put together the Python code required to extract the price. 79 | 80 | ```python 81 | import re 82 | 83 | expression = '[\d\s.,]*\d' 84 | price = '9.99 USD' 85 | 86 | match = re.search(expression,price) 87 | 88 | if match: 89 | print(match.group(0)) # Prints 9.99 90 | ``` 91 | 92 | This regular expression can extract prices from most of the cases. 93 | 94 | There are some problems, though. The first problems are the decimal separator and the thousand separators. In some locales, a comma is used for decimal separators, and a period or space is used for thousand separators. 95 | 96 | For example, ten thousand euros and forty-five cents are written in these two possible ways. 97 | 98 | ``` 99 | 10 000,45 100 | 10.000,45 101 | ``` 102 | 103 | On the other hand, other locales use a comma as the decimal separator and period as the thousand separators. In that case, the same price would be written like this: 104 | 105 | ``` 106 | 10,000.45 107 | ``` 108 | 109 | To handle this, there will be a need to write more code. 110 | 111 | So far, we did not talk about getting the currency and cleaning up the price. 112 | 113 | As we add more cases, the code will become more and more complex. 114 | 115 | This is where it is important to know that there is already a package published on the Python Package Index —[Price Parser](https://pypi.org/project/price-parser/). 116 | 117 | ## Price Parser Package 118 | 119 | Price Parser is a package that aims to make parsing extracting price and currency from raw string much easier, without going into the complexities to handling all the cases. 120 | 121 | Even though Price Parser is written for parsing scraped price string, it works with files or any other data source that contain currency information in raw, unprocessed strings. 122 | 123 | Price Parser package can be installed easily from the terminal. 124 | 125 | ```shell 126 | pip install date-parser 127 | ``` 128 | 129 | If you are using Anaconda, note that it is NOT available on any channel, and thus you will have to use `pip install` command there as well. 130 | 131 | ```shell 132 | conda install date-parser # DOES NOT WORK. Use pip install instead. 133 | pip install date-parser # works 134 | ``` 135 | 136 | Before we explore how this package can be used, first, we must understand the class `Price` of this package.. 137 | 138 | ### `Price` Class 139 | 140 | Price Parser package uses a special class to represent the currency. This class is aptly named `Price`. The `Price` class has two important attributes: 141 | 142 | - `amount`: Price value as `Decimal` 143 | - `currency`: Currency as `string` 144 | 145 | It also has a data descriptor `amount_float`, that will return the currency amount as `float`, instead of `Decimal()`. 146 | 147 | With this in mind, let's see how we can parse price. 148 | 149 | ### Parsing Price - Simple Cases 150 | 151 | Price Parser supports two ways of parsing. These are not different ways but rather just alias. They do the same thing. 152 | 153 | These two functions are: 154 | 155 | - `price_parser.parse_price` 156 | - `Price.fromstring` 157 | 158 | The signature of these functions is as follows: 159 | 160 | ```python 161 | fromstring(price, currency_hint= None, decimal_separator= None) 162 | ``` 163 | 164 | The only mandatory parameter is the price string. The other two parameters are optional. This function will return an instance of `Price`. 165 | 166 | Here is the first example. In this example, it can be seen how the `amount`, `amount_float`, and `currency` can be used. 167 | 168 | ```python 169 | >>> from price_parser import Price 170 | >>> price = Price.fromstring('29,99 €') 171 | >>> print(price) 172 | Price(amount=Decimal('29.99'), currency='€') 173 | >>> price.amount 174 | Decimal('29.99') 175 | >>> price.amount_float 176 | 29.99 177 | >>> price.currency 178 | '€' 179 | 180 | ``` 181 | For simple cases, supplying only the price string is enough for the method to work. Here are few more examples: 182 | 183 | ```python 184 | >>> Price.fromstring('59,90 EUR') 185 | Price(amount=Decimal('59.90'), currency='EUR') 186 | >>> Price.fromstring('12,000원') 187 | Price(amount=Decimal('12000'), currency='원') 188 | 189 | ``` 190 | 191 | ### Parsing Price - Complex Cases 192 | 193 | The `fromstring` function can handle uncleaned strings and extract price amount and currency for most of the cases. 194 | 195 | ```python 196 | >>> Price.fromstring('Price From £ 39.95') 197 | Price(amount=Decimal('39.95'), currency='£') 198 | ``` 199 | 200 | Often, € sign is used as the decimal separator. This is handled by Price Parser without any additional tweaking. 201 | 202 | ```python 203 | >>> Price.fromstring('29€ 99 ') 204 | Price(amount=Decimal('29.99'), currency='€') 205 | ``` 206 | 207 | The previously discussed difficult cases, where space or a period can be used as thousands of separators are supported out of the box. 208 | 209 | ```python 210 | >>> Price.fromstring('10.000,45€') 211 | Price(amount=Decimal('10000.45'), currency='€') 212 | >>> Price.fromstring('10 000,45€') 213 | Price(amount=Decimal('10000.45'), currency='€') 214 | ``` 215 | 216 | Both `0` and `free` are interpreted as `0`. Further, if amount or currency cannot be determined by Price Parser, it will return a `Price` object with `amount` and/or `currency` will be set to `None`. 217 | 218 | ```python 219 | >>> Price.fromstring('free') 220 | Price(amount=Decimal('0'), currency=None) 221 | >>> Price.fromstring('$') 222 | Price(amount=None, currency='$') 223 | >>> Price.fromstring('0') 224 | Price(amount=Decimal('0'), currency=None) 225 | >>> Price.fromstring('not available') 226 | Price(amount=None, currency=None) 227 | ``` 228 | 229 | Currency hint can be used if Price Parser is not able to detect the currency. For example, United Arab Emirates Dirham is written as both AED and Dhs. Price Parser can handle AED, but can not handle Dhs. In this case, `currency_hint` is useful. 230 | 231 | ```python 232 | >>> Price.fromstring('29 AED') 233 | Price(amount=Decimal('29'), currency='AED') # correct currency 234 | >>> Price.fromstring('29 Dhs') 235 | Price(amount=Decimal('29'), currency=None) # currency is None 236 | >>> Price.fromstring('29 Dhs', currency_hint="AED") # send currency_hint 237 | Price(amount=Decimal('29'), currency='AED') # currency is correct 238 | ``` 239 | 240 | Finally, if the decimal separator is known, it can be set using the `decimal_separator` argument. This will ensure that there are no errors in guessing the decimal separator. Here is an example: 241 | 242 | ```python 243 | >>> Price.fromstring("Price: $140,600") # It's actually 140.600 244 | Price(amount=Decimal('140600'), currency='$') # Wrong output 245 | >>> Price.fromstring("Price: $140,600", decimal_separator=",") 246 | Price(amount=Decimal('140.600'), currency='$') # Correct output 247 | ``` 248 | 249 | -------------------------------------------------------------------------------- /images/Preview-of-RegEx.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oxylabs/Price-Parsing-Tutorial/3d296b940dd05730aaa270797a9eb43b31d45fab/images/Preview-of-RegEx.png --------------------------------------------------------------------------------