├── .gitignore ├── LICENSE ├── README.md ├── pyproject.toml ├── requirements.txt ├── tests ├── 20200524_ドラゴンフライト.zip ├── 20200524_ドラゴンボール.zip ├── 20200524_フラット.zip ├── 20200524_フラットpwd.zip ├── 202301-03_hokkaido_jukyu.zip ├── test_zipu.py └── ミックス.zip └── zip_unicode ├── __init__.py └── main.py /.gitignore: -------------------------------------------------------------------------------- 1 | /venv/ 2 | /build/ 3 | /dist/ 4 | /ZipUnicode.egg-info/ 5 | /.idea/ 6 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) [2020] [Nguyen Ba Duc Tin] 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ZipUnicode 2 | Make extracted unreadable filename problem gone away. 3 | 4 | [![Downloads](https://pepy.tech/badge/zipunicode)](https://pepy.tech/project/zipunicode) 5 | [![PyPI version](https://badge.fury.io/py/zipunicode.svg)](https://pypi.org/project/zipunicode/) 6 | [![GitHub license](https://img.shields.io/github/license/Dragon2fly/zipunicode)](https://github.com/Dragon2fly/zipunicode/blob/master/LICENSE) 7 | 8 | ## Install: 9 | Using pip: `pip install ZipUnicode` 10 | 11 | Beside installing `zip_unicode` package, 12 | this will also create an executable file `zipu` in the syspath 13 | for you to work with `zip` file directly from the console. 14 | 15 | ## Filename encoding inside a zip file 16 | Everyone agrees what a zip file is and how to make one. 17 | That is the way to turn a collection of files into a sequence of bytes 18 | and put a `.zip` at the end of the name of a newly created file. 19 | But no one said anything about how filename should be handled. 20 | So it is up to the zip extracting program to interpret that sequence of bytes into filename. 21 | 22 | Most OS use UTF-8 for filename encoding and flip a bit in the zip file to indicate that. 23 | However, Windows is not a case. For different languages, Windows uses different `code page`s 24 | to encode filename. So, if you create a zip file containing a file named `ê.txt` on Linux and 25 | extract it on Windows, you may got something like `├¬.txt` or `テェ.txt`. 26 | 27 | The exact filename depends on the `code page` or `language` that Windows is using. 28 | The same thing also happens when a zip file was created on Windows, 29 | contains non-ascii filename, and then extracted on Linux or on Windows that use different `code page`s. 30 | 31 | All that means if the filename wasn't encoded by `UTF-8` `encoding (or code page)`, 32 | then there is no easy way to knows which `encoding` that was used when extracting the file. 33 | 34 | ## Overview 35 | You will use `zipu` to interact with zip file. 36 | 37 | ```bash 38 | $ zipu -h 39 | ``` 40 | 41 | ```bash 42 | usage: zipu [-h] [--extract] [--fix] [--encoding ENCODING] 43 | [--password PASSWORD] 44 | zipfile [destination] 45 | 46 | Fix filename encoding error inside a zip file. 47 | 48 | positional arguments: 49 | zipfile path to zip file 50 | destination folder path to extract zip file 51 | 52 | optional arguments: 53 | -h, --help show this help message and exit 54 | --extract, -x extract the zipfile to specified destination 55 | --fix, -f create a new zip file with UTF-8 file names 56 | --encoding ENCODING, -enc ENCODING 57 | zip file used encoding: shift-jis, cp932... 58 | --password PASSWORD, -pwd PASSWORD 59 | password to extract zip file 60 | ``` 61 | 62 | Extracting a zip file is as simple as `zipu -x file.zip`. 63 | Files are extracted into the folder that has the same name as `file.zip` without `.zip` 64 | and stays on the same folder path as `file.zip`. Filename `encoding` is handled automatically. 65 | 66 | You can also ensure your zip file being opened correctly on all computers by `zipu -f file.zip`. 67 | This will create a new `file_fixed.zip` contains all file names encoded with `UTF-8`. 68 | 69 | ## Usage: 70 | 1. View content of the zip file: 71 | 72 | You simply point `zipu` to your zip file's path as follow: 73 | 74 | ```bash 75 | zipu path/to/file.zip 76 | ``` 77 | 78 | This makes `zipu` do the following: 79 | * automatically guess the encoding that was used to encode file names 80 | * check if the file was password encrypted 81 | * give you a default extract destination if you don't provide any 82 | 83 | Then, it will show a summarization of the contents of that zip file, 84 | something similar to the following: 85 | 86 | D:\tmp>zipu 20200524_ドラゴンフライト.zip 87 | 88 | ```bash 89 | * Detected encoding : SHIFT_JIS | Language:Japanese | Confidence:99% 90 | * Default destination: D:\tmp 91 | * Password protected : False 92 | --------------------------- try encoding: SHIFT_JIS --------------------------- 93 | 20200524_ドラゴンフライト/ 94 | 20200524_ドラゴンフライト/テストレポート_リナックスノード.txt 95 | 20200524_ドラゴンフライト/太陽バッテリーver5.txt 96 | 20200524_ドラゴンフライト/経営報告_桜ちゃん.txt 97 | ------------------------------------------------------------------------------- 98 | Add '-enc ENCODING' to see filename shown in encoding ENCODING (mbcs, cp932, shift-jis,...) 99 | Add '-x' flag to extract all files to default destination 100 | ``` 101 | 102 | If there is a root folder inside and it has the same name as the zip file as above example, 103 | `default destination` will be the parent folder of the zip file. 104 | Otherwise, `default destination` will point to a subdirectory 105 | that has the name of the zip file as the following case: 106 | 107 | D:\tmp>zipu 20200524_ドラゴンボール.zip 108 | 109 | ```bash 110 | * Detected encoding : SHIFT_JIS | Language:Japanese | Confidence:99% 111 | * Default destination: D:\tmp\20200524_ドラゴンボール 112 | * Password protected : False 113 | --------------------------- try encoding: SHIFT_JIS --------------------------- 114 | テストレポート_リナックスノード.txt 115 | 太陽バッテリーver5.txt 116 | 経営報告_桜ちゃん.txt 117 | ------------------------------------------------------------------------------- 118 | Add '-enc ENCODING' to see filename shown in encoding ENCODING (mbcs, cp932, shift-jis,...) 119 | Add '-x' flag to extract all files to default destination 120 | ``` 121 | 122 | 2. View content with a specific encoding: 123 | 124 | Encoding auto-detection is not always correct. When the sample is too little 125 | and some parts of `A` encoding are in `B` encoding, `B` may be wrongly detected 126 | instead of `A`. In such cases, you can specify the encoding which you believe 127 | is the correct one with `-enc ENCODING` switch. 128 | 129 | D:\tmp>zipu 20200524_ドラゴンボール.zip -enc cp932 130 | 131 | ```bash 132 | * Default destination: D:\tmp\20200524_ドラゴンボール 133 | * Password protected : False 134 | --------------------------- try encoding: cp932 --------------------------- 135 | テストレポート_リナックスノード.txt 136 | 太陽バッテリーver5.txt 137 | 経営報告_桜ちゃん.txt 138 | --------------------------------------------------------------------------- 139 | Add '-enc ENCODING' to see filename shown in encoding ENCODING (mbcs, cp932, shift-jis,...) 140 | Add '-x' flag to extract all files to default destination 141 | ``` 142 | 143 | In case that your specified `ENCODING` is wrong and cannot decode some bytes, 144 | these unknown bytes will be replaced by a lot of `�`. 145 | 146 | D:\tmp>zipu 20200524_ドラゴンボール.zip -enc ascii 147 | 148 | ```bash 149 | * Default destination: D:\tmp\20200524_ドラゴンボール 150 | * Password protected : False 151 | --------------------------- try encoding: ascii --------------------------- 152 | �e�X�g���|�[�g�Q���i�b�N�X�m�[�h.txt 153 | ���z�o�b�e���[ver5.txt 154 | �o�c��_�������.txt 155 | --------------------------------------------------------------------------- 156 | Add '-enc ENCODING' to see filename shown in encoding ENCODING (mbcs, cp932, shift-jis,...) 157 | Add '-x' flag to extract all files to default destination 158 | ``` 159 | 160 | Or those bytes are mapped into completely different characters: 161 | 162 | D:\tmp>zipu 20200524_ドラゴンボール.zip -enc utf16 163 | 164 | ```bash 165 | * Default destination: D:\tmp\20200524_ドラゴンボール 166 | * Password protected : False 167 | --------------------------- try encoding: utf16 --------------------------- 168 | 斃境枃貃粃宁枃冁誃榃抃亃境涃宁梃琮瑸 169 | 뺑窗澃抃斃誃宁敶㕲琮瑸 170 | 澌掉邍赟苷芿苡⻱硴� 171 | --------------------------------------------------------------------------- 172 | Add '-enc ENCODING' to see filename shown in encoding ENCODING (mbcs, cp932, shift-jis,...) 173 | Add '-x' flag to extract all files to default destination 174 | ``` 175 | 176 | Only when auto-detection failed, it is your responsibility to decide which `ENCODING` is the correct one. 177 | 178 | **Warning**: If your console uses non-full `UTF-8` font as in the case of Windows, 179 | some `UTF-8` characters are shown as a dot `・`. 180 | This is not a result of wrong encoding but rather unsupported characters by the font. 181 | 182 | 3. Extract the zip file: 183 | 184 | Usually, encoding auto-detection works just fine so you can jump right to extraction with
185 | `zipu -x path/to/file.zip`. The `-x` argument can be either placed **before or after** the path to the zip file. 186 | 187 | D:\tmp>zipu 20200524_ドラゴンフライト.zip -x 188 | 189 | ```bash 190 | * Detected encoding : SHIFT_JIS | Language:Japanese | Confidence:99% 191 | Extracting: 20200524_ドラゴンフライト/テストレポート_リナックスノード.txt 192 | Extracting: 20200524_ドラゴンフライト/太陽バッテリーver5.txt 193 | Extracting: 20200524_ドラゴンフライト/経営報告_桜ちゃん.txt 194 | Finished 195 | ``` 196 | 197 | As mentioned before, without specifying the `destination`, zip file is extracted to 198 | the directory in the same path and has the name of that zip file.
199 | In the above example, that would be `D:\tmp\20200524_ドラゴンフライト`. 200 | 201 | When extract `destination` is specified, you add it right after the zip file's path as: 202 | 203 | zipu -x path/to/file.zip path/to/extract 204 | 205 | If the output file names are unreadable, 206 | you have to guess the `ENCODING` with `-enc` switch as described in **2. View content with a specified encoding**. 207 | Then you can use that `ENCODING` to extract zip file: 208 | 209 | zipu path/to/file.zip -x -enc ENCODING 210 | 211 | 4. A Password protected zip file: 212 | 213 | If a zip file is encrypted, ` * Password protected : True` will show up when viewing its content. 214 | When extracting the zip file, you will be asked for `password` if you haven't provided any. 215 | You can also specify password directly in the command as follows: 216 | 217 | zipu path/to/file.zip -x -pwd PASSWORD 218 | 219 | 5. Mixed contents: 220 | 221 | Some zip files are very tricky. It contains file names of different encodings. Some `UTF-8`, some not. 222 | For `UTF-8` marked files, `zipu` will leave it as is while trying different `ENCODING` on other files. 223 | `UTF-8` encoded filename has `(UTF-8) ` string prefixed in the content view: 224 | 225 | D:\tmp>zipu ミックス.zip 226 | 227 | ```bash 228 | * Detected encoding : SHIFT_JIS | Language:Japanese | Confidence:63% 229 | * Default destination: D:\tmp\ミックス 230 | * Password protected : False 231 | --------------------------- try encoding: SHIFT_JIS --------------------------- 232 | (UTF-8) Vùng Trời Bình Yên.txt 233 | бореиская.txt 234 | テストレポート_リナックスノード.txt 235 | 太陽バッテリーver5.txt 236 | 経営報告_桜ちゃん.txt 237 | ------------------------------------------------------------------------------- 238 | Add '-enc ENCODING' to see filename shown in encoding ENCODING (mbcs, cp932, shift-jis,...) 239 | Add '-x' flag to extract all files to default destination 240 | ``` 241 | 242 | When extracting, `UTF-8` encoded filename will not wrongly be decoded with detected `ENCODING` 243 | so that you can read it as is. 244 | 245 | **Warning**: `zipu` cannot handle zip file that contains three or more encodings, or two encodings 246 | but neither is `UTF-8`. In such cases, you have to extract the zip file for each encoding. 247 | 248 | 6. Fixing a zip file: 249 | 250 | If you make a zip file contains file names which are not in `UTF-8` nor `ASCII` encoding, 251 | then you can ensure that your colleagues who use computers of different language can 252 | open the zip just fine as follows: 253 | 254 | ```bash 255 | zipu -f path/to/file.zip 256 | ``` 257 | 258 | This first extracts your zip file (and convert all file names to `UTF-8`). 259 | Then it compresses extracted contents and adds `_fixed` suffix to the zip filename. 260 | The fixed zip file is on the same path as the original one. 261 | 262 | **Warning**: `zipu` cannot create password encrypted zip file. 263 | With these files you have to first extract it by `zipu` and then re-zip it 264 | with your conventional tool. 265 | 266 | ## Changelog 267 | ### 1.1.0 268 | * Handle malformed zip file: Some zip files contain folders but are registered as file entries. 269 | These file entries have size of zero by and are extracted as zero-byte files. 270 | Since the OS doesn't allow creating file and folder of the same name 271 | within the same directory, `zipu` cannot continue to create the folder and extract the file inside. 272 | Now `zipu` will check for those malformed entries and skip it. 273 | * Fixing zip file from commandline with `zipu -f` now work normally. 274 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [build-system] 2 | requires = ["setuptools>=61.0", "chardet"] 3 | build-backend = "setuptools.build_meta" 4 | 5 | [project] 6 | name = "ZipUnicode" 7 | authors = [ 8 | {name = "Nguyen Ba Duc Tin", email = "nguyenbaduc.tin@gmail.com"}, 9 | ] 10 | description = "Fix unreadable file names when extracting zip file" 11 | readme = "README.md" 12 | requires-python = ">=3.6" 13 | license = {file = "LICENSE"} 14 | dependencies = [ 15 | 'chardet>=3.0.0', 16 | ] 17 | classifiers=[ 18 | 'Development Status :: 5 - Production/Stable', 19 | 'Intended Audience :: End Users/Desktop', 20 | 'Intended Audience :: Developers', 21 | 'Intended Audience :: Information Technology', 22 | 'License :: OSI Approved :: MIT License', 23 | 'Programming Language :: Python', 24 | 'Programming Language :: Python :: 3.6', 25 | 'Programming Language :: Python :: 3.7', 26 | 'Programming Language :: Python :: 3.8', 27 | 'Programming Language :: Python :: 3.9', 28 | 'Programming Language :: Python :: 3.10', 29 | 'Programming Language :: Python :: 3.11', 30 | 'Operating System :: OS Independent', 31 | 'Topic :: Software Development :: Debuggers', 32 | 'Topic :: Home Automation', 33 | 'Topic :: Office/Business', 34 | 'Topic :: Scientific/Engineering :: Artificial Intelligence', 35 | 'Topic :: Scientific/Engineering :: Information Analysis', 36 | 'Topic :: Utilities' 37 | ] 38 | dynamic = ["version"] 39 | 40 | [project.scripts] 41 | zipu = "zip_unicode.main:entry_point" 42 | 43 | [project.urls] 44 | homepage = "https://github.com/Dragon2fly/ZipUnicode" 45 | 46 | [tool.setuptools] 47 | packages = ["zip_unicode"] 48 | 49 | [tool.setuptools.dynamic] 50 | version = {attr = "zip_unicode.__version__"} 51 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | chardet >= 3.0.0 2 | 3 | pytest>=5.4.2 4 | setuptools>=40.8.0 -------------------------------------------------------------------------------- /tests/20200524_ドラゴンフライト.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Dragon2fly/ZipUnicode/412b9422469069fe580c219ef683639a4192e088/tests/20200524_ドラゴンフライト.zip -------------------------------------------------------------------------------- /tests/20200524_ドラゴンボール.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Dragon2fly/ZipUnicode/412b9422469069fe580c219ef683639a4192e088/tests/20200524_ドラゴンボール.zip -------------------------------------------------------------------------------- /tests/20200524_フラット.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Dragon2fly/ZipUnicode/412b9422469069fe580c219ef683639a4192e088/tests/20200524_フラット.zip -------------------------------------------------------------------------------- /tests/20200524_フラットpwd.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Dragon2fly/ZipUnicode/412b9422469069fe580c219ef683639a4192e088/tests/20200524_フラットpwd.zip -------------------------------------------------------------------------------- /tests/202301-03_hokkaido_jukyu.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Dragon2fly/ZipUnicode/412b9422469069fe580c219ef683639a4192e088/tests/202301-03_hokkaido_jukyu.zip -------------------------------------------------------------------------------- /tests/test_zipu.py: -------------------------------------------------------------------------------- 1 | __author__ = "Duc Tin" 2 | 3 | from pathlib import Path 4 | 5 | import pytest 6 | 7 | from zip_unicode import ZipHandler 8 | 9 | root_folder = ZipHandler('20200524_ドラゴンフライト.zip') 10 | root_folder2 = ZipHandler('20200524_ドラゴンボール.zip') 11 | flat = ZipHandler('20200524_フラット.zip') 12 | flat_pwd = ZipHandler('20200524_フラットpwd.zip') 13 | mixed = ZipHandler('ミックス.zip') 14 | nested_subfolder = ZipHandler('202301-03_hokkaido_jukyu.zip') # with folder entry registered as a file 15 | 16 | 17 | def clean_up(path: Path): 18 | if path.is_dir(): 19 | for f in path.iterdir(): 20 | clean_up(f) if f.is_dir() else f.unlink() 21 | else: 22 | path.rmdir() 23 | else: 24 | path.unlink() 25 | 26 | 27 | def test_byte_name(): 28 | res = {b'V\xc3\xb9ng Tr\xe1\xbb\x9di B\xc3\xacnh Y\xc3\xaan.txt': True, 29 | b'\x84q\x84\x80\x84\x82\x84u\x84y\x84\x83\x84{\x84p\x84\x91.txt': False, 30 | b'\x83e\x83X\x83g\x83\x8c\x83|\x81[\x83g\x81Q\x83\x8a\x83i\x83b\x83N\x83X\x83m\x81[\x83h.txt': False, 31 | b'\x91\xbe\x97z\x83o\x83b\x83e\x83\x8a\x81[ver5.txt': False, 32 | b'\x8co\x89c\x95\xf1\x8d\x90_\x8d\xf7\x82\xbf\x82\xe1\x82\xf1.txt': False, 33 | } 34 | 35 | for file_info in mixed.zip_ref.infolist(): 36 | is_utf8, name = mixed.byte_name(file_info) 37 | assert res[name] == is_utf8 38 | 39 | 40 | def test_guess_encoding(): 41 | assert flat.original_encoding == 'SHIFT_JIS' 42 | 43 | 44 | def test_get_filename_map(): 45 | names = ['Vùng Trời Bình Yên.txt', 'бореиская.txt', 'テストレポート_リナックスノード.txt', 46 | '太陽バッテリーver5.txt', '経営報告_桜ちゃん.txt'] 47 | encodings = ['utf8', 'cp932', 'cp932', 'cp932', 'cp932'] 48 | 49 | wrong_encoded = [x.encode(enc) for x, enc in zip(names[1:], encodings[1:])] 50 | wrong_decoded = [x.decode('cp437') for x in wrong_encoded] 51 | wrong_decoded.insert(0, 'Vùng Trời Bình Yên.txt') # utf8 is left intact 52 | 53 | assert mixed.name_map == dict(zip(wrong_decoded, names)) 54 | 55 | 56 | def test_duplicated_root_name(): 57 | assert root_folder._duplicated_root_name() 58 | assert not root_folder2._duplicated_root_name() 59 | assert not flat._duplicated_root_name() 60 | assert not flat_pwd._duplicated_root_name() 61 | assert not mixed._duplicated_root_name() 62 | 63 | 64 | def test_is_encrypted(): 65 | assert flat_pwd.is_encrypted() 66 | assert not flat.is_encrypted() 67 | assert not mixed.is_encrypted() 68 | 69 | 70 | def test_extract_individual(): 71 | name = "テストレポート_リナックスノード.txt".encode('cp932').decode('cp437') 72 | out = Path('test_extract_individual.txt') 73 | flat._extract_individual(name, out) 74 | assert out.read_text(encoding='cp932') == '何もない' 75 | out.unlink() 76 | 77 | 78 | def test_extract_all(): 79 | # all filenames have the same encoding 80 | expect = {'テストレポート_リナックスノード.txt', '経営報告_桜ちゃん.txt', '太陽バッテリーver5.txt'} 81 | out = Path('test_extract_all_one_enc') 82 | flat.extract_all(out) 83 | assert set(x.name for x in out.iterdir()) == expect 84 | clean_up(out) 85 | 86 | # some files are UTF8 encoded, some are not 87 | expect = {'Vùng Trời Bình Yên.txt', 'бореиская.txt', 'テストレポート_リナックスノード.txt', 88 | '太陽バッテリーver5.txt', '経営報告_桜ちゃん.txt'} 89 | out = Path('test_extract_all_mixed_enc') 90 | mixed.extract_all(out) 91 | assert set(x.name for x in out.iterdir()) == expect 92 | clean_up(out) 93 | 94 | # multiple sub-folder with subfolders entry as a file 95 | # aka malformed zipfile 96 | expect = {'202301', '202302', '202303'} 97 | out = Path('test_extract_all_multiple_sub_folder') 98 | nested_subfolder.extract_all(out) 99 | assert set(x.name for x in out.iterdir()) == expect 100 | clean_up(out) 101 | 102 | 103 | def test_extract_all_with_pwd(caplog): 104 | expect = {'テストレポート_リナックスノード.txt', '経営報告_桜ちゃん.txt', '太陽バッテリーver5.txt'} 105 | out = Path('test_extract_all') 106 | 107 | with pytest.raises(OSError) as e: 108 | # password input required 109 | flat_pwd.extract_all(out) 110 | 111 | flat_pwd.password = b'WrongPassword' 112 | flat_pwd.extract_all(out) 113 | 114 | capture = caplog.text 115 | assert 'Wrong password!' in capture 116 | 117 | flat_pwd.password = b'password' 118 | flat_pwd.extract_all(out) 119 | assert set(x.name for x in out.iterdir()) == expect 120 | 121 | file_1 = Path('test_extract_all/テストレポート_リナックスノード.txt') 122 | assert file_1.read_text(encoding='cp932') == '何もない' 123 | 124 | # clean up 125 | flat_pwd.password = None 126 | clean_up(out) 127 | 128 | 129 | def test_extract_all_with_root_folder(): 130 | out1 = Path(root_folder.zip_ref.filename.replace('.zip', '')) 131 | root_folder.extract_all() 132 | assert not any(x.is_dir() for x in out1.iterdir()) 133 | clean_up(out1) 134 | 135 | out2 = Path('specified_path') 136 | root_folder.extract_all(out2) 137 | assert (out2 / out1.name).exists() 138 | clean_up(out2) 139 | 140 | out3 = Path('ミックス') 141 | mixed.extract_all() 142 | assert out3.exists() and len(list(out3.iterdir())) == 5 143 | clean_up(out3) 144 | 145 | 146 | @pytest.mark.parametrize('my_zip', [root_folder, flat, mixed]) 147 | def test_fix_it(my_zip): 148 | my_zip.fix_it() 149 | 150 | name = my_zip.zip_path.stem 151 | fixed = my_zip.zip_path.parent / (name + '_fixed.zip') 152 | fixed_zip = ZipHandler(fixed) 153 | assert fixed_zip.all_utf8 154 | 155 | fixed_zip.zip_ref.close() 156 | clean_up(fixed) 157 | -------------------------------------------------------------------------------- /tests/ミックス.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Dragon2fly/ZipUnicode/412b9422469069fe580c219ef683639a4192e088/tests/ミックス.zip -------------------------------------------------------------------------------- /zip_unicode/__init__.py: -------------------------------------------------------------------------------- 1 | __author__ = "Duc Tin" 2 | from .main import ZipHandler, __version__ 3 | -------------------------------------------------------------------------------- /zip_unicode/main.py: -------------------------------------------------------------------------------- 1 | __author__ = "Duc Tin" 2 | __version__ = "1.1.1" 3 | 4 | import getpass 5 | import logging 6 | import shutil 7 | import sys 8 | import zipfile 9 | import tempfile 10 | from pathlib import Path 11 | from argparse import ArgumentParser 12 | 13 | import chardet 14 | 15 | 16 | # Disable chardet logger 17 | logging.getLogger('chardet').level = logging.ERROR 18 | 19 | # Config our logger 20 | logging.basicConfig(format='%(message)s', stream=sys.stdout, level=logging.INFO) 21 | logger = logging.getLogger('zip_unicode') 22 | 23 | 24 | def zip_it(base_name, root_dir): 25 | logger.info("Creating archive:") 26 | shutil.make_archive(base_name, 'zip', root_dir, logger=logger) 27 | 28 | 29 | class ZipHandler: 30 | def __init__(self, path: str, encoding: str = None, 31 | password: bytes = None, extract_path: str = ""): 32 | 33 | self.zip_path = Path(path) 34 | self.zip_ref = zipfile.ZipFile(self.zip_path) 35 | 36 | self.all_utf8 = None 37 | self.original_encoding = encoding or self.guess_encoding() 38 | self.password = password 39 | self.name_map = self._get_filename_map() 40 | 41 | if self._duplicated_root_name(): 42 | self.default_destination = self.zip_path.parent.absolute() 43 | else: 44 | self.default_destination = self.zip_path.parent.absolute() / self.zip_path.stem 45 | self.destination = Path(extract_path) if extract_path else self.default_destination 46 | 47 | @staticmethod 48 | def byte_name(file_info: zipfile.ZipInfo) -> (bool, bytes): 49 | """return path of a zip element in bytes, 50 | and a flag is True if it is UTF-8 encoded 51 | """ 52 | is_utf8 = file_info.flag_bits & 0x800 53 | if not is_utf8: 54 | # filename is not encoded with utf-8 55 | return False, file_info.orig_filename.encode("cp437") 56 | else: 57 | return True, file_info.orig_filename.encode("utf-8") 58 | 59 | def guess_encoding(self): 60 | namelist = [] 61 | 62 | self.all_utf8 = True 63 | for file_info in self.zip_ref.infolist(): 64 | utf8, byte_name = self.byte_name(file_info) 65 | if not utf8: 66 | namelist.append(byte_name) 67 | self.all_utf8 = False 68 | 69 | if not self.all_utf8: 70 | enc = chardet.detect(b' '.join(namelist)) 71 | logger.info(f' * Detected encoding : {enc["encoding"]} | ' 72 | f'Language:{enc["language"]} | ' 73 | f'Confidence:{enc["confidence"]:.0%} ') 74 | return enc["encoding"] 75 | else: 76 | logger.info(' * All file names are properly in UTF8 encoding') 77 | return 'UTF_8' 78 | 79 | def _is_folder_entry_as_file(self, entry_name): 80 | for entry in self.zip_ref.namelist(): 81 | if entry.startswith(entry_name) and len(entry) > len(entry_name): 82 | return True 83 | else: 84 | return False 85 | 86 | def _get_filename_map(self) -> dict: 87 | """ Map unreadable filename to correctly decoded one """ 88 | encoding = self.original_encoding 89 | name_map = {} 90 | for file_info in self.zip_ref.infolist(): 91 | if not (file_info.flag_bits & 0x800): 92 | # filename is not encoded with utf-8 93 | name_as_bytes:bytes = file_info.orig_filename.encode("cp437") 94 | name_as_str = name_as_bytes.decode(encoding, errors='replace') 95 | else: 96 | name_as_str = file_info.filename 97 | 98 | if file_info.file_size == 0 and not name_as_str.endswith('/'): 99 | if self._is_folder_entry_as_file(name_as_str): 100 | logger.warning(f'Malformed zipfile: Entry "{file_info.filename}" ' 101 | f'is a directory but is registered as a file.') 102 | continue 103 | name_map[file_info.filename] = name_as_str 104 | 105 | return name_map 106 | 107 | def _duplicated_root_name(self) -> bool: 108 | """Inside zip file is one folder whose name is zip filename""" 109 | paths = sorted(self.name_map.values()) # make sure the shorted name listed first 110 | root = paths[0] 111 | has_root = all(x.startswith(root) for x in paths) 112 | if not has_root: 113 | return False 114 | 115 | zipname = self.zip_ref.filename.replace('.zip', '/') 116 | if zipname.endswith(root): 117 | return True 118 | 119 | def is_encrypted(self) -> bool: 120 | """Check if zipfile is password protected""" 121 | encrypted = False 122 | for file_info in self.zip_ref.infolist(): 123 | encrypted = bool(file_info.flag_bits & 0x1) 124 | if encrypted: 125 | break 126 | return encrypted 127 | 128 | def fix_it(self): 129 | """convert filename from nonUTF-8 to UTF-8""" 130 | with tempfile.TemporaryDirectory() as tmp_folder: 131 | tmp_folder = Path(tmp_folder) 132 | self.extract_all(tmp_folder) 133 | new_name = self.zip_path.parent / (self.zip_path.stem + '_fixed') 134 | folder_to_zip = tmp_folder # /self.zip_path.stem 135 | zip_it(new_name, folder_to_zip) 136 | 137 | if self.is_encrypted(): 138 | logger.warning(f" !!! Fixed zipfile is NOT password protected!") 139 | 140 | def _extract_individual(self, filename: str, output_path: Path, 141 | password: bytes = None) -> bool: 142 | """Extract 'filename' in zipfile to path 'output_path' with password 'password' """ 143 | 144 | try: 145 | with output_path.open("wb+") as output_file: 146 | stream = self.zip_ref.open(filename, pwd=password) 147 | shutil.copyfileobj(fsrc=stream, fdst=output_file) 148 | return True 149 | except RuntimeError as e: 150 | if 'Bad password' in str(e): 151 | logger.error(f"RuntimeError: Wrong password!") 152 | else: 153 | logger.error(e) 154 | return False 155 | except Exception as e: 156 | logger.error(e) 157 | return False 158 | 159 | def extract_all(self, destination: Path = None): 160 | """Extract content of zipfile with readable filename""" 161 | password = self.password 162 | destination = destination or self.destination 163 | 164 | if self.is_encrypted() and not password: 165 | password = getpass.getpass().encode() 166 | 167 | for original_name, decoded_name in self.name_map.items(): 168 | if decoded_name.endswith("/"): 169 | # skip subdirectory 170 | continue 171 | 172 | logger.info(f"Extracting: {decoded_name}") 173 | fo = destination / decoded_name 174 | fo.parent.mkdir(parents=True, exist_ok=True) 175 | extract_ok = self._extract_individual(original_name, fo, password) 176 | if not extract_ok: 177 | break 178 | else: 179 | logger.info("Finished") 180 | 181 | def __repr__(self): 182 | basic = f" * Default destination: {self.default_destination}\n" \ 183 | f" * Password protected : {self.is_encrypted()}" 184 | 185 | try_enc = (not self.all_utf8) and f' try encoding: {self.original_encoding} ' or '' 186 | txt = [basic, try_enc.center(79, '-')] 187 | for file_info in self.zip_ref.infolist(): 188 | if not (file_info.flag_bits & 0x800): 189 | name_as_bytes = file_info.orig_filename.encode("cp437") 190 | name_as_str = name_as_bytes.decode(self.original_encoding, "replace") 191 | else: 192 | name_as_str = "(UTF-8) " + file_info.filename 193 | txt.append(name_as_str) 194 | txt.append('-' * len(txt[1])) 195 | 196 | txt.append("Add '-enc ENCODING' to see filename shown in encoding " 197 | "ENCODING (mbcs, cp932, shift-jis,...)") 198 | txt.append("Add '-x' flag to extract all files to " 199 | "default destination") 200 | return '\n'.join(txt) 201 | 202 | 203 | def entry_point(): 204 | parser = ArgumentParser(description='Fix filename encoding error ' 205 | 'inside a zip file.') 206 | parser.add_argument('zipfile', help='path to zip file') 207 | parser.add_argument('destination', nargs='?', default="", 208 | help='folder path to extract zip file') 209 | parser.add_argument('--extract', '-x', action='store_true', 210 | help='extract the zipfile to specified destination') 211 | parser.add_argument('--fix', '-f', action='store_true', 212 | help='create a new zip file with UTF-8 file names') 213 | parser.add_argument('--encoding', '-enc', 214 | help='zip file used encoding: shift-jis, cp932...') 215 | parser.add_argument('--password', '-pwd', default='', 216 | help='password to extract zip file') 217 | 218 | args = parser.parse_args() 219 | try: 220 | if args.extract: 221 | zhdl = ZipHandler(path=args.zipfile, encoding=args.encoding, 222 | password=args.password.encode('utf8'), 223 | extract_path=args.destination) 224 | zhdl.extract_all() 225 | elif args.fix: 226 | zhdl = ZipHandler(path=args.zipfile, encoding=args.encoding, 227 | password=args.password.encode('utf8'), 228 | extract_path=args.destination) 229 | zhdl.fix_it() 230 | else: 231 | zhdl = ZipHandler(path=args.zipfile, encoding=args.encoding) 232 | print(zhdl) 233 | # except Exception as e: 234 | # logger.error(e) 235 | finally: 236 | pass 237 | 238 | 239 | if __name__ == '__main__': 240 | entry_point() --------------------------------------------------------------------------------