├── .gitignore
├── LICENSE
├── README.md
├── pyproject.toml
├── requirements.txt
├── tests
├── 20200524_ドラゴンフライト.zip
├── 20200524_ドラゴンボール.zip
├── 20200524_フラット.zip
├── 20200524_フラットpwd.zip
├── 202301-03_hokkaido_jukyu.zip
├── test_zipu.py
└── ミックス.zip
└── zip_unicode
├── __init__.py
└── main.py
/.gitignore:
--------------------------------------------------------------------------------
1 | /venv/
2 | /build/
3 | /dist/
4 | /ZipUnicode.egg-info/
5 | /.idea/
6 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) [2020] [Nguyen Ba Duc Tin]
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # ZipUnicode
2 | Make extracted unreadable filename problem gone away.
3 |
4 | [](https://pepy.tech/project/zipunicode)
5 | [](https://pypi.org/project/zipunicode/)
6 | [](https://github.com/Dragon2fly/zipunicode/blob/master/LICENSE)
7 |
8 | ## Install:
9 | Using pip: `pip install ZipUnicode`
10 |
11 | Beside installing `zip_unicode` package,
12 | this will also create an executable file `zipu` in the syspath
13 | for you to work with `zip` file directly from the console.
14 |
15 | ## Filename encoding inside a zip file
16 | Everyone agrees what a zip file is and how to make one.
17 | That is the way to turn a collection of files into a sequence of bytes
18 | and put a `.zip` at the end of the name of a newly created file.
19 | But no one said anything about how filename should be handled.
20 | So it is up to the zip extracting program to interpret that sequence of bytes into filename.
21 |
22 | Most OS use UTF-8 for filename encoding and flip a bit in the zip file to indicate that.
23 | However, Windows is not a case. For different languages, Windows uses different `code page`s
24 | to encode filename. So, if you create a zip file containing a file named `ê.txt` on Linux and
25 | extract it on Windows, you may got something like `├¬.txt` or `テェ.txt`.
26 |
27 | The exact filename depends on the `code page` or `language` that Windows is using.
28 | The same thing also happens when a zip file was created on Windows,
29 | contains non-ascii filename, and then extracted on Linux or on Windows that use different `code page`s.
30 |
31 | All that means if the filename wasn't encoded by `UTF-8` `encoding (or code page)`,
32 | then there is no easy way to knows which `encoding` that was used when extracting the file.
33 |
34 | ## Overview
35 | You will use `zipu` to interact with zip file.
36 |
37 | ```bash
38 | $ zipu -h
39 | ```
40 |
41 | ```bash
42 | usage: zipu [-h] [--extract] [--fix] [--encoding ENCODING]
43 | [--password PASSWORD]
44 | zipfile [destination]
45 |
46 | Fix filename encoding error inside a zip file.
47 |
48 | positional arguments:
49 | zipfile path to zip file
50 | destination folder path to extract zip file
51 |
52 | optional arguments:
53 | -h, --help show this help message and exit
54 | --extract, -x extract the zipfile to specified destination
55 | --fix, -f create a new zip file with UTF-8 file names
56 | --encoding ENCODING, -enc ENCODING
57 | zip file used encoding: shift-jis, cp932...
58 | --password PASSWORD, -pwd PASSWORD
59 | password to extract zip file
60 | ```
61 |
62 | Extracting a zip file is as simple as `zipu -x file.zip`.
63 | Files are extracted into the folder that has the same name as `file.zip` without `.zip`
64 | and stays on the same folder path as `file.zip`. Filename `encoding` is handled automatically.
65 |
66 | You can also ensure your zip file being opened correctly on all computers by `zipu -f file.zip`.
67 | This will create a new `file_fixed.zip` contains all file names encoded with `UTF-8`.
68 |
69 | ## Usage:
70 | 1. View content of the zip file:
71 |
72 | You simply point `zipu` to your zip file's path as follow:
73 |
74 | ```bash
75 | zipu path/to/file.zip
76 | ```
77 |
78 | This makes `zipu` do the following:
79 | * automatically guess the encoding that was used to encode file names
80 | * check if the file was password encrypted
81 | * give you a default extract destination if you don't provide any
82 |
83 | Then, it will show a summarization of the contents of that zip file,
84 | something similar to the following:
85 |
86 | D:\tmp>zipu 20200524_ドラゴンフライト.zip
87 |
88 | ```bash
89 | * Detected encoding : SHIFT_JIS | Language:Japanese | Confidence:99%
90 | * Default destination: D:\tmp
91 | * Password protected : False
92 | --------------------------- try encoding: SHIFT_JIS ---------------------------
93 | 20200524_ドラゴンフライト/
94 | 20200524_ドラゴンフライト/テストレポート_リナックスノード.txt
95 | 20200524_ドラゴンフライト/太陽バッテリーver5.txt
96 | 20200524_ドラゴンフライト/経営報告_桜ちゃん.txt
97 | -------------------------------------------------------------------------------
98 | Add '-enc ENCODING' to see filename shown in encoding ENCODING (mbcs, cp932, shift-jis,...)
99 | Add '-x' flag to extract all files to default destination
100 | ```
101 |
102 | If there is a root folder inside and it has the same name as the zip file as above example,
103 | `default destination` will be the parent folder of the zip file.
104 | Otherwise, `default destination` will point to a subdirectory
105 | that has the name of the zip file as the following case:
106 |
107 | D:\tmp>zipu 20200524_ドラゴンボール.zip
108 |
109 | ```bash
110 | * Detected encoding : SHIFT_JIS | Language:Japanese | Confidence:99%
111 | * Default destination: D:\tmp\20200524_ドラゴンボール
112 | * Password protected : False
113 | --------------------------- try encoding: SHIFT_JIS ---------------------------
114 | テストレポート_リナックスノード.txt
115 | 太陽バッテリーver5.txt
116 | 経営報告_桜ちゃん.txt
117 | -------------------------------------------------------------------------------
118 | Add '-enc ENCODING' to see filename shown in encoding ENCODING (mbcs, cp932, shift-jis,...)
119 | Add '-x' flag to extract all files to default destination
120 | ```
121 |
122 | 2. View content with a specific encoding:
123 |
124 | Encoding auto-detection is not always correct. When the sample is too little
125 | and some parts of `A` encoding are in `B` encoding, `B` may be wrongly detected
126 | instead of `A`. In such cases, you can specify the encoding which you believe
127 | is the correct one with `-enc ENCODING` switch.
128 |
129 | D:\tmp>zipu 20200524_ドラゴンボール.zip -enc cp932
130 |
131 | ```bash
132 | * Default destination: D:\tmp\20200524_ドラゴンボール
133 | * Password protected : False
134 | --------------------------- try encoding: cp932 ---------------------------
135 | テストレポート_リナックスノード.txt
136 | 太陽バッテリーver5.txt
137 | 経営報告_桜ちゃん.txt
138 | ---------------------------------------------------------------------------
139 | Add '-enc ENCODING' to see filename shown in encoding ENCODING (mbcs, cp932, shift-jis,...)
140 | Add '-x' flag to extract all files to default destination
141 | ```
142 |
143 | In case that your specified `ENCODING` is wrong and cannot decode some bytes,
144 | these unknown bytes will be replaced by a lot of `�`.
145 |
146 | D:\tmp>zipu 20200524_ドラゴンボール.zip -enc ascii
147 |
148 | ```bash
149 | * Default destination: D:\tmp\20200524_ドラゴンボール
150 | * Password protected : False
151 | --------------------------- try encoding: ascii ---------------------------
152 | �e�X�g���|�[�g�Q���i�b�N�X�m�[�h.txt
153 | ���z�o�b�e���[ver5.txt
154 | �o�c��_�������.txt
155 | ---------------------------------------------------------------------------
156 | Add '-enc ENCODING' to see filename shown in encoding ENCODING (mbcs, cp932, shift-jis,...)
157 | Add '-x' flag to extract all files to default destination
158 | ```
159 |
160 | Or those bytes are mapped into completely different characters:
161 |
162 | D:\tmp>zipu 20200524_ドラゴンボール.zip -enc utf16
163 |
164 | ```bash
165 | * Default destination: D:\tmp\20200524_ドラゴンボール
166 | * Password protected : False
167 | --------------------------- try encoding: utf16 ---------------------------
168 | 斃境枃貃粃宁枃冁誃榃抃亃境涃宁梃琮瑸
169 | 뺑窗澃抃斃誃宁敶㕲琮瑸
170 | 澌掉邍赟苷芿苡⻱硴�
171 | ---------------------------------------------------------------------------
172 | Add '-enc ENCODING' to see filename shown in encoding ENCODING (mbcs, cp932, shift-jis,...)
173 | Add '-x' flag to extract all files to default destination
174 | ```
175 |
176 | Only when auto-detection failed, it is your responsibility to decide which `ENCODING` is the correct one.
177 |
178 | **Warning**: If your console uses non-full `UTF-8` font as in the case of Windows,
179 | some `UTF-8` characters are shown as a dot `・`.
180 | This is not a result of wrong encoding but rather unsupported characters by the font.
181 |
182 | 3. Extract the zip file:
183 |
184 | Usually, encoding auto-detection works just fine so you can jump right to extraction with
185 | `zipu -x path/to/file.zip`. The `-x` argument can be either placed **before or after** the path to the zip file.
186 |
187 | D:\tmp>zipu 20200524_ドラゴンフライト.zip -x
188 |
189 | ```bash
190 | * Detected encoding : SHIFT_JIS | Language:Japanese | Confidence:99%
191 | Extracting: 20200524_ドラゴンフライト/テストレポート_リナックスノード.txt
192 | Extracting: 20200524_ドラゴンフライト/太陽バッテリーver5.txt
193 | Extracting: 20200524_ドラゴンフライト/経営報告_桜ちゃん.txt
194 | Finished
195 | ```
196 |
197 | As mentioned before, without specifying the `destination`, zip file is extracted to
198 | the directory in the same path and has the name of that zip file.
199 | In the above example, that would be `D:\tmp\20200524_ドラゴンフライト`.
200 |
201 | When extract `destination` is specified, you add it right after the zip file's path as:
202 |
203 | zipu -x path/to/file.zip path/to/extract
204 |
205 | If the output file names are unreadable,
206 | you have to guess the `ENCODING` with `-enc` switch as described in **2. View content with a specified encoding**.
207 | Then you can use that `ENCODING` to extract zip file:
208 |
209 | zipu path/to/file.zip -x -enc ENCODING
210 |
211 | 4. A Password protected zip file:
212 |
213 | If a zip file is encrypted, ` * Password protected : True` will show up when viewing its content.
214 | When extracting the zip file, you will be asked for `password` if you haven't provided any.
215 | You can also specify password directly in the command as follows:
216 |
217 | zipu path/to/file.zip -x -pwd PASSWORD
218 |
219 | 5. Mixed contents:
220 |
221 | Some zip files are very tricky. It contains file names of different encodings. Some `UTF-8`, some not.
222 | For `UTF-8` marked files, `zipu` will leave it as is while trying different `ENCODING` on other files.
223 | `UTF-8` encoded filename has `(UTF-8) ` string prefixed in the content view:
224 |
225 | D:\tmp>zipu ミックス.zip
226 |
227 | ```bash
228 | * Detected encoding : SHIFT_JIS | Language:Japanese | Confidence:63%
229 | * Default destination: D:\tmp\ミックス
230 | * Password protected : False
231 | --------------------------- try encoding: SHIFT_JIS ---------------------------
232 | (UTF-8) Vùng Trời Bình Yên.txt
233 | бореиская.txt
234 | テストレポート_リナックスノード.txt
235 | 太陽バッテリーver5.txt
236 | 経営報告_桜ちゃん.txt
237 | -------------------------------------------------------------------------------
238 | Add '-enc ENCODING' to see filename shown in encoding ENCODING (mbcs, cp932, shift-jis,...)
239 | Add '-x' flag to extract all files to default destination
240 | ```
241 |
242 | When extracting, `UTF-8` encoded filename will not wrongly be decoded with detected `ENCODING`
243 | so that you can read it as is.
244 |
245 | **Warning**: `zipu` cannot handle zip file that contains three or more encodings, or two encodings
246 | but neither is `UTF-8`. In such cases, you have to extract the zip file for each encoding.
247 |
248 | 6. Fixing a zip file:
249 |
250 | If you make a zip file contains file names which are not in `UTF-8` nor `ASCII` encoding,
251 | then you can ensure that your colleagues who use computers of different language can
252 | open the zip just fine as follows:
253 |
254 | ```bash
255 | zipu -f path/to/file.zip
256 | ```
257 |
258 | This first extracts your zip file (and convert all file names to `UTF-8`).
259 | Then it compresses extracted contents and adds `_fixed` suffix to the zip filename.
260 | The fixed zip file is on the same path as the original one.
261 |
262 | **Warning**: `zipu` cannot create password encrypted zip file.
263 | With these files you have to first extract it by `zipu` and then re-zip it
264 | with your conventional tool.
265 |
266 | ## Changelog
267 | ### 1.1.0
268 | * Handle malformed zip file: Some zip files contain folders but are registered as file entries.
269 | These file entries have size of zero by and are extracted as zero-byte files.
270 | Since the OS doesn't allow creating file and folder of the same name
271 | within the same directory, `zipu` cannot continue to create the folder and extract the file inside.
272 | Now `zipu` will check for those malformed entries and skip it.
273 | * Fixing zip file from commandline with `zipu -f` now work normally.
274 |
--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
1 | [build-system]
2 | requires = ["setuptools>=61.0", "chardet"]
3 | build-backend = "setuptools.build_meta"
4 |
5 | [project]
6 | name = "ZipUnicode"
7 | authors = [
8 | {name = "Nguyen Ba Duc Tin", email = "nguyenbaduc.tin@gmail.com"},
9 | ]
10 | description = "Fix unreadable file names when extracting zip file"
11 | readme = "README.md"
12 | requires-python = ">=3.6"
13 | license = {file = "LICENSE"}
14 | dependencies = [
15 | 'chardet>=3.0.0',
16 | ]
17 | classifiers=[
18 | 'Development Status :: 5 - Production/Stable',
19 | 'Intended Audience :: End Users/Desktop',
20 | 'Intended Audience :: Developers',
21 | 'Intended Audience :: Information Technology',
22 | 'License :: OSI Approved :: MIT License',
23 | 'Programming Language :: Python',
24 | 'Programming Language :: Python :: 3.6',
25 | 'Programming Language :: Python :: 3.7',
26 | 'Programming Language :: Python :: 3.8',
27 | 'Programming Language :: Python :: 3.9',
28 | 'Programming Language :: Python :: 3.10',
29 | 'Programming Language :: Python :: 3.11',
30 | 'Operating System :: OS Independent',
31 | 'Topic :: Software Development :: Debuggers',
32 | 'Topic :: Home Automation',
33 | 'Topic :: Office/Business',
34 | 'Topic :: Scientific/Engineering :: Artificial Intelligence',
35 | 'Topic :: Scientific/Engineering :: Information Analysis',
36 | 'Topic :: Utilities'
37 | ]
38 | dynamic = ["version"]
39 |
40 | [project.scripts]
41 | zipu = "zip_unicode.main:entry_point"
42 |
43 | [project.urls]
44 | homepage = "https://github.com/Dragon2fly/ZipUnicode"
45 |
46 | [tool.setuptools]
47 | packages = ["zip_unicode"]
48 |
49 | [tool.setuptools.dynamic]
50 | version = {attr = "zip_unicode.__version__"}
51 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | chardet >= 3.0.0
2 |
3 | pytest>=5.4.2
4 | setuptools>=40.8.0
--------------------------------------------------------------------------------
/tests/20200524_ドラゴンフライト.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Dragon2fly/ZipUnicode/412b9422469069fe580c219ef683639a4192e088/tests/20200524_ドラゴンフライト.zip
--------------------------------------------------------------------------------
/tests/20200524_ドラゴンボール.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Dragon2fly/ZipUnicode/412b9422469069fe580c219ef683639a4192e088/tests/20200524_ドラゴンボール.zip
--------------------------------------------------------------------------------
/tests/20200524_フラット.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Dragon2fly/ZipUnicode/412b9422469069fe580c219ef683639a4192e088/tests/20200524_フラット.zip
--------------------------------------------------------------------------------
/tests/20200524_フラットpwd.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Dragon2fly/ZipUnicode/412b9422469069fe580c219ef683639a4192e088/tests/20200524_フラットpwd.zip
--------------------------------------------------------------------------------
/tests/202301-03_hokkaido_jukyu.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Dragon2fly/ZipUnicode/412b9422469069fe580c219ef683639a4192e088/tests/202301-03_hokkaido_jukyu.zip
--------------------------------------------------------------------------------
/tests/test_zipu.py:
--------------------------------------------------------------------------------
1 | __author__ = "Duc Tin"
2 |
3 | from pathlib import Path
4 |
5 | import pytest
6 |
7 | from zip_unicode import ZipHandler
8 |
9 | root_folder = ZipHandler('20200524_ドラゴンフライト.zip')
10 | root_folder2 = ZipHandler('20200524_ドラゴンボール.zip')
11 | flat = ZipHandler('20200524_フラット.zip')
12 | flat_pwd = ZipHandler('20200524_フラットpwd.zip')
13 | mixed = ZipHandler('ミックス.zip')
14 | nested_subfolder = ZipHandler('202301-03_hokkaido_jukyu.zip') # with folder entry registered as a file
15 |
16 |
17 | def clean_up(path: Path):
18 | if path.is_dir():
19 | for f in path.iterdir():
20 | clean_up(f) if f.is_dir() else f.unlink()
21 | else:
22 | path.rmdir()
23 | else:
24 | path.unlink()
25 |
26 |
27 | def test_byte_name():
28 | res = {b'V\xc3\xb9ng Tr\xe1\xbb\x9di B\xc3\xacnh Y\xc3\xaan.txt': True,
29 | b'\x84q\x84\x80\x84\x82\x84u\x84y\x84\x83\x84{\x84p\x84\x91.txt': False,
30 | b'\x83e\x83X\x83g\x83\x8c\x83|\x81[\x83g\x81Q\x83\x8a\x83i\x83b\x83N\x83X\x83m\x81[\x83h.txt': False,
31 | b'\x91\xbe\x97z\x83o\x83b\x83e\x83\x8a\x81[ver5.txt': False,
32 | b'\x8co\x89c\x95\xf1\x8d\x90_\x8d\xf7\x82\xbf\x82\xe1\x82\xf1.txt': False,
33 | }
34 |
35 | for file_info in mixed.zip_ref.infolist():
36 | is_utf8, name = mixed.byte_name(file_info)
37 | assert res[name] == is_utf8
38 |
39 |
40 | def test_guess_encoding():
41 | assert flat.original_encoding == 'SHIFT_JIS'
42 |
43 |
44 | def test_get_filename_map():
45 | names = ['Vùng Trời Bình Yên.txt', 'бореиская.txt', 'テストレポート_リナックスノード.txt',
46 | '太陽バッテリーver5.txt', '経営報告_桜ちゃん.txt']
47 | encodings = ['utf8', 'cp932', 'cp932', 'cp932', 'cp932']
48 |
49 | wrong_encoded = [x.encode(enc) for x, enc in zip(names[1:], encodings[1:])]
50 | wrong_decoded = [x.decode('cp437') for x in wrong_encoded]
51 | wrong_decoded.insert(0, 'Vùng Trời Bình Yên.txt') # utf8 is left intact
52 |
53 | assert mixed.name_map == dict(zip(wrong_decoded, names))
54 |
55 |
56 | def test_duplicated_root_name():
57 | assert root_folder._duplicated_root_name()
58 | assert not root_folder2._duplicated_root_name()
59 | assert not flat._duplicated_root_name()
60 | assert not flat_pwd._duplicated_root_name()
61 | assert not mixed._duplicated_root_name()
62 |
63 |
64 | def test_is_encrypted():
65 | assert flat_pwd.is_encrypted()
66 | assert not flat.is_encrypted()
67 | assert not mixed.is_encrypted()
68 |
69 |
70 | def test_extract_individual():
71 | name = "テストレポート_リナックスノード.txt".encode('cp932').decode('cp437')
72 | out = Path('test_extract_individual.txt')
73 | flat._extract_individual(name, out)
74 | assert out.read_text(encoding='cp932') == '何もない'
75 | out.unlink()
76 |
77 |
78 | def test_extract_all():
79 | # all filenames have the same encoding
80 | expect = {'テストレポート_リナックスノード.txt', '経営報告_桜ちゃん.txt', '太陽バッテリーver5.txt'}
81 | out = Path('test_extract_all_one_enc')
82 | flat.extract_all(out)
83 | assert set(x.name for x in out.iterdir()) == expect
84 | clean_up(out)
85 |
86 | # some files are UTF8 encoded, some are not
87 | expect = {'Vùng Trời Bình Yên.txt', 'бореиская.txt', 'テストレポート_リナックスノード.txt',
88 | '太陽バッテリーver5.txt', '経営報告_桜ちゃん.txt'}
89 | out = Path('test_extract_all_mixed_enc')
90 | mixed.extract_all(out)
91 | assert set(x.name for x in out.iterdir()) == expect
92 | clean_up(out)
93 |
94 | # multiple sub-folder with subfolders entry as a file
95 | # aka malformed zipfile
96 | expect = {'202301', '202302', '202303'}
97 | out = Path('test_extract_all_multiple_sub_folder')
98 | nested_subfolder.extract_all(out)
99 | assert set(x.name for x in out.iterdir()) == expect
100 | clean_up(out)
101 |
102 |
103 | def test_extract_all_with_pwd(caplog):
104 | expect = {'テストレポート_リナックスノード.txt', '経営報告_桜ちゃん.txt', '太陽バッテリーver5.txt'}
105 | out = Path('test_extract_all')
106 |
107 | with pytest.raises(OSError) as e:
108 | # password input required
109 | flat_pwd.extract_all(out)
110 |
111 | flat_pwd.password = b'WrongPassword'
112 | flat_pwd.extract_all(out)
113 |
114 | capture = caplog.text
115 | assert 'Wrong password!' in capture
116 |
117 | flat_pwd.password = b'password'
118 | flat_pwd.extract_all(out)
119 | assert set(x.name for x in out.iterdir()) == expect
120 |
121 | file_1 = Path('test_extract_all/テストレポート_リナックスノード.txt')
122 | assert file_1.read_text(encoding='cp932') == '何もない'
123 |
124 | # clean up
125 | flat_pwd.password = None
126 | clean_up(out)
127 |
128 |
129 | def test_extract_all_with_root_folder():
130 | out1 = Path(root_folder.zip_ref.filename.replace('.zip', ''))
131 | root_folder.extract_all()
132 | assert not any(x.is_dir() for x in out1.iterdir())
133 | clean_up(out1)
134 |
135 | out2 = Path('specified_path')
136 | root_folder.extract_all(out2)
137 | assert (out2 / out1.name).exists()
138 | clean_up(out2)
139 |
140 | out3 = Path('ミックス')
141 | mixed.extract_all()
142 | assert out3.exists() and len(list(out3.iterdir())) == 5
143 | clean_up(out3)
144 |
145 |
146 | @pytest.mark.parametrize('my_zip', [root_folder, flat, mixed])
147 | def test_fix_it(my_zip):
148 | my_zip.fix_it()
149 |
150 | name = my_zip.zip_path.stem
151 | fixed = my_zip.zip_path.parent / (name + '_fixed.zip')
152 | fixed_zip = ZipHandler(fixed)
153 | assert fixed_zip.all_utf8
154 |
155 | fixed_zip.zip_ref.close()
156 | clean_up(fixed)
157 |
--------------------------------------------------------------------------------
/tests/ミックス.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Dragon2fly/ZipUnicode/412b9422469069fe580c219ef683639a4192e088/tests/ミックス.zip
--------------------------------------------------------------------------------
/zip_unicode/__init__.py:
--------------------------------------------------------------------------------
1 | __author__ = "Duc Tin"
2 | from .main import ZipHandler, __version__
3 |
--------------------------------------------------------------------------------
/zip_unicode/main.py:
--------------------------------------------------------------------------------
1 | __author__ = "Duc Tin"
2 | __version__ = "1.1.1"
3 |
4 | import getpass
5 | import logging
6 | import shutil
7 | import sys
8 | import zipfile
9 | import tempfile
10 | from pathlib import Path
11 | from argparse import ArgumentParser
12 |
13 | import chardet
14 |
15 |
16 | # Disable chardet logger
17 | logging.getLogger('chardet').level = logging.ERROR
18 |
19 | # Config our logger
20 | logging.basicConfig(format='%(message)s', stream=sys.stdout, level=logging.INFO)
21 | logger = logging.getLogger('zip_unicode')
22 |
23 |
24 | def zip_it(base_name, root_dir):
25 | logger.info("Creating archive:")
26 | shutil.make_archive(base_name, 'zip', root_dir, logger=logger)
27 |
28 |
29 | class ZipHandler:
30 | def __init__(self, path: str, encoding: str = None,
31 | password: bytes = None, extract_path: str = ""):
32 |
33 | self.zip_path = Path(path)
34 | self.zip_ref = zipfile.ZipFile(self.zip_path)
35 |
36 | self.all_utf8 = None
37 | self.original_encoding = encoding or self.guess_encoding()
38 | self.password = password
39 | self.name_map = self._get_filename_map()
40 |
41 | if self._duplicated_root_name():
42 | self.default_destination = self.zip_path.parent.absolute()
43 | else:
44 | self.default_destination = self.zip_path.parent.absolute() / self.zip_path.stem
45 | self.destination = Path(extract_path) if extract_path else self.default_destination
46 |
47 | @staticmethod
48 | def byte_name(file_info: zipfile.ZipInfo) -> (bool, bytes):
49 | """return path of a zip element in bytes,
50 | and a flag is True if it is UTF-8 encoded
51 | """
52 | is_utf8 = file_info.flag_bits & 0x800
53 | if not is_utf8:
54 | # filename is not encoded with utf-8
55 | return False, file_info.orig_filename.encode("cp437")
56 | else:
57 | return True, file_info.orig_filename.encode("utf-8")
58 |
59 | def guess_encoding(self):
60 | namelist = []
61 |
62 | self.all_utf8 = True
63 | for file_info in self.zip_ref.infolist():
64 | utf8, byte_name = self.byte_name(file_info)
65 | if not utf8:
66 | namelist.append(byte_name)
67 | self.all_utf8 = False
68 |
69 | if not self.all_utf8:
70 | enc = chardet.detect(b' '.join(namelist))
71 | logger.info(f' * Detected encoding : {enc["encoding"]} | '
72 | f'Language:{enc["language"]} | '
73 | f'Confidence:{enc["confidence"]:.0%} ')
74 | return enc["encoding"]
75 | else:
76 | logger.info(' * All file names are properly in UTF8 encoding')
77 | return 'UTF_8'
78 |
79 | def _is_folder_entry_as_file(self, entry_name):
80 | for entry in self.zip_ref.namelist():
81 | if entry.startswith(entry_name) and len(entry) > len(entry_name):
82 | return True
83 | else:
84 | return False
85 |
86 | def _get_filename_map(self) -> dict:
87 | """ Map unreadable filename to correctly decoded one """
88 | encoding = self.original_encoding
89 | name_map = {}
90 | for file_info in self.zip_ref.infolist():
91 | if not (file_info.flag_bits & 0x800):
92 | # filename is not encoded with utf-8
93 | name_as_bytes:bytes = file_info.orig_filename.encode("cp437")
94 | name_as_str = name_as_bytes.decode(encoding, errors='replace')
95 | else:
96 | name_as_str = file_info.filename
97 |
98 | if file_info.file_size == 0 and not name_as_str.endswith('/'):
99 | if self._is_folder_entry_as_file(name_as_str):
100 | logger.warning(f'Malformed zipfile: Entry "{file_info.filename}" '
101 | f'is a directory but is registered as a file.')
102 | continue
103 | name_map[file_info.filename] = name_as_str
104 |
105 | return name_map
106 |
107 | def _duplicated_root_name(self) -> bool:
108 | """Inside zip file is one folder whose name is zip filename"""
109 | paths = sorted(self.name_map.values()) # make sure the shorted name listed first
110 | root = paths[0]
111 | has_root = all(x.startswith(root) for x in paths)
112 | if not has_root:
113 | return False
114 |
115 | zipname = self.zip_ref.filename.replace('.zip', '/')
116 | if zipname.endswith(root):
117 | return True
118 |
119 | def is_encrypted(self) -> bool:
120 | """Check if zipfile is password protected"""
121 | encrypted = False
122 | for file_info in self.zip_ref.infolist():
123 | encrypted = bool(file_info.flag_bits & 0x1)
124 | if encrypted:
125 | break
126 | return encrypted
127 |
128 | def fix_it(self):
129 | """convert filename from nonUTF-8 to UTF-8"""
130 | with tempfile.TemporaryDirectory() as tmp_folder:
131 | tmp_folder = Path(tmp_folder)
132 | self.extract_all(tmp_folder)
133 | new_name = self.zip_path.parent / (self.zip_path.stem + '_fixed')
134 | folder_to_zip = tmp_folder # /self.zip_path.stem
135 | zip_it(new_name, folder_to_zip)
136 |
137 | if self.is_encrypted():
138 | logger.warning(f" !!! Fixed zipfile is NOT password protected!")
139 |
140 | def _extract_individual(self, filename: str, output_path: Path,
141 | password: bytes = None) -> bool:
142 | """Extract 'filename' in zipfile to path 'output_path' with password 'password' """
143 |
144 | try:
145 | with output_path.open("wb+") as output_file:
146 | stream = self.zip_ref.open(filename, pwd=password)
147 | shutil.copyfileobj(fsrc=stream, fdst=output_file)
148 | return True
149 | except RuntimeError as e:
150 | if 'Bad password' in str(e):
151 | logger.error(f"RuntimeError: Wrong password!")
152 | else:
153 | logger.error(e)
154 | return False
155 | except Exception as e:
156 | logger.error(e)
157 | return False
158 |
159 | def extract_all(self, destination: Path = None):
160 | """Extract content of zipfile with readable filename"""
161 | password = self.password
162 | destination = destination or self.destination
163 |
164 | if self.is_encrypted() and not password:
165 | password = getpass.getpass().encode()
166 |
167 | for original_name, decoded_name in self.name_map.items():
168 | if decoded_name.endswith("/"):
169 | # skip subdirectory
170 | continue
171 |
172 | logger.info(f"Extracting: {decoded_name}")
173 | fo = destination / decoded_name
174 | fo.parent.mkdir(parents=True, exist_ok=True)
175 | extract_ok = self._extract_individual(original_name, fo, password)
176 | if not extract_ok:
177 | break
178 | else:
179 | logger.info("Finished")
180 |
181 | def __repr__(self):
182 | basic = f" * Default destination: {self.default_destination}\n" \
183 | f" * Password protected : {self.is_encrypted()}"
184 |
185 | try_enc = (not self.all_utf8) and f' try encoding: {self.original_encoding} ' or ''
186 | txt = [basic, try_enc.center(79, '-')]
187 | for file_info in self.zip_ref.infolist():
188 | if not (file_info.flag_bits & 0x800):
189 | name_as_bytes = file_info.orig_filename.encode("cp437")
190 | name_as_str = name_as_bytes.decode(self.original_encoding, "replace")
191 | else:
192 | name_as_str = "(UTF-8) " + file_info.filename
193 | txt.append(name_as_str)
194 | txt.append('-' * len(txt[1]))
195 |
196 | txt.append("Add '-enc ENCODING' to see filename shown in encoding "
197 | "ENCODING (mbcs, cp932, shift-jis,...)")
198 | txt.append("Add '-x' flag to extract all files to "
199 | "default destination")
200 | return '\n'.join(txt)
201 |
202 |
203 | def entry_point():
204 | parser = ArgumentParser(description='Fix filename encoding error '
205 | 'inside a zip file.')
206 | parser.add_argument('zipfile', help='path to zip file')
207 | parser.add_argument('destination', nargs='?', default="",
208 | help='folder path to extract zip file')
209 | parser.add_argument('--extract', '-x', action='store_true',
210 | help='extract the zipfile to specified destination')
211 | parser.add_argument('--fix', '-f', action='store_true',
212 | help='create a new zip file with UTF-8 file names')
213 | parser.add_argument('--encoding', '-enc',
214 | help='zip file used encoding: shift-jis, cp932...')
215 | parser.add_argument('--password', '-pwd', default='',
216 | help='password to extract zip file')
217 |
218 | args = parser.parse_args()
219 | try:
220 | if args.extract:
221 | zhdl = ZipHandler(path=args.zipfile, encoding=args.encoding,
222 | password=args.password.encode('utf8'),
223 | extract_path=args.destination)
224 | zhdl.extract_all()
225 | elif args.fix:
226 | zhdl = ZipHandler(path=args.zipfile, encoding=args.encoding,
227 | password=args.password.encode('utf8'),
228 | extract_path=args.destination)
229 | zhdl.fix_it()
230 | else:
231 | zhdl = ZipHandler(path=args.zipfile, encoding=args.encoding)
232 | print(zhdl)
233 | # except Exception as e:
234 | # logger.error(e)
235 | finally:
236 | pass
237 |
238 |
239 | if __name__ == '__main__':
240 | entry_point()
--------------------------------------------------------------------------------