├── README.md
├── _config.yml
├── sieve.png
└── sift3.cpp
/README.md:
--------------------------------------------------------------------------------
1 | # sift3
2 | Repository sifter and hardlinker - GPLv3
3 |
4 | 
5 |
6 | ## General
7 |
8 | sift3 is a utility that allows you to sift source folders for items (files or folders) and selectively move or hard link these items to subfolders in a destination folder.
9 |
10 | The idea is that as you download or create new files/media you make sure they have a good descriptive name and drop them into a repository subfolder. The repository can be structured with hierarchical folder structure or not.
11 |
12 | Then you create empty destination subfolders, with names that work as searches into the repository.
13 |
14 | When you run sift3 it will do a brute force attempt to match all items in the repo against all target folders in dest. When there is a match, the repo item is hardlinked into the matching target dest folder.
15 |
16 | This can be seen as a form of "procedural curating". Instead of manually copying/linking every file and folder you specify how it should be done. And then let sift3 do it.
17 |
18 | Since files are hardlinked and not copied, there is very little space penalty for having the same item appear in many dest folders.
19 |
20 | Typical use is to add items, with descriptive names, to the repo. Then update dest by running "sift3 --missing repo dest". Afterwards the file dest/missing.txt can be examined to see what items in repo was not matched in dest. And new dest folders can be added, if needed.
21 |
22 | If repo or dest is restructured you can clear dest before repopulating it by running "sift3 --clear repo dest".
23 |
24 | ## What was wrong with sift?
25 |
26 | The old sift version worked OK but it could not use variable depth repo and dest folder structures. Also the matching was very relaxed, resulting in too many false positive matches. In sift3 the "..." suffix is used to create "parent" subfolders to add depth to both repo and dest. The matching is made more strict with exact word matching in strict order. But this can be relaxed, if needed, by selectively turning off matching of word start or stop and allowing alternative matches in (non-parent) dest target folders.
27 |
28 | ## Usage
29 |
30 | sift3 Copyright Anders Larsen 2020 gislagard@gmail.com
31 |
32 | This program is free software under the terms of GPLv3.
33 |
34 | Hardlink items in repo to matching folders at dest.
35 |
36 | Matching rules:
37 |
38 | Items match only once and in strict order.
39 | Matches are not case sensitive for ASCII.
40 | Suffix '...' in folder names creates a parent folder.
41 | Lower case words in parent and dest folders are not matched.
42 | Use any of ',(); for alternative match in not parent dest.
43 | Full words must match. CamelCase is detected if used.
44 | Use _underscore_ in dest to disable full word start/stop.
45 |
46 | Usage: " << argv[0] << " [options] repo dest
47 |
48 | Options:
49 |
50 | -c or --clear Clear existing items in dest before sift.
51 | -m or --missing Log not matching in repo to dest/missing.txt.
52 | -v or --verbose Increase verbosity. Max is 3: -vvv
53 |
54 | ## Compile and install
55 |
56 | sift3 is a very small utility, consisting of a singe C++ source file, written using only Standard C++ 17 and without any dependencies, so no make or similar is needed. But a recent compiler that supports C++ 17 is needed. The normal gcc 8.4.0-3 that comes with Ubuntu 20.04 works fine.
57 |
58 | This command creates an executable, sift3, ready to use or move into some folder in the path:
59 |
60 | g++ -O3 -std=c++17 sift3.cpp -o sift3
61 |
62 | ## Details and examples
63 |
64 | This documentation is just started, much more documentation is needed and will be added over time.
65 |
66 | ### Repository - repo
67 |
68 | A repo is simply a folder holding a number of items to be sifted, subfolders or normal files. This can be downloaded files or photos or whatever. Subfolders with the suffix "..." are used as "parent repo subfolders" that also can hold items, in any depth.
69 |
70 | The filenames of the parent folders and the items themselves is combined to provide a description of each item.
71 |
72 | Example:
73 |
74 | 1. /repo/old/Movies .../2001 - A Space Odyssey - Science Fiction - Stanley Kubrik (1968)/2001.mkv
75 |
76 | Here we have a parent folder "Movies" and an item consisting of the subfolder "2001 - A Space Odessey - Stanley Kubrik (1968)". The description of the item is the combination of the parent folder name and the foldername. However, words in parent subfolders that are not capitalized (Title Case) are ignored, as are any of "(),;". So the description used by sift3 consists of the following tokens:
77 |
78 | [Movies] [2001] [-] [A] [Space] [Odessey] [-] [Science] [Fiction] [-] [Stanley] [Kubrik] [1968]
79 |
80 | Sometimes CamelCase is used to separate words, instead of space. This is detected and used by sift3. So the filename:
81 |
82 | 2. /repo/Movies/Bladerunner ScienceFiction.mkv
83 |
84 | Would be used as the following tokens:
85 |
86 | [Bladerunner] [Science] [Fiction]
87 |
88 | ### Destination - dest
89 |
90 | A destination is simply a folder with a number of empty "dest" subfolders. Subfolders with the suffix "..." are used as "parent dest subfolders" that also can hold dest subfolders, in any depth.
91 |
92 | Examples:
93 |
94 | 3. /dest/Movies .../by director .../Stanley Kubrik/
95 | 4. /dest/Movies .../by genre .../Science Fiction/
96 | 5. /dest/Movies .../by genre .../Comedy
97 | 6. /dest/Movies .../by year .../1968/
98 | 7. /dest/Movies Sci_ Fi_/
99 | 8. /dest/Sci_ Fi_ Movies/
100 | 9. /dest/Sci_ Fi_ movies/
101 |
102 | The parent dest folder names and the dest folder names are combined by sift3 to create a series of "search" tokens. However, words in parent dest folders that are not capitalized (Title Case) are ignored, as are any of "(),;".
103 |
104 | Several variants can be specified in the dest folder (not in a parent dest folder) by separating the variants with any of "(),;". In addition the underscore character \_ can be used to specify "free" search without having to match exactly the start or end of a word.
105 |
106 | ### The sift - hardlinking from repo to dest
107 |
108 | sift3 reads in all folders in dest to memory and for each variant creates a set search tokens. Then sift3 walks through the repo, and for each item tries to find a matching set of search tokens from dest.
109 |
110 | The search tokens from dest need to match the repo description once, in strict order. If all search tokens from dest match, then the item from repo is hardlinked into the matching subfolder in dest.
111 |
112 | * Examples 3, 4, 6, 7 and 9 will match example 1.
113 | * Examples 4, 7 and 9 will match example 2.
114 | * Example 5 will not match 1 because "Comedy" doesn't match example 1.
115 | * Example 7 will match 1 because the underscores means a complete word match is not needed.
116 | * Example 8 will not match 1 because the order is wrong.
117 | * Example 9 will match 1 because the lower case "movies" is ignored.
118 |
119 | #### Warning: sift3 is intentionally designed to delete/clear whole folder trees without asking for any confirmation. Be careful. Backup your data.
120 |
121 | #### Warning: You should understand what hardlinking means and how it works. The files in dest are not copies, they are the actual same files as the files in repo. Don't edit the contents, unless you want it propagated to all the same hardlinked files.
122 |
123 | ### Clear
124 |
125 | If you rename or restructure things you may want to remove old hardlinked itemes in dest. Run sift3 with the commandline option --clear and old hardlinks will be removed.
126 |
127 | Parent dest folders and empty dest folders will remain.
128 |
129 | #### Warning: sift3 will NOT ask for any confirmation before it clears out dest.
130 |
131 | Tip: To get an empty dest just run a sift3 with --clear using an empty folder as repo. This is very useful if you want to backup dest. If you backup repo and an empty dest, you have a full backup.
132 |
133 | ### Missing - missing.txt
134 |
135 | You may want to find out what items in the repo are not matched in dest, so you can add new dest folders to handle that. Run sift3 with the commandline option --missing. Then a file dest/missing.txt will be created with all items not matched from the repo.
136 |
137 | ### Performance and no progress indicator - verbosity
138 |
139 | Some general attempts have been made to make sift3 reasonably fast. Compiled C++ is fast. Hardlinking is fast. All dest search sets held in RAM for fast access. But there is always room for more optimizations.
140 |
141 | In essence sift3 is a brute force nested linear search, every single dest folder is tested against every single item in repo. This is very bad for performance, especially when repo and dest grows. But for a moderate size repo and a moderate size dest, on a local fast filesystem, sift3 runtimes are typically measured in seconds or single minutes. Typically it is the filesystem speed that is the bottleneck, not the processing power.
142 |
143 | If you use sift3 on a remote networked filesystem on a NAS, over WiFi or slow Ethernet and both repo and dest is large, sift3 can take a very long time to finish. I did some "worst case" testing over wifi to a NAS running on a RPi4 with mergerfs with very big repo (~20TB) and big dest and runtimes could be close to an hour. Almost half of that time was spent just deleting old hardlinks using --clear.
144 |
145 | If you need to improve performance, split your data into several smaller repo and dest. Run on a local filesystem, not remotely. In other words, run sift3 on the NAS itself, not on a client using a slow remote networked shared filesystem.
146 |
147 | If you want some indication that sift3 is working, while it is running, and hasn't hung, you can increase the verbosity using the --verbosity commandline option. Each time you use the option the verbosity level is increased one step.
148 |
149 | Verbosity 0: sift3 is silent. This is the default.
150 |
151 | Verbosity 1: sift3 tells when it shifts between reading (and clearing) dest and reading repo.
152 |
153 | Verbosity 2: sift3 tells you about what it does with every item in dest and repo.
154 |
155 | Verbosity 3: for debugging, not recommended...
156 |
157 | Verbosity 2 gives a good indication of the pace of sift3.
158 |
159 | Instead of using --verbosity multiple times you can use -v multiple times or -vv or -vvv.
160 |
--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | theme: jekyll-theme-time-machine
--------------------------------------------------------------------------------
/sieve.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/WikiBox/sift3/d186fe926ee664f4d621093909f1f12a34f1c0b0/sieve.png
--------------------------------------------------------------------------------
/sift3.cpp:
--------------------------------------------------------------------------------
1 | /*
2 | sift3 - Copyright Anders Larsen 2020 Gislagard@gmail.com
3 |
4 | For docs, see: https://github.com/WikiBox/sift3
5 |
6 | This program is free software: you can redistribute it and/or modify
7 | it under the terms of the GNU General Public License as published by
8 | the Free Software Foundation, either version 3 of the License, or
9 | (at your option) any later version.
10 |
11 | This program is distributed in the hope that it will be useful,
12 | but WITHOUT ANY WARRANTY; without even the implied warranty of
13 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 | GNU General Public License for more details.
15 |
16 | You should have received a copy of the GNU General Public License
17 | along with this program. If not, see .
18 | */
19 |
20 | #include
21 | #include
22 | #include
23 | #include
24 | #include
25 | #include
26 | #include
27 | #include
28 |
29 | namespace fs = std::filesystem;
30 | using namespace std;
31 |
32 | int verbose;
33 | bool clear, missing;
34 | ofstream ostrm_missing;
35 |
36 | vector > repo_items;
37 | vector > > dest_items;
38 |
39 | void tokenize(string str, const string& delimiter, vector &tokens)
40 | {
41 | size_t prev = 0, pos = 0, len = str.length();
42 |
43 | do
44 | {
45 | pos = str.find_first_of(delimiter, prev);
46 |
47 | if (pos == string::npos)
48 | pos = len;
49 |
50 | string tok = str.substr(prev, pos - prev);
51 |
52 | if (!tok.empty())
53 | tokens.push_back(tok);
54 |
55 | prev = pos + 1;
56 |
57 | } while (pos < len && prev < len);
58 | }
59 |
60 | /*
61 | * Recursive hardlink (copy file w. hardlink&recursion) in the stdlib seems to be broken
62 | */
63 |
64 | void recursive_hardlink(fs::path src, fs::path dst)
65 | {
66 | std::error_code ec;
67 |
68 | if (fs::is_directory(src))
69 | {
70 | if(verbose > 2) cout << "create directory: " << dst << endl;
71 |
72 | fs::create_directory(dst, ec); // mkdir
73 | if (ec && (verbose > 2))
74 | cerr << "Create directory warning: " << ec.message() << " " << dst.string() << endl;
75 |
76 | for(auto& new_src : fs::directory_iterator(src))
77 | recursive_hardlink(new_src, dst / new_src.path().filename());
78 | }
79 | else if (fs::is_regular_file(src))
80 | {
81 | if(verbose > 2) cout << "hardlink file: " << src << endl;
82 |
83 | // Skip if dst exists
84 | if(fs::exists(dst))
85 | return;
86 |
87 | fs::create_hard_link(src, dst, ec);
88 | if (ec && (verbose > 2))
89 | cerr << "Create hardlink warning: " << ec.message() << " " << dst.string() << endl;
90 | }
91 | }
92 |
93 | /*
94 | * Find word breaks to use for valid starts and stops of word matches
95 | */
96 |
97 | void build_start_stop(string& s, vector& start, vector& stop)
98 | {
99 | size_t len = s.length();
100 |
101 | char pre, cur, suc;
102 |
103 | for(int i = 0; i < len; i++)
104 | {
105 | if((i + 1) == len)
106 | suc = '\0';
107 | else
108 | suc = s[i + 1];
109 |
110 | if(i == 0)
111 | pre = '\0';
112 | else
113 | pre = cur;
114 |
115 | cur = s[i];
116 |
117 | // Is cur a valid start of a word?
118 | start.push_back(((isalpha(cur) && !isalpha(pre)) || // _[a]
119 | (isupper(cur) && islower(pre)))); // a[A]
120 |
121 | // Is cur a valid stop of a word?
122 | stop.push_back(((isalpha(cur) && !isalpha(suc)) || // [a]_
123 | (islower(cur) && isupper(suc)))); // [a]A
124 | }
125 | }
126 |
127 | size_t find_word_case_insensitive(std::string_view s, size_t spos, std::string_view p)
128 | {
129 | auto it = std::search
130 | (
131 | s.begin() + spos, s.end(),
132 | p.begin(), p.end(),
133 | [](char cs, char cp) { return std::toupper(cs) == std::toupper(cp); }
134 | );
135 |
136 | if (it == s.end())
137 | return string::npos;
138 |
139 | return it - s.begin();
140 | }
141 |
142 | /*
143 | Search s for the strings in p, in order. If all are found, return true.
144 | Recursive backtracking.
145 |
146 | s : string to search.
147 | spos : position in s to search from.
148 | p : strings to match.
149 | ppos : position in p to search from.
150 | wstart : flags match valid start of word.
151 | wstop : flags match valid stop of word..
152 |
153 | + at start/end of string to match turn off checks for start/stop word breaks.
154 | */
155 |
156 | bool match(string& s, size_t spos, const vector& p, size_t ppos, const vector& wstart, const vector& wstop)
157 | {
158 | if (spos == s.length() && (ppos != p.size())) // Exhausted s (searched string) but not p (patterns to search for in s)
159 | return false;
160 | else if (ppos == p.size()) // Exhausted p
161 | return true;
162 |
163 | std::string_view q = p[ppos];
164 |
165 | // Check if first char is UPPER CASE - Means Word match required - otherwise more relaxed string match
166 | // bool test = isupper(q.front());
167 | // bool test_start = test, test_stop = test;
168 |
169 | bool test_start = true, test_stop = true;
170 |
171 | if(q.front() == '_')
172 | {
173 | test_start = false;
174 | q.remove_prefix(1);
175 | }
176 |
177 | if(q.back() == '_')
178 | {
179 | test_stop = false;
180 | q.remove_suffix(1);
181 | }
182 |
183 | size_t qlen = q.length();
184 | size_t pos = 0;
185 |
186 | while(string::npos != (pos = find_word_case_insensitive(s, spos, q)))
187 | if((!test_start || wstart[pos]) && (!test_stop || wstop[pos + qlen - 1]))
188 | return match(s, pos + qlen, p, ppos + 1, wstart, wstop);
189 | else
190 | spos = pos + qlen;
191 |
192 | return false;
193 | }
194 |
195 | inline bool has_suffix(std::string const & s, std::string const & suffix)
196 | {
197 | if (suffix.length() > s.length())
198 | return false;
199 |
200 | return std::equal(suffix.rbegin(), suffix.rend(), s.rbegin());
201 | }
202 |
203 | /*
204 | * Iterate over everything in repo and sift one item after another.
205 | * repo_info holds all words from previous recursion to use in a match
206 | */
207 |
208 | void sift_repo(const fs::path repo_folder, const string repo_info = "")
209 | {
210 | for(auto& repo_folder_item : fs::directory_iterator(repo_folder))
211 | {
212 | string filename = repo_folder_item.path().filename().string();
213 |
214 | if(!has_suffix(filename, "...") || // has not suffix ...
215 | !fs::is_directory(repo_folder_item)) // or is not a file
216 | {
217 | string new_repo_info = repo_info;
218 |
219 | if(!new_repo_info.empty())
220 | new_repo_info.append(" ");
221 |
222 | new_repo_info.append(filename);
223 |
224 | vector stop, start;
225 |
226 | build_start_stop(new_repo_info, start, stop);
227 |
228 | bool have_match = false;
229 |
230 | for(const auto& [dest, words] : dest_items)
231 | {
232 | size_t next = 0, pos;
233 |
234 | if (match(new_repo_info, 0, words, 0, start, stop))
235 | {
236 | const fs::path dst = dest / filename;
237 |
238 | if(verbose > 0)
239 | cout << " Linking: " << repo_folder_item.path() << "\n"
240 | << " with: " << dst << endl;
241 |
242 | recursive_hardlink(repo_folder_item, dst);
243 | have_match = true;
244 | }
245 | }
246 |
247 | if(!have_match)
248 | {
249 | if(missing)
250 | ostrm_missing << repo_folder_item.path() << endl;
251 |
252 | if(verbose > 0)
253 | cout << "No match: " << repo_folder_item.path() << endl;
254 | }
255 | }
256 | else // has suffix ... and is a folder
257 | {
258 | vector words;
259 | string new_repo_info = repo_info;
260 |
261 | // Erase "..." suffix
262 | filename.erase(filename.length() - 3, string::npos);
263 |
264 | tokenize(filename, " ", words);
265 |
266 | // Erase all words starting with alpha but not capitalized
267 | words.erase(std::remove_if(words.begin(), words.end(),
268 | [](string s){return isalpha(s.front()) && !std::isupper(s.front());}), words.end());
269 |
270 | // Reconstruct filename into new_repo_info
271 | for(auto& word : words)
272 | {
273 | if(!new_repo_info.empty()) // pedantic
274 | new_repo_info.append(" ");
275 |
276 | new_repo_info.append(word);
277 | }
278 |
279 | if(verbose > 2) cout << "new_repo_info used: " << new_repo_info << endl;
280 |
281 | sift_repo(repo_folder_item.path(), new_repo_info);
282 | }
283 | }
284 | }
285 |
286 | /*
287 | * Read all destination folders, store search tokens only if Capitalized first letter - recursive
288 | * dest_info holds all search terms
289 | */
290 |
291 | void read_dest(const fs::path dest_folder, const string dest_info = "")
292 | {
293 | for (auto& dest_folder_item : fs::directory_iterator(dest_folder))
294 | {
295 | string filename = dest_folder_item.path().filename().string();
296 |
297 | if(fs::is_directory(dest_folder_item) && !has_suffix(filename, "...")) // This is a destination folder!
298 | {
299 | string new_dest_info = dest_info;
300 |
301 | if(!new_dest_info.empty())
302 | new_dest_info.append(" ");
303 |
304 | new_dest_info.append(filename);
305 |
306 | if(std::error_code ec; clear)
307 | {
308 | for(auto& item: fs::directory_iterator(dest_folder_item))
309 | {
310 | if(verbose > 1)
311 | cout << "Deleting: " << item.path().string() << endl;
312 |
313 | fs::remove_all(item.path(), ec);
314 | }
315 | }
316 |
317 | vector variants;
318 |
319 | tokenize(new_dest_info, ",();", variants);
320 |
321 | for (auto& var : variants)
322 | {
323 | vector words;
324 |
325 | tokenize(var, " ", words);
326 |
327 | // Remove all words not capitalized
328 | words.erase(std::remove_if(words.begin(), words.end(),
329 | [](string s){return isalpha(s.front()) && !std::isupper(s.front());}), words.end());
330 |
331 | if (!words.empty())
332 | {
333 | dest_items.push_back(make_pair(dest_folder_item, words));
334 |
335 | if(verbose > 2)
336 | {
337 | cout << "Dest tokens used for " << dest_folder_item.path().string() << endl;
338 | for(auto& s : words)
339 | cout << "[" << s << "] ";
340 |
341 | cout << endl;
342 | }
343 | }
344 | }
345 | }
346 | else if (fs::is_directory(dest_folder_item) && has_suffix(filename, "..."))
347 | {
348 | string new_dest_info = dest_info;
349 |
350 | if(!new_dest_info.empty())
351 | new_dest_info.append(" ");
352 |
353 | filename = filename.substr(0, filename.length() - 3); // erase ...
354 |
355 | string space_these = ",();"; // Variants not allowed in parent folder
356 |
357 | for(auto& c : filename)
358 | if(space_these.find(c) != string::npos)
359 | c = ' ';
360 |
361 | new_dest_info.append(filename);
362 |
363 | read_dest(dest_folder_item.path(), new_dest_info);
364 | }
365 | }
366 | }
367 |
368 | int main(int argc, char** argv)
369 | {
370 | string repo, dest;
371 |
372 | bool syntax_error = false;
373 |
374 | clear = missing = false;
375 | verbose = 0;
376 |
377 | for (int i = 1; i < argc; i++)
378 | {
379 | string arg = string(argv[i]);
380 |
381 | if ((arg == "-c") || (arg == "--clear"))
382 | clear = true;
383 | else if ((arg == "-m") || (arg == "--missing"))
384 | missing = true;
385 | else if ((arg == "-v") || (arg == "--verbose"))
386 | verbose++;
387 | else if ((arg == "-vv"))
388 | verbose+=2;
389 | else if ((arg == "-vvv"))
390 | verbose+=3;
391 | else if (i == (argc - 2))
392 | repo = arg;
393 | else if (i == (argc - 1))
394 | dest = arg;
395 | else
396 | {
397 | syntax_error = true;
398 | break;
399 | }
400 | }
401 |
402 | if(syntax_error || !(fs::exists(repo) && fs::exists(dest)))
403 | {
404 | cerr << "sift3 Copyright Anders Larsen 2020 gislagard@gmail.com\n\n"
405 | << "This program is free software under the terms of GPLv3.\n\n"
406 | << "Hardlink items in repo folder to matching subfolders in dest.\n\n"
407 | << "Matching rules: \n"
408 | << " Items match only once and in strict order.\n"
409 | << " Matches are not case sensitive for ASCII.\n"
410 | << " Suffix '...' in folder names creates a parent folder.\n"
411 | << " Lower case words in parent folders are not matched.\n"
412 | << " Use any of ',(); for alternative match in not parent dest.\n"
413 | << " Full words must match. CamelCase is detected if used.\n"
414 | << " Use _underscore_ in dest to disable full word start/stop.\n\n"
415 | << "Usage: sift3 [options] repo dest\n\n"
416 | << "Options:\n\n"
417 | << " -c or --clear Clear existing items in dest before sift.\n"
418 | << " -m or --missing Log not matching in repo to dest/missing.txt.\n"
419 | << " -v or --verbose Increase verbosity. Max is 3: -vvv\n"
420 | << endl;
421 |
422 | return -1;
423 | }
424 |
425 | if(verbose > 0 && clear)
426 | cout << "Reading and clearing " << dest << endl;
427 | else if(verbose > 0)
428 | cout << "Reading " << dest << endl;
429 |
430 | read_dest(dest);
431 |
432 | if(missing)
433 | ostrm_missing.open(fs::path(dest) / "missing.txt", std::ios::binary);
434 |
435 | if(verbose > 0)
436 | cout << "Sifting " << repo << endl;
437 |
438 | sift_repo(repo);
439 |
440 | if(missing)
441 | {
442 | ostrm_missing.close();
443 |
444 | if(verbose > 0)
445 | cout << dest << "/missing.txt closed" << endl;
446 | }
447 |
448 | return 0;
449 | }
450 |
--------------------------------------------------------------------------------