├── .DS_Store ├── 5pages.pdf ├── Be_Good.pdf ├── SAMPLE-OF-ENV-FILE.txt ├── Street_Tree_List.csv ├── _100 AI Startups__ 100 LLM Apps that have earned $500,000 before their first year of existence.html ├── be-good-and-how-not-to-die.txt ├── be-good.txt ├── como_podemos_ayudarte.pdf ├── good.txt ├── state_of_the_union.txt ├── street_tree_db.sqlite ├── thefuzz-master ├── .DS_Store ├── .editorconfig ├── .github │ └── workflows │ │ └── ci.yml ├── .gitignore ├── CHANGES.rst ├── LICENSE.txt ├── MANIFEST.in ├── README.rst ├── benchmarks.py ├── data │ └── titledata.csv ├── release ├── setup.py ├── test_thefuzz.py ├── test_thefuzz_hypothesis.py ├── test_thefuzz_pytest.py ├── thefuzz │ ├── __init__.py │ ├── fuzz.py │ ├── fuzz.pyi │ ├── process.py │ ├── process.pyi │ ├── py.typed │ ├── utils.py │ └── utils.pyi └── tox.ini ├── thefuzz ├── __init__.py ├── fuzz.py ├── fuzz.pyi ├── process.py ├── process.pyi ├── py.typed ├── utils.py └── utils.pyi └── youtube └── LLM Apps: Professional Opportunities for LLM App Developers..m4a /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-LLM-Bootcamp/data/f776b47268f2d7152ac1accf3ce47472ec1a59e7/.DS_Store -------------------------------------------------------------------------------- /5pages.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-LLM-Bootcamp/data/f776b47268f2d7152ac1accf3ce47472ec1a59e7/5pages.pdf -------------------------------------------------------------------------------- /Be_Good.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-LLM-Bootcamp/data/f776b47268f2d7152ac1accf3ce47472ec1a59e7/Be_Good.pdf -------------------------------------------------------------------------------- /SAMPLE-OF-ENV-FILE.txt: -------------------------------------------------------------------------------- 1 | (remember: the name of this file should be .env) 2 | (this file should not be in the data folder, but in the root folder) 3 | (replace … with your confidential key) 4 | (remove the keys you are not using) 5 | 6 | 7 | OPENAI_API_KEY=… 8 | 9 | 10 | LANGCHAIN_TRACING_V2=true 11 | LANGCHAIN_ENDPOINT=https://api.smith.langchain.com 12 | LANGCHAIN_API_KEY=… 13 | 14 | 15 | SERPAPI_API_KEY=… 16 | HUGGINGFACEHUB_API_TOKEN=… 17 | COHERE_API_KEY=.. 18 | DEEPLAKE_API_KEY=… 19 | GOOGLE_API_KEY=… 20 | GOOGLE_CSE_ID=… 21 | ACTIVELOOP_ORG_ID=… 22 | REPLICATE_API_TOKEN=… 23 | PALM_API_KEY=.. 24 | PALM_REGION=.. -------------------------------------------------------------------------------- /be-good-and-how-not-to-die.txt: -------------------------------------------------------------------------------- 1 | Be good 2 | 3 | April 2008(This essay is derived from a talk at the 2008 Startup School.)About a month after we started Y Combinator we came up with the 4 | phrase that became our motto: Make something people want. We've 5 | learned a lot since then, but if I were choosing now that's still 6 | the one I'd pick.Another thing we tell founders is not to worry too much about the 7 | business model, at least at first. Not because making money is 8 | unimportant, but because it's so much easier than building something 9 | great.A couple weeks ago I realized that if you put those two ideas 10 | together, you get something surprising. Make something people want. 11 | Don't worry too much about making money. What you've got is a 12 | description of a charity.When you get an unexpected result like this, it could either be a 13 | bug or a new discovery. Either businesses aren't supposed to be 14 | like charities, and we've proven by reductio ad absurdum that one 15 | or both of the principles we began with is false. Or we have a new 16 | idea.I suspect it's the latter, because as soon as this thought occurred 17 | to me, a whole bunch of other things fell into place.ExamplesFor example, Craigslist. It's not a charity, but they run it like 18 | one. And they're astoundingly successful. When you scan down the 19 | list of most popular web sites, the number of employees at Craigslist 20 | looks like a misprint. Their revenues aren't as high as they could 21 | be, but most startups would be happy to trade places with them.In Patrick O'Brian's novels, his captains always try to get upwind 22 | of their opponents. If you're upwind, you decide when and if to 23 | engage the other ship. Craigslist is effectively upwind of enormous 24 | revenues. They'd face some challenges if they wanted to make more, 25 | but not the sort you face when you're tacking upwind, trying to 26 | force a crappy product on ambivalent users by spending ten times 27 | as much on sales as on development. [1]I'm not saying startups should aim to end up like Craigslist. 28 | They're a product of unusual circumstances. But they're a good 29 | model for the early phases.Google looked a lot like a charity in the beginning. They didn't 30 | have ads for over a year. At year 1, Google was indistinguishable 31 | from a nonprofit. If a nonprofit or government organization had 32 | started a project to index the web, Google at year 1 is the limit 33 | of what they'd have produced.Back when I was working on spam filters I thought it would be a 34 | good idea to have a web-based email service with good spam filtering. 35 | I wasn't thinking of it as a company. I just wanted to keep people 36 | from getting spammed. But as I thought more about this project, I 37 | realized it would probably have to be a company. It would cost 38 | something to run, and it would be a pain to fund with grants and 39 | donations.That was a surprising realization. Companies often claim to be 40 | benevolent, but it was surprising to realize there were purely 41 | benevolent projects that had to be embodied as companies to work.I didn't want to start another company, so I didn't do it. But if 42 | someone had, they'd probably be quite rich now. There was a window 43 | of about two years when spam was increasing rapidly but all the big 44 | email services had terrible filters. If someone had launched a 45 | new, spam-free mail service, users would have flocked to it.Notice the pattern here? From either direction we get to the same 46 | spot. If you start from successful startups, you find they often 47 | behaved like nonprofits. And if you start from ideas for nonprofits, 48 | you find they'd often make good startups.PowerHow wide is this territory? Would all good nonprofits be good 49 | companies? Possibly not. What makes Google so valuable is that 50 | their users have money. If you make people with money love you, 51 | you can probably get some of it. But could you also base a successful 52 | startup on behaving like a nonprofit to people who don't have money? 53 | Could you, for example, grow a successful startup out of curing an 54 | unfashionable but deadly disease like malaria?I'm not sure, but I suspect that if you pushed this idea, you'd be 55 | surprised how far it would go. For example, people who apply to Y 56 | Combinator don't generally have much money, and yet we can profit 57 | by helping them, because with our help they could make money. Maybe 58 | the situation is similar with malaria. Maybe an organization that 59 | helped lift its weight off a country could benefit from the resulting 60 | growth.I'm not proposing this is a serious idea. I don't know anything 61 | about malaria. But I've been kicking ideas around long enough to 62 | know when I come across a powerful one.One way to guess how far an idea extends is to ask yourself at what 63 | point you'd bet against it. The thought of betting against benevolence 64 | is alarming in the same way as saying that something is technically 65 | impossible. You're just asking to be made a fool of, because these 66 | are such powerful forces. [2]For example, initially I thought maybe this principle only applied 67 | to Internet startups. Obviously it worked for Google, but what 68 | about Microsoft? Surely Microsoft isn't benevolent? But when I 69 | think back to the beginning, they were. Compared to IBM they were 70 | like Robin Hood. When IBM introduced the PC, they thought they 71 | were going to make money selling hardware at high prices. But by 72 | gaining control of the PC standard, Microsoft opened up the market 73 | to any manufacturer. Hardware prices plummeted, and lots of people 74 | got to have computers who couldn't otherwise have afforded them. 75 | It's the sort of thing you'd expect Google to do.Microsoft isn't so benevolent now. Now when one thinks of what 76 | Microsoft does to users, all the verbs that come to mind begin with 77 | F. [3] And yet it doesn't seem to pay. 78 | Their stock price has been flat for years. Back when they were 79 | Robin Hood, their stock price rose like Google's. Could there be 80 | a connection?You can see how there would be. When you're small, you can't bully 81 | customers, so you have to charm them. Whereas when you're big you 82 | can maltreat them at will, and you tend to, because it's easier 83 | than satisfying them. You grow big by being nice, but you can stay 84 | big by being mean.You get away with it till the underlying conditions change, and 85 | then all your victims escape. So "Don't be evil" may be the most 86 | valuable thing Paul Buchheit made for Google, because it may turn 87 | out to be an elixir of corporate youth. I'm sure they find it 88 | constraining, but think how valuable it will be if it saves them 89 | from lapsing into the fatal laziness that afflicted Microsoft and 90 | IBM.The curious thing is, this elixir is freely available to any other 91 | company. Anyone can adopt "Don't be evil." The catch is that 92 | people will hold you to it. So I don't think you're going to see 93 | record labels or tobacco companies using this discovery.MoraleThere's a lot of external evidence that benevolence works. But how 94 | does it work? One advantage of investing in a large number of 95 | startups is that you get a lot of data about how they work. From 96 | what we've seen, being good seems to help startups in three ways: 97 | it improves their morale, it makes other people want to help them, 98 | and above all, it helps them be decisive.Morale is tremendously important to a startup—so important 99 | that morale alone is almost enough to determine success. Startups 100 | are often described as emotional roller-coasters. One minute you're 101 | going to take over the world, and the next you're doomed. The 102 | problem with feeling you're doomed is not just that it makes you 103 | unhappy, but that it makes you stop working. So the downhills 104 | of the roller-coaster are more of a self fulfilling prophecy than 105 | the uphills. If feeling you're going to succeed makes you work 106 | harder, that probably improves your chances of succeeding, but if 107 | feeling you're going to fail makes you stop working, that practically 108 | guarantees you'll fail.Here's where benevolence comes in. If you feel you're really helping 109 | people, you'll keep working even when it seems like your startup 110 | is doomed. Most of us have some amount of natural benevolence. 111 | The mere fact that someone needs you makes you want to help them. 112 | So if you start the kind of startup where users come back each day, 113 | you've basically built yourself a giant tamagotchi. You've made 114 | something you need to take care of.Blogger is a famous example of a startup that went through really 115 | low lows and survived. At one point they ran out of money and 116 | everyone left. Evan Williams came in to work the next day, and there 117 | was no one but him. What kept him going? Partly that users needed 118 | him. He was hosting thousands of people's blogs. He couldn't just 119 | let the site die.There are many advantages of launching quickly, but the most important 120 | may be that once you have users, the tamagotchi effect kicks in. 121 | Once you have users to take care of, you're forced to figure out 122 | what will make them happy, and that's actually very valuable 123 | information.The added confidence that comes from trying to help people can 124 | also help you with investors. One of the founders of 125 | Chatterous told 126 | me recently that he and his cofounder had decided that this service 127 | was something the world needed, so they were going to keep working 128 | on it no matter what, even if they had to move back to Canada and live 129 | in their parents' basements.Once they realized this, they stopped caring so much what investors thought 130 | about them. They still met with them, but they weren't going to 131 | die if they didn't get their money. And you know what? The investors 132 | got a lot more interested. They could sense that the Chatterouses 133 | were going to do this startup with or without them.If you're really committed and your startup is cheap to run, you 134 | become very hard to kill. And practically all startups, even the 135 | most successful, come close to death at some point. So if doing 136 | good for people gives you a sense of mission that makes you harder 137 | to kill, that alone more than compensates for whatever you lose by 138 | not choosing a more selfish project.HelpAnother advantage of being good is that it makes other people want 139 | to help you. This too seems to be an inborn trait in humans.One of the startups we've funded, Octopart, is currently locked in 140 | a classic battle of good versus evil. They're a search site for 141 | industrial components. A lot of people need to search for components, 142 | and before Octopart there was no good way to do it. That, it turned 143 | out, was no coincidence.Octopart built the right way to search for components. Users like 144 | it and they've been growing rapidly. And yet for most of Octopart's 145 | life, the biggest distributor, Digi-Key, has been trying to force 146 | them take their prices off the site. Octopart is sending them 147 | customers for free, and yet Digi-Key is trying to make that traffic 148 | stop. Why? Because their current business model depends on 149 | overcharging people who have incomplete information about prices. 150 | They don't want search to work.The Octoparts are the nicest guys in the world. They dropped out 151 | of the PhD program in physics at Berkeley to do this. They just 152 | wanted to fix a problem they encountered in their research. Imagine 153 | how much time you could save the world's engineers if they could 154 | do searches online. So when I hear that a big, evil company is 155 | trying to stop them in order to keep search broken, it makes me 156 | really want to help them. It makes me spend more time on the Octoparts 157 | than I do with most of the other startups we've funded. It just 158 | made me spend several minutes telling you how great they are. Why? 159 | Because they're good guys and they're trying to help the world.If you're benevolent, people will rally around you: investors, 160 | customers, other companies, and potential employees. In the long 161 | term the most important may be the potential employees. I think 162 | everyone knows now that 163 | good hackers are much better than mediocre 164 | ones. If you can attract the best hackers to work for you, as 165 | Google has, you have a big advantage. And the very best hackers 166 | tend to be idealistic. They're not desperate for a job. They can 167 | work wherever they want. So most want to work on things that will 168 | make the world better.CompassBut the most important advantage of being good is that it acts as 169 | a compass. One of the hardest parts of doing a startup is that you 170 | have so many choices. There are just two or three of you, and a 171 | thousand things you could do. How do you decide?Here's the answer: Do whatever's best for your users. You can hold 172 | onto this like a rope in a hurricane, and it will save you if 173 | anything can. Follow it and it will take you through everything 174 | you need to do.It's even the answer to questions that seem unrelated, like how to 175 | convince investors to give you money. If you're a good salesman, 176 | you could try to just talk them into it. But the more reliable 177 | route is to convince them through your users: if you make something 178 | users love enough to tell their friends, you grow exponentially, 179 | and that will convince any investor.Being good is a particularly useful strategy for making decisions 180 | in complex situations because it's stateless. It's like telling 181 | the truth. The trouble with lying is that you have to remember 182 | everything you've said in the past to make sure you don't contradict 183 | yourself. If you tell the truth you don't have to remember anything, 184 | and that's a really useful property in domains where things happen 185 | fast.For example, Y Combinator has now invested in 80 startups, 57 of 186 | which are still alive. (The rest have died or merged or been 187 | acquired.) When you're trying to advise 57 startups, it turns out 188 | you have to have a stateless algorithm. You can't have ulterior 189 | motives when you have 57 things going on at once, because you can't 190 | remember them. So our rule is just to do whatever's best for the 191 | founders. Not because we're particularly benevolent, but because 192 | it's the only algorithm that works on that scale.When you write something telling people to be good, you seem to be 193 | claiming to be good yourself. So I want to say explicitly that I 194 | am not a particularly good person. When I was a kid I was firmly 195 | in the camp of bad. The way adults used the word good, it seemed 196 | to be synonymous with quiet, so I grew up very suspicious of it.You know how there are some people whose names come up in conversation 197 | and everyone says "He's such a great guy?" People never say 198 | that about me. The best I get is "he means well." I am not claiming 199 | to be good. At best I speak good as a second language.So I'm not suggesting you be good in the usual sanctimonious way. 200 | I'm suggesting it because it works. It will work not just as a 201 | statement of "values," but as a guide to strategy, 202 | and even a design spec for software. Don't just not be evil. Be 203 | good.Notes[1] Fifty years ago 204 | it would have seemed shocking for a public company not to pay 205 | dividends. Now many tech companies don't. The markets seem to 206 | have figured out how to value potential dividends. Maybe that isn't 207 | the last step in this evolution. Maybe markets will eventually get 208 | comfortable with potential earnings. (VCs already are, and at least 209 | some of them consistently make money.)I realize this sounds like the stuff one used to hear about the 210 | "new economy" during the Bubble. Believe me, I was not drinking 211 | that kool-aid at the time. But I'm convinced there were some 212 | good 213 | ideas buried in Bubble thinking. For example, it's ok to focus on 214 | growth instead of profits—but only if the growth is genuine. 215 | You can't be buying users; that's a pyramid scheme. But a company 216 | with rapid, genuine growth is valuable, and eventually markets learn 217 | how to value valuable things.[2] The idea of starting 218 | a company with benevolent aims is currently undervalued, because 219 | the kind of people who currently make that their explicit goal don't 220 | usually do a very good job.It's one of the standard career paths of trustafarians to start 221 | some vaguely benevolent business. The problem with most of them 222 | is that they either have a bogus political agenda or are feebly 223 | executed. The trustafarians' ancestors didn't get rich by preserving 224 | their traditional culture; maybe people in Bolivia don't want to 225 | either. And starting an organic farm, though it's at least 226 | straightforwardly benevolent, doesn't help people on the scale that 227 | Google does.Most explicitly benevolent projects don't hold themselves sufficiently 228 | accountable. They act as if having good intentions were enough to 229 | guarantee good effects.[3] Users dislike their 230 | new operating system so much that they're starting petitions to 231 | save the old one. And the old one was nothing special. The hackers 232 | within Microsoft must know in their hearts that if the company 233 | really cared about users they'd just advise them to switch to OSX.Thanks to Trevor Blackwell, Paul Buchheit, Jessica Livingston, 234 | and Robert Morris for reading drafts of this. 235 | 236 | How not to die 237 | 238 | August 2007 239 | 240 | (This is a talk I gave at the last Y Combinator dinner of the summer. Usually we don't have a speaker at the last dinner; it's more of a party. But it seemed worth spoiling the atmosphere if I could save some of the startups from preventable deaths. So at the last minute I cooked up this rather grim talk. I didn't mean this as an essay; I wrote it down because I only had two hours before dinner and think fastest while writing.) 241 | 242 | A couple days ago I told a reporter that we expected about a third of the companies we funded to succeed. Actually I was being conservative. I'm hoping it might be as much as a half. Wouldn't it be amazing if we could achieve a 50% success rate? 243 | 244 | Another way of saying that is that half of you are going to die. Phrased that way, it doesn't sound good at all. In fact, it's kind of weird when you think about it, because our definition of success is that the founders get rich. If half the startups we fund succeed, then half of you are going to get rich and the other half are going to get nothing. 245 | 246 | If you can just avoid dying, you get rich. That sounds like a joke, but it's actually a pretty good description of what happens in a typical startup. It certainly describes what happened in Viaweb. We avoided dying till we got rich. 247 | 248 | It was really close, too. When we were visiting Yahoo to talk about being acquired, we had to interrupt everything and borrow one of their conference rooms to talk down an investor who was about to back out of a new funding round we needed to stay alive. So even in the middle of getting rich we were fighting off the grim reaper. 249 | 250 | You may have heard that quote about luck consisting of opportunity meeting preparation. You've now done the preparation. The work you've done so far has, in effect, put you in a position to get lucky: you can now get rich by not letting your company die. That's more than most people have. So let's talk about how not to die. 251 | 252 | We've done this five times now, and we've seen a bunch of startups die. About 10 of them so far. We don't know exactly what happens when they die, because they generally don't die loudly and heroically. Mostly they crawl off somewhere and die. 253 | 254 | For us the main indication of impending doom is when we don't hear from you. When we haven't heard from, or about, a startup for a couple months, that's a bad sign. If we send them an email asking what's up, and they don't reply, that's a really bad sign. So far that is a 100% accurate predictor of death. 255 | 256 | Whereas if a startup regularly does new deals and releases and either sends us mail or shows up at YC events, they're probably going to live. 257 | 258 | I realize this will sound naive, but maybe the linkage works in both directions. Maybe if you can arrange that we keep hearing from you, you won't die. 259 | 260 | That may not be so naive as it sounds. You've probably noticed that having dinners every Tuesday with us and the other founders causes you to get more done than you would otherwise, because every dinner is a mini Demo Day. Every dinner is a kind of a deadline. So the mere constraint of staying in regular contact with us will push you to make things happen, because otherwise you'll be embarrassed to tell us that you haven't done anything new since the last time we talked. 261 | 262 | If this works, it would be an amazing hack. It would be pretty cool if merely by staying in regular contact with us you could get rich. It sounds crazy, but there's a good chance that would work. 263 | 264 | A variant is to stay in touch with other YC-funded startups. There is now a whole neighborhood of them in San Francisco. If you move there, the peer pressure that made you work harder all summer will continue to operate. 265 | 266 | When startups die, the official cause of death is always either running out of money or a critical founder bailing. Often the two occur simultaneously. But I think the underlying cause is usually that they've become demoralized. You rarely hear of a startup that's working around the clock doing deals and pumping out new features, and dies because they can't pay their bills and their ISP unplugs their server. 267 | 268 | Startups rarely die in mid keystroke. So keep typing! 269 | 270 | If so many startups get demoralized and fail when merely by hanging on they could get rich, you have to assume that running a startup can be demoralizing. That is certainly true. I've been there, and that's why I've never done another startup. The low points in a startup are just unbelievably low. I bet even Google had moments where things seemed hopeless. 271 | 272 | Knowing that should help. If you know it's going to feel terrible sometimes, then when it feels terrible you won't think "ouch, this feels terrible, I give up." It feels that way for everyone. And if you just hang on, things will probably get better. The metaphor people use to describe the way a startup feels is at least a roller coaster and not drowning. You don't just sink and sink; there are ups after the downs. 273 | 274 | Another feeling that seems alarming but is in fact normal in a startup is the feeling that what you're doing isn't working. The reason you can expect to feel this is that what you do probably won't work. Startups almost never get it right the first time. Much more commonly you launch something, and no one cares. Don't assume when this happens that you've failed. That's normal for startups. But don't sit around doing nothing. Iterate. 275 | 276 | I like Paul Buchheit's suggestion of trying to make something that at least someone really loves. As long as you've made something that a few users are ecstatic about, you're on the right track. It will be good for your morale to have even a handful of users who really love you, and startups run on morale. But also it will tell you what to focus on. What is it about you that they love? Can you do more of that? Where can you find more people who love that sort of thing? As long as you have some core of users who love you, all you have to do is expand it. It may take a while, but as long as you keep plugging away, you'll win in the end. Both Blogger and Delicious did that. Both took years to succeed. But both began with a core of fanatically devoted users, and all Evan and Joshua had to do was grow that core incrementally. Wufoo is on the same trajectory now. 277 | 278 | So when you release something and it seems like no one cares, look more closely. Are there zero users who really love you, or is there at least some little group that does? It's quite possible there will be zero. In that case, tweak your product and try again. Every one of you is working on a space that contains at least one winning permutation somewhere in it. If you just keep trying, you'll find it. 279 | 280 | Let me mention some things not to do. The number one thing not to do is other things. If you find yourself saying a sentence that ends with "but we're going to keep working on the startup," you are in big trouble. Bob's going to grad school, but we're going to keep working on the startup. We're moving back to Minnesota, but we're going to keep working on the startup. We're taking on some consulting projects, but we're going to keep working on the startup. You may as well just translate these to "we're giving up on the startup, but we're not willing to admit that to ourselves," because that's what it means most of the time. A startup is so hard that working on it can't be preceded by "but." 281 | 282 | In particular, don't go to graduate school, and don't start other projects. Distraction is fatal to startups. Going to (or back to) school is a huge predictor of death because in addition to the distraction it gives you something to say you're doing. If you're only doing a startup, then if the startup fails, you fail. If you're in grad school and your startup fails, you can say later "Oh yeah, we had this startup on the side when I was in grad school, but it didn't go anywhere." 283 | 284 | You can't use euphemisms like "didn't go anywhere" for something that's your only occupation. People won't let you. 285 | 286 | One of the most interesting things we've discovered from working on Y Combinator is that founders are more motivated by the fear of looking bad than by the hope of getting millions of dollars. So if you want to get millions of dollars, put yourself in a position where failure will be public and humiliating. 287 | 288 | When we first met the founders of Octopart, they seemed very smart, but not a great bet to succeed, because they didn't seem especially committed. One of the two founders was still in grad school. It was the usual story: he'd drop out if it looked like the startup was taking off. Since then he has not only dropped out of grad school, but appeared full length in Newsweek with the word "Billionaire" printed across his chest. He just cannot fail now. Everyone he knows has seen that picture. Girls who dissed him in high school have seen it. His mom probably has it on the fridge. It would be unthinkably humiliating to fail now. At this point he is committed to fight to the death. 289 | 290 | I wish every startup we funded could appear in a Newsweek article describing them as the next generation of billionaires, because then none of them would be able to give up. The success rate would be 90%. I'm not kidding. 291 | 292 | When we first knew the Octoparts they were lighthearted, cheery guys. Now when we talk to them they seem grimly determined. The electronic parts distributors are trying to squash them to keep their monopoly pricing. (If it strikes you as odd that people still order electronic parts out of thick paper catalogs in 2007, there's a reason for that. The distributors want to prevent the transparency that comes from having prices online.) I feel kind of bad that we've transformed these guys from lighthearted to grimly determined. But that comes with the territory. If a startup succeeds, you get millions of dollars, and you don't get that kind of money just by asking for it. You have to assume it takes some amount of pain. 293 | 294 | And however tough things get for the Octoparts, I predict they'll succeed. They may have to morph themselves into something totally different, but they won't just crawl off and die. They're smart; they're working in a promising field; and they just cannot give up. 295 | 296 | All of you guys already have the first two. You're all smart and working on promising ideas. Whether you end up among the living or the dead comes down to the third ingredient, not giving up. 297 | 298 | So I'll tell you now: bad shit is coming. It always is in a startup. The odds of getting from launch to liquidity without some kind of disaster happening are one in a thousand. So don't get demoralized. When the disaster strikes, just say to yourself, ok, this was what Paul was talking about. What did he say to do? Oh, yeah. Don't give up. -------------------------------------------------------------------------------- /be-good.txt: -------------------------------------------------------------------------------- 1 | Be good 2 | 3 | April 2008(This essay is derived from a talk at the 2008 Startup School.)About a month after we started Y Combinator we came up with the 4 | phrase that became our motto: Make something people want. We've 5 | learned a lot since then, but if I were choosing now that's still 6 | the one I'd pick.Another thing we tell founders is not to worry too much about the 7 | business model, at least at first. Not because making money is 8 | unimportant, but because it's so much easier than building something 9 | great.A couple weeks ago I realized that if you put those two ideas 10 | together, you get something surprising. Make something people want. 11 | Don't worry too much about making money. What you've got is a 12 | description of a charity.When you get an unexpected result like this, it could either be a 13 | bug or a new discovery. Either businesses aren't supposed to be 14 | like charities, and we've proven by reductio ad absurdum that one 15 | or both of the principles we began with is false. Or we have a new 16 | idea.I suspect it's the latter, because as soon as this thought occurred 17 | to me, a whole bunch of other things fell into place.ExamplesFor example, Craigslist. It's not a charity, but they run it like 18 | one. And they're astoundingly successful. When you scan down the 19 | list of most popular web sites, the number of employees at Craigslist 20 | looks like a misprint. Their revenues aren't as high as they could 21 | be, but most startups would be happy to trade places with them.In Patrick O'Brian's novels, his captains always try to get upwind 22 | of their opponents. If you're upwind, you decide when and if to 23 | engage the other ship. Craigslist is effectively upwind of enormous 24 | revenues. They'd face some challenges if they wanted to make more, 25 | but not the sort you face when you're tacking upwind, trying to 26 | force a crappy product on ambivalent users by spending ten times 27 | as much on sales as on development. [1]I'm not saying startups should aim to end up like Craigslist. 28 | They're a product of unusual circumstances. But they're a good 29 | model for the early phases.Google looked a lot like a charity in the beginning. They didn't 30 | have ads for over a year. At year 1, Google was indistinguishable 31 | from a nonprofit. If a nonprofit or government organization had 32 | started a project to index the web, Google at year 1 is the limit 33 | of what they'd have produced.Back when I was working on spam filters I thought it would be a 34 | good idea to have a web-based email service with good spam filtering. 35 | I wasn't thinking of it as a company. I just wanted to keep people 36 | from getting spammed. But as I thought more about this project, I 37 | realized it would probably have to be a company. It would cost 38 | something to run, and it would be a pain to fund with grants and 39 | donations.That was a surprising realization. Companies often claim to be 40 | benevolent, but it was surprising to realize there were purely 41 | benevolent projects that had to be embodied as companies to work.I didn't want to start another company, so I didn't do it. But if 42 | someone had, they'd probably be quite rich now. There was a window 43 | of about two years when spam was increasing rapidly but all the big 44 | email services had terrible filters. If someone had launched a 45 | new, spam-free mail service, users would have flocked to it.Notice the pattern here? From either direction we get to the same 46 | spot. If you start from successful startups, you find they often 47 | behaved like nonprofits. And if you start from ideas for nonprofits, 48 | you find they'd often make good startups.PowerHow wide is this territory? Would all good nonprofits be good 49 | companies? Possibly not. What makes Google so valuable is that 50 | their users have money. If you make people with money love you, 51 | you can probably get some of it. But could you also base a successful 52 | startup on behaving like a nonprofit to people who don't have money? 53 | Could you, for example, grow a successful startup out of curing an 54 | unfashionable but deadly disease like malaria?I'm not sure, but I suspect that if you pushed this idea, you'd be 55 | surprised how far it would go. For example, people who apply to Y 56 | Combinator don't generally have much money, and yet we can profit 57 | by helping them, because with our help they could make money. Maybe 58 | the situation is similar with malaria. Maybe an organization that 59 | helped lift its weight off a country could benefit from the resulting 60 | growth.I'm not proposing this is a serious idea. I don't know anything 61 | about malaria. But I've been kicking ideas around long enough to 62 | know when I come across a powerful one.One way to guess how far an idea extends is to ask yourself at what 63 | point you'd bet against it. The thought of betting against benevolence 64 | is alarming in the same way as saying that something is technically 65 | impossible. You're just asking to be made a fool of, because these 66 | are such powerful forces. [2]For example, initially I thought maybe this principle only applied 67 | to Internet startups. Obviously it worked for Google, but what 68 | about Microsoft? Surely Microsoft isn't benevolent? But when I 69 | think back to the beginning, they were. Compared to IBM they were 70 | like Robin Hood. When IBM introduced the PC, they thought they 71 | were going to make money selling hardware at high prices. But by 72 | gaining control of the PC standard, Microsoft opened up the market 73 | to any manufacturer. Hardware prices plummeted, and lots of people 74 | got to have computers who couldn't otherwise have afforded them. 75 | It's the sort of thing you'd expect Google to do.Microsoft isn't so benevolent now. Now when one thinks of what 76 | Microsoft does to users, all the verbs that come to mind begin with 77 | F. [3] And yet it doesn't seem to pay. 78 | Their stock price has been flat for years. Back when they were 79 | Robin Hood, their stock price rose like Google's. Could there be 80 | a connection?You can see how there would be. When you're small, you can't bully 81 | customers, so you have to charm them. Whereas when you're big you 82 | can maltreat them at will, and you tend to, because it's easier 83 | than satisfying them. You grow big by being nice, but you can stay 84 | big by being mean.You get away with it till the underlying conditions change, and 85 | then all your victims escape. So "Don't be evil" may be the most 86 | valuable thing Paul Buchheit made for Google, because it may turn 87 | out to be an elixir of corporate youth. I'm sure they find it 88 | constraining, but think how valuable it will be if it saves them 89 | from lapsing into the fatal laziness that afflicted Microsoft and 90 | IBM.The curious thing is, this elixir is freely available to any other 91 | company. Anyone can adopt "Don't be evil." The catch is that 92 | people will hold you to it. So I don't think you're going to see 93 | record labels or tobacco companies using this discovery.MoraleThere's a lot of external evidence that benevolence works. But how 94 | does it work? One advantage of investing in a large number of 95 | startups is that you get a lot of data about how they work. From 96 | what we've seen, being good seems to help startups in three ways: 97 | it improves their morale, it makes other people want to help them, 98 | and above all, it helps them be decisive.Morale is tremendously important to a startup—so important 99 | that morale alone is almost enough to determine success. Startups 100 | are often described as emotional roller-coasters. One minute you're 101 | going to take over the world, and the next you're doomed. The 102 | problem with feeling you're doomed is not just that it makes you 103 | unhappy, but that it makes you stop working. So the downhills 104 | of the roller-coaster are more of a self fulfilling prophecy than 105 | the uphills. If feeling you're going to succeed makes you work 106 | harder, that probably improves your chances of succeeding, but if 107 | feeling you're going to fail makes you stop working, that practically 108 | guarantees you'll fail.Here's where benevolence comes in. If you feel you're really helping 109 | people, you'll keep working even when it seems like your startup 110 | is doomed. Most of us have some amount of natural benevolence. 111 | The mere fact that someone needs you makes you want to help them. 112 | So if you start the kind of startup where users come back each day, 113 | you've basically built yourself a giant tamagotchi. You've made 114 | something you need to take care of.Blogger is a famous example of a startup that went through really 115 | low lows and survived. At one point they ran out of money and 116 | everyone left. Evan Williams came in to work the next day, and there 117 | was no one but him. What kept him going? Partly that users needed 118 | him. He was hosting thousands of people's blogs. He couldn't just 119 | let the site die.There are many advantages of launching quickly, but the most important 120 | may be that once you have users, the tamagotchi effect kicks in. 121 | Once you have users to take care of, you're forced to figure out 122 | what will make them happy, and that's actually very valuable 123 | information.The added confidence that comes from trying to help people can 124 | also help you with investors. One of the founders of 125 | Chatterous told 126 | me recently that he and his cofounder had decided that this service 127 | was something the world needed, so they were going to keep working 128 | on it no matter what, even if they had to move back to Canada and live 129 | in their parents' basements.Once they realized this, they stopped caring so much what investors thought 130 | about them. They still met with them, but they weren't going to 131 | die if they didn't get their money. And you know what? The investors 132 | got a lot more interested. They could sense that the Chatterouses 133 | were going to do this startup with or without them.If you're really committed and your startup is cheap to run, you 134 | become very hard to kill. And practically all startups, even the 135 | most successful, come close to death at some point. So if doing 136 | good for people gives you a sense of mission that makes you harder 137 | to kill, that alone more than compensates for whatever you lose by 138 | not choosing a more selfish project.HelpAnother advantage of being good is that it makes other people want 139 | to help you. This too seems to be an inborn trait in humans.One of the startups we've funded, Octopart, is currently locked in 140 | a classic battle of good versus evil. They're a search site for 141 | industrial components. A lot of people need to search for components, 142 | and before Octopart there was no good way to do it. That, it turned 143 | out, was no coincidence.Octopart built the right way to search for components. Users like 144 | it and they've been growing rapidly. And yet for most of Octopart's 145 | life, the biggest distributor, Digi-Key, has been trying to force 146 | them take their prices off the site. Octopart is sending them 147 | customers for free, and yet Digi-Key is trying to make that traffic 148 | stop. Why? Because their current business model depends on 149 | overcharging people who have incomplete information about prices. 150 | They don't want search to work.The Octoparts are the nicest guys in the world. They dropped out 151 | of the PhD program in physics at Berkeley to do this. They just 152 | wanted to fix a problem they encountered in their research. Imagine 153 | how much time you could save the world's engineers if they could 154 | do searches online. So when I hear that a big, evil company is 155 | trying to stop them in order to keep search broken, it makes me 156 | really want to help them. It makes me spend more time on the Octoparts 157 | than I do with most of the other startups we've funded. It just 158 | made me spend several minutes telling you how great they are. Why? 159 | Because they're good guys and they're trying to help the world.If you're benevolent, people will rally around you: investors, 160 | customers, other companies, and potential employees. In the long 161 | term the most important may be the potential employees. I think 162 | everyone knows now that 163 | good hackers are much better than mediocre 164 | ones. If you can attract the best hackers to work for you, as 165 | Google has, you have a big advantage. And the very best hackers 166 | tend to be idealistic. They're not desperate for a job. They can 167 | work wherever they want. So most want to work on things that will 168 | make the world better.CompassBut the most important advantage of being good is that it acts as 169 | a compass. One of the hardest parts of doing a startup is that you 170 | have so many choices. There are just two or three of you, and a 171 | thousand things you could do. How do you decide?Here's the answer: Do whatever's best for your users. You can hold 172 | onto this like a rope in a hurricane, and it will save you if 173 | anything can. Follow it and it will take you through everything 174 | you need to do.It's even the answer to questions that seem unrelated, like how to 175 | convince investors to give you money. If you're a good salesman, 176 | you could try to just talk them into it. But the more reliable 177 | route is to convince them through your users: if you make something 178 | users love enough to tell their friends, you grow exponentially, 179 | and that will convince any investor.Being good is a particularly useful strategy for making decisions 180 | in complex situations because it's stateless. It's like telling 181 | the truth. The trouble with lying is that you have to remember 182 | everything you've said in the past to make sure you don't contradict 183 | yourself. If you tell the truth you don't have to remember anything, 184 | and that's a really useful property in domains where things happen 185 | fast.For example, Y Combinator has now invested in 80 startups, 57 of 186 | which are still alive. (The rest have died or merged or been 187 | acquired.) When you're trying to advise 57 startups, it turns out 188 | you have to have a stateless algorithm. You can't have ulterior 189 | motives when you have 57 things going on at once, because you can't 190 | remember them. So our rule is just to do whatever's best for the 191 | founders. Not because we're particularly benevolent, but because 192 | it's the only algorithm that works on that scale.When you write something telling people to be good, you seem to be 193 | claiming to be good yourself. So I want to say explicitly that I 194 | am not a particularly good person. When I was a kid I was firmly 195 | in the camp of bad. The way adults used the word good, it seemed 196 | to be synonymous with quiet, so I grew up very suspicious of it.You know how there are some people whose names come up in conversation 197 | and everyone says "He's such a great guy?" People never say 198 | that about me. The best I get is "he means well." I am not claiming 199 | to be good. At best I speak good as a second language.So I'm not suggesting you be good in the usual sanctimonious way. 200 | I'm suggesting it because it works. It will work not just as a 201 | statement of "values," but as a guide to strategy, 202 | and even a design spec for software. Don't just not be evil. Be 203 | good.Notes[1] Fifty years ago 204 | it would have seemed shocking for a public company not to pay 205 | dividends. Now many tech companies don't. The markets seem to 206 | have figured out how to value potential dividends. Maybe that isn't 207 | the last step in this evolution. Maybe markets will eventually get 208 | comfortable with potential earnings. (VCs already are, and at least 209 | some of them consistently make money.)I realize this sounds like the stuff one used to hear about the 210 | "new economy" during the Bubble. Believe me, I was not drinking 211 | that kool-aid at the time. But I'm convinced there were some 212 | good 213 | ideas buried in Bubble thinking. For example, it's ok to focus on 214 | growth instead of profits—but only if the growth is genuine. 215 | You can't be buying users; that's a pyramid scheme. But a company 216 | with rapid, genuine growth is valuable, and eventually markets learn 217 | how to value valuable things.[2] The idea of starting 218 | a company with benevolent aims is currently undervalued, because 219 | the kind of people who currently make that their explicit goal don't 220 | usually do a very good job.It's one of the standard career paths of trustafarians to start 221 | some vaguely benevolent business. The problem with most of them 222 | is that they either have a bogus political agenda or are feebly 223 | executed. The trustafarians' ancestors didn't get rich by preserving 224 | their traditional culture; maybe people in Bolivia don't want to 225 | either. And starting an organic farm, though it's at least 226 | straightforwardly benevolent, doesn't help people on the scale that 227 | Google does.Most explicitly benevolent projects don't hold themselves sufficiently 228 | accountable. They act as if having good intentions were enough to 229 | guarantee good effects.[3] Users dislike their 230 | new operating system so much that they're starting petitions to 231 | save the old one. And the old one was nothing special. The hackers 232 | within Microsoft must know in their hearts that if the company 233 | really cared about users they'd just advise them to switch to OSX.Thanks to Trevor Blackwell, Paul Buchheit, Jessica Livingston, 234 | and Robert Morris for reading drafts of this. -------------------------------------------------------------------------------- /como_podemos_ayudarte.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-LLM-Bootcamp/data/f776b47268f2d7152ac1accf3ce47472ec1a59e7/como_podemos_ayudarte.pdf -------------------------------------------------------------------------------- /good.txt: -------------------------------------------------------------------------------- 1 | April 2008(This essay is derived from a talk at the 2008 Startup School.)About a month after we started Y Combinator we came up with the 2 | phrase that became our motto: Make something people want. We've 3 | learned a lot since then, but if I were choosing now that's still 4 | the one I'd pick.Another thing we tell founders is not to worry too much about the 5 | business model, at least at first. Not because making money is 6 | unimportant, but because it's so much easier than building something 7 | great.A couple weeks ago I realized that if you put those two ideas 8 | together, you get something surprising. Make something people want. 9 | Don't worry too much about making money. What you've got is a 10 | description of a charity.When you get an unexpected result like this, it could either be a 11 | bug or a new discovery. Either businesses aren't supposed to be 12 | like charities, and we've proven by reductio ad absurdum that one 13 | or both of the principles we began with is false. Or we have a new 14 | idea.I suspect it's the latter, because as soon as this thought occurred 15 | to me, a whole bunch of other things fell into place.ExamplesFor example, Craigslist. It's not a charity, but they run it like 16 | one. And they're astoundingly successful. When you scan down the 17 | list of most popular web sites, the number of employees at Craigslist 18 | looks like a misprint. Their revenues aren't as high as they could 19 | be, but most startups would be happy to trade places with them.In Patrick O'Brian's novels, his captains always try to get upwind 20 | of their opponents. If you're upwind, you decide when and if to 21 | engage the other ship. Craigslist is effectively upwind of enormous 22 | revenues. They'd face some challenges if they wanted to make more, 23 | but not the sort you face when you're tacking upwind, trying to 24 | force a crappy product on ambivalent users by spending ten times 25 | as much on sales as on development. [1]I'm not saying startups should aim to end up like Craigslist. 26 | They're a product of unusual circumstances. But they're a good 27 | model for the early phases.Google looked a lot like a charity in the beginning. They didn't 28 | have ads for over a year. At year 1, Google was indistinguishable 29 | from a nonprofit. If a nonprofit or government organization had 30 | started a project to index the web, Google at year 1 is the limit 31 | of what they'd have produced.Back when I was working on spam filters I thought it would be a 32 | good idea to have a web-based email service with good spam filtering. 33 | I wasn't thinking of it as a company. I just wanted to keep people 34 | from getting spammed. But as I thought more about this project, I 35 | realized it would probably have to be a company. It would cost 36 | something to run, and it would be a pain to fund with grants and 37 | donations.That was a surprising realization. Companies often claim to be 38 | benevolent, but it was surprising to realize there were purely 39 | benevolent projects that had to be embodied as companies to work.I didn't want to start another company, so I didn't do it. But if 40 | someone had, they'd probably be quite rich now. There was a window 41 | of about two years when spam was increasing rapidly but all the big 42 | email services had terrible filters. If someone had launched a 43 | new, spam-free mail service, users would have flocked to it.Notice the pattern here? From either direction we get to the same 44 | spot. If you start from successful startups, you find they often 45 | behaved like nonprofits. And if you start from ideas for nonprofits, 46 | you find they'd often make good startups.PowerHow wide is this territory? Would all good nonprofits be good 47 | companies? Possibly not. What makes Google so valuable is that 48 | their users have money. If you make people with money love you, 49 | you can probably get some of it. But could you also base a successful 50 | startup on behaving like a nonprofit to people who don't have money? 51 | Could you, for example, grow a successful startup out of curing an 52 | unfashionable but deadly disease like malaria?I'm not sure, but I suspect that if you pushed this idea, you'd be 53 | surprised how far it would go. For example, people who apply to Y 54 | Combinator don't generally have much money, and yet we can profit 55 | by helping them, because with our help they could make money. Maybe 56 | the situation is similar with malaria. Maybe an organization that 57 | helped lift its weight off a country could benefit from the resulting 58 | growth.I'm not proposing this is a serious idea. I don't know anything 59 | about malaria. But I've been kicking ideas around long enough to 60 | know when I come across a powerful one.One way to guess how far an idea extends is to ask yourself at what 61 | point you'd bet against it. The thought of betting against benevolence 62 | is alarming in the same way as saying that something is technically 63 | impossible. You're just asking to be made a fool of, because these 64 | are such powerful forces. [2]For example, initially I thought maybe this principle only applied 65 | to Internet startups. Obviously it worked for Google, but what 66 | about Microsoft? Surely Microsoft isn't benevolent? But when I 67 | think back to the beginning, they were. Compared to IBM they were 68 | like Robin Hood. When IBM introduced the PC, they thought they 69 | were going to make money selling hardware at high prices. But by 70 | gaining control of the PC standard, Microsoft opened up the market 71 | to any manufacturer. Hardware prices plummeted, and lots of people 72 | got to have computers who couldn't otherwise have afforded them. 73 | It's the sort of thing you'd expect Google to do.Microsoft isn't so benevolent now. Now when one thinks of what 74 | Microsoft does to users, all the verbs that come to mind begin with 75 | F. [3] And yet it doesn't seem to pay. 76 | Their stock price has been flat for years. Back when they were 77 | Robin Hood, their stock price rose like Google's. Could there be 78 | a connection?You can see how there would be. When you're small, you can't bully 79 | customers, so you have to charm them. Whereas when you're big you 80 | can maltreat them at will, and you tend to, because it's easier 81 | than satisfying them. You grow big by being nice, but you can stay 82 | big by being mean.You get away with it till the underlying conditions change, and 83 | then all your victims escape. So "Don't be evil" may be the most 84 | valuable thing Paul Buchheit made for Google, because it may turn 85 | out to be an elixir of corporate youth. I'm sure they find it 86 | constraining, but think how valuable it will be if it saves them 87 | from lapsing into the fatal laziness that afflicted Microsoft and 88 | IBM.The curious thing is, this elixir is freely available to any other 89 | company. Anyone can adopt "Don't be evil." The catch is that 90 | people will hold you to it. So I don't think you're going to see 91 | record labels or tobacco companies using this discovery.MoraleThere's a lot of external evidence that benevolence works. But how 92 | does it work? One advantage of investing in a large number of 93 | startups is that you get a lot of data about how they work. From 94 | what we've seen, being good seems to help startups in three ways: 95 | it improves their morale, it makes other people want to help them, 96 | and above all, it helps them be decisive.Morale is tremendously important to a startup—so important 97 | that morale alone is almost enough to determine success. Startups 98 | are often described as emotional roller-coasters. One minute you're 99 | going to take over the world, and the next you're doomed. The 100 | problem with feeling you're doomed is not just that it makes you 101 | unhappy, but that it makes you stop working. So the downhills 102 | of the roller-coaster are more of a self fulfilling prophecy than 103 | the uphills. If feeling you're going to succeed makes you work 104 | harder, that probably improves your chances of succeeding, but if 105 | feeling you're going to fail makes you stop working, that practically 106 | guarantees you'll fail.Here's where benevolence comes in. If you feel you're really helping 107 | people, you'll keep working even when it seems like your startup 108 | is doomed. Most of us have some amount of natural benevolence. 109 | The mere fact that someone needs you makes you want to help them. 110 | So if you start the kind of startup where users come back each day, 111 | you've basically built yourself a giant tamagotchi. You've made 112 | something you need to take care of.Blogger is a famous example of a startup that went through really 113 | low lows and survived. At one point they ran out of money and 114 | everyone left. Evan Williams came in to work the next day, and there 115 | was no one but him. What kept him going? Partly that users needed 116 | him. He was hosting thousands of people's blogs. He couldn't just 117 | let the site die.There are many advantages of launching quickly, but the most important 118 | may be that once you have users, the tamagotchi effect kicks in. 119 | Once you have users to take care of, you're forced to figure out 120 | what will make them happy, and that's actually very valuable 121 | information.The added confidence that comes from trying to help people can 122 | also help you with investors. One of the founders of 123 | Chatterous told 124 | me recently that he and his cofounder had decided that this service 125 | was something the world needed, so they were going to keep working 126 | on it no matter what, even if they had to move back to Canada and live 127 | in their parents' basements.Once they realized this, they stopped caring so much what investors thought 128 | about them. They still met with them, but they weren't going to 129 | die if they didn't get their money. And you know what? The investors 130 | got a lot more interested. They could sense that the Chatterouses 131 | were going to do this startup with or without them.If you're really committed and your startup is cheap to run, you 132 | become very hard to kill. And practically all startups, even the 133 | most successful, come close to death at some point. So if doing 134 | good for people gives you a sense of mission that makes you harder 135 | to kill, that alone more than compensates for whatever you lose by 136 | not choosing a more selfish project.HelpAnother advantage of being good is that it makes other people want 137 | to help you. This too seems to be an inborn trait in humans.One of the startups we've funded, Octopart, is currently locked in 138 | a classic battle of good versus evil. They're a search site for 139 | industrial components. A lot of people need to search for components, 140 | and before Octopart there was no good way to do it. That, it turned 141 | out, was no coincidence.Octopart built the right way to search for components. Users like 142 | it and they've been growing rapidly. And yet for most of Octopart's 143 | life, the biggest distributor, Digi-Key, has been trying to force 144 | them take their prices off the site. Octopart is sending them 145 | customers for free, and yet Digi-Key is trying to make that traffic 146 | stop. Why? Because their current business model depends on 147 | overcharging people who have incomplete information about prices. 148 | They don't want search to work.The Octoparts are the nicest guys in the world. They dropped out 149 | of the PhD program in physics at Berkeley to do this. They just 150 | wanted to fix a problem they encountered in their research. Imagine 151 | how much time you could save the world's engineers if they could 152 | do searches online. So when I hear that a big, evil company is 153 | trying to stop them in order to keep search broken, it makes me 154 | really want to help them. It makes me spend more time on the Octoparts 155 | than I do with most of the other startups we've funded. It just 156 | made me spend several minutes telling you how great they are. Why? 157 | Because they're good guys and they're trying to help the world.If you're benevolent, people will rally around you: investors, 158 | customers, other companies, and potential employees. In the long 159 | term the most important may be the potential employees. I think 160 | everyone knows now that 161 | good hackers are much better than mediocre 162 | ones. If you can attract the best hackers to work for you, as 163 | Google has, you have a big advantage. And the very best hackers 164 | tend to be idealistic. They're not desperate for a job. They can 165 | work wherever they want. So most want to work on things that will 166 | make the world better.CompassBut the most important advantage of being good is that it acts as 167 | a compass. One of the hardest parts of doing a startup is that you 168 | have so many choices. There are just two or three of you, and a 169 | thousand things you could do. How do you decide?Here's the answer: Do whatever's best for your users. You can hold 170 | onto this like a rope in a hurricane, and it will save you if 171 | anything can. Follow it and it will take you through everything 172 | you need to do.It's even the answer to questions that seem unrelated, like how to 173 | convince investors to give you money. If you're a good salesman, 174 | you could try to just talk them into it. But the more reliable 175 | route is to convince them through your users: if you make something 176 | users love enough to tell their friends, you grow exponentially, 177 | and that will convince any investor.Being good is a particularly useful strategy for making decisions 178 | in complex situations because it's stateless. It's like telling 179 | the truth. The trouble with lying is that you have to remember 180 | everything you've said in the past to make sure you don't contradict 181 | yourself. If you tell the truth you don't have to remember anything, 182 | and that's a really useful property in domains where things happen 183 | fast.For example, Y Combinator has now invested in 80 startups, 57 of 184 | which are still alive. (The rest have died or merged or been 185 | acquired.) When you're trying to advise 57 startups, it turns out 186 | you have to have a stateless algorithm. You can't have ulterior 187 | motives when you have 57 things going on at once, because you can't 188 | remember them. So our rule is just to do whatever's best for the 189 | founders. Not because we're particularly benevolent, but because 190 | it's the only algorithm that works on that scale.When you write something telling people to be good, you seem to be 191 | claiming to be good yourself. So I want to say explicitly that I 192 | am not a particularly good person. When I was a kid I was firmly 193 | in the camp of bad. The way adults used the word good, it seemed 194 | to be synonymous with quiet, so I grew up very suspicious of it.You know how there are some people whose names come up in conversation 195 | and everyone says "He's such a great guy?" People never say 196 | that about me. The best I get is "he means well." I am not claiming 197 | to be good. At best I speak good as a second language.So I'm not suggesting you be good in the usual sanctimonious way. 198 | I'm suggesting it because it works. It will work not just as a 199 | statement of "values," but as a guide to strategy, 200 | and even a design spec for software. Don't just not be evil. Be 201 | good.Notes[1] Fifty years ago 202 | it would have seemed shocking for a public company not to pay 203 | dividends. Now many tech companies don't. The markets seem to 204 | have figured out how to value potential dividends. Maybe that isn't 205 | the last step in this evolution. Maybe markets will eventually get 206 | comfortable with potential earnings. (VCs already are, and at least 207 | some of them consistently make money.)I realize this sounds like the stuff one used to hear about the 208 | "new economy" during the Bubble. Believe me, I was not drinking 209 | that kool-aid at the time. But I'm convinced there were some 210 | good 211 | ideas buried in Bubble thinking. For example, it's ok to focus on 212 | growth instead of profits—but only if the growth is genuine. 213 | You can't be buying users; that's a pyramid scheme. But a company 214 | with rapid, genuine growth is valuable, and eventually markets learn 215 | how to value valuable things.[2] The idea of starting 216 | a company with benevolent aims is currently undervalued, because 217 | the kind of people who currently make that their explicit goal don't 218 | usually do a very good job.It's one of the standard career paths of trustafarians to start 219 | some vaguely benevolent business. The problem with most of them 220 | is that they either have a bogus political agenda or are feebly 221 | executed. The trustafarians' ancestors didn't get rich by preserving 222 | their traditional culture; maybe people in Bolivia don't want to 223 | either. And starting an organic farm, though it's at least 224 | straightforwardly benevolent, doesn't help people on the scale that 225 | Google does.Most explicitly benevolent projects don't hold themselves sufficiently 226 | accountable. They act as if having good intentions were enough to 227 | guarantee good effects.[3] Users dislike their 228 | new operating system so much that they're starting petitions to 229 | save the old one. And the old one was nothing special. The hackers 230 | within Microsoft must know in their hearts that if the company 231 | really cared about users they'd just advise them to switch to OSX.Thanks to Trevor Blackwell, Paul Buchheit, Jessica Livingston, 232 | and Robert Morris for reading drafts of this. -------------------------------------------------------------------------------- /street_tree_db.sqlite: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-LLM-Bootcamp/data/f776b47268f2d7152ac1accf3ce47472ec1a59e7/street_tree_db.sqlite -------------------------------------------------------------------------------- /thefuzz-master/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-LLM-Bootcamp/data/f776b47268f2d7152ac1accf3ce47472ec1a59e7/thefuzz-master/.DS_Store -------------------------------------------------------------------------------- /thefuzz-master/.editorconfig: -------------------------------------------------------------------------------- 1 | # .editorconfig 2 | # http://editorconfig.org/ 3 | root = true 4 | 5 | [*] 6 | charset = utf-8 7 | end_of_line = lf 8 | indent_size = 2 9 | indent_style = space 10 | insert_final_newline = true 11 | trim_trailing_whitespace = true 12 | 13 | [*.bat] 14 | end_of_line = crlf 15 | 16 | [*.go] 17 | indent_size = 4 18 | indent_style = tab 19 | 20 | [*.html] 21 | indent_size = 4 22 | 23 | [*Makefile] 24 | indent_size = 4 25 | indent_style = tab 26 | 27 | [*.php] 28 | indent_size = 4 29 | 30 | [*.py] 31 | indent_size = 4 32 | 33 | [*.xml] 34 | indent_size = 4 35 | -------------------------------------------------------------------------------- /thefuzz-master/.github/workflows/ci.yml: -------------------------------------------------------------------------------- 1 | name: The Fuzz 2 | 3 | on: [push, pull_request, workflow_dispatch] 4 | 5 | jobs: 6 | build: 7 | runs-on: ubuntu-latest 8 | strategy: 9 | fail-fast: false 10 | matrix: 11 | python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"] 12 | test-cmd: [pytest] 13 | include: 14 | - python-version: "3.8" 15 | test-cmd: python setup.py check --restructuredtext --strict --metadata 16 | - python-version: "3.11" 17 | test-cmd: python setup.py check --restructuredtext --strict --metadata 18 | steps: 19 | - uses: actions/checkout@v4 20 | - name: Set up Python ${{ matrix.python-version }} 21 | uses: actions/setup-python@v4 22 | with: 23 | python-version: ${{ matrix.python-version }} 24 | allow-prereleases: true 25 | - name: Install dependencies 26 | run: | 27 | python -m pip install --upgrade pip setuptools wheel 28 | pip install pytest pycodestyle docutils Pygments hypothesis 29 | 30 | - name: Install project 31 | run: pip install . 32 | 33 | - name: Test with pytest 34 | run: ${{ matrix.test-cmd }} 35 | -------------------------------------------------------------------------------- /thefuzz-master/.gitignore: -------------------------------------------------------------------------------- 1 | *.py[oc] 2 | 3 | # Temp files 4 | *~ 5 | ~* 6 | .*~ 7 | \#* 8 | .#* 9 | *# 10 | 11 | # Build files 12 | build 13 | dist 14 | pkg 15 | *.egg 16 | *.egg-info 17 | 18 | # Debian Files 19 | debian/files 20 | debian/python-beaver* 21 | 22 | # Sphinx build 23 | doc/_build 24 | 25 | # Generated man page 26 | doc/aws_hostname.1 27 | 28 | # tox 29 | .tox 30 | 31 | # Hypothesis - keep the examples database 32 | .hypothesis/tmp 33 | .hypothesis/unicodedata 34 | .hypothesis 35 | 36 | # pytest 37 | .cache/ 38 | .pytest_cache 39 | __pycache__ 40 | 41 | # Pycharm 42 | .idea/ 43 | 44 | # vscode 45 | .vscode/ 46 | -------------------------------------------------------------------------------- /thefuzz-master/CHANGES.rst: -------------------------------------------------------------------------------- 1 | Changelog 2 | ========= 3 | 4 | 0.17.0 (2018-08-20) 5 | ------------------- 6 | 7 | - Make benchmarks script Py3 compatible. [Stefan Behnel] 8 | 9 | - Add Go lang port. [iddober] 10 | 11 | - Add reference to C# port. [ericcoleman] 12 | 13 | - Chore: remove license header from files. [Jose Diaz-Gonzalez] 14 | 15 | The files should all inherit the projects license. 16 | 17 | 18 | - Fix README title style. [Thomas Grainger] 19 | 20 | - Add readme check. [Thomas Grainger] 21 | 22 | install docutils and Pygments 23 | 24 | 25 | - Cache pip. [Thomas Grainger] 26 | 27 | - Upgrade pip/setuptools for hypothesis. [Thomas Grainger] 28 | 29 | - Feat: drop py26 and py33 support from tox. [Jose Diaz-Gonzalez] 30 | 31 | - Feat: drop support for 2.6 in test_thefuzz.py. [Jose Diaz-Gonzalez] 32 | 33 | - Feat: drop reference to 2.4 from readme. [Jose Diaz-Gonzalez] 34 | 35 | - Feat: drop py2.6 and py3.3 classifiers. [Jose Diaz-Gonzalez] 36 | 37 | - Feat: drop 2.6 and 3.3 support. [Jose Diaz-Gonzalez] 38 | 39 | These are no longer supported. Please upgrade your python version if you are using either version. 40 | 41 | - Fuzz: _token_sort: check for equivalence. [Ralf Ramsauer] 42 | 43 | If we don't have to full_process the strings, we can safely assume to 44 | return 100 in case both candidates equal. 45 | 46 | Signed-off-by: Ralf Ramsauer 47 | 48 | 49 | - Test: add more test cases. [Ralf Ramsauer] 50 | 51 | Signed-off-by: Ralf Ramsauer 52 | 53 | 54 | - Utils: add and use check_for_equivalence decorator. [Ralf Ramsauer] 55 | 56 | And decorate basic scoring functions. 57 | 58 | The check_for_equivalence decorator MUST be used after the 59 | check_for_none decorator, as otherwise ratio(None, None) will get a 60 | score of 100. 61 | 62 | This fixes the first part of the recently introduced changes in the test 63 | set. 64 | 65 | Signed-off-by: Ralf Ramsauer 66 | 67 | 68 | - Tests: add some corner cases. [Ralf Ramsauer] 69 | 70 | '' and '' are equal, so are '{' and '{'. Test if thefuzz gives them a 71 | score of 100. 72 | 73 | For the moment, this patch breaks tests, fixes in thefuzz follow. 74 | 75 | Signed-off-by: Ralf Ramsauer 76 | 77 | 78 | - Utils: remove superfluous check. [Ralf Ramsauer] 79 | 80 | Decorators make sure that only non None-values are passed. We can safely 81 | assume that None will never get here. 82 | 83 | Other than that, None's shouldn't simply be ignored and erroneously 84 | changed to empty strings. Better let users fail. 85 | 86 | This commit doesn't break any tests. 87 | 88 | Signed-off-by: Ralf Ramsauer 89 | 90 | 91 | - README: add missing requirements. [Ralf Ramsauer] 92 | 93 | pycodestyle and hypothesis are required for automatic testing. Add them 94 | to README's requirement section. 95 | 96 | Signed-off-by: Ralf Ramsauer 97 | 98 | 99 | - Remove empty document. [Ralf Ramsauer] 100 | 101 | Signed-off-by: Ralf Ramsauer 102 | 103 | 104 | 0.16.0 (2017-12-18) 105 | ------------------- 106 | 107 | - Add punctuation characters back in so process does something. 108 | [davidcellis] 109 | 110 | - Simpler alphabet and even fewer examples. [davidcellis] 111 | 112 | - Fewer examples and larger deadlines for Hypothesis. [davidcellis] 113 | 114 | - Slightly more examples. [davidcellis] 115 | 116 | - Attempt to fix the failing 2.7 and 3.6 python tests. [davidcellis] 117 | 118 | - Readme: add link to C++ port. [Lizard] 119 | 120 | - Fix tests on Python 3.3. [Jon Banafato] 121 | 122 | Modify tox.ini and .travis.yml to install enum34 when running with 123 | Python 3.3 to allow hypothesis tests to pass. 124 | 125 | 126 | - Normalize Python versions. [Jon Banafato] 127 | 128 | - Enable Travis-CI tests for Python 3.6 129 | - Enable tests for all supported Python versions in tox.ini 130 | - Add Trove classifiers for Python 3.4 - 3.6 to setup.py 131 | 132 | --- 133 | 134 | Note: Python 2.6 and 3.3 are no longer supported by the Python core 135 | team. Support for these can likely be dropped, but that's out of scope 136 | for this change set. 137 | 138 | 139 | - Fix typos. [Sven-Hendrik Haase] 140 | 141 | 0.15.1 (2017-07-19) 142 | ------------------- 143 | 144 | - Fix setup.py (addresses #155) [Paul O'Leary McCann] 145 | 146 | - Merge remote-tracking branch 'upstream/master' into 147 | extract_optimizations. [nolan] 148 | 149 | - Seed random before generating benchmark strings. [nolan] 150 | 151 | - Cleaner implementation of same idea without new param, but adding 152 | existing full_process param to Q,W,UQ,UW. [nolan] 153 | 154 | - Fix benchmark only generate list once. [nolan] 155 | 156 | - Only run util.full_process once on query when using extract functions, 157 | add new benchmarks. [nolan] 158 | 159 | 0.15.0 (2017-02-20) 160 | ------------------- 161 | 162 | - Add extras require to install python-levenshtein optionally. [Rolando 163 | Espinoza] 164 | 165 | This allows to install python-levenshtein as dependency. 166 | 167 | 168 | - Fix link formatting in the README. [Alex Chan] 169 | 170 | - Add fuzzball.js JavaScript port link. [nolan] 171 | 172 | - Added Rust Port link. [Logan Collins] 173 | 174 | - Validate_string docstring. [davidcellis] 175 | 176 | - For full comparisons test that ONLY exact matches (after processing) 177 | are added. [davidcellis] 178 | 179 | - Add detailed docstrings to WRatio and QRatio comparisons. 180 | [davidcellis] 181 | 182 | 0.14.0 (2016-11-04) 183 | ------------------- 184 | 185 | - Possible PEP-8 fix + make pep-8 warnings appear in test. [davidcellis] 186 | 187 | - Possible PEP-8 fix. [davidcellis] 188 | 189 | - Possible PEP-8 fix. [davidcellis] 190 | 191 | - Test for stderr log instead of warning. [davidcellis] 192 | 193 | - Convert warning.warn to logging.warning. [davidcellis] 194 | 195 | - Additional details for empty string warning from process. 196 | [davidcellis] 197 | 198 | String formatting fix for python 2.6 199 | 200 | 201 | - Enclose warnings.simplefilter() inside a with statement. [samkennerly] 202 | 203 | 0.13.0 (2016-11-01) 204 | ------------------- 205 | 206 | - Support alternate git status output. [Jose Diaz-Gonzalez] 207 | 208 | - Split warning test into new test file, added to travis execution on 209 | 2.6 / pypy3. [davidcellis] 210 | 211 | - Remove hypothesis examples database from gitignore. [davidcellis] 212 | 213 | - Add check for warning to tests. [davidcellis] 214 | 215 | Reordered test imports 216 | 217 | 218 | - Check processor and warn before scorer may remove processor. 219 | [davidcellis] 220 | 221 | - Renamed test - tidied docstring. [davidcellis] 222 | 223 | - Add token ratios to the list of scorers that skip running full_process 224 | as a processor. [davidcellis] 225 | 226 | - Added tokex_sort, token_set to test. [davidcellis] 227 | 228 | - Test docstrings/comments. [davidcellis] 229 | 230 | Removed redundant check from test. 231 | 232 | 233 | - Added py.test .cache/ removed duplicated build from gitignore. 234 | [davidcellis] 235 | 236 | - Added default_scorer, default_processor parameters to make it easier 237 | to change in the future. [davidcellis] 238 | 239 | Added warning if the processor reduces the input query to an empty string. 240 | 241 | 242 | - Rewrote extracts to explicitly use default values for processor and 243 | scorer. [davidcellis] 244 | 245 | - Changed Hypothesis tests to use pytest parameters. [davidcellis] 246 | 247 | - Added Hypothesis based tests for identical strings. [Ducksual] 248 | 249 | Added support for hypothesis to travis config. 250 | Hypothesis based tests are skipped on Python 2.6 and pypy3. 251 | 252 | Added .hypothesis/ folder to gitignore 253 | 254 | 255 | - Added test for simple 'a, b' string on process.extractOne. [Ducksual] 256 | 257 | - Process the query in process.extractWithoutOrder when using a scorer 258 | which does not do so. [Ducksual] 259 | 260 | Closes 139 261 | 262 | 263 | - Mention that difflib and levenshtein results may differ. [Jose Diaz- 264 | Gonzalez] 265 | 266 | Closes #128 267 | 268 | 0.12.0 (2016-09-14) 269 | ------------------- 270 | 271 | - Declare support for universal wheels. [Thomas Grainger] 272 | 273 | - Clarify that license is GPLv2. [Gareth Tan] 274 | 275 | 0.11.1 (2016-07-27) 276 | ------------------- 277 | 278 | - Add editorconfig. [Jose Diaz-Gonzalez] 279 | 280 | - Added tox.ini cofig file for easy local multi-environment testing 281 | changed travis config to use py.test like tox updated use of pep8 282 | module to pycodestyle. [Pedro Rodrigues] 283 | 284 | 0.11.0 (2016-06-30) 285 | ------------------- 286 | 287 | - Clean-up. [desmaisons_david] 288 | 289 | - Improving performance. [desmaisons_david] 290 | 291 | - Performance Improvement. [desmaisons_david] 292 | 293 | - Fix link to Levenshtein. [Brian J. McGuirk] 294 | 295 | - Fix readme links. [Brian J. McGuirk] 296 | 297 | - Add license to StringMatcher.py. [Jose Diaz-Gonzalez] 298 | 299 | Closes #113 300 | 301 | 0.10.0 (2016-03-14) 302 | ------------------- 303 | 304 | - Handle None inputs same as empty string (Issue #94) [Nick Miller] 305 | 306 | 0.9.0 (2016-03-07) 307 | ------------------ 308 | 309 | - Pull down all keys when updating local copy. [Jose Diaz-Gonzalez] 310 | 311 | 0.8.2 (2016-02-26) 312 | ------------------ 313 | 314 | - Remove the warning for "slow" sequence matcher on PyPy. [Julian 315 | Berman] 316 | 317 | where it's preferable to use the pure-python implementation. 318 | 319 | 0.8.1 (2016-01-25) 320 | ------------------ 321 | 322 | - Minor release changes. [Jose Diaz-Gonzalez] 323 | 324 | - Clean up wiki link in readme. [Ewan Oglethorpe] 325 | 326 | 0.8.0 (2015-11-16) 327 | ------------------ 328 | 329 | - Refer to Levenshtein distance in readme. Closes #88. [Jose Diaz- 330 | Gonzalez] 331 | 332 | - Added install step for travis to have pep8 available. [Pedro 333 | Rodrigues] 334 | 335 | - Added a pep8 test. The way I add the error 501 to the ignore tuple is 336 | probably wrong but from the docs and source code of pep8 I could not 337 | find any other way. [Pedro Rodrigues] 338 | 339 | I also went ahead and removed the pep8 call from the release file. 340 | 341 | 342 | - Added python 3.5, pypy, and ypyp3 to the travis config file. [Pedro 343 | Rodrigues] 344 | 345 | - Added another step to the release file to run the tests before 346 | releasing. [Pedro Rodrigues] 347 | 348 | - Fixed a few pep8 errors Added a verification step in the release 349 | automation file. This step should probably be somewhere at git level. 350 | [Pedro Rodrigues] 351 | 352 | - Pep8. [Pedro Rodrigues] 353 | 354 | - Leaving TODOs in the code was never a good idea. [Pedro Rodrigues] 355 | 356 | - Changed return values to be rounded integers. [Pedro Rodrigues] 357 | 358 | - Added a test with the recovered data file. [Pedro Rodrigues] 359 | 360 | - Recovered titledata.csv. [Pedro Rodrigues] 361 | 362 | - Move extract test methods into the process test. [Shale Craig] 363 | 364 | Somehow, they ended up in the `RatioTest`, despite asserting that the 365 | `ProcessTest` works. 366 | 367 | 368 | 0.7.0 (2015-10-02) 369 | ------------------ 370 | 371 | - Use portable syntax for catching exception on tests. [Luis Madrigal] 372 | 373 | - [Fix] test against correct variable. [Luis Madrigal] 374 | 375 | - Add unit tests for validator decorators. [Luis Madrigal] 376 | 377 | - Move validators to decorator functions. [Luis Madrigal] 378 | 379 | This allows easier composition and IMO makes the functions more readable 380 | 381 | 382 | - Fix typo: dictionery -> dictionary. [shale] 383 | 384 | - FizzyWuzzy -> TheFuzz typo correction. [shale] 385 | 386 | - Add check for gitchangelog. [Jose Diaz-Gonzalez] 387 | 388 | 0.6.2 (2015-09-03) 389 | ------------------ 390 | 391 | - Ensure the rst-lint binary is available. [Jose Diaz-Gonzalez] 392 | 393 | 0.6.1 (2015-08-07) 394 | ------------------ 395 | 396 | - Minor whitespace changes for PEP8. [Jose Diaz-Gonzalez] 397 | 398 | 0.6.0 (2015-07-20) 399 | ------------------ 400 | 401 | - Added link to a java port. [Andriy Burkov] 402 | 403 | - Patched "name 'unicode' is not defined" python3. [Carlos Garay] 404 | 405 | https://github.com/seatgeek/thefuzz/issues/80 406 | 407 | - Make process.extract accept {dict, list}-like choices. [Nathan 408 | Typanski] 409 | 410 | Previously, process.extract expected lists or dictionaries, and tested 411 | this with isinstance() calls. In keeping with the spirit of Python (duck 412 | typing and all that), this change enables one to use extract() on any 413 | dict-like object for dict-like results, or any list-like object for 414 | list-like results. 415 | 416 | So now we can (and, indeed, I've added tests for these uses) call 417 | extract() on things like: 418 | 419 | - a generator of strings ("any iterable") 420 | - a UserDict 421 | - custom user-made classes that "look like" dicts 422 | (or, really, anything with a .items() method that behaves like a dict) 423 | - plain old lists and dicts 424 | 425 | The behavior is exactly the same for previous use cases of 426 | lists-and-dicts. 427 | 428 | This change goes along nicely with PR #68, since those docs suggest 429 | dict-like behavior is valid, and this change makes that true. 430 | 431 | 432 | - Merge conflict. [Adam Cohen] 433 | 434 | - Improve docs for thefuzz.process. [Nathan Typanski] 435 | 436 | The documentation for this module was dated and sometimes inaccurate. 437 | This overhauls the docs to accurately describe the current module, 438 | including detailing optional arguments that were not previously 439 | explained - e.g., limit argument to extract(). 440 | 441 | This change follows the Google Python Style Guide, which may be found 442 | at: 443 | 444 | 445 | 446 | 447 | 0.5.0 (2015-02-04) 448 | ------------------ 449 | 450 | - FIX: 0.4.0 is released, no need to specify 0.3.1 in README. [Josh 451 | Warner (Mac)] 452 | 453 | - Fixed a small typo. [Rostislav Semenov] 454 | 455 | - Reset `processor` and `scorer` defaults to None with argument 456 | checking. [foxxyz] 457 | 458 | - Catch generators without lengths. [Jeremiah Lowin] 459 | 460 | - Fixed python3 issue and deprecated assertion method. [foxxyz] 461 | 462 | - Fixed some docstrings, typos, python3 string method compatibility, 463 | some errors that crept in during rebase. [foxxyz] 464 | 465 | - [mod] The lamdba in extract is not needed. [Olivier Le Thanh Duong] 466 | 467 | [mod] Pass directly the defaults functions in the args 468 | 469 | [mod] itertools.takewhile() can handle empty list just fine no need to test for it 470 | 471 | [mod] Shorten extractOne by removing double if 472 | 473 | [mod] Use a list comprehention in extract() 474 | 475 | [mod] Autopep8 on process.py 476 | 477 | [doc] Document make_type_consistent 478 | 479 | [mod] bad_chars shortened 480 | 481 | [enh] Move regex compilation outside the method, otherwise we don't get the benefit from it 482 | 483 | [mod] Don't need all the blah just to redefine method from string module 484 | 485 | [mod] Remove unused import 486 | 487 | [mod] Autopep8 on string_processing.py 488 | 489 | [mod] Rewrote asciidammit without recursion to make it more readable 490 | 491 | [mod] Autopep8 on utils.py 492 | 493 | [mod] Remove unused import 494 | 495 | [doc] Add some doc to fuzz.py 496 | 497 | [mod] Move the code to sort string in a separate function 498 | 499 | [doc] Docstrings for WRatio, UWRatio 500 | 501 | 502 | - Add note on which package to install. Closes #67. [Jose Diaz-Gonzalez] 503 | 504 | 0.4.0 (2014-10-31) 505 | ------------------ 506 | 507 | - In extarctBests() and extractOne() use '>=' instead of '>' [Юрий 508 | Пайков] 509 | 510 | - Fixed python3 issue with SequenceMatcher import. [Юрий Пайков] 511 | 512 | 0.3.3 (2014-10-22) 513 | ------------------ 514 | 515 | - Fixed issue #59 - "partial" parameter for `_token_set()` is now 516 | honored. [Юрий Пайков] 517 | 518 | - Catch generators without lengths. [Jeremiah Lowin] 519 | 520 | - Remove explicit check for lists. [Jeremiah Lowin] 521 | 522 | The logic in `process.extract()` should support any Python sequence/iterable. The explicit check for lists is unnecessary and limiting (for example, it forces conversion of generators and other iterable classes to lists). 523 | 524 | 0.3.2 (2014-09-12) 525 | ------------------ 526 | 527 | - Make release command an executable. [Jose Diaz-Gonzalez] 528 | 529 | - Simplify MANIFEST.in. [Jose Diaz-Gonzalez] 530 | 531 | - Add a release script. [Jose Diaz-Gonzalez] 532 | 533 | - Fix readme codeblock. [Jose Diaz-Gonzalez] 534 | 535 | - Minor formatting. [Jose Diaz-Gonzalez] 536 | 537 | - Use __version__ from thefuzz package. [Jose Diaz-Gonzalez] 538 | 539 | - Set __version__ constant in __init__.py. [Jose Diaz-Gonzalez] 540 | 541 | - Rename LICENSE to LICENSE.txt. [Jose Diaz-Gonzalez] 542 | 543 | 0.3.0 (2014-08-24) 544 | ------------------ 545 | 546 | - Test dict input to extractOne() [jamesnunn] 547 | 548 | - Remove whitespace. [jamesnunn] 549 | 550 | - Choices parameter for extract() accepts both dict and list objects. 551 | [jamesnunn] 552 | 553 | - Enable automated testing with Python 3.4. [Corey Farwell] 554 | 555 | - Fixed typo: lettters -> letters. [Tal Einat] 556 | 557 | - Fixing LICENSE and README's license info. [Dallas Gutauckis] 558 | 559 | - Proper ordered list. [Jeff Paine] 560 | 561 | - Convert README to rst. [Jeff Paine] 562 | 563 | - Add requirements.txt per discussion in #44. [Jeff Paine] 564 | 565 | - Add LICENSE TO MANIFEST.in. [Jeff Paine] 566 | 567 | - Rename tests.py to more common test_thefuzz.py. [Jeff Paine] 568 | 569 | - Add proper MANIFEST template. [Jeff Paine] 570 | 571 | - Remove MANIFEST file Not meant to be kept in version control. [Jeff 572 | Paine] 573 | 574 | - Remove unused file. [Jeff Paine] 575 | 576 | - Pep8. [Jeff Paine] 577 | 578 | - Pep8 formatting. [Jeff Paine] 579 | 580 | - Pep8 formatting. [Jeff Paine] 581 | 582 | - Pep8 indentations. [Jeff Paine] 583 | 584 | - Pep8 cleanup. [Jeff Paine] 585 | 586 | - Pep8. [Jeff Paine] 587 | 588 | - Pep8 cleanup. [Jeff Paine] 589 | 590 | - Pep8 cleanup. [Jeff Paine] 591 | 592 | - Pep8 import style. [Jeff Paine] 593 | 594 | - Pep8 import ordering. [Jeff Paine] 595 | 596 | - Pep8 import ordering. [Jeff Paine] 597 | 598 | - Remove unused module. [Jeff Paine] 599 | 600 | - Pep8 import ordering. [Jeff Paine] 601 | 602 | - Remove unused module. [Jeff Paine] 603 | 604 | - Pep8 import ordering. [Jeff Paine] 605 | 606 | - Remove unused imports. [Jeff Paine] 607 | 608 | - Remove unused module. [Jeff Paine] 609 | 610 | - Remove import * where present. [Jeff Paine] 611 | 612 | - Avoid import * [Jeff Paine] 613 | 614 | - Add Travis CI badge. [Jeff Paine] 615 | 616 | - Remove python 2.4, 2.5 from Travis (not supported) [Jeff Paine] 617 | 618 | - Add python 2.4 and 2.5 to Travis. [Jeff Paine] 619 | 620 | - Add all supported python versions to travis. [Jeff Paine] 621 | 622 | - Bump minor version number. [Jeff Paine] 623 | 624 | - Add classifiers for python versions. [Jeff Paine] 625 | 626 | - Added note about python-Levenshtein speedup. Closes #34. [Jose Diaz- 627 | Gonzalez] 628 | 629 | - Fixed tests on 2.6. [Grigi] 630 | 631 | - Fixed py2.6. [Grigi] 632 | 633 | - Force bad_chars to ascii. [Grigi] 634 | 635 | - Since importing unicode_literals, u decorator not required on strings 636 | from py2.6 and up. [Grigi] 637 | 638 | - Py3 support without 2to3. [Grigi] 639 | 640 | - Created: Added .travis.yml. [futoase] 641 | 642 | - [enh] Add docstrings to process.py. [Olivier Le Thanh Duong] 643 | 644 | Turn the existings comments into docstrings so they can be seen via introspection 645 | 646 | 647 | - Don't condense multiple punctuation characters to a single whitespace. 648 | this is a behavioral change. [Adam Cohen] 649 | 650 | - UQRatio and UWRatio shorthands. [Adam Cohen] 651 | 652 | - Version 0.2. [Adam Cohen] 653 | 654 | - Unicode/string comparison bug. [Adam Cohen] 655 | 656 | - To maintain backwards compatibility, default is to force_ascii as 657 | before. [Adam Cohen] 658 | 659 | - Fix merge conflict. [Adam Cohen] 660 | 661 | - New process function: extractBests. [Flávio Juvenal] 662 | 663 | - More readable reverse sorting. [Flávio Juvenal] 664 | 665 | - Further honoring of force_ascii. [Adam Cohen] 666 | 667 | - Indentation fix. [Adam Cohen] 668 | 669 | - Handle force_ascii in fuzz methods. [Adam Cohen] 670 | 671 | - Add back relevant tests. [Adam Cohen] 672 | 673 | - Utility method to make things consistent. [Adam Cohen] 674 | 675 | - Re-commit asciidammit and add a parameter to full_process to determine 676 | behavior. [Adam Cohen] 677 | 678 | - Added a test for non letters/digits replacements. [Tristan Launay] 679 | 680 | - ENG-741 fixed benchmark line length. [Laurent Erignoux] 681 | 682 | - Fixed Unicode flag for tests. [Tristan Launay] 683 | 684 | - ENG-741 commented code removed not erased for review from creator. 685 | [Laurent Erignoux] 686 | 687 | - ENG-741 cut long lines in fuzzy wizzy benchmark. [Laurent Erignoux] 688 | 689 | - Re-upped the limit on benchmark, now that performance is not an issue 690 | anymore. [Tristan Launay] 691 | 692 | - Fixed comment. [Tristan Launay] 693 | 694 | - Simplified processing of strings with built-in regex code in python. 695 | Also fixed empty string detection in token_sort_ratio. [Tristan 696 | Launay] 697 | 698 | - Proper benchmark display. Introduce methods to explicitly do all the 699 | unicode preprocessing *before* using fuzz lib. [Tristan Launay] 700 | 701 | - ENG-741: having a true benchmark, to see when we improve stuff. 702 | [Benjamin Combourieu] 703 | 704 | - Unicode support in benchmark.py. [Benjamin Combourieu] 705 | 706 | - Added file for processing strings. [Tristan Launay] 707 | 708 | - Uniform treatment of strings in Unicode. Non-ASCII chars are now 709 | considered in strings, which allows for matches in Cyrillic, Chinese, 710 | Greek, etc. [Tristan Launay] 711 | 712 | - Fixed bug in _token_set. [Michael Edward] 713 | 714 | - Removed reference to PR. [Jose Diaz-Gonzalez] 715 | 716 | - Sadist build and virtualenv dirs are not part of the project. [Pedro 717 | Rodrigues] 718 | 719 | - Fixes https://github.com/seatgeek/thefuzz/issues/10 and correctly 720 | points to README.textile. [Pedro Rodrigues] 721 | 722 | - Info on the pull request. [Pedro Rodrigues] 723 | 724 | - Pullstat.us button. [Pedro Rodrigues] 725 | 726 | - Fuzzywuzzy really needs better benchmarks. [Pedro Rodrigues] 727 | 728 | - Moved tests and benchmarks out of the package. [Pedro Rodrigues] 729 | 730 | - Report better ratio()s redundant import try. [Pedro Rodrigues] 731 | 732 | - AssertGreater did not exist in python 2.4. [Pedro Rodrigues] 733 | 734 | - Remove debug output. [Adam Cohen] 735 | 736 | - Looks for python-Levenshtein package, and if present, uses that 737 | instead of difflib. 10x speedup if present. add benchmarks. [Adam 738 | Cohen] 739 | 740 | - Add gitignore. [Adam Cohen] 741 | 742 | - Fix a bug in WRatio, as well as an issue in full_process, which was 743 | failing on strings with all unicode characters. [Adam Cohen] 744 | 745 | - Error in partial_ratio. closes #7. [Adam Cohen] 746 | 747 | - Adding some real-life event data for benchmarking. [Adam Cohen] 748 | 749 | - Cleaned up utils.py. [Pedro Rodrigues] 750 | 751 | - Optimized speed for full_process() [Pedro Rodrigues] 752 | 753 | - Speed improvements to asciidammit. [Pedro Rodrigues] 754 | 755 | - Removed old versions of validate_string() and remove_ponctuation() 756 | kept from previous commits. [Pedro Rodrigues] 757 | 758 | - Issue #6 from github updated license headers to match MIT license. 759 | [Pedro Rodrigues] 760 | 761 | - Clean up. [Pedro Rodrigues] 762 | 763 | - Changes to utils.validate_string() and benchmarks. [Pedro Rodrigues] 764 | 765 | - Some benchmarks to test the changes made to remove_punctuation. [Pedro 766 | Rodrigues] 767 | 768 | - Faster remove_punctuation. [Pedro Rodrigues] 769 | 770 | - AssertIsNone did not exist in Python 2.4. [Pedro Rodrigues] 771 | 772 | - Just adding some simple install instructions for pip. [Chris Dary] 773 | 774 | - Check for null/empty strings in QRatio and WRatio. Add tests. Closes 775 | #3. [Adam Cohen] 776 | 777 | - More README. [Adam Cohen] 778 | 779 | - README. [Adam Cohen] 780 | 781 | - README. [Adam Cohen] 782 | 783 | - Slight change to README. [Adam Cohen] 784 | 785 | - Some readme. [Adam Cohen] 786 | 787 | - Distutils. [Adam Cohen] 788 | 789 | - Change directory structure. [Adam Cohen] 790 | 791 | - Initial commit. [Adam Cohen] 792 | 793 | 794 | -------------------------------------------------------------------------------- /thefuzz-master/LICENSE.txt: -------------------------------------------------------------------------------- 1 | 2 | GNU GENERAL PUBLIC LICENSE 3 | Version 2, June 1991 4 | 5 | Copyright (C) 1989, 1991 Free Software Foundation, Inc. 6 | 59 Temple Place, Suite 330, Boston, MA 02111 USA 7 | Everyone is permitted to copy and distribute verbatim copies 8 | of this license document, but changing it is not allowed. 9 | 10 | Preamble 11 | 12 | The licenses for most software are designed to take away your 13 | freedom to share and change it. By contrast, the GNU General Public 14 | License is intended to guarantee your freedom to share and change free 15 | software--to make sure the software is free for all its users. This 16 | General Public License applies to most of the Free Software 17 | Foundation's software and to any other program whose authors commit to 18 | using it. (Some other Free Software Foundation software is covered by 19 | the GNU Library General Public License instead.) You can apply it to 20 | your programs, too. 21 | 22 | When we speak of free software, we are referring to freedom, not 23 | price. Our General Public Licenses are designed to make sure that you 24 | have the freedom to distribute copies of free software (and charge for 25 | this service if you wish), that you receive source code or can get it 26 | if you want it, that you can change the software or use pieces of it 27 | in new free programs; and that you know you can do these things. 28 | 29 | To protect your rights, we need to make restrictions that forbid 30 | anyone to deny you these rights or to ask you to surrender the rights. 31 | These restrictions translate to certain responsibilities for you if you 32 | distribute copies of the software, or if you modify it. 33 | 34 | For example, if you distribute copies of such a program, whether 35 | gratis or for a fee, you must give the recipients all the rights that 36 | you have. You must make sure that they, too, receive or can get the 37 | source code. And you must show them these terms so they know their 38 | rights. 39 | 40 | We protect your rights with two steps: (1) copyright the software, and 41 | (2) offer you this license which gives you legal permission to copy, 42 | distribute and/or modify the software. 43 | 44 | Also, for each author's protection and ours, we want to make certain 45 | that everyone understands that there is no warranty for this free 46 | software. If the software is modified by someone else and passed on, we 47 | want its recipients to know that what they have is not the original, so 48 | that any problems introduced by others will not reflect on the original 49 | authors' reputations. 50 | 51 | Finally, any free program is threatened constantly by software 52 | patents. We wish to avoid the danger that redistributors of a free 53 | program will individually obtain patent licenses, in effect making the 54 | program proprietary. To prevent this, we have made it clear that any 55 | patent must be licensed for everyone's free use or not licensed at all. 56 | 57 | The precise terms and conditions for copying, distribution and 58 | modification follow. 59 | 60 | GNU GENERAL PUBLIC LICENSE 61 | TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 62 | 63 | 0. This License applies to any program or other work which contains 64 | a notice placed by the copyright holder saying it may be distributed 65 | under the terms of this General Public License. The "Program", below, 66 | refers to any such program or work, and a "work based on the Program" 67 | means either the Program or any derivative work under copyright law: 68 | that is to say, a work containing the Program or a portion of it, 69 | either verbatim or with modifications and/or translated into another 70 | language. (Hereinafter, translation is included without limitation in 71 | the term "modification".) Each licensee is addressed as "you". 72 | 73 | Activities other than copying, distribution and modification are not 74 | covered by this License; they are outside its scope. The act of 75 | running the Program is not restricted, and the output from the Program 76 | is covered only if its contents constitute a work based on the 77 | Program (independent of having been made by running the Program). 78 | Whether that is true depends on what the Program does. 79 | 80 | 1. You may copy and distribute verbatim copies of the Program's 81 | source code as you receive it, in any medium, provided that you 82 | conspicuously and appropriately publish on each copy an appropriate 83 | copyright notice and disclaimer of warranty; keep intact all the 84 | notices that refer to this License and to the absence of any warranty; 85 | and give any other recipients of the Program a copy of this License 86 | along with the Program. 87 | 88 | You may charge a fee for the physical act of transferring a copy, and 89 | you may at your option offer warranty protection in exchange for a fee. 90 | 91 | 2. You may modify your copy or copies of the Program or any portion 92 | of it, thus forming a work based on the Program, and copy and 93 | distribute such modifications or work under the terms of Section 1 94 | above, provided that you also meet all of these conditions: 95 | 96 | a) You must cause the modified files to carry prominent notices 97 | stating that you changed the files and the date of any change. 98 | 99 | b) You must cause any work that you distribute or publish, that in 100 | whole or in part contains or is derived from the Program or any 101 | part thereof, to be licensed as a whole at no charge to all third 102 | parties under the terms of this License. 103 | 104 | c) If the modified program normally reads commands interactively 105 | when run, you must cause it, when started running for such 106 | interactive use in the most ordinary way, to print or display an 107 | announcement including an appropriate copyright notice and a 108 | notice that there is no warranty (or else, saying that you provide 109 | a warranty) and that users may redistribute the program under 110 | these conditions, and telling the user how to view a copy of this 111 | License. (Exception: if the Program itself is interactive but 112 | does not normally print such an announcement, your work based on 113 | the Program is not required to print an announcement.) 114 | 115 | These requirements apply to the modified work as a whole. If 116 | identifiable sections of that work are not derived from the Program, 117 | and can be reasonably considered independent and separate works in 118 | themselves, then this License, and its terms, do not apply to those 119 | sections when you distribute them as separate works. But when you 120 | distribute the same sections as part of a whole which is a work based 121 | on the Program, the distribution of the whole must be on the terms of 122 | this License, whose permissions for other licensees extend to the 123 | entire whole, and thus to each and every part regardless of who wrote it. 124 | 125 | Thus, it is not the intent of this section to claim rights or contest 126 | your rights to work written entirely by you; rather, the intent is to 127 | exercise the right to control the distribution of derivative or 128 | collective works based on the Program. 129 | 130 | In addition, mere aggregation of another work not based on the Program 131 | with the Program (or with a work based on the Program) on a volume of 132 | a storage or distribution medium does not bring the other work under 133 | the scope of this License. 134 | 135 | 3. You may copy and distribute the Program (or a work based on it, 136 | under Section 2) in object code or executable form under the terms of 137 | Sections 1 and 2 above provided that you also do one of the following: 138 | 139 | a) Accompany it with the complete corresponding machine-readable 140 | source code, which must be distributed under the terms of Sections 141 | 1 and 2 above on a medium customarily used for software interchange; or, 142 | 143 | b) Accompany it with a written offer, valid for at least three 144 | years, to give any third party, for a charge no more than your 145 | cost of physically performing source distribution, a complete 146 | machine-readable copy of the corresponding source code, to be 147 | distributed under the terms of Sections 1 and 2 above on a medium 148 | customarily used for software interchange; or, 149 | 150 | c) Accompany it with the information you received as to the offer 151 | to distribute corresponding source code. (This alternative is 152 | allowed only for noncommercial distribution and only if you 153 | received the program in object code or executable form with such 154 | an offer, in accord with Subsection b above.) 155 | 156 | The source code for a work means the preferred form of the work for 157 | making modifications to it. For an executable work, complete source 158 | code means all the source code for all modules it contains, plus any 159 | associated interface definition files, plus the scripts used to 160 | control compilation and installation of the executable. However, as a 161 | special exception, the source code distributed need not include 162 | anything that is normally distributed (in either source or binary 163 | form) with the major components (compiler, kernel, and so on) of the 164 | operating system on which the executable runs, unless that component 165 | itself accompanies the executable. 166 | 167 | If distribution of executable or object code is made by offering 168 | access to copy from a designated place, then offering equivalent 169 | access to copy the source code from the same place counts as 170 | distribution of the source code, even though third parties are not 171 | compelled to copy the source along with the object code. 172 | 173 | 4. You may not copy, modify, sublicense, or distribute the Program 174 | except as expressly provided under this License. Any attempt 175 | otherwise to copy, modify, sublicense or distribute the Program is 176 | void, and will automatically terminate your rights under this License. 177 | However, parties who have received copies, or rights, from you under 178 | this License will not have their licenses terminated so long as such 179 | parties remain in full compliance. 180 | 181 | 5. You are not required to accept this License, since you have not 182 | signed it. However, nothing else grants you permission to modify or 183 | distribute the Program or its derivative works. These actions are 184 | prohibited by law if you do not accept this License. Therefore, by 185 | modifying or distributing the Program (or any work based on the 186 | Program), you indicate your acceptance of this License to do so, and 187 | all its terms and conditions for copying, distributing or modifying 188 | the Program or works based on it. 189 | 190 | 6. Each time you redistribute the Program (or any work based on the 191 | Program), the recipient automatically receives a license from the 192 | original licensor to copy, distribute or modify the Program subject to 193 | these terms and conditions. You may not impose any further 194 | restrictions on the recipients' exercise of the rights granted herein. 195 | You are not responsible for enforcing compliance by third parties to 196 | this License. 197 | 198 | 7. If, as a consequence of a court judgment or allegation of patent 199 | infringement or for any other reason (not limited to patent issues), 200 | conditions are imposed on you (whether by court order, agreement or 201 | otherwise) that contradict the conditions of this License, they do not 202 | excuse you from the conditions of this License. If you cannot 203 | distribute so as to satisfy simultaneously your obligations under this 204 | License and any other pertinent obligations, then as a consequence you 205 | may not distribute the Program at all. For example, if a patent 206 | license would not permit royalty-free redistribution of the Program by 207 | all those who receive copies directly or indirectly through you, then 208 | the only way you could satisfy both it and this License would be to 209 | refrain entirely from distribution of the Program. 210 | 211 | If any portion of this section is held invalid or unenforceable under 212 | any particular circumstance, the balance of the section is intended to 213 | apply and the section as a whole is intended to apply in other 214 | circumstances. 215 | 216 | It is not the purpose of this section to induce you to infringe any 217 | patents or other property right claims or to contest validity of any 218 | such claims; this section has the sole purpose of protecting the 219 | integrity of the free software distribution system, which is 220 | implemented by public license practices. Many people have made 221 | generous contributions to the wide range of software distributed 222 | through that system in reliance on consistent application of that 223 | system; it is up to the author/donor to decide if he or she is willing 224 | to distribute software through any other system and a licensee cannot 225 | impose that choice. 226 | 227 | This section is intended to make thoroughly clear what is believed to 228 | be a consequence of the rest of this License. 229 | 230 | 8. If the distribution and/or use of the Program is restricted in 231 | certain countries either by patents or by copyrighted interfaces, the 232 | original copyright holder who places the Program under this License 233 | may add an explicit geographical distribution limitation excluding 234 | those countries, so that distribution is permitted only in or among 235 | countries not thus excluded. In such case, this License incorporates 236 | the limitation as if written in the body of this License. 237 | 238 | 9. The Free Software Foundation may publish revised and/or new versions 239 | of the General Public License from time to time. Such new versions will 240 | be similar in spirit to the present version, but may differ in detail to 241 | address new problems or concerns. 242 | 243 | Each version is given a distinguishing version number. If the Program 244 | specifies a version number of this License which applies to it and "any 245 | later version", you have the option of following the terms and conditions 246 | either of that version or of any later version published by the Free 247 | Software Foundation. If the Program does not specify a version number of 248 | this License, you may choose any version ever published by the Free Software 249 | Foundation. 250 | 251 | 10. If you wish to incorporate parts of the Program into other free 252 | programs whose distribution conditions are different, write to the author 253 | to ask for permission. For software which is copyrighted by the Free 254 | Software Foundation, write to the Free Software Foundation; we sometimes 255 | make exceptions for this. Our decision will be guided by the two goals 256 | of preserving the free status of all derivatives of our free software and 257 | of promoting the sharing and reuse of software generally. 258 | 259 | NO WARRANTY 260 | 261 | 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY 262 | FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN 263 | OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES 264 | PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED 265 | OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF 266 | MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS 267 | TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE 268 | PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, 269 | REPAIR OR CORRECTION. 270 | 271 | 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 272 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR 273 | REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, 274 | INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING 275 | OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED 276 | TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY 277 | YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER 278 | PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE 279 | POSSIBILITY OF SUCH DAMAGES. 280 | 281 | END OF TERMS AND CONDITIONS 282 | 283 | Appendix: How to Apply These Terms to Your New Programs 284 | 285 | If you develop a new program, and you want it to be of the greatest 286 | possible use to the public, the best way to achieve this is to make it 287 | free software which everyone can redistribute and change under these terms. 288 | 289 | To do so, attach the following notices to the program. It is safest 290 | to attach them to the start of each source file to most effectively 291 | convey the exclusion of warranty; and each file should have at least 292 | the "copyright" line and a pointer to where the full notice is found. 293 | 294 | 295 | Copyright (C) 19yy 296 | 297 | This program is free software; you can redistribute it and/or modify 298 | it under the terms of the GNU General Public License as published by 299 | the Free Software Foundation; either version 2 of the License, or 300 | (at your option) any later version. 301 | 302 | This program is distributed in the hope that it will be useful, 303 | but WITHOUT ANY WARRANTY; without even the implied warranty of 304 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 305 | GNU General Public License for more details. 306 | 307 | You should have received a copy of the GNU General Public License 308 | along with this program; if not, write to the Free Software 309 | Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111 USA 310 | 311 | Also add information on how to contact you by electronic and paper mail. 312 | 313 | If the program is interactive, make it output a short notice like this 314 | when it starts in an interactive mode: 315 | 316 | Gnomovision version 69, Copyright (C) 19yy name of author 317 | Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. 318 | This is free software, and you are welcome to redistribute it 319 | under certain conditions; type `show c' for details. 320 | 321 | The hypothetical commands `show w' and `show c' should show the appropriate 322 | parts of the General Public License. Of course, the commands you use may 323 | be called something other than `show w' and `show c'; they could even be 324 | mouse-clicks or menu items--whatever suits your program. 325 | 326 | You should also get your employer (if you work as a programmer) or your 327 | school, if any, to sign a "copyright disclaimer" for the program, if 328 | necessary. Here is a sample; alter the names: 329 | 330 | Yoyodyne, Inc., hereby disclaims all copyright interest in the program 331 | `Gnomovision' (which makes passes at compilers) written by James Hacker. 332 | 333 | , 1 April 1989 334 | Ty Coon, President of Vice 335 | 336 | This General Public License does not permit incorporating your program into 337 | proprietary programs. If your program is a subroutine library, you may 338 | consider it more useful to permit linking proprietary applications with the 339 | library. If this is what you want to do, use the GNU Library General 340 | Public License instead of this License. -------------------------------------------------------------------------------- /thefuzz-master/MANIFEST.in: -------------------------------------------------------------------------------- 1 | include *.txt 2 | include *.rst 3 | include test_thefuzz.py 4 | -------------------------------------------------------------------------------- /thefuzz-master/README.rst: -------------------------------------------------------------------------------- 1 | .. image:: https://github.com/seatgeek/thefuzz/actions/workflows/ci.yml/badge.svg 2 | :target: https://github.com/seatgeek/thefuzz 3 | 4 | TheFuzz 5 | ======= 6 | 7 | Fuzzy string matching like a boss. It uses `Levenshtein Distance `_ to calculate the differences between sequences in a simple-to-use package. 8 | 9 | Requirements 10 | ============ 11 | 12 | - Python 3.8 or higher 13 | - `rapidfuzz `_ 14 | 15 | For testing 16 | ~~~~~~~~~~~ 17 | - pycodestyle 18 | - hypothesis 19 | - pytest 20 | 21 | Installation 22 | ============ 23 | 24 | Using pip via PyPI 25 | 26 | .. code:: bash 27 | 28 | pip install thefuzz 29 | 30 | 31 | Using pip via GitHub 32 | 33 | .. code:: bash 34 | 35 | pip install git+git://github.com/seatgeek/thefuzz.git@0.19.0#egg=thefuzz 36 | 37 | Adding to your ``requirements.txt`` file (run ``pip install -r requirements.txt`` afterwards) 38 | 39 | .. code:: bash 40 | 41 | git+ssh://git@github.com/seatgeek/thefuzz.git@0.19.0#egg=thefuzz 42 | 43 | Manually via GIT 44 | 45 | .. code:: bash 46 | 47 | git clone git://github.com/seatgeek/thefuzz.git thefuzz 48 | cd thefuzz 49 | python setup.py install 50 | 51 | 52 | Usage 53 | ===== 54 | 55 | .. code:: python 56 | 57 | >>> from thefuzz import fuzz 58 | >>> from thefuzz import process 59 | 60 | Simple Ratio 61 | ~~~~~~~~~~~~ 62 | 63 | .. code:: python 64 | 65 | >>> fuzz.ratio("this is a test", "this is a test!") 66 | 97 67 | 68 | Partial Ratio 69 | ~~~~~~~~~~~~~ 70 | 71 | .. code:: python 72 | 73 | >>> fuzz.partial_ratio("this is a test", "this is a test!") 74 | 100 75 | 76 | Token Sort Ratio 77 | ~~~~~~~~~~~~~~~~ 78 | 79 | .. code:: python 80 | 81 | >>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear") 82 | 91 83 | >>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear") 84 | 100 85 | 86 | Token Set Ratio 87 | ~~~~~~~~~~~~~~~ 88 | 89 | .. code:: python 90 | 91 | >>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear") 92 | 84 93 | >>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear") 94 | 100 95 | 96 | Partial Token Sort Ratio 97 | ~~~~~~~~~~~~~~~~~~~~~~~~ 98 | 99 | .. code:: python 100 | 101 | >>> fuzz.token_sort_ratio("fuzzy was a bear", "wuzzy fuzzy was a bear") 102 | 84 103 | >>> fuzz.partial_token_sort_ratio("fuzzy was a bear", "wuzzy fuzzy was a bear") 104 | 100 105 | 106 | Process 107 | ~~~~~~~ 108 | 109 | .. code:: python 110 | 111 | >>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"] 112 | >>> process.extract("new york jets", choices, limit=2) 113 | [('New York Jets', 100), ('New York Giants', 78)] 114 | >>> process.extractOne("cowboys", choices) 115 | ("Dallas Cowboys", 90) 116 | 117 | You can also pass additional parameters to ``extractOne`` method to make it use a specific scorer. A typical use case is to match file paths: 118 | 119 | .. code:: python 120 | 121 | >>> process.extractOne("System of a down - Hypnotize - Heroin", songs) 122 | ('/music/library/good/System of a Down/2005 - Hypnotize/01 - Attack.mp3', 86) 123 | >>> process.extractOne("System of a down - Hypnotize - Heroin", songs, scorer=fuzz.token_sort_ratio) 124 | ("/music/library/good/System of a Down/2005 - Hypnotize/10 - She's Like Heroin.mp3", 61) 125 | 126 | .. |Build Status| image:: https://github.com/seatgeek/thefuzz/actions/workflows/ci.yml/badge.svg 127 | :target: https://github.com/seatgeek/thefuzz 128 | -------------------------------------------------------------------------------- /thefuzz-master/benchmarks.py: -------------------------------------------------------------------------------- 1 | from timeit import timeit 2 | import math 3 | import csv 4 | 5 | iterations = 100000 6 | 7 | 8 | reader = csv.DictReader(open('data/titledata.csv'), delimiter='|') 9 | titles = [i['custom_title'] for i in reader] 10 | title_blob = '\n'.join(titles) 11 | 12 | 13 | cirque_strings = [ 14 | "cirque du soleil - zarkana - las vegas", 15 | "cirque du soleil ", 16 | "cirque du soleil las vegas", 17 | "zarkana las vegas", 18 | "las vegas cirque du soleil at the bellagio", 19 | "zarakana - cirque du soleil - bellagio" 20 | ] 21 | 22 | choices = [ 23 | "", 24 | "new york yankees vs boston red sox", 25 | "", 26 | "zarakana - cirque du soleil - bellagio", 27 | None, 28 | "cirque du soleil las vegas", 29 | None 30 | ] 31 | 32 | mixed_strings = [ 33 | "Lorem Ipsum is simply dummy text of the printing and typesetting industry.", 34 | "C\\'est la vie", 35 | "Ça va?", 36 | "Cães danados", 37 | "\xacCamarões assados", 38 | "a\xac\u1234\u20ac\U00008000" 39 | ] 40 | 41 | common_setup = "from thefuzz import fuzz, utils; " 42 | 43 | 44 | def print_result_from_timeit(stmt='pass', setup='pass', number=1000000): 45 | """ 46 | Clean function to know how much time took the execution of one statement 47 | """ 48 | units = ["s", "ms", "us", "ns"] 49 | duration = timeit(stmt, setup, number=int(number)) 50 | avg_duration = duration / float(number) 51 | thousands = int(math.floor(math.log(avg_duration, 1000))) 52 | 53 | print("Total time: {:f}s. Average run: {:.3f}{}.".format( 54 | duration, avg_duration * (1000 ** -thousands), units[-thousands])) 55 | 56 | 57 | for s in mixed_strings + cirque_strings + choices: 58 | print('Test full_process for: "%s"' % s) 59 | print_result_from_timeit('utils.full_process(u\'%s\')' % s, 60 | common_setup, number=iterations) 61 | 62 | # benchmarking the core matching methods... 63 | 64 | for s in cirque_strings: 65 | print('Test fuzz.ratio for string: "%s"' % s) 66 | print('-------------------------------') 67 | print_result_from_timeit('fuzz.ratio(u\'cirque du soleil\', u\'%s\')' % s, 68 | common_setup, number=iterations / 100) 69 | 70 | for s in cirque_strings: 71 | print('Test fuzz.partial_ratio for string: "%s"' % s) 72 | print('-------------------------------') 73 | print_result_from_timeit('fuzz.partial_ratio(u\'cirque du soleil\', u\'%s\')' 74 | % s, common_setup, number=iterations / 100) 75 | 76 | for s in cirque_strings: 77 | print('Test fuzz.WRatio for string: "%s"' % s) 78 | print('-------------------------------') 79 | print_result_from_timeit('fuzz.WRatio(u\'cirque du soleil\', u\'%s\')' % s, 80 | common_setup, number=iterations / 100) 81 | 82 | print('Test process.extract(scorer = fuzz.QRatio) for string: "%s"' % s) 83 | print('-------------------------------') 84 | print_result_from_timeit('process.extract(u\'cirque du soleil\', choices, scorer = fuzz.QRatio)', 85 | common_setup + " from thefuzz import process; import string,random; random.seed(18);" 86 | " choices = [\'\'.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(30)) for s in range(5000)]", 87 | number=10) 88 | 89 | print('Test process.extract(scorer = fuzz.WRatio) for string: "%s"' % s) 90 | print('-------------------------------') 91 | print_result_from_timeit('process.extract(u\'cirque du soleil\', choices, scorer = fuzz.WRatio)', 92 | common_setup + " from thefuzz import process; import string,random; random.seed(18);" 93 | " choices = [\'\'.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(30)) for s in range(5000)]", 94 | number=10) 95 | 96 | 97 | # let me show you something 98 | 99 | s = 'New York Yankees' 100 | 101 | test = 'import functools\n' 102 | test += 'title_blob = """%s"""\n' % title_blob 103 | test += 'title_blob = title_blob.strip()\n' 104 | test += 'titles = title_blob.split("\\n")\n' 105 | 106 | print('Real world ratio(): "%s"' % s) 107 | print('-------------------------------') 108 | test += 'prepared_ratio = functools.partial(fuzz.ratio, "%s")\n' % s 109 | test += 'titles.sort(key=prepared_ratio)\n' 110 | print_result_from_timeit(test, common_setup, number=100) 111 | -------------------------------------------------------------------------------- /thefuzz-master/release: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | set -eo pipefail; [[ $RELEASE_TRACE ]] && set -x 3 | 4 | PACKAGE_NAME='thefuzz' 5 | INIT_PACKAGE_NAME='thefuzz' 6 | PUBLIC="true" 7 | 8 | # Colors 9 | COLOR_OFF="\033[0m" # unsets color to term fg color 10 | RED="\033[0;31m" # red 11 | GREEN="\033[0;32m" # green 12 | YELLOW="\033[0;33m" # yellow 13 | MAGENTA="\033[0;35m" # magenta 14 | CYAN="\033[0;36m" # cyan 15 | 16 | # ensure wheel is available 17 | pip install wheel > /dev/null 18 | 19 | # ensure Pygment is available 20 | pip install Pygments > /dev/null 21 | 22 | command -v gitchangelog >/dev/null 2>&1 || { 23 | echo -e "${RED}WARNING: Missing gitchangelog binary, please run: pip install gitchangelog==2.2.0${COLOR_OFF}\n" 24 | exit 1 25 | } 26 | 27 | command -v rst-lint > /dev/null || { 28 | echo -e "${RED}WARNING: Missing rst-lint binary, please run: pip install restructuredtext_lint${COLOR_OFF}\n" 29 | exit 1 30 | } 31 | 32 | set +e; 33 | python test_thefuzz.py &> /dev/null # run the tests 34 | if [ ! $? -eq 0 ]; then 35 | echo -e "${RED}WARNING: The tests are failing.${COLOR_OFF}" 36 | exit 1 37 | fi 38 | set -e; 39 | 40 | if [[ "$@" != "major" ]] && [[ "$@" != "minor" ]] && [[ "$@" != "patch" ]]; then 41 | echo -e "${RED}WARNING: Invalid release type, must specify 'major', 'minor', or 'patch'${COLOR_OFF}\n" 42 | exit 1 43 | fi 44 | 45 | echo -e "\n${GREEN}STARTING RELEASE PROCESS${COLOR_OFF}\n" 46 | 47 | set +e; 48 | git status | grep -Eo "working (directory|tree) clean" &> /dev/null 49 | if [ ! $? -eq 0 ]; then # working directory is NOT clean 50 | echo -e "${RED}WARNING: You have uncommitted changes, you may have forgotten something${COLOR_OFF}\n" 51 | exit 1 52 | fi 53 | set -e; 54 | 55 | echo -e "${YELLOW}--->${COLOR_OFF} Updating local copy" 56 | git pull -q origin master 57 | git fetch --tags > /dev/null 58 | 59 | echo -e "${YELLOW}--->${COLOR_OFF} Retrieving release versions" 60 | 61 | current_version=$(cat ${INIT_PACKAGE_NAME}/__init__.py |grep '__version__ ='|sed 's/[^0-9.]//g') 62 | major=$(echo $current_version | awk '{split($0,a,"."); print a[1]}') 63 | minor=$(echo $current_version | awk '{split($0,a,"."); print a[2]}') 64 | patch=$(echo $current_version | awk '{split($0,a,"."); print a[3]}') 65 | 66 | if [[ "$@" == "major" ]]; then 67 | major=$(($major + 1)); 68 | minor="0" 69 | patch="0" 70 | elif [[ "$@" == "minor" ]]; then 71 | minor=$(($minor + 1)); 72 | patch="0" 73 | elif [[ "$@" == "patch" ]]; then 74 | patch=$(($patch + 1)); 75 | fi 76 | 77 | next_version="${major}.${minor}.${patch}" 78 | 79 | echo -e "${YELLOW} >${COLOR_OFF} ${MAGENTA}${current_version}${COLOR_OFF} -> ${MAGENTA}${next_version}${COLOR_OFF}" 80 | 81 | echo -e "${YELLOW}--->${COLOR_OFF} Ensuring readme passes lint checks (if this fails, run rst-lint)" 82 | rst-lint README.rst > /dev/null 83 | 84 | echo -e "${YELLOW}--->${COLOR_OFF} Creating necessary temp file" 85 | tempfoo=$(basename $0) 86 | TMPFILE=$(mktemp /tmp/${tempfoo}.XXXXXX) || { 87 | echo -e "${RED}WARNING: Cannot create temp file using mktemp in /tmp dir ${COLOR_OFF}\n" 88 | exit 1 89 | } 90 | 91 | find_this="__version__ = '$current_version'" 92 | replace_with="__version__ = '$next_version'" 93 | 94 | echo -e "${YELLOW}--->${COLOR_OFF} Updating ${INIT_PACKAGE_NAME}/__init__.py" 95 | sed "s/$find_this/$replace_with/" ${INIT_PACKAGE_NAME}/__init__.py > $TMPFILE && mv $TMPFILE ${INIT_PACKAGE_NAME}/__init__.py 96 | 97 | echo -e "${YELLOW}--->${COLOR_OFF} Updating README.rst" 98 | find_this="${PACKAGE_NAME}.git@$current_version" 99 | replace_with="${PACKAGE_NAME}.git@$next_version" 100 | sed "s/$find_this/$replace_with/" README.rst > $TMPFILE && mv $TMPFILE README.rst 101 | find_this="${PACKAGE_NAME}==$current_version" 102 | replace_with="${PACKAGE_NAME}==$next_version" 103 | sed "s/$find_this/$replace_with/" README.rst > $TMPFILE && mv $TMPFILE README.rst 104 | 105 | if [ -f docs/conf.py ]; then 106 | echo -e "${YELLOW}--->${COLOR_OFF} Updating docs" 107 | find_this="version = '${current_version}'" 108 | replace_with="version = '${next_version}'" 109 | sed "s/$find_this/$replace_with/" docs/conf.py > $TMPFILE && mv $TMPFILE docs/conf.py 110 | 111 | find_this="release = '${current_version}'" 112 | replace_with="release = '${next_version}'" 113 | sed "s/$find_this/$replace_with/" docs/conf.py > $TMPFILE && mv $TMPFILE docs/conf.py 114 | fi 115 | 116 | echo -e "${YELLOW}--->${COLOR_OFF} Updating CHANGES.rst for new release" 117 | version_header="$next_version ($(date +%F))" 118 | set +e; dashes=$(yes '-'|head -n ${#version_header}|tr -d '\n') ; set -e 119 | gitchangelog |sed "4s/.*/$version_header/"|sed "5s/.*/$dashes/" > $TMPFILE && mv $TMPFILE CHANGES.rst 120 | 121 | echo -e "${YELLOW}--->${COLOR_OFF} Adding changed files to git" 122 | git add CHANGES.rst README.rst ${INIT_PACKAGE_NAME}/__init__.py 123 | if [ -f docs/conf.py ]; then git add docs/conf.py; fi 124 | 125 | echo -e "${YELLOW}--->${COLOR_OFF} Creating release" 126 | git commit -q -m "Release version $next_version" 127 | 128 | echo -e "${YELLOW}--->${COLOR_OFF} Tagging release" 129 | git tag -a $next_version -m "Release version $next_version" 130 | 131 | echo -e "${YELLOW}--->${COLOR_OFF} Pushing release and tags to github" 132 | git push -q origin master && git push -q --tags 133 | 134 | if [[ "$PUBLIC" == "true" ]]; then 135 | echo -e "${YELLOW}--->${COLOR_OFF} Creating python release" 136 | cp README.rst README 137 | python setup.py sdist bdist_wheel upload > /dev/null 138 | rm README 139 | else 140 | echo -e "${YELLOW}--->${COLOR_OFF} Creating local python dist and wheel for manual release" 141 | python setup.py sdist bdist_wheel > /dev/null 142 | fi 143 | 144 | echo -e "\n${CYAN}RELEASED VERSION ${next_version}${COLOR_OFF}\n" 145 | -------------------------------------------------------------------------------- /thefuzz-master/setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Copyright (c) 2014 SeatGeek 4 | 5 | # This file is part of thefuzz. 6 | 7 | from thefuzz import __version__ 8 | from setuptools import setup 9 | 10 | with open('README.rst') as f: 11 | long_description = f.read() 12 | 13 | setup( 14 | name='thefuzz', 15 | version=__version__, 16 | author='Adam Cohen', 17 | author_email='adam@seatgeek.com', 18 | packages=['thefuzz'], 19 | # keep for backwards compatibility of projects depending on `thefuzz[speedup]` 20 | extras_require={'speedup': []}, 21 | install_requires=['rapidfuzz>=3.0.0, < 4.0.0'], 22 | url='https://github.com/seatgeek/thefuzz', 23 | license="GPLv2", 24 | classifiers=[ 25 | 'Intended Audience :: Developers', 26 | 'License :: OSI Approved :: GNU General Public License v2 (GPLv2)', 27 | 'Programming Language :: Python', 28 | 'Programming Language :: Python :: 3', 29 | 'Programming Language :: Python :: 3.8', 30 | 'Programming Language :: Python :: 3.9', 31 | 'Programming Language :: Python :: 3.10', 32 | 'Programming Language :: Python :: 3.11', 33 | 'Programming Language :: Python :: 3.12', 34 | 'Programming Language :: Python :: 3 :: Only', 35 | ], 36 | description='Fuzzy string matching in python', 37 | long_description=long_description, 38 | zip_safe=True, 39 | python_requires='>=3.8' 40 | ) 41 | -------------------------------------------------------------------------------- /thefuzz-master/test_thefuzz.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | import re 3 | import pycodestyle 4 | 5 | from thefuzz import fuzz 6 | from thefuzz import process 7 | from thefuzz import utils 8 | 9 | scorers = [ 10 | fuzz.ratio, 11 | fuzz.partial_ratio, 12 | fuzz.token_sort_ratio, 13 | fuzz.token_set_ratio, 14 | fuzz.partial_token_sort_ratio, 15 | fuzz.partial_token_set_ratio, 16 | fuzz.QRatio, 17 | fuzz.UQRatio, 18 | fuzz.WRatio, 19 | fuzz.UWRatio, 20 | ] 21 | 22 | class StringProcessingTest(unittest.TestCase): 23 | def test_replace_non_letters_non_numbers_with_whitespace(self): 24 | strings = ["new york mets - atlanta braves", "Cães danados", 25 | "New York //// Mets $$$", "Ça va?"] 26 | for string in strings: 27 | proc_string = utils.full_process(string) 28 | regex = re.compile(r"(?ui)[\W]") 29 | for expr in regex.finditer(proc_string): 30 | self.assertEqual(expr.group(), " ") 31 | 32 | def test_dont_condense_whitespace(self): 33 | s1 = "new york mets - atlanta braves" 34 | s2 = "new york mets atlanta braves" 35 | s3 = "new york mets atlanta braves" 36 | p1 = utils.full_process(s1) 37 | p2 = utils.full_process(s2) 38 | p3 = utils.full_process(s3) 39 | self.assertEqual(p1, s3) 40 | self.assertEqual(p2, s2) 41 | self.assertEqual(p3, s3) 42 | 43 | 44 | class UtilsTest(unittest.TestCase): 45 | def setUp(self): 46 | self.s1 = "new york mets" 47 | self.s1a = "new york mets" 48 | self.s2 = "new YORK mets" 49 | self.s3 = "the wonderful new york mets" 50 | self.s4 = "new york mets vs atlanta braves" 51 | self.s5 = "atlanta braves vs new york mets" 52 | self.s6 = "new york mets - atlanta braves" 53 | self.mixed_strings = [ 54 | "Lorem Ipsum is simply dummy text of the printing and typesetting industry.", 55 | "C'est la vie", 56 | "Ça va?", 57 | "Cães danados", 58 | "\xacCamarões assados", 59 | "a\xac\u1234\u20ac\U00008000", 60 | "\u00C1" 61 | ] 62 | 63 | def tearDown(self): 64 | pass 65 | 66 | def test_ascii_only(self): 67 | for s in self.mixed_strings: 68 | utils.ascii_only(s) 69 | 70 | def test_fullProcess(self): 71 | for s in self.mixed_strings: 72 | utils.full_process(s) 73 | 74 | def test_fullProcessForceAscii(self): 75 | for s in self.mixed_strings: 76 | utils.full_process(s, force_ascii=True) 77 | 78 | 79 | class RatioTest(unittest.TestCase): 80 | 81 | def setUp(self): 82 | self.s1 = "new york mets" 83 | self.s1a = "new york mets" 84 | self.s2 = "new YORK mets" 85 | self.s3 = "the wonderful new york mets" 86 | self.s4 = "new york mets vs atlanta braves" 87 | self.s5 = "atlanta braves vs new york mets" 88 | self.s6 = "new york mets - atlanta braves" 89 | self.s7 = 'new york city mets - atlanta braves' 90 | # test silly corner cases 91 | self.s8 = '{' 92 | self.s8a = '{' 93 | self.s9 = '{a' 94 | self.s9a = '{a' 95 | self.s10 = 'a{' 96 | self.s10a = '{b' 97 | 98 | self.cirque_strings = [ 99 | "cirque du soleil - zarkana - las vegas", 100 | "cirque du soleil ", 101 | "cirque du soleil las vegas", 102 | "zarkana las vegas", 103 | "las vegas cirque du soleil at the bellagio", 104 | "zarakana - cirque du soleil - bellagio" 105 | ] 106 | 107 | self.baseball_strings = [ 108 | "new york mets vs chicago cubs", 109 | "chicago cubs vs chicago white sox", 110 | "philladelphia phillies vs atlanta braves", 111 | "braves vs mets", 112 | ] 113 | 114 | def tearDown(self): 115 | pass 116 | 117 | def testEqual(self): 118 | self.assertEqual(fuzz.ratio(self.s1, self.s1a), 100) 119 | self.assertEqual(fuzz.ratio(self.s8, self.s8a), 100) 120 | self.assertEqual(fuzz.ratio(self.s9, self.s9a), 100) 121 | 122 | def testCaseInsensitive(self): 123 | self.assertNotEqual(fuzz.ratio(self.s1, self.s2), 100) 124 | self.assertEqual(fuzz.ratio(utils.full_process(self.s1), utils.full_process(self.s2)), 100) 125 | 126 | def testPartialRatio(self): 127 | self.assertEqual(fuzz.partial_ratio(self.s1, self.s3), 100) 128 | 129 | def testTokenSortRatio(self): 130 | self.assertEqual(fuzz.token_sort_ratio(self.s1, self.s1a), 100) 131 | 132 | def testPartialTokenSortRatio(self): 133 | self.assertEqual(fuzz.partial_token_sort_ratio(self.s1, self.s1a), 100) 134 | self.assertEqual(fuzz.partial_token_sort_ratio(self.s4, self.s5), 100) 135 | self.assertEqual(fuzz.partial_token_sort_ratio(self.s8, self.s8a, full_process=False), 100) 136 | self.assertEqual(fuzz.partial_token_sort_ratio(self.s9, self.s9a, full_process=True), 100) 137 | self.assertEqual(fuzz.partial_token_sort_ratio(self.s9, self.s9a, full_process=False), 100) 138 | self.assertEqual(fuzz.partial_token_sort_ratio(self.s10, self.s10a, full_process=False), 67) 139 | self.assertEqual(fuzz.partial_token_sort_ratio(self.s10a, self.s10, full_process=False), 67) 140 | 141 | def testTokenSetRatio(self): 142 | self.assertEqual(fuzz.token_set_ratio(self.s4, self.s5), 100) 143 | self.assertEqual(fuzz.token_set_ratio(self.s8, self.s8a, full_process=False), 100) 144 | self.assertEqual(fuzz.token_set_ratio(self.s9, self.s9a, full_process=True), 100) 145 | self.assertEqual(fuzz.token_set_ratio(self.s9, self.s9a, full_process=False), 100) 146 | self.assertEqual(fuzz.token_set_ratio(self.s10, self.s10a, full_process=False), 50) 147 | 148 | def testPartialTokenSetRatio(self): 149 | self.assertEqual(fuzz.partial_token_set_ratio(self.s4, self.s7), 100) 150 | 151 | def testQuickRatioEqual(self): 152 | self.assertEqual(fuzz.QRatio(self.s1, self.s1a), 100) 153 | 154 | def testQuickRatioCaseInsensitive(self): 155 | self.assertEqual(fuzz.QRatio(self.s1, self.s2), 100) 156 | 157 | def testQuickRatioNotEqual(self): 158 | self.assertNotEqual(fuzz.QRatio(self.s1, self.s3), 100) 159 | 160 | def testWRatioEqual(self): 161 | self.assertEqual(fuzz.WRatio(self.s1, self.s1a), 100) 162 | 163 | def testWRatioCaseInsensitive(self): 164 | self.assertEqual(fuzz.WRatio(self.s1, self.s2), 100) 165 | 166 | def testWRatioPartialMatch(self): 167 | # a partial match is scaled by .9 168 | self.assertEqual(fuzz.WRatio(self.s1, self.s3), 90) 169 | 170 | def testWRatioMisorderedMatch(self): 171 | # misordered full matches are scaled by .95 172 | self.assertEqual(fuzz.WRatio(self.s4, self.s5), 95) 173 | 174 | def testWRatioStr(self): 175 | self.assertEqual(fuzz.WRatio(str(self.s1), str(self.s1a)), 100) 176 | 177 | def testQRatioStr(self): 178 | self.assertEqual(fuzz.WRatio(str(self.s1), str(self.s1a)), 100) 179 | 180 | def testEmptyStringsScore100(self): 181 | self.assertEqual(fuzz.ratio("", ""), 100) 182 | self.assertEqual(fuzz.partial_ratio("", ""), 100) 183 | 184 | def testIssueSeven(self): 185 | s1 = "HSINCHUANG" 186 | s2 = "SINJHUAN" 187 | s3 = "LSINJHUANG DISTRIC" 188 | s4 = "SINJHUANG DISTRICT" 189 | 190 | self.assertGreater(fuzz.partial_ratio(s1, s2), 75) 191 | self.assertGreater(fuzz.partial_ratio(s1, s3), 75) 192 | self.assertGreater(fuzz.partial_ratio(s1, s4), 75) 193 | 194 | def testRatioUnicodeString(self): 195 | s1 = "\u00C1" 196 | s2 = "ABCD" 197 | score = fuzz.ratio(s1, s2) 198 | self.assertEqual(0, score) 199 | 200 | def testPartialRatioUnicodeString(self): 201 | s1 = "\u00C1" 202 | s2 = "ABCD" 203 | score = fuzz.partial_ratio(s1, s2) 204 | self.assertEqual(0, score) 205 | 206 | def testWRatioUnicodeString(self): 207 | s1 = "\u00C1" 208 | s2 = "ABCD" 209 | score = fuzz.WRatio(s1, s2) 210 | self.assertEqual(0, score) 211 | 212 | # Cyrillic. 213 | s1 = "\u043f\u0441\u0438\u0445\u043e\u043b\u043e\u0433" 214 | s2 = "\u043f\u0441\u0438\u0445\u043e\u0442\u0435\u0440\u0430\u043f\u0435\u0432\u0442" 215 | score = fuzz.WRatio(s1, s2, force_ascii=False) 216 | self.assertNotEqual(0, score) 217 | 218 | # Chinese. 219 | s1 = "\u6211\u4e86\u89e3\u6570\u5b66" 220 | s2 = "\u6211\u5b66\u6570\u5b66" 221 | score = fuzz.WRatio(s1, s2, force_ascii=False) 222 | self.assertNotEqual(0, score) 223 | 224 | def testQRatioUnicodeString(self): 225 | s1 = "\u00C1" 226 | s2 = "ABCD" 227 | score = fuzz.QRatio(s1, s2) 228 | self.assertEqual(0, score) 229 | 230 | # Cyrillic. 231 | s1 = "\u043f\u0441\u0438\u0445\u043e\u043b\u043e\u0433" 232 | s2 = "\u043f\u0441\u0438\u0445\u043e\u0442\u0435\u0440\u0430\u043f\u0435\u0432\u0442" 233 | score = fuzz.QRatio(s1, s2, force_ascii=False) 234 | self.assertNotEqual(0, score) 235 | 236 | # Chinese. 237 | s1 = "\u6211\u4e86\u89e3\u6570\u5b66" 238 | s2 = "\u6211\u5b66\u6570\u5b66" 239 | score = fuzz.QRatio(s1, s2, force_ascii=False) 240 | self.assertNotEqual(0, score) 241 | 242 | def testQratioForceAscii(self): 243 | s1 = "ABCD\u00C1" 244 | s2 = "ABCD" 245 | 246 | score = fuzz.QRatio(s1, s2, force_ascii=True) 247 | self.assertEqual(score, 100) 248 | 249 | score = fuzz.QRatio(s1, s2, force_ascii=False) 250 | self.assertLess(score, 100) 251 | 252 | def testQRatioForceAscii(self): 253 | s1 = "ABCD\u00C1" 254 | s2 = "ABCD" 255 | 256 | score = fuzz.WRatio(s1, s2, force_ascii=True) 257 | self.assertEqual(score, 100) 258 | 259 | score = fuzz.WRatio(s1, s2, force_ascii=False) 260 | self.assertLess(score, 100) 261 | 262 | def testPartialTokenSetRatioForceAscii(self): 263 | s1 = "ABCD\u00C1 HELP\u00C1" 264 | s2 = "ABCD HELP" 265 | 266 | score = fuzz.partial_token_set_ratio(s1, s2, force_ascii=True) 267 | self.assertEqual(score, 100) 268 | 269 | score = fuzz.partial_token_set_ratio(s1, s2, force_ascii=False) 270 | self.assertLess(score, 100) 271 | 272 | def testPartialTokenSortRatioForceAscii(self): 273 | s1 = "ABCD\u00C1 HELP\u00C1" 274 | s2 = "ABCD HELP" 275 | 276 | score = fuzz.partial_token_sort_ratio(s1, s2, force_ascii=True) 277 | self.assertEqual(score, 100) 278 | 279 | score = fuzz.partial_token_sort_ratio(s1, s2, force_ascii=False) 280 | self.assertLess(score, 100) 281 | 282 | def testCheckForNone(self): 283 | for scorer in scorers: 284 | self.assertEqual(scorer(None, None), 0) 285 | self.assertEqual(scorer('Some', None), 0) 286 | self.assertEqual(scorer(None, 'Some'), 0) 287 | 288 | self.assertNotEqual(scorer('Some', 'Some'), 0) 289 | 290 | def testCheckEmptyString(self): 291 | for scorer in scorers: 292 | if scorer in {fuzz.token_set_ratio, fuzz.partial_token_set_ratio, fuzz.WRatio, fuzz.UWRatio, fuzz.QRatio, fuzz.UQRatio}: 293 | self.assertEqual(scorer('', ''), 0) 294 | else: 295 | self.assertEqual(scorer('', ''), 100) 296 | 297 | self.assertEqual(scorer('Some', ''), 0) 298 | self.assertEqual(scorer('', 'Some'), 0) 299 | self.assertNotEqual(scorer('Some', 'Some'), 0) 300 | 301 | 302 | class ProcessTest(unittest.TestCase): 303 | 304 | def setUp(self): 305 | self.s1 = "new york mets" 306 | self.s1a = "new york mets" 307 | self.s2 = "new YORK mets" 308 | self.s3 = "the wonderful new york mets" 309 | self.s4 = "new york mets vs atlanta braves" 310 | self.s5 = "atlanta braves vs new york mets" 311 | self.s6 = "new york mets - atlanta braves" 312 | 313 | self.cirque_strings = [ 314 | "cirque du soleil - zarkana - las vegas", 315 | "cirque du soleil ", 316 | "cirque du soleil las vegas", 317 | "zarkana las vegas", 318 | "las vegas cirque du soleil at the bellagio", 319 | "zarakana - cirque du soleil - bellagio" 320 | ] 321 | 322 | self.baseball_strings = [ 323 | "new york mets vs chicago cubs", 324 | "chicago cubs vs chicago white sox", 325 | "philladelphia phillies vs atlanta braves", 326 | "braves vs mets", 327 | ] 328 | 329 | def testGetBestChoice1(self): 330 | query = "new york mets at atlanta braves" 331 | best = process.extractOne(query, self.baseball_strings) 332 | self.assertEqual(best[0], "braves vs mets") 333 | 334 | def testGetBestChoice2(self): 335 | query = "philadelphia phillies at atlanta braves" 336 | best = process.extractOne(query, self.baseball_strings) 337 | self.assertEqual(best[0], self.baseball_strings[2]) 338 | 339 | def testGetBestChoice3(self): 340 | query = "atlanta braves at philadelphia phillies" 341 | best = process.extractOne(query, self.baseball_strings) 342 | self.assertEqual(best[0], self.baseball_strings[2]) 343 | 344 | def testGetBestChoice4(self): 345 | query = "chicago cubs vs new york mets" 346 | best = process.extractOne(query, self.baseball_strings) 347 | self.assertEqual(best[0], self.baseball_strings[0]) 348 | 349 | def testWithProcessor(self): 350 | events = [ 351 | ["chicago cubs vs new york mets", "CitiField", "2011-05-11", "8pm"], 352 | ["new york yankees vs boston red sox", "Fenway Park", "2011-05-11", "8pm"], 353 | ["atlanta braves vs pittsburgh pirates", "PNC Park", "2011-05-11", "8pm"], 354 | ] 355 | query = ["new york mets vs chicago cubs", "CitiField", "2017-03-19", "8pm"], 356 | 357 | best = process.extractOne(query, events, processor=lambda event: event[0]) 358 | self.assertEqual(best[0], events[0]) 359 | 360 | def testIssue57(self): 361 | """ 362 | account for force_ascii 363 | """ 364 | query = str(("test", "test")) 365 | choices = [("test", "test")] 366 | assert process.extract(query, choices)[0][1] == 100 367 | 368 | def testWithScorer(self): 369 | choices = [ 370 | "new york mets vs chicago cubs", 371 | "chicago cubs at new york mets", 372 | "atlanta braves vs pittsbugh pirates", 373 | "new york yankees vs boston red sox" 374 | ] 375 | 376 | choices_dict = { 377 | 1: "new york mets vs chicago cubs", 378 | 2: "chicago cubs vs chicago white sox", 379 | 3: "philladelphia phillies vs atlanta braves", 380 | 4: "braves vs mets" 381 | } 382 | 383 | # in this hypothetical example we care about ordering, so we use quick ratio 384 | query = "new york mets at chicago cubs" 385 | scorer = fuzz.QRatio 386 | 387 | # first, as an example, the normal way would select the "more 388 | # 'complete' match of choices[1]" 389 | 390 | best = process.extractOne(query, choices) 391 | self.assertEqual(best[0], choices[1]) 392 | 393 | # now, use the custom scorer 394 | 395 | best = process.extractOne(query, choices, scorer=scorer) 396 | self.assertEqual(best[0], choices[0]) 397 | 398 | best = process.extractOne(query, choices_dict) 399 | self.assertEqual(best[0], choices_dict[1]) 400 | 401 | def testWithCutoff(self): 402 | choices = [ 403 | "new york mets vs chicago cubs", 404 | "chicago cubs at new york mets", 405 | "atlanta braves vs pittsbugh pirates", 406 | "new york yankees vs boston red sox" 407 | ] 408 | 409 | query = "los angeles dodgers vs san francisco giants" 410 | 411 | # in this situation, this is an event that does not exist in the list 412 | # we don't want to randomly match to something, so we use a reasonable cutoff 413 | 414 | best = process.extractOne(query, choices, score_cutoff=50) 415 | self.assertIsNone(best) 416 | 417 | # however if we had no cutoff, something would get returned 418 | 419 | # best = process.extractOne(query, choices) 420 | # self.assertIsNotNone(best) 421 | 422 | def testWithCutoff2(self): 423 | choices = [ 424 | "new york mets vs chicago cubs", 425 | "chicago cubs at new york mets", 426 | "atlanta braves vs pittsbugh pirates", 427 | "new york yankees vs boston red sox" 428 | ] 429 | 430 | query = "new york mets vs chicago cubs" 431 | # Only find 100-score cases 432 | res = process.extractOne(query, choices, score_cutoff=100) 433 | self.assertIsNotNone(res) 434 | best_match, score = res 435 | self.assertIs(best_match, choices[0]) 436 | 437 | def testEmptyStrings(self): 438 | choices = [ 439 | "", 440 | "new york mets vs chicago cubs", 441 | "new york yankees vs boston red sox", 442 | "", 443 | "" 444 | ] 445 | 446 | query = "new york mets at chicago cubs" 447 | 448 | best = process.extractOne(query, choices) 449 | self.assertEqual(best[0], choices[1]) 450 | 451 | def testNullStrings(self): 452 | choices = [ 453 | None, 454 | "new york mets vs chicago cubs", 455 | "new york yankees vs boston red sox", 456 | None, 457 | None 458 | ] 459 | 460 | query = "new york mets at chicago cubs" 461 | 462 | best = process.extractOne(query, choices) 463 | self.assertEqual(best[0], choices[1]) 464 | 465 | def test_list_like_extract(self): 466 | """We should be able to use a list-like object for choices.""" 467 | def generate_choices(): 468 | choices = ['a', 'Bb', 'CcC'] 469 | yield from choices 470 | search = 'aaa' 471 | result = [(value, confidence) for value, confidence in 472 | process.extract(search, generate_choices())] 473 | self.assertGreater(len(result), 0) 474 | 475 | def test_dict_like_extract(self): 476 | """We should be able to use a dict-like object for choices, not only a 477 | dict, and still get dict-like output. 478 | """ 479 | try: 480 | from UserDict import UserDict 481 | except ImportError: 482 | from collections import UserDict 483 | choices = UserDict({'aa': 'bb', 'a1': None}) 484 | search = 'aaa' 485 | result = process.extract(search, choices) 486 | self.assertGreater(len(result), 0) 487 | for value, confidence, key in result: 488 | self.assertIn(value, choices.values()) 489 | 490 | def test_dedupe(self): 491 | """We should be able to use a list-like object for contains_dupes 492 | """ 493 | # Test 1 494 | contains_dupes = ['Frodo Baggins', 'Tom Sawyer', 'Bilbo Baggin', 'Samuel L. Jackson', 'F. Baggins', 'Frody Baggins', 'Bilbo Baggins'] 495 | 496 | result = process.dedupe(contains_dupes) 497 | self.assertLess(len(result), len(contains_dupes)) 498 | 499 | # Test 2 500 | contains_dupes = ['Tom', 'Dick', 'Harry'] 501 | 502 | # we should end up with the same list since no duplicates are contained in the list (e.g. original list is returned) 503 | deduped_list = ['Tom', 'Dick', 'Harry'] 504 | 505 | result = process.dedupe(contains_dupes) 506 | self.assertEqual(result, deduped_list) 507 | 508 | def test_simplematch(self): 509 | basic_string = 'a, b' 510 | match_strings = ['a, b'] 511 | 512 | result = process.extractOne(basic_string, match_strings, scorer=fuzz.ratio) 513 | part_result = process.extractOne(basic_string, match_strings, scorer=fuzz.partial_ratio) 514 | 515 | self.assertEqual(result, ('a, b', 100)) 516 | self.assertEqual(part_result, ('a, b', 100)) 517 | 518 | 519 | class TestCodeFormat(unittest.TestCase): 520 | def test_pep8_conformance(self): 521 | pep8style = pycodestyle.StyleGuide(quiet=False) 522 | pep8style.options.ignore = pep8style.options.ignore + tuple(['E501']) 523 | pep8style.input_dir('thefuzz') 524 | result = pep8style.check_files() 525 | self.assertEqual(result.total_errors, 0, "PEP8 POLICE - WOOOOOWOOOOOOOOOO") 526 | 527 | if __name__ == '__main__': 528 | unittest.main() # run all tests 529 | -------------------------------------------------------------------------------- /thefuzz-master/test_thefuzz_hypothesis.py: -------------------------------------------------------------------------------- 1 | from itertools import product 2 | from functools import partial 3 | from string import ascii_letters, digits, punctuation 4 | 5 | from hypothesis import given, assume, settings, HealthCheck 6 | import hypothesis.strategies as st 7 | import pytest 8 | 9 | from thefuzz import fuzz, process, utils 10 | 11 | 12 | HYPOTHESIS_ALPHABET = ascii_letters + digits + punctuation 13 | 14 | 15 | def scorers_processors(): 16 | """ 17 | Generate a list of (scorer, processor) pairs for testing 18 | 19 | :return: [(scorer, processor), ...] 20 | """ 21 | scorers = [fuzz.ratio, 22 | fuzz.partial_ratio] 23 | processors = [lambda x: x, 24 | partial(utils.full_process, force_ascii=False), 25 | partial(utils.full_process, force_ascii=True)] 26 | splist = list(product(scorers, processors)) 27 | splist.extend( 28 | [(fuzz.WRatio, partial(utils.full_process, force_ascii=True)), 29 | (fuzz.QRatio, partial(utils.full_process, force_ascii=True)), 30 | (fuzz.UWRatio, partial(utils.full_process, force_ascii=False)), 31 | (fuzz.UQRatio, partial(utils.full_process, force_ascii=False)), 32 | (fuzz.token_set_ratio, partial(utils.full_process, force_ascii=True)), 33 | (fuzz.token_sort_ratio, partial(utils.full_process, force_ascii=True)), 34 | (fuzz.partial_token_set_ratio, partial(utils.full_process, force_ascii=True)), 35 | (fuzz.partial_token_sort_ratio, partial(utils.full_process, force_ascii=True))] 36 | ) 37 | 38 | return splist 39 | 40 | 41 | def full_scorers_processors(): 42 | """ 43 | Generate a list of (scorer, processor) pairs for testing for scorers that use the full string only 44 | 45 | :return: [(scorer, processor), ...] 46 | """ 47 | scorers = [fuzz.ratio] 48 | processors = [lambda x: x, 49 | partial(utils.full_process, force_ascii=False), 50 | partial(utils.full_process, force_ascii=True)] 51 | splist = list(product(scorers, processors)) 52 | splist.extend( 53 | [(fuzz.WRatio, partial(utils.full_process, force_ascii=True)), 54 | (fuzz.QRatio, partial(utils.full_process, force_ascii=True)), 55 | (fuzz.UWRatio, partial(utils.full_process, force_ascii=False)), 56 | (fuzz.UQRatio, partial(utils.full_process, force_ascii=False))] 57 | ) 58 | 59 | return splist 60 | 61 | 62 | @pytest.mark.parametrize('scorer,processor', 63 | scorers_processors()) 64 | @given(data=st.data()) 65 | @settings(max_examples=20, deadline=5000, suppress_health_check=[HealthCheck.data_too_large]) 66 | def test_identical_strings_extracted(scorer, processor, data): 67 | """ 68 | Test that identical strings will always return a perfect match. 69 | 70 | :param scorer: 71 | :param processor: 72 | :param data: 73 | :return: 74 | """ 75 | # Draw a list of random strings 76 | strings = data.draw( 77 | st.lists( 78 | st.text(min_size=10, max_size=100, alphabet=HYPOTHESIS_ALPHABET), 79 | min_size=1, 80 | max_size=10 81 | ) 82 | ) 83 | # Draw a random integer for the index in that list 84 | choiceidx = data.draw(st.integers(min_value=0, max_value=(len(strings) - 1))) 85 | 86 | # Extract our choice from the list 87 | choice = strings[choiceidx] 88 | 89 | # Check process doesn't make our choice the empty string 90 | assume(processor(choice) != '') 91 | 92 | # Extract all perfect matches 93 | result = process.extractBests(choice, 94 | strings, 95 | scorer=scorer, 96 | processor=processor, 97 | score_cutoff=100, 98 | limit=None) 99 | 100 | # Check we get a result 101 | assert result != [] 102 | 103 | # Check the original is in the list 104 | assert (choice, 100) in result 105 | 106 | 107 | @pytest.mark.parametrize('scorer,processor', 108 | full_scorers_processors()) 109 | @given(data=st.data()) 110 | @settings(max_examples=20, deadline=5000) 111 | def test_only_identical_strings_extracted(scorer, processor, data): 112 | """ 113 | Test that only identical (post processing) strings score 100 on the test. 114 | 115 | If two strings are not identical then using full comparison methods they should 116 | not be a perfect (100) match. 117 | 118 | :param scorer: 119 | :param processor: 120 | :param data: 121 | :return: 122 | """ 123 | # Draw a list of random strings 124 | strings = data.draw( 125 | st.lists( 126 | st.text(min_size=10, max_size=100, alphabet=HYPOTHESIS_ALPHABET), 127 | min_size=1, 128 | max_size=10) 129 | ) 130 | # Draw a random integer for the index in that list 131 | choiceidx = data.draw(st.integers(min_value=0, max_value=(len(strings) - 1))) 132 | 133 | # Extract our choice from the list 134 | choice = strings[choiceidx] 135 | 136 | # Check process doesn't make our choice the empty string 137 | assume(processor(choice) != '') 138 | 139 | # Extract all perfect matches 140 | result = process.extractBests(choice, 141 | strings, 142 | scorer=scorer, 143 | processor=processor, 144 | score_cutoff=100, 145 | limit=None) 146 | 147 | # Check we get a result 148 | assert result != [] 149 | 150 | # Check THE ONLY result(s) we get are a perfect match for the (processed) original data 151 | pchoice = processor(choice) 152 | for r in result: 153 | assert pchoice == processor(r[0]) 154 | -------------------------------------------------------------------------------- /thefuzz-master/test_thefuzz_pytest.py: -------------------------------------------------------------------------------- 1 | from thefuzz import process 2 | 3 | 4 | def test_process_warning(caplog): 5 | """Check that a string reduced to 0 by processor logs a warning to stderr""" 6 | 7 | query = ':::::::' 8 | choices = [':::::::'] 9 | 10 | _ = process.extractOne(query, choices) 11 | 12 | logstr = ("Applied processor reduces " 13 | "input query to empty string, " 14 | "all comparisons will have score 0. " 15 | "[Query: ':::::::']") 16 | 17 | assert 1 == len(caplog.records) 18 | log = caplog.records[0] 19 | 20 | assert log.levelname == "WARNING" 21 | assert log.name == "thefuzz.process" 22 | assert logstr == log.message 23 | -------------------------------------------------------------------------------- /thefuzz-master/thefuzz/__init__.py: -------------------------------------------------------------------------------- 1 | __version__ = '0.21.0' 2 | -------------------------------------------------------------------------------- /thefuzz-master/thefuzz/fuzz.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | from rapidfuzz.fuzz import ( 4 | ratio as _ratio, 5 | partial_ratio as _partial_ratio, 6 | token_set_ratio as _token_set_ratio, 7 | token_sort_ratio as _token_sort_ratio, 8 | partial_token_set_ratio as _partial_token_set_ratio, 9 | partial_token_sort_ratio as _partial_token_sort_ratio, 10 | WRatio as _WRatio, 11 | QRatio as _QRatio, 12 | ) 13 | 14 | from . import utils 15 | 16 | ########################### 17 | # Basic Scoring Functions # 18 | ########################### 19 | 20 | 21 | def _rapidfuzz_scorer(scorer, s1, s2, force_ascii, full_process): 22 | """ 23 | wrapper around rapidfuzz function to be compatible with the API of thefuzz 24 | """ 25 | if full_process: 26 | if s1 is None or s2 is None: 27 | return 0 28 | 29 | s1 = utils.full_process(s1, force_ascii=force_ascii) 30 | s2 = utils.full_process(s2, force_ascii=force_ascii) 31 | 32 | return int(round(scorer(s1, s2))) 33 | 34 | 35 | def ratio(s1, s2): 36 | return _rapidfuzz_scorer(_ratio, s1, s2, False, False) 37 | 38 | 39 | def partial_ratio(s1, s2): 40 | """ 41 | Return the ratio of the most similar substring 42 | as a number between 0 and 100. 43 | """ 44 | return _rapidfuzz_scorer(_partial_ratio, s1, s2, False, False) 45 | 46 | 47 | ############################## 48 | # Advanced Scoring Functions # 49 | ############################## 50 | 51 | # Sorted Token 52 | # find all alphanumeric tokens in the string 53 | # sort those tokens and take ratio of resulting joined strings 54 | # controls for unordered string elements 55 | def token_sort_ratio(s1, s2, force_ascii=True, full_process=True): 56 | """ 57 | Return a measure of the sequences' similarity between 0 and 100 58 | but sorting the token before comparing. 59 | """ 60 | return _rapidfuzz_scorer(_token_sort_ratio, s1, s2, force_ascii, full_process) 61 | 62 | 63 | def partial_token_sort_ratio(s1, s2, force_ascii=True, full_process=True): 64 | """ 65 | Return the ratio of the most similar substring as a number between 66 | 0 and 100 but sorting the token before comparing. 67 | """ 68 | return _rapidfuzz_scorer( 69 | _partial_token_sort_ratio, s1, s2, force_ascii, full_process 70 | ) 71 | 72 | 73 | def token_set_ratio(s1, s2, force_ascii=True, full_process=True): 74 | return _rapidfuzz_scorer(_token_set_ratio, s1, s2, force_ascii, full_process) 75 | 76 | 77 | def partial_token_set_ratio(s1, s2, force_ascii=True, full_process=True): 78 | return _rapidfuzz_scorer( 79 | _partial_token_set_ratio, s1, s2, force_ascii, full_process 80 | ) 81 | 82 | 83 | ################### 84 | # Combination API # 85 | ################### 86 | 87 | # q is for quick 88 | def QRatio(s1, s2, force_ascii=True, full_process=True): 89 | """ 90 | Quick ratio comparison between two strings. 91 | 92 | Runs full_process from utils on both strings 93 | Short circuits if either of the strings is empty after processing. 94 | 95 | :param s1: 96 | :param s2: 97 | :param force_ascii: Allow only ASCII characters (Default: True) 98 | :full_process: Process inputs, used here to avoid double processing in extract functions (Default: True) 99 | :return: similarity ratio 100 | """ 101 | return _rapidfuzz_scorer(_QRatio, s1, s2, force_ascii, full_process) 102 | 103 | 104 | def UQRatio(s1, s2, full_process=True): 105 | """ 106 | Unicode quick ratio 107 | 108 | Calls QRatio with force_ascii set to False 109 | 110 | :param s1: 111 | :param s2: 112 | :return: similarity ratio 113 | """ 114 | return QRatio(s1, s2, force_ascii=False, full_process=full_process) 115 | 116 | 117 | # w is for weighted 118 | def WRatio(s1, s2, force_ascii=True, full_process=True): 119 | """ 120 | Return a measure of the sequences' similarity between 0 and 100, using different algorithms. 121 | 122 | **Steps in the order they occur** 123 | 124 | #. Run full_process from utils on both strings 125 | #. Short circuit if this makes either string empty 126 | #. Take the ratio of the two processed strings (fuzz.ratio) 127 | #. Run checks to compare the length of the strings 128 | * If one of the strings is more than 1.5 times as long as the other 129 | use partial_ratio comparisons - scale partial results by 0.9 130 | (this makes sure only full results can return 100) 131 | * If one of the strings is over 8 times as long as the other 132 | instead scale by 0.6 133 | 134 | #. Run the other ratio functions 135 | * if using partial ratio functions call partial_ratio, 136 | partial_token_sort_ratio and partial_token_set_ratio 137 | scale all of these by the ratio based on length 138 | * otherwise call token_sort_ratio and token_set_ratio 139 | * all token based comparisons are scaled by 0.95 140 | (on top of any partial scalars) 141 | 142 | #. Take the highest value from these results 143 | round it and return it as an integer. 144 | 145 | :param s1: 146 | :param s2: 147 | :param force_ascii: Allow only ascii characters 148 | :type force_ascii: bool 149 | :full_process: Process inputs, used here to avoid double processing in extract functions (Default: True) 150 | :return: 151 | """ 152 | return _rapidfuzz_scorer(_WRatio, s1, s2, force_ascii, full_process) 153 | 154 | 155 | def UWRatio(s1, s2, full_process=True): 156 | """ 157 | Return a measure of the sequences' similarity between 0 and 100, 158 | using different algorithms. Same as WRatio but preserving unicode. 159 | """ 160 | return WRatio(s1, s2, force_ascii=False, full_process=full_process) 161 | -------------------------------------------------------------------------------- /thefuzz-master/thefuzz/fuzz.pyi: -------------------------------------------------------------------------------- 1 | def ratio(s1: str, s2: str) -> int: ... 2 | def partial_ratio(s1: str, s2: str) -> int: ... 3 | def token_sort_ratio(s1: str, s2: str, force_ascii: bool = ..., full_process: bool = ...) -> int: ... 4 | def partial_token_sort_ratio(s1: str, s2: str, force_ascii: bool = ..., full_process: bool = ...) -> int: ... 5 | def token_set_ratio(s1: str, s2: str, force_ascii: bool = ..., full_process: bool = ...) -> int: ... 6 | def partial_token_set_ratio(s1: str, s2: str, force_ascii: bool = ..., full_process: bool = ...) -> int: ... 7 | def QRatio(s1: str, s2: str, force_ascii: bool = ..., full_process: bool = ...) -> int: ... 8 | def UQRatio(s1: str, s2: str, full_process: bool = ...) -> int: ... 9 | def WRatio(s1: str, s2: str, force_ascii: bool = ..., full_process: bool = ...) -> int: ... 10 | def UWRatio(s1: str, s2: str, full_process: bool = ...) -> int: ... 11 | -------------------------------------------------------------------------------- /thefuzz-master/thefuzz/process.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from . import fuzz 3 | from . import utils 4 | import logging 5 | from rapidfuzz import fuzz as rfuzz 6 | from rapidfuzz import process as rprocess 7 | from functools import partial 8 | 9 | _logger = logging.getLogger(__name__) 10 | 11 | default_scorer = fuzz.WRatio 12 | default_processor = utils.full_process 13 | 14 | 15 | def _get_processor(processor, scorer): 16 | """ 17 | thefuzz runs both the default preprocessing of the function and the preprocessing 18 | function passed into process.* while rapidfuzz only runs the one passed into 19 | process.*. This function wraps the processor to mimic this behavior 20 | """ 21 | if scorer not in (fuzz.WRatio, fuzz.QRatio, 22 | fuzz.token_set_ratio, fuzz.token_sort_ratio, 23 | fuzz.partial_token_set_ratio, fuzz.partial_token_sort_ratio, 24 | fuzz.UWRatio, fuzz.UQRatio): 25 | return processor 26 | 27 | force_ascii = scorer not in [fuzz.UWRatio, fuzz.UQRatio] 28 | pre_processor = partial(utils.full_process, force_ascii=force_ascii) 29 | 30 | if not processor or processor == utils.full_process: 31 | return pre_processor 32 | 33 | def wrapper(s): 34 | return pre_processor(processor(s)) 35 | 36 | return wrapper 37 | 38 | 39 | # this allows lowering the scorers back to the scorers used in rapidfuzz 40 | # this allows rapidfuzz to perform more optimizations behind the scenes. 41 | # These mapped scorers are the same with two expceptions 42 | # - default processor 43 | # - result is not rounded 44 | # these two exceptions need to be taken into account in the implementation 45 | _scorer_lowering = { 46 | fuzz.ratio: rfuzz.ratio, 47 | fuzz.partial_ratio: rfuzz.partial_ratio, 48 | fuzz.token_set_ratio: rfuzz.token_set_ratio, 49 | fuzz.token_sort_ratio: rfuzz.token_sort_ratio, 50 | fuzz.partial_token_set_ratio: rfuzz.partial_token_set_ratio, 51 | fuzz.partial_token_sort_ratio: rfuzz.partial_token_sort_ratio, 52 | fuzz.WRatio: rfuzz.WRatio, 53 | fuzz.QRatio: rfuzz.QRatio, 54 | fuzz.UWRatio: rfuzz.WRatio, 55 | fuzz.UQRatio: rfuzz.QRatio, 56 | } 57 | 58 | 59 | def _get_scorer(scorer): 60 | """ 61 | rapidfuzz scorers require the score_cutoff argument to be available 62 | This generates a compatible wrapper function 63 | """ 64 | def wrapper(s1, s2, score_cutoff=0): 65 | return scorer(s1, s2) 66 | 67 | return _scorer_lowering.get(scorer, wrapper) 68 | 69 | 70 | def _preprocess_query(query, processor): 71 | processed_query = processor(query) if processor else query 72 | if len(processed_query) == 0: 73 | _logger.warning("Applied processor reduces input query to empty string, " 74 | "all comparisons will have score 0. " 75 | f"[Query: \'{query}\']") 76 | 77 | return processed_query 78 | 79 | 80 | def extractWithoutOrder(query, choices, processor=default_processor, scorer=default_scorer, score_cutoff=0): 81 | """ 82 | Select the best match in a list or dictionary of choices. 83 | 84 | Find best matches in a list or dictionary of choices, return a 85 | generator of tuples containing the match and its score. If a dictionary 86 | is used, also returns the key for each match. 87 | 88 | Arguments: 89 | query: An object representing the thing we want to find. 90 | choices: An iterable or dictionary-like object containing choices 91 | to be matched against the query. Dictionary arguments of 92 | {key: value} pairs will attempt to match the query against 93 | each value. 94 | processor: Optional function of the form f(a) -> b, where a is the query or 95 | individual choice and b is the choice to be used in matching. 96 | 97 | This can be used to match against, say, the first element of 98 | a list: 99 | 100 | lambda x: x[0] 101 | 102 | Defaults to thefuzz.utils.full_process(). 103 | scorer: Optional function for scoring matches between the query and 104 | an individual processed choice. This should be a function 105 | of the form f(query, choice) -> int. 106 | 107 | By default, fuzz.WRatio() is used and expects both query and 108 | choice to be strings. 109 | score_cutoff: Optional argument for score threshold. No matches with 110 | a score less than this number will be returned. Defaults to 0. 111 | 112 | Returns: 113 | Generator of tuples containing the match and its score. 114 | 115 | If a list is used for choices, then the result will be 2-tuples. 116 | If a dictionary is used, then the result will be 3-tuples containing 117 | the key for each match. 118 | 119 | For example, searching for 'bird' in the dictionary 120 | 121 | {'bard': 'train', 'dog': 'man'} 122 | 123 | may return 124 | 125 | ('train', 22, 'bard'), ('man', 0, 'dog') 126 | """ 127 | is_mapping = hasattr(choices, "items") 128 | is_lowered = scorer in _scorer_lowering 129 | 130 | query = _preprocess_query(query, processor) 131 | it = rprocess.extract_iter( 132 | query, choices, 133 | processor=_get_processor(processor, scorer), 134 | scorer=_get_scorer(scorer), 135 | score_cutoff=score_cutoff 136 | ) 137 | 138 | for choice, score, key in it: 139 | if is_lowered: 140 | score = int(round(score)) 141 | 142 | yield (choice, score, key) if is_mapping else (choice, score) 143 | 144 | 145 | def extract(query, choices, processor=default_processor, scorer=default_scorer, limit=5): 146 | """ 147 | Select the best match in a list or dictionary of choices. 148 | 149 | Find best matches in a list or dictionary of choices, return a 150 | list of tuples containing the match and its score. If a dictionary 151 | is used, also returns the key for each match. 152 | 153 | Arguments: 154 | query: An object representing the thing we want to find. 155 | choices: An iterable or dictionary-like object containing choices 156 | to be matched against the query. Dictionary arguments of 157 | {key: value} pairs will attempt to match the query against 158 | each value. 159 | processor: Optional function of the form f(a) -> b, where a is the query or 160 | individual choice and b is the choice to be used in matching. 161 | 162 | This can be used to match against, say, the first element of 163 | a list: 164 | 165 | lambda x: x[0] 166 | 167 | Defaults to thefuzz.utils.full_process(). 168 | scorer: Optional function for scoring matches between the query and 169 | an individual processed choice. This should be a function 170 | of the form f(query, choice) -> int. 171 | By default, fuzz.WRatio() is used and expects both query and 172 | choice to be strings. 173 | limit: Optional maximum for the number of elements returned. Defaults 174 | to 5. 175 | 176 | Returns: 177 | List of tuples containing the match and its score. 178 | 179 | If a list is used for choices, then the result will be 2-tuples. 180 | If a dictionary is used, then the result will be 3-tuples containing 181 | the key for each match. 182 | 183 | For example, searching for 'bird' in the dictionary 184 | 185 | {'bard': 'train', 'dog': 'man'} 186 | 187 | may return 188 | 189 | [('train', 22, 'bard'), ('man', 0, 'dog')] 190 | """ 191 | return extractBests(query, choices, processor=processor, scorer=scorer, limit=limit) 192 | 193 | 194 | def extractBests(query, choices, processor=default_processor, scorer=default_scorer, score_cutoff=0, limit=5): 195 | """ 196 | Get a list of the best matches to a collection of choices. 197 | 198 | Convenience function for getting the choices with best scores. 199 | 200 | Args: 201 | query: A string to match against 202 | choices: A list or dictionary of choices, suitable for use with 203 | extract(). 204 | processor: Optional function for transforming choices before matching. 205 | See extract(). 206 | scorer: Scoring function for extract(). 207 | score_cutoff: Optional argument for score threshold. No matches with 208 | a score less than this number will be returned. Defaults to 0. 209 | limit: Optional maximum for the number of elements returned. Defaults 210 | to 5. 211 | 212 | Returns: A a list of (match, score) tuples. 213 | """ 214 | is_mapping = hasattr(choices, "items") 215 | is_lowered = scorer in _scorer_lowering 216 | 217 | query = _preprocess_query(query, processor) 218 | results = rprocess.extract( 219 | query, choices, 220 | processor=_get_processor(processor, scorer), 221 | scorer=_get_scorer(scorer), 222 | score_cutoff=score_cutoff, 223 | limit=limit 224 | ) 225 | 226 | for i, (choice, score, key) in enumerate(results): 227 | if is_lowered: 228 | score = int(round(score)) 229 | 230 | results[i] = (choice, score, key) if is_mapping else (choice, score) 231 | 232 | return results 233 | 234 | 235 | def extractOne(query, choices, processor=default_processor, scorer=default_scorer, score_cutoff=0): 236 | """ 237 | Find the single best match above a score in a list of choices. 238 | 239 | This is a convenience method which returns the single best choice. 240 | See extract() for the full arguments list. 241 | 242 | Args: 243 | query: A string to match against 244 | choices: A list or dictionary of choices, suitable for use with 245 | extract(). 246 | processor: Optional function for transforming choices before matching. 247 | See extract(). 248 | scorer: Scoring function for extract(). 249 | score_cutoff: Optional argument for score threshold. If the best 250 | match is found, but it is not greater than this number, then 251 | return None anyway ("not a good enough match"). Defaults to 0. 252 | 253 | Returns: 254 | A tuple containing a single match and its score, if a match 255 | was found that was above score_cutoff. Otherwise, returns None. 256 | """ 257 | is_mapping = hasattr(choices, "items") 258 | is_lowered = scorer in _scorer_lowering 259 | 260 | query = _preprocess_query(query, processor) 261 | res = rprocess.extractOne( 262 | query, choices, 263 | processor=_get_processor(processor, scorer), 264 | scorer=_get_scorer(scorer), 265 | score_cutoff=score_cutoff 266 | ) 267 | 268 | if res is None: 269 | return res 270 | 271 | choice, score, key = res 272 | 273 | if is_lowered: 274 | score = int(round(score)) 275 | 276 | return (choice, score, key) if is_mapping else (choice, score) 277 | 278 | 279 | def dedupe(contains_dupes, threshold=70, scorer=fuzz.token_set_ratio): 280 | """ 281 | This convenience function takes a list of strings containing duplicates and uses fuzzy matching to identify 282 | and remove duplicates. Specifically, it uses process.extract to identify duplicates that 283 | score greater than a user defined threshold. Then, it looks for the longest item in the duplicate list 284 | since we assume this item contains the most entity information and returns that. It breaks string 285 | length ties on an alphabetical sort. 286 | 287 | Note: as the threshold DECREASES the number of duplicates that are found INCREASES. This means that the 288 | returned deduplicated list will likely be shorter. Raise the threshold for dedupe to be less 289 | sensitive. 290 | 291 | Args: 292 | contains_dupes: A list of strings that we would like to dedupe. 293 | threshold: the numerical value (0,100) point at which we expect to find duplicates. 294 | Defaults to 70 out of 100 295 | scorer: Optional function for scoring matches between the query and 296 | an individual processed choice. This should be a function 297 | of the form f(query, choice) -> int. 298 | By default, fuzz.token_set_ratio() is used and expects both query and 299 | choice to be strings. 300 | 301 | Returns: 302 | A deduplicated list. For example: 303 | 304 | In: contains_dupes = ['Frodo Baggin', 'Frodo Baggins', 'F. Baggins', 'Samwise G.', 'Gandalf', 'Bilbo Baggins'] 305 | In: dedupe(contains_dupes) 306 | Out: ['Frodo Baggins', 'Samwise G.', 'Bilbo Baggins', 'Gandalf'] 307 | """ 308 | deduped = set() 309 | for item in contains_dupes: 310 | matches = extractBests(item, contains_dupes, scorer=scorer, score_cutoff=threshold, limit=None) 311 | deduped.add(max(matches, key=lambda x: (len(x[0]), x[0]))[0]) 312 | 313 | return list(deduped) if len(deduped) != len(contains_dupes) else contains_dupes 314 | -------------------------------------------------------------------------------- /thefuzz-master/thefuzz/process.pyi: -------------------------------------------------------------------------------- 1 | from collections.abc import Mapping 2 | import typing 3 | from typing import Any, Callable, Union, Tuple, Generator, TypeVar, Sequence 4 | 5 | 6 | ChoicesT = Union[Mapping[str, str], Sequence[str]] 7 | T = TypeVar('T') 8 | ProcessorT = Union[Callable[[str, bool], str], Callable[[Any], Any]] 9 | ScorerT = Callable[[str, str, bool, bool], int] 10 | 11 | 12 | @typing.overload 13 | def extractWithoutOrder(query: str, choices: Mapping[str, str], processor: ProcessorT, scorer: ScorerT, score_cutoff: int = ...) -> Generator[Tuple[str, int, str], None, None]: ... 14 | 15 | 16 | @typing.overload 17 | def extractWithoutOrder(query: str, choices: Sequence[str], processor: ProcessorT, scorer: ScorerT, score_cutoff: int = ...) -> Generator[Tuple[str, int], None, None]: ... 18 | -------------------------------------------------------------------------------- /thefuzz-master/thefuzz/py.typed: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /thefuzz-master/thefuzz/utils.py: -------------------------------------------------------------------------------- 1 | from rapidfuzz.utils import default_process as _default_process 2 | 3 | translation_table = {i: None for i in range(128, 256)} # ascii dammit! 4 | 5 | 6 | def ascii_only(s): 7 | return s.translate(translation_table) 8 | 9 | 10 | def full_process(s, force_ascii=False): 11 | """ 12 | Process string by 13 | -- removing all but letters and numbers 14 | -- trim whitespace 15 | -- force to lower case 16 | if force_ascii == True, force convert to ascii 17 | """ 18 | 19 | if force_ascii: 20 | s = ascii_only(str(s)) 21 | 22 | return _default_process(s) 23 | -------------------------------------------------------------------------------- /thefuzz-master/thefuzz/utils.pyi: -------------------------------------------------------------------------------- 1 | 2 | def ascii_only(s: str) -> str: ... 3 | def full_process(s: str, force_ascii: bool = ...) -> str: ... 4 | -------------------------------------------------------------------------------- /thefuzz-master/tox.ini: -------------------------------------------------------------------------------- 1 | [tox] 2 | envlist = py{38, 39, 310, 311, 312, py3} 3 | skip_missing_interpreters = True 4 | 5 | [testenv] 6 | deps = pytest 7 | pycodestyle 8 | hypothesis 9 | commands = pytest 10 | -------------------------------------------------------------------------------- /thefuzz/__init__.py: -------------------------------------------------------------------------------- 1 | __version__ = '0.21.0' 2 | -------------------------------------------------------------------------------- /thefuzz/fuzz.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | from rapidfuzz.fuzz import ( 4 | ratio as _ratio, 5 | partial_ratio as _partial_ratio, 6 | token_set_ratio as _token_set_ratio, 7 | token_sort_ratio as _token_sort_ratio, 8 | partial_token_set_ratio as _partial_token_set_ratio, 9 | partial_token_sort_ratio as _partial_token_sort_ratio, 10 | WRatio as _WRatio, 11 | QRatio as _QRatio, 12 | ) 13 | 14 | from . import utils 15 | 16 | ########################### 17 | # Basic Scoring Functions # 18 | ########################### 19 | 20 | 21 | def _rapidfuzz_scorer(scorer, s1, s2, force_ascii, full_process): 22 | """ 23 | wrapper around rapidfuzz function to be compatible with the API of thefuzz 24 | """ 25 | if full_process: 26 | if s1 is None or s2 is None: 27 | return 0 28 | 29 | s1 = utils.full_process(s1, force_ascii=force_ascii) 30 | s2 = utils.full_process(s2, force_ascii=force_ascii) 31 | 32 | return int(round(scorer(s1, s2))) 33 | 34 | 35 | def ratio(s1, s2): 36 | return _rapidfuzz_scorer(_ratio, s1, s2, False, False) 37 | 38 | 39 | def partial_ratio(s1, s2): 40 | """ 41 | Return the ratio of the most similar substring 42 | as a number between 0 and 100. 43 | """ 44 | return _rapidfuzz_scorer(_partial_ratio, s1, s2, False, False) 45 | 46 | 47 | ############################## 48 | # Advanced Scoring Functions # 49 | ############################## 50 | 51 | # Sorted Token 52 | # find all alphanumeric tokens in the string 53 | # sort those tokens and take ratio of resulting joined strings 54 | # controls for unordered string elements 55 | def token_sort_ratio(s1, s2, force_ascii=True, full_process=True): 56 | """ 57 | Return a measure of the sequences' similarity between 0 and 100 58 | but sorting the token before comparing. 59 | """ 60 | return _rapidfuzz_scorer(_token_sort_ratio, s1, s2, force_ascii, full_process) 61 | 62 | 63 | def partial_token_sort_ratio(s1, s2, force_ascii=True, full_process=True): 64 | """ 65 | Return the ratio of the most similar substring as a number between 66 | 0 and 100 but sorting the token before comparing. 67 | """ 68 | return _rapidfuzz_scorer( 69 | _partial_token_sort_ratio, s1, s2, force_ascii, full_process 70 | ) 71 | 72 | 73 | def token_set_ratio(s1, s2, force_ascii=True, full_process=True): 74 | return _rapidfuzz_scorer(_token_set_ratio, s1, s2, force_ascii, full_process) 75 | 76 | 77 | def partial_token_set_ratio(s1, s2, force_ascii=True, full_process=True): 78 | return _rapidfuzz_scorer( 79 | _partial_token_set_ratio, s1, s2, force_ascii, full_process 80 | ) 81 | 82 | 83 | ################### 84 | # Combination API # 85 | ################### 86 | 87 | # q is for quick 88 | def QRatio(s1, s2, force_ascii=True, full_process=True): 89 | """ 90 | Quick ratio comparison between two strings. 91 | 92 | Runs full_process from utils on both strings 93 | Short circuits if either of the strings is empty after processing. 94 | 95 | :param s1: 96 | :param s2: 97 | :param force_ascii: Allow only ASCII characters (Default: True) 98 | :full_process: Process inputs, used here to avoid double processing in extract functions (Default: True) 99 | :return: similarity ratio 100 | """ 101 | return _rapidfuzz_scorer(_QRatio, s1, s2, force_ascii, full_process) 102 | 103 | 104 | def UQRatio(s1, s2, full_process=True): 105 | """ 106 | Unicode quick ratio 107 | 108 | Calls QRatio with force_ascii set to False 109 | 110 | :param s1: 111 | :param s2: 112 | :return: similarity ratio 113 | """ 114 | return QRatio(s1, s2, force_ascii=False, full_process=full_process) 115 | 116 | 117 | # w is for weighted 118 | def WRatio(s1, s2, force_ascii=True, full_process=True): 119 | """ 120 | Return a measure of the sequences' similarity between 0 and 100, using different algorithms. 121 | 122 | **Steps in the order they occur** 123 | 124 | #. Run full_process from utils on both strings 125 | #. Short circuit if this makes either string empty 126 | #. Take the ratio of the two processed strings (fuzz.ratio) 127 | #. Run checks to compare the length of the strings 128 | * If one of the strings is more than 1.5 times as long as the other 129 | use partial_ratio comparisons - scale partial results by 0.9 130 | (this makes sure only full results can return 100) 131 | * If one of the strings is over 8 times as long as the other 132 | instead scale by 0.6 133 | 134 | #. Run the other ratio functions 135 | * if using partial ratio functions call partial_ratio, 136 | partial_token_sort_ratio and partial_token_set_ratio 137 | scale all of these by the ratio based on length 138 | * otherwise call token_sort_ratio and token_set_ratio 139 | * all token based comparisons are scaled by 0.95 140 | (on top of any partial scalars) 141 | 142 | #. Take the highest value from these results 143 | round it and return it as an integer. 144 | 145 | :param s1: 146 | :param s2: 147 | :param force_ascii: Allow only ascii characters 148 | :type force_ascii: bool 149 | :full_process: Process inputs, used here to avoid double processing in extract functions (Default: True) 150 | :return: 151 | """ 152 | return _rapidfuzz_scorer(_WRatio, s1, s2, force_ascii, full_process) 153 | 154 | 155 | def UWRatio(s1, s2, full_process=True): 156 | """ 157 | Return a measure of the sequences' similarity between 0 and 100, 158 | using different algorithms. Same as WRatio but preserving unicode. 159 | """ 160 | return WRatio(s1, s2, force_ascii=False, full_process=full_process) 161 | -------------------------------------------------------------------------------- /thefuzz/fuzz.pyi: -------------------------------------------------------------------------------- 1 | def ratio(s1: str, s2: str) -> int: ... 2 | def partial_ratio(s1: str, s2: str) -> int: ... 3 | def token_sort_ratio(s1: str, s2: str, force_ascii: bool = ..., full_process: bool = ...) -> int: ... 4 | def partial_token_sort_ratio(s1: str, s2: str, force_ascii: bool = ..., full_process: bool = ...) -> int: ... 5 | def token_set_ratio(s1: str, s2: str, force_ascii: bool = ..., full_process: bool = ...) -> int: ... 6 | def partial_token_set_ratio(s1: str, s2: str, force_ascii: bool = ..., full_process: bool = ...) -> int: ... 7 | def QRatio(s1: str, s2: str, force_ascii: bool = ..., full_process: bool = ...) -> int: ... 8 | def UQRatio(s1: str, s2: str, full_process: bool = ...) -> int: ... 9 | def WRatio(s1: str, s2: str, force_ascii: bool = ..., full_process: bool = ...) -> int: ... 10 | def UWRatio(s1: str, s2: str, full_process: bool = ...) -> int: ... 11 | -------------------------------------------------------------------------------- /thefuzz/process.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from . import fuzz 3 | from . import utils 4 | import logging 5 | from rapidfuzz import fuzz as rfuzz 6 | from rapidfuzz import process as rprocess 7 | from functools import partial 8 | 9 | _logger = logging.getLogger(__name__) 10 | 11 | default_scorer = fuzz.WRatio 12 | default_processor = utils.full_process 13 | 14 | 15 | def _get_processor(processor, scorer): 16 | """ 17 | thefuzz runs both the default preprocessing of the function and the preprocessing 18 | function passed into process.* while rapidfuzz only runs the one passed into 19 | process.*. This function wraps the processor to mimic this behavior 20 | """ 21 | if scorer not in (fuzz.WRatio, fuzz.QRatio, 22 | fuzz.token_set_ratio, fuzz.token_sort_ratio, 23 | fuzz.partial_token_set_ratio, fuzz.partial_token_sort_ratio, 24 | fuzz.UWRatio, fuzz.UQRatio): 25 | return processor 26 | 27 | force_ascii = scorer not in [fuzz.UWRatio, fuzz.UQRatio] 28 | pre_processor = partial(utils.full_process, force_ascii=force_ascii) 29 | 30 | if not processor or processor == utils.full_process: 31 | return pre_processor 32 | 33 | def wrapper(s): 34 | return pre_processor(processor(s)) 35 | 36 | return wrapper 37 | 38 | 39 | # this allows lowering the scorers back to the scorers used in rapidfuzz 40 | # this allows rapidfuzz to perform more optimizations behind the scenes. 41 | # These mapped scorers are the same with two expceptions 42 | # - default processor 43 | # - result is not rounded 44 | # these two exceptions need to be taken into account in the implementation 45 | _scorer_lowering = { 46 | fuzz.ratio: rfuzz.ratio, 47 | fuzz.partial_ratio: rfuzz.partial_ratio, 48 | fuzz.token_set_ratio: rfuzz.token_set_ratio, 49 | fuzz.token_sort_ratio: rfuzz.token_sort_ratio, 50 | fuzz.partial_token_set_ratio: rfuzz.partial_token_set_ratio, 51 | fuzz.partial_token_sort_ratio: rfuzz.partial_token_sort_ratio, 52 | fuzz.WRatio: rfuzz.WRatio, 53 | fuzz.QRatio: rfuzz.QRatio, 54 | fuzz.UWRatio: rfuzz.WRatio, 55 | fuzz.UQRatio: rfuzz.QRatio, 56 | } 57 | 58 | 59 | def _get_scorer(scorer): 60 | """ 61 | rapidfuzz scorers require the score_cutoff argument to be available 62 | This generates a compatible wrapper function 63 | """ 64 | def wrapper(s1, s2, score_cutoff=0): 65 | return scorer(s1, s2) 66 | 67 | return _scorer_lowering.get(scorer, wrapper) 68 | 69 | 70 | def _preprocess_query(query, processor): 71 | processed_query = processor(query) if processor else query 72 | if len(processed_query) == 0: 73 | _logger.warning("Applied processor reduces input query to empty string, " 74 | "all comparisons will have score 0. " 75 | f"[Query: \'{query}\']") 76 | 77 | return processed_query 78 | 79 | 80 | def extractWithoutOrder(query, choices, processor=default_processor, scorer=default_scorer, score_cutoff=0): 81 | """ 82 | Select the best match in a list or dictionary of choices. 83 | 84 | Find best matches in a list or dictionary of choices, return a 85 | generator of tuples containing the match and its score. If a dictionary 86 | is used, also returns the key for each match. 87 | 88 | Arguments: 89 | query: An object representing the thing we want to find. 90 | choices: An iterable or dictionary-like object containing choices 91 | to be matched against the query. Dictionary arguments of 92 | {key: value} pairs will attempt to match the query against 93 | each value. 94 | processor: Optional function of the form f(a) -> b, where a is the query or 95 | individual choice and b is the choice to be used in matching. 96 | 97 | This can be used to match against, say, the first element of 98 | a list: 99 | 100 | lambda x: x[0] 101 | 102 | Defaults to thefuzz.utils.full_process(). 103 | scorer: Optional function for scoring matches between the query and 104 | an individual processed choice. This should be a function 105 | of the form f(query, choice) -> int. 106 | 107 | By default, fuzz.WRatio() is used and expects both query and 108 | choice to be strings. 109 | score_cutoff: Optional argument for score threshold. No matches with 110 | a score less than this number will be returned. Defaults to 0. 111 | 112 | Returns: 113 | Generator of tuples containing the match and its score. 114 | 115 | If a list is used for choices, then the result will be 2-tuples. 116 | If a dictionary is used, then the result will be 3-tuples containing 117 | the key for each match. 118 | 119 | For example, searching for 'bird' in the dictionary 120 | 121 | {'bard': 'train', 'dog': 'man'} 122 | 123 | may return 124 | 125 | ('train', 22, 'bard'), ('man', 0, 'dog') 126 | """ 127 | is_mapping = hasattr(choices, "items") 128 | is_lowered = scorer in _scorer_lowering 129 | 130 | query = _preprocess_query(query, processor) 131 | it = rprocess.extract_iter( 132 | query, choices, 133 | processor=_get_processor(processor, scorer), 134 | scorer=_get_scorer(scorer), 135 | score_cutoff=score_cutoff 136 | ) 137 | 138 | for choice, score, key in it: 139 | if is_lowered: 140 | score = int(round(score)) 141 | 142 | yield (choice, score, key) if is_mapping else (choice, score) 143 | 144 | 145 | def extract(query, choices, processor=default_processor, scorer=default_scorer, limit=5): 146 | """ 147 | Select the best match in a list or dictionary of choices. 148 | 149 | Find best matches in a list or dictionary of choices, return a 150 | list of tuples containing the match and its score. If a dictionary 151 | is used, also returns the key for each match. 152 | 153 | Arguments: 154 | query: An object representing the thing we want to find. 155 | choices: An iterable or dictionary-like object containing choices 156 | to be matched against the query. Dictionary arguments of 157 | {key: value} pairs will attempt to match the query against 158 | each value. 159 | processor: Optional function of the form f(a) -> b, where a is the query or 160 | individual choice and b is the choice to be used in matching. 161 | 162 | This can be used to match against, say, the first element of 163 | a list: 164 | 165 | lambda x: x[0] 166 | 167 | Defaults to thefuzz.utils.full_process(). 168 | scorer: Optional function for scoring matches between the query and 169 | an individual processed choice. This should be a function 170 | of the form f(query, choice) -> int. 171 | By default, fuzz.WRatio() is used and expects both query and 172 | choice to be strings. 173 | limit: Optional maximum for the number of elements returned. Defaults 174 | to 5. 175 | 176 | Returns: 177 | List of tuples containing the match and its score. 178 | 179 | If a list is used for choices, then the result will be 2-tuples. 180 | If a dictionary is used, then the result will be 3-tuples containing 181 | the key for each match. 182 | 183 | For example, searching for 'bird' in the dictionary 184 | 185 | {'bard': 'train', 'dog': 'man'} 186 | 187 | may return 188 | 189 | [('train', 22, 'bard'), ('man', 0, 'dog')] 190 | """ 191 | return extractBests(query, choices, processor=processor, scorer=scorer, limit=limit) 192 | 193 | 194 | def extractBests(query, choices, processor=default_processor, scorer=default_scorer, score_cutoff=0, limit=5): 195 | """ 196 | Get a list of the best matches to a collection of choices. 197 | 198 | Convenience function for getting the choices with best scores. 199 | 200 | Args: 201 | query: A string to match against 202 | choices: A list or dictionary of choices, suitable for use with 203 | extract(). 204 | processor: Optional function for transforming choices before matching. 205 | See extract(). 206 | scorer: Scoring function for extract(). 207 | score_cutoff: Optional argument for score threshold. No matches with 208 | a score less than this number will be returned. Defaults to 0. 209 | limit: Optional maximum for the number of elements returned. Defaults 210 | to 5. 211 | 212 | Returns: A a list of (match, score) tuples. 213 | """ 214 | is_mapping = hasattr(choices, "items") 215 | is_lowered = scorer in _scorer_lowering 216 | 217 | query = _preprocess_query(query, processor) 218 | results = rprocess.extract( 219 | query, choices, 220 | processor=_get_processor(processor, scorer), 221 | scorer=_get_scorer(scorer), 222 | score_cutoff=score_cutoff, 223 | limit=limit 224 | ) 225 | 226 | for i, (choice, score, key) in enumerate(results): 227 | if is_lowered: 228 | score = int(round(score)) 229 | 230 | results[i] = (choice, score, key) if is_mapping else (choice, score) 231 | 232 | return results 233 | 234 | 235 | def extractOne(query, choices, processor=default_processor, scorer=default_scorer, score_cutoff=0): 236 | """ 237 | Find the single best match above a score in a list of choices. 238 | 239 | This is a convenience method which returns the single best choice. 240 | See extract() for the full arguments list. 241 | 242 | Args: 243 | query: A string to match against 244 | choices: A list or dictionary of choices, suitable for use with 245 | extract(). 246 | processor: Optional function for transforming choices before matching. 247 | See extract(). 248 | scorer: Scoring function for extract(). 249 | score_cutoff: Optional argument for score threshold. If the best 250 | match is found, but it is not greater than this number, then 251 | return None anyway ("not a good enough match"). Defaults to 0. 252 | 253 | Returns: 254 | A tuple containing a single match and its score, if a match 255 | was found that was above score_cutoff. Otherwise, returns None. 256 | """ 257 | is_mapping = hasattr(choices, "items") 258 | is_lowered = scorer in _scorer_lowering 259 | 260 | query = _preprocess_query(query, processor) 261 | res = rprocess.extractOne( 262 | query, choices, 263 | processor=_get_processor(processor, scorer), 264 | scorer=_get_scorer(scorer), 265 | score_cutoff=score_cutoff 266 | ) 267 | 268 | if res is None: 269 | return res 270 | 271 | choice, score, key = res 272 | 273 | if is_lowered: 274 | score = int(round(score)) 275 | 276 | return (choice, score, key) if is_mapping else (choice, score) 277 | 278 | 279 | def dedupe(contains_dupes, threshold=70, scorer=fuzz.token_set_ratio): 280 | """ 281 | This convenience function takes a list of strings containing duplicates and uses fuzzy matching to identify 282 | and remove duplicates. Specifically, it uses process.extract to identify duplicates that 283 | score greater than a user defined threshold. Then, it looks for the longest item in the duplicate list 284 | since we assume this item contains the most entity information and returns that. It breaks string 285 | length ties on an alphabetical sort. 286 | 287 | Note: as the threshold DECREASES the number of duplicates that are found INCREASES. This means that the 288 | returned deduplicated list will likely be shorter. Raise the threshold for dedupe to be less 289 | sensitive. 290 | 291 | Args: 292 | contains_dupes: A list of strings that we would like to dedupe. 293 | threshold: the numerical value (0,100) point at which we expect to find duplicates. 294 | Defaults to 70 out of 100 295 | scorer: Optional function for scoring matches between the query and 296 | an individual processed choice. This should be a function 297 | of the form f(query, choice) -> int. 298 | By default, fuzz.token_set_ratio() is used and expects both query and 299 | choice to be strings. 300 | 301 | Returns: 302 | A deduplicated list. For example: 303 | 304 | In: contains_dupes = ['Frodo Baggin', 'Frodo Baggins', 'F. Baggins', 'Samwise G.', 'Gandalf', 'Bilbo Baggins'] 305 | In: dedupe(contains_dupes) 306 | Out: ['Frodo Baggins', 'Samwise G.', 'Bilbo Baggins', 'Gandalf'] 307 | """ 308 | deduped = set() 309 | for item in contains_dupes: 310 | matches = extractBests(item, contains_dupes, scorer=scorer, score_cutoff=threshold, limit=None) 311 | deduped.add(max(matches, key=lambda x: (len(x[0]), x[0]))[0]) 312 | 313 | return list(deduped) if len(deduped) != len(contains_dupes) else contains_dupes 314 | -------------------------------------------------------------------------------- /thefuzz/process.pyi: -------------------------------------------------------------------------------- 1 | from collections.abc import Mapping 2 | import typing 3 | from typing import Any, Callable, Union, Tuple, Generator, TypeVar, Sequence 4 | 5 | 6 | ChoicesT = Union[Mapping[str, str], Sequence[str]] 7 | T = TypeVar('T') 8 | ProcessorT = Union[Callable[[str, bool], str], Callable[[Any], Any]] 9 | ScorerT = Callable[[str, str, bool, bool], int] 10 | 11 | 12 | @typing.overload 13 | def extractWithoutOrder(query: str, choices: Mapping[str, str], processor: ProcessorT, scorer: ScorerT, score_cutoff: int = ...) -> Generator[Tuple[str, int, str], None, None]: ... 14 | 15 | 16 | @typing.overload 17 | def extractWithoutOrder(query: str, choices: Sequence[str], processor: ProcessorT, scorer: ScorerT, score_cutoff: int = ...) -> Generator[Tuple[str, int], None, None]: ... 18 | -------------------------------------------------------------------------------- /thefuzz/py.typed: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /thefuzz/utils.py: -------------------------------------------------------------------------------- 1 | from rapidfuzz.utils import default_process as _default_process 2 | 3 | translation_table = {i: None for i in range(128, 256)} # ascii dammit! 4 | 5 | 6 | def ascii_only(s): 7 | return s.translate(translation_table) 8 | 9 | 10 | def full_process(s, force_ascii=False): 11 | """ 12 | Process string by 13 | -- removing all but letters and numbers 14 | -- trim whitespace 15 | -- force to lower case 16 | if force_ascii == True, force convert to ascii 17 | """ 18 | 19 | if force_ascii: 20 | s = ascii_only(str(s)) 21 | 22 | return _default_process(s) 23 | -------------------------------------------------------------------------------- /thefuzz/utils.pyi: -------------------------------------------------------------------------------- 1 | 2 | def ascii_only(s: str) -> str: ... 3 | def full_process(s: str, force_ascii: bool = ...) -> str: ... 4 | -------------------------------------------------------------------------------- /youtube/LLM Apps: Professional Opportunities for LLM App Developers..m4a: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-LLM-Bootcamp/data/f776b47268f2d7152ac1accf3ce47472ec1a59e7/youtube/LLM Apps: Professional Opportunities for LLM App Developers..m4a --------------------------------------------------------------------------------