├── README.md ├── sampledocs ├── stopwords ├── article4 ├── article1 ├── article2 ├── article3 └── article5 └── LocalitySensitiveHashing.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # Locality Sensitive Hashing Tutorial 2 | 3 | As the name suggests, this is a tutorial on locality sensitive hashing. All of the information is contained in the notebook. 4 | 5 | The sampledocs folder contains some artificial data for performing the document similarity task. It consists of news articles pulled from cnn, with one document consisting of partial concatenations of the others. This is to create artificilly similar documents, which our algorithms are trying to find. 6 | 7 | The similarity task for vectors can easily generate synthetic data by just creating random matrices, so we do that in the notebook. 8 | -------------------------------------------------------------------------------- /sampledocs/stopwords: -------------------------------------------------------------------------------- 1 | i 2 | me 3 | my 4 | myself 5 | we 6 | our 7 | ours 8 | ourselves 9 | you 10 | your 11 | yours 12 | yourself 13 | yourselves 14 | he 15 | him 16 | his 17 | himself 18 | she 19 | her 20 | hers 21 | herself 22 | it 23 | its 24 | itself 25 | they 26 | them 27 | their 28 | theirs 29 | themselves 30 | what 31 | which 32 | who 33 | whom 34 | this 35 | that 36 | these 37 | those 38 | am 39 | is 40 | are 41 | was 42 | were 43 | be 44 | been 45 | being 46 | have 47 | has 48 | had 49 | having 50 | do 51 | does 52 | did 53 | doing 54 | a 55 | an 56 | the 57 | and 58 | but 59 | if 60 | or 61 | because 62 | as 63 | until 64 | while 65 | of 66 | at 67 | by 68 | for 69 | with 70 | about 71 | against 72 | between 73 | into 74 | through 75 | during 76 | before 77 | after 78 | above 79 | below 80 | to 81 | from 82 | up 83 | down 84 | in 85 | out 86 | on 87 | off 88 | over 89 | under 90 | again 91 | further 92 | then 93 | once 94 | here 95 | there 96 | when 97 | where 98 | why 99 | how 100 | all 101 | any 102 | both 103 | each 104 | few 105 | more 106 | most 107 | other 108 | some 109 | such 110 | no 111 | nor 112 | not 113 | only 114 | own 115 | same 116 | so 117 | than 118 | too 119 | very 120 | s 121 | t 122 | can 123 | will 124 | just 125 | don 126 | should 127 | now 128 | -------------------------------------------------------------------------------- /sampledocs/article4: -------------------------------------------------------------------------------- 1 | ew quarantine hobbies have unearthed new passions, some bringing with them a literal silver lining. 2 | This year, backyard archaeologists in the United Kingdom have recorded discoveries of more than 47,000 objects, the British Museum announced this week. 3 | Regular people found the vast majority of the historical artifacts by traversing the countryside with metal detectors, before adding or updating records through the museum's Portable Antiquities Scheme. 4 | The British Museum said the program also saw an uptick in people updating digital records of antiquities while the country was under a full lockdown between March 22 and May 13. 5 | That database contains records of more than 1.5 million objects discovered since 1998 by the general public rather than by professional archaeologists. 6 | "It is brilliant to see the scheme growing from strength to strength during lockdown thanks to garden discoveries and digital reporting," said UK Culture Minister Caroline Dinenage, in a news release. 7 | This list of garden treasures dug up this year includes a 13th century medieval seal, bearing a Latin inscription reading "David, God's messenger, bishop of St. Andrews." 8 | And shining out among the new discoveries are two hoards of coins. 9 | One of those troves, which contained 50 South African solid gold coins, was unearthed in Milton Keynes, a town about 50 miles northwest of London. 10 | It's a mystery how those coins, minted in the 1970s during South Africa's apartheid era, wound up buried in a British backyard after a half-century. 11 | The other major coin hoard, which held 63 gold coins and one silver coin featuring monarchs Edward IV and Henry VIII, was likely buried in the 16th century. It included coins bearing the initials of several of Henry VIII's wives, including Catherine of Aragon, Anne Boleyn and Jane Seymour. 12 | Nearly 500 years later during the Covid-19 pandemic, residents rediscovered them while weeding in their garden. 13 | Another amateur find during the pandemic was an ancient Roman furniture fitting made of a copper alloy, and clearly featuring the face of the god Oceanus. 14 | That artifact, found in Old Basing about 50 miles southeast of London, dates as far back as the 1st century. 15 | A new report from PAS shows some 81,602 objects added to the scheme in 2019, before the recent spate of lockdown treasure hunting, each of those items now coming under public ownership. 16 | The UK's Treasure Act of 1996 requires that finders report each discovery more than 300 years old to the local coroner in the area in which they found it. 17 | If the local authority defines the object as treasure, then it takes the find to the British Museum to be valued. The government then pays a fair market price to the discoverer. 18 | The law is intended to allow national or local museums to ultimately acquire the historic treasures so that the overall public can benefit. 19 | Even during the pandemic the Portable Antiquities Scheme's liaison officers have been able to reach out to finders and obtain relics of significance, Michael Lewis, who heads the program, said in the news release. 20 | The mission continues to "ensure finds, important for understanding Britain's past, are not lost but instead recorded for posterity," he said. 21 | -------------------------------------------------------------------------------- /sampledocs/article1: -------------------------------------------------------------------------------- 1 | Washington (CNN)White House chief of staff Mark Meadows told Food and Drug Administration Commissioner Dr. Stephen Hahn he needed to grant an emergency use authorization for Pfizer/BioNTech's coronavirus vaccine by the end of Friday, and if not, he needs to resign, an administration official and a source familiar with the situation tell CNN. 2 | 3 | Another person familiar with matter, who also confirmed the demand that the vaccine be authorized by the end of Friday, said President Donald Trump has been venting about the FDA chief since the vaccine was rolled out in the UK earlier this week. 4 | The two men had a call Friday morning. A White House official said they do not comment on private conversations but the chief "regularly requests updates on the progress toward a vaccine." 5 | Hahn quickly disputed the description of the conversation, which was first reported by The Washington Post, but the news is likely to raise additional questions about the extent to which Trump administration political interests are involved with the vaccine authorization process, and could undermine public confidence in the effort.Vaccine advisers to the FDA voted Thursday to recommend the agency grant emergency use authorization to the vaccine, and it's expected to be authorized imminently."This is an untrue representation of the phone call with the Chief of Staff. The FDA was encouraged to continue working expeditiously on Pfizer-BioNTech's (emergency use authorization) request," Hahn said in a statement Friday afternoon. "FDA is committed to issuing this authorization quickly, as we noted in our statement this morning." 6 | A White House official familiar with the conversation between Meadows and Hahn said it's doubtful the FDA commissioner will actually be fired. The blunt warning from the chief of staff that he might as well resign if Pfizer's emergency use authorization isn't granted by Friday was a larger sign of the President's frustration, this person said. 7 | Public health experts have been fearful all along that White House officials would put undue pressure on the authorization process and, in turn, compromise public confidence in the vaccine, a source close to the White House coronavirus task force told CNN. 8 | This source said it's unclear why Meadows would make such a threat this late in the process and that authorization of the vaccine is expected at any moment. This person added that authorization could come by the end of Friday. 9 | Dr. Moncef Slaoui, the head of the US government's effort to develop a vaccine against Covid-19, told CNN's Jake Tapper on "The Lead" Friday he was concerned about the reports' potential effect on undermining confidence in the vaccine's safety. 10 | "Yes, I think there is an opportunity there for people to see undue pressure if the story is right," he said while also defending the FDA's authorization process, which he described as "an effective, transparent, thorough, in-depth review." 11 | Trump has grown impatient with the authorization process in recent weeks, with one person familiar with the President's thinking telling CNN that Trump wants to rush out as many vaccines as possible before he leaves office. 12 | On Friday, he called the FDA "a big, old, slow turtle." 13 | "Get the dam vaccines out NOW, Dr. Hahn @SteveFDA," Trump tweeted. "Stop playing games and start saving lives!!!" 14 | President-elect Joe Biden, in remarks Friday afternoon at an event announcing additional top administration picks, did not address the news of the Meadows demand to Hahn but urged the public to have faith in the vaccine and expressed gratitude "to the scientists and the public experts who evaluated its safety and efficiency, free from political influence." 15 | "I want to make it clear to the public, you should have confidence in this -- there is no political influence," Biden said. "These are first-rate scientists taking their time looking at all of the elements that need to be looked at. Scientific integrity led us to this point." 16 | -------------------------------------------------------------------------------- /sampledocs/article2: -------------------------------------------------------------------------------- 1 | Investigators with the Manhattan district attorney's office have interviewed several employees at President Donald Trump's lender and insurer in recent weeks as part of a wide-ranging investigation into the Trump Organization, according to multiple people familiar with the investigation. 2 | 3 | Two employees of Deutsche Bank, which has loaned more than $300 million to the Trump Organization, were interviewed by prosecutors, according to sources familiar with the matter. 4 | The interviews took place after the November presidential election, the people said, and focused on general questions about how bankers assess loans and underwriting criteria. 5 | The questioning was not specific to the bank's dealing with the Trump Organization or the President, the people said, with one person adding that it was the beginning of the process. Additional interviews are expected in the near future, they said. 6 | Prosecutors also interviewed at least one employee at Aon, an insurance broker who has done work with the President's company, according to one source familiar with the matter. 7 | A spokeswoman for Aon confirmed the company received a subpoena and said it is cooperating with the investigation. The spokeswoman declined to comment on any employee interviews. Representatives for Deutsche Bank and the district attorney's office, led by Cyrus Vance, also declined to comment. Deutsche Bank was subpoenaed as part of the investigation last year and has said it cooperates with authorized investigations. 8 | The New York Times first reported on the interviews with Deutsche Bank and Aon employees. 9 | The interviews with Trump counterparties comes as prosecutors wait for a decision by the US Supreme Court over a grand jury subpoena for the President's tax returns. The President has lost several legal challenges in an attempt to block the subpoena to Mazars USA, his long-time accounting firm, for eight years of his personal and business records and tax returns. 10 | Last month, the 2nd US Circuit Court of Appeals denied Trump's latest effort to block the subpoena paving the way for it to be enforced. The President's lawyers have asked the Supreme Court to stay, or halt, the ruling and a decision is expected any day. 11 | The Mazars records are critical to the investigation, prosecutors have said. The Manhattan district attorney investigation is the only criminal inquiry facing Trump, his business and his family and will continue after he leaves office. Trump has had discussions about issuing pardons to his family members and possibly himself, CNN has reported, but those pardons would not insulate him from a state criminal indictment. 12 | In court filings the district attorney's office has suggested the inquiry could involve tax fraud, insurance fraud and schemes to defraud its lenders. They also recently subpoenaed the Trump Organization for records relating to fees it has paid to consultants, including a payment made to a company controlled by the President's daughter, Ivanka Trump, according to people familiar with the matter. 13 | The Trump Organization has denied any wrongdoing and said applicable taxes were paid. 14 | Last year prosecutors with the district attorney's office interviewed Michael Cohen, the President's former personal attorney, at least three times about his knowledge of the Trump Organization's business dealings. 15 | Cohen testified before Congress in February 2019 that the Trump Organization allegedly manipulated its financial statements to suit its desired outcomes. Cohen said Trump "deflated his assets to reduce his real estate taxes." And he alleged company officials would play with the financial numbers when dealing with insurance companies and Deutsche Bank. 16 | Specifically, Cohen alleged the President inflated the value of his assets at times, including in 2014 when Trump submitted documents to Deutsche Bank as part of an attempt to bid for the Buffalo Bills football team. Trump never did the loan. 17 | Cohen pleaded guilty to federal crimes, including campaign finance charges for facilitating hush-money payments to silence two woman's allegations of affairs with Trump. Trump has denied the affairs. Cohen is serving a three year prison sentence and was released to home confinement earlier this year due to the coronavirus pandemic. 18 | 19 | -------------------------------------------------------------------------------- /sampledocs/article3: -------------------------------------------------------------------------------- 1 | In the final days of his desperate, dishonest campaign to upend last month's election, President Donald Trump tossed off a particularly audacious and offensive challenge aimed at those he somehow thinks can change the outcome. 2 | 3 | "Let's see if they have the courage to do what everybody in this country knows is right," he said. 4 | With time running out, his blatant hope was to intimidate and bait state legislatures or the Supreme Court into overturning a vote of the people, by legislative or judicial fiat. 5 | Trump's perverse definition of "courage" and "right," of course, amounts to a willingness to bend truth to his will and prize his continuation in office over American democracy. 6 | Yet against this madness, we have witnessed many acts of genuine courage. Of people of both parties bravely doing right.Secretaries of state and election authorities in the contested states, Republicans and Democrats, have weathered death threats and vows of political retribution simply for doing their jobs and counting, recounting and certifying the vote. They have shown enormous courage and deserve our gratitude and respect. 7 | Thousands of everyday Americans worked around the clock -- in the midst of a pandemic and sometimes with the din of howling mobs in the background -- to count and recount the votes. These patriotic Americans showed inspiring courage. 8 | So, too, did the governors who loyally campaigned for Trump but refused to bow to his threats and intimidation after the election. 9 | Yes, they were simply doing what the law required and democracy demands by certifying the vote in their states. But these officials acted knowing that in the hothouse of Trump's Republican Party, doing their duty now could cost them their jobs in primaries later. 10 | Dozens of judges -- some appointed by Trump -- have summarily dismissed his ferocious, groundless assault on the election results. They have shown gratifying fidelity to the law. 11 | Trump has made clear from the very start of his presidency that he believed that every branch, every person in government, should be beholden to him before the law and their oaths of office.He told us before the election, in rushing through the Supreme Court confirmation of Amy Coney Barrett, that he wanted nine justices to rule on election disputes. His implication was clear: My justices. My way. 12 | But this very conservative Supreme Court would not (so far, at least) be enlisted in the profane and unconstitutional mission of overturning an election on his behalf. 13 | As they and other federal judges have lifetime appointments, perhaps it took less courage than it did for those elected officials who have risked their careers -- and even their lives -- to stand by the rule of law. But the justices deserve credit, too, for firmly and unanimously rejecting the absurd lawsuits Trump and his loyalists have pushed their way. 14 | Courageous, too, have been those handful of elected Republicans who have acknowledged the results, congratulated President-elect Joe Biden and urged a peaceful, orderly transition of power. 15 | But if we have seen acts of courage, we also have seen cowardice. 16 | Rudy Giuliani and the Trump legal team have debased themselves, our legal system and democracy by filing one frivolous lawsuit after another, crying fraud on TV -- but not in the courtroom, where evidence is required.And too many Republican officials, fearful of getting sideways with Trump's base, have dutifully echoed his dishonest charges of vote fraud. 17 | A group of Republican state attorneys general lined up in support of a preposterous eleventh-hour filing from the attorney general of Texas, asking the Supreme Court to overturn the results in four states Trump lost. 18 | The suit was filed without standing, supporting evidence or a colorable argument. Nonetheless, Rep. Mike Johnson of Louisiana circulated an email among his Republican House colleagues urging them to sign an amicus brief in support of this outrageous folly. Trump is "anxiously awaiting the final list" to see who signs on to the amicus brief, Johnson wrote, implying dire consequences for anyone who failed join. 19 | Within 24 hours, more than 100 House Republicans complied. 20 | Supporters of Trump have gone so far as to argue that Republicans may stage a battle on the House floor to reject the Biden electors from contested states, a symbolic gesture that would fail but set a chilling precedent.Almost as egregious have been the many Republican members of the House and Senate who have refused to acknowledge Biden's victory and countenanced weeks of delay in the transition. 21 | Through their silence, they have given credence to Trump's blatant lies, which have now gained traction among a large majority of Republicans nationally. 22 | It's crazy, authoritarian stuff. If it were occurring anywhere else, Americans would condemn it as an appalling attempt to undermine democracy. 23 | History will scorn the cowards who meekly complied with Trump's scheme to tarnish and overturn the election -- and honor the many who showed courage and fidelity to the rule of law during this time of trial. 24 | -------------------------------------------------------------------------------- /sampledocs/article5: -------------------------------------------------------------------------------- 1 | ew quarantine hobbies have unearthed new passions, some bringing with them a literal silver lining. 2 | This year, backyard archaeologists in the United Kingdom have recorded discoveries of more than 47,000 objects, the British Museum announced this week. 3 | Regular people found the vast majority of the historical artifacts by traversing the countryside with metal detectors, before adding or updating records through the museum's Portable Antiquities Scheme. 4 | The British Museum said the program also saw an uptick in people updating digital records of antiquities while the country was under a full lockdown between March 22 and May 13. 5 | That database contains records of more than 1.5 million objects discovered since 1998 by the general public rather than by professional archaeologists. 6 | "It is brilliant to see the scheme growing from strength to strength during lockdown thanks to garden discoveries and digital reporting," said UK Culture Minister Caroline Dinenage, in a news release. 7 | This list of garden treasures dug up this year includes a 13th century medieval seal, bearing a Latin inscription reading "David, God's messenger, bishop of St. Andrews." 8 | The other major coin hoard, which held 63 gold coins and one silver coin featuring monarchs Edward IV and Henry VIII, was likely buried in the 16th century. It included coins bearing the initials of several of Henry VIII's wives, including Catherine of Aragon, Anne Boleyn and Jane Seymour. 9 | Nearly 500 years later during the Covid-19 pandemic, residents rediscovered them while weeding in their garden. 10 | Another amateur find during the pandemic was an ancient Roman furniture fitting made of a copper alloy, and clearly featuring the face of the god Oceanus. 11 | That artifact, found in Old Basing about 50 miles southeast of London, dates as far back as the 1st century. 12 | A new report from PAS shows some 81,602 objects added to the scheme in 2019, before the recent spate of lockdown treasure hunting, each of those items now coming under public ownership. 13 | If the local authority defines the object as treasure, then it takes the find to the British Museum to be valued. The government then pays a fair market price to the discoverer. 14 | The law is intended to allow national or local museums to ultimately acquire the historic treasures so that the overall public can benefit. 15 | Even during the pandemic the Portable Antiquities Scheme's liaison officers have been able to reach out to finders and obtain relics of significance, Michael Lewis, who heads the program, said in the news release. 16 | The mission continues to "ensure finds, important for understanding Britain's past, are not lost but instead recorded for posterity," he said. 17 | 18 | Another person familiar with matter, who also confirmed the demand that the vaccine be authorized by the end of Friday, said President Donald Trump has been venting about the FDA chief since the vaccine was rolled out in the UK earlier this week. 19 | The two men had a call Friday morning. A White House official said they do not comment on private conversations but the chief "regularly requests updates on the progress toward a vaccine." 20 | Hahn quickly disputed the description of the conversation, which was first reported by The Washington Post, but the news is likely to raise additional questions about the extent to which Trump administration political interests are involved with the vaccine authorization process, and could undermine public confidence in the effort.Vaccine advisers to the FDA voted Thursday to recommend the agency grant emergency use authorization to the vaccine, and it's expected to be authorized imminently."This is an untrue representation of the phone call with the Chief of Staff. The FDA was encouraged to continue working expeditiously on Pfizer-BioNTech's (emergency use authorization) request," Hahn said in a statement Friday afternoon. "FDA is committed to issuing this authorization quickly, as we noted in our statement this morning." 21 | Public health experts have been fearful all along that White House officials would put undue pressure on the authorization process and, in turn, compromise public confidence in the vaccine, a source close to the White House coronavirus task force told CNN. 22 | This source said it's unclear why Meadows would make such a threat this late in the process and that authorization of the vaccine is expected at any moment. This person added that authorization could come by the end of Friday. 23 | Dr. Moncef Slaoui, the head of the US government's effort to develop a vaccine against Covid-19, told CNN's Jake Tapper on "The Lead" Friday he was concerned about the reports' potential effect on undermining confidence in the vaccine's safety. 24 | President-elect Joe Biden, in remarks Friday afternoon at an event announcing additional top administration picks, did not address the news of the Meadows demand to Hahn but urged the public to have faith in the vaccine and expressed gratitude "to the scientists and the public experts who evaluated its safety and efficiency, free from political influence." 25 | "I want to make it clear to the public, you should have confidence in this -- there is no political influence," Biden said. "These are first-rate scientists taking their time looking at all of the elements that need to be looked at. Scientific integrity led us to this point." 26 | Investigators with the Manhattan district attorney's office have interviewed several employees at President Donald Trump's lender and insurer in recent weeks as part of a wide-ranging investigation into the Trump Organization, according to multiple people familiar with the investigation. 27 | Two employees of Deutsche Bank, which has loaned more than $300 million to the Trump Organization, were interviewed by prosecutors, according to sources familiar with the matter. 28 | The questioning was not specific to the bank's dealing with the Trump Organization or the President, the people said, with one person adding that it was the beginning of the process. Additional interviews are expected in the near future, they said. 29 | Prosecutors also interviewed at least one employee at Aon, an insurance broker who has done work with the President's company, according to one source familiar with the matter. 30 | A spokeswoman for Aon confirmed the company received a subpoena and said it is cooperating with the investigation. The spokeswoman declined to comment on any employee interviews. Representatives for Deutsche Bank and the district attorney's office, led by Cyrus Vance, also declined to comment. Deutsche Bank was subpoenaed as part of the investigation last year and has said it cooperates with authorized investigations. 31 | The New York Times first reported on the interviews with Deutsche Bank and Aon employees. 32 | In court filings the district attorney's office has suggested the inquiry could involve tax fraud, insurance fraud and schemes to defraud its lenders. They also recently subpoenaed the Trump Organization for records relating to fees it has paid to consultants, including a payment made to a company controlled by the President's daughter, Ivanka Trump, according to people familiar with the matter. 33 | Specifically, Cohen alleged the President inflated the value of his assets at times, including in 2014 when Trump submitted documents to Deutsche Bank as part of an attempt to bid for the Buffalo Bills football team. Trump never did the loan. 34 | Cohen pleaded guilty to federal crimes, including campaign finance charges for facilitating hush-money payments to silence two woman's allegations of affairs with Trump. Trump has denied the affairs. Cohen is serving a three year prison sentence and was released to home confinement earlier this year due to the coronavirus pandemic. 35 | 36 | -------------------------------------------------------------------------------- /LocalitySensitiveHashing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Locality Sensitive Hashing\n", 8 | "\n", 9 | "This notebook is a tutorial and python implementation of locality sensitive hashing.\n", 10 | "\n", 11 | "We demonstrate the tools using a very very small dataset, consisting of 4 articles pulled from CNN, and a 5th that is a partial concatenation of 3 of those, to artificialy produce some high similarity scores.\n", 12 | "\n", 13 | "Since this is a small dataset, we can easily compare our approximations to the true similarity scores amongst text files. We do so throughout. The layout of this notebook is as follows:\n", 14 | "\n", 15 | "Part I: computing similarity amongst text documents\n", 16 | "1. Introduce a shingle function. Clean and split each text file into a set of K-shingles\n", 17 | "2. Compute the exact Jaccard similarity (intersection over union) between all pairs\n", 18 | "3. Create and apply a MinHashing class:\n", 19 | " 1. Initialize with a dictionary of key-value pairs for the shingles\n", 20 | " 2. Apply \"universal hashing\" to perform minhashing on a shingle set\n", 21 | " 3. can be called like a function to compute a **signature matrix**\n", 22 | "4. Evaluate MinHashing effectiveness by computing scores of all pairs\n", 23 | "5. Introduce LSH for finding **candidate pairs**, i.e. use a banded signature matrix to find all pairs whose similarity is likely above a threshold\n", 24 | "6. Make this efficient, by using hash table for band, column ids, allowing O(n) comparison\n", 25 | "\n", 26 | "Part II: computing similarity amongst vectors\n", 27 | "\n", 28 | "* Afterwards, we provide an additional LSH family for Euclidean spaces, namely cosine similarity. \n", 29 | "* This is used to ascertain the similarity of vectors in a D-dimensional space.\n", 30 | "* Can be implemented using the *Random Hyperplanes* hashing method." 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 26, 36 | "metadata": {}, 37 | "outputs": [], 38 | "source": [ 39 | "import os\n", 40 | "import time\n", 41 | "import itertools\n", 42 | "import collections\n", 43 | "import numpy as np\n", 44 | "import matplotlib.pyplot as plt" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "# Part I: document similarity" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "## MinHashing without Locality Sensitive Hashing" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "### Shingling\n", 66 | "\n", 67 | "We illustrate how to convert a word document into a list of shingles.\n", 68 | "We will use three similar strings, and see how this shows similarity.\n", 69 | "I made a directory and copied 4 files from CNN, then the 5th is a concatenation of 3 of them, randomly deleting some lines" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 4, 75 | "metadata": {}, 76 | "outputs": [ 77 | { 78 | "name": "stdout", 79 | "output_type": "stream", 80 | "text": [ 81 | "Average char-length: 3651.6\n", 82 | "Min char-length: 2412\n", 83 | "Max char-length: 5873\n" 84 | ] 85 | } 86 | ], 87 | "source": [ 88 | "import os\n", 89 | "HOME = os.getcwd()\n", 90 | "TARGET = os.path.join(HOME, 'sampledocs/')\n", 91 | "\n", 92 | "documents = []\n", 93 | "for article in os.listdir(TARGET):\n", 94 | " if article == 'stopwords':\n", 95 | " continue\n", 96 | " path = os.path.join(TARGET, article)\n", 97 | " with open(path, 'r') as file:\n", 98 | " documents.append(file.read())\n", 99 | " \n", 100 | "stopwords = []\n", 101 | "with open(os.path.join(TARGET, 'stopwords'), 'r') as file:\n", 102 | " for line in file:\n", 103 | " stopwords.append(line.strip())\n", 104 | " \n", 105 | "for i, doc in enumerate(documents):\n", 106 | " doc = doc.strip().replace('\\n', ' ').lower()\n", 107 | " for word in stopwords:\n", 108 | " doc = doc.replace(' '+word+' ', ' ')\n", 109 | " documents[i] = doc\n", 110 | "\n", 111 | "print(f\"Average char-length: \\\n", 112 | "{np.mean(np.array([len(x) for x in documents]))}\")\n", 113 | "print(f\"Min char-length: {min(len(x) for x in documents)}\")\n", 114 | "print(f\"Max char-length: {max(len(x) for x in documents)}\")" 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": 5, 120 | "metadata": {}, 121 | "outputs": [ 122 | { 123 | "name": "stdout", 124 | "output_type": "stream", 125 | "text": [ 126 | "Found 2060 unique shingles, out of 3009 possible.\n", 127 | "Found 1918 unique shingles, out of 2412 possible.\n", 128 | "Found 2782 unique shingles, out of 3673 possible.\n", 129 | "Found 2091 unique shingles, out of 3291 possible.\n", 130 | "Found 3953 unique shingles, out of 5873 possible.\n" 131 | ] 132 | } 133 | ], 134 | "source": [ 135 | "# create K-shingles by sliding window approach\n", 136 | "def getShingles(str1, K=5):\n", 137 | " d1 = set()\n", 138 | " for i in range(len(str1)-K):\n", 139 | " d1.add(str1[i:i+K])\n", 140 | " print(f\"Found {len(d1)} unique shingles, out of {len(str1)} possible.\")\n", 141 | " return d1\n", 142 | "doc_shingles = [getShingles(s, 5) for s in documents]" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "### Define the Jaccard similarity (intersection over union)" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": 7, 155 | "metadata": {}, 156 | "outputs": [ 157 | { 158 | "name": "stdout", 159 | "output_type": "stream", 160 | "text": [ 161 | "**~~~~~~ True similarity scores ~~~~~~**\n", 162 | "Pair\tScore\n", 163 | "--------------\n", 164 | "(0, 1)\t0.050\n", 165 | "(0, 2)\t0.069\n", 166 | "(0, 3)\t0.093\n", 167 | "(0, 4)\t0.336\n", 168 | "(1, 2)\t0.052\n", 169 | "(1, 3)\t0.051\n", 170 | "(1, 4)\t0.400\n", 171 | "(2, 3)\t0.081\n", 172 | "(2, 4)\t0.083\n", 173 | "(3, 4)\t0.294\n" 174 | ] 175 | } 176 | ], 177 | "source": [ 178 | "def jaccardSim(d1,d2):\n", 179 | " return len(d1.intersection(d2))/len(d1.union(d2))\n", 180 | "\n", 181 | "# itertools.combinations finds all (,n) n-pairs\n", 182 | "# then we use a map op on the tuples with jaccardSim\n", 183 | "pairs = itertools.combinations(documents, 2)\n", 184 | "pair_labels = []\n", 185 | "pair_sims = []\n", 186 | "for x1, x2 in itertools.combinations(zip(range(len(doc_shingles)),doc_shingles), 2):\n", 187 | " pair_labels.append((x1[0],x2[0]))\n", 188 | " pair_sims.append(jaccardSim(x1[1],x2[1]))\n", 189 | " \n", 190 | "print(f\"**~~~~~~ True similarity scores ~~~~~~**\")\n", 191 | "print(\"Pair\\tScore\")\n", 192 | "print(\"-\"*14)\n", 193 | "for pair, score in zip(pair_labels, pair_sims):\n", 194 | " print(f\"{pair}\\t{score:.3f}\")" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": 8, 200 | "metadata": {}, 201 | "outputs": [ 202 | { 203 | "name": "stdout", 204 | "output_type": "stream", 205 | "text": [ 206 | "There are 7534 shingles\n" 207 | ] 208 | } 209 | ], 210 | "source": [ 211 | "# Take union of all sets. Convert to an array and assign\n", 212 | "# each element an integer based on position in array\n", 213 | "fullset = set.union(*doc_shingles)\n", 214 | "shingle_dict = dict(zip(list(fullset),range(len(fullset))))\n", 215 | "print(f\"There are {len(shingle_dict)} shingles\")" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": 10, 221 | "metadata": {}, 222 | "outputs": [], 223 | "source": [ 224 | "# shingle_dict" 225 | ] 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "metadata": {}, 230 | "source": [ 231 | "### 3. Define the MinHash class, capable of creating a signature matrix\n", 232 | "\n", 233 | "Note: this only takes sets as input (not matrices) allowing us to efficiently deal with sparse matrices" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": 11, 239 | "metadata": {}, 240 | "outputs": [ 241 | { 242 | "name": "stdout", 243 | "output_type": "stream", 244 | "text": [ 245 | "Initialization test: passed\n", 246 | "Set parameters to right size: passed\n", 247 | "Permuting a row integer returns array: passed\n", 248 | "Compute minhashed signature matrix: passed\n" 249 | ] 250 | } 251 | ], 252 | "source": [ 253 | "# Create a hash function\n", 254 | "# define as a callable class, so that we only\n", 255 | "# intialize random functions once\n", 256 | "class HashManager():\n", 257 | " def __init__(self, shingle_dict):\n", 258 | " self.shingle_dict = shingle_dict\n", 259 | " self.N = len(shingle_dict)\n", 260 | " self.params = None\n", 261 | " \n", 262 | " def _initParams(self, n_sig):\n", 263 | " self.params = np.random.randint(self.N, size=[n_sig,2])\n", 264 | " \n", 265 | " def _permuteRow(self, row):\n", 266 | " return (self.params@np.array([1,row]))%self.N\n", 267 | " \n", 268 | " def __call__(self, docs, n_sig, init=True):\n", 269 | " # Initialize if we change signature matrix length\n", 270 | " # or if we request to re-initialize\n", 271 | " if self.params is None or len(self.params) != n_sig or init:\n", 272 | " self._initParams(n_sig)\n", 273 | " \n", 274 | " #initialize signature matrix\n", 275 | " sig = np.full((n_sig, len(docs)), np.inf)\n", 276 | " \n", 277 | " # each doc in docs is assumed to be an iterable object\n", 278 | " for j, doc in enumerate(docs):\n", 279 | " for shingle in doc:\n", 280 | " orig_row = shingle_dict[shingle]\n", 281 | " curr_col = self._permuteRow(orig_row)\n", 282 | " sig[:,j] = np.minimum(sig[:,j],curr_col)\n", 283 | " return sig.astype(int)\n", 284 | " \n", 285 | "# run some tests:\n", 286 | "try:\n", 287 | " print(\"Initialization test: \", end=\"\")\n", 288 | " hm = HashManager(shingle_dict)\n", 289 | " print(\"passed\")\n", 290 | "\n", 291 | " print(\"Set parameters to right size: \", end=\"\")\n", 292 | " hm._initParams(n_sig=4)\n", 293 | " assert(hm.params.shape == (4,2))\n", 294 | " print(\"passed\")\n", 295 | "\n", 296 | " print(\"Permuting a row integer returns array: \", end=\"\")\n", 297 | " curr_col = hm._permuteRow(3)\n", 298 | " assert(curr_col.shape == (4,))\n", 299 | " print(\"passed\")\n", 300 | "\n", 301 | " print(\"Compute minhashed signature matrix: \", end=\"\")\n", 302 | " hm(doc_shingles, 4)\n", 303 | " print(\"passed\")\n", 304 | "except Exception as e:\n", 305 | " print(\"failure\")\n", 306 | " print(e.args)" 307 | ] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "execution_count": 12, 312 | "metadata": {}, 313 | "outputs": [], 314 | "source": [ 315 | "hm = HashManager(shingle_dict)" 316 | ] 317 | }, 318 | { 319 | "cell_type": "markdown", 320 | "metadata": {}, 321 | "source": [ 322 | "### 4. Use MinHashing to compute similarity scores, and see how well it does" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": 17, 328 | "metadata": {}, 329 | "outputs": [ 330 | { 331 | "name": "stdout", 332 | "output_type": "stream", 333 | "text": [ 334 | "**~~~~~~ Similarity score comparison ~~~~~~**\n", 335 | "Pair\t\tApprox\t\tTrue\t\t%Error\n", 336 | "(0, 1)\t\t0.000\t\t0.050\t\t100.00\n", 337 | "(0, 2)\t\t0.100\t\t0.069\t\t45.19\n", 338 | "(0, 3)\t\t0.300\t\t0.093\t\t221.78\n", 339 | "(0, 4)\t\t0.300\t\t0.336\t\t10.77\n", 340 | "(1, 2)\t\t0.100\t\t0.052\t\t93.46\n", 341 | "(1, 3)\t\t0.000\t\t0.051\t\t100.00\n", 342 | "(1, 4)\t\t0.200\t\t0.400\t\t50.02\n", 343 | "(2, 3)\t\t0.200\t\t0.081\t\t147.01\n", 344 | "(2, 4)\t\t0.200\t\t0.083\t\t141.05\n", 345 | "(3, 4)\t\t0.700\t\t0.294\t\t137.69\n", 346 | "True pairs: {(0, 4), (3, 4), (1, 4)}\n", 347 | "Candidate pairs: {(0, 3), (0, 4), (3, 4)}\n", 348 | "False negatives: 1\n", 349 | "Potential false positives: 1\n" 350 | ] 351 | } 352 | ], 353 | "source": [ 354 | "def trueSimScores(doc_shingles):\n", 355 | " pair_labels = []\n", 356 | " pair_sims = []\n", 357 | " idxs = range(len(doc_shingles))\n", 358 | " for x1, x2 in itertools.combinations(zip(idxs,doc_shingles), 2):\n", 359 | " pair_labels.append((x1[0], x2[0]))\n", 360 | " pair_sims.append(jaccardSim(x1[1], x2[1]))\n", 361 | " return dict(zip(pair_labels, pair_sims))\n", 362 | " \n", 363 | "def sigSimScores(sig_mat):\n", 364 | "# cols = [sig_mat[:,i] for i in range(sig_mat.shape[1])]\n", 365 | " cols = sig_mat.T\n", 366 | " idxs = range(sig_mat.shape[1])\n", 367 | " \n", 368 | " pair_labels = []\n", 369 | " pair_sims = []\n", 370 | " for (i,col1), (j,col2) in itertools.combinations(zip(idxs, cols),2):\n", 371 | " pair_labels.append((i,j))\n", 372 | " pair_sims.append(np.mean(col1==col2))\n", 373 | " \n", 374 | " return dict(zip(pair_labels, pair_sims))\n", 375 | "\n", 376 | "def printScoreComparison(true_dict, approx_dict):\n", 377 | " print(f\"**~~~~~~ Similarity score comparison ~~~~~~**\")\n", 378 | " print(\"Pair\\t\\tApprox\\t\\tTrue\\t\\t%Error\")\n", 379 | " for pair, true_value in true_dict.items():\n", 380 | " approx_value = approx_dict[pair]\n", 381 | " err = 100*abs(true_value-approx_value)/true_value\n", 382 | " print(f\"{pair}\\t\\t{approx_value:.3f}\\t\\t{true_value:.3f}\\t\\t{err:.2f}\")\n", 383 | "\n", 384 | "def candidatePairs(score_dict, threshold):\n", 385 | " return set(pair for pair, scr in score_dict.items() if scr>=threshold)\n", 386 | "\n", 387 | "def accMatrix(true_dict, approx_dict, threshold):\n", 388 | " true_pairs = candidatePairs(true_dict, threshold)\n", 389 | " approx_pairs = candidatePairs(approx_dict, threshold)\n", 390 | " false_negatives = len(true_pairs - approx_pairs)\n", 391 | " false_positives = len(approx_pairs - true_pairs)\n", 392 | " print(f\"False negatives: {false_negatives}\")\n", 393 | " print(f\"Potential false positives: {false_positives}\")\n", 394 | "\n", 395 | "sig_mat = hm(doc_shingles, 10)\n", 396 | "true_score_dict = trueSimScores(doc_shingles)\n", 397 | "approx_score_dict = sigSimScores(sig_mat)\n", 398 | "printScoreComparison(true_score_dict, approx_score_dict)\n", 399 | "\n", 400 | "print(\"True pairs:\",candidatePairs(true_score_dict, 0.25))\n", 401 | "print(\"Candidate pairs:\",candidatePairs(approx_score_dict, 0.25))\n", 402 | "accMatrix(true_score_dict, approx_score_dict, 0.4)\n", 403 | "\n", 404 | "# print(f\"**~~~~~~ Approximate similarity scores ~~~~~~**\")\n", 405 | "# print(\"Pair\\t\\tApproximate Score\\t\\tTrue Score\")\n", 406 | "# print(\"-\"*14)\n", 407 | "# for pair, score in sigSimScores(sig_mat):\n", 408 | "# print(f\"{pair}\\t{score:.3f}\")\n", 409 | " \n", 410 | "# print(f\"**~~~~~~ True similarity scores ~~~~~~**\")\n", 411 | "# print(\"Pair\\tScore\")\n", 412 | "# print(\"-\"*14)\n", 413 | "# for pair, score in zip(pair_labels, pair_sims):\n", 414 | "# print(f\"{pair}\\t{score:.3f}\")" 415 | ] 416 | }, 417 | { 418 | "cell_type": "markdown", 419 | "metadata": {}, 420 | "source": [ 421 | "## Adding Locality Sensitive Hashing: preliminary band-structure theory, how to choose band size" 422 | ] 423 | }, 424 | { 425 | "cell_type": "markdown", 426 | "metadata": {}, 427 | "source": [ 428 | "### Effects of changing b,r at fixed n" 429 | ] 430 | }, 431 | { 432 | "cell_type": "markdown", 433 | "metadata": {}, 434 | "source": [ 435 | "Now implement Locality-sensitive hashing. We use a band structure on the signature matrix. If the matrix has $n$ rows, then we divide it into $b$ bands each of width $r$, such that\n", 436 | "$$n = b*r$$\n", 437 | "\n", 438 | "Let $p$ be the true similarity score (match percent) between a pair. The probability of matching every integer in a band is\n", 439 | "\n", 440 | "$$\\text{prob. one band doesn't match } = 1-p^r$$\n", 441 | "\n", 442 | "Now the probability that NONE of the $b$ bands match is given by\n", 443 | "\n", 444 | "$$\\text{prob. no bands match } = (1-p^r)^b$$\n", 445 | "\n", 446 | "Therefore, the probability that at least one band matches, is\n", 447 | "\n", 448 | "$$P(\\geq 1\\text{ match}) = 1-(1-p^r)^b$$\n", 449 | "\n", 450 | "Now, we want this to be our criteria for a candidate pair. That is, two signatures columns are a candidate pair if $P \\geq 1/2$. We will see in the plot below, that as the signature matrix increases, we can tune $r$ and $b$ to approximately be a step function around the true value of $p$. The goal is to find as few candidate pairs as possible, while also making sure we find them all.\n", 451 | "\n", 452 | "Draw a vertical line at the true p value\n", 453 | "The area UNDER the curve $P(x
"
470 | ]
471 | },
472 | "metadata": {
473 | "needs_background": "light"
474 | },
475 | "output_type": "display_data"
476 | }
477 | ],
478 | "source": [
479 | "import matplotlib.pyplot as plt\n",
480 | "n = 200\n",
481 | "ops = [(2,100),(4,50),(10,20),(20,10),(50,4),(100,2)]\n",
482 | "yval = lambda p,r,b: 1-(1-p**r)**b\n",
483 | "pts = np.linspace(0,1,100)\n",
484 | "yval(pts,.2,.2)\n",
485 | "for op in ops:\n",
486 | " plt.plot(pts, yval(pts,op[0],op[1]), label=op)\n",
487 | "plt.plot(pts,0*pts+0.5,'k--', label=\"P=1/2\")\n",
488 | "plt.plot([0.4,0.4],[0,1], 'k-.')\n",
489 | "plt.legend()\n",
490 | "plt.xlabel('p')\n",
491 | "plt.ylabel('Probability')\n",
492 | "plt.title(\"legend: (r,b). p_true=0.4 (vertical line)\")\n",
493 | "plt.show()"
494 | ]
495 | },
496 | {
497 | "cell_type": "markdown",
498 | "metadata": {},
499 | "source": [
500 | "### Effects of changing n while adjusting b,r to keep P=1/2 argument constant"
501 | ]
502 | },
503 | {
504 | "cell_type": "markdown",
505 | "metadata": {},
506 | "source": [
507 | "Now let's solve for the optimal values r,b. \n",
508 | "\n",
509 | "We suppose that we *fix* $n,r,b$. Given these, we determine at which approximate point $p$ is the crossover. We should find $p=p(b,r)$. Afterwards, we can then approximately solve for $r,b$ to be a desired $p$\n",
510 | "\n",
511 | "We begin with for $P=1/2$.\n",
512 | "$$1/2 = 1-(1-p^r)^b$$\n",
513 | "$$ 1-p^r = 2^{-1/b}$$\n",
514 | "$$p = (1-2^{-1/b})^{1/r} = (1-e^{-(1/b)\\ln2})^{1/r} \\approx (1/b)^{1/r}*\\text{const}$$\n",
515 | "\n",
516 | "Finally, if we fix $r$ and $p$, we can find the required bands to be about\n",
517 | "$$ b \\approx 1/p^r\n",
518 | "\n",
519 | "Takeaways:\n",
520 | "\n",
521 | "increasing r -> \n",
522 | "* moves the curve right.\n",
523 | "* more the false negatives\n",
524 | "* means lower chance to match\n",
525 | " \n",
526 | "increasing b -> \n",
527 | "* moves the curve left.\n",
528 | "* more false positives\n",
529 | "* means higher chance to match\n",
530 | " \n",
531 | "increasing n -> \n",
532 | "* the curve approaches a step function.\n",
533 | "* fewer false anythings\n",
534 | "* always good!\n",
535 | "\n",
536 | "Now watch what happens as we pick the optimal r,b and then increase n.\n",
537 | "\n",
538 | "We will try to keep p centered around $p=0.5$"
539 | ]
540 | },
541 | {
542 | "cell_type": "code",
543 | "execution_count": 19,
544 | "metadata": {},
545 | "outputs": [
546 | {
547 | "data": {
548 | "image/png": "\n",
549 | "text/plain": [
550 | "