├── requirements.txt
├── data
    ├── mean_shift1.png
    ├── sherlock_noise1.txt
    ├── wiki_noise4.txt
    ├── wiki_noise14.txt
    ├── wiki_noise9.txt
    ├── wiki_noise11.txt
    ├── wiki_noise3.txt
    ├── wiki_noise7.txt
    ├── wiki_noise1.txt
    ├── sherlock_noise3.txt
    ├── sherlock_noise2.txt
    ├── wiki_noise10.txt
    ├── wiki_noise12.txt
    ├── wiki_noise5.txt
    ├── wiki_noise8.txt
    ├── wiki_noise6.txt
    ├── wiki_noise2.txt
    ├── sherlock_noise5.txt
    ├── sherlock_noise.txt
    ├── wiki_noise0.txt
    ├── wiki_noise13.txt
    └── sherlock_noise4.txt
├── ads.py
├── README.md
├── data.py
├── baselines.py
├── cifar_corruptor.py
├── words.py
├── pixel.py
├── part_utils.py
└── utils.py


/requirements.txt:
--------------------------------------------------------------------------------
1 | 
2 | numpy
3 | matplotlib
4 | torch
5 | scipy
6 | sklearn
7 | 


--------------------------------------------------------------------------------
/data/mean_shift1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/twistedcubic/que-outlier-detection/HEAD/data/mean_shift1.png


--------------------------------------------------------------------------------
/ads.py:
--------------------------------------------------------------------------------
 1 | 
 2 | '''
 3 | Process ads data
 4 | '''
 5 | import torch
 6 | import utils
 7 | 
 8 | import pdb
 9 | 
10 | '''
11 | file in arff format
12 | '''
13 | def get_data(path):
14 |     with open(path) as file:
15 |         lines = file.readlines()
16 |         
17 |     data_l = []
18 |     #bool_l = []
19 |     noise_idx_l = []
20 |     id_l = []
21 |     counter = 0
22 |     for line in lines:
23 |         if line[0] == '%' or line[0] == '@' or len(line)<5:
24 |             continue
25 |         line_ar = line.split(',')
26 | 
27 |         #second to last is id, some integer like 175, "@ATTRIBUTE 'id' real\n"
28 |         data_l.append([float(i) for i in line_ar[:-2]])
29 |         id_l.append(float(line_ar[-2]))
30 |         if line_ar[-1] == "'yes'\n":
31 |             noise_idx_l.append(counter)
32 |             #bool_l.append(1)
33 |         #else:
34 |         #    bool_l.append(0)
35 |         counter += 1
36 |         
37 |     data = torch.FloatTensor(data_l).to(utils.device)
38 |     #is_ad = torch.IntTensor(bool_l).to(utils.device)
39 |     noise_idx = torch.LongTensor(noise_idx_l).to(utils.device)
40 |     
41 |     return data, noise_idx
42 | 
43 | if __name__ == '__main__':
44 |     data, noise_idx = get_ads('data/internet_ads.arff')
45 |     pdb.set_trace()
46 | 


--------------------------------------------------------------------------------
/data/sherlock_noise1.txt:
--------------------------------------------------------------------------------
 1 | In a large bowl, combine the eggs, schmaltz, stock, matzo meal, nutmeg, ginger and parsley. Season with 1 teaspoon salt and a few grinds of pepper. Gently mix with a whisk or spoon. Cover and refrigerate until chilled, about 3 hours or overnight.
 2 | To shape and cook the matzo balls, fill a wide, deep pan with lightly salted water and bring to a boil. With wet hands, take some of the mix and mold it into the size and shape of a Ping-Pong ball. Gently drop it into the boiling water, repeating until all the mix is used.
 3 | Cover the pan, reduce heat to a lively simmer and cook matzo balls about 30 to 40 minutes for al dente, longer for light. If desired, the cooked matzo balls can be transferred to chicken or vegetable soup and served immediately. Alternatively, they may be placed on a baking sheet and frozen, then transferred to a freezer bag and kept frozen until a few hours before serving; reheat in chicken or vegetable soup or broth.
 4 | 
 5 | In a small bowl, combine salt, sugar, celery seed, garlic, turmeric, cayenne and black pepper. Mix well.
 6 | Place a 14-inch or larger skillet over medium-high heat, and add oil. Heat oil to 350 degrees. Set aside a baking sheet or plate lined with paper towels.
 7 | Using tongs, place a whole matzo into the oil, pressing down gently until well submerged. Fry for 20 to 30 seconds, then transfer matzo from the oil to paper towels to drain. The matzo will crisp and change to light golden brown after it is removed from the oil; adjust cooking time as needed.
 8 | Sprinkle the top of each warm matzo with about a teaspoon of spice mix. Serve immediately, or cover with a kitchen towel and set aside in a warm place for up to several hours.
 9 | 
10 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Fast mean estimation and outlier detection
 2 | 
 3 | This repo contains code for our paper [**Quantum Entropy Scoring for Fast Robust Mean Estimation and Improved Outlier Detection**](https://arxiv.org/abs/1906.11366).
 4 | 
 5 | [Yihe Dong](http://yihedong.me/), [Sam Hopkins](http://www.samuelbhopkins.com/), [Jerry Li](https://jerryzli.github.io/)
 6 | _________________
 7 | 
 8 | ### Install
 9 | To install dependencies, run:
10 | ```
11 | pip install -r requirements.txt
12 | ```
13 | 
14 | ### Description of select scripts:
15 | * [`mean.py`](mean.py) contains the backbone of the experimental setup and evaluation.
16 | * [`utils.py`](utils.py) contains various utilities methods, such as fast JL computation.
17 | * Auxiliary scripts specific to certain experiments: [`pixel.py`](pixel.py) used for running the hot pixels experiments on CIFAR data, [`words.py`](words.py) used when running word embeddings experiments.
18 | 
19 | The [data](data) directory contains select data for running the experiments. Additional data should be downloaded into the [data](data) directory: [GloVe embeddings can be downloaded here](https://nlp.stanford.edu/projects/glove/). [CIFAR images can be downloaded from here](https://www.cs.toronto.edu/~kriz/cifar.html).
20 | 
21 | 
22 | The main script [mean.py](mean.py) with appropriate arguments. For instance, to run experiments on synthetic data with respect to varying alpha:
23 | ```
24 | python mean.py --experiment_type syn_lamb
25 | ```
26 | 
27 | And the same on word embeddings data:
28 | ```
29 | python mean.py --experiment_type text_lamb
30 | ```
31 | 
32 | To run experiments on CIFAR images:
33 | ```
34 | python pixel.py --experiment_type image_lamb
35 | ```
36 | 
37 | For more available runtime options see:
38 | ```
39 | python mean.py -h
40 | ```
41 | 
42 | ### Reference
43 | 
44 | If you find our paper and repo useful, please cite as:
45 | 
46 | ```
47 | @inproceedings{que2019,
48 |   title={Quantum Entropy Scoring for Fast Robust Mean Estimation and Improved Outlier Detection},
49 |   author={Dong, Yihe and Hopkins, Samuel and Li, Jerry},
50 |   booktitle={Advances in Neural Information Processing Systems},
51 |   year={2019}
52 | }
53 | ```
54 | <p>
55 | <img src="data/mean_shift1.png"  width="300" >
56 | </p>


--------------------------------------------------------------------------------
/data/wiki_noise4.txt:
--------------------------------------------------------------------------------
 1 | The Battle of Rovine took place on 17 May 1395.[5] The Wallachian army led by Voivod Mircea cel Bătrân (Mircea the Elder) opposed the Ottoman invasion personally led by Sultan Bayezid I the Lightning. The Turkish force heavily outnumbered the Wallachian troops. The legend says that on the eve of the battle, dressed as a peace emissary, Mircea cel Bătrân talked to Bayezid asking him to leave Wallachia and promising him safe passage back. The Sultan proudly insisted on fighting.
 2 | 
 3 | Battle
 4 | The battle took place probably near the Argeș River,[6] but the exact location is disputed. The Wallachian victory is confirmed by numerous sources and historians.[1][2][3][4]
 5 | 
 6 | During the battle, a key tactical role was played by the Wallachian archers who severely depleted the Ottoman ranks during their initial attack.[7] Bayezid's vassals, the Serbian lords Stefan Lazarević and Marko Mrnjavčević, two of the greatest knights of the time, participated and fought bravely; Stefan showed great courage, Marko was killed in action.
 7 | 
 8 | An alternative historical view is that the dramatic confrontation lasted not just a single day, but an entire week, being in the first stage a war of positions. The fierce battle ended with heavy casualties for both sides, eventually each army withdrawing from the battlefield. Although Wallachians pushed back the enemy, the Ottomans were able to defend their resulting position relying on the personal guard of the Sultan composed of Janissaries. This was the impregnable position of the Ottoman defense a year later, in the famous Battle of Nicopolis. This tactical innovation became a fundamental element of the Ottoman war strategies until the 18th century. The army of Mircea, sustaining heavy casualties, and unable to break the defense of the Sultan's camp, was finally obliged to withdraw. Because the Ottoman Empire was not able to conquer Wallachia at this time, Rovine remains one of the most important battles in Romanian history.[6]
 9 | 
10 | An epic description of the confrontation is presented in the poem "Scrisoarea a III-a" (The Third Letter) written by the Romanian national poet, Mihai Eminescu. The Dečani chronicle describes the battle and reports that Prince Marko and Constantine Dragaš died fighting.[8] The same source mentions that Marko's brother, Andreja Mrnjavčević, also perished during the fight
11 | 


--------------------------------------------------------------------------------
/data.py:
--------------------------------------------------------------------------------
 1 | 
 2 | '''
 3 | Processes data
 4 | '''
 5 | import torch
 6 | import numpy as np
 7 | import utils
 8 | import os.path as osp
 9 | import pdb
10 | 
11 | '''
12 | Load subsampled genetics data.
13 | '''
14 | def clean_genetics_data():    
15 |     data = np.loadtxt(osp.join(utils.data_dir, 'ALL.20k.data'), delimiter=' ')
16 |     col_sums = np.sum((data==0).astype(np.int), axis=0)
17 |     #remove any column containing 0, which indicates missing
18 |     data = data[:, col_sums==0]
19 |     #convert 2->0, 1->1
20 |     data = 2 - data
21 |     return data
22 | 
23 | def load_genetics_data():
24 |     X = np.load(osp.join(utils.data_dir, 'sampled_data.npy'))
25 |     X = torch.from_numpy(X).to(dtype=torch.float32, device=utils.device)
26 |     return X
27 | 
28 | def load_glove_data():
29 |     X = np.load('/home/yihdong/partition/data/glove_dataset.npy')
30 |     X = torch.from_numpy(X).to(dtype=torch.float32, device=utils.device)
31 |     return X
32 | 
33 | def process_glove_data(dim=100):
34 |     path = osp.join(utils.data_dir, 'glove_embs.pt')
35 |     if osp.exists(path):
36 |         d = torch.load(path)
37 |         #aa={'vocab':words_ar, 'word_emb':word_emb}
38 |         return d['vocab'], d['word_emb'].to(utils.device)
39 |     else:
40 |         return load_process_glove_data(dim)
41 |         
42 | '''
43 | Process glove vectors from raw txt file into numpy arrays.
44 | '''
45 | def load_process_glove_data(dim=100):
46 |     path = osp.join(utils.data_dir, 'glove.6B.{}d.txt'.format(dim))
47 |     lines = load_lines(path)
48 |     lines_len = len(lines)
49 |     words_ar = [0]*lines_len
50 |     word_emb = torch.zeros(lines_len, dim)
51 |     for i, line in enumerate(lines):
52 |         line_ar = line.split()
53 |         words_ar[i] = line_ar[0]
54 |         word_emb[i] = torch.FloatTensor([float(t) for t in line_ar[1:]])
55 |     
56 |     word_emb = word_emb.to(utils.device)
57 |     return words_ar, word_emb
58 | 
59 | def load_lines(path):
60 |     with open(path, 'r') as file:
61 |         lines = file.read().splitlines()
62 |     return lines
63 | 
64 | def write_lines(lines1, path):
65 |     lines = []
66 |     for line in lines1:
67 |         lines.append(str(line) + os.linesep)   
68 |     with open(path, 'w') as file:
69 |         file.writelines(lines)
70 |     
71 | if __name__ == '__main__':
72 |     data = load_genetics_data()
73 |     pdb.set_trace()
74 | 


--------------------------------------------------------------------------------
/data/wiki_noise14.txt:
--------------------------------------------------------------------------------
1 | Australia, officially the Commonwealth of Australia,[12] is a sovereign country comprising the mainland of the Australian continent, the island of Tasmania and numerous smaller islands. It is the largest country in Oceania and the world's sixth-largest country by total area. The neighbouring countries are Papua New Guinea, Indonesia and East Timor to the north; the Solomon Islands and Vanuatu to the north-east; and New Zealand to the south-east. The population of 25 million[6] is highly urbanised and heavily concentrated on the eastern seaboard.[13] Australia's capital is Canberra, and its largest city is Sydney. The country's other major metropolitan areas are Melbourne, Brisbane, Perth and Adelaide.
2 | 
3 | Indigenous Australians inhabited the continent for about 60,000 years prior to European discovery with the arrival of Dutch explorers in the early 17th century, who named it New Holland. In 1770, Australia's eastern half was claimed by Great Britain and initially settled through penal transportation to the colony of New South Wales from 26 January 1788, a date which became Australia's national day. The population grew steadily in subsequent decades, and by the time of an 1850s gold rush, most of the continent had been explored and an additional five self-governing crown colonies established. On 1 January 1901, the six colonies federated, forming the Commonwealth of Australia. Australia has since maintained a stable liberal democratic political system that functions as a federal parliamentary constitutional monarchy, comprising six states and ten territories.
4 | 
5 | Being the oldest,[14] flattest[15] and driest inhabited continent,[16][17] with the least fertile soils,[18][19] Australia has a landmass of 7,617,930 square kilometres (2,941,300 sq mi).[20] A megadiverse country, its size gives it a wide variety of landscapes, with deserts in the centre, tropical rainforests in the north-east and mountain ranges in the south-east. Its population density, 2.8 inhabitants per square kilometre, remains among the lowest in the world.[21] Australia generates its income from various sources including mining-related exports, telecommunications, banking and manufacturing.[22][23][24]
6 | 
7 | Australia is a highly developed country, with the world's 14th-largest economy. It has a high-income economy, with the world's tenth-highest per capita income.[25] It is a regional power, and has the world's 13th-highest military expenditure.[26] Australia has the world's ninth-largest immigrant population, with immigrants accounting for 29% of the population.[27][28] Having the third-highest human development index and the eighth-highest ranked democracy globally, the country ranks highly in quality of life, health, education, economic freedom, civil liberties and political rights,[29] with all its major cities faring well in global comparative livability surveys.[30] Australia is a member of the United Nations, G20, Commonwealth of Nations, ANZUS, Organisation for Economic Co-operation and Development (OECD), World Trade Organization, Asia-Pacific Economic Cooperation, Pacific Islands Forum and the ASEAN Plus Six mechanism
8 | 
9 | 


--------------------------------------------------------------------------------
/data/wiki_noise9.txt:
--------------------------------------------------------------------------------
1 | The koala (Phascolarctos cinereus, or, inaccurately, koala bear[a]) is an arboreal herbivorous marsupial native to Australia. It is the only extant representative of the family Phascolarctidae and its closest living relatives are the wombats, which comprise the family Vombatidae.[4]. The koala is found in coastal areas of the mainland's eastern and southern regions, inhabiting Queensland, New South Wales, Victoria, and South Australia. It is easily recognisable by its stout, tailless body and large head with round, fluffy ears and large, spoon-shaped nose. The koala has a body length of 60–85 cm (24–33 in) and weighs 4–15 kg (9–33 lb). Pelage colour ranges from silver grey to chocolate brown. Koalas from the northern populations are typically smaller and lighter in colour than their counterparts further south. These populations possibly are separate subspecies, but this is disputed.
2 | 
3 | Koalas typically inhabit open eucalypt woodlands, and the leaves of these trees make up most of their diet. Because this eucalypt diet has limited nutritional and caloric content, koalas are largely sedentary and sleep up to 20 hours a day. They are asocial animals, and bonding exists only between mothers and dependent offspring. Adult males communicate with loud bellows that intimidate rivals and attract mates. Males mark their presence with secretions from scent glands located on their chests. Being marsupials, koalas give birth to underdeveloped young that crawl into their mothers' pouches, where they stay for the first six to seven months of their lives. These young koalas, known as joeys, are fully weaned around a year old. Koalas have few natural predators and parasites, but are threatened by various pathogens, such as Chlamydiaceae bacteria and the koala retrovirus, as well as by bushfires and droughts.
4 | 
5 | Koalas were hunted by Indigenous Australians and depicted in myths and cave art for millennia. The first recorded encounter between a European and a koala was in 1798, and an image of the animal was published in 1810 by naturalist George Perry. Botanist Robert Brown wrote the first detailed scientific description of the koala in 1814, although his work remained unpublished for 180 years. Popular artist John Gould illustrated and described the koala, introducing the species to the general British public. Further details about the animal's biology were revealed in the 19th century by several English scientists. Because of its distinctive appearance, the koala is recognised worldwide as a symbol of Australia. Koalas are listed as Vulnerable by the International Union for Conservation of Nature.[1] The Australian government similarly lists specific populations in Queensland and New South Wales as Vulnerable.[5] The animal was hunted heavily in the early 20th century for its fur, and large-scale cullings in Queensland resulted in a public outcry that initiated a movement to protect the species. Sanctuaries were established, and translocation efforts moved to new regions koalas whose habitat had become fragmented or reduced. The biggest threat to their existence is habitat destruction caused by agriculture and urbanisation
6 | 


--------------------------------------------------------------------------------
/data/wiki_noise11.txt:
--------------------------------------------------------------------------------
 1 | The domestic dog (Canis lupus familiaris when considered a subspecies of the wolf or Canis familiaris when considered a distinct species)[5] is a member of the genus Canis (canines), which forms part of the wolf-like canids,[6] and is the most widely abundant terrestrial carnivore.[7][8][9][10][11] The dog and the extant gray wolf are sister taxa[12][13][14] as modern wolves are not closely related to the wolves that were first domesticated,[13][14] which implies that the direct ancestor of the dog is extinct.[15] The dog was the first species to be domesticated[14][16] and has been selectively bred over millennia for various behaviors, sensory capabilities, and physical attributes.[17]
 2 | 
 3 | Their long association with humans has led dogs to be uniquely attuned to human behavior[18] and they are able to thrive on a starch-rich diet that would be inadequate for other canid species.[19] Dogs vary widely in shape, size and colors.[20] They perform many roles for humans, such as hunting, herding, pulling loads, protection, assisting police and military, companionship and, more recently, aiding disabled people and therapeutic roles. This influence on human society has given them the sobriquet of "man's best friend"
 4 | 
 5 | The origin of the domestic dog includes the dog's evolutionary divergence from the wolf, its domestication, and its development into dog types and dog breeds. The dog is a member of the genus Canis, which forms part of the wolf-like canids, and was the first species and the only large carnivore to have been domesticated.[14][24] The dog and the extant gray wolf are sister taxa, as modern wolves are not closely related to the population of wolves that was first domesticated.[14]
 6 | 
 7 | The genetic divergence between dogs and wolves occurred between 40,000–20,000 years ago, just before or during the Last Glacial Maximum.[25][2] This timespan represents the upper time-limit for the commencement of domestication because it is the time of divergence and not the time of domestication, which occurred later.[25][26] The domestication of animals commenced over 15,000 years ago, beginning with the grey wolf (Canis lupus) by nomadic hunter-gatherers.[25] The archaeological record and genetic analysis show the remains of the Bonn–Oberkassel dog buried beside humans 14,200 years ago to be the first undisputed dog, with disputed remains occurring 36,000 years ago.[2] It was not until 11,000 years ago that people living in the Near East entered into relationships with wild populations of aurochs, boar, sheep, and goats.[25]
 8 | 
 9 | Where the domestication of the dog took place remains debated, with the most plausible proposals spanning Western Europe,[9][26] Central Asia[26][27] and East Asia.[26][28] This has been made more complicated by the recent proposal that an initial wolf population split into East and West Eurasian groups. These two groups, before going extinct, were domesticated independently into two distinct dog populations between 14,000 and 6,400 years ago. The Western Eurasian dog population was gradually and partially replaced by East Asian dogs introduced by humans at least 6,400 years ago.[26][2] This proposal is also debated
10 | 


--------------------------------------------------------------------------------
/data/wiki_noise3.txt:
--------------------------------------------------------------------------------
 1 | Society, in general, addresses the fact that an individual has rather limited means as an autonomous unit. The great apes have always been more (Bonobo, Homo, Pan) or less (Gorilla, Pongo) social animals, so Robinson Crusoe-like situations are either fictions or unusual corner cases to the ubiquity of social context for humans, who fall between presocial and eusocial in the spectrum of animal ethology.
 2 | 
 3 | Cultural relativism as a widespread approach or ethic has largely replaced notions of "primitive", better/worse, or "progress" in relation to cultures (including their material culture/technology and social organization).
 4 | 
 5 | According to anthropologist Maurice Godelier, one critical novelty in society, in contrast to humanity's closest biological relatives (chimpanzees and bonobos), is the parental role assumed by the males, which supposedly would be absent in our nearest relatives for whom paternity is not generally determinable.[2][3]
 6 | 
 7 | In political science
 8 | Societies may also be structured politically. In order of increasing size and complexity, there are bands, tribes, chiefdoms, and state societies. These structures may have varying degrees of political power, depending on the cultural, geographical, and historical environments that these societies must contend with. Thus, a more isolated society with the same level of technology and culture as other societies is more likely to survive than one in close proximity to others that may encroach on their resources. A society that is unable to offer an effective response to other societies it competes with will usually be subsumed into the culture of the competing society.
 9 | 
10 | In sociology
11 | 
12 | The social group enables its members to benefit in ways that would not otherwise be possible on an individual basis. Both individual and social (common) goals can thus be distinguished and considered. Ant (formicidae) social ethology.
13 | Sociologist Peter L. Berger defines society as "...a human product, and nothing but a human product, that yet continuously acts upon its producers." According to him, society was created by humans but this creation turns back and creates or molds humans every day.[4]
14 | 
15 | 
16 | Canis lupus social ethology
17 | Sociologist Gerhard Lenski differentiates societies based on their level of technology, communication, and economy: (1) hunters and gatherers, (2) simple agricultural, (3) advanced agricultural, (4) industrial, and (5) special (e.g. fishing societies or maritime societies).[5] This is similar to the system earlier developed by anthropologists Morton H. Fried, a conflict theorist, and Elman Service, an integration theorist, who have produced a system of classification for societies in all human cultures based on the evolution of social inequality and the role of the state. This system of classification contains four categories:
18 | 
19 | Hunter-gatherer bands (categorization of duties and responsibilities).
20 | Tribal societies in which there are some limited instances of social rank and prestige.
21 | Stratified structures led by chieftains.
22 | Civilizations, with complex social hierarchies and organized, institutional governments
23 | 


--------------------------------------------------------------------------------
/data/wiki_noise7.txt:
--------------------------------------------------------------------------------
1 | Petrified Forest National Park is an American national park in Navajo and Apache counties in northeastern Arizona. Named for its large deposits of petrified wood, the fee area of the park covers about 230 square miles (600 square kilometers), encompassing semi-desert shrub steppe as well as highly eroded and colorful badlands. The park's headquarters is about 26 miles (42 km) east of Holbrook along Interstate 40 (I-40), which parallels the BNSF Railway's Southern Transcon, the Puerco River, and historic U.S. Route 66, all crossing the park roughly east–west. The site, the northern part of which extends into the Painted Desert, was declared a national monument in 1906 and a national park in 1962. The park received 644,922 recreational visitors in 2018. Typical visitor activities include sightseeing, photography, hiking, and backpacking.
2 | 
3 | Averaging about 5,400 feet (1,600 m) in elevation, the park has a dry windy climate with temperatures that vary from summer highs of about 100 °F (38 °C) to winter lows well below freezing. More than 400 species of plants, dominated by grasses such as bunchgrass, blue grama, and sacaton, are found in the park. Fauna include larger animals such as pronghorns, coyotes, and bobcats, many smaller animals, such as deer mice, snakes, lizards, seven kinds of amphibians, and more than 200 species of birds, some of which are permanent residents and many of which are migratory. About one third of the park is designated wilderness—50,260 acres (79 sq mi; 203 km2).[8]
4 | 
5 | The Petrified Forest is known for its fossils, especially fallen trees that lived in the Late Triassic Epoch, about 225 million years ago. The sediments containing the fossil logs are part of the widespread and colorful Chinle Formation, from which the Painted Desert gets its name. Beginning about 60 million years ago, the Colorado Plateau, of which the park is part, was pushed upward by tectonic forces and exposed to increased erosion. All of the park's rock layers above the Chinle, except geologically recent ones found in parts of the park, have been removed by wind and water. In addition to petrified logs, fossils found in the park have included Late Triassic ferns, cycads, ginkgoes, and many other plants as well as fauna including giant reptiles called phytosaurs, large amphibians, and early dinosaurs. Paleontologists have been unearthing and studying the park's fossils since the early 20th century.
6 | 
7 | The park's earliest human inhabitants arrived at least 8,000 years ago. By about 2,000 years ago, they were growing corn in the area and shortly thereafter building pit houses in what would become the park. Later inhabitants built above-ground dwellings called pueblos. Although a changing climate caused the last of the park's pueblos to be abandoned by about 1400 CE, more than 600 archeological sites, including petroglyphs, have been discovered in the park. In the 16th century, Spanish explorers visited the area, and by the mid-19th century a U.S. team had surveyed an east–west route through the area where the park is now located and noted the petrified wood. Later, roads and a railway followed similar routes and gave rise to tourism and, before the park was protected, to large-scale removal of fossils. Theft of petrified wood remains a problem in the 21st century.
8 | 


--------------------------------------------------------------------------------
/data/wiki_noise1.txt:
--------------------------------------------------------------------------------
 1 | The Arlington Museum of Art traces its history to the foundation of the Arlington Art Association by Howard and Arista Joyner in 1952. Howard Joyner established the Art Department at the University of Texas at Arlington, and Arista Joyner was the first art teacher at Arlington High School. The Arlington Art Association promoted art in the city by sponsoring juried art exhibitions, shows featuring local artists, and art auctions benefiting scholarships for local high school students, while also creating a savings fund to eventually purchase a building to serve as its permanent home.[1] In 1986, the Arlington Art Association bought the former J. C. Penney store on Main Street in downtown Arlington, which it remodeled extensively and moved into in 1989 after incorporating as the Arlington Museum of Art.[1][5] The first exhibition at the museum opened in May 1990 and featured contemporary art.[1]
 2 | 
 3 | In 1991, former Dallas Museum of Art assistant curator for contemporary art and KERA radio art critic Joan Davidow was hired as the full-time director of the Arlington Museum of Art.[1][5] Under her tenure, which lasted until September 2000, she focused the museum's curated exhibitions on Texas contemporary art.[1][6] In her first three years as director, she tripled the museum's budget to $225,000 while securing corporate sponsorships from Lockheed Martin, Target, and U.S. Trust.[5] Writing for Texas Monthly in 1998, Michael Ennis referred to Davidow as "arguably the most imaginative and irrepressibly adventurous museum director working in Texas" and a "champion of the latest and often most contentious Texas art".[5] She also ran an art summer camp for children at the museum and a Saturday-afternoon family component for each of the museum's exhibitions.[5]
 4 | 
 5 | In February 2001, Anne Allen was hired as the new director of the Arlington Museum of Art, having previously served in the same capacity at The Old Jail Art Center in Albany, Texas.[1] During the six years of her tenure, she added new programs such as artist lectures and gallery talks to the museum's calendar of exhibitions.[1] The museum was reorganized in 2012 due to its financial needs and the impact of a weak economy, and former board member Chris Hightower was selected as its new director.[1] Under his tenure, the museum has broadened its scope beyond contemporary art and now features "historically significant and culturally important exhibitions".[1] The museum has also begun supporting its exhibitions with accompanying programming, funding them through grants, and renting its facilities for outside events.[1]
 6 | 
 7 | In 2015, local philanthropist Sam Mahrouq donated $550,000 to the Arlington Museum of Art, which allowed it to retire its building mortgage.[7][8] In 2016, the museum gained notoriety when it removed a satirical poster depicting Donald Trump from an exhibition due to the objection of a board member
 8 | 
 9 | The Arlington Museum of Art has hosted numerous traveling exhibitions, including those featuring photography by Ansel Adams,[10] art by Salvador Dalí,[11] Milton H. Greene's photographs of Marilyn Monroe,[12] Harlem Renaissance artwork (including works by Richmond Barthé, Aaron Douglas, Jacob Lawrence, and Charles White),[13] Utagawa Hiroshige's woodblock prints,[14] Vivian Maier's street photography,[15] and Pablo Picasso's ceramics.[16] It has also featured exhibitions of edible art sculptures[17] and film costumes, including those of Johnny Depp from Pirates of the Caribbean and Emmy Rossum from The Phantom of The Opera
10 | 


--------------------------------------------------------------------------------
/data/sherlock_noise3.txt:
--------------------------------------------------------------------------------
 1 | After the fruits are picked and washed, the juice is extracted by one of two automated methods. In the first method, two metal cups with sharp metal tubes on the bottom cup come together, removing the peel and forcing the flesh of the fruit through the metal tube. The juice of the fruit, then escapes through small holes in the tube. The peels can then be used further, and are washed to remove oils, which are reclaimed later for usage. The second method requires the fruits to be cut in half before being subjected to reamers, which extract the juice.[7]
 2 | 
 3 | After the juice is filtered, it may be concentrated in evaporators, which reduce the size of juice by a factor of 5, making it easier to transport and increasing its expiration date. Juices are concentrated by heating under a vacuum to remove water, and then cooling to around 13 degrees Celsius. About two thirds of the water in a juice is removed.[6] The juice is then later reconstituted, in which the concentrate is mixed with water and other factors to return any lost flavor from the concentrating process. Juices can also be sold in a concentrated state, in which the consumer adds water to the concentrated juice as preparation.[7]
 4 | 
 5 | Juices are then pasteurized and filled into containers, often while still hot. If the juice is poured into a container while hot, it is cooled as quickly as possible. Packages that cannot stand heat require sterile conditions for filling. Chemicals such as hydrogen peroxide can be used to sterilize containers.[7] Plants can make anywhere from 1 to 20 tonnes a day.[6]
 6 | 
 7 | Processing
 8 | 
 9 | A variety of packaged juices in a supermarket
10 | High intensity pulsed electric fields are being used as an alternative to heat pasteurization in fruit juices. Heat treatments sometimes fail to make a quality, microbiological stable products.[8] However, it was found that processing with high intensity pulsed electric fields (PEF) can be applied to fruit juices to provide a shelf stable and safe product.[8] In addition, it was found that pulsed electric fields provide a fresh-like and high nutrition value product.[8] Pulsed electric field processing is a type of nonthermal method for food preservation.[9]
11 | 
12 | Pulsed electric fields use short pulses of electricity to inactivate microbes. In addition, the use of PEF results in minimal detrimental effects on the quality of the food.[10] Pulse electric fields kill microorganisms and provide better maintenance of the original colour, flavour, and nutritional value of the food as compared to heat treatments.[10] This method of preservation works by placing two electrodes between liquid juices then applying high voltage pulses for microseconds to milliseconds.[10] The high voltage pulses are of intensity in the range of 10 to 80 kV/cm.[10]
13 | 
14 | Processing time of the juice is calculated by multiplying the number of pulses with the effective pulse duration.[10] The high voltage of the pulses produce an electric field that results in microbial inactivation that may be present in the juice.[10] The PEF temperatures are below that of the temperatures used in thermal processing.[10] After the high voltage treatment, the juice is aseptically packaged and refrigerated.[10] Juice is also able to transfer electricity due to the presence of several ions from the processing.[10] When the electric field is applied to the juice, electric currents are then able to flow into the liquid juice and transferred around due to the charged molecules in the juice.[10] Therefore, pulsed electric fields are able to inactivate microorganisms, extend shelf life, and reduce enzymatic activity of the juice while maintaining similar quality as the original, fresh pressed juice.[10]
15 | 


--------------------------------------------------------------------------------
/data/sherlock_noise2.txt:
--------------------------------------------------------------------------------
 1 | The apple is a deciduous tree, generally standing 6 to 15 ft (1.8 to 4.6 m) tall in cultivation and up to 30 ft (9.1 m) in the wild. When cultivated, the size, shape and branch density are determined by rootstock selection and trimming method. The leaves are alternately arranged dark green-colored simple ovals with serrated margins and slightly downy undersides.[4]
 2 | 
 3 | 
 4 | Apple blossom
 5 | Blossoms are produced in spring simultaneously with the budding of the leaves and are produced on spurs and some long shoots. The 3 to 4 cm (1.2 to 1.6 in) flowers are white with a pink tinge that gradually fades, five petaled, with an inflorescence consisting of a cyme with 4–6 flowers. The central flower of the inflorescence is called the "king bloom"; it opens first and can develop a larger fruit.[4][5]
 6 | 
 7 | The fruit matures in late summer or autumn, and cultivars exist in a wide range of sizes. Commercial growers aim to produce an apple that is 2 3⁄4 to 3 1⁄4 in (7.0 to 8.3 cm) in diameter, due to market preference. Some consumers, especially those in Japan, prefer a larger apple, while apples below 2 1⁄4 in (5.7 cm) are generally used for making juice and have little fresh market value. The skin of ripe apples is generally red, yellow, green, pink, or russetted, though many bi- or tri-colored cultivars may be found.[6] The skin may also be wholly or partly russeted i.e. rough and brown. The skin is covered in a protective layer of epicuticular wax.[7] The exocarp (flesh) is generally pale yellowish-white,[6] though pink or yellow exocarps also occur.
 8 | 
 9 | Wild ancestors
10 | Main article: Malus sieversii
11 | The original wild ancestor of Malus pumila was Malus sieversii, found growing wild in the mountains of Central Asia in southern Kazakhstan, Kyrgyzstan, Tajikistan, and Xinjiang, China.[4][8] Cultivation of the species, most likely beginning on the forested flanks of the Tian Shan mountains, progressed over a long period of time and permitted secondary introgression of genes from other species into the open-pollinated seeds. Significant exchange with Malus sylvestris, the crabapple, resulted in current populations of apples being more related to crabapples than to the more morphologically similar progenitor Malus sieversii. In strains without recent admixture the contribution of the latter predominates
12 | 
13 | The origin of the domestic dog includes the dog's evolutionary divergence from the wolf, its domestication, and its development into dog types and dog breeds. The dog is a member of the genus Canis, which forms part of the wolf-like canids, and was the first species and the only large carnivore to have been domesticated.[14][24] The dog and the extant gray wolf are sister taxa, as modern wolves are not closely related to the population of wolves that was first domesticated.[14]
14 | 
15 | The genetic divergence between dogs and wolves occurred between 40,000–20,000 years ago, just before or during the Last Glacial Maximum.[25][2] This timespan represents the upper time-limit for the commencement of domestication because it is the time of divergence and not the time of domestication, which occurred later.[25][26] The domestication of animals commenced over 15,000 years ago, beginning with the grey wolf (Canis lupus) by nomadic hunter-gatherers.[25] The archaeological record and genetic analysis show the remains of the Bonn–Oberkassel dog buried beside humans 14,200 years ago to be the first undisputed dog, with disputed remains occurring 36,000 years ago.[2] It was not until 11,000 years ago that people living in the Near East entered into relationships with wild populations of aurochs, boar, sheep, and goats.[25]
16 | 
17 | Where the domestication of the dog took place remains debated, with the most plausible proposals spanning Western Europe,[9][26] Central Asia[26][27] and East Asia.[26][28] This has been made more complicated by the recent proposal that an initial wolf population split into East and West Eurasian groups. These two groups, before going extinct, were domesticated independently into two distinct dog populations between 14,000 and 6,400 years ago. The Western Eurasian dog population was gradually and partially replaced by East Asian dogs introduced by humans at least 6,400 years ago.[26][2] This proposal is also debated.[2]
18 | 
19 | 


--------------------------------------------------------------------------------
/data/wiki_noise10.txt:
--------------------------------------------------------------------------------
 1 | An apple is a sweet, edible fruit produced by an apple tree (Malus pumila). Apple trees are cultivated worldwide and are the most widely grown species in the genus Malus. The tree originated in Central Asia, where its wild ancestor, Malus sieversii, is still found today. Apples have been grown for thousands of years in Asia and Europe and were brought to North America by European colonists. Apples have religious and mythological significance in many cultures, including Norse, Greek and European Christian traditions.
 2 | 
 3 | Apple trees are large if grown from seed. Generally, apple cultivars are propagated by grafting onto rootstocks, which control the size of the resulting tree. There are more than 7,500 known cultivars of apples, resulting in a range of desired characteristics. Different cultivars are bred for various tastes and use, including cooking, eating raw and cider production. Trees and fruit are prone to a number of fungal, bacterial and pest problems, which can be controlled by a number of organic and non-organic means. In 2010, the fruit's genome was sequenced as part of research on disease control and selective breeding in apple production.
 4 | 
 5 | Worldwide production of apples in 2017 was 83.1 million tonnes, with China accounting for half of the total
 6 | 
 7 | The apple is a deciduous tree, generally standing 6 to 15 ft (1.8 to 4.6 m) tall in cultivation and up to 30 ft (9.1 m) in the wild. When cultivated, the size, shape and branch density are determined by rootstock selection and trimming method. The leaves are alternately arranged dark green-colored simple ovals with serrated margins and slightly downy undersides.[4]
 8 | 
 9 | 
10 | Apple blossom
11 | Blossoms are produced in spring simultaneously with the budding of the leaves and are produced on spurs and some long shoots. The 3 to 4 cm (1.2 to 1.6 in) flowers are white with a pink tinge that gradually fades, five petaled, with an inflorescence consisting of a cyme with 4–6 flowers. The central flower of the inflorescence is called the "king bloom"; it opens first and can develop a larger fruit.[4][5]
12 | 
13 | The fruit matures in late summer or autumn, and cultivars exist in a wide range of sizes. Commercial growers aim to produce an apple that is 2 3⁄4 to 3 1⁄4 in (7.0 to 8.3 cm) in diameter, due to market preference. Some consumers, especially those in Japan, prefer a larger apple, while apples below 2 1⁄4 in (5.7 cm) are generally used for making juice and have little fresh market value. The skin of ripe apples is generally red, yellow, green, pink, or russetted, though many bi- or tri-colored cultivars may be found.[6] The skin may also be wholly or partly russeted i.e. rough and brown. The skin is covered in a protective layer of epicuticular wax.[7] The exocarp (flesh) is generally pale yellowish-white,[6] though pink or yellow exocarps also occur.
14 | 
15 | Wild ancestors
16 | Main article: Malus sieversii
17 | The original wild ancestor of Malus pumila was Malus sieversii, found growing wild in the mountains of Central Asia in southern Kazakhstan, Kyrgyzstan, Tajikistan, and Xinjiang, China.[4][8] Cultivation of the species, most likely beginning on the forested flanks of the Tian Shan mountains, progressed over a long period of time and permitted secondary introgression of genes from other species into the open-pollinated seeds. Significant exchange with Malus sylvestris, the crabapple, resulted in current populations of apples being more related to crabapples than to the more morphologically similar progenitor Malus sieversii. In strains without recent admixture the contribution of the latter predominates.[9][10][11]
18 | 
19 | Genome
20 | In 2010, an Italian-led consortium announced they had sequenced the complete genome of the apple in collaboration with horticultural genomicists at Washington State University,[12] using 'Golden Delicious'.[13] It had about 57,000 genes, the highest number of any plant genome studied to date[14] and more genes than the human genome (about 30,000).[15] This new understanding of the apple genome will help scientists identify genes and gene variants that contribute to resistance to disease and drought, and other desirable characteristics. Understanding the genes behind these characteristics will help scientists perform more knowledgeable selective breeding. The genome sequence also provided proof that Malus sieversii was the wild ancestor of the domestic apple—an issue that had been long-debated in the scientific community
21 | 


--------------------------------------------------------------------------------
/data/wiki_noise12.txt:
--------------------------------------------------------------------------------
 1 | The Progressive Era was a period of widespread social activism and political reform across the United States that spanned the 1890s to the 1920s.[1] The main objectives of the Progressive movement was eliminating problems caused by industrialization, urbanization, immigration, and political corruption. The movement primarily targeted political machines and their bosses. By taking down these corrupt representatives in office, a further means of direct democracy would be established. They also sought regulation of monopolies (trust busting) and corporations through antitrust laws, which were seen as a way to promote equal competition for the advantage of legitimate competitors.
 2 | 
 3 | Many progressives supported prohibition of alcoholic beverages, ostensibly to destroy the political power of local bosses based in saloons, but others out of a religious motivation.[2] At the same time, women's suffrage was promoted to bring a "purer" female vote into the arena.[3] A third theme was building an Efficiency Movement in every sector that could identify old ways that needed modernizing, and bring to bear scientific, medical and engineering solutions; a key part of the efficiency movement was scientific management, or "Taylorism". The middle class was in charge for helping reform the Progressive Era, and they got stuck with all of the burdens of this reformation. In Michael McGerr's book A Fierce Discontent, Jane Addams stated that she believed in the necessity of "association" of stepping across the social boundaries of industrial America.[4]
 4 | 
 5 | Many activists joined efforts to reform local government, public education, medicine, finance, insurance, industry, railroads, churches, and many other areas. Progressives transformed, professionalized and made "scientific" the social sciences, especially history,[5] economics,[6] and political science.[7] In academic fields the day of the amateur author gave way to the research professor who published in the new scholarly journals and presses. The national political leaders included Republicans Theodore Roosevelt, Robert M. La Follette Sr., and Charles Evans Hughes and Democrats William Jennings Bryan, Woodrow Wilson and Al Smith. Leaders of the movement also existed far from presidential politics: Jane Addams, Grace Abbott, Edith Abbott and Sophonisba Breckinridge were among the most influential non-governmental Progressive Era reformers.
 6 | 
 7 | Initially the movement operated chiefly at local level, but later it expanded to state and national levels. Progressives drew support from the middle class, and supporters included many lawyers, teachers, physicians, ministers, and business people.[8] Some Progressives strongly supported scientific methods as applied to economics, government, industry, finance, medicine, schooling, theology, education, and even the family. They closely followed advances underway at the time in Western Europe[9] and adopted numerous policies, such as a major transformation of the banking system by creating the Federal Reserve System in 1913[10] and the arrival of cooperative banking in the US with the founding of the first credit union in 1908.[11] Reformers felt that old-fashioned ways meant waste and inefficiency, and eagerly sought out the "one best system"
 8 | 
 9 | Disturbed by the waste, inefficiency, stubbornness, corruption, and injustices of the Gilded Age, the Progressives were committed to changing and reforming every aspect of the state, society and economy. Significant changes enacted at the national levels included the imposition of an income tax with the Sixteenth Amendment, direct election of Senators with the Seventeenth Amendment, Prohibition with the Eighteenth Amendment, election reforms to stop corruption and fraud, and women's suffrage through the Nineteenth Amendment to the U.S. Constitution.[14]
10 | 
11 | A main objective of the Progressive Era movement was to eliminate corruption within the government. They made it a point to also focus on family, education, and many other important aspects that still are enforced today. The most important political leaders during this time were Theodore Roosevelt, Robert M. La Follette Sr., Charles Evans Hughes, and Herbert Hoover. Some democratic leaders included William Jennings Bryan, Woodrow Wilson, and Al Smith.[15]
12 | 
13 | This movement targeted the regulations of huge monopolies and corporations. This was done through antitrust laws to promote equal competition amongst every business. This was done through the Sherman Act of 1890, the Clayton Act of 1914, and the Federal Trade Commission Act of 1914
14 | 


--------------------------------------------------------------------------------
/data/wiki_noise5.txt:
--------------------------------------------------------------------------------
 1 | Brown v. Board of Education of Topeka, 347 U.S. 483 (1954),[1] was a landmark decision of the U.S. Supreme Court in which the Court ruled that American state laws establishing racial segregation in public schools are unconstitutional, even if the segregated schools are otherwise equal in quality. Handed down on May 17, 1954, the Court's unanimous (9–0) decision stated that "separate educational facilities are inherently unequal," and therefore violate the Equal Protection Clause of the Fourteenth Amendment of the U.S. Constitution. However, the decision's 14 pages did not spell out any sort of method for ending racial segregation in schools, and the Court's second decision in Brown II (349 U.S. 294 (1955)) only ordered states to desegregate "with all deliberate speed".
 2 | 
 3 | The case originated with a lawsuit filed by the Brown family, a family of black Americans in Topeka, Kansas, after their local public school district refused to enroll their daughter in the school closest to their home, instead requiring her to ride a bus to a blacks-only school further away. A number of other black families joined the lawsuit, and the Supreme Court later combined their case with several other similar lawsuits from other areas of the United States. At trial, the district court ruled in favor of the school board based on the Supreme Court's precedent in the 1896 case Plessy v. Ferguson, in which the Court had ruled that racial segregation was not in itself a violation of the Fourteenth Amendment's Equal Protection Clause if the facilities in question were otherwise equal, a doctrine that had come to be known as "separate but equal". The Browns, represented by NAACP chief counsel Thurgood Marshall, appealed to the Supreme Court, which agreed to hear the case.
 4 | 
 5 | The Court's decision in Brown partially overruled Plessy v. Ferguson by declaring that the "separate but equal" notion was unconstitutional for American public schools and educational facilities.[note 1] It paved the way for integration and was a major victory of the Civil Rights Movement,[3] and a model for many future impact litigation cases.[4] In the American South, especially the "Deep South", where racial segregation was deeply entrenched, the reaction to Brown among most white people was "noisy and stubborn".[5] Many Southern governmental and political leaders embraced a plan known as "Massive Resistance", created by Virginia Senator Harry F. Byrd, in order to frustrate attempts to force them to de-segregate their school systems. Four years later, in the case of Cooper v. Aaron, the Court reaffirmed its ruling in Brown, and explicitly stated that state officials and legislators had no power to nullify its ruling.
 6 | 
 7 | For much of the sixty years preceding the Brown case, race relations in the United States had been dominated by racial segregation. This policy had been endorsed in 1896 by the United States Supreme Court case of Plessy v. Ferguson, which held that as long as the separate facilities for the separate races were equal, segregation did not violate the Fourteenth Amendment ("no State shall ... deny to any person ... the equal protection of the laws").[6]
 8 | 
 9 | The plaintiffs in Brown asserted that this system of racial separation, while masquerading as providing separate but equal treatment of both white and black Americans, instead perpetuated inferior accommodations, services, and treatment for black Americans. Racial segregation in education varied widely from the 17 states that required racial segregation to the 16 in which it was prohibited. Brown was influenced by UNESCO's 1950 Statement, signed by a wide variety of internationally renowned scholars, titled The Race Question.[7] This declaration denounced previous attempts at scientifically justifying racism as well as morally condemning racism. Another work that the Supreme Court cited was Gunnar Myrdal's An American Dilemma: The Negro Problem and Modern Democracy (1944).[8] Myrdal had been a signatory of the UNESCO declaration.
10 | 
11 | The United States and the Soviet Union were both at the height of the Cold War during this time, and U.S. officials, including Supreme Court Justices, were highly aware of the harm that segregation and racism played on America's international image. When Justice William O. Douglas traveled to India in 1950, the first question he was asked was, "Why does America tolerate the lynching of Negroes?" Douglas later wrote that he had learned from his travels that "the attitude of the United States toward its colored minorities is a powerful factor in our relations with India." Chief Justice Earl Warren, nominated to the Supreme Court by President Eisenhower, echoed Douglas's concerns in a 1954 speech to the American Bar Association, proclaiming that "Our American system like all others is on trial both at home and abroad, ... the extent to which we maintain the spirit of our constitution with its Bill of Rights, will in the long run do more to make it both secure and the object of adulation than the number of hydrogen bombs we stockpile
12 | 


--------------------------------------------------------------------------------
/data/wiki_noise8.txt:
--------------------------------------------------------------------------------
 1 | The Netherlands (Dutch: Nederland, [ˈneːdərlɑnt] (About this soundlisten)), also commonly known as Holland, is a country located mainly in Northwestern Europe. The European portion of the Netherlands consists of twelve separate provinces that border Germany to the east, Belgium to the south, and the North Sea to the northwest, with maritime borders in the North Sea with Belgium, Germany and the United Kingdom.[12] Together with three island territories in the Caribbean Sea—Bonaire, Sint Eustatius and Saba—it forms a constituent country of the Kingdom of the Netherlands. The official language is Dutch, but a secondary official language in the province of Friesland is West Frisian.
 2 | 
 3 | The six largest cities in the Netherlands are Amsterdam, Rotterdam, The Hague, Utrecht, Eindhoven and Tilburg. Amsterdam is the country's capital,[13] while The Hague holds the seat of the States General, Cabinet and Supreme Court.[14] The Port of Rotterdam is the largest port in Europe, and the largest in any country outside Asia.[15] The country is a founding member of the EU, Eurozone, G10, NATO, OECD and WTO, as well as a part of the Schengen Area and the trilateral Benelux Union. It hosts several intergovernmental organisations and international courts, many of which are centered in The Hague, which is consequently dubbed 'the world's legal capital'.[16]
 4 | 
 5 | Netherlands literally means 'lower countries' in reference to its low elevation and flat topography, with only about 50% of its land exceeding 1 metre (3 ft 3 in) above sea level, and nearly 17% falling below sea level.[17] Most of the areas below sea level, known as polders, are the result of land reclamation that began in the 16th century. With a population of 17.30 million people, all living within a total area of roughly 41,500 square kilometres (16,000 sq mi)—of which the land area is 33,700 square kilometres (13,000 sq mi)—the Netherlands is one of the most densely populated countries in the world. Nevertheless, it is the world's second-largest exporter of food and agricultural products (after the United States), owing to its fertile soil, mild climate, and intensive agriculture.[18][19]
 6 | 
 7 | The Netherlands was, historically, the third country in the world to have representative government, and it has been a parliamentary constitutional monarchy with a unitary structure since 1848. The country has a tradition of pillarisation and a long record of social tolerance, having legalised abortion, prostitution and human euthanasia, along with maintaining a progressive drug policy. The Netherlands abolished the death penalty in 1870, allowed women's suffrage in 1917, and became the world's first country to legalise same-sex marriage in 2001. Its mixed-market advanced economy had the thirteenth-highest per capita income globally. The Netherlands ranks among the highest in international indexes of press freedom,[20] economic freedom,[21] human development, and quality of life, as well as happiness
 8 | 
 9 | The Netherlands has been a constitutional monarchy since 1815, and due to the efforts of Johan Rudolph Thorbecke,[97] became a parliamentary democracy since 1848. The Netherlands is described as a consociational state. Dutch politics and governance are characterised by an effort to achieve broad consensus on important issues, within both the political community and society as a whole. In 2017, The Economist ranked the Netherlands as the 11th most democratic country in the world.
10 | 
11 | The monarch is the head of state, at present King Willem-Alexander of the Netherlands. Constitutionally, the position is equipped with limited powers. By law, the King has the right to be periodically briefed and consulted on government affairs. Depending on the personalities and relationships of the King and the ministers, the monarch might have influence beyond the power granted by the Constitution of the Netherlands.
12 | 
13 | The executive power is formed by the Council of Ministers, the deliberative organ of the Dutch cabinet. The cabinet usually consists of 13 to 16 ministers and a varying number of state secretaries. One to three ministers are ministers without portfolio. The head of government is the Prime Minister of the Netherlands, who often is the leader of the largest party of the coalition. The Prime Minister is a primus inter pares, with no explicit powers beyond those of the other ministers. Mark Rutte has been Prime Minister since October 2010; the Prime Minister had been the leader of the largest party continuously since 1973.
14 | 
15 | The cabinet is responsible to the bicameral parliament, the States General, which also has legislative powers. The 150 members of the House of Representatives, the lower house, are elected in direct elections on the basis of party-list proportional representation. These are held every four years, or sooner in case the cabinet falls (for example: when one of the chambers carries a motion of no confidence, the cabinet offers its resignation to the monarch). The States-Provincial are directly elected every four years as well. The members of the provincial assemblies elect the 75 members of the Senate, the upper house, which has the power to reject laws, but not propose or amend them. Both houses send members to the Benelux Parliament, a consultative council.
16 | 


--------------------------------------------------------------------------------
/baselines.py:
--------------------------------------------------------------------------------
  1 | 
  2 | '''
  3 | Baseline methods.
  4 | -various LOF-based methods
  5 | -isolation forest
  6 | -dbscan
  7 | -l2
  8 | -elliptic envelope
  9 | -naive spectral
 10 | -
 11 | '''
 12 | import torch
 13 | import numpy as np
 14 | import sklearn
 15 | import sklearn.ensemble
 16 | import sklearn.covariance
 17 | import sklearn.cluster
 18 | import random
 19 | import utils
 20 | import pdb
 21 | 
 22 | '''
 23 | kNN method that uses distances to k nearest neighbors
 24 | as scores. (global method)
 25 | Input:
 26 | -X: data, 2D tensor.
 27 | '''
 28 | def knn_dist(X, k=10, sum_dist=False):
 29 |     
 30 |     min_dist, idx = utils.dist_rank(X, k=k, largest=False)
 31 |     
 32 |     if sum_dist:
 33 |         dist_score = min_dist.sum(-1)
 34 |     else:
 35 |         dist_score = min_dist.mean(-1)
 36 |     
 37 |     return dist_score
 38 | 
 39 | '''
 40 | Lof method using reachability criteria to determine density.
 41 | (Local method.)
 42 | '''
 43 | def knn_dist_lof(X, k=10):
 44 |     X_len = len(X)
 45 |     
 46 |     #dist_ = dist(X, X)    
 47 |     #min_dist, min_idx = torch.topk(dist_, dim=-1, k=k, largest=False)
 48 |     
 49 |     min_dist, min_idx = utils.dist_rank(X, k=k, largest=False)
 50 |     kth_dist = min_dist[:, -1]
 51 |     # sum max(kth dist, dist(o, p)) over neighbors o of p
 52 |     kth_dist_exp = kth_dist.expand(X.size(0), -1) #n x n
 53 |     kth_dist = torch.gather(input=kth_dist_exp, dim=1, index=min_idx)
 54 |     
 55 |     min_dist[kth_dist > min_dist] = kth_dist[kth_dist > min_dist]
 56 |     #inverse of lrd scores
 57 |     dist_avg = min_dist.mean(-1).clamp(min=0.0001)
 58 |     
 59 |     compare_density = False
 60 |     if compare_density:
 61 |         #compare with density. Get kth neighbor index.
 62 |         dist_avg_exp = dist_avg.unsqueeze(-1) / dist_avg.unsqueeze(0).expand(X_len, -1)
 63 |         #lof = torch.zeros(X_len, 1).to(utils.device)
 64 |         lof = torch.gather(input=dist_avg_exp, dim=-1, index=min_idx).sum(-1)
 65 |         torch.scatter_add_(lof, dim=-1, index=min_idx, src=dist_avg_exp)    
 66 |         return -lof.squeeze(0)
 67 | 
 68 |     return dist_avg
 69 | 
 70 | '''
 71 | LoOP: kNN based method using quadratic mean distance to estimate density.
 72 | LoOP (Local Outlier Probabilities) (Kriegel et al. 2009a)
 73 | '''
 74 | def knn_dist_loop(X, k=10):
 75 |     dist_ = dist(X, X)
 76 |     min_dist, idx = torch.topk(dist_, dim=-1, k=k, largest=False)
 77 |     dist_avg = (min_dist**2).mean(-1).sqrt()
 78 |     
 79 |     return dist_avg
 80 | 
 81 | '''
 82 | Isolation forest to compute outlier scores.
 83 | Returns: The higher the score, the more likely to be outlier.
 84 | '''
 85 | def isolation_forest(X):
 86 |     X = X.cpu().numpy()
 87 |     model = sklearn.ensemble.IsolationForest(contamination='auto', behaviour='new')
 88 |     #labels = model.fit_predict(X)
 89 |     model.fit(X)
 90 |     scores = -model.decision_function(X)
 91 |     
 92 |     #labels = torch.from_numpy(labels).to(utils.device)  
 93 |     #scores = torch.zeros_like(labels)
 94 |     #scores[labels==-1] = 1
 95 |     return torch.from_numpy(scores).to(utils.device)
 96 | 
 97 | '''
 98 | Elliptic envelope
 99 | Returns: The higher the score, the more likely to be outlier.
100 | '''
101 | def ellenv(X):
102 |     X = X.cpu().numpy()
103 |     model = sklearn.covariance.EllipticEnvelope(contamination=0.2)
104 |     #ensemble.IsolationForest(contamination='auto', behaviour='new')
105 |     model.fit(X)
106 |     scores = -model.decision_function(X)
107 |     
108 |     #labels = torch.from_numpy(labels).to(utils.device)  
109 |     #scores = torch.zeros_like(labels)
110 |     #scores[labels==-1] = 1
111 |     return torch.from_numpy(scores).to(utils.device)
112 | 
113 | '''
114 | Local outlier factor.
115 | 
116 | '''
117 | def lof(X):
118 |     #precompute distances to accelerate LOF
119 |     dist_mx = dist(X, X)    
120 |     dist_mx = dist_mx.cpu().numpy()
121 |     #metric by default is minkowski with p=2
122 |     model = sklearn.neighbors.LocalOutlierFactor(n_neighbors=20, metric='precomputed', contamination='auto')
123 |     labels = model.fit_predict(dist_mx)
124 |     labels = torch.from_numpy(labels).to(utils.device)
125 |     scores = torch.zeros_like(labels)
126 |     scores[labels==-1] = 1
127 |     return scores
128 |     
129 |     
130 | '''
131 | DBSCAN, density based, mark points as inlier if 
132 | they either have lots of neighbors or have inliers
133 | as their neighbors.
134 | -X are points, not pairwise distances.
135 | Returns:
136 | -scores, 1 means outlier.
137 | '''
138 | def dbscan(X):
139 |     X = X.cpu().numpy()
140 |     model = sklearn.cluster.DBSCAN(min_samples=10)
141 |     model.fit(X)
142 |     #-1 means "outlier"
143 |     labels = model.labels_
144 |     labels = torch.from_numpy(labels).to(utils.device)
145 |     scores = torch.zeros_like(labels)
146 |     scores[labels==-1] = 1
147 |     return scores
148 | 
149 | '''
150 | Compute score using l2 distance to the mean
151 | Higher scores mean more likely outliers.
152 | '''
153 | def l2(X):
154 |     scores = ((X - X.mean(0))**2).sum(-1)    
155 |     return scores
156 |     
157 | '''
158 | Input:
159 | -X, Y: 2D tensors
160 | '''
161 | def dist(X, Y):
162 |     
163 |     X_norms = torch.sum(X**2, dim=1).view(-1, 1)
164 |     Y_norms = torch.sum(Y**2, dim=1).view(1, -1)
165 |     cur_distances = X_norms + Y_norms - 2*torch.mm(X, Y.t())    
166 | 
167 |     return cur_distances
168 |     
169 |     
170 | 


--------------------------------------------------------------------------------
/data/wiki_noise6.txt:
--------------------------------------------------------------------------------
 1 | The Dublin and Monaghan bombings of 17 May 1974 were a series of co-ordinated bombings in Dublin and Monaghan, Ireland. Three bombs exploded in Dublin during the evening rush hour and a fourth exploded in Monaghan almost ninety minutes later. They killed 33 civilians and a full-term unborn child, and injured almost 300. The bombings were the deadliest attack of the conflict known as the Troubles,[2] and the deadliest attack in the Republic's history.[3] Most of the victims were young women, although the ages of the dead ranged from an unborn child to 80 years.
 2 | 
 3 | The Ulster Volunteer Force (UVF), a loyalist paramilitary group from Northern Ireland, claimed responsibility for the bombings in 1993. It had launched a number of attacks in the Republic since 1969. There are allegations taken seriously by inquiries that elements of the British state security forces helped the UVF carry out the bombings, including members of the Glenanne gang. Some of these allegations have come from former members of the security forces. The Irish parliament's Joint Committee on Justice called the attacks an act of international terrorism involving British state forces.[1] The month before the bombings, the British government had lifted the UVF's status as a proscribed organisation.
 4 | 
 5 | The bombings happened during the Ulster Workers' Council strike. This was a general strike called by hardline loyalists and unionists in Northern Ireland who opposed the Sunningdale Agreement. Specifically, they opposed the sharing of political power with Irish nationalists, and the proposed role for the Republic in the governance of Northern Ireland. The Republic's government had helped bring about the Agreement. The strike brought down the Agreement and the Northern Ireland Assembly on 28 May.
 6 | 
 7 | No-one has ever been charged with the bombings. A campaign by the victims' families led to an Irish government inquiry under Justice Henry Barron. His 2003 report criticised the Garda Síochána's investigation and said the investigators stopped their work prematurely.[4] It also criticised the Fine Gael/Labour government of the time for its inaction and lack of interest in the bombings.[4] The report said it was likely that British security force personnel or MI5 intelligence was involved but had insufficient evidence of higher-level involvement. However, the inquiry was hindered by the British government's refusal to release key documents.[5] The victims' families and others are continuing to campaign to this day for the British government to release these documents
 8 | 
 9 | At about 17:30 on Friday 17 May 1974, without warning, three car bombs exploded in Dublin city centre at Parnell Street, Talbot Street and South Leinster Street during rush hour. The streets all ran east-west from busy thoroughfares to railway stations.[7] There was a bus strike in Dublin at the time, which meant there were more people on the streets than usual.[8] According to one of the Irish Army's top bomb disposal officers, Commandant Patrick Trears, the bombs were constructed so well that 100% of each bomb exploded upon detonation.[9] Twenty-three people died in these explosions and three others died from their injuries over the following few days and weeks. Many of the dead were young women originally from rural towns employed in the civil service. An entire family from central Dublin was killed. Two of the victims were foreign: an Italian man, and a French Jewish woman whose family had survived the Holocaust.
10 | 
11 | First bomb
12 | The first of the three Dublin car bombs went off at about 17:28 on Parnell Street, near the intersection with Marlborough Street.[10] It was in a parking bay outside the Welcome Inn pub and Barry's supermarket at 93 and 91 Parnell Street respectively, and near petrol pumps. Shop fronts were blown out, cars were destroyed, and people were thrown in all directions. A brown Mini that had been parked behind the bomb was hurled onto the pavement at a right angle. One survivor described "a big ball of flame coming straight towards us, like a great nuclear mushroom cloud whooshing up everything in its path".[11] The bomb car was a metallic green 1970 model Hillman Avenger, registration number DIA 4063. It had been facing toward O'Connell Street, Dublin's main thoroughfare. This car, like the other two bomb cars, had its original registration plates. It had been hijacked in Belfast that morning.[12]
13 | 
14 | Ten people were killed in this explosion, including two infant girls and their parents, and a World War I veteran.[13] Many others, including a teenaged petrol-pump attendant, were severely injured.
15 | 
16 | Second bomb
17 | The second of the Dublin car bombs went off at about 17:30 on Talbot Street, near the intersection with Lower Gardiner Street. Talbot Street was the main route from the city centre to Connolly station, one of Dublin's primary railway stations. It was parked at 18 Talbot Street, on the north side, opposite Guineys department store. The bomb car was a metallic blue mink Ford Escort, registration number 1385 WZ. It had been stolen that morning in the Docks area of Belfast.[12] The blast damaged buildings and vehicles on both sides of the street. People suffered severe burns and were struck by shrapnel, flying glass and debris; some were hurled through the windows of shops.[10]
18 | 
19 | Twelve people were killed outright, and another two died over the following days and weeks. Thirteen of the fourteen victims were women, including one who was nine months pregnant. One young woman who had been beside the bomb car was decapitated; the only clue to her sex was the pair of brown platform boots she was wearing.[14] Several others lost limbs and a man was impaled through the abdomen by an iron bar.[10] Several bodies lay in the street for half an hour as ambulances struggled to get through traffic jams.[15] At least four bodies were found on the pavement outside Guineys.[16] The bodies of the victims were covered by newspapers until they were removed from the scene
20 | 


--------------------------------------------------------------------------------
/data/wiki_noise2.txt:
--------------------------------------------------------------------------------
 1 | "At Seventeen" is a song by American singer-songwriter Janis Ian from her seventh studio album Between the Lines. Columbia released it in August 1975 as the album's second single. Ian wrote the lyrics based on a The New York Times article with a samba instrumental, and Brooks Arthur produced the final version. A soft rock ballad, the song is about a social outcast in high school. Critics have regarded "At Seventeen" as a type of anthem. Despite her initial reluctance to perform the single live, Ian promoted it at various appearances and it has been included on compilation and live albums.
 2 | 
 3 | Critics praised "At Seventeen", which earned Ian the Grammy Award for Best Female Pop Vocal Performance, and Grammy nominations for Record and Song of the Year. The single reached number three on the Billboard Hot 100 chart, and has sold over a million copies as of August 2004. Internationally, "At Seventeen" charted in Australia, Canada, and New Zealand. One of Ian's most commercially successful songs, critics consider it her signature song. "At Seventeen" has been used frequently in television and films, like The Simpsons and Mean Girls; it has also been referenced in literature. Various recording artists and musicians, including Anita Kerr, Jann Arden, and Celine Dion, have covered "At Seventeen". The Hong Kong all-female band at17 named themselves after the song.
 4 | 
 5 | "At Seventeen" was written by Janis Ian at the age of twenty-four and produced by Brooks Arthur.[1][2] She was inspired to write the single after reading a The New York Times article about a young woman who believed her life would improve after a debutante ball and her subsequent disappointment when it did not.[3][4] In the article the girl was eighteen, but Ian changed it to seventeen to fit with her samba guitar instrumental.[4] She recalled feeling uncomfortable while writing "At Seventeen" as it predated the confessional song trend of the mid-1970s.[3] She was also uncertain about writing about high school when she had never experienced a homecoming or a prom.[4] She said she purposefully took her time with the song to insure it did not lose its "intensity";[4] she repeatedly stopped and started work on it over the course of three months.[3][5] At the time, she was living with her mother.[4]
 6 | 
 7 | During the recording process, which Ian described as "very tense", she worried she had accidentally stolen the melody from a different song and consulted with three friends about it. Arthur described the song as "just honest and straight from her heart", and felt it was different from folk or pop music. He said Ian was easy to work with as she had prepared by bringing lyric sheets and arrangements to the studio sessions.[3] Arthur and Ian had worked together on her 1966 single "Society's Child", during which they formed a close friendship.[6] "At Seventeen" was completed in roughly two or three days at 914 Sound Studios;[3][6] it was recorded on September 17, 1974.[7] The final version contains two combined takes, as the initial ending was deemed too weak compared to its start. Allen Klein listened in during a session and responded positively to the song.[3] Brooks Arthur, Larry Alexander, and Russ Payne were the audio engineers for "At Seventeen".[2]
 8 | 
 9 | Composition and lyrics
10 | 
11 | Critics cited "At Seventeen" as bossa nova,[6][8] pop rock, jazz, and blues.[9] Ian originally wrote the song while playing a samba instrumental on her guitar.[4]
12 | Problems playing this file? See media help.
13 | "At Seventeen" is composed in the key of C major using common time and a moderate tempo of 126 beats per minute. Instrumentation is provided by a piano and a guitar. During the track, Ian's vocal range spans from the low note of G3 to the high note of Ab4.[10] Some commentators connected the song to bossa nova.[6][8] Mix magazine's Gary Eskow cited Ian's style as the opposite of Antônio Carlos Jobim's because she "explore[d] the belly of the bossa, the flip side of Ipanema".[6] John Lissner of The New York Times referred to the instrumental as having a "laid‐back bossa nova beat" and ostinato.[8] On the other hand, AllMusic's Lindsay Planer referred to "At Seventeen" as a mixture of pop rock, jazz, and blues,[9] and music scholar James E. Perone associated it more with jazz and a "coffeehouse folksinger" approach.[11] Perone described the song's style as more restrained compared to Ian's contemporaries.[11] A writer for Rolling Stone magazine associated "At Seventeen" with "sulk-pop".[12]
14 | 
15 | "At Seventeen" is a soft rock ballad about being a social outcast in high school,[13][14] particularly with respect to adolescent cruelty and rejection.[15][16] The lyrics focus on the conflict between cliques as represented by the contrast of "ravaged faces" and "clear-skinned smiles".[17] The song opens with the line "I learned the truth at seventeen, that love was meant for beauty queens".[10] The narrator reveals in the third verse that she finds herself unattractive ("Those of us with ravaged faces"), but later provides a more hopeful outlook through an "Ugly Duckling" allusion ("Ugly duckling girls like me.").[3] Ian said "The Ugly Duckling" lyric was partially inspired by Billie Holiday, who described her music as always containing a sense of hope. Ian had written the last verse ("To those of us who knew the pain / of valentines that never came") to connect with the listener.[4] Other lyrics include: "…remained at home / Inventing lovers on the phone."[18] and "The valentines I never knew / the Friday night charades of youth."[19]
16 | 
17 | Some commentators viewed "At Seventeen" as a type of anthem.[20][21][22] Melissa Etheridge and Billboard's Patrick Crowley interpreted the song as a gay anthem.[20][21] Crowley equated the awkwardness described in the lyrics to the confusion over one's sexual orientation.[20] Etheridge felt the line ("I learned the truth at seventeen") as discovering one's homosexuality. Ian said she was surprised at the LGBT support given to the song.[21] NPR included "At Seventeen" in its 2018 series on American anthems
18 | 


--------------------------------------------------------------------------------
/data/sherlock_noise5.txt:
--------------------------------------------------------------------------------
 1 | was United States Secretary of State from 1861 to 1869, and earlier served as Governor of New York and United States Senator. A determined opponent of the spread of slavery in the years leading up to the American Civil War he was a dominant figure in the Republican Party in its formative years, and was praised for his work on behalf of the Union as Secretary of State during the Civil War.
 2 | 
 3 | Seward was born in Florida, Orange County, New York, where his father was a farmer and owned slaves. He was educated as a lawyer and moved to the Central New York town of Auburn. Seward was elected to the New York State Senate in 1830 as an Anti-Mason. Four years later, he became the gubernatorial nominee of the Whig Party. Though he was not successful in that race, Seward was elected governor in 1838 and won a second two-year term in 1840. During this period, he signed several laws that advanced the rights of and opportunities for black residents, as well as guaranteeing fugitive slaves jury trials in the state. The legislation protected abolitionists, and he used his position to intervene in cases of freed black people who were enslaved in the South.
 4 | 
 5 | After many years of practicing law in Auburn, he was elected by the state legislature to the U.S. Senate in 1849. Seward's strong stances and provocative words against slavery brought him hatred in the South. He was re-elected to the Senate in 1855, and soon joined the nascent Republican Party, becoming one of its leading figures. As the 1860 presidential election approached, he was regarded as the leading candidate for the Republican nomination. Several factors, including attitudes to his vocal opposition to slavery, his support for immigrants and Catholics, and his association with editor and political boss Thurlow Weed, worked against him and Abraham Lincoln secured the presidential nomination. Although devastated by his loss, he campaigned for Lincoln, who was elected and appointed him Secretary of State.
 6 | 
 7 | The Peninsular War[c] (1807–1814) was a military conflict between Napoleon's empire and Bourbon Spain (assisted by the United Kingdom of Great Britain and Ireland and its ally Kingdom of Portugal), for control of the Iberian Peninsula during the Napoleonic Wars. The war began when the French and Spanish armies invaded and occupied Portugal in 1807, and escalated in 1808 when France turned on Spain, previously its ally. The war on the peninsula lasted until the Sixth Coalition defeated Napoleon in 1814, and is regarded as one of the first wars of national liberation, significant for the emergence of large-scale guerrilla warfare.
 8 | 
 9 | The Peninsular War overlaps with what the Spanish-speaking world calls the Guerra de la Independencia Española (Spanish War of Independence), which began with the Dos de Mayo Uprising on 2 May 1808 and ended on 17 April 1814. The French occupation destroyed the Spanish administration, which fragmented into quarrelling provincial juntas. The episode remains as the bloodiest event in Spain's modern history, doubling in relative terms the Spanish Civil War.
10 | 
11 | Doris Day (born Doris Mary Kappelhoff; April 3, 1922 – May 13, 2019) was an American actress, singer, and animal welfare activist. She began her career as a big band singer in 1939, achieving commercial success in 1945 with two No. 1 recordings, "Sentimental Journey" and "My Dreams Are Getting Better All the Time" with Les Brown & His Band of Renown. She left Brown to embark on a solo career and recorded more than 650 songs from 1947 to 1967.
12 | 
13 | Day's film career began during the latter part of the classical Hollywood era with the film Romance on the High Seas (1948), leading to a 20-year career as a motion picture actress. She starred in films of many genres, including musicals, comedies, and dramas. She played the title role in Calamity Jane (1953) and starred in Alfred Hitchcock's The Man Who Knew Too Much (1956) with James Stewart. Her best-known films are those in which she co-starred with Rock Hudson, chief among them 1959's Pillow Talk, for which she was nominated for the Academy Award for Best Actress. She also worked with James Garner on both Move Over, Darling (1963) and The Thrill of It All (1963), and also starred with Clark Gable, Cary Grant, James Cagney, David Niven, Jack Lemmon, Frank Sinatra, Richard Widmark, Kirk Douglas, Lauren Bacall and Rod Taylor in various movies. After ending her film career in 1968, only briefly removed from the height of her popularity, she starred in the sitcom The Doris Day Show
14 | 
15 | The hotel is located in the Dejvice quarter of the Prague 6 municipal district, and was recognized on the list of Czech cultural monuments on 4 July 2000.[1] Construction of the hotel took place from 1952 to 1956, with interior decorations finished in 1957.[2] The hotel was the idea of Alexej Čepička, the Czechoslovakian Minister of Defence, who envisioned a monument to the newly formed Fourth Czechoslovak Republic that would reinforce ties with the Soviet Union.[2][3]
16 | 
17 | The original plans were commissioned from the college of architects at the Military Project Institute in 1951, and called the site Hotel Družba, the Russian word for friendship. The original function was military accommodations in a rectangular floor plan to house out-of-town officers. This draft was never sent to the public archives, rather it was kept secret.[3] The final construction site for the new hotel was chosen in 1951, and architect František Jeřábek worked with the military on a new set of plans, which were more complicated and included a luxury hotel.[3] Plans were revised in the late construction stage to add an extra two steps on the already finished central staircase, to accommodate one step for each of the forty-four Czechoslovak generals at that time.[3]
18 | 
19 | When it was completed in 1957, the hotel had the largest capacity in Czechoslovakia.[3] The Hotel Družba was opened up to public use and its name was switched to the Hotel Čedok in 1957, sharing the name of the local travel agency for tourism in the Czech Republic.[2][3] Later in 1957, a public competition was held to rename the building, and Hotel International was chosen.[2][3] Other suggested names included Podbaba, Juliska, Máj, Mír, Slovan, Experiment, Eldorádo, Stůlka prostři se, and Den a noc
20 | 


--------------------------------------------------------------------------------
/data/sherlock_noise.txt:
--------------------------------------------------------------------------------
 1 | But here is an artist. He desires to paint you the dreamiest, shadiest, quietest, most enchanting bit of romantic landscape in all the valley of the Saco. What is the chief element he employs? There stand his trees, each with a hollow trunk, as if a hermit and a crucifix were within; and here sleeps his meadow, and there sleep his cattle; and up from yonder cottage goes a sleepy smoke. Deep into distant woodlands winds a mazy way, reaching to overlapping spurs of mountains bathed in their hill-side blue. But though the picture lies thus tranced, and though this pine-tree shakes down its sighs like leaves upon this shepherd’s head, yet all were vain, unless the shepherd’s eye were fixed upon the magic stream before him. Go visit the Prairies in June, when for scores on scores of miles you wade knee-deep among Tiger-lilies—what is the one charm wanting?—Water—there is not a drop of water there! Were Niagara but a cataract of sand, would you travel your thousand miles to see it? Why did the poor poet of Tennessee, upon suddenly receiving two handfuls of silver, deliberate whether to buy him a coat, which he sadly needed, or invest his money in a pedestrian trip to Rockaway Beach? Why is almost every robust healthy boy with a robust healthy soul in him, at some time or other crazy to go to sea? Why upon your first voyage as a passenger, did you yourself feel such a mystical vibration, when first told that you and your ship were now out of sight of land? Why did the old Persians hold the sea holy? Why did the Greeks give it a separate deity, and own brother of Jove? Surely all this is not without meaning. And still deeper the meaning of that story of Narcissus, who because he could not grasp the tormenting, mild image he saw in the fountain, plunged into it and was drowned. But that same image, we ourselves see in all rivers and oceans. It is the image of the ungraspable phantom of life; and this is the key to it all.
 2 | 
 3 | Now, when I say that I am in the habit of going to sea whenever I begin to grow hazy about the eyes, and begin to be over conscious of my lungs, I do not mean to have it inferred that I ever go to sea as a passenger. For to go as a passenger you must needs have a purse, and a purse is but a rag unless you have something in it. Besides, passengers get sea-sick—grow quarrelsome—don’t sleep of nights—do not enjoy themselves much, as a general thing;—no, I never go as a passenger; nor, though I am something of a salt, do I ever go to sea as a Commodore, or a Captain, or a Cook. I abandon the glory and distinction of such offices to those who like them. For my part, I abominate all honorable respectable toils, trials, and tribulations of every kind whatsoever. It is quite as much as I can do to take care of myself, without taking care of ships, barques, brigs, schooners, and what not. And as for going as cook,—though I confess there is considerable glory in that, a cook being a sort of officer on ship-board—yet, somehow, I never fancied broiling fowls;—though once broiled, judiciously buttered, and judgmatically salted and peppered, there is no one who will speak more respectfully, not to say reverentially, of a broiled fowl than I will. It is out of the idolatrous dotings of the old Egyptians upon broiled ibis and roasted river horse, that you see the mummies of those creatures in their huge bake-houses the pyramids.
 4 | 
 5 | No, when I go to sea, I go as a simple sailor, right before the mast, plumb down into the forecastle, aloft there to the royal mast-head. True, they rather order me about some, and make me jump from spar to spar, like a grasshopper in a May meadow. And at first, this sort of thing is unpleasant enough. It touches one’s sense of honor, particularly if you come of an old established family in the land, the Van Rensselaers, or Randolphs, or Hardicanutes. And more than all, if just previous to putting your hand into the tar-pot, you have been lording it as a country schoolmaster, making the tallest boys stand in awe of you. The transition is a keen one, I assure you, from a schoolmaster to a sailor, and requires a strong decoction of Seneca and the Stoics to enable you to grin and bear it. But even this wears off in time.
 6 | 
 7 | What of it, if some old hunks of a sea-captain orders me to get a broom and sweep down the decks? What does that indignity amount to, weighed, I mean, in the scales of the New Testament? Do you think the archangel Gabriel thinks anything the less of me, because I promptly and respectfully obey that old hunks in that particular instance? Who ain’t a slave? Tell me that. Well, then, however the old sea-captains may order me about—however they may thump and punch me about, I have the satisfaction of knowing that it is all right; that everybody else is one way or other served in much the same way—either in a physical or metaphysical point of view, that is; and so the universal thump is passed round, and all hands should rub each other’s shoulder-blades, and be content.
 8 | Again, I always go to sea as a sailor, because they make a point of paying me for my trouble, whereas they never pay passengers a single penny that I ever heard of. On the contrary, passengers themselves must pay. And there is all the difference in the world between paying and being paid. The act of paying is perhaps the most uncomfortable infliction that the two orchard thieves entailed upon us. But being paid,—what will compare with it? The urbane activity with which a man receives money is really marvellous, considering that we so earnestly believe money to be the root of all earthly ills, and that on no account can a monied man enter heaven. Ah! how cheerfully we consign ourselves to perdition!
 9 | 
10 | Finally, I always go to sea as a sailor, because of the wholesome exercise and pure air of the fore-castle deck. For as in this world, head winds are far more prevalent than winds from astern (that is, if you never violate the Pythagorean maxim), so for the most part the Commodore on the quarter-deck gets his atmosphere at second hand from the sailors on the forecastle. He thinks he breathes it first; but not so. In much the same way do the commonalty lead their leaders in many other things, at the same time that the leaders little suspect it. But wherefore it was that after having repeatedly smelt the sea as a merchant sailor, I should now take it into my head to go on a whaling voyage; this the invisible police officer of the Fates, who has the constant surveillance of me, and secretly dogs me, and influences me in some unaccountable way—he can better answer than any one else. And, doubtless, my going on this whaling voyage, formed part of the grand programme of Providence that was drawn up a long time ago. It came in as a sort of brief interlude and solo between more extensive performances. I take it that this part of the bill must have run something like this:
11 | 


--------------------------------------------------------------------------------
/data/wiki_noise0.txt:
--------------------------------------------------------------------------------
 1 | 
 2 | Edward II (25 April 1284 – 21 September 1327), also called Edward of Carnarvon, was king of England from 1307 until he was deposed in January 1327. The fourth son of Edward I, Edward became the heir apparent to the throne following the death of his elder brother Alphonso. Beginning in 1300, Edward accompanied his father on campaigns to pacify Scotland, and in 1306 was knighted in a grand ceremony at Westminster Abbey. Following his father's death, Edward succeeded to the throne in 1307. He married Isabella, the daughter of the powerful King Philip IV of France, in 1308, as part of a long-running effort to resolve tensions between the English and French crowns.
 3 | 
 4 | Edward had a close and controversial relationship with Piers Gaveston, who had joined his household in 1300. The precise nature of his and Gaveston's relationship is uncertain; they may have been friends, lovers or sworn brothers. Edward's relationship with Gaveston inspired Christopher Marlowe's 1592 play Edward II, along with other plays, films, novels and media. Many of these have focused on the possible sexual relationship between the two men. Gaveston's power as Edward's favourite provoked discontent among both the barons and the French royal family, and Edward was forced to exile him. On Gaveston's return, the barons pressured the king into agreeing to wide-ranging reforms, called the Ordinances of 1311. The newly empowered barons banished Gaveston, to which Edward responded by revoking the reforms and recalling his favourite. Led by Edward's cousin, the Earl of Lancaster, a group of the barons seized and executed Gaveston in 1312, beginning several years of armed confrontation. English forces were pushed back in Scotland, where Edward was decisively defeated by Robert the Bruce at the Battle of Bannockburn in 1314. Widespread famine followed, and criticism of the king's reign mounted.
 5 | 
 6 | The Despenser family, in particular Hugh Despenser the Younger, became close friends and advisers to Edward, but Lancaster and many of the barons seized the Despensers' lands in 1321, and forced the king to exile them. In response, Edward led a short military campaign, capturing and executing Lancaster. Edward and the Despensers strengthened their grip on power, formally revoking the 1311 reforms, executing their enemies and confiscating estates. Unable to make progress in Scotland, Edward finally signed a truce with Robert. Opposition to the regime grew, and when Isabella was sent to France to negotiate a peace treaty in 1325, she turned against Edward and refused to return. Instead, she allied herself with the exiled Roger Mortimer, and invaded England with a small army in 1326. Edward's regime collapsed and he fled to Wales, where he was captured in November. The king was forced to relinquish his crown in January 1327 in favour of his 14-year-old son, Edward III, and he died in Berkeley Castle on 21 September, probably murdered on the orders of the new regime.
 7 | 
 8 | Edward's contemporaries criticised his performance as king, noting his failures in Scotland and the oppressive regime of his later years, although 19th-century academics later argued that the growth of parliamentary institutions during his reign was a positive development for England over the longer term. Debate over his perceived failures has continued into the 21st century
 9 | 
10 | Edward II was the fourth son of Edward I and his first wife, Eleanor of Castile.[1] His father was the king of England and had also inherited Gascony in south-western France, which he held as the feudal vassal of the king of France, and the Lordship of Ireland.[2] His mother was from the Castilian royal family, and held the County of Ponthieu in northern France. Edward I proved a successful military leader, leading the suppression of the baronial revolts in the 1260s and joining the Ninth Crusade.[3] During the 1280s he conquered North Wales, removing the native Welsh princes from power and, in the 1290s, he intervened in Scotland's civil war, claiming suzerainty over the country.[4] He was considered an extremely successful ruler by his contemporaries, largely able to control the powerful earls that formed the senior ranks of the English nobility.[5] The historian Michael Prestwich describes Edward I as "a king to inspire fear and respect", while John Gillingham characterises him as an efficient bully.[6]
11 | 
12 | Despite Edward I's successes, when he died in 1307 he left a range of challenges for his son to resolve.[7] One of the most critical was the problem of English rule in Scotland, where Edward's long but ultimately inconclusive military campaign was ongoing when he died.[8] Edward's control of Gascony created tension with the French kings.[9] They insisted that the English kings give homage to them for the lands; the English kings saw this demand as insulting to their honour, and the issue remained unresolved.[9] Edward I also faced increasing opposition from his barons over the taxation and requisitions required to resource his wars, and left his son debts of around £200,000 on his death.[10][nb 1]
13 | 
14 | Early life (1284–1307)
15 | Birth
16 | Photograph of Caernarfon castle
17 | Caernarfon Castle, Edward's birthplace
18 | Edward II was born in Caernarfon Castle in north Wales on 25 April 1284, less than a year after Edward I had conquered the region, and as a result is sometimes called Edward of Caernarfon.[12] The king probably chose the castle deliberately as the location for Edward's birth as it was an important symbolic location for the native Welsh, associated with Roman imperial history, and it formed the centre of the new royal administration of North Wales.[13] Edward's birth brought predictions of greatness from contemporary prophets, who believed that the Last Days of the world were imminent, declaring him a new King Arthur, who would lead England to glory.[14] David Powel, a 16th-century clergyman, suggested that the baby was offered to the Welsh as a prince "that was borne in Wales and could speake never a word of English", but there is no evidence to support this account.[15]
19 | 
20 | Edward's name was English in origin, linking him to the Anglo-Saxon saint Edward the Confessor, and was chosen by his father instead of the more traditional Norman and Castilian names selected for Edward's brothers:[16] Edward had three elder brothers: John and Henry, who had died before Edward was born, and Alphonso, who died in August 1284, leaving Edward as the heir to the throne.[17] Although Edward was a relatively healthy child, there were enduring concerns throughout his early years that he too might die and leave his father without a male heir.[17] After his birth, Edward was looked after by a wet nurse called Mariota or Mary Maunsel for a few months until she fell ill, when Alice de Leygrave became his foster mother.[18] He would have barely known his natural mother Eleanor, who was in Gascony with his father during his earliest years.[18] An official household, complete with staff, was created for the new baby, under the direction of a clerk, Giles of Oudenarde.
21 | 


--------------------------------------------------------------------------------
/data/wiki_noise13.txt:
--------------------------------------------------------------------------------
 1 | The Gilded Age in United States history is the late 19th century, from the 1870s to about 1900. The term for this period came into use in the 1920s and 1930s and was derived from writer Mark Twain's and Charles Dudley Warner's 1873 novel The Gilded Age: A Tale of Today, which satirized an era of serious social problems masked by a thin gold gilding. The early half of the Gilded Age roughly coincided with the middle portion of the Victorian era in Britain and the Belle Époque in France. Its beginning in the years after the American Civil War overlaps the Reconstruction Era (which ended in 1877).[1] It was followed in the 1890s by the Progressive Era.
 2 | 
 3 | The Gilded Age was an era of rapid economic growth, especially in the North and West. As American wages were much higher than those in Europe, especially for skilled workers, the period saw an influx of millions of European immigrants. The rapid expansion of industrialization led to real wage growth of 60% between 1860 and 1890, spread across the ever-increasing labor force. The average annual wage per industrial worker (including men, women, and children) rose from $380 in 1880 to $564 in 1890, a gain of 48%. However, the Gilded Age was also an era of abject poverty and inequality as millions of immigrants—many from impoverished regions—poured into the United States, and the high concentration of wealth became more visible and contentious.[2]
 4 | 
 5 | Railroads were the major growth industry, with the factory system, mining, and finance increasing in importance. Immigration from Europe and the eastern states led to the rapid growth of the West, based on farming, ranching, and mining. Labor unions became important in the very rapidly growing industrial cities. Two major nationwide depressions—the Panic of 1873 and the Panic of 1893—interrupted growth and caused social and political upheavals. The South after the Civil War remained economically devastated; its economy became increasingly tied to commodities, cotton and tobacco production, which suffered from low prices. With the end of the Reconstruction era in 1877, African-American people in the South were stripped of political power and voting rights and were left economically disadvantaged.
 6 | 
 7 | The political landscape was notable in that despite some corruption, turnout was very high and national elections saw two evenly matched parties. The dominant issues were cultural (especially regarding prohibition, education, and ethnic or racial groups) and economic (tariffs and money supply). With the rapid growth of cities, political machines increasingly took control of urban politics. In business, powerful nationwide trusts formed in some industries. Unions crusaded for the 8-hour working day and the abolition of child labor; middle class reformers demanded civil service reform, prohibition of liquor and beer, and women's suffrage. Local governments across the North and West built public schools chiefly at the elementary level; public high schools started to emerge. The numerous religious denominations were growing in membership and wealth, with Catholicism becoming the largest denomination. They all expanded their missionary activity to the world arena. Catholics, Lutherans and Episcopalians set up religious schools and the larger denominations set up numerous colleges, hospitals, and charities. Many of the problems faced by society, especially the poor, during the Gilded Age gave rise to attempted reforms in the subsequent Progressive Era
 8 | 
 9 | The Gilded Age was a period of economic growth as the United States jumped to the lead in industrialization ahead of Britain. The nation was rapidly expanding its economy into new areas, especially heavy industry like factories, railroads, and coal mining. In 1869, the First Transcontinental Railroad opened up the far-west mining and ranching regions. Travel from New York to San Francisco now took six days instead of six months.[13] Railroad track mileage tripled between 1860 and 1880, and then doubled again by 1920. The new track linked formerly isolated areas with larger markets and allowed for the rise of commercial farming, ranching, and mining, creating a truly national marketplace. American steel production rose to surpass the combined totals of Britain, Germany, and France.[14]
10 | 
11 | Investors in London and Paris poured money into the railroads through the American financial market centered in Wall Street. By 1900, the process of economic concentration had extended into most branches of industry—a few large corporations, called "trusts", dominated in steel, oil, sugar, meat, and farm machinery. Through vertical integration these trusts were able to control each aspect of the production of a specific good, ensuring that the profits made on the finished product were maximized and prices minimized, and by controlling access to the raw materials, prevented other companies from being able to compete in the marketplace.[15] Several monopolies --most famously Standard Oil--came to dominate their markets by keeping prices low when competitors appeared; they grew at a rate four times faster than that of the competitive sectors.[16]
12 | 
13 | 
14 | A Los Angeles oil district, circa 1900
15 | Increased mechanization of industry is a major mark of the Gilded Age's search for cheaper ways to create more product. Frederick Winslow Taylor observed that worker efficiency in steel could be improved through the use of very close observations with a stop watch to eliminate wasted effort. Mechanization made some factories an assemblage of unskilled laborers performing simple and repetitive tasks under the direction of skilled foremen and engineers. Machine shops grew rapidly, and they comprised highly skilled workers and engineers. Both the number of unskilled and skilled workers increased, as their wage rates grew.[17]
16 | 
17 | Engineering colleges were established to feed the enormous demand for expertise. Railroads invented modern management, with clear chains of command, statistical reporting, and complex bureaucratic systems.[18] They systematized the roles of middle managers and set up explicit career tracks. They hired young men ages 18–21 and promoted them internally until a man reached the status of locomotive engineer, conductor, or station agent at age 40 or so. Career tracks were invented for skilled blue-collar jobs and for white-collar managers, starting in railroads and expanding into finance, manufacturing, and trade. Together with rapid growth of small business, a new middle class was rapidly growing, especially in northern cities.[19]
18 | 
19 | The United States became a world leader in applied technology. From 1860 to 1890, 500,000 patents were issued for new inventions—over ten times the number issued in the previous seventy years. George Westinghouse invented air brakes for trains (making them both safer and faster). Theodore Vail established the American Telephone & Telegraph Company and built a great communications network.[20] Thomas Edison, in addition to inventing hundreds of devices, established the first electrical lighting utility, basing it on direct current and an efficient incandescent lamp. Electric power delivery spread rapidly across Gilded Age cities. The streets were lighted at night, and electric streetcars allowed for faster commuting to work and easier shopping.[21]
20 | 
21 | Petroleum launched a new industry beginning with the Pennsylvania oil fields in the 1860s. The United States dominated the global industry into the 1950s. Kerosene replaced whale oil and candles for lighting homes. John D. Rockefeller founded Standard Oil Company and monopolized the oil industry, which mostly produced kerosene before the automobile created a demand for gasoline in the 20th century
22 | 


--------------------------------------------------------------------------------
/data/sherlock_noise4.txt:
--------------------------------------------------------------------------------
 1 | The Hawaiian islands were formed by volcanic activity initiated at an undersea magma source called the Hawaii hotspot. The process is continuing to build islands; the tectonic plate beneath much of the Pacific Ocean continually moves northwest and the hot spot remains stationary, slowly creating new volcanoes. Because of the hotspot's location, all currently active land volcanoes are located on the southern half of Hawaii Island. The newest volcano, Lōʻihi Seamount, is located south of the coast of Hawaii Island.
 2 | 
 3 | The last volcanic eruption outside Hawaii Island occurred at Haleakalā on Maui before the late 18th century, possibly hundreds of years earlier.[33] In 1790, Kīlauea exploded; it was the deadliest eruption known to have occurred in the modern era in what is now the United States.[34] Up to 5,405 warriors and their families marching on Kīlauea were killed by the eruption.[35] Volcanic activity and subsequent erosion have created impressive geological features. Hawaii Island has the second-highest point among the world's islands.[36]
 4 | 
 5 | On the flanks of the volcanoes, slope instability has generated damaging earthquakes and related tsunamis, particularly in 1868 and 1975.[37] Steep cliffs have been created by catastrophic debris avalanches on the submerged flanks of ocean island volcanoes.[38][39]
 6 | 
 7 | The Kīlauea erupted in May 2018, opening 22 fissure vents on its East Rift Zone. The Leilani Estates and Lanipuna Gardens are situated within this territory. The destruction affected at least 36 buildings and this coupled with the lava flows and the sulfur dioxide fumes, necessitated the evacuation of more than 2,000 local inhabitants from the neighborhoods.[40]
 8 | 
 9 | Flora and fauna
10 | See also: Endemism in the Hawaiian Islands and List of invasive plant species in Hawaii
11 | A Hawaiian monk seal rests at French Frigate Shoals.
12 | French Frigate Shoals, located in the Northwestern Hawaiian Islands, is protected as part of the Papahānaumokuākea Marine National Monument.
13 | Because the islands of Hawaii are distant from other land habitats, life is thought to have arrived there by wind, waves (i.e. by ocean currents) and wings (i.e. birds, insects, and any seeds they may have carried on their feathers). This isolation, in combination with the diverse environment (including extreme altitudes, tropical climates, and arid shorelines), allowed for the evolution of new endemic flora and fauna. Hawaii has more endangered species and has lost a higher percentage of its endemic species than any other U.S. state.[41] One endemic plant, Brighamia, now requires hand-pollination because its natural pollinator is presumed to be extinct.[42] The two species of Brighamia—B. rockii and B. insignis—are represented in the wild by around 120 individual plants. To ensure these plants set seed, biologists rappel down 3,000-foot (910 m) cliffs to brush pollen onto their stigmas.[43]
14 | 
15 | The extant main islands of the archipelago have been above the surface of the ocean for fewer than 10 million years; a fraction of the time biological colonization and evolution have occurred there. The islands are well known for the environmental diversity that occurs on high mountains within a trade winds field. On a single island, the climate around the coasts can range from dry tropical (less than 20 inches or 510 millimeters annual rainfall) to wet tropical; on the slopes, environments range from tropical rainforest (more than 200 inches or 5,100 millimeters per year), through a temperate climate, to alpine conditions with a cold, dry climate. The rainy climate impacts soil development, which largely determines ground permeability, affecting the distribution of streams and wetlands.[citation needed]
16 | 
17 | Protected areas
18 | 
19 | Nā Pali Coast State Park, Kauaʻi
20 | Several areas in Hawaii are under the protection of the National Park Service.[44] Hawaii has two national parks: Haleakalā National Park located near Kula on the island of Maui, which features the dormant volcano Haleakalā that formed east Maui, and Hawaii Volcanoes National Park in the southeast region of the Hawaiʻi Island, which includes the active volcano Kīlauea and its rift zones.
21 | 
22 | There are three national historical parks; Kalaupapa National Historical Park in Kalaupapa, Molokaʻi, the site of a former leper colony; Kaloko-Honokōhau National Historical Park in Kailua-Kona on Hawaiʻi Island; and Puʻuhonua o Hōnaunau National Historical Park, an ancient place of refuge on Hawaiʻi Island's west coast. Other areas under the control of the National Park Service include Ala Kahakai National Historic Trail on Hawaiʻi Island and the USS Arizona Memorial at Pearl Harbor on Oʻahu.
23 | 
24 | The Papahānaumokuākea Marine National Monument was proclaimed by President George W. Bush on June 15, 2006. The monument covers roughly 140,000 square miles (360,000 km2) of reefs, atolls, and shallow and deep sea out to 50 miles (80 km) offshore in the Pacific Ocean—an area larger than all of the national parks in the U.S. combined.[45]
25 | 
26 | Climate
27 | See also: List of Hawaii tornadoes, List of Hawaii hurricanes, and Climate of Hawaii
28 | 
29 | A true-color satellite view of Hawaii shows that most of the vegetation on the islands grows on their northeast sides, which face the wind. The silver glow around the southwest of the islands is the result of calmer waters.[46]
30 | Hawaii's climate is typical for the tropics, although temperatures and humidity tend to be less extreme because of near-constant trade winds from the east. Summer highs usually reach around 88 °F (31 °C) during the day, with the temperature reaching a low of 75 °F (24 °C) at night. Winter day temperatures are usually around 83 °F (28 °C); at low elevation they seldom dip below 65 °F (18 °C) at night. Snow, not usually associated with the tropics, falls at 13,800 feet (4,200 m) on Mauna Kea and Mauna Loa on Hawaii Island in some winter months. Snow rarely falls on Haleakalā. Mount Waiʻaleʻale on Kauaʻi has the second-highest average annual rainfall on Earth, about 460 inches (12,000 mm) per year. Most of Hawaii experiences only two seasons; the dry season runs from May to October and the wet season is from October to April.[47]
31 | 
32 | The warmest temperature recorded in the state, in Pahala on April 27, 1931, is 100 °F (38 °C), making it tied with Alaska as the lowest record high temperature observed in a U.S. state.[48] Hawaii's record low temperature is 12 °F (−11 °C) observed in May 1979, on the summit of Mauna Kea. Hawaii is the only state to have never recorded sub-zero Fahrenheit temperatures.[48]
33 | 
34 | Climates vary considerably on each island; they can be divided into windward and leeward (koʻolau and kona, respectively) areas based upon location relative to the higher mountains. Windward sides face cloud cover
35 | 
36 | The Hawaiian archipelago is located 2,000 mi (3,200 km) southwest of the contiguous United States.[29] Hawaii is the southernmost U.S. state and the second westernmost after Alaska. Hawaii, like Alaska, does not border any other U.S. state. It is the only U.S. state that is not geographically located in North America, the only state completely surrounded by water and that is entirely an archipelago, and the only state in which coffee is commercially cultivable.
37 | 
38 | In addition to the eight main islands, the state has many smaller islands and islets. Kaʻula is a small island near Niʻihau. The Northwest Hawaiian Islands is a group of nine small, older islands to the northwest of Kauaʻi that extend from Nihoa to Kure Atoll; these are remnants of once much larger volcanic mountains. Across the archipelago are around 130 small rocks and islets, such as Molokini, which are either volcanic, marine sedimentary or erosional in origin.[30]
39 | 
40 | Hawaii's tallest mountain Mauna Kea is 13,796 ft (4,205 m) above mean sea level;[31] it is taller than Mount Everest if measured from the base of the mountain, which lies on the floor of the Pacific Ocean and rises about 33,500 feet (10,200 m).
41 | 


--------------------------------------------------------------------------------
/cifar_corruptor.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import matplotlib.pyplot as plt
  3 | import sklearn as sk
  4 | import scipy as sp
  5 | import sklearn.decomposition as decom
  6 | import pdb
  7 | 
  8 | '''
  9 | Must have CIFAR images downloaded. E.g. from https://www.cs.toronto.edu/~kriz/cifar.html.
 10 | '''
 11 | 
 12 | D = 1024
 13 | PIXEL_VALUE_RANGE  = 256
 14 | VARIANCE_OUTLIER_DISTRIBUTION = 20
 15 | 
 16 | def unpickle(file):
 17 |     import pickle
 18 |     with open(file, 'rb') as fo:
 19 |         dict = pickle.load(fo, encoding='bytes')
 20 |     return dict
 21 | 
 22 | 
 23 | # returns a pair (cifar,airplanes) of all cifar data and airplanes
 24 | def init():
 25 |     cifar1 = unpickle('data/cifar-10-batches-py/data_batch_1')
 26 |     cifar2 = unpickle('data/cifar-10-batches-py/data_batch_2')
 27 |     cifar3 = unpickle('data/cifar-10-batches-py/data_batch_3')
 28 |     cifar4 = unpickle('data/cifar-10-batches-py/data_batch_4')
 29 |     cifar5 = unpickle('data/cifar-10-batches-py/data_batch_5')
 30 |     cifar6 = unpickle('data/cifar-10-batches-py/test_batch')
 31 | 
 32 |     cifar = np.concatenate((cifar1[b'data'], cifar2[b'data'], cifar3[b'data'], cifar4[b'data'], cifar5[b'data'], cifar6[b'data']))
 33 |     cifar_red = cifar[:,0:1024].astype(int) #keep only the red channel to speed things up
 34 | 
 35 |     # get only the airplanes
 36 |     cifar_by_label = [[] for i in range(10)]
 37 |     for batch in [cifar1,cifar2,cifar3,cifar4,cifar5,cifar6]:
 38 |         for i in range(len(batch[b'labels'])):
 39 |             cifar_by_label[batch[b'labels'][i]].append(batch[b'data'][i])
 40 | 
 41 |     for i in range(len(cifar_by_label)):
 42 |         cifar_by_label[i] = np.array([y[:1024].astype(int) for y in cifar_by_label[i]])
 43 | 
 44 | 
 45 |     # sort into classes with some incredibly garbage code
 46 |     class0 = []
 47 |     class1 = []
 48 |     class2 = []
 49 |     class3 = []
 50 |     class4 = []
 51 |     class5 = []
 52 |     class6 = []
 53 |     class7 = []
 54 |     class8 = []
 55 |     class9 = []
 56 | 
 57 |     for batch in [cifar1,cifar2,cifar3,cifar4,cifar5,cifar6]:
 58 |         for i in range(len(batch[b'labels'])):
 59 |             if batch[b'labels'][i] == 0:
 60 |                 class0 += [batch[b'data'][i][:1024].astype(int)]
 61 |             if batch[b'labels'][i] == 1:
 62 |                 class1 += [batch[b'data'][i][:1024].astype(int)]
 63 |             if batch[b'labels'][i] == 2:
 64 |                 class2 += [batch[b'data'][i][:1024].astype(int)]
 65 |             if batch[b'labels'][i] == 3:
 66 |                 class3 += [batch[b'data'][i][:1024].astype(int)]
 67 |             if batch[b'labels'][i] == 4:
 68 |                 class4 += [batch[b'data'][i][:1024].astype(int)]
 69 |             if batch[b'labels'][i] == 5:
 70 |                 class5 += [batch[b'data'][i][:1024].astype(int)]
 71 |             if batch[b'labels'][i] == 6:
 72 |                 class6 += [batch[b'data'][i][:1024].astype(int)]
 73 |             if batch[b'labels'][i] == 7:
 74 |                 class7 += [batch[b'data'][i][:1024].astype(int)]
 75 |             if batch[b'labels'][i] == 8:
 76 |                 class8 += [batch[b'data'][i][:1024].astype(int)]
 77 |             if batch[b'labels'][i] == 9:
 78 |                 class9 += [batch[b'data'][i][:1024].astype(int)]
 79 | 
 80 |     cifar_by_class = [np.array(cl) for cl in [class0,class1,class2,class3,class4,class5,class6,class7,class8,class9]]
 81 | 
 82 |     return cifar_red,cifar_by_class
 83 | 
 84 | 
 85 | # randomly subsamples n elements of np array, returns np array
 86 | def subsample(array,n,sample_idx=None):
 87 |     if sample_idx is not None:
 88 |         return array[sample_idx]
 89 |     else:
 90 |         return np.array([array[np.random.randint(array.shape[0])] for i in range(n)])
 91 | 
 92 | 
 93 | '''
 94 | Computes data used for whitening.
 95 | Input:
 96 | -data: image data
 97 | -fast_whiten: whether to do approximate or exact whitening.
 98 | -sample_idx: indices used for whitening
 99 | Returns: 
100 | -whitening matrix, computed with either exact or appx inverse and 5000 random cifar images from all classes
101 | '''
102 | def get_whitening(data, fast_whiten=False, sample_idx=None):
103 | 
104 |     N = 5000
105 | 
106 |     cifar_red,airplanes = data
107 | 
108 |     # subsample
109 |     whitening_imgs = subsample(cifar_red,N,sample_idx)
110 | 
111 |     w_mean = np.sum(whitening_imgs,axis=0) / whitening_imgs.shape[0]
112 |     w_centered = whitening_imgs - np.outer(np.ones(whitening_imgs.shape[0]), w_mean)
113 |     w_cov = np.dot(np.transpose(w_centered), w_centered) / whitening_imgs.shape[0]
114 |     
115 |     if fast_whiten:
116 |         whiten_dim = int(0.3*cifar_red.shape[-1])
117 |         sv = decom.TruncatedSVD(whiten_dim)
118 |         sv.fit(w_cov)
119 |         
120 |         top_evals, top_evecs = sv.singular_values_, sv.components_
121 |         top_evals = 1/np.sqrt(top_evals)
122 |         
123 |         return (top_evals, top_evecs)
124 |         
125 |     else:
126 |         whiten = sp.linalg.sqrtm(w_cov)
127 |         whiten = np.linalg.inv(whiten)
128 |         return whiten
129 | 
130 | 
131 | def get_corrupted_data(data,num_directions,frac_bad,W,one_class=False,which_class=1,fast_whiten=False,sample_idx=None):
132 |     '''
133 |     args:
134 |       data -- a pair cifar,airplanes consisting of all cifar images and just the airplane images
135 |       num_directions -- how many different bad pixels to make
136 |       frac_bad -- what percent of data should be outliers
137 |       W -- a whitening matrix
138 |       one_class -- use only one class of images. IN THIS CASE WE DO NOT RANDOMLY SUBSAMPLE
139 | 
140 |     returns:
141 |       good_data,bad_data -- a pair of numpy arrays with good whitened data and bad whitened data
142 | 
143 |     we randomly subsample 5000 cifar images and corrupt a subset of them
144 | 
145 |     '''
146 | 
147 |     imgs,by_label= data
148 | 
149 |      # make a fresh copy of the data so we can modify it
150 |     imgs = np.copy(imgs)
151 | 
152 |     # randomly subsample 5000 images
153 |     imgs = subsample(imgs,5000,sample_idx)
154 | 
155 |     # if using one class of images
156 |     if one_class:
157 |         imgs = np.copy(by_label[which_class])
158 | 
159 |     # split the data
160 |     num_bad = int(frac_bad * len(imgs))
161 |     num_good = len(imgs) - num_bad
162 | 
163 |     bad_data = imgs[:num_bad]
164 |     good_data = imgs[num_bad:]
165 | 
166 |     # compute how many of each outlier type
167 |     #hot_fracs = [np.random.randint(VARIANCE_OUTLIER_DISTRIBUTION) for i in range(num_directions)]
168 |     hot_fracs = [1.3**i for i in range(num_directions)]
169 |     s = float(sum(hot_fracs))
170 |     hot_fracs = map(lambda x: x/s, hot_fracs)
171 |     hot_nums = [int(x * num_bad) for x in hot_fracs]
172 | 
173 |     # introduce corruptions
174 |     for i in range(num_directions):
175 | 
176 |         # pick locations and values for hot pixels at random
177 |         px_val = np.random.randint(PIXEL_VALUE_RANGE)
178 |         px_loc = np.random.randint(D)
179 |         start_idx = sum(hot_nums[:i])
180 |         
181 |         #for j in range(start_idx, start_idx+ ):
182 |         for j in range(start_idx, start_idx+hot_nums[i]):
183 |             bad_data[j][px_loc] = px_val
184 |         #start_idx = end_idx
185 | 
186 |     # whiten and center the data
187 |     if fast_whiten:
188 |         top_evals, top_evecs = W
189 |         X = np.concatenate((bad_data,good_data),axis=0)
190 |         projected = np.matmul(top_evecs.transpose()/(top_evecs**2).sum(-1), np.matmul(top_evecs, X.transpose())).transpose()
191 |         
192 |         X = np.matmul(np.matmul(top_evecs.transpose(), np.diag(top_evals)), np.matmul(top_evecs, X.transpose())).transpose() + (X-projected )
193 |         bad_data_w = X[:len(bad_data)]
194 |         good_data_w = X[len(bad_data):]
195 |     else:
196 |         bad_data_w = np.dot(bad_data, W)
197 |         good_data_w= np.dot(good_data, W)
198 | 
199 |     mean = np.sum(np.concatenate((bad_data_w,good_data_w),axis=0), axis=0) / imgs.shape[0]
200 |     bad_data_cw = bad_data_w - np.outer(np.ones(bad_data_w.shape[0]), mean)
201 |     good_data_cw = good_data_w - np.outer(np.ones(good_data_w.shape[0]), mean)
202 | 
203 |     return good_data_cw,bad_data_cw
204 | 


--------------------------------------------------------------------------------
/words.py:
--------------------------------------------------------------------------------
  1 | 
  2 | '''
  3 | Detect outliers in word embeddings
  4 | '''
  5 | import torch
  6 | import sklearn.decomposition as decom
  7 | import data
  8 | import utils
  9 | import numpy as np
 10 | import numpy.linalg as linalg
 11 | import re
 12 | 
 13 | import pdb
 14 | 
 15 | USE_ALLENNLP = False
 16 | #use flag, as some users reported issues with installation.
 17 | if USE_ALLENNLP:
 18 |     import allennlp.data.tokenizers.word_tokenizer as tokenizer
 19 |     from allennlp.data.tokenizers.word_filter import StopwordFilter
 20 |     tk = tokenizer.WordTokenizer()
 21 |     stop_word_filter = StopwordFilter()
 22 | else:
 23 |     print('Note: using rudimentary tokenizer, for better results enable allennlp.')
 24 |     stop_word_filter = utils.stop_word_filter()
 25 |     tk = utils.tokenizer()
 26 |     
 27 | '''
 28 | Combines content and noise words embeddings
 29 | '''
 30 | def doc_word_embed_content_noise(content_path, noise_path, whiten_path=None, content_lines=None, noise_lines=None, opt=None):
 31 |     no_add_set = set()
 32 |     doc_word_embed_f = doc_word_embed_sen
 33 |     content_words_ar, content_word_embeds = doc_word_embed_f(content_path, no_add_set, content_lines=content_lines)
 34 |     words_set = set(content_words_ar)
 35 |     noise_words_ar, noise_word_embeds = doc_word_embed_f(noise_path, set(content_words_ar), content_lines=noise_lines)
 36 |     content_words_ar.extend(noise_words_ar)
 37 |     words_ar = content_words_ar
 38 |     word_embeds = torch.cat((content_word_embeds, noise_word_embeds), dim=0)
 39 |     
 40 |     whitening = opt.whiten if opt is not None else True  
 41 |     if whitening and whiten_path is not None:
 42 |         #use an article of data in the inliers topic to whiten data.
 43 |         whiten_ar, whiten_word_embeds = doc_word_embed_f(whiten_path, set()) #, content_lines=content_lines)#,content_lines=content_lines)
 44 |         
 45 |         whiten_cov = utils.cov(whiten_word_embeds)
 46 |         fast_whiten = False #True
 47 |         if not fast_whiten:
 48 |             U, D, V_t = linalg.svd(whiten_cov)
 49 |             #D_avg = D.mean() #D[len(D)//2]
 50 |             #print('D_avg! {}'.format(D_avg))
 51 |             
 52 |             cov_inv = torch.from_numpy(np.matmul(linalg.pinv(np.diag(np.sqrt(D))), U.transpose())).to(utils.device)
 53 |             #cov_inv = torch.from_numpy(np.matmul(U, np.matmul(linalg.pinv(np.diag(np.sqrt(D))), V_t))).to(utils.device)
 54 | 
 55 |             word_embeds0=word_embeds
 56 |             #change multiplication order!
 57 |             word_embeds = torch.mm(cov_inv, word_embeds.t()).t()
 58 |             if False:
 59 |                 
 60 |                 after_cov = utils.cov(word_embeds)
 61 |                 U1, D1, V_t1 = linalg.svd(after_cov)                
 62 |                 pdb.set_trace()
 63 |                 
 64 |                 content_whitened = torch.mm(cov_inv, content_word_embeds.t()).t()
 65 |                 after_cov2 = utils.cov(content_whitened)
 66 |                 _, D1, _ = linalg.svd(after_cov2) 
 67 |                 print('after whitening D {}'.format(D1[:7]))
 68 |         else:
 69 |             #### faster whitening
 70 |             sv = decom.TruncatedSVD(30)
 71 |             sv.fit(whiten_cov.cpu().numpy())
 72 |             top_evals, top_evecs = sv.singular_values_, sv.components_
 73 |             top_evals = torch.from_numpy(1/np.sqrt(top_evals)).to(utils.device)
 74 |             top_evecs = torch.from_numpy(top_evecs).to(utils.device)
 75 |             #pdb.set_trace()
 76 |             
 77 |             X = word_embeds
 78 |             projected = torch.mm(top_evecs.t()/(top_evecs**2).sum(-1), torch.mm(top_evecs, X.t())).t()
 79 |             #eval_ones = torch.eye(len(top_evals), device=top_evals.device)
 80 |             ##projected = torch.mm(torch.mm(top_evecs.t(), eval_ones), torch.mm(top_evecs, X.t())).t()
 81 |             
 82 |             #(d x k) * (k x d) * (d x n), project onto and squeeze the components along top evecs
 83 |             ##word_embeds = torch.mm((top_evecs/top_evals.unsqueeze(-1)).t(), torch.mm(top_evecs, X.t())).t() + (X-torch.mm(top_evecs.t(), torch.mm(top_evecs, X.t()) ).t())
 84 |             #pdb.set_trace()
 85 |             ##word_embeds = torch.mm((top_evecs/(top_evals*(top_evecs**2).sum(-1)).unsqueeze(-1)).t(), torch.mm(top_evecs, X.t())).t() + (X-projected )
 86 |             #word_embeds = torch.mm((top_evecs/(top_evals*(top_evecs**2).sum(-1)).unsqueeze(-1)).t(), torch.mm(top_evecs, X.t())).t() + (X-projected )
 87 |             word_embeds = torch.mm(torch.mm(top_evecs.t(), top_evals.diag()), torch.mm(top_evecs, X.t())).t() + (X-projected )            
 88 |     
 89 |     noise_idx = torch.LongTensor(list(range(len(content_word_embeds), len(word_embeds)))).to(utils.device)
 90 |     if False:
 91 |         #normalie per direction
 92 |         word_embeds_norm = ((word_embeds-word_embeds.mean(0))**2).sum(dim=1, keepdim=True).sqrt()
 93 |     debug_top_dir = False
 94 |     if debug_top_dir:
 95 |         w1 = (content_word_embeds - word_embeds.mean(0))#/word_embeds_norm[:len(content_word_embeds)]
 96 |         
 97 |         w2 = (noise_word_embeds - word_embeds.mean(0))#/word_embeds_norm[len(content_word_embeds):]
 98 |         mean_diff = ((w1.mean(0)-w2.mean(0))**2).sum().sqrt()
 99 |         w1_norm = (w1**2).sum(-1).sqrt().mean()
100 |         w2_norm = (w2**2).sum(-1).sqrt().mean()
101 |         X = (word_embeds - word_embeds.mean(0))#/word_embeds_norm
102 |         cov = torch.mm(X.t(), X)/word_embeds.size(0)
103 |         U, D, V_t = linalg.svd(cov.cpu().numpy())
104 |         U1 = torch.from_numpy(U[1]).to(utils.device)
105 |         mean1_dir = w1.mean(0)
106 |         mean1_proj = (mean1_dir*U1).sum()
107 |         mean2_dir = w2.mean(0)
108 |         mean2_proj = (mean2_dir*U1).sum()
109 |         diff_proj = ((mean1_dir-mean2_dir)*U1).sum()
110 | 
111 |         #plot histogram of these projections
112 |         proj1 = (w1*U1).sum(-1)
113 |         proj2 = (w2*U1).sum(-1)
114 |         utils.hist(proj1, 'inliers')
115 |         utils.hist(proj2, 'outliers')
116 |         pdb.set_trace()
117 |     #word_embeds=(word_embeds - word_embeds.mean(0))/word_embeds_norm
118 |     return words_ar, word_embeds, noise_idx
119 | 
120 | '''
121 | Read in file, get embeddings, remove stop words
122 | Input:
123 | -no_add_set: words to not add
124 | '''
125 | def doc_word_embed(path, no_add_set, content_lines=None):
126 |     if content_lines is not None:
127 |         lines = content_lines
128 |     else:
129 |         with open(path, 'r') as file:
130 |             lines = file.readlines()
131 | 
132 |     words = []
133 |     vocab, embeds = data.process_glove_data(dim=100)
134 |     embed_map = dict(zip(vocab, embeds))
135 |     
136 |     #list of list of tokens
137 |     tokens_l = tk.batch_tokenize(lines)   
138 |     #stop_word_filter = StopwordFilter()
139 |     tokens_l1 = []
140 |     for sentence_l in tokens_l:
141 |         tokens_l1.extend(sentence_l)
142 |     tokens_l = [tokens_l1]
143 | 
144 |     n_avg = 5 #5
145 |     word_embeds = []
146 |     words_ar = []
147 |     added_set = set(no_add_set)
148 |     for sentence in tokens_l:
149 |         
150 |         sentence = stop_word_filter.filter_words(sentence)
151 |         cur_embed = torch.zeros_like(embed_map['a'])
152 |         cur_counter = 0
153 |         for j,w in enumerate(sentence):            
154 |             w = w.text.lower() if USE_ALLENNLP else w.lower()
155 |             if w in embed_map:# and w not in added_set:
156 |                 if cur_counter == n_avg:# or j==len(sentence)-1:
157 |                     added_set.add(w)
158 |                     words_ar.append(w)
159 |                     #word_embeds.append(embed_map[w])
160 |                     #word_embeds.append(cur_embed/(cur_counter if cur_counter > 0 else 1))
161 |                     word_embeds.append(cur_embed/n_avg)
162 |                     
163 |                     cur_embed = torch.zeros_like(embed_map['a'])
164 |                     cur_counter = 0      
165 |                 else:
166 |                     cur_counter += 1
167 |                     cur_embed += embed_map[w]
168 |     
169 |     word_embeds = torch.stack(word_embeds, dim=0).to(utils.device)    
170 |     if False: #is_noise :#False: #sanity check
171 |         word_embeds[:] = word_embeds.mean(0) #word_embeds[0]
172 |     return words_ar, word_embeds
173 | '''
174 | embedding of sentences.
175 | '''
176 | def doc_word_embed_sen(path, no_add_set, content_lines=None):
177 |     if content_lines is not None:
178 |         lines = content_lines
179 |     else:
180 |         with open(path, 'r') as file:
181 |             lines = file.readlines()
182 |             
183 |     lines1 = []
184 |     patt = re.compile('[;\.:!,?]')
185 |     for line in lines:
186 |         #cur_lines = []
187 |         for cur_line in patt.split(line):
188 |             lines1.append(cur_line)            
189 |         #lines1.append(cur_lines)
190 |     lines = lines1    
191 |     
192 |     words = []
193 |     vocab, embeds = data.process_glove_data(dim=100)
194 |     embed_map = dict(zip(vocab, embeds))    
195 |     #list of list of tokens
196 |     tokens_l = tk.batch_tokenize(lines)    
197 | 
198 |     '''
199 |     tokens_l1 = []
200 |     for sentence_l in tokens_l:
201 |         tokens_l1.extend(sentence_l)
202 |     tokens_l = [tokens_l1]
203 |     '''
204 |     max_len = 200
205 |     word_embeds = []
206 |     words_ar = []
207 |     #added_set = set(no_add_set)
208 |     for sentence in tokens_l:        
209 |         sentence = stop_word_filter.filter_words(sentence)        
210 |         if len(sentence) < 4:
211 |             continue
212 |         cur_embed = torch.zeros_like(embed_map['a'])
213 |         cur_counter = 0
214 |         for j,w in enumerate(sentence):
215 |             w = w.text.lower() if USE_ALLENNLP else w.lower()
216 |             if w in embed_map:# and w not in added_set:
217 |                 if cur_counter == max_len:# or j==len(sentence)-1:
218 |                     #added_set.add(w)
219 |                     words_ar.append(w)
220 |                     #word_embeds.append(embed_map[w])
221 |                     #word_embeds.append(cur_embed/(cur_counter if cur_counter > 0 else 1))
222 |                     word_embeds.append(cur_embed/max_len)
223 |                     
224 |                     cur_embed = torch.zeros_like(embed_map['a'])
225 |                     cur_counter = 0      
226 |                 else:
227 |                     cur_counter += 1
228 |                     cur_embed += embed_map[w]
229 |         
230 |         word_embeds.append(cur_embed / len(sentence))
231 |         
232 |     word_embeds = torch.stack(word_embeds, dim=0).to(utils.device)    
233 |     if False: #is_noise :#False: #sanity check
234 |         word_embeds[:] = word_embeds.mean(0) #word_embeds[0]
235 |     
236 |     return words_ar, word_embeds
237 | 
238 | def doc_word_embed0(path, no_add_set):
239 |     with open(path, 'r') as file:
240 |         lines = file.readlines()
241 | 
242 |     words = []
243 |     vocab, embeds = data.process_glove_data(dim=100)
244 |     embed_map = dict(zip(vocab, embeds))
245 |     
246 |     #list of list of tokens
247 |     tokens_l = tk.batch_tokenize(lines)
248 |         
249 |     word_embeds = []
250 |     words_ar = []
251 |     added_set = set(no_add_set)
252 |     for sentence in tokens_l:
253 |         sentence = stop_word_filter.filter_words(sentence)    
254 |         for w in sentence:
255 |             
256 |             w = w.text.lower() if USE_ALLENNLP else w.lower()
257 |             if w in embed_map and w not in added_set:
258 |                 added_set.add(w)
259 |                 words_ar.append(w)
260 |                 word_embeds.append(embed_map[w])
261 |       
262 |     word_embeds = torch.stack(word_embeds, dim=0).to(utils.device)
263 |     if False: #sanity check
264 |         word_embeds[:] = word_embeds[0] 
265 |     #word_embeds = word_embeds / (word_embeds**2).sum(dim=1, keepdim=True).sqrt()
266 |     return words_ar, word_embeds
267 | 
268 | '''
269 | Create sentence embeddings on the file with the supplied path.
270 | '''
271 | def doc_sentence_embed(path):
272 |     with open(path, 'r') as file:
273 |         lines = file.readlines()
274 | 
275 |     lines1 = []
276 |     for line in lines:
277 |         lines1.extend(line.lower().split('.') )
278 |         
279 |     lines = lines1
280 |     words = []
281 |     vocab, embeds = data.process_glove_data(dim=100)
282 |     embed_map = dict(zip(vocab, embeds))
283 |     
284 |     tokens_l = tk.batch_tokenize(lines)
285 |     word_embeds = []
286 |     words_ar = []
287 |     added_set = set()
288 |     for sentence in tokens_l:
289 |         if len(sentence) < 3:
290 |             continue
291 |         sentence_embed = 0
292 |         aa = True
293 |         for w in sentence:
294 |             w = w.text.lower() if USE_ALLENNLP else w.lower()
295 |             if w in embed_map:# and w not in added_set:
296 |                 ##added_set.add(w)
297 |                 ##words_ar.append(w)
298 |                 ##word_embeds.append(embed_map[w])
299 |                 sentence_embed += embed_map[w]
300 |                 aa = False
301 |         if aa:
302 |             continue
303 |         words_ar.append(sentence)
304 |         word_embeds.append(sentence_embed/len(sentence))
305 | 
306 |     word_embeds = torch.stack(word_embeds, dim=0).to(utils.device)
307 |     #word_embeds = word_embeds / (word_embeds**2).sum(dim=1, keepdim=True).sqrt()
308 |     return words_ar, word_embeds
309 | 
310 | if __name__=='__main__':
311 |     doc_word_embed('data/test.txt', set())
312 |     
313 | 


--------------------------------------------------------------------------------
/pixel.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import matplotlib
  3 | matplotlib.use('agg')
  4 | import matplotlib.pyplot as plt
  5 | 
  6 | import torch
  7 | import numpy as np
  8 | import numpy.linalg as linalg
  9 | import sklearn.decomposition as decom
 10 | from scipy.stats import ortho_group
 11 | import scipy.stats as st
 12 | import random
 13 | import utils
 14 | import os.path as osp
 15 | import data
 16 | import baselines
 17 | import words
 18 | import mean
 19 | import cifar_corruptor as cif
 20 | import pdb
 21 | 
 22 | '''
 23 | Outlier detection and mean estimation on CIFAR data.
 24 | '''
 25 | 
 26 | def test_pixel_dirs(opt):
 27 |     opt.n = 8000 #10000 #50
 28 |     opt.feat_dim = 1024 #1400 #1000
 29 |     n = opt.n
 30 |     feat_dim = opt.feat_dim
 31 |     #number of top dirs for calculating tau0
 32 |     opt.n_top_dir = 1
 33 |     #number of directions to add noise
 34 |     opt.p = 0.2 #default total portion corrupted
 35 | 
 36 |     #use original samples for whitening 
 37 |     same_whitening_samples = False
 38 |     cif_data = cif.init()
 39 |     #number of directions
 40 |     n_dir_l = list(range(1, 16, 3))
 41 |     n_repeat = 20 
 42 |     data_l = []
 43 |     n_sample = 5000
 44 |     for n_dir in n_dir_l:
 45 |         cur_data_l = []        
 46 |         for _ in range(n_repeat):
 47 |             if same_whitening_samples:
 48 |                 sample_idx = np.random.randint(low=0, high=n_sample, size=(n_sample,))
 49 |             else:
 50 |                 sample_idx = None
 51 |             if opt.whiten:                
 52 |                 whiten_mx = cif.get_whitening(cif_data, fast_whiten=opt.fast_whiten, sample_idx=sample_idx)
 53 |             else:
 54 |                 whiten_mx = np.eye(feat_dim)
 55 |             X, X_n = cif.get_corrupted_data(cif_data, n_dir, opt.p, whiten_mx, fast_whiten=opt.fast_whiten, sample_idx=sample_idx)
 56 |             X, X_n = torch.from_numpy(X.astype(np.float32)).to(utils.device), torch.from_numpy(X_n).to(utils.device, torch.float32)
 57 |             noise_idx = torch.LongTensor(list(range(len(X_n)))).to(utils.device) + len(X)
 58 |             X = torch.cat((X, X_n), dim=0)
 59 |             #pdb.set_trace()
 60 |             X = X - X.mean(0)
 61 |             if opt.fast_jl:
 62 |                 ##enable if doing fast JL
 63 |                 X = utils.pad_to_2power(X)
 64 |             cur_data_l.append((X, noise_idx))
 65 |         data_l.append(cur_data_l)
 66 |     
 67 |     print('samples feat dim {}'.format(X.size(1)))
 68 |         
 69 |     #which baseline to use as tau0, can be 'isolation_forest'
 70 |     opt.baseline = 'tau0' #'l2' #'tau0' #'isolation_forest' #'l2' #
 71 |     print('baseline method: {}'.format(opt.baseline))    
 72 |     opt.n_iter = 1
 73 |     #amount to remove wrt cur_p
 74 |     opt.remove_factor = 1./opt.n_iter    
 75 |     
 76 |     #scalar to multiply norm of noise vectors with. This is deprecated
 77 |     opt.norm_scale = 1.3
 78 |     #amount to divide noise norm by
 79 |     opt.noise_norm_div = 8
 80 |     opt.lamb_multiplier = 6
 81 |     
 82 |     #opt.n_dir = N_DIR
 83 |     #n_dir = opt.n_dir
 84 |     acc_l = []
 85 |     #numpy array used for plotting.
 86 |     k_l = []
 87 |     p_l = []
 88 |     tau_l = []
 89 |     res_l = []
 90 |     
 91 |     #no need to include tau0
 92 |     if opt.fast_jl:
 93 |         outlier_methods_l = ['l2'] 
 94 |     else:
 95 |         outlier_methods_l = ['l2', 'iso forest', 'ell env', 'lof', 'knn']
 96 | 
 97 |     #+3 for tau1 tau0 and n_dir
 98 |     scores_ar = np.zeros((len(n_dir_l), len(outlier_methods_l)+3))
 99 |     std_ar = np.zeros((len(n_dir_l), len(outlier_methods_l)+3))
100 |     
101 |     for j, n_dir in enumerate(n_dir_l):
102 |         
103 |         cur_data_l = data_l[j]
104 |         opt.n_dir = n_dir
105 |         
106 |         #percentage to remove   
107 |         opt.remove_p = opt.p*opt.remove_factor
108 |         #for cur_dir in range(3, n_dir, 9):            
109 |         #cur_res_l = [n, feat_dim, n_noise_dir, opt.p, opt.lamb_multiplier, opt.norm_scale]
110 |         acc_mx = torch.zeros(n_repeat, 2)
111 |         cur_scores_ar = np.zeros((n_repeat, len(outlier_methods_l)+2))
112 |         for i in range(n_repeat):
113 |             X, noise_idx = cur_data_l[i]
114 |             ##cur_scores_ar[i] = train(X, n_noise_dir, opt.p, outlier_methods_l, opt)            
115 |             cur_scores_ar[i] = test_pixel2(X, noise_idx, outlier_methods_l, opt)
116 |             acc_mx[i, 0] = cur_scores_ar[i, 1] #acc0
117 |             acc_mx[i, 1] = cur_scores_ar[i, 0] #acc1
118 |             
119 |         scores_ar[j, 1:] = np.mean(cur_scores_ar, axis=0)
120 |         if opt.use_std:
121 |             std_ar[j, 1:] = np.std(cur_scores_ar, axis=0)
122 |         else:
123 |             se = np.clip(st.sem(cur_scores_ar, axis=0), 1e-3, None)
124 |             low, high = st.t.interval(0.95, cur_scores_ar.shape[0]-1, loc=scores_ar[j, 1:], scale=se)
125 |             std_ar[j, 1:] = (high - low)/2.
126 |         
127 |         scores_ar[j, 0] = n_dir
128 |         std_ar[j, 0] = n_dir
129 |         
130 |         acc_mean = acc_mx.mean(dim=0)
131 |         acc0, acc1 = acc_mean[0].item(), acc_mean[1].item()
132 |         print('n_noise_dir {} lamb {} acc0 {} acc1 {}'.format(n_dir, opt.lamb_multiplier, acc0, acc1))
133 |         #cur_res_l.extend([acc0, acc1])
134 |         
135 |     print('About to plot!')
136 |     print(std_ar)
137 |     pdb.set_trace()
138 |     #if plot_lambda:
139 |     #legends = ['lamb', 'acc', 'tau']
140 |     #else:
141 |     #    legends = ['k', 'acc', 'tau', 'p']
142 |     #plot both tau1 vs tau0, and tau1 against all baselines.
143 |     ##utils.plot_acc_syn_lamb(p_l, acc_l, tau_l, legends, opt)
144 |     
145 |     scores_ar = scores_ar.transpose()
146 |     std_ar = std_ar.transpose()
147 |     utils.plot_scatter_flex(scores_ar, ['tau1', 'tau0'] + outlier_methods_l, opt, std_ar=std_ar)
148 |     m = {'opt':opt, 'scores_ar':scores_ar, 'conf_ar':std_ar}
149 |     with open(osp.join('results', opt.dir, 'dirs_data.npy'), 'wb') as f:
150 |         torch.save(m, f)
151 |         print('saved under {}'.format(f))
152 | 
153 | def test_pixel_lamb(opt):
154 |     
155 |     n_dir_l = [3, 6, 10]
156 |     #n_dir_l = [3]
157 |     legend_l = []
158 |     scores_l = []
159 |     conf_l = []
160 |     cif_data = cif.init()
161 |     
162 |     if opt.compute_scores_diff:
163 |         for n_dir in n_dir_l:
164 |             legend_l.append(str(n_dir))
165 |             opt.n_dir = n_dir
166 |             mean1, conf1 = test_pixel_lamb2(cif_data, opt)
167 |             #scores_l.append(mean1[:, 1])                                                                          
168 |             #conf_l.append(conf1[:, 1])                                                                         
169 |             scores_l.append(mean1)
170 |             conf_l.append(conf1)
171 | 
172 |         n_lamb = mean1.shape[-1]
173 |         scores_ar = np.stack(scores_l, axis=0)
174 |         conf_ar = np.stack(conf_l, axis=0)
175 |         tau0_ar = np.concatenate((mean1[0].reshape(1, -1), scores_ar[:,2,:].reshape(len(n_dir_l), n_lamb)), axis=0)
176 |         tau0_conf_ar = np.concatenate((mean1[0].reshape(1, -1), conf_ar[:,2,:].reshape(len(n_dir_l), n_lamb)), axis=0)
177 | 
178 |         l2_ar = np.concatenate((mean1[0].reshape(1, -1), scores_ar[:,3,:].reshape(len(n_dir_l), n_lamb)), axis=0)
179 |         l2_conf_ar = np.concatenate((mean1[0].reshape(1, -1), conf_ar[:,3,:].reshape(len(n_dir_l), n_lamb)), axis=0)
180 |         #scores_ar = np.concatenate((mean1[:, 0].reshape(1, -1), scores_ar[:,:,3]), axis=0)
181 |         #scores_ar = np.stack([mean1[:, 0]]+conf_l, axis=0)
182 |         #np.concatenate((mean1[:, 0].reshape(1,-1), np.stack(scores_l, axis=0)), axis=0) 
183 |         pdb.set_trace()
184 |         utils.plot_scatter_flex(tau0_ar, legend_l, opt, std_ar=tau0_conf_ar, name='tau0')
185 |         utils.plot_scatter_flex(l2_ar, legend_l, opt, std_ar=l2_conf_ar, name='l2')
186 |     else:
187 |         for n_dir in n_dir_l:
188 |             legend_l.append(str(n_dir))
189 |             opt.n_dir = n_dir
190 |             mean1, conf1 = test_pixel_lamb2(cif_data, opt)
191 |             scores_l.append(mean1[1])
192 |             conf_l.append(conf1[1])
193 | 
194 |         scores_ar = np.stack([mean1[0]]+scores_l, axis=0)
195 |         conf_ar = np.stack([mean1[0]]+conf_l, axis=0)
196 |         #scores_ar = np.concatenate((mean1[:, 0].reshape(1,-1), np.stack(scores_l, axis=0)), axis=0)
197 |         #conf_ar = np.concatenate((mean1[:, 0].reshape(1,-1), np.stack(conf_l, axis=0)), axis=0)
198 |         pdb.set_trace()
199 |         utils.plot_scatter_flex(scores_ar, legend_l, opt, std_ar=conf_ar)    
200 |         
201 |     m = {'opt':opt, 'scores_ar':scores_ar, 'conf_ar':conf_ar}
202 |     with open(osp.join('results', opt.dir, 'lamb_data.npy'), 'wb') as f:
203 |         torch.save(m, f)
204 |         print('saved under {}'.format(f))
205 | 
206 | '''
207 | Returns:
208 | -mean and confidence intervals of various scores, tau1 + baselines.
209 | '''
210 | def test_pixel_lamb2(cif_data, opt):
211 | 
212 |     #number of top dirs for calculating tau0
213 |     opt.n_top_dir = 1
214 |     #number of directions to add noise
215 |     opt.p = 0.2 #default total portion corrupted
216 |     
217 |     n_repeat = 5
218 |     data_l = []
219 |     
220 |     for _ in range(n_repeat):
221 |         whiten_mx = cif.get_whitening(cif_data, fast_whiten=opt.fast_whiten)
222 |         X, X_n = cif.get_corrupted_data(cif_data, opt.n_dir, opt.p, whiten_mx, fast_whiten=opt.fast_whiten)
223 |         X, X_n = torch.from_numpy(X.astype(np.float32)).to(utils.device), torch.from_numpy(X_n).to(utils.device, torch.float32)
224 |         noise_idx = torch.LongTensor(list(range(len(X_n)))).to(utils.device) + len(X)
225 |         X = torch.cat((X, X_n), dim=0)        
226 |         X = X - X.mean(0)
227 |         
228 |         if opt.fast_jl:
229 |             X = utils.pad_to_2power(X)
230 |         data_l.append((X, noise_idx))
231 |     print('samples feat dim: {}'.format(X.size(1)))
232 |         
233 |     #which baseline to use as tau0, can be 'isolation_forest'
234 |     opt.baseline = 'tau0' #'l2' #'tau0' #'isolation_forest' #'l2' #
235 |     print('baseline method: {}'.format(opt.baseline))    
236 |     opt.n_iter = 1
237 |     #amount to remove wrt cur_p
238 |     opt.remove_factor = 1./opt.n_iter    
239 |     
240 |     #scalar to multiply norm of noise vectors with. This is deprecated
241 |     opt.norm_scale = 1.3
242 |     #amount to divide noise norm by
243 |     opt.noise_norm_div = 8
244 |     
245 |     #opt.n_dir = N_DIR
246 |     #n_dir = opt.n_dir
247 |     acc_l = []
248 |     #numpy array used for plotting.
249 |     k_l = []
250 |     p_l = []
251 |     tau_l = []
252 |     res_l = []
253 |     
254 |     #no need to include tau0
255 |     if opt.fast_whiten:
256 |         #for studying lambda only need to compare with best baselines
257 |         outlier_methods_l = ['l2']    
258 |     else:    
259 |         outlier_methods_l = ['l2', 'iso forest', 'ell env', 'lof', 'knn']
260 |     
261 |     lamb_l = list(range(0, 22, 3))
262 |     #+3 for tau1 tau0 and lamb
263 |     scores_ar = np.zeros((len(lamb_l), len(outlier_methods_l)+3))
264 |     std_ar = np.zeros((len(lamb_l), len(outlier_methods_l)+3))
265 |     
266 |     for j, lamb in enumerate(lamb_l):
267 |         
268 |         opt.lamb_multiplier = lamb                    
269 |         #percentage to remove   
270 |         opt.remove_p = opt.p*opt.remove_factor
271 |         #for cur_dir in range(3, n_dir, 9):            
272 |         #cur_res_l = [n, feat_dim, n_noise_dir, opt.p, opt.lamb_multiplier, opt.norm_scale]
273 |         acc_mx = torch.zeros(n_repeat, 2)
274 |         cur_scores_ar = np.zeros((n_repeat, len(outlier_methods_l)+2))
275 |         for i in range(n_repeat):
276 |             X, noise_idx = data_l[i]
277 |             ##cur_scores_ar[i] = train(X, n_noise_dir, opt.p, outlier_methods_l, opt)            
278 |             cur_scores_ar[i] = test_pixel2(X, noise_idx, outlier_methods_l, opt)
279 |             acc_mx[i, 0] = cur_scores_ar[i, 1] #acc0
280 |             acc_mx[i, 1] = cur_scores_ar[i, 0] #acc1
281 |             
282 |         '''
283 |         if opt.use_std:
284 |             std_ar[j, 1:] = np.std(cur_scores_ar, axis=0)
285 |         else:
286 |             se = np.clip(st.sem(cur_scores_ar, axis=0), 1e-3, None)        
287 |             low, high = st.t.interval(0.95, cur_scores_ar.shape[0]-1, loc=scores_ar[j, 1:], scale=se)
288 |             std_ar[j, 1:] = (high - low)/2.
289 |         '''
290 |         
291 |         if opt.compute_scores_diff:
292 |             #tau1 - tau0
293 |             cur_scores_ar[:, 1] = cur_scores_ar[:, 0] - cur_scores_ar[:, 1]
294 |             cur_scores_ar[:, 2] = cur_scores_ar[:, 0] - cur_scores_ar[:, 2]
295 |             
296 |         scores_ar[j, 1:] = cur_scores_ar.mean(axis=0)        
297 |         if opt.use_std:
298 |             std_ar[j, 1:] = cur_scores_ar.std(axis=0)            
299 |         else:
300 |             #low, high = st.t.interval(0.95, n_repeat-1, loc=auc_prob1, scale=st.sem(cur_auc_prob1_l))
301 |             #conf_int1 = (high - low)/2.
302 |             se = np.clip(st.sem(cur_scores_ar, axis=0), 1e-4, None) 
303 |             low, high = st.t.interval(0.95, n_repeat-1, loc=scores_ar[j, 1:], scale=se)
304 |             std_ar[j, 1:] = (high - low)/2.
305 |         
306 |         scores_ar[j, 0] = lamb
307 |         
308 |     scores_ar = scores_ar.transpose()
309 |     std_ar = std_ar.transpose()
310 |     print(std_ar)
311 |     plot = False
312 |     if plot:
313 |         print('About to plot!')
314 |         pdb.set_trace()
315 |         utils.plot_scatter_flex(scores_ar, ['tau1', 'tau0'] + outlier_methods_l, opt, std_ar=std_ar)
316 |         
317 |     return scores_ar, std_ar
318 |     
319 | '''
320 | Returns:
321 | -scores of tau1 and baselines, length of outlier_method_l + 2
322 | '''
323 | def test_pixel2(X, noise_idx, outlier_method_l, opt):
324 |     
325 |     '''
326 |     content_path = 'data/sherlock.txt' if content_lines is None else None
327 |     #noise_path = 'data/news_noise1.txt' if noise_lines is not None else None
328 |     noise_path = 'data/sherlock_noise3.txt' if noise_lines is None else None
329 |     '''
330 |     #words_ar, X, noise_idx = words.doc_word_embed_content_noise(content_path, noise_path, 'data/sherlock_whiten.txt', content_lines, noise_lines)#.to(utils.device) #('data/sherlock_noise3.txt', 'data/test_noise.txt')#.to(utils.device)
331 |     
332 |     noise_idx = noise_idx.unsqueeze(-1)
333 |     print('** {} number of outliers {}'.format(X.size(0), len(noise_idx)))
334 |     #pdb.set_trace()
335 |     
336 |     opt.n, opt.feat_dim = X.size(0), X.size(1)
337 |     #percentage of points to remove.
338 |     opt.remove_p = 0.2
339 |     #number of top dirs for calculating tau0.
340 |     opt.n_top_dir = 1
341 |     opt.n_iter = 1 
342 |     #use select_idx rather than the scores tau, since tau's are scores for remaining points after outliers.
343 |     tau1, select_idx1, n_removed1, tau0, select_idx0, n_removed0 = mean.compute_tau1_tau0(X, opt)
344 |     ##tau1, select_idx1, n_removed1, tau0, select_idx0, n_removed0 = torch.ones(len(X)).to(utils.device), None, 5, torch.ones(len(X)).to(utils.device), None, 5 #mean.compute_tau1_tau0(X, opt)
345 |     
346 |     all_idx = torch.zeros(X.size(0), device=utils.device)
347 |     ones = torch.ones(noise_idx.size(0), device=utils.device)
348 |     
349 |     all_idx.scatter_add_(dim=0, index=noise_idx.squeeze(), src=ones)
350 |     
351 |     opt.baseline = 'tau0' #'lof'#'knn' 'l2' #'l2' #'tau0' #'l2'#'isolation_forest'#'dbscan' #'isolation_forest'
352 |     scores_l = []
353 |     
354 |     for method in outlier_method_l:
355 |         if method == 'iso forest':
356 |             tau = baselines.isolation_forest(X)
357 |         elif method == 'ell env':
358 |             tau = baselines.ellenv(X)
359 |         elif method == 'lof':
360 |             tau = baselines.knn_dist_lof(X)
361 |         elif method == 'dbscan':
362 |             tau = baselines.dbscan(X)
363 |         elif method == 'l2':
364 |             tau = baselines.l2(X)        
365 |         elif method == 'knn':
366 |             tau = baselines.knn_dist(X)
367 |         elif method == 'tau2':
368 |             select_idx2 = torch.LongTensor(list(range(len(X)))).to(utils.device)
369 |             tau = mean.compute_tau2(X, select_idx2, opt)        
370 |         else:
371 |             raise Exception('Outlier method {} not supported'.format(method))
372 |         good_scores = tau[all_idx==0]
373 |         bad_scores = tau[all_idx==1]
374 |         auc = utils.auc(good_scores, bad_scores)
375 |         scores_l.append(auc)
376 |         
377 |     if opt.n_iter > 1:
378 |         #all_idx = torch.LongTensor(range(len(X_classes))).to(utils.device)
379 |         all_idx = torch.LongTensor(range(len(X))).to(utils.device)
380 |         zeros1 = torch.zeros(len(X), device=utils.device)
381 |         zeros1[select_idx1] = 1
382 |         outliers_idx1 = all_idx[zeros1==0]
383 |         zeros0 = torch.zeros(len(X), device=utils.device)
384 |         zeros0[select_idx0] = 1
385 |         outliers_idx0 = all_idx[zeros0==0]
386 |         if opt.baseline != 'tau0': 
387 |             outliers_idx0 = torch.topk(tau0, k=n_removed0, largest=True)[1]
388 |             
389 |     else:
390 |         #should not be used if n_iter > 1
391 |         outliers_idx0 = torch.topk(tau0, k=n_removed0, largest=True)[1]
392 |         outliers_idx1 = torch.topk(tau1, k=n_removed1, largest=True)[1]
393 |         
394 |         #Distribution of true outliers with respect to the predicted scores.
395 |         compute_auc_b = True
396 |         if compute_auc_b:
397 |             #complement of noise_idx
398 |             #X_range = list(range(len(X)))
399 |             zeros = torch.zeros(len(tau1), device=utils.device)
400 |             zeros[noise_idx] = 1
401 |             
402 |             inliers_tau1 = tau1[zeros==0] #this vs index_select
403 |             outliers_tau1 = tau1[zeros==1]#torch.index_select(tau1, dim=0, index=noise_idx)
404 |             ##utils.inlier_outlier_hist(inliers_tau1, outliers_tau1, 'tau1', high=40)
405 |             tau1_auc = utils.auc(inliers_tau1, outliers_tau1)
406 |             
407 |             inliers_tau0 = tau0[zeros==0] #this vs index_select
408 |             outliers_tau0 = tau0[zeros==1] #torch.index_select(tau0, dim=0, index=noise_idx)
409 |             ##utils.inlier_outlier_hist(inliers_tau0, outliers_tau0, opt.baseline, high=40)            
410 |             tau0_auc = utils.auc(inliers_tau0, outliers_tau0)
411 |             
412 |     print('tau1 size {}'.format(tau1.size(0)))
413 |     outliers_idx0_exp = outliers_idx0.unsqueeze(0).expand(len(noise_idx), -1)
414 |     outliers_idx1_exp = outliers_idx1.unsqueeze(0).expand(len(noise_idx), -1)
415 |     assert len(outliers_idx0) == len(outliers_idx1)
416 |     
417 |     tau0_cor = noise_idx.eq(outliers_idx0_exp).sum()
418 |     tau1_cor = noise_idx.eq(outliers_idx1_exp).sum()
419 |     print('{}_cor {} out of {} tau1_cor {} out of {}'.format(opt.baseline, tau0_cor, len(outliers_idx0), tau1_cor, len(outliers_idx1)))    
420 |     
421 |     #return tau0_cor.item()/len(outliers_idx0), tau1_cor.item()/len(outliers_idx0), tau0_auc, tau1_auc #0 instead of 1
422 |     return [tau1_auc, tau0_auc] + scores_l
423 | 
424 | 
425 | if __name__ == '__main__':
426 |     opt = utils.parse_args()
427 |     
428 |     opt.use_std = True
429 |     opt.compute_scores_diff = True
430 |     opt.whiten = True
431 |     opt.fast_whiten = True
432 |     
433 |     #directory to store results
434 |     opt.dir = 'cifar'
435 |     method = opt.experiment_type
436 |     if method == 'image_lamb':
437 |         opt.type = 'lamb'
438 |         test_pixel_lamb(opt)
439 |     elif method == 'image_dirs':
440 |         opt.type = 'dirs'
441 |         test_pixel_dirs(opt)
442 |     else:
443 |         raise Exception('Wrong script for experiment type {}'.format(method))
444 |     
445 | 
446 | 


--------------------------------------------------------------------------------
/part_utils.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Utilities functions
  3 | '''
  4 | import torch
  5 | import numpy
  6 | import numpy as np
  7 | import os
  8 | import os.path as osp
  9 | import pickle
 10 | import argparse
 11 | from scipy.stats import ortho_group
 12 | import matplotlib
 13 | matplotlib.use('agg')
 14 | import matplotlib.pyplot as plt
 15 | import seaborn as sns
 16 | import pandas as pd
 17 | 
 18 | import pdb
 19 | 
 20 | #parse the configs from config file
 21 | 
 22 | def read_config():
 23 |     with open('config', 'r') as file:
 24 |         lines = file.readlines()
 25 |     
 26 |     name2config = {}
 27 |     for line in lines:
 28 |         
 29 |         if line[0] == '#' or '=' not in line:
 30 |             continue
 31 |         line_l = line.split('=')
 32 |         name2config[line_l[0].strip()] = line_l[1].strip()
 33 |     m = name2config
 34 |     if 'kahip_dir' not in m or 'data_dir' not in m or 'glove_dir' not in m or 'sift_dir' not in m:
 35 |         raise Exception('Config must have kahip_dir, data_dir, glove_dir, and sift_dir')
 36 |     return name2config
 37 | 
 38 | name2config = read_config()
 39 | 
 40 | device = 'cuda' if torch.cuda.is_available() else 'cpu'
 41 | kahip_dir = name2config['kahip_dir'] 
 42 | graph_file = 'knn.graph'
 43 | data_dir = name2config['data_dir'] 
 44 | 
 45 | parts_path = osp.join(data_dir, 'partition', '')
 46 | dsnode_path = osp.join(data_dir, 'train_dsnode')
 47 | 
 48 | glove_dir = name2config['glove_dir'] 
 49 | sift_dir = name2config['sift_dir'] 
 50 | 
 51 | #starter numbers
 52 | N_CLUSTERS = 256 #16
 53 | N_HIDDEN = 512
 54 | #for reference, this is 128 for sift, 784 for mnist, and 100 for glove
 55 | N_INPUT = 128 
 56 | 
 57 | '''                                                        
 58 | One unified parse_args to encure consistency across different components.
 59 | Returns opt.
 60 | '''
 61 | def parse_args():
 62 |     parser = argparse.ArgumentParser()
 63 |     parser.add_argument('--n_clusters', default=N_CLUSTERS, type=int, help='number of cluseters' )
 64 |     parser.add_argument('--kahip_config', default='strong', help='fast, eco, or strong' )
 65 |     parser.add_argument('--parts_path_root', default=parts_path, help='path root to partition')
 66 |     parser.add_argument('--dsnode_path', default=dsnode_path, help='path to datanode dsnode for training')
 67 |     parser.add_argument('--k', default=10, type=int, help='number of neighbors')
 68 |     parser.add_argument('--nn_mult', default=1, type=int, help='multiplier for opt.k to create distribution of bins of nearest neighbors during training')
 69 |     parser.add_argument('--data_dir', default=data_dir, help='data dir')
 70 |     parser.add_argument('--graph_file', default=graph_file, help='file to store knn graph')
 71 | 
 72 |     parser.add_argument('--glove', default=True, help='whether using glove data')
 73 |     parser.add_argument('--glove_c', default=False, help='whether using glove data')
 74 |     parser.add_argument('--sift_c', default=False, help='whether using glove data')
 75 |     parser.add_argument('--sift', default=False, help='whether using SIFT data')
 76 |     parser.add_argument('--fast_kmeans', default=True, help='whether using fast kmeans, non-sklearn')
 77 |     parser.add_argument('--itq', default=False, help='whether using ITQ solver')
 78 |     parser.add_argument('--pca', default=False, help='whether using PCA solver')
 79 |     parser.add_argument('--rp', default=False, help='whether using random projection solver')
 80 |     parser.add_argument('--kmeans_use_kahip_height', default=-2, type=int, help='height if kmeans using kahip height, i.e. for combining kahip+kmeans methods')
 81 |     parser.add_argument('--compute_gt_nn', default=False, help='whether to compute ground-truth for dataset points. Ground truth partitions instead of learned, ie if everything were partitioned by kahip')
 82 |     
 83 |     #meta and more hyperparameters
 84 |     parser.add_argument('--write_res', default=True, help='whether to write acc and probe count results for kmeans')
 85 |     parser.add_argument('--normalize_data', default=False, help='whether to normalize input data')
 86 |     #parser.add_argument('--normalize_feature', default=True, help='whether to scale features')
 87 |     parser.add_argument('--max_bin_count', default=30, type=int, help='max bin count for kmeans') #default=160
 88 |     parser.add_argument('--acc_thresh', default=0.97, type=float, help='acc threshold for kmeans')
 89 |     parser.add_argument('--n_repeat_km', default=3, type=int, help='number of experimental repeats for kmeans')
 90 |     
 91 |     #params for training
 92 |     parser.add_argument('--n_input', default=N_INPUT, type=int, help='dimension of neural net input')
 93 |     parser.add_argument('--n_hidden', default=N_HIDDEN, type=int, help='hidden dimension')
 94 |     parser.add_argument('--n_class', default=N_CLUSTERS, type=int, help='number of classes for trainig')
 95 |     parser.add_argument('--n_epochs', default=1, type=int, help='number of epochs for trainig') #35
 96 |     parser.add_argument('--lr', default=0.0008, type=float, help='learning rate')    
 97 | 
 98 |     opt = parser.parse_args()
 99 |     
100 |     if opt.glove:    
101 |         opt.n_input = 100
102 |     elif opt.glove_c:
103 |         opt.n_input = 100        
104 |     elif opt.sift or opt.sift_c:
105 |         opt.n_input = 128
106 |     else:
107 |         opt.n_input = 784 #for mnist
108 |         
109 |     if (opt.glove or opt.glove_c) and not opt.normalize_data:
110 |         print('GloVe data must be normalized! Setting normalize_data to True...')
111 |         opt.normalize_data = True
112 |         
113 |     if opt.glove and opt.sift:        
114 |         raise Exception('Must choose only one of opt.glove and opt.sift!')
115 |     
116 |     if not opt.fast_kmeans^opt.itq:
117 |         #raise Exception('Must choose only one of opt.fast_kmeans and opt.itq!')
118 |         print('NOTE: fast_kmeans and itq options share the same value')
119 |         
120 |     if not opt.fast_kmeans:
121 |         print('NOTE: fast_kmeans not enabled')
122 |         
123 |     return opt 
124 | 
125 | class NestedList:
126 |     def __init__(self):
127 |         self.master = {}
128 | 
129 |     def add_list(self, l, idx):
130 |         if not isinstance(l, list):
131 |             raise Exception('Must add list to ListWrapper!')
132 |         self.master[idx] = l
133 |         
134 |     def get_list(self, idx):
135 |         return self.master[idx]
136 | '''
137 | l2 normalize along last dim
138 | Input: torch tensor.
139 | '''
140 | def normalize(vec):
141 |     norm = vec.norm(p=2, dim=-1, keepdim=True)
142 |     return vec/norm
143 | 
144 | def normalize_np(vec):
145 |     norm = numpy.linalg.norm(vec, axis=-1, keepdims=True)    
146 |     return vec/norm
147 | 
148 | '''
149 | Cross polytope LSH
150 | To find the part, Random rotation followed by picking the nearest spherical
151 | lattice point after normalization, ie argmax, not up to sign.
152 | Input:
153 | -M: projection matrix
154 | -n_clusters, must be divisible by 2.
155 | '''
156 | def polytope_lsh(X, n_clusters):
157 |     #random orthogonal rotation
158 |     
159 |     #M = torch.randn(X.size(-1), proj_dim)
160 |     M = torch.from_numpy(ortho_group.rvs(X.size(-1)))
161 |     proj_dim = n_clusters / 2
162 |     M = M[:, :proj_dim]
163 |     X = torch.mm(X, M)    
164 |     #X = X[:, :proj_dim]
165 |     
166 |     max_idx = torch.argmax(X.abs(), dim=-1) #check dim!
167 |     max_entries = torch.gather(X, dim=-1, index=max_idx)
168 |     #now in range e.g. [-8, 8]
169 |     max_idx[max_entries<0] = -max_idx[max_entries<0]
170 |     max_idx += proj_dim
171 |     return M, max_idx.view(-1)
172 | 
173 | '''
174 | get ranking using cross polytope info.
175 | Input:
176 | -q: query input, 2D tensor
177 | -M: projection mx. 2D tensor. d x n_total_clusters/2
178 | '''
179 | def polytope_rank(q, M, n_bins):  
180 |     q = torch.mm(q, M)
181 |     n_queries, d = q.size(0), q.size(0)
182 |     q = q.view(-1)
183 |     bases = torch.eye(d, device=device)
184 |     bases = torch.cat((bases, -bases), dim=0)
185 |     bases_exp = bases.unsqueeze(0).expand(n_queries, 2*d, d)
186 |     #multiply in last dimension
187 |     idx = torch.topk((bases_exp*q).sum(-1), k=n_bins, dim=-1)
188 |     return idx    
189 | 
190 | '''
191 | Compute histograms of distances to the mth neighbor. Useful for e.g.
192 | after catalyzer processing.
193 | Input:
194 | -X: data
195 | -q: queries
196 | -m: the mth neighbor to take distance to.
197 | '''
198 | def plot_dist_hist(X, q, m, data_name):    
199 |     dist = l2_dist(q, X)    
200 |     dist, ranks = torch.topk(dist, k=m, dim=-1, largest=False)    
201 |     dist = dist / dist[:, 0].unsqueeze(-1)
202 |     #first look at the mean and median of distances
203 |     mth_dist = dist[:, m-1]
204 |     plt.hist(mth_dist.cpu().numpy(), bins=100, label=str(m)+'th neighbor')
205 |     plt.xlabel('distance')
206 |     plt.ylabel('count')
207 |     plt.xlim(0, 4)
208 |     plt.ylim(0, 140)
209 |     plt.title('Dist to {}^th nearest neighbor'.format(m))
210 |     plt.grid(True)
211 |     fig_path = osp.join(data_dir, '{}_dist_{}_hist.jpg'.format(data_name, m))
212 |     plt.savefig(fig_path)
213 |     print('fig saved {}'.format(fig_path))
214 |     #pdb.set_trace()
215 |     return mth_dist, plt
216 | '''
217 | Plot distance scatter plot, *up to* m^th neighbor, normalized by nearest neighbor dist.
218 | '''
219 | def plot_dist_hist_upto(X, q, m, data_name): 
220 |     dist = l2_dist(q, X)    
221 |     dist, ranks = torch.topk(dist, k=m, dim=-1, largest=False)    
222 |     dist = dist / dist[:, 0].unsqueeze(-1)
223 |     #first look at the mean and median of distances
224 |     m_dist = dist[:, :m]
225 |     m_dist = m_dist.mean(0)
226 | 
227 |     df = pd.DataFrame({'k':list(range(m)), 'dist':m_dist.cpu().numpy()})
228 |     fig = sns.scatterplot(x='k', y='dist', data=df, label=data_name)
229 |     fig.figure.legend()
230 |     fig.set_title('{}: distance wrt k up to {}'.format(data_name, m))
231 |     fig_path = osp.join(data_dir, '{}_dist_upto{}.jpg'.format(data_name, m))
232 |     fig.figure.savefig(fig_path)
233 |     print('figure saved under {}'.format(fig_path))
234 |     
235 |     
236 |     '''
237 |     plt.hist(mth_dist.cpu().numpy(), bins=100, label=str(m)+'th neighbor')
238 |     plt.xlabel('distance')
239 |     plt.ylabel('count')
240 |     plt.xlim(0, 4)
241 |     plt.ylim(0, 140)
242 |     plt.title('Dist to {}^th nearest neighbor'.format(m))
243 |     plt.grid(True)
244 |     fig_path = osp.join(data_dir, '{}_dist_{}_hist.jpg'.format(data_name, m))
245 |     plt.savefig(fig_path)
246 |     print('fig saved {}'.format(fig_path))
247 |     #pdb.set_trace()
248 |     return mth_dist, plt
249 |     '''
250 |     
251 | '''
252 | Type can be query, train, and or answers.
253 | '''
254 | def load_data_dep(type='query'):
255 |     if type == 'query':
256 |         return torch.from_numpy(np.load(osp.join(data_dir, 'queries_unnorm.npy')))
257 |     elif type == 'answers':
258 |         #answers are NN of the query points
259 |         return torch.from_numpy(np.load(osp.join(data_dir, 'answers_unnorm.npy')))
260 |     elif type == 'train':
261 |         return torch.from_numpy(np.load(osp.join(data_dir, 'dataset_unnorm.npy')))
262 |     else:
263 |         raise Exception('Unsupported data type')
264 | 
265 | '''
266 | All data are normalized.
267 | glove_dir : '~/partition/glove-100-angular/normalized'
268 | '''
269 | def load_glove_data(type='query'):
270 |     if type == 'query':
271 |         return torch.from_numpy(np.load(osp.join(data_dir, 'glove_queries.npy')))
272 |     elif type == 'answers':
273 |         #answers are NN of the query points
274 |         return torch.from_numpy(np.load(osp.join(data_dir, 'glove_answers.npy')))
275 |     elif type == 'train':
276 |         return torch.from_numpy(np.load(osp.join(data_dir, 'glove_dataset.npy')))
277 |     else:
278 |         raise Exception('Unsupported data type')
279 | 
280 | '''
281 | catalyzer'd glove data
282 | '''
283 | def load_glove_c_data(type='query'):
284 |     if type == 'query':
285 |         return torch.from_numpy(np.load(osp.join(data_dir, 'glove_c0.08_queries.npy')))
286 |     elif type == 'answers':
287 |         #answers are NN of the query points
288 |         return torch.from_numpy(np.load(osp.join(data_dir, 'glove_answers.npy')))
289 |     elif type == 'train':
290 |         return torch.from_numpy(np.load(osp.join(data_dir, 'glove_c0.08_dataset.npy')))
291 |     else:
292 |         raise Exception('Unsupported data type')
293 | 
294 | def load_sift_c_data(type='query'):
295 |     if type == 'query':
296 |         return torch.from_numpy(np.load(osp.join(data_dir, 'sift_c_queries.npy')))
297 |     elif type == 'answers':
298 |         #answers are NN of the query points
299 |         return torch.from_numpy(np.load(osp.join(data_dir, 'sift_answers.npy')))
300 |     elif type == 'train':
301 |         return torch.from_numpy(np.load(osp.join(data_dir, 'sift_c_dataset.npy')))
302 |     else:
303 |         raise Exception('Unsupported data type')
304 | 
305 |     
306 | '''
307 | All data are normalized.
308 | glove_dir : '~/partition/glove-100-angular/normalized'
309 | '''
310 | def load_sift_data(type='query'):
311 |     if type == 'query':
312 |         return torch.from_numpy(np.load(osp.join(data_dir, 'sift_queries_unnorm.npy')))
313 |     elif type == 'answers':
314 |         #answers are NN of the query points
315 |         return torch.from_numpy(np.load(osp.join(data_dir, 'sift_answers_unnorm.npy')))
316 |     elif type == 'train':
317 |         return torch.from_numpy(np.load(osp.join(data_dir, 'sift_dataset_unnorm.npy')))
318 |     else:
319 |         raise Exception('Unsupported data type')
320 | 
321 | '''
322 | Glove data according
323 | Input:
324 | -n_parts: number of parts.
325 | '''
326 | def glove_top_parts_path(n_parts):
327 |     if n_parts not in [2, 4, 8, 16, 32, 64, 128, 256, 512]:
328 |         raise Exception('Glove partitioning has not been precomputed for {} parts.'.format(n_parts))
329 |     strength = 'strong' #'eco' if n_parts in [128, 256] else 'strong'
330 |     glove_top_parts_path = osp.join(glove_dir, 'partition_{}_{}'.format(n_parts, strength), 'partition.txt')
331 |     if n_parts == 16:
332 |         glove_top_parts_path = '/home/yihdong/partition/data/partition/16strongglove0ht1'
333 |     return glove_top_parts_path
334 | 
335 | '''
336 | SIFT partitioning.
337 | Input:
338 | -n_parts: number of parts.
339 | '''
340 | def sift_top_parts_path(n_parts):
341 |     if n_parts not in [2, 4, 8, 16, 32, 64, 128, 256]:
342 |         raise Exception('SIFT partitioning has not been precomputed for {} parts.'.format(n_parts))
343 |     
344 |     #strength = 'eco' if n_parts in [128, 256] else 'strong'
345 |     strength = 'strong'
346 |     sift_top_parts_path = osp.join(data_dir, 'partition_{}_{}'.format(n_parts, strength), 'partition.txt')
347 |     
348 |     return sift_top_parts_path
349 | 
350 | '''
351 | Memory-compatible. 
352 | Ranks of closest points not self.
353 | Uses l2 dist. But uses cosine dist if data normalized. 
354 | Input: 
355 | -data: tensors
356 | -specify k if only interested in the top k results.
357 | -largest: whether pick largest when ranking. 
358 | -include_self: include the point itself in the final ranking.
359 | '''
360 | def dist_rank(data_x, k, data_y=None, largest=False, opt=None, include_self=False):
361 | 
362 |     if isinstance(data_x, np.ndarray):
363 |         data_x = torch.from_numpy(data_x)
364 | 
365 |     if data_y is None:
366 |         data_y = data_x
367 |     else:
368 |         if isinstance(data_y, np.ndarray):
369 |             data_y = torch.from_numpy(data_y)
370 |     k0 = k
371 |     device_o = data_x.device
372 |     data_x = data_x.to(device)
373 |     data_y = data_y.to(device)
374 |     
375 |     (data_x_len, dim) = data_x.size()
376 |     data_y_len = data_y.size(0)
377 |     #break into chunks. 5e6  is total for MNIST point size
378 |     #chunk_sz = int(5e6 // data_y_len)
379 |     chunk_sz = 16384
380 |     chunk_sz = 500 #700 mem error. 1 mil points
381 |     if data_y_len > 990000:
382 |         chunk_sz = 600 #1000 if over 1.1 mil
383 |         #chunk_sz = 500 #1000 if over 1.1 mil 
384 |     else:
385 |         chunk_sz = 3000    
386 | 
387 |     if k+1 > len(data_y):
388 |         k = len(data_y) - 1
389 |     #if opt is not None and opt.sift:
390 |     
391 |     if device == 'cuda':
392 |         dist_mx = torch.cuda.LongTensor(data_x_len, k+1)
393 |         act_dist = torch.cuda.FloatTensor(data_x_len, k+1)
394 |     else:
395 |         dist_mx = torch.LongTensor(data_x_len, k+1)
396 |         act_dist = torch.cuda.FloatTensor(data_x_len, k+1)
397 |     data_normalized = True if opt is not None and opt.normalize_data else False
398 |     largest = True if largest else (True if data_normalized else False)
399 |     
400 |     #compute l2 dist <--be memory efficient by blocking
401 |     total_chunks = int((data_x_len-1) // chunk_sz) + 1
402 |     y_t = data_y.t()
403 |     if not data_normalized:
404 |         y_norm = (data_y**2).sum(-1).view(1, -1)
405 |     
406 |     for i in range(total_chunks):
407 |         base = i*chunk_sz
408 |         upto = min((i+1)*chunk_sz, data_x_len)
409 |         cur_len = upto-base
410 |         x = data_x[base : upto]
411 |         
412 |         if not data_normalized:
413 |             x_norm = (x**2).sum(-1).view(-1, 1)        
414 |             #plus op broadcasts
415 |             dist = x_norm + y_norm        
416 |             dist -= 2*torch.mm(x, y_t)
417 |         else:
418 |             dist = -torch.mm(x, y_t)
419 |             
420 |         topk_d, topk = torch.topk(dist, k=k+1, dim=1, largest=largest)
421 |                 
422 |         dist_mx[base:upto, :k+1] = topk #torch.topk(dist, k=k+1, dim=1, largest=largest)[1][:, 1:]
423 |         act_dist[base:upto, :k+1] = topk_d #torch.topk(dist, k=k+1, dim=1, largest=largest)[1][:, 1:]
424 |         
425 |     topk = dist_mx
426 |     if k > 3 and opt is not None and opt.sift:
427 |         #topk = dist_mx
428 |         #sift contains duplicate points, don't run this in general.
429 |         identity_ranks = torch.LongTensor(range(len(topk))).to(topk.device)
430 |         topk_0 = topk[:, 0]
431 |         topk_1 = topk[:, 1]
432 |         topk_2 = topk[:, 2]
433 |         topk_3 = topk[:, 3]
434 | 
435 |         id_idx1 = topk_1 == identity_ranks
436 |         id_idx2 = topk_2 == identity_ranks
437 |         id_idx3 = topk_3 == identity_ranks
438 | 
439 |         if torch.sum(id_idx1).item() > 0:
440 |             topk[id_idx1, 1] = topk_0[id_idx1]
441 | 
442 |         if torch.sum(id_idx2).item() > 0:
443 |             topk[id_idx2, 2] = topk_0[id_idx2]
444 | 
445 |         if torch.sum(id_idx3).item() > 0:
446 |             topk[id_idx3, 3] = topk_0[id_idx3]           
447 | 
448 |     
449 |     if not include_self:
450 |         topk = topk[:, 1:]
451 |         act_dist = act_dist[:, 1:]
452 |     elif topk.size(-1) > k0:
453 |         topk = topk[:, :-1]
454 |     topk = topk.to(device_o)
455 |     return act_dist, topk
456 | 
457 | '''
458 | Memory-compatible. 
459 | Input: 
460 | -data: tensors
461 | -data_y: if None take dist from data_x to itself
462 | '''
463 | def l2_dist(data_x, data_y=None):
464 | 
465 |     if data_y is not None:
466 |         return _l2_dist2(data_x, data_y)
467 |     else:
468 |         return _l2_dist1(data_x)
469 |    
470 | '''
471 | Memory-compatible, when insufficient GPU mem. To be combined with _l2_dist2 later.
472 | Input: 
473 | -data: tensor
474 | '''
475 | def _l2_dist1(data):
476 | 
477 |     if isinstance(data, numpy.ndarray):
478 |         data = torch.from_numpy(data)
479 |     (data_len, dim) = data.size()
480 |     #break into chunks. 5e6  is total for MNIST point size
481 |     chunk_sz = int(5e6 // data_len)    
482 |     dist_mx = torch.FloatTensor(data_len, data_len)
483 |     
484 |     #compute l2 dist <--be memory efficient by blocking
485 |     total_chunks = int((data_len-1) // chunk_sz) + 1
486 |     y_t = data.t()
487 |     y_norm = (data**2).sum(-1).view(1, -1)
488 |     
489 |     for i in range(total_chunks):
490 |         base = i*chunk_sz
491 |         upto = min((i+1)*chunk_sz, data_len)
492 |         cur_len = upto-base
493 |         x = data[base : upto]
494 |         x_norm = (x**2).sum(-1).view(-1, 1)
495 |         #plus op broadcasts
496 |         dist_mx[base:upto] = x_norm + y_norm - 2*torch.mm(x, y_t)
497 |         
498 | 
499 |     return dist_mx
500 | 
501 | '''
502 | Memory-compatible.
503 | Input: 
504 | -data: tensor
505 | '''
506 | def _l2_dist2(data_x, data_y):
507 | 
508 |     (data_x_len, dim) = data_x.size()
509 |     data_y_len = data_y.size(0)
510 |     #break into chunks. 5e6  is total for MNIST point size
511 |     chunk_sz = int(5e6 // data_y_len)
512 |     dist_mx = torch.FloatTensor(data_x_len, data_y_len)
513 |     
514 |     #compute l2 dist <--be memory efficient by blocking
515 |     total_chunks = int((data_x_len-1) // chunk_sz) + 1
516 |     y_t = data_y.t()
517 |     y_norm = (data_y**2).sum(-1).view(1, -1)
518 |     
519 |     for i in range(total_chunks):
520 |         base = i*chunk_sz
521 |         upto = min((i+1)*chunk_sz, data_x_len)
522 |         cur_len = upto-base
523 |         x = data_x[base : upto]
524 |         x_norm = (x**2).sum(-1).view(-1, 1)
525 |         #plus op broadcasts
526 |         dist_mx[base:upto] = x_norm + y_norm - 2*torch.mm(x, y_t)
527 |         
528 |         #data_x = data[base : upto].unsqueeze(cur_len, data_len, dime(1).expand(cur_len, data_len, dim)
529 |         #                                    )
530 |     return dist_mx
531 | 
532 |  
533 | '''
534 | convert numpy array or list to markdown table
535 | Input:
536 | -numpy array (or two-nested list)
537 | -s
538 | 
539 | '''
540 | def mx2md(mx, row_label, col_label):
541 |     #height, width = mx.shape
542 |     height, width = len(mx), len(mx[0])
543 |     
544 |     if height != len(row_label) or width != len(col_label):
545 |         raise Exception('mx2md: height != len(row_label) or width != len(col_label)')
546 | 
547 |     l = ['-']
548 |     l.extend([str(i) for i in col_label])
549 |     rows = [l]
550 |     rows.append(['---' for i in range(width+1)])
551 |     
552 |     for i, row in enumerate(mx):
553 |         l = [str(row_label[i])]
554 |         l.extend([str(j) for j in mx[i]])
555 |         rows.append(l)
556 |         
557 |     md = '\n'.join(['|'.join(row) for row in rows])
558 |     #md0 = ['\n'.join(row) for row in rows]
559 |     return md
560 | 
561 | '''
562 | convert multiple numpy arrays or lists of same shape to markdown table
563 | Input:
564 | -numpy array (or two-nested list)
565 | 
566 | '''
567 | def mxs2md(mx_l, row_label, col_label):
568 |         
569 |     height, width = len(mx_l[0]), len(mx_l[0][0])
570 | 
571 |     for i, mx in enumerate(mx_l, 1):
572 |         if (height, width) != (len(mx), len(mx[0])):
573 |             raise Exception('shape mismatch: height != len(row_label) or width != len(col_label)')
574 |     
575 |     if height != len(row_label) or width != len(col_label):
576 |         raise Exception('mx2md: height != len(row_label) or width != len(col_label)')
577 | 
578 |     l = ['-']
579 |     l.extend([str(i) for i in col_label])
580 |     rows = [l]
581 |     rows.append(['---' for i in range(width+1)])
582 | 
583 |     for i, row in enumerate(mx):
584 |         l = [str(row_label[i])]
585 |                     
586 |         #l.extend([str(j) for j in mx_k[i]])
587 |         l.extend([' / '.join([str(mx_k[i][j]) for mx_k in mx_l]) for j in range(width)])
588 |         rows.append(l)
589 |         
590 |     md = '\n'.join(['|'.join(row) for row in rows])
591 |     #md0 = ['\n'.join(row) for row in rows]
592 |     return md
593 | 
594 | def load_lines(path):
595 |     with open(path, 'r') as file:
596 |         lines = file.read().splitlines()
597 |     return lines
598 | 
599 | '''                            
600 | Input: lines is list of objects, not newline-terminated yet.                                                                        
601 | '''
602 | def write_lines(lines, path):
603 |     lines1 = []
604 |     for line in lines:
605 |         lines1.append(str(line) + os.linesep)
606 |     with open(path, 'w') as file:
607 |         file.writelines(lines1)
608 | 
609 | def pickle_dump(obj, path):
610 |     with open(path, 'wb') as file:
611 |         pickle.dump(obj, file)
612 | 
613 | def pickle_load(path):
614 |     with open(path, 'rb') as file:
615 |         return pickle.load(file)
616 | 
617 |     
618 | if __name__ == '__main__':
619 |     mx1 = np.zeros((2,2))
620 |     mx2 = np.ones((2,2))
621 |     
622 |     row = ['1','2']
623 |     col = ['3','4']
624 |     
625 |     print(mxs2md([mx1,mx2], row, col))
626 | 
627 | 


--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
  1 | 
  2 | '''
  3 | Utilities functions
  4 | '''
  5 | from __future__ import unicode_literals
  6 | import matplotlib
  7 | matplotlib.use('agg')
  8 | import matplotlib.pyplot as plt
  9 | import pandas as pd
 10 | import numpy as np
 11 | import os
 12 | import os.path as osp
 13 | import utils
 14 | import argparse
 15 | import torch
 16 | import math
 17 | import numpy.linalg as linalg
 18 | import scipy.linalg
 19 | 
 20 | import pdb
 21 | 
 22 | res_dir = 'results'
 23 | data_dir = 'data'
 24 | device = 'cuda' if torch.cuda.is_available() else 'cpu'
 25 | 
 26 | def parse_args():
 27 |     parser = argparse.ArgumentParser()
 28 |     parser.add_argument('--max_dir', default=1, type=int, help='Max number of directions' )
 29 |     parser.add_argument('--lamb_multiplier', type=float, default=1., help='Set alpha multiplier')
 30 |     parser.add_argument('--experiment_type', default='syn_lamb', help='Set type of experiment, e.g. syn_dirs, syn_lamb, text_lamb, text_dirs, image_lamb, image_dirs, representing varying alpha or the number of corruption directions for the respective dataset')
 31 |     parser.add_argument('--generate_data', help='Generate synthetic data to run synthetic data experiments', dest='generate_data', action='store_true')
 32 |     parser.set_defaults(generate_data=False)
 33 |     parser.add_argument('--fast_jl', help='Use fast method to generate approximate QUE scores', dest='fast_jl', action='store_true')
 34 |     parser.set_defaults(fast_jl=False)
 35 |     parser.add_argument('--fast_whiten', help='Use approximate whitening', dest='fast_whiten', action='store_true')
 36 |     parser.set_defaults(fast_whiten=False)    
 37 |     parser.add_argument('--high_dim', help='Generate high-dimensional data, if running synthetic data experiments', dest='high_dim', action='store_true')
 38 |     parser.set_defaults(high_dim=False)    
 39 |     
 40 |     opt = parser.parse_args()
 41 |     
 42 |     if len(opt.experiment_type) > 3 and opt.experiment_type[:3] == 'syn':
 43 |         opt.generate_data = True
 44 |         
 45 |     return opt
 46 | 
 47 | def create_dir(path):
 48 |     if not os.path.exists(path):
 49 |         os.mkdir(path)
 50 |     
 51 | create_dir(res_dir)
 52 | 
 53 | '''
 54 | Get degree and coefficient for kth Chebyshev poly.
 55 | '''
 56 | def get_chebyshev_deg(k):
 57 |     if k == 0:
 58 |         coeff = [1]
 59 |         deg = [0]   
 60 |     elif k == 1:
 61 |         coeff = [1]
 62 |         deg = [1]
 63 |     elif k == 2:
 64 |         coeff = [2, -1]
 65 |         deg = [2, 0]
 66 |     elif k == 3:
 67 |         coeff = [4, -3]
 68 |         deg = [3, 1]
 69 |     elif k == 4:
 70 |         coeff = [8, -8, 1]
 71 |         deg = [4, 2, 0]
 72 |     elif k == 5:
 73 |         coeff = [16, -20, 5]
 74 |         deg = [5, 3, 1]
 75 |     elif k == 6:
 76 |         coeff = [32, -48, 18, -1]
 77 |         deg = [6, 4, 2, 0]
 78 |     else:
 79 |         raise Exception('deg {} chebyshev not supported'.format(k))
 80 |     return coeff, deg
 81 |         
 82 | '''
 83 | Combination of JL projection and 
 84 | Chebyshev expansion of the matrix exponential.
 85 | Input:
 86 | -X: data matrix, 2D tensor. X is sparse for gene data!
 87 | Returns:
 88 | -tau1: scores, 1D tensor (n,)
 89 | '''
 90 | def jl_chebyshev(X, lamb):
 91 | 
 92 |     #print(X[:,0].mean(0))
 93 |     #assert X[:,0].mean(0) < 1e-4
 94 |     X = X - X.mean(0, keepdim=True)
 95 |     
 96 |     n_data, feat_dim = X.size()
 97 |     X_scaled = X/n_data
 98 |     
 99 |     #if lamb=0 no scaling, so not to magnify bessel[i, 0] in the approximation.
100 |     scale = int(dominant_eval_cov(np.sqrt(lamb)*X)[0]) if lamb > 0 else 1
101 |     
102 |     if scale > 1:
103 |         print('Scaling M! {}'.format(scale))
104 |         #scale down matrix if matrix norm >= 3, since later scale up when
105 |         #odd power
106 |         if scale%2 == 0:
107 |             scale -= 1
108 |         X_scaled /= scale
109 |     else:
110 |         scale = 1    
111 |     
112 |     subsample_freq = int(feat_dim/math.log(feat_dim, 2)) #100
113 |     k = math.ceil(feat_dim/subsample_freq)
114 | 
115 |     X_t = X.t() 
116 |     #fast Hadamard transform (ffht) vs transform by multiplication by Hadamard mx.
117 |     ffht_b = False 
118 |     P, H, D = get_jl_mx(feat_dim, k, ffht_b)
119 |     
120 |     I_proj = torch.eye(feat_dim, feat_dim, device=X.device)
121 |     
122 |     M = D
123 |     I_proj = torch.mm(D, I_proj)
124 |     if ffht_b:
125 |         #can be obtained from https://github.com/FALCONN-LIB/FFHT
126 |         #place here so not everyone needs to install.   
127 |         import ffht
128 | 
129 |         M = M.t()
130 |         #M1 = np.zeros((M.size(0), M.size(1)), dtype=np.double)
131 |         M_np = M.cpu().numpy()
132 |         
133 |         I_np = I_proj.cpu().numpy()
134 |         for i in range(M.size(0)):
135 |             ffht.fht(M_np[i])
136 |         for i in range(I_proj.size(0)):
137 |             ffht.fht(I_np[i])
138 |         #pdb.set_trace()
139 |         M = torch.from_numpy(M_np).to(dtype=M.dtype, device=X.device).t()
140 |         I_proj = torch.from_numpy(I_np).to(dtype=M.dtype, device=X.device)
141 |     else:
142 |         #right now form the matrix exponential here
143 |         M = torch.mm(H, M)
144 |         I_proj = torch.mm(H, I_proj)
145 |             
146 |     #apply P now so downstream multiplications are faster: kd instead of d^2.
147 |     #subsample to get reduced dimension
148 |     subsample = True
149 |     if subsample:
150 |         #random sampling performs well in practice and has lower complexity        
151 |         #select_idx = torch.randint(low=0, high=feat_dim, size=(feat_dim//5,)) <--this produces repeats
152 |         if device == 'cuda':
153 |             #pdb.set_trace()
154 |             select_idx = torch.cuda.LongTensor(list(range(0, feat_dim, subsample_freq)))
155 |         else:
156 |             select_idx = torch.LongTensor(list(range(0, feat_dim, subsample_freq)))
157 |         #M = torch.index_select(M, dim=0, index=select_idx)
158 |         M = M[select_idx]
159 |         #I_proj = torch.index_select(I_proj, dim=0, index=select_idx)
160 |         I_proj = I_proj[select_idx]
161 |     else:
162 |         M = torch.sparse.mm(P, M)
163 |         I_proj = torch.sparse.mm(P, I_proj)
164 | 
165 |     #M is now the projection mx
166 |     A = M
167 |     for _ in range(scale):
168 |         #(k x d)
169 |         A = sketch_and_apply(lamb, X, X_scaled, A, I_proj)
170 |         
171 |     #Compute tau1 scores
172 |     #M = M / M.diag().sum()
173 |     #M is (k x d)
174 |     #compute tau1 scores (this M is previous M^{1/2})
175 |     tau1 = (torch.mm(A, X_t)**2).sum(0)
176 |     
177 |     return tau1
178 | 
179 | '''
180 | -M: projection mx
181 | -X, X_scaled, input and scaled input
182 | Returns:
183 | -k x d projected matrix
184 | '''
185 | def sketch_and_apply(lamb, X, X_scaled, M, I_proj):
186 |     X_t = X.t()
187 |     M = torch.mm(M, X_t)
188 |     M = torch.mm(M, X_scaled)
189 |     
190 |     check_cov = False
191 |     if check_cov:
192 |         #sanity check, use exact cov mx
193 |         #print('Using real cov mx!')        
194 |         M = cov(X)
195 |         subsample_freq = 1
196 |         feat_dim = X.size(1)
197 |         k = feat_dim
198 |         I_proj = torch.eye(k, k, device=X.device)
199 | 
200 |     check_exp = False
201 |     #Sanity check, computes exact matrix expoenntial
202 |     if False:
203 |         U, D, V_t = linalg.svd(lamb*M.cpu().numpy())
204 |         pdb.set_trace()
205 |         U = torch.from_numpy(U.astype('float32')).to(device)
206 |         D_exp = torch.from_numpy(np.exp(D.astype('float32'))).to(device).diag()
207 |         m = torch.mm(U, D_exp)
208 |         m = torch.mm(m, U.t())        
209 |         #tau1 = (torch.mm(M, X_t)**2).sum(0)        
210 |         return m
211 |     if check_exp:
212 |         M = torch.from_numpy(scipy.linalg.expm(lamb*M.cpu().numpy())).to(device)
213 |         #pdb.set_trace()
214 |         tau1 = (torch.mm(M, X_t)**2).sum(0)
215 |         #X_m = torch.mm(X, M)
216 |         #tau1 = (X*X_m).sum(-1)
217 |         return M
218 |     
219 |     ## Matrix exponential appx ##
220 |     total_deg = 6
221 |     monomials = [0]*total_deg
222 |     #k x d
223 |     monomials[1] = M
224 |     
225 |     #create monomimials in chebyshev poly. Start with deg 2 since already multiplied with one cov.
226 |     for i in range(2, total_deg):        
227 |         monomials[i] = torch.mm(torch.mm(monomials[i-1], X_t), X_scaled)
228 |     
229 |     monomials[0] = I_proj 
230 |     M = 0
231 |     #M is now (k x d)
232 |     #degrees of terms in deg^th chebyshev poly
233 |     for kk in range(1, total_deg):
234 |         #coefficients and degrees for chebyshev poly. Includes 0th deg.  
235 |         coeff, deg = get_chebyshev_deg(kk)
236 | 
237 |         T_k = 0
238 |         for i, d in enumerate(deg):
239 |             c = coeff[i]            
240 |             T_k += c*lamb**d*monomials[d]
241 |             
242 |         #includes multiplication with powers of i
243 |         bessel_k = get_bessel('-i', kk)
244 |         M = M + bessel_k*T_k
245 | 
246 |     #M = I_proj
247 |     #degree 0 term. M is now (k x d)
248 |     #M[:, :k] = 2*M[:, :k] + get_bessel('i', 0) * torch.eye(k, feat_dim, device=X.device) #torch.ones((k,), device=X.device).diag()
249 |     #(k x d) matrix
250 |     M = 2*M + get_bessel('i', 0) * I_proj 
251 | 
252 |     return M
253 |     
254 |     
255 | '''
256 | Create JL projection matrix.
257 | Input: 
258 | -d: original dim
259 | -k: reduced dim
260 | '''
261 | def get_jl_mx(d, k, ffht_b):
262 |     #M is sparse k x d matrix
263 |     
264 |     P = torch.ones(k, d, device=device) #torch.sparse(  )
265 | 
266 |     if not ffht_b:
267 |         H = get_hadamard(d)
268 |     else:
269 |         H = None
270 |     #diagonal Rademacher mx
271 |     sign = torch.randint(low=0, high=2, size=(d,), device=device, dtype=torch.float32)
272 |     sign[sign==0] = -1
273 |     D = sign.diag()
274 |     
275 |     return P, H, D
276 | 
277 | #dict of Hadamard matrices of given dimensions
278 | H2 = {}
279 | '''
280 | -d: dimension of H. Power of 2.
281 | -replace with FFT for d log(d).
282 | '''
283 | def get_hadamard(d):
284 | 
285 |     if d in H2:
286 |         return H2[d]
287 |     if osp.exists('h{}.pt'.format(d)):
288 |         H2[d] = torch.load('h{}.pt'.format(d)).to(device)
289 |         return H2[d]
290 |     power = math.log(d, 2)
291 |     if power-round(power) != 0:
292 |         raise Exception('Dimension of Hamadard matrix must be power of 2')
293 |     power = int(power)
294 |     #M1 = torch.FloatTensor([[ ], [ ]])
295 |     M2 = torch.FloatTensor([1, 1, 1, -1])
296 |     if device == 'cuda':
297 |         M2 = M2.cuda()
298 |     i = 2
299 |     H = M2
300 |     while i <= power:
301 |         #H = torch.ger(H.view(-1), M2).view(2**i, 2**i)
302 |         H = torch.ger(M2, H.view(-1))
303 |         #reshape into 4 block matrices
304 |         H = H.view(-1, 2**(i-1), 2**(i-1))
305 |         H = torch.cat((torch.cat((H[0], H[1]), dim=1), torch.cat((H[2], H[3]), dim=1)), dim=0)
306 |         #if i == 2:
307 |         #    pdb.set_trace()
308 |         i += 1
309 |     torch.save(H, 'h{}.pt'.format(d))
310 |     H2[d] = H.view(d, d) / np.sqrt(d)
311 |     return H2[d]
312 | 
313 | '''
314 | Pad to power of 2.
315 | Input: size 2.
316 | '''
317 | def pad_to_2power(X):
318 |     n_data, feat_dim = X.size(0), X.size(-1)
319 |     power = int(math.ceil(math.log(feat_dim, 2)))
320 |     power_diff = 2**power-feat_dim
321 |     if power_diff == 0:
322 |         return X
323 |     padding = torch.zeros(n_data, power_diff, dtype=X.dtype, device=X.device)
324 |     X = torch.cat((X, padding), dim=-1)
325 |     
326 |     return X
327 | 
328 | '''
329 | Find dominant eval of XX^t (and evec in the process) using the power method.
330 | Without explicitly forming XX^t
331 | Returns:
332 | -dominant eval + corresponding eigenvector
333 | '''
334 | def dominant_eval_cov(X):
335 |     n_data = X.size(0)
336 |     X = X - X.mean(dim=0, keepdim=True)
337 |     X_t = X.t()
338 |     X_t_scaled = X_t/n_data
339 |     n_round = 5
340 |     
341 |     v = torch.randn(X.size(-1), 1, device=X.device)
342 |     for _ in range(n_round):
343 |         v = torch.mm(X_t_scaled, torch.mm(X, v))
344 |         #scale each time instead of at the end to avoid overflow
345 |         #v = v / (v**2).sum().sqrt()
346 |     v = v / (v**2).sum().sqrt()
347 |     mu = torch.mm(v.t(), torch.mm(X_t_scaled, torch.mm(X, v))) / (v**2).sum()
348 |     
349 |     return mu.item(), v.view(-1)
350 | '''
351 | dominant eval of matrix X
352 | Returns: top eval and evec
353 | '''
354 | def dominant_eval(A):
355 |     '''
356 |     n_data = X.size(0)
357 |     X = X - X.mean(dim=0, keepdim=True)
358 |     X_t = X.t()
359 |     X_t_scaled = X_t/n_data
360 |     '''
361 |     n_round = 5    
362 |     v = torch.randn(A.size(-1), 1, device=A.device)
363 |     for _ in range(n_round):
364 |         v = torch.mm(A, v)
365 |         #scale each time instead of at the end to avoid overflow
366 |         #v = v / (v**2).sum().sqrt()
367 |     v = v / (v**2).sum().sqrt()
368 |     mu = torch.mm(v.t(), torch.mm(A, v)) / (v**2).sum()
369 |     
370 |     return mu.item(), v.view(-1)
371 | 
372 | '''
373 | Top k eigenvalues of X_c X_c^t rather than top one.
374 | '''
375 | def dominant_eval_k(A, k):
376 |     
377 |     evals = torch.zeros(k).to(device)
378 |     evecs = torch.zeros(k, A.size(-1)).to(device)
379 |     
380 |     for i in range(k):
381 |         
382 |         cur_eval, cur_evec = dominant_eval(A)
383 |         A -= (cur_evec*A).sum(-1, keepdim=True) * (cur_evec/(cur_evec**2).sum())
384 |         
385 |         evals[i] = cur_eval
386 |         evecs[i] = cur_evec
387 |         
388 |     return evals, evecs
389 | 
390 | '''
391 | Top cov dir, for e.g. visualization + debugging.
392 | '''
393 | def get_top_evals(X, k=10):
394 |     X_cov = cov(X)
395 |     U, D, V_t = linalg.svd(X_cov.cpu().numpy())
396 |     return D[:k]
397 | 
398 | #bessel function values at i and -i, index is degree.
399 | #sum_{j=0}^\infty ((-1)^j/(2^(2j+k) *j!*(k+j)! )) * (-i)^(2*j+k) for k=0 BesselI(0, 1)
400 | bessel_i = [1.1266066]
401 | bessel_neg_i = [1.266066, -0.565159j, -0.1357476, 0.0221684j, 0.00273712, -0.00027146j]
402 | #includes multipliacation with powers of i, i**k
403 | bessel_neg_i = [1.266066, 0.565159, 0.1357476, 0.0221684, 0.00273712, 0.00027146]
404 | 
405 | '''
406 | Get precomputed deg^th Bessel function value at input arg.
407 | '''
408 | def get_bessel(arg, deg):
409 |     if arg == 'i':
410 |         if deg > len(bessel_i):
411 |             raise Exception('Bessel i not computed for deg {}'.format(deg))         
412 |         return bessel_i[deg]
413 |     elif arg == '-i':
414 |         if deg > len(bessel_neg_i):
415 |             raise Exception('Bessel -i not computed for deg {}'.format(deg))
416 |         return bessel_neg_i[deg]
417 | 
418 | 
419 | '''
420 | Projection (vector) of dirs onto target direction.
421 | '''
422 | def project_onto(tgt, dirs):
423 |     
424 |     projection = (tgt*dirs).sum(-1, keepdims=True) * (tgt/(tgt**2).sum())
425 |     
426 |     return projection
427 | 
428 | '''
429 | Plot 
430 | legends: last field of which correponds to hue, and dictates which kind of plot, eg lambda or p
431 | '''
432 | #def plot_acc(acc_l, k_l, p_l, tau_l, opt):
433 | def plot_acc(k_l, acc_l, tau_l, p_l, legends, opt):
434 |     import seaborn as sns
435 |     opt.lamb = round(opt.lamb, 2)
436 |     df = create_df(k_l, acc_l, tau_l, p_l, legends)
437 |     #fig = sns.scatterplot(x='k', y='acc', style='tau', hue='p', data=df)
438 |     fig = sns.scatterplot(x=legends[0], y=legends[1], style=legends[2], hue=legends[3], data=df)
439 |     
440 |     fig.set(ylim=(0, 1.05))
441 |     fig.set_title('acc vs k. n_iter {} remove_fac {} p {} on dataset {} tau0: {}'.format(opt.n_iter, opt.remove_factor, opt.p, opt.dataset_name, opt.baseline))
442 |     fig_path = osp.join(utils.res_dir, 'plot{}{}_{}_{}_{}{}iter{}{}{}.jpg'.format('N{}_noise'.format(opt.noise_norm_div),opt.feat_dim, opt.n_dir, opt.norm_scale, legends[-1], opt.p, opt.n_iter, opt.dataset_name, opt.baseline))
443 |     fig.figure.savefig(fig_path)
444 |     print('figure saved under {}'.format(fig_path))
445 | 
446 | '''
447 | Plot wrt lambda
448 | '''
449 | def plot_acc_syn_lamb(p_l, acc_l, tau_l, legends, opt):
450 |     import seaborn as sns
451 |     opt.lamb = round(opt.lamb, 2)
452 |     df = pd.DataFrame({legends[0]:p_l, legends[1]:acc_l, legends[2]:tau_l})
453 |     #df = create_df(p_l, acc_l, tau_l, legends)
454 |     #fig = sns.scatterplot(x='k', y='acc', style='tau', hue='p', data=df)
455 |     fig = sns.scatterplot(x=legends[0], y=legends[1], style=legends[2], data=df)
456 |     
457 |     fig.set(ylim=(0, 1.05))
458 |     fig.set_title('acc vs k. n_iter {} remove_fac {} p {} on dataset {} tau0: {}'.format(opt.n_iter, opt.remove_factor, opt.p, opt.dataset_name, opt.baseline))
459 |     fig_path = osp.join(utils.res_dir, 'syn', 'lamb_{}_{}_{}_{}{}iter{}{}{}.jpg'.format(opt.feat_dim, opt.n_dir, opt.norm_scale, legends[-1], opt.p, opt.n_iter, opt.dataset_name, opt.baseline))
460 |     fig.figure.savefig(fig_path)
461 |     print('figure saved under {}'.format(fig_path))
462 | 
463 | '''
464 | Scatter plot of input X, e.g. for varying lambda
465 | Input:
466 | -standard deviation: standard deviation around each point.
467 | '''
468 | def plot_scatter(X, Y, legends, opt, std=None):
469 |     import seaborn as sns
470 |     df = pd.DataFrame({legends[0]:X, legends[1]:Y})    
471 |     fig = sns.scatterplot(x=legends[0], y=legends[1], data=df, label=(legends[1]))
472 |     #fig = sns.scatterplot(x=legends[0], y=legends[1], style=legends[2], hue=legends[3], data=df)
473 | 
474 |     fig.set(ylim=(0, 1.05))
475 |     plt.grid(True)
476 |     #fig.set(ylim=(0, max(Y)+.1))
477 |     #fig.set_title('acc vs k. n_iter {} remove_fac {} p {} noise_norm_div {} on dataset {} tau0: {}'.format(opt.n_iter, opt.remove_factor, opt.p, opt.noise_norm_div, opt.dataset_name, opt.baseline))
478 |     fig.set_title('Recall scores as a function of varying {} for text {}'.format(legends[0], opt.text_name))
479 |     fig_path = osp.join(utils.res_dir, 'text', '{}_{}_{}1.jpg'.format(legends[0], legends[1], opt.text_name))
480 |     fig.figure.savefig(fig_path)
481 |     print('figure saved under {}'.format(fig_path))
482 | 
483 | '''
484 | Plot flexible number of variables.
485 | Useful for e.g. plotting wrt baselines.
486 | Input:
487 | -data_l: 0th entry is the x axis.
488 | Inputs are np arrays
489 | -legend_l has len one less than data_l
490 | -name: extra appendix to file name.
491 | '''
492 | def plot_scatter_flex(data_ar, legend_l, opt, std_ar=None, name=''):
493 |     
494 |     plt.clf()
495 |     m = {}
496 |     markers = ['^', 'o', 'x', '.', '1', '3', '+', '4', '5']
497 |     '''
498 |     legends = []
499 |     for i, data in data_l:
500 |         m[legend_l[i]] = data
501 |     '''
502 |     for i in range(1, len(data_ar)):
503 |         #plt.scatter(data_ar[0], data_ar[i], marker=markers[i-1], label=legend_l[i-1])
504 |         cur_legend = get_label_name(legend_l[i-1])
505 |         plt.errorbar(data_ar[0], data_ar[i], yerr=std_ar[i], marker=markers[i-1], label=cur_legend) 
506 |     
507 |     label_name = get_label_name(name) #'naive spectral' if name == 'tau0' else name
508 |     #fig.set(ylim=(0, 1.05))
509 |     
510 |     plt.grid(True)
511 |     plt.legend()
512 |     if opt.type == 'lamb':
513 |         plt.xlabel('Alpha')
514 |         plt.ylabel('ROCAUC(QUE) - ROCAUC{}'.format(label_name))
515 |         plt.title('ROCAUC(QUE) improvement over ROCAUC({})'.format(label_name))
516 |     else:
517 |         x_label = get_label_name(opt.type)
518 |         plt.xlabel(x_label)
519 |         plt.ylabel('ROCAUC')
520 |         plt.title('ROCAUC of QUE vs baseline methods')
521 |     #fig.set_title('acc vs k. n_iter {} remove_fac {} p {} noise_norm_div {} on dataset {} tau0: {}'.format(opt.n_iter, opt.remove_factor, opt.p, opt.noise_norm_div, opt.dataset_name, opt.baseline))
522 |     
523 |     fname_append = ''
524 |     if not opt.whiten:
525 |         fname_append += '_nw'
526 |     if opt.fast_whiten:
527 |         fname_append += '_fw'
528 |     if opt.fast_jl:
529 |         fname_append += '_fast'
530 |         
531 |     #if opt.fast_jl:
532 |     #    fig_path = osp.join(utils.res_dir, opt.dir, 'baselines_{}{}_fast.jpg'.format(opt.type, name))
533 |     #else:
534 |     create_dir(osp.join(utils.res_dir, opt.dir))
535 |     fig_path = osp.join(utils.res_dir, opt.dir, 'baselines_{}{}{}.jpg'.format(opt.type, name, fname_append))
536 |     plt.savefig(fig_path)
537 |     print('figure saved under {}'.format(fig_path))
538 | 
539 | '''
540 | Get label name to be used on plots.
541 | '''
542 | def get_label_name(name):
543 |     name2label = {'tau0':'naive spectral', 'tau1':'QUE', 'lamb':'Alpha',
544 |                   'dirs':'number of directions (k)'}
545 |     try:
546 |         return name2label[name]
547 |     except KeyError:
548 |         return name
549 |     
550 | '''
551 | Computes average probability of outlier scores higher than inlier scores.
552 | Input:
553 | -inlier+outlier scores, 1D tensors
554 | '''
555 | def auc(inlier_scores, outlier_scores0):
556 |     
557 |     n_inliers, n_outliers = len(inlier_scores), len(outlier_scores0)
558 |     if False and n_inliers + n_outliers > 150000:
559 |         inlier_scores = inlier_scores.to('cpu')
560 |         outlier_scores = outlier_scores0.to('cpu')
561 |     prob_l = []
562 |     chunk_sz = 500
563 |     for i in range(0, n_outliers, chunk_sz):
564 |         start = i
565 |         end = min(n_outliers, i+chunk_sz)
566 |         cur_n = end - start
567 |         outlier_scores = outlier_scores0[start:end]
568 |         
569 |         #average probabilities of inliers scores lower than outlier scores.    
570 |         outlier_scores_exp = outlier_scores.unsqueeze(-1).expand(-1, n_inliers)
571 |         inlier_scores_exp = inlier_scores.unsqueeze(0).expand(cur_n, -1)
572 |         zeros = torch.zeros(cur_n, n_inliers).to(device)
573 |         zeros[outlier_scores_exp > inlier_scores_exp] = 1
574 |         prob = (zeros.sum(-1) / n_inliers).mean().item()
575 |         prob_l.append(prob)
576 |     return np.mean(prob_l)
577 |     
578 | '''
579 | Plot histogram of tensors
580 | Input:
581 | -data
582 | -keyword to be used in file
583 | '''
584 | def hist(X, name, high=10):
585 |     X = X.cpu().numpy()
586 |     plt.hist(X, 50, label=str(name))
587 |     
588 |     plt.xlabel('projection')
589 |     plt.ylabel('count')
590 |     plt.title('projections of {} onto top covariance dir'.format(name))
591 |     #plt.text(60, .025, r'$\mu=100,\ \sigma=15$')
592 |     
593 |     plt.axis([X.min(), X.max(), 0, high])
594 |     plt.grid(True)
595 |     
596 |     fig_path = osp.join(utils.res_dir, 'eval_proj_{}.jpg'.format(name))
597 |     plt.savefig(fig_path)
598 |     print('figure saved under {}'.format(fig_path))
599 | 
600 | '''
601 | Inliers and outliers histograms.
602 | Input:
603 | -X/Y: inliers/outliers score (or other measurement) distributions according to some score
604 | -
605 | '''
606 | def inlier_outlier_hist(X, Y, score_name, high=50):
607 |     #X, Y
608 |     
609 |     X = X.cpu().numpy()
610 |     Y = Y.cpu().numpy()
611 | 
612 |     n_bins_x = 50
613 |     n_bins_y = max(1, int(n_bins_x * (Y.max()-Y.min()) / (X.max()-X.min())))
614 |     plt.hist(X, n_bins_x, label='inliers')
615 |     plt.hist(Y, n_bins_y, label='outliers')
616 |     
617 |     plt.xlabel('knn distance')
618 |     plt.ylabel('sample count')
619 |     label_name = get_label_name(score_name)
620 |     plt.title('Distance to k-nearest neighbors'.format(label_name))
621 |     #plt.text(60, .025, r'$\mu=100,\ \sigma=15$')
622 | 
623 |     plt.legend()
624 |     #plt.axis([min(X.min(), Y.min()), max(X.max(), Y.max()), 0, high])
625 |     plt.axis([min(X.min(), Y.min()), 30, 0, high]) #for ads,high y 300 
626 |     #plt.axis([min(X.min(), Y.min()), 3, 0, high]) #syn high x 3
627 |     plt.grid(True)
628 |     
629 |     fig_path = osp.join(utils.res_dir, 'knn_inout_{}.jpg'.format(score_name))
630 |     plt.savefig(fig_path)
631 |     print('figure saved under {}'.format(fig_path))
632 | 
633 |     
634 | '''
635 | k is number of dirs
636 | p_mx is percentage of all noise combined.
637 | legends: array of strings of legends, eg ['a', 'b', 'c', 'd']
638 | '''
639 | def create_df(k_l, acc_l, tau_l, p_l, legends):
640 | 
641 |     #return pd.DataFrame({'acc':acc_l, 'k':k_l, 'tau':tau_l, 'p':p_l})
642 |     return pd.DataFrame({legends[0]:k_l, legends[1]:acc_l, legends[2]:tau_l, legends[3]:p_l})
643 | 
644 | '''
645 | Take inner product of rows in one with rows in another.
646 | Input:
647 | -2D tensors
648 | '''
649 | def inner(mx1, mx2):
650 |     return (mx1 * mx2).sum(dim=1)
651 | 
652 | '''
653 | Inner product matrix of all pairwise rows and columns
654 | '''
655 | def inner_mx(mx1, mx2):    
656 |     return torch.mm(mx1 * mx2.t())
657 | 
658 | 
659 | '''
660 | Input: lines is list of objects, not newline-terminated yet. 
661 | '''
662 | def write_lines(lines, path, mode='w'):
663 |     lines1 = []
664 |     for line in lines:
665 |         lines1.append(str(line) + os.linesep)
666 |     with open(path, mode) as file:
667 |         file.writelines(lines1)
668 |         
669 | def read_lines(path):
670 |     with open(path, 'r') as file:
671 |         return file.readlines()
672 |     
673 | '''
674 | Input:
675 | -X: shape (n_sample, n_feat)
676 | '''
677 | def cov(X):
678 |     #X_mean = X.mean()
679 |     X = X - X.mean(dim=0, keepdim=True)
680 | 
681 |     cov = torch.mm(X.t(), X) / X.size(0)
682 |     return cov
683 |     
684 | ########################
685 | 
686 | def create_df_(acc_mx, probe_mx, height, k, opt):
687 |     #construct probe_count, acc, and dist_count                                                                                  
688 |     #total number of points we compute distances to                                                                                  dist_count_l = []
689 |     acc_l = []
690 |     probe_l = []
691 |     counter = 0
692 |     n_clusters_ar = [2**(i+1) for i in range(20)]
693 | 
694 |     #i indicates n_clusters
695 |     for i, acc_ar in enumerate(acc_mx):
696 |         n_clusters = n_clusters_ar[i]
697 |         #j is n_bins
698 |         for j, acc in enumerate(acc_ar):
699 |             probe_count = probe_mx[i][j]
700 |             if not opt.glove and not opt.sift:
701 |                 if height == 1 and probe_count > 2000:
702 |                     continue
703 |                 elif probe_count > 3000:
704 |                     continue
705 | 
706 |             # \sum_u^h n_bins^u * n_clusters * k
707 |             exp = np.array([l for l in range(height)])
708 |             
709 |             dist_count = np.sum(k * n_clusters * j**exp)
710 |             if not opt.glove and not opt.sift:
711 |                 if dist_count > 50000:
712 |                     continue
713 |             dist_count_l.append(dist_count)
714 |             acc_l.append(acc)
715 |             #probe_l.append(probe_count)
716 |             probe_l.append(probe_count + dist_count)
717 | 
718 |             counter += 1
719 |             
720 |     df = pd.DataFrame({'probe_count':probe_l, 'acc':acc_l, 'dist_count':dist_count_l})
721 |     return df
722 | 
723 | def plot_acc_():
724 |     import seaborn as sns
725 |     df_l = []
726 |     height_df_l = []
727 |     for i, acc_mx in enumerate(acc_mx_l):
728 |         probe_mx = probe_mx_l[i]
729 |         height = height_l[i]
730 |         df = create_df(acc_mx, probe_mx, height, k, opt)
731 |         df_l.append(df)
732 |         height_df_l.extend([height] * len(df))
733 | 
734 |     method, max_loyd = json_data['km_method'], json_data['max_loyd']
735 |     df = pd.concat(df_l, axis=0, ignore_index=True)
736 | 
737 |     height_df = pd.DataFrame({'height': height_df_l})
738 | 
739 |     df = pd.concat([df, height_df], axis=1)
740 | 
741 |     fig = sns.scatterplot(x='probe_count', y='acc', hue='height', data=df)
742 |     
743 |     fig.set_title('')
744 |     fig_path = osp.join(' ', ' ')
745 |     fig.figure.savefig(fig_path)
746 | 
747 | def np_save(obj, path):
748 |     with open(path, 'wb') as f:
749 |         np.save(f, obj)
750 |         print('saved under {}'.format(path))
751 | 
752 | '''
753 | Memory-compatible. 
754 | Ranks of closest points not self.
755 | Uses l2 dist. But uses cosine dist if data normalized. 
756 | Input: 
757 | -data: tensors
758 | -specify k if only interested in the top k results.
759 | -largest: whether pick largest when ranking. 
760 | -include_self: include the point itself in the final ranking.
761 | '''
762 | def dist_rank(data_x, k, data_y=None, largest=False, opt=None, include_self=False):
763 | 
764 |     if isinstance(data_x, np.ndarray):
765 |         data_x = torch.from_numpy(data_x)
766 | 
767 |     if data_y is None:
768 |         data_y = data_x
769 |     else:
770 |         if isinstance(data_y, np.ndarray):
771 |             data_y = torch.from_numpy(data_y)
772 |     k0 = k
773 |     device_o = data_x.device
774 |     data_x = data_x.to(device)
775 |     data_y = data_y.to(device)
776 |     
777 |     (data_x_len, dim) = data_x.size()
778 |     data_y_len = data_y.size(0)
779 |     #break into chunks. 5e6  is total for MNIST point size
780 |     #chunk_sz = int(5e6 // data_y_len)
781 |     chunk_sz = 16384
782 |     chunk_sz = 500 #700 mem error. 1 mil points
783 |     if data_y_len > 990000:
784 |         chunk_sz = 600 #1000 if over 1.1 mil
785 |         #chunk_sz = 500 #1000 if over 1.1 mil 
786 |     else:
787 |         chunk_sz = 3000    
788 | 
789 |     if k+1 > len(data_y):
790 |         k = len(data_y) - 1
791 |     #if opt is not None and opt.sift:
792 |     
793 |     if device == 'cuda':
794 |         dist_mx = torch.cuda.LongTensor(data_x_len, k+1)
795 |         act_dist = torch.cuda.FloatTensor(data_x_len, k+1)
796 |     else:
797 |         dist_mx = torch.LongTensor(data_x_len, k+1)
798 |         act_dist = torch.cuda.FloatTensor(data_x_len, k+1)
799 |     data_normalized = True if opt is not None and opt.normalize_data else False
800 |     largest = True if largest else (True if data_normalized else False)
801 |     
802 |     #compute l2 dist <--be memory efficient by blocking
803 |     total_chunks = int((data_x_len-1) // chunk_sz) + 1
804 |     y_t = data_y.t()
805 |     if not data_normalized:
806 |         y_norm = (data_y**2).sum(-1).view(1, -1)
807 |     
808 |     for i in range(total_chunks):
809 |         base = i*chunk_sz
810 |         upto = min((i+1)*chunk_sz, data_x_len)
811 |         cur_len = upto-base
812 |         x = data_x[base : upto]
813 |         
814 |         if not data_normalized:
815 |             x_norm = (x**2).sum(-1).view(-1, 1)        
816 |             #plus op broadcasts
817 |             dist = x_norm + y_norm        
818 |             dist -= 2*torch.mm(x, y_t)
819 |         else:
820 |             dist = -torch.mm(x, y_t)
821 |             
822 |         topk_d, topk = torch.topk(dist, k=k+1, dim=1, largest=largest)
823 |                 
824 |         dist_mx[base:upto, :k+1] = topk #torch.topk(dist, k=k+1, dim=1, largest=largest)[1][:, 1:]
825 |         act_dist[base:upto, :k+1] = topk_d #torch.topk(dist, k=k+1, dim=1, largest=largest)[1][:, 1:]
826 |         
827 |     topk = dist_mx
828 |     if k > 3 and opt is not None and opt.sift:
829 |         #topk = dist_mx
830 |         #sift contains duplicate points, don't run this in general.
831 |         identity_ranks = torch.LongTensor(range(len(topk))).to(topk.device)
832 |         topk_0 = topk[:, 0]
833 |         topk_1 = topk[:, 1]
834 |         topk_2 = topk[:, 2]
835 |         topk_3 = topk[:, 3]
836 | 
837 |         id_idx1 = topk_1 == identity_ranks
838 |         id_idx2 = topk_2 == identity_ranks
839 |         id_idx3 = topk_3 == identity_ranks
840 | 
841 |         if torch.sum(id_idx1).item() > 0:
842 |             topk[id_idx1, 1] = topk_0[id_idx1]
843 | 
844 |         if torch.sum(id_idx2).item() > 0:
845 |             topk[id_idx2, 2] = topk_0[id_idx2]
846 | 
847 |         if torch.sum(id_idx3).item() > 0:
848 |             topk[id_idx3, 3] = topk_0[id_idx3]           
849 | 
850 |     
851 |     if not include_self:
852 |         topk = topk[:, 1:]
853 |         act_dist = act_dist[:, 1:]
854 |     elif topk.size(-1) > k0:
855 |         topk = topk[:, :-1]
856 |     topk = topk.to(device_o)
857 |     return act_dist, topk
858 | 
859 | class tokenizer:
860 |     """
861 |     Rudimentary tokenizer for when allennlp is unavailable.
862 |     """
863 |     def __init__(self):
864 |         import re
865 |         self.patt = re.compile('[ ;,.?!`\'":|\s~%&*()#$@+-=]')
866 |         
867 |     def batch_tokenize(self, sent_l):
868 |         sent_l2 = []
869 |         for sent in sent_l:            
870 |             sent_l2.append(self.patt.split(sent))
871 |             
872 |         return sent_l2
873 |     
874 | class stop_word_filter:
875 |     
876 |     def filter_words(self, tok_l):
877 |         """
878 |         Input: tok_l: list of tokens
879 |         """
880 |         tok_l2 = []
881 |         for tok in tok_l:
882 |             if tok not in STOP_WORDS:
883 |                 tok_l2.append(tok)
884 |         return tok_l2
885 | 
886 | ## This below is due to the authors of spacy, reproduced here as some users have ##
887 | ## reported difficulties installing the language packages required for processig text ##
888 | 
889 | # Stop words
890 | STOP_WORDS = set(
891 |     """
892 | a about above across after afterwards again against all almost alone along
893 | already also although always am among amongst amount an and another any anyhow
894 | anyone anything anyway anywhere are around as at
895 | back be became because become becomes becoming been before beforehand behind
896 | being below beside besides between beyond both bottom but by
897 | call can cannot ca could
898 | did do does doing done down due during
899 | each eight either eleven else elsewhere empty enough even ever every
900 | everyone everything everywhere except
901 | few fifteen fifty first five for former formerly forty four from front full
902 | further
903 | get give go
904 | had has have he hence her here hereafter hereby herein hereupon hers herself
905 | him himself his how however hundred
906 | i if in indeed into is it its itself
907 | keep
908 | last latter latterly least less
909 | just
910 | made make many may me meanwhile might mine more moreover most mostly move much
911 | must my myself
912 | name namely neither never nevertheless next nine no nobody none noone nor not
913 | nothing now nowhere
914 | of off often on once one only onto or other others otherwise our ours ourselves
915 | out over own
916 | part per perhaps please put
917 | quite
918 | rather re really regarding
919 | same say see seem seemed seeming seems serious several she should show side
920 | since six sixty so some somehow someone something sometime sometimes somewhere
921 | still such
922 | take ten than that the their them themselves then thence there thereafter
923 | thereby therefore therein thereupon these they third this those though three
924 | through throughout thru thus to together too top toward towards twelve twenty
925 | two
926 | under until up unless upon us used using
927 | various very very via was we well were what whatever when whence whenever where
928 | whereafter whereas whereby wherein whereupon wherever whether which while
929 | whither who whoever whole whom whose why will with within without would
930 | yet you your yours yourself yourselves
931 | """.split()
932 | )
933 | 
934 | contractions = ["n't", "'d", "'ll", "'m", "'re", "'s", "'ve"]
935 | STOP_WORDS.update(contractions)
936 | 
937 | for apostrophe in ["‘", "’"]:
938 |     for stopword in contractions:
939 |         STOP_WORDS.add(stopword.replace("'", apostrophe))
940 | 


--------------------------------------------------------------------------------