├── .gitignore ├── .travis.yml ├── Gemfile ├── LICENSE.txt ├── README.md ├── Rakefile ├── bin ├── console └── setup ├── epitome.gemspec ├── lib ├── epitome.rb └── epitome │ ├── corpus.rb │ ├── document.rb │ └── version.rb └── test ├── corpus_test.rb ├── minitest_helper.rb └── test_epitome.rb /.gitignore: -------------------------------------------------------------------------------- 1 | /.bundle/ 2 | /.yardoc 3 | /Gemfile.lock 4 | /_yardoc/ 5 | /coverage/ 6 | /doc/ 7 | /pkg/ 8 | /spec/reports/ 9 | /tmp/ 10 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: ruby 2 | rvm: 3 | - 2.2.0 4 | -------------------------------------------------------------------------------- /Gemfile: -------------------------------------------------------------------------------- 1 | source 'https://rubygems.org' 2 | 3 | # Specify your gem's dependencies in hemingway.gemspec 4 | gemspec 5 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2015 McFreely 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in 13 | all copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 21 | THE SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Epitome 2 | 3 | A small gem to make your text shorter. It's an implementation of the Lexrank algorithm. You can use it on a single text, but lexrank is designed to be used on a collection of texts. But it works the same anyway. 4 | 5 | ## Installation 6 | 7 | Add this line to your application's Gemfile: 8 | 9 | ```ruby 10 | gem 'epitome' 11 | ``` 12 | 13 | And then execute: 14 | 15 | $ bundle 16 | 17 | Or install it yourself as: 18 | 19 | $ gem install epitome 20 | 21 | ## Usage 22 | 23 | Firstly, you need to create some documents. 24 | 25 | ```ruby 26 | document_one = Epitome::Document.new("The cat likes catnip. He rolls and rolls") 27 | document_two = Epitome::Document.new("The cat plays in front of the dog. The dog is placid.") 28 | ``` 29 | 30 | Then, organize your documents in a corpus 31 | 32 | ```ruby 33 | document_collection = [document_one, document_two] 34 | @corpus = Epitome::Corpus.new(document_collection) 35 | ``` 36 | 37 | Finally, output the summary 38 | ```ruby 39 | @corpus.summary(length=3) 40 | ``` 41 | 42 | This returns a nice, short text. 43 | 44 | ## Options 45 | ### Summary options 46 | You can pass options to set the length of the expected summary, and set the similarity threshold 47 | ```ruby 48 | @corpus.summary(5, 0.2) 49 | ``` 50 | The length is the number of sentences of the final output. 51 | 52 | The threshold is a value between 0.1 and 0.3, but 0.2 is considered to give the best results (and thus the default value). 53 | 54 | ### Stopword option 55 | When creating the corpus, you can set the language of the stopword list to be used 56 | ```ruby 57 | @corpus = Epitome::Corpus.new(document_collection, "fr") 58 | ``` 59 | The default value is english "en". 60 | You can find more about the stopword filter [here](https://github.com/brenes/stopwords-filter). 61 | ## Contributing 62 | 63 | 1. Fork it ( https://github.com/[my-github-username]/hemingway/fork ) 64 | 2. Create your feature branch (`git checkout -b my-new-feature`) 65 | 3. Commit your changes (`git commit -am 'Add some feature'`) 66 | 4. Push to the branch (`git push origin my-new-feature`) 67 | 5. Create a new Pull Request 68 | -------------------------------------------------------------------------------- /Rakefile: -------------------------------------------------------------------------------- 1 | require "bundler/gem_tasks" 2 | 3 | require "rake/testtask" 4 | 5 | Rake::TestTask.new do |t| 6 | t.test_files = FileList['test/*_test.rb'] 7 | end 8 | 9 | task default: :test 10 | -------------------------------------------------------------------------------- /bin/console: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env ruby 2 | 3 | require "bundler/setup" 4 | require "epitome" 5 | 6 | # You can add fixtures and/or initialization code here to make experimenting 7 | # with your gem easier. You can also use a different console, if you like. 8 | 9 | # (If you use this, don't forget to add pry to your Gemfile!) 10 | # require "pry" 11 | # Pry.start 12 | 13 | require "irb" 14 | IRB.start 15 | -------------------------------------------------------------------------------- /bin/setup: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | set -euo pipefail 3 | IFS=$'\n\t' 4 | 5 | bundle install 6 | 7 | # Do any other automated setup that you need to do here 8 | -------------------------------------------------------------------------------- /epitome.gemspec: -------------------------------------------------------------------------------- 1 | # coding: utf-8 2 | lib = File.expand_path('../lib', __FILE__) 3 | $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib) 4 | require 'epitome/version' 5 | 6 | Gem::Specification.new do |spec| 7 | spec.name = "epitome" 8 | spec.version = Epitome::VERSION 9 | spec.authors = ["McFreely"] 10 | spec.email = ["paulmcfreely@gmail.com"] 11 | 12 | spec.summary = %q{Epitome makes your texts shorter.} 13 | spec.description = %q{An implementation of the Lexrank Algorithm, which summarize corpus of text documents.} 14 | spec.homepage = "https://github.com/McFreely/epitome" 15 | spec.license = "MIT" 16 | 17 | spec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) } 18 | spec.bindir = "exe" 19 | spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) } 20 | spec.require_paths = ["lib"] 21 | 22 | spec.add_development_dependency "bundler", "~> 1.9" 23 | spec.add_development_dependency "rake", "~> 10.0" 24 | spec.add_development_dependency "minitest" 25 | spec.add_dependency "scalpel" 26 | spec.add_dependency "stopwords-filter" 27 | end 28 | -------------------------------------------------------------------------------- /lib/epitome.rb: -------------------------------------------------------------------------------- 1 | require "epitome/version" 2 | 3 | module Epitome 4 | end 5 | 6 | require "epitome/document" 7 | require "epitome/corpus" 8 | -------------------------------------------------------------------------------- /lib/epitome/corpus.rb: -------------------------------------------------------------------------------- 1 | require 'matrix' 2 | require 'stopwords' 3 | 4 | module Epitome 5 | class Corpus 6 | attr_reader :original_corpus 7 | def initialize(document_collection, lang="en") 8 | # lang is the language used to initialize the stopword list 9 | @lang = lang 10 | 11 | # Massage the document_collection into a more workable form 12 | @original_corpus = {} 13 | document_collection.each { |document| @original_corpus[document.id] = document.text } 14 | @clean_corpus = {} 15 | @original_corpus.each do |key, value| 16 | @clean_corpus[key] = clean value 17 | end 18 | 19 | # Dictionary of term-frequency for each word 20 | # to avoid unnecessary computations 21 | @word_tf_doc = {} 22 | 23 | # Just the sentences 24 | @sentences = @original_corpus.values.flatten 25 | 26 | # The number of documents in the corpus 27 | @n_docs = @original_corpus.keys.size 28 | 29 | end 30 | 31 | def summary(summary_length, threshold=0.2) 32 | s = @clean_corpus.values.flatten 33 | # n is the number of sentences in the total corpus 34 | n = @clean_corpus.values.flatten.size 35 | 36 | # Vector of Similarity Degree for each sentence in the corpus 37 | degree = Array.new(n) {0.00} 38 | 39 | # Square matrix of dimension n = number of sentences 40 | cosine_matrix = Matrix.build(n) do |i, j| 41 | if idf_modified_cosine(s[i], s[j]) > threshold 42 | degree[i] += 1.0 43 | 1.0 44 | else 45 | 0.0 46 | end 47 | end 48 | 49 | # Similarity Matrix 50 | similarity_matrix = Matrix.build(n) do |i,j| 51 | degree[i] == 0 ? 0.0 : ( cosine_matrix[i,j] / degree[i] ) 52 | end 53 | 54 | # Random walk ala PageRank 55 | # in the form of a power method 56 | results = power_method similarity_matrix, n, 0.85 57 | 58 | # Ugly sleight of hand to return a text based on results 59 | # Results => Results => ResultsText 60 | h = Hash[@sentences.zip(results)] 61 | return h.sort_by {|k, v| v}.reverse.first(summary_length).to_h.keys.join(" ") 62 | end 63 | 64 | private 65 | def clean(sentence_array) 66 | # Clean the sentences a bit to avoid unnecessary operations 67 | # 68 | # Create stopword filter 69 | filter = Stopwords::Snowball::Filter.new @lang 70 | sentence_array.map do |s| 71 | s = s.downcase 72 | filter.filter(s.split).join(" ") 73 | end 74 | end 75 | 76 | def n_docs_including_w(word) 77 | # Count the number of documents in the corpus containing the word 78 | # Look for the word in the dictionnary first, calculate if not present 79 | @word_tf_doc.fetch(word) do |w| 80 | count = 0 81 | docs = [] 82 | 83 | # Concanate the each document sentences to make it easier to search 84 | @clean_corpus.values.each { |sentences| docs << sentences.join(" ") } 85 | 86 | # Here, we user an interpolated string instead of a regex to avoid 87 | # weird corner cases 88 | docs.each { |s| count += 1 if s.include? "#{word}" } 89 | 90 | @word_tf_doc[w] = count 91 | count 92 | end 93 | end 94 | 95 | def idf(word) 96 | # Number of documents in which word appears 97 | # Inverse Frequency Smooth (as per wikipedia article) 98 | result = Math.log( @n_docs / n_docs_including_w(word) ) 99 | 100 | # Return 1 to avoid words having all the same td_idf by multiplying by 0 101 | return result == 0 ? 1.0 : result 102 | end 103 | 104 | def tf(sentence, word) 105 | # Number of occurences of word in sentence 106 | sentence.scan(word).count 107 | end 108 | 109 | def sentence_tfidf_sum(sentence) 110 | # The Sum of tfidf values for each of the words in a sentence 111 | sentence.split(" ") 112 | .map { |word| (tf(sentence, word)**2) * idf(word) } 113 | .inject(:+) 114 | end 115 | 116 | def idf_modified_cosine(x, y) 117 | # Compute the similarity between two sentences x, y 118 | # using the modified cosine tfidf formula 119 | numerator = (x + " " + y).split(" ") 120 | .map { |word| tf(x, word) * tf(y, word) * (idf(word)**2) } 121 | .inject(:+) 122 | 123 | denominator = Math.sqrt(sentence_tfidf_sum(x)) * Math.sqrt(sentence_tfidf_sum(y)) 124 | numerator / denominator 125 | end 126 | 127 | def power_method(matrix, n, e) 128 | # Accept a stochastic, irreducible & aperiodic matrix M 129 | # Accept a matrix size n, an error tolerance e 130 | # Output Eigenvector p 131 | 132 | # init values 133 | t = 0 134 | p = Vector.elements(Array.new(n) { (1.0 / n) * 1} ) 135 | sigma = 1 136 | 137 | until sigma < e 138 | t += 1 139 | prev_p = p.clone 140 | p = matrix.transpose * prev_p 141 | sigma = (p - prev_p).norm 142 | end 143 | 144 | p.to_a 145 | end 146 | end 147 | end 148 | 149 | -------------------------------------------------------------------------------- /lib/epitome/document.rb: -------------------------------------------------------------------------------- 1 | require "scalpel" 2 | require "securerandom" 3 | 4 | module Epitome 5 | class Document 6 | attr_reader :id 7 | attr_reader :text 8 | def initialize(text) 9 | @id = SecureRandom.uuid 10 | @text = Scalpel.cut text 11 | end 12 | end 13 | end 14 | -------------------------------------------------------------------------------- /lib/epitome/version.rb: -------------------------------------------------------------------------------- 1 | module Epitome 2 | VERSION = "0.3.0" 3 | end 4 | -------------------------------------------------------------------------------- /test/corpus_test.rb: -------------------------------------------------------------------------------- 1 | require "minitest/autorun" 2 | 3 | require "epitome/document" 4 | require "epitome/corpus" 5 | 6 | class CorpusTest < Minitest::Test 7 | def setup 8 | doc_one = "The cat likes to eat pasta. He wants more each time. Gorbachev was seen trynig to defuse the tension in the room with one of his hallmark jokes." 9 | doc_two = "Dog dog dog. Gorbachev prefer pesto sauce." 10 | @document_one = Epitome::Document.new(doc_one) 11 | @document_two = Epitome::Document.new(doc_two) 12 | document_collection = [@document_one, @document_two] 13 | @corpus = Epitome::Corpus.new(document_collection) 14 | end 15 | 16 | def test_document_preparation 17 | refute_empty @document_one.id 18 | refute_empty @document_two.id 19 | 20 | assert_equal ["The cat likes to eat pasta.", "He wants more each time.", "Gorbachev was seen trynig to defuse the tension in the room with one of his hallmark jokes."], @document_one.text 21 | assert_equal ["Dog dog dog.", "Gorbachev prefer pesto sauce."], @document_two.text 22 | end 23 | 24 | def test_corpus_generation 25 | assert_equal 2, @corpus.original_corpus.keys.size 26 | assert_equal 2, @corpus.original_corpus.values.size 27 | end 28 | 29 | def test_summary 30 | summary = @corpus.summary(4, 0.2) 31 | 32 | refute_empty summary 33 | assert_equal String, summary.class 34 | end 35 | 36 | def test_weight 37 | text = "The market-sellers outside San Salvador's cathedral have been doing a roaring trade in recent days. A dollar for a poster of a smiling Oscar Romero or how about a baseball cap with his face on it? A driver winds down his window and stops outside a stall to hand over some cash for a t-shirt and then goes on his way. People are getting ready for a day of celebration. At least 250,000 people are expected to descend on the small capital of San Salvador on Saturday as they witness the beatification of one of the region's biggest heroes. Archbishop Oscar Romero was not just a churchman. He took a stand during El Salvador's darkest moments. El Salvador formally apologised for the murder of Archbishop Romero in 2010 At least 250,000 are expected in San Salvador to celebrate the beatification of Oscar Romero When the US-backed Salvadorean army was using death squads and torture to stop leftist revolutionaries from seizing power, he was not afraid to speak out in his weekly sermons. The law of God which says thou shalt not kill must come before any human order to kill. It is high time you recovered your conscience, he said in his last homily in March 1980, calling on the National Guard and police to stop the violence. I implore you, I beg you, I order you in the name of God: Stop the repression. That was a sermon that cost him his life. A day later, while giving mass, he was hit through the heart by a single bullet, killed by a right-wing death squad. With him died hopes of peace. In the months after Oscar Romero's assassination, the violence intensified and more than a decade of civil war followed. The conflict left around 80,000 people dead. Controversial figure Flying in to El Salvador, you land at the international airport named after Archbishop Romero. Your passport is stamped with his little portrait too. Small details that show he has a big following here. But he was not a figure loved by all. For some, he was more guerrilla than a man of God&*& He wasn't political but he lived in a very conflictive political time. Everything was politicised, says Father Jesus Delgado, Oscar Romero's friend and personal assistant. There was a line in the middle and the ones who supported the government were good and the ones who were against the government were bad - it was that simple. But he also faced opposition within the Church. Oscar Romero's path to sainthood had been stalled for years Several conservative Latin American cardinals in the Vatican blocked his beatification for years because they were concerned his death was prompted more by his politics than by his preaching. We cannot overlook that many of his most vocal opponents were in the church, says Professor Michael Lee, a theologian at Fordham University. It was not just a matter of faith and politics as two separate things but the political dimension of faith itself. Some linked Father Romero to Liberation Theology. It was a movement that grew out of the region's poverty and inequality with the belief that the Church could play a role in bringing about social change. Some radical priests became involved in revolutionary movements but friends of Oscar Romero say he was not one of them. Symbolic decision In the unstable political context of El Salvador though, there was a lot of mistrust. It took decades for that mentality to change. Not until Pope Francis became the first Latin American pontiff was his beatification unblocked. The Pope declared him a martyr who had died because of hatred of his faith, ending the decades-long debate. Francis becoming Pope represents a whole sea change because the Latin American church is now in charge of the universal church, says Dr Austen Ivereigh, author of a biography of Pope Francis. That's why this has huge symbolic significance, the unblocking of the cause of Romero. It really does signal the arrival of the Latin American church in Rome. Romero's younger brother has said that he remembers his brother as a committed man For Oscar Romero's supporters in El Salvador, this about turn has been a long time coming. Romero was their hero and that he is recognised as a saint of the church gives them huge affirmation and encouragement and inspiration, says Julian Filochowski, who is the chair of the Oscar Romero Trust in the UK. He's like Martin Luther King. It puts him in that same orbit as great iconic figures. In a country where religion is all important, he also divides opinions. The Catholic Church is undoubtedly powerful but more than a third of people here now identify themselves as evangelical. Several I spoke to said they did not recognise him as a saintly figure. Rising violence Flying the flag for the Romero family on Saturday is Gaspar Romero, the Archbishop's youngest brother who says he remembers his sibling as a hard-working and committed man. He was always very humble and dedicated to his studies, he says. He was committed to protecting the poor, if he was alive today he would be doing the same work. And it is work that many feel is more relevant than ever - El Salvador is fast becoming one of the most violent places in the world. Many feel the country is in a worse place than it ever was during civil conflict. This time though, there is no Oscar Romero to make a stand." 38 | 39 | 40 | @document = Epitome::Document.new(text) 41 | @corpus = Epitome::Corpus.new([@document]) 42 | puts "-----" 43 | puts text 44 | puts "-----" 45 | 46 | summary = @corpus.summary(10, 0.2) 47 | 48 | puts summary 49 | 50 | puts "-----" 51 | puts summary 52 | refute_nil summary 53 | end 54 | end 55 | -------------------------------------------------------------------------------- /test/minitest_helper.rb: -------------------------------------------------------------------------------- 1 | $LOAD_PATH.unshift File.expand_path('../../lib', __FILE__) 2 | require 'epitome' 3 | 4 | require 'minitest/autorun' 5 | -------------------------------------------------------------------------------- /test/test_epitome.rb: -------------------------------------------------------------------------------- 1 | require 'minitest_helper' 2 | 3 | class TestEpitome < Minitest::Test 4 | def test_that_it_has_a_version_number 5 | refute_nil ::Epitome::VERSION 6 | end 7 | 8 | def test_it_does_something_useful 9 | assert false 10 | end 11 | end 12 | --------------------------------------------------------------------------------