├── README.md ├── libraries ├── com.mongodb.jar ├── com.mongodb.rhino.jar ├── google-api-translate-java-0.95.jar ├── goose-1.4.1.jar └── jsoup-1.6.1.jar └── src └── proose ├── libraries ├── goose.js └── register.js ├── resources └── page.js ├── routing.js ├── settings.js └── web └── static └── index.html /README.md: -------------------------------------------------------------------------------- 1 | [Proose](http://github.com/mdorn/proose) is a web services wrapper 2 | around the [Goose](http://github.com/jiminoc/goose) HTML content extracting 3 | library. 4 | 5 | Proose also has limited (5,000 character maximum) support for the 6 | unofficial [Google Translate Java API](http://code.google.com/p/google-api-translate-java/). 7 | 8 | Proose is based on [Prudence](http://threecrickets.com/prudence/), the 9 | RESTful web platform for the JVM. It was inspired by the need for a server-side 10 | implementation of [Readability.js](http://code.google.com/p/arc90labs-readability/) 11 | Goose seems to be the best one in any language; Proose exposes it via a web services API 12 | written primarily in a few lines of server-side JavaScript running on top of Prudence. 13 | 14 | To use it, you'll need the JavaScript-enabled edition of Prudence (v1.1). You'll need to install the `proose` source in your instance's `applications` directory, and install or link the included jar dependencies (located in `libraries` in the repo) in the instance's `libraries` directory. These are the dependencies: 15 | 16 | * Goose: (http://github.com/jiminoc/goose) 17 | * MongoDB/Rhino integration (http://code.google.com/p/mongodb-rhino/) (Note: used only for the included JSON class. Prior to version 1.1, these jars were included in Prudence, but were since moved out of it.) 18 | * JSoup: (http://jsoup.org/packages/jsoup-1.4.1.jar) 19 | * (Optional) Google Translate Java API: (http://code.google.com/p/google-api-translate-java/downloads/list) 20 | 21 | Once it's up and running, it will return a JSON representation of the main text 22 | of the URI you give it within an HTTP POST containing your request data in JSON format: 23 | 24 | curl -i -H "Accept: application/json" -X POST -d '{"uri": "http://threecrickets.com/prudence/rest/"}' http://localhost:8080/proose/page/ 25 | 26 | { 27 | "title": "Prudence: Scalable REST/JVM Web Development Platform", 28 | "text": "There's a lot of buzz about REST, but also a lot confusion about what it is and what it's good for. This essay attempts to convey REST's simple essence. Let's start, then, not at REST, but at an attempt to create a new architecture for building scalable applications. Our goals are for it to be minimal, straightforward, and still have enough features to be productive. We want to learn some lessons from the failures of other, more elaborate and complete architectures. ..." 29 | } 30 | 31 | curl -i -H "Accept: application/json" -X POST -d '{"uri": "http://threecrickets.com/prudence/legal/", "source_language": "en", "target_language": "fr"}' http://localhost:8080/proose/page/ 32 | 33 | { 34 | "title": "Licence et les marques - Prudence: REST Scalable / Plate-forme de développement Web JVM - Trois grillons", 35 | "text": "Prudence vous est fourni sous la licence GNU Lesser General Public License version 3.0.\n\nEn outre, nous voulons mentionner expressément que les grillons Trois LLC, le titulaire du droit d'auteur que de tout le code source, n'a pas l'intention de libérer les futures versions du projet Prudence open source sous plusieurs licences restrictives, telles que la GPL. Si nous changeons la licence de nouveau dans l'avenir, il ne pouvait être pour une licence moins restrictive (comme Apache Public License).\n\nNotez que cet accord ne couvre pas les bibliothèques redistribués tiers. Les bibliothèques vous sont fournis à des fins de commodité, mais restent sous leurs licences respectives, qui sont reproduites dans les «licences / *" fichiers. Pour les demandes d'autorisation spéciales, s'il vous plaît contacter Trois grillons LLC." 36 | } 37 | 38 | -------------------------------------------------------------------------------- /libraries/com.mongodb.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdorn/proose/8da868a50c7f465b8e09bacc508182084ac03b3d/libraries/com.mongodb.jar -------------------------------------------------------------------------------- /libraries/com.mongodb.rhino.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdorn/proose/8da868a50c7f465b8e09bacc508182084ac03b3d/libraries/com.mongodb.rhino.jar -------------------------------------------------------------------------------- /libraries/google-api-translate-java-0.95.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdorn/proose/8da868a50c7f465b8e09bacc508182084ac03b3d/libraries/google-api-translate-java-0.95.jar -------------------------------------------------------------------------------- /libraries/goose-1.4.1.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdorn/proose/8da868a50c7f465b8e09bacc508182084ac03b3d/libraries/goose-1.4.1.jar -------------------------------------------------------------------------------- /libraries/jsoup-1.6.1.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdorn/proose/8da868a50c7f465b8e09bacc508182084ac03b3d/libraries/jsoup-1.6.1.jar -------------------------------------------------------------------------------- /src/proose/libraries/goose.js: -------------------------------------------------------------------------------- 1 | importClass(com.jimplush.goose.Configuration, com.jimplush.goose.ContentExtractor) 2 | importClass(org.apache.commons.lang.StringEscapeUtils) 3 | try { 4 | importClass(com.google.api.translate.Language) 5 | importClass(com.google.api.translate.Translate) 6 | } catch(error) { 7 | application.logger.warning("Google Translate Java API not found: Translation feature unavailable.") 8 | } 9 | document.execute('register/') 10 | 11 | var Goose = Goose || function() { 12 | var Public = { 13 | config: null, 14 | extractor: null, 15 | Extractor: function() { 16 | this.extract = function(uri, srclang, tlang) { 17 | try { 18 | // use Goose to extract article title and main text 19 | var article = Public.extractor.extractContent(String(uri)) 20 | retval = { 21 | "title": String(article.getTitle()), 22 | "text": String(StringEscapeUtils.unescapeHtml(article.getCleanedArticleText())) 23 | } 24 | } catch(error) { 25 | application.logger.debug(error + ": " + uri) 26 | return null 27 | } 28 | if (srclang && tlang) { 29 | // use Google Translate Java API 30 | var title = Public.translate.execute(retval.title, Language.fromString(srclang), Language.fromString(tlang)) 31 | var text = Public.translate.execute(retval.text, Language.fromString(srclang), Language.fromString(tlang)) 32 | retval = { 33 | "title": String(title), 34 | "text": String(text) 35 | } 36 | } 37 | return retval 38 | } 39 | } 40 | } 41 | // Initialize 42 | Public.config = register(Configuration, null, {'setEnableImageFetching': false}) 43 | Public.extractor = register(ContentExtractor, Public.config) 44 | try { 45 | Public.translate = register(Translate, null, {'setHttpReferrer': application.globals.get('proose.settings.httpReferrer')}) 46 | } catch(error) { 47 | // Google Translate library is optional 48 | } 49 | return Public 50 | }() 51 | 52 | -------------------------------------------------------------------------------- /src/proose/libraries/register.js: -------------------------------------------------------------------------------- 1 | function register(cls, params, attributes) { 2 | namespace = application.globals.get('proose.settings.namespace') 3 | // don't have access to cls.class.getName() in JS, hence regex 4 | clsname = String(cls).match(/ (.*)]/)[1] 5 | globname = namespace + '.' + clsname 6 | globinst = application.globals.get(globname) 7 | if (!globinst) { 8 | if (params) { 9 | var instance = new cls(params) 10 | } else { 11 | var instance = new cls() 12 | } 13 | var cls = instance.getClass() 14 | // can't use java.lang.Class in Rhino, hence no getMethod() 15 | var methods = cls.getMethods() 16 | var method_names = [] 17 | for (m in methods) { 18 | method_names[m] = String(methods[m].getName()) 19 | } 20 | for (a in attributes) { 21 | method_index = method_names.indexOf(a) 22 | method = methods[method_index] 23 | method.invoke(instance, attributes[a]) 24 | } 25 | globinst = application.getGlobal(globname, instance) 26 | } 27 | return globinst 28 | } -------------------------------------------------------------------------------- /src/proose/resources/page.js: -------------------------------------------------------------------------------- 1 | importClass(com.mongodb.rhino.JSON) 2 | document.execute('goose/') 3 | 4 | function handleInit(conversation) { 5 | conversation.addMediaTypeByName('text/html') 6 | conversation.addMediaTypeByName('application/json') 7 | } 8 | 9 | function handleGet(conversation) { 10 | // return 501 // not implemented 11 | uri = "http://threecrickets.com/prudence/legal/" 12 | var goose = new Goose.Extractor() 13 | var result = goose.extract(uri, 'en', 'fr') 14 | return JSON.to(result, true) 15 | } 16 | 17 | function handlePost(conversation) { 18 | // Example usage: 19 | // ## Get article title and body: 20 | // curl -i -H "Accept: application/json" -X POST \ 21 | // -d '{"uri": "http://threecrickets.com/prudence/rest/"}' http://localhost:8080/proose/page/ 22 | // ## Get translation of an article from English into French: 23 | // curl -i -H "Accept: application/json" -X POST \ 24 | // -d '{"uri": "http://threecrickets.com/prudence/legal/", "source_language": "en", "target_language": "fr"}' \ 25 | // http://localhost:8080/proose/page/ 26 | var text = conversation.entity.text 27 | var json = JSON.from(String(text)) 28 | var uri = json.uri 29 | var srclang = json.source_language || null 30 | var tlang = json.target_language || null 31 | var goose = new Goose.Extractor() 32 | var result = goose.extract(uri, srclang, tlang) 33 | if (!result) { 34 | return 404 35 | } else { 36 | return JSON.to(result, true) 37 | } 38 | } 39 | -------------------------------------------------------------------------------- /src/proose/routing.js: -------------------------------------------------------------------------------- 1 | document.execute('defaults/application/routing/') 2 | router.capture('page/', 'page/') 3 | -------------------------------------------------------------------------------- /src/proose/settings.js: -------------------------------------------------------------------------------- 1 | document.execute('defaults/application/settings/') 2 | showDebugOnError = false 3 | 4 | applicationName = 'Proose' 5 | applicationDescription = "Exposes Goose's text extraction functionality via a Web Services API based on the Prudence framework." 6 | applicationAuthor = 'Matt Dorn' 7 | applicationHomeURL = 'http://github.com/mdorn/proose' 8 | applicationContactEmail = 'matt.dorn@gmail.com' 9 | 10 | predefinedGlobals['proose.settings.httpReferrer'] = 'http://example.com' 11 | predefinedGlobals['proose.settings.namespace'] = 'proose' 12 | -------------------------------------------------------------------------------- /src/proose/web/static/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 |

welcome to proose.

4 | 5 | 6 | 7 | --------------------------------------------------------------------------------