Ngram

From IntereditionWiki

The Ngram Extractor is an easy to use tool that allows anyone to perform Ngram operations on their TEI-XML files.

Team

Maria Sukhareva, Russell Horton, Jeffrey C. Witt, Henry Lynam


User Interface

- Google gadget

Source Code

  • Web Service

https://bitbucket.org/hlynam/ngram [1]

  • Google Gadget

https://github.com/jeffreycwitt/ngram [2]

Version 2.0 Google Gadget published to Internet:

http://jeffreycwitt.com/ngrams/ngram2.xml [3]

Version 3.0 Ngram Extractor published to Internet:

http://jeffreycwitt.com/ngrams/ngram3.php [4]


Wednesday Notes

  • TEI-XML file location:

- dropbox location - dropbox folder - xslt file (could be in the dropbox folder)

  • Select ngrams (range):

- min: 1 - max: 5

  • Select stats operation:

- frequency - marginal frequency - create contingency table

  • Output format:

- xml ngram - mysql dump - summary stats


Thursday Notes

Parsed ngrams for Moby Dick and loaded them into MySQL. Currently running Dice coefficient calculation over entire book. Also loaded one million, ten million, and one hundred million trigrams from Google Ngrams corpus into MySQL. Seeing 2 - 5ms query times to find counts with specification of words in any position in ngram.

This is to slow to do large corpora quickly. Moby Dick (200k words) would take around 15 minutes to do the whole process.

Jeff and Henry worked on an igoogle app was created-called "ngram extractor".

The extractor makes a call with 5 parameters to Henry's site. Henry's site returns an XML file, which is then parsed in the igoogle app. Which displays the results for the user.

Friday Notes

  • MySQL Optimized Ngram Processor

- Implemented ngram service in Coffeescript/Javascript Node.js Express, sent up to Dotcloud, available online.


  • Google Gadgets component is operational

- It communicates with the web service

- It correctly bypasses Google Gadget caching using a random variable addition

- It parses and displays the returned Ngram data

- It is nicely formatted


  • Web Service

- It parses Ngrams with Ngram Length and the Collocation Span

- It calculates frequency

- It handles duplicate words in the collocation spans (almost)

- It produces an alphabetically sorted XML output of the Ngrams with their frequencies

Saturday Notes

  • Web Service

Cleaned up function name and parameters

Deleted unused parameters

This allows us to use the ngram.asmx page as a test of POST functionality of the web service

Fixed Ngram extractor bug

Sorted by frequency

Read XML file from supplied URL

Extracted p data

  • Google Gadget

Deleted unused parameters

Published this url as a Google Gadget

Demo of Ngram Extractor

  • Overview of Ngram Extractor microservice (Henry)

Google Gadget connects to Web Service which converts a TEI-XML file into Ngrams in XML format

Add the Google Gadget by searching for:

Ngram Extractor

Demo Values:

- URL Location: http://www.jeffreycwitt.com/plaoul/translation_english/trans_engl_prollecture1.xml

- Ngram Length: 3

- Collocation Span: 4


  • Google Gadget User Interface (Jeff)

Talk about issues in creating and debugging a Gadget

cache, editing, debugging, publishing a gadget

Access an external TEI-XML file

what constitutes text, how to detect sentence boundaries


  • Talk about Ngram algorithm and the parameters (Maria)

Ngram Length, Collocation Span

Frequencies, marginal frequencies


  • Talk about web service (Henry)

Demo web service at:

http://henrylynam.com/ngram/ngram.asmx [5]

Demo Values:

- URL Location: http://www.jeffreycwitt.com/plaoul/translation_english/trans_engl_prollecture1.xml

- Ngram Length: 3

- Collocation Span: 4


Access web service from web server

Show string format of GET in web service

Show returned XML data


Lessons Learned

Google Gadget issues (versions)

Interoperability issues server side: web services to allow developers work in parallel in different languages / frameworks?

Server monolithic framework issues

Overall: a great experience