Garfield Nate

Big Fat Hairy Programmer

The Extended Euclidian Algorithm in Perl

| Comments

This week I learned about the extended Euclidian algorithm for finding a linear combination of two numbers that yields their GCD. For example, the GCD of 213 and 171 is 3, and -4213 + 5171 = 3. This algorithm is important in the RSA) encryption scheme.

I had quite a difficult time getting myself to fully understand how it works. I jumped between Wikipedia, my data structures textbook (don’t buy it), a YouTube video, and this excellent number theory class lecture. The lecture is the best, though I think there may be a typographical error in the recursive formula.

The basic idea uses recursion with an easy base step. We call Euclid(a,b) with a ≥ b:

  • The base case is when b is 0. The GCD of x and 0 is always x, and the coefficients to produce a GCD of 0 are 1 and 0 (or anything else): 1*x + 0(or anything)*0 = x. So the base case returns (1,0)
  • Any other step starts by recursively calling Euclid(b, a mod b). We know that the GCD of a and b is the same as the GCD of b and a mod b (lemma 12 in the lecture). This recursive call is guaranteed to eventually get to the base case of b = 0.
  • After finding the coefficients for producing the GCD from b and a mod b, we can calculate the ones for producing the GCD from a and b, because a mod b can be put in terms of a and b (see the code comments for the formulas).

To really help myself understand the whole thing, I wrote a Perl script to illustrate it. I put in lots of comments as I worked my way through it.

use strict;
use strict;
use warnings;
use 5.010;
#start with a >= b
my @nums = sort {$b <=> $a} @ARGV;

gcd(@nums);

#input: two numbers (a,b) a >= b > 0
#output: the coefficients which which yield their GCD;
sub gcd {
 my ($a, $b) = @_;

 #base case; the GCD of x and 0 always x;
 #and the coefficients will always be 1 and 0 (or anything) because

 #1*x + 0*0 = x

if($b == 0){
  say "GCD is $a";
  say "(a,b) = ($a,$b), coefficients = (1,0)";
  say "1x$a + 0x$b = $a";
  return (1, 0);
 }

 #otherwise, we evaluate u and v for k = ub + vr, where r is a mod b
 #gcd(b, a%b) gives the same value
 my $remainder = $a % $b;
 my ($u, $v) = gcd($b, $remainder);
 #now we can find k in terms of a and b because we know r in terms a and b
 #r = a - bq, where q = the whole part of a/b
 #k = ub + vr = ub + v(a - bq) = va + b(u-qv)
 #so the coefficient on a is v, and the coefficient on b is 1-qv
 my $x = $v;
 my $q = int(($a/$b));
 my $y = $u - $q*$x;
 say "(a,b) = ($a,$b), coefficients are ($x,$y)";
 say "${x}x$a + ${y}x$b = " . ($x*$a + $y*$b);
 return ($x, $y);
}

Feel free to leave a comment if you think that something could be stated more clearly. I hope it helps anyone else trying to learn how the extended Euclidian algorithm works.

Running Perl With Sublime Text 2

| Comments

I’ve been having fun trying out Sublime Text. It’s pretty, fast, and extremely extensible.

The first thing that I wanted was to be able to work well with Perl. I installed Package Control, followed by SublimeLinter, which has the perlcritic command built in. Making this useful requires a little finagling; perlcritic is by no means a quick program (being a really thorough linter for a language which is complex to parse), and the defaults for SublimeLinter cause it run over and over again as you type. To fix this, I edited Packages/SublimeLinter/SublimeLinter.sublime-settings and changed the “sublimelinter” setting to false. Now, in order to lint the current file, I have to press ctrl+alt+l. (Update: I don’t recommend this for Sublime Text 2 because of speed problems. See this issue on Github. ST3 should be fine, though.)

Next, I wanted to be able to run my Perl scripts. Sublime has the ctrl+b shortcut for running a build for the current file. What the build actually does is specified in either a build file or the project file. To create a new build file for perl, go to Tools->Build System -> New Build System. The build file I’ve seen on different sites for Perl looks like this:

{
    "cmd": ["perl", "$file"],
    "file_regex": ".* at (.*) line ([0-9]*)",
    "selector": "source.perl"
}

Save this as perl.sublime-build. With this, whenever you are working on a Perl file and hit ctrl+b, the command “perl -w your_file.pl” will be run. This, however, was not good enough for me. Most of the time I am working on tests for a Perl module, so I have to run perl -Ilib t/my_test_file.t. I also want to be able to run individual tests as well as prove using shortcuts.

To do this, we need to turn the module directory into a Sublime Text project. This is pretty simple. First, open the module directory in Sublime Text. Select Project->Save Project As, then choose the name of the project and save it in the top directory of the module. Paste the following simple contents into the project file:

{
 "folders":
 [
  {
   "path": "."
  }
 ]
}

All this does is add the entire directory to the project. Next, we edit the Perl build file to reference the root of the project so we can add the top-level lib directory to our include path:

{
 "cmd": ["perl", "-Ilib", "$file"],
 "working_dir": "$project_path",
 "file_regex": ".* at (.) line ([0-9])",
 "selector": "source.perl",
}

Great! Now we can run Perl on tests contained in module directories. This still works fine for standalone scripts, too.

Now I’d like to run my whole test suite using prove. By default, ctrl+shift+b runs a build variant with the name “Run”, so we’ll just make a prove variant with that name. I’d much rather give it a more descriptive name, but the Sublime shortcut requires this name. You can change the shortcut, but then you wouldn’t be able to use the shortcut for other builds (other languages). It’s all up to you. Here is the final build file:

{
 "cmd": ["perl", "-Ilib", "$file"],
 "working_dir": "$project_path",
 "file_regex": ".* at (.) line ([0-9])",
 "selector": "source.perl",
 "variants": [
        {
  "cmd": ["prove", "-vlr", "--merge"],
  "working_dir": "$project_path",
         "name": "Run",
  "windows": {
                 "cmd": ["prove.bat", "-vlr", "--merge"]
             }
        }
    ]
}

Note that I needed a Windows variant for prove since the Sublime editor doesn’t work the same as cmd. You could, alternatively, add ‘“shell”:true’ to use the system’s command shell so you don’t need a separate command for Windows.

With this build file in place, I can now press ctrl+b to run any Perl script, with it’s project lib directory in @INC, and ctrl+shift+b to run prove. Voila!

Here are the final files:

Managing Global State: The Flip-Flop Operator

| Comments

Today I was faced with another mysterious failing test while writing a test suite for some legacy code. I knew it had to be a problem with persisting state because this particular test only failed when processing a particular data set with the same object which was just used to process another set.

My first step to trying to fix this was to delete all of the values stored in the object during the processing procedure:

    delete $self->{stateDatum1};
    delete $self->{stateDatum2};
    #etc....

Nothing changed. I reduced the problematic code into a small example for this post. First, the module to be tested:

package Demo::Bad::GlobalFlipFlop;
use strict;
use warnings;
use autodie;
use 5.010;

sub new {
 my ($class) = @_;
 my $self = {};
 bless $self, $class;
 return $self;
}

#return true if parsing succeeded, false otherwise.
sub parse {
 my ($self, $file) = @_;
 open my $file_in, '<', $file;

 my $started = 0;
 while( <$file_in> ){

  #flip-flop
  next unless /^=startHere/i .. 0;    # start processing
  $started = 1;
  #continue doing something with file contents...
  # say 'hello:)' if(/hello/);
  # say 'goodbye:(' if(/goodbye/);
 }
 if(not $started){
  say "File not processed; missing '=startHere' line.";
  return;
 }
 close $file_in;
 return 1;
}

1;

The main idea here is that we are processing some file and returning a boolean representing its validity. The only requirement of validity of the file is that a certain start token is found within it; everything before the start sequence is ignored. Here are valid and invalid example files:

#good_file.txt
=startHere
hello
goodbye

#bad_file.txt- doesn't contain a start sequence
hello
goodbye

Now, the test file:

use strict;
use warnings;
use autodie;
use Test::More tests => 2;
use File::Slurp;
use Demo::Bad::GlobalFlipFlop;

my $good_name = 'good_file.txt';
my $bad_file = 'bad_file.txt';

my $demo = Demo::Bad::GlobalFlipFlop->new();

ok( $demo->parse($good_name) );
ok( not $demo->parse($bad_file) );

The output of running this file:

>perl test.pl
1..2
hello:)
goodbye:(
ok 1
hello:)
goodbye:(
not ok 2
#   Failed test at test.pl line 61.
# Looks like you failed 1 test of 2.

Why did it fail the second test, which involves checking that an invalid file is considered invalid?

The bug is in the line which matches the start token:

next unless /^=startHere/i .. 0; # start processing

The regex, flip-flop operator and 0 were clearly some sort of idiom that I was unfamiliar with. I had only ever used the flip-flop with numbers, such as 1..10, which iterates from numbers 1 through 10. How does it work? Let’s check perlop:

> Each ".." operator maintains its own boolean state, even across calls to a subroutine that contains it. It is false as long as its left operand is false. Once the left operand is true, the range operator stays true until the right operand is true *AFTER* which the range operator becomes false again.

The mysterious line thus worked like this:

  • Skip lines of the input file until the left side, a match for the start token, is true
  • Don’t skip lines again until the right side, 0, is evaluated as true (which never happens).
  • The state of this flip-flop operator is stored between subsequent calls to the subroutine. It’s a hidden global variable! Usually flip-flop operators are used in contexts that are guaranteed a reset after iteration (such as 1..10). Not so here! I replaced the offending code with some that keeps state for me:
my $started = 0;
while(<$file_in>){
    if(/^=startHere/i){
        $started = 1;
    }
    next unless $started;
#continue processing...

With this, everything works as expected:

>perl test.pl
1..2
ok 1
File not processed; missing '=startHere' line.
ok 2

Note that this bug only presented itself to me because I changed the legacy standalone script to be its own module, creating the possibility of storing state between subroutine calls.

When Not to Use Perl's Implicit Close; Suffering From Buffering

| Comments

This post is a quick note on a bug I had difficulty tracking down.

One nice feature of Perl, introduced long before my time, is that of implicit closing. Perl closes filehandles for you when you forget (maybe on purpose). So the following is not a resource leak as a standalone script:

    open my $file, '>utf8', '/path/to/new/file'
        or die "couldn't open file: $!";
    print $file 'Hello!';

When the script finishes, Perl will close $file for you, so you can be nice and lazy. The caveat to this is that the variable $. isn’t reset as it would be with a normal close (see docs here). $. holds the current line number from the last file read. So if you were processing a file line-by-line and found an error, you might print an error like bad value foo on line XYZ using the $. variable for XYZ. I raised a question about this on StackOverflow.

Today I found another case where not explicitly closing a filehandle means trouble. I was working on testing a modulino-style script with flexible outputs. You can call a method to set the handle that this script prints to. In my test script, I was setting the handle to be some filehandle and then checking the contents of the file against a string. The problem? The file was always empty at run time, but contained what I expected it to when I manually inspected it. Here’s some example broken code:

#ImplicitClose.pm
package Demo::Bad::ImplicitClose;
use strict;
use warnings;
sub new {
 my ($class) = @_;
 my $self = {};
 bless $self, $class;
 return $self;
}
sub output_fh {
    my ( $self, $fh ) = @_;
    if ($fh) {
        if ( ref($fh) eq 'GLOB' ) {
            $self->{output_fh} = $fh;
        }
        else {
            open my $fh2, '>', $fh or die "Couldn't open $fh";
            $self->{output_fh} = $fh2;
        }
    }
    $self->{output_fh};
}
sub some_long_method {
 my ($self, $text) = @_;
 print { $self->{output_fh} } $text;
}
1;
#test.pl
use strict;
use warnings;
use autodie;
use Test::More tests => 1;
use File::Slurp;
use Demo::Bad::ImplicitClose;
my $file_name = 'file1.txt';
#make sure we pass the test from outputting something *this* run
unlink $file_name if -e $file_name;
my $print = 'some junk';
my $demo = Demo::Bad::ImplicitClose->new();
$demo->output_fh($file_name);
$demo->some_long_method($print);
my $contents = read_file($file_name);
is($contents, $print);

If you run test.pl, you’ll see that its one and only test fails:

>perl -I[folder where you put the Demo directory] test.pl
1..1
not ok 1
#   Failed test at test.pl line 68.
#          got: ''
#     expected: 'some junk'
# Looks like you failed 1 test of 1.

Then, when you inspect the contents of file1.txt, you have:

some junk

What happened here? I was suffering from buffering. Because neither test.pl nor ImplicitClose.pm closed the file, it was still open when I was trying to read it. Nothing had been written to it yet because the amount printed was so small that it had to wait in the buffer either until there was more to write or until the file was closed, which would flush the buffer. Implicit close wouldn’t be performed until the the filehandle’s reference count reached 0, and the $demo object still had a reference to it. So the test would have worked fine if I had assigned undef to $demo, or just closed the filehandle.

Watch those implicit closes.

Testing Perl Distributions With Test Subdirectories

| Comments

Normally I run my test suites with the prove utility:

prove -vl --merge

The v option turns on verbose processing, and the l option adds lib/ to the include path. provethen runs all of the tests in the t/ folder.

Today, I had a new problem. I merged multiple distributions into one (without losing any Git history!), and each had a test suite that I wanted to keep separate. Naturally, I moved the tests from each distribution into its own subdirectory under t/. However, this time when I ran prove -vl, I got this message:

Files=0, Tests=0,  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
Result: NOTESTS

Dubious… Well, I needed to know how to test with subdirectories in the t/ folder, so I looked at the prove documentation and found the -r option. The r stands for “recurse”, meaning that the test files would be found by recursing into the directories of the distribution (starting at the top in the root of the distribution). That turned out to be exactly what I needed!

prove -vlr
t/parser/01-testParser.t
...
All tests successful.
Files=28, Tests=1815, 211 wallclock secs ( 0.92 usr + 0.28 sys = 1.20 CPU)
Result: PASS

Woohoo!

Also, both MakeMakerand Module::Build recurse in the same way during module testing. If you use Dist::Zilla, then you’ll probably have the plugins [MakeMaker] and [ModuleBuild]. Using these, dzil test will recurse in the same way.

Perl Tip: Don't Use a Makefile for Your Module

| Comments

During my internship at SoarTech, I got a chance to learn a lot more about creating Perl modules. I put together a package of scripts for converting old file formats for speech recognition grammars, and I thought it worked beautifully. Of course, to start my module I used the classic tool h2xs:

h2xs -X -n Foo::Bar

My final code was well tested, and quick to install. I asked my co-worker to install it on his machine, and it was just as easy to use.

I was confident when I presented it to the company, until someone asked, “so, these Perl scripts, they work on Windows, Mac, Linux, etc., right?” I told them they should, since Perl is cross-platform. I became (a good kind of) paranoid, and asked my boss to test my code on his Mac (I was using Windows). The thing exploded when fed my script! I couldn’t believe it! What could I have done so wrong?

So, the next day I stayed home a bit to borrow my wife’s Mac and do more testing. But there was a big problem: my module used ExtUtils::MakeMaker to install itself. This has been the standard for years, and the majority of CPAN modules use this for installation. The cpan utility recognizes it, and runs installation automatically. However, MakeMaker is DOOMED! It requires an external tool, make, which you can find on every *nix platform, but everywhere else it has to be installed by the user. Strawberry Perl and ActiveState Perl for Windows come with a version (dmake or nmake). But on Mac, you have to install XCode, a whopping 4 gigabyte distribution for Mac developers.

Dangit…

My solution was to follow Michael Schwern’s advice and convert to using Module::Build, which does not have external dependencies. There happens to be a converter to help you switch. When I ran it on my code, it didn’t give a completely valid output, but the edits I did were minimal. From a user standpoint, the module will still be installed using the cpan utility, so nothing has changed.

When I put new distribution, with a shiny new Build.PL file, on a Mac, I still had some failed tests, but there weren’t intermingled with the giant BOOM that happens when the cpan utility can’t find make. After fixing a bug or two, my module works on Windows and Mac and my boss is a happy camper.

Graphing Grammar Parses From CMU Sphinx 4

| Comments

CMU Sphinx comes with some neat grammar parsing stuff that I never knew about. It uses JSGF (as detailed here) and comes with several demos, showing how to use a basic grammar, arc weights, tags, and even getting a javascript representation of the final parse! At work I’ve been needing to do some custom processing of the grammar output, but it was more conceptually difficult than I’d planned. So after figuring out how to traverse a parse tree, I decided to write a little application to print out the parse of a given sentence. The eclipse project for it can be found here.

Given this small gramamar:

#JSGF V1.0;
grammar sidTests ;

public <greet> = <greeting> [<person>] [i am <person>];

<greeting> = konnichiwa {language:japanese} | hello {language:english} | guten tag {language:german};

<person> = john {gender:man} | martha {gender:female} | kelly;

If we parse the sentence “konnichiwa kelly i am john”, the program outputs the following:

digraph {
 "greet-2147483647" [label="greet" color=magenta];
 "greet-2147483647" -> "(<sidTests.greeting> = konnichiwa {language:japanese}) ( (<sidTests.person> = kelly) ) ( i am (<sidTests.person> = john {gender:man}) )-2147483646";
 "(<sidTests.greeting> = konnichiwa {language:japanese}) ( (<sidTests.person> = kelly) ) ( i am (<sidTests.person> = john {gender:man}) )-2147483646" [label="(<sidTests.greeting> = konnichiwa {language:japanese}) ( (<sidTests.person> = kelly) ) ( i am (<sidTests.person> = john {gender:man}) )" color=green];
 "(<sidTests.greeting> = konnichiwa {language:japanese}) ( (<sidTests.person> = kelly) ) ( i am (<sidTests.person> = john {gender:man}) )-2147483646" -> "greeting-2147483645";
 "greeting-2147483645" [label="greeting" color=magenta];
 "greeting-2147483645" -> "konnichiwa {language:japanese}-2147483644";
 "konnichiwa {language:japanese}-2147483644" [label="konnichiwa {language:japanese}" color=green];
 "konnichiwa {language:japanese}-2147483644" -> "language:japanese-2147483643";
 "language:japanese-2147483643" [label="{language:japanese}" color=red];
 "language:japanese-2147483643" -> "konnichiwa-2147483642";
 "konnichiwa-2147483642" [label="konnichiwa" color=cadetblue shape=box];
 "(<sidTests.greeting> = konnichiwa {language:japanese}) ( (<sidTests.person> = kelly) ) ( i am (<sidTests.person> = john {gender:man}) )-2147483646" -> "(<sidTests.person> = kelly)-2147483641";
 "(<sidTests.person> = kelly)-2147483641" [label="(<sidTests.person> = kelly)" color=green];
 "(<sidTests.person> = kelly)-2147483641" -> "person-2147483640";
 "person-2147483640" [label="person" color=magenta];
 "person-2147483640" -> "kelly-2147483639";
 "kelly-2147483639" [label="kelly" color=green];
 "kelly-2147483639" -> "kelly-2147483638";
 "kelly-2147483638" [label="kelly" color=cadetblue shape=box];
 "(<sidTests.greeting> = konnichiwa {language:japanese}) ( (<sidTests.person> = kelly) ) ( i am (<sidTests.person> = john {gender:man}) )-2147483646" -> "i am (<sidTests.person> = john {gender:man})-2147483637";
 "i am (<sidTests.person> = john {gender:man})-2147483637" [label="i am (<sidTests.person> = john {gender:man})" color=green];
 "i am (<sidTests.person> = john {gender:man})-2147483637" -> "i-2147483636";
 "i-2147483636" [label="i" color=cadetblue shape=box];
 "i am (<sidTests.person> = john {gender:man})-2147483637" -> "am-2147483635";
 "am-2147483635" [label="am" color=cadetblue shape=box];
 "i am (<sidTests.person> = john {gender:man})-2147483637" -> "person-2147483634";
 "person-2147483634" [label="person" color=magenta];
 "person-2147483634" -> "john {gender:man}-2147483633";
 "john {gender:man}-2147483633" [label="john {gender:man}" color=green];
 "john {gender:man}-2147483633" -> "gender:man-2147483632";
 "gender:man-2147483632" [label="{gender:man}" color=red];
 "gender:man-2147483632" -> "john-2147483631";
 "john-2147483631" [label="john" color=cadetblue shape=box];
}

which is all a big mess until we run it through GraphViz and see this:

A graph explaining how our sentence was parsed! I color code the parse: green is a RuleSequence, magenta is a RuleParse, light blue is a Token, red is a Tag, yellow (which there strangely aren’t any of) is a RuleName.

I notice two very strange things here. First, RuleParses don’t have anything as a direct child except for RuleSequences (RuleName is also possible but not shown). So RuleSequences will always be present and may only have one child. Second, text is treated as a sub-component of a tag instead of the other way around. So the text is tagging the tag? I don’t know why they designed it that way, but at least now that I have a graph of the parse so I can figure out how to properly process it.

Getting WordNet Verb Frames With JAWS

| Comments

I love using JAWS to access WordNet. It has a rather extensive API, runs quickly, and doesn’t require too much configuration. All you have to do is download the Jaws binary jar and WordNet, and then specify to JAWS where the WordNet files are (I will demonstrate this later).

One thing that did take a while to figure out was how to get verb frames from it. A verb frame is an indication of how the verb may be used. For example, the entry for the verb “fax” in WordNet contains the following frames:

  • Somebody ----s something to somebody
  • Somebody ----s somebody something
  • Somebody ----s somebody
  • Somebody ----s something
  • Somebody ----s

Let’s look at the WordNet file to see how frames are specified. If you open data.verb, you will see that “sun” occurs twice. The frames are listed like so:

02112546 39 v 04 sun 0 insolate 0 solarize 0 solarise ... 01 + 08 00 | expose to the rays of the sun or affect by exposure to the sun

00104147 29 v 02 sun 0 sunbathe ... 03 + 02 00 + 22 00 + 09 01 | expose one's body to the sun

The 01 and 03 indicate the number of verb frames, 08, 02, 22, and 09 are frame numbers. The 00’s and 01 that follow the frame numbers indicate which words in the synset the numbers apply to. 00 means the frame is applicable to all members. The 01 in the second entry means that frame 9 is only for the word sun, and not for the second word, sunbathe.
There are two methods provided by JAWS to get frames. They are both contained in the VerbSynset class:

 /**
  * Returns the sentence frames (if any) associated with this verb meaning.
  * Sentence frames are examples of how the verb can be used / applied, and
  * all the frames returned by this method apply to all word forms in the
  * synset.
  *
  * @return Sentence frames associated with all word forms in this synset.
  * @see
  *         Format of Lexicographer Files ("Verb Frames")
  */
 public String[] getSentenceFrames();

 /**
  * Returns the sentence frames (if any) that are specific to a particular
  * word form within this synset, where sentence frames are examples of
  * how the word form can be used / applied.
  *
  * @param  wordForm Word form for which to return sentence frames.
  * @return Sentence frames that are specific to the word form.
  * @see
  *         Format of Lexicographer Files ("Verb Frames")
  */
 public String[] getSentenceFrames(String wordForm);

Keep in mind that the VerbSynset class is completely divorced from the actual orthographic representation of a word, since a synset may belong to several different words. The first method returns all of the frames that apply to every word in the synset, or to all of the frames marked with a 00 in the data.verb file as shown above. The second method returns only the frames which are marked as being specific to a single orthographic representation, specified by the one argument for the method. The return values are complementary and each is incomplete by itself. However, given only the synset offset or only the word to look up, JAWS is returning as much information as is possible. If you know both the synset number and the orthographic representation of a word you need frames for (and I don’t see why you wouldn’t), then the getWordFramesComplete method in the program below demonstrates how to get all of the available frames:

package edu.byu.xnlsoar.test;

import java.util.ArrayList;
import java.util.List;

import edu.smu.tspell.wordnet.Synset;
import edu.smu.tspell.wordnet.SynsetType;
import edu.smu.tspell.wordnet.VerbSynset;
import edu.smu.tspell.wordnet.WordNetDatabase;
import edu.smu.tspell.wordnet.impl.file.SampleFrameFactory;
import edu.smu.tspell.wordnet.impl.file.SynsetFactory;
import edu.smu.tspell.wordnet.impl.file.SynsetPointer;

public class DemoFrames {

 private static WordNetDatabase database;
 private static SynsetFactory synsetFactory;
 //initialize everything here
 static{
  System.setProperty("wordnet.database.dir", "./lib/3.0/dict");
  database = WordNetDatabase.getFileInstance();
  synsetFactory = SynsetFactory.getInstance();
 }

 /**
  *
  * @param synsetOffset Synset number to look up frames for
  * @return Array of frames for the synset; only returns frames
  * which apply to every word in the synset
  * frames
  */
 public static List<string> getGeneralSynsetFrames(int synsetOffset){
  SynsetPointer sp = new SynsetPointer(SynsetType.VERB, synsetOffset);
  VerbSynset vSynset = (VerbSynset) synsetFactory.getSynset(sp);
  List<string> frames = new ArrayList<string>();
  for(String s : vSynset.getSentenceFrames())
   frames.add(s);
  return frames;
 }

 /**
  *
  * @param lemma Base form of the word you want to look up
  * @return Array of frames for the lemma; only returns those
  * that are specific to a particular word form within each synset.
  * frames
  */
 public static List<string> getWordFramesSpecific(String lemma){
  List<string> frames = new ArrayList<string>();
  Synset[] synsets = database.getSynsets(lemma,SynsetType.VERB);
  for(Synset synset : synsets){
   for(String s : ((VerbSynset) synset).getSentenceFrames(lemma))
    frames.add(s);
  }
  return frames;
 }

 /**
  * This one is more difficult to understand...
  * @param lemma Base form of the word you want to look up
  * @return Array of frames for the lemma; only returns those
  * that match every word in each of the synsets that contain this word.
  */
 public static List<string> getWordFramesGeneral(String lemma){
  List<string> frames = new ArrayList<string>();
  Synset[] synsets = database.getSynsets(lemma,SynsetType.VERB);
  for(Synset synset : synsets){
   for(String s : ((VerbSynset) synset).getSentenceFrames())
    frames.add(s);
  }
  return frames;
 }

 /**
  * This method is the best. It returns all possible frames
  * given a synset number and the accompanying word.
  * @param lemma Base form of the word you want to look up
  * @param synsetOffset Synset number to look up frames for
  * @return Array of frames for the synset; returns all frames
  * for this word within this synset.
  * frames
  */
 public static List<string> getWordFramesComplete(String lemma, int synsetOffset){
  SynsetPointer sp = new SynsetPointer(SynsetType.VERB, synsetOffset);
  VerbSynset vSynset = (VerbSynset) synsetFactory.getSynset(sp);
  List<string> frames = new ArrayList<string>();
  for(String s : vSynset.getSentenceFrames(lemma))
   frames.add(s);
  for(String s : vSynset.getSentenceFrames())
   frames.add(s);
  return frames;
 }

 /**
  * Prints out several different queries for the frames of "fax"
  */
 public static void main(String[] args) {
  int offset = 104147;//the synset meaning "expose one's body to the sun"
  System.out.println(getGeneralSynsetFrames(offset));//returns 2 frames
  System.out.println(getWordFramesSpecific("sun"));//returns 1 frame
  System.out.println(getWordFramesGeneral("sun"));//returns 3 frames
  System.out.println(getWordFramesComplete("sunbathe",offset));//returns 2 frames
  System.out.println(getWordFramesComplete("sun",offset));//returns 3 frames (different from before)
 }
}

getWordFramesComplete calls both of the available methods in JAWS, retrieving both frames that apply to all words in a synset and the frames that are specific to a single word in the synset.

Review: The Development of Language Processing Strategies: A Cross-linguistic Study Between Japanese and English

| Comments

"The Development of Language Processing Strategies: A Cross-linguistic Study Between Japanese and English"The Development of Language Processing Strategies: A Cross-linguistic Study Between Japanese and English by Reiko Mazuka

My rating: 4 of 5 stars

This is basically an updated version of Mazuka’s PHD thesis. This book is a significant work on human sentence processing involving data from a head final and a head initial language.

Mazuka presents data on sentence processing experiments with English speaking adults and Japanese speaking children and adults. She shows that sentence processing strategies are the same in children and adults (though their ability differs with age), and that sentence processing strategies differ cross-linguistically. Her experimental data include probe latency tasks (PLTs) for lexical and semantic information in English and Japanese sentences.

A probe latency task involves a subject listening to a sentence and responding to questions about its contents. In lexical PLT, a subject is asked if the sentence contained the specified lexical item. In semantic PLT, the subject is asked if a sentence contained a portion which has a similar meaning to a specified word or phrase. The experimenter then carefully designs sentences which test the subjects' ability to process different types of sentences. Mazuka’s experiment measure response time and also the accuracy of the subjects' responses. Her findings for cross-linguistic processing differences are as follows:

English speakers showed processing differences for main and subordinate clauses, while Japanese speakers did not.

English speakers showed different effects for semantic and lexical tasks, while Japanese speakers did not.

In English speakers, response time for semantic probe latency tasks involving sentence-initial subordinate clauses (an LB structure) was increased, while in Japanese speakers it was greatly decreased.

Japanese speaker response times to both lexical and semantic PLTs involving left-branching and coordinate structures were the same; English speakers showed much larger recency effects in LB than coordinate sentences.

Hypotheses about the human language processing mechanism which assume a single processing strategy do not account for these data. Japanese speakers process LB structures efficiently, and English speakers process RB structures efficiently. This is impossible in a parser which assumes only one processing strategy, and a parser which can efficiently process both would be too powerful to account for real human data. For Japanese speakers to process LB structures as efficiently as English speakers do RB structures, processing must be done bottom-up instead of top-down. Mazuka therefore hypothesizes that Universal Grammar (UG) contains a parameter which determines whether a language is right- or left-branching (RB or LB), and that this is linked with the processing strategy by specifying whether processing should be done top-down or bottom-up. She also hypothesizes that in English, main and subordinate clauses are processed to a different semantic level at some initial encoding stage, accounting differences in English main and subordinate clause PLT tasks. This needs to be further tested in the future with PLTs involving two clause sentences beginning with an explicit subordinator in Japanese.

She states that future research is required to determine the exact relationship between her experimental data and the operation of the human sentence parser as she has hypothesized.

Since some languages such as German, actually branch in different directions for different types of clauses, her hypothesis needs to be revised to account for this. I’m hoping that her hypotheses can be tested in detail in some sort of a cogntive modeling system. View all my reviews

List of Japanese NLP Tools

| Comments

I haven’t tried out all of these so I don’t have comments for everything, but hopefully this list will come in useful for someone.

Morphological analyzers/tokenizers

  • Itadaki: a Japanese processing module for OpenOffice. I’ve done a tiny bit of work and issue documentation on a fork here, and someone forked that to work with a Japanese/German dictionary here.
  • GoSen: Uses sen as a base, and is part of Itadaki; a pure Java version of ChaSen. See my previous post on where to download it from.
  • MeCab: This page also contains a comparison of MeCab, ChaSen, JUMAN, and Kakasi.
  • ChaSen
  • JUMAN
  • Cabocha: Uses support vector machines for morphological and dependency structure analysis.
  • Gomoku
  • Igo
  • Kuromoji: Donated to Apache and used in Solr. Looks nice.

Corpora

  • Hypermedia Corpus
  • TüBa-J/S: Japanese treebank from universityu of Tübingen. Not as heavily annotated as I’d hoped. You have to send them an agreement to download it, but it’s free.
  • GSK: Not free, but very cheap.
  • LDC: Expensive unless your institution is a member

Other lexical resources

  • Kakasi: Gives readings for kanji compounds.
  • WordNet: Stil under development by NiCT. The sense numbers are cross-indexed with those in the English WordNet, so it could be useful for translation. Also, there are no verb frames like there are in English.
  • LCS Database: From Okayama University
  • Framenet: Unfortunately you can only do online browsing.
  • Chakoshi: Online collocation search engine.