Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Result #2

Open
hosseinadib-hub opened this issue Apr 22, 2011 · 13 comments
Open

Incorrect Result #2

hosseinadib-hub opened this issue Apr 22, 2011 · 13 comments
Assignees

Comments

@hosseinadib-hub
Copy link

i followed all your instructions but i encounter with an error in scandata.py (a Problem with unicode characters)

some how i correct the error (i'am not sure really Correct It !) -- i think the problem is with this Code
"articleBuffer.append((id, ctitle))" that i change it to
"articleBuffer.append((id,' \u ' + ctitle))"

But "\u" append to first of each record , i delete it from records with SQL commands --

But i didn't get a suitable result, My rate in wordsim353 is 0.4 but ESA was 0.75.
my correlation SPEARMAN_RANK_CORRELATION algorithm is :

public static double spearmanRankCorrelationCoefficient(Double[] a,
Double[] b) {
check(a, b);
SortedMap<Double,Double> ranking = new TreeMap<Double,Double>();
for (int i = 0; i < a.length; ++i) {
ranking.put(a[i], b[i]);
}

    double[] sortedB = new double[b.length];
    for (int i = 0; i < b.length; i++) {
        sortedB[i] = b[i];
    }

    Arrays.sort(sortedB);
    Map<Double,Integer> otherRanking = new HashMap<Double,Integer>();
    for (int i = 0; i < b.length; ++i) {
        otherRanking.put(sortedB[i], i);
    }

    // keep track of the last value we saw in the key set so we can check
    // for ties.  If there are ties then the Pearson's product-moment
    // coefficient should be returned instead.
    Double last = null;

    // sum of the differences in rank
    double diff = 0d;

    // the current rank of the element in a that we are looking at
    int curRank = 0;

    for (Map.Entry<Double,Double> e : ranking.entrySet()) {
        Double x = e.getKey();
        Double y = e.getValue();
        // check that there are no tied rankings
        if (last == null)
            last = x;
        else if (last.equals(x))
            // if there was a tie, return the correlation instead.
            return correlation(a,b);
        else
            last = x;

        // determine the difference in the ranks for both values
        int rankDiff = curRank - otherRanking.get(y).intValue();
        diff += rankDiff * rankDiff;

        curRank++;
    }

    return 1 - ((6 * diff) / (a.length * (a.length * a.length - 1)));
}

Please Guide me to get suitable result.

you listed some instruction in the address https://github.com/faraday/wikiprep-esa/wiki/roadmap , i want to know if you implement this instructions in your Code ?

Regards

@ghost ghost assigned faraday Apr 25, 2011
@pchanosakau
Copy link

Hi there,

I am having some trouble implementating ESA as well (I got about 0.65 Spearman correlation)

Instead of calculating Spearman correlation yourself, how about using the following web site
which calculates it for you?

http://www.maccery.com/maths/

Could you also share your Spearman correlation coefficient when you are done?

@faraday
Copy link
Owner

faraday commented Apr 26, 2011

amiradib, why do you think that line is incorrect?
With that line, we keep a buffer of (article id, article title) tuples and if the buffer fills up, write them to the database. Why do you append "\u" to each record?

Yes, the instructions listed in Roadmap document are applied in this project.
I cannot access my experiment results right now but I will also upload correlation values, which yielded 0.737 with 20051105 dump from Gabrilovich et al.

@faraday
Copy link
Owner

faraday commented Apr 27, 2011

Wikiprep-ESA Spearman correlation detail for 20051105 dump from Gabrilovich et al:
http://www.ceng.metu.edu.tr/~cagatay/spearman-20051105.xls

@hosseinadib-hub
Copy link
Author

So Tanks, I Review to find problem.

"why do you think that line is incorrect?"
i don't think that this line is incorrect but i recieve error in runtime and when i trace codes, i found that problem is from this line.

the error arise when Contents in "ctitle" variable has a unicode character, Like "â", with change this line, i bypass this problem, but the "\u" appent in first of all records !
maybe this problem related to Operation system, my operation system is fedora 12.

@pchanosakau
Copy link

Thanks a lot of sharing the Spearman correlation data!
The URL that I wrote in my last comment verified it was indeed about 0.73.

This gave me hints of where I got my code wrong.

@faraday
Copy link
Owner

faraday commented Apr 29, 2011

amiradib, I set the default encoding of Python as UTF-8 while I was working, that might be the difference that cause problems for you.
As you stated, you can either modify related parts to explicitly define strings as unicode or you can change your encoding before executing the script. You can consider changing it at sitecustomize.py

http://diveintopython.org/xml_processing/unicode.html
http://blog.ianbicking.org/illusive-setdefaultencoding.html

@dbk067
Copy link

dbk067 commented May 12, 2011

I also had to change those lines to .encode("utf-8") even thought my default encoding is UTF8, no idea why. I also changed line 82, because the test should fail for Gabrilovich too (I believe).

Having done that I ran the following with no problem:

python scanLinks.py <hgw.xml file from Wikiprep dump>
python scanData.py <hgw.xml file from Wikiprep dump>
python addAnchors.py <anchor_text file from Wikiprep dump>

java -cp esa-lucene.jar edu.wiki.index.ESAWikipediaIndexer
java -cp esa-lucene.jar edu.wiki.modify.IndexModifier

java -cp esa-lucene.jar edu.wiki.demo.TestESAVectors

However, the results of TestESAVectors doesn't seem right. I used Bank of America as the original paper did, but my results were wildly different:
338498 Bank of America Plaza (Atlanta) 0.2070109466738399
205026 South Bank 0.20568114549072036
603356 National bank 0.19167196995310218
216340 Bank robbery 0.18271097494829652
386181 London South Bank University 0.16930809511027398
55569 First Bank of the United States 0.16357204991418456
345394 Grand Banks 0.16237282536193956
169177 Bank of Canada 0.1602520683031212
158268 Bank Holiday 0.16010798473208784
355517 Ernie Banks 0.15979087608657905

I'm not quite sure why they look nothing like the paper's results. Am I using the wrong function?

Thanks,
David

@pchanosakau
Copy link

Hi David,

Do you mind sharing to me whether or not
you were able to reproduce the result of Spearman correlation being about 0.73?

I looked at the Java/Python code myself, and honestly I could not find anything
that looks like a bug. I am also investigating why some of us different results.

Thanks,
Patrick

@dbk067
Copy link

dbk067 commented May 14, 2011

I ran TestWordsim353 and plugged my numbers into faraday's spreadsheet above. Rank correlation was 0.742.

@pchanosakau
Copy link

Thanks, David!

@dbk067
Copy link

dbk067 commented May 16, 2011

You're welcome. Let me know if I can do anything else to help.

I'm curious, are other people getting results similar to mine for TestESAVectors on "Bank of America"?
Is there a different function that might give results closer to those in the original ESA paper?

@pchanosakau
Copy link

I finally could run the code. Here is my result of running TestESAVector.java on "Bank of America".

I set the following parameters to be
LINK_ALPHA = 0.0f;
WINDOW_THRES= 0.005f;

21139 North America 0.4362808871937762
31882 United States 0.29172723222927655
26769 South America 0.2899955079067316
6121 Central America 0.18718546016137955
24113 President of the United States 0.1865016623155888
216340 Bank robbery 0.18471200944081695
603356 National bank 0.17603763862445493
169177 Bank of Canada 0.16796354588362047
55569 First Bank of the United States 0.16377390524738045
5666 Central bank 0.16374378819479793

David, would you mind sharing your parameter values to get 0.742 Spearman correlation?
I am just curious about it.

Setting LINK_ALPHA = 0.0f, and WINDOW_THRES= 0.005f gave me about 0.73.
If I set LINK_ALPHA = 0.0f, and WINDOW_THRES= 0.05f, the correlation became 0.66.

@dbk067
Copy link

dbk067 commented Jun 1, 2011

I don't believe I changed them at all.

LINK_ALPHA = 0.5f
WINDOW_THRES= 0.005f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants