Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minor: poem titles as search results are coming up under previous poem title #54

Open
ghost opened this issue Dec 8, 2016 · 13 comments
Assignees

Comments

@ghost
Copy link

ghost commented Dec 8, 2016

search 'marriage heaven hell'

notice that the first result (which is for the title of the Marriage of Heaven and Hell) is under For Children the Gates of Paradise, which is the poem before it

@ghost
Copy link
Author

ghost commented Dec 10, 2016

@queryluke could this just be a matter of mislabeling, or is it deeper?

@queryluke
Copy link
Collaborator

It's deeper. When erdman xml is parsed it creates Page json objects. These objects have a "headings" attribute which is a nested array of headings ids. For page 33, the heading attribute is:

"headings":["[[\"b1\", [[\"b1.6\", []], [\"b1.7\", [[\"b1.7.1\", []]]]]]]"],

b1.6 is For Children because there is a little bit of this poem on page 33
b1.7 is Marriage

So this is a bit of a conundrum. Right now the script always selects the first 2nd level header in the list (in this case b1.6), I can switch it to always accept the LAST 2nd level header (b1.7), but then a search for "mother sister" (the last line of Children) would show the result under Marriage.

Fixing this on the javascript side is nearly impossible and ugly. So it's something you'll want to discuss with Nathan. I'm sure he'll have his own ideas on how to fix it.

@ghost
Copy link
Author

ghost commented Dec 10, 2016

ok, gonna assign to nathan

@ghost ghost assigned nathan-rice Dec 10, 2016
@nathan-rice
Copy link
Collaborator

It isn't clear exactly what the desired behavior is here. As Luke mentioned, the page title is set to the first header. I can set it to the last header, or the second header (if there is more than one).

@ghost
Copy link
Author

ghost commented Apr 21, 2017

The title in the results should be the poem title of the poem/work that contains the line

@nathan-rice
Copy link
Collaborator

The information isn't stored that way. Pages titles are mapped to headers in a one to one relation. If you want I can stuff all the headers in the mapping, and you can write javascript to pick the one that should actually be displayed.

@ghost
Copy link
Author

ghost commented Apr 21, 2017

The number of titles per page is inconsistent, so choosing first or second would be arbitrary.

Basically the result should correspond to the actual poem it's in

@ghost
Copy link
Author

ghost commented Apr 21, 2017

Ok, I'll confer with joe

@ghost
Copy link
Author

ghost commented Apr 22, 2017

had a look more closely. i'm not sure what to say. if someone searches "marriage heaven hell" and the title of the poem "The Marriage of Heaven and Hell" is a result, then that is what should show as the header of the result, not the previous poem's title. i understand the issue in the code, but we do need to fix it. i'm not sure what you mean by stuffing all the headers in the mapping--i haven't looked at the code closely--but if you did that, how would we select the right one in javascript? we'd have to do another mini search in the javascript? is there a way to detect a result coming from a poem/work title and then use that title as the result header?

@nathan-rice
Copy link
Collaborator

The problem here stems from the fact that your unit of data is a page, but your desired unit of search results is not.

In my opinion the best option is not to use the page heading in the search results, but instead use the page number. That is technically correct and avoids confusion.

Probably the most direct way to get the behavior you want is if you do a javascript search on the page for the relevant text, then work backwards in the dom from that text node to the previous heading, which you then use for the title. Any other option would require completely redoing how data is stored in solr, which basically would involve rewriting the entire application.

@ghost
Copy link
Author

ghost commented Apr 22, 2017

we can't use the page number because then the results wouldn't amount to a proper concordance and the information conveyed would be a lot less useful.

we'll have to go the javascript way. @queryluke, is this the solution in the javascript that you were thinking of?

@ghost
Copy link
Author

ghost commented Jun 20, 2017

@nathan-rice i wanted to remind you of this issue. joe v. just pointed out another instance of it. search "sin". the second result under THE [ FIRST ]BOOK OF URIZEN is actually a line in THE BOOK of AHANIA, which comes after THE [ FIRST ]BOOK OF URIZEN

@nathan-rice
Copy link
Collaborator

Page 84 occurs under both the Urizen and Ahania headings due to the structure of the XML. Currently the javascript groups query result text by heading, using the first heading on the page. As a result, though it is in Ahania, the heading for the result is Urizen.

Changing this behavior to fix this (for example, by taking the last heading) will just break other cases. The best solution is to move the <pb page="#"> outside the <div2> containing Urizen. There isn't really a good solution to this problem given the current data model, and I doubt the problem is a big enough deal to warrant overhauling that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants