Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve TypeScript typings #257

Closed
wants to merge 6 commits into from

Conversation

singingwolfboy
Copy link

@singingwolfboy singingwolfboy commented Mar 27, 2024

This alters the TypeScript typings to make it clear that the fields, idField, and storeFields properties must be keys in the document type. It also improves the typing of the SearchResult type so that it knows to combine it with the stored fields, and preserves the type information when doing so.

Now, in order to use this project with TypeScript, here is an example of the right way to do so:

interface Doc {
  id: number;
  title: string;
  pubDate: Date;
  author: { name: string };
  category: string;
}

const ms = new MiniSearch<Doc, 'title' | 'category'>({
  // Typescript can autocomplete field names, and even suggest deep paths like `author.name`!
  fields: ['title', 'pubDate', 'author.name'],
   // note that the `storeFields` are repeated in the type annotation
  storeFields: ['title', 'category'],
})
const document: Doc = {
  id: 1,
  title: 'Divina Commedia',
  pubDate: new Date(1320, 0, 1),
  author: { name: 'Dante Alighieri' },
  category: 'poetry'
}
ms.add(document)

const results = ms.search('Dante')
results[0].category  // Typescript understands that the `category` field is included in the results!

Comment on lines -1926 to +1934
const fieldValue = extractField(doc, fieldName)
if (fieldValue !== undefined) documentFields[fieldName] = fieldValue
documentFields[fieldName] = doc[fieldName]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only change that I made to the logic of the code. Instead of calling extractField on the field and saving the result to this._storedFields, this now simply accesses the field directly, which preserves the correct type information.

What was the reason for using extractField here? I didn't understand why that was happening, which is why I felt that replacing it was the right thing to do. If there's a reason (or edge case) for why this code was using extractField, please let me know, so I can make sure to test that reason with my change!

@@ -111,7 +111,7 @@ describe('MiniSearch', () => {
expect(extractField).toHaveBeenCalledWith(document, 'title')
expect(extractField).toHaveBeenCalledWith(document, 'pubDate')
expect(extractField).toHaveBeenCalledWith(document, 'author.name')
expect(extractField).toHaveBeenCalledWith(document, 'category')
expect(extractField).not.toHaveBeenCalledWith(document, 'category')
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it important for extractField to be called on fields that are in storeFields? If so, why?

@lucaong
Copy link
Owner

lucaong commented Apr 3, 2024

Hi @singingwolfboy , thanks a lot for your contribution! I agree that preserving the type information as much as possible would be desirable.

Unfortunately, this change would introduce a big breaking change on a feature that a lot of users rely upon: in MiniSearch, the document does not necessarily have to be a simple key/value object. That's just the default, but in principle it can be anything, and the application developer is able to control this via the extractField function.

Suppose, for example, that an application is indexing news articles, and each document has a publishedAt field, which is a Date object. Suppose that the application should allow to search for articles published in a specific month/year, or even on a specific day of the week (e.g. "wednesday"). This could be achieved by defining an extractField like:

new MiniSearch({
  fields: ['publishedAt', /* etc... */],
  extractField: (document, fieldName) => {
    if (fieldName === 'publishedAt') {
      return document[fieldName].toLocaleDateString('en-UK', {
        weekday: 'long',
        year: 'numeric',
        month: 'long',
        day: 'numeric',
      })
    } else {
      return document[fieldName]
    }
  }
})

This feature is widely used, and existed prior to the switch to TypeScript. If types were the most important goal, one could require developers to transform the documents before indexing, to ensure that all fields are simple key/value objects with string values. While MiniSearch tries to provide type safety whenever possible, convenience of use is a more important goal, and the large majority of users use it with JavaScript, not TypeScript.

Therefore, I am unfortunately not going to merge this pull request, as the cost of the large breaking change surpasses the benefits. TypeScript users, if they are not using a custom extractField option, can preserve type safety by wrapping MiniSearch in a thin factory function enforcing the keyOf restriction.

That said, this is not the first time this topic comes up: see #229 and #208 for example. I am devising a partial solution that would allow more type safety while not introducing a breaking changes.

Thanks again for the contribution, highly appreciated!

@singingwolfboy
Copy link
Author

@lucaong I completely understand the value of the extractField option, and as far as I can tell, this code change preserves that feature! You can tell because the automated tests are passing on this pull request, at least when I run them locally -- I need you to approve the GitHub Actions workflow before it runs on this pull request. Not only is there a test that exercises this functionality, but the code example that I put in the description of this pull request does the same thing. The extractField function should continue to work, as expected, for all fields listed in the fields option.

The one and only change in functionality that I have made in this pull request (as far as I'm aware) is that the extractField function is no longer being called for fields that are listed in the storeFields option. In the code example that I put in the description of this pull request, that means that when the title and category fields are stored with the search index, the contents of these fields are saved directly, without passing them through extractField. (Note that title is listed in both the fields and storeFields lists; we use extractField when using title to create the search index, and we do not use it when storing the literal value with that index.)

This is useful because when we want to store extra information in the index, I believe it's useful to store that information exactly as it was listed in the document. For example, a Date object should be stored as a Date object, not as a string representation. The string representation is helpful for finding the document in the search, so that we can index on the string "Wednesday 3 April 2024", but once we've found the document and returned a SearchResult, the developer probably wants to work with the native JavaScript Date, not with the string "Wednesday 3 April 2024".

Can you explain why we want to call extractField on data that is stored in this._storedFields? Maybe I'm misunderstanding why you built it this way. Is it for performance reasons, for example?

@lucaong
Copy link
Owner

lucaong commented Apr 3, 2024

Thanks @singingwolfboy for the explanation !

It is true that, most typically, storedFields are taken plainly from the object without modification, but that's not always the case. One common case is for nested fields (this is suggested in one of the examples in the README):

// Assuming that our documents look like:
const documents = [
  { id: 1, title: 'Moby Dick', author: { name: 'Herman Melville' } },
  { id: 2, title: 'Zen and the Art of Motorcycle Maintenance', author: { name: 'Robert Pirsig' } },
  { id: 3, title: 'Neuromancer', author: { name: 'William Gibson' } },
  { id: 4, title: 'Zen in the Art of Archery', author: { name: 'Eugen Herrigel' } },
  // ...and more
]

// We can support nested fields (author.name) with a
// custom `extractField` function:

let miniSearch = new MiniSearch({
  fields: ['title', 'author.name'],
  extractField: (document, fieldName) => {
    // Access nested fields
    return fieldName.split('.').reduce((doc, key) => doc && doc[key], document)
  }
})

In the case above, should one want to store the author name, one has the option to either do storeFields: ['author'], keeping it as an object, or storeFields: ['author.name'], storing only the name part. The second way is only possible if we pass it through extractField. This feature is not strictly necessary, but keeps things consistent, and is definitely in use in many projects. We could publish a major version and make a backward incompatible change, but I don't think it's worth.

What do you think?

@lucaong
Copy link
Owner

lucaong commented Apr 3, 2024

What I think is a backward incompatible change that worries me more, is that if you enforce the type string & keyof T for the fieldName argument of many functions like tokenize or extractField, then TypeScript users won't be able to compile anymore projects that define "computed fields" like the nested field example above (where "author.name" is not a key of the document type, yet is a valid fieldName in the current MiniSearch implementation).

This was not clear in my first example, because I used a field name that actually exists as a key on the document type, but should be clearer in my second example. Sorry for that.

Tests pass because they are written in JavaScript, but I believe that TypeScript projects doing what shown in the example above would not compile anymore after the change.

I do have an idea to possibly improve type safety without breaking backwards compatibility, I hope I find the time to outline it in an issue at some point soon. If I manage to, I will mention you in the issue, so you can provide your opinion.

@singingwolfboy
Copy link
Author

In the case above, should one want to store the author name, one has the option to either do storeFields: ['author'], keeping it as an object, or storeFields: ['author.name'], storing only the name part. The second way is only possible if we pass it through extractField.

That's not the only way to do it! In fact, I would say it's not even the best way to do it. If the intended use case is supporting dotted paths to store specific fields from nested objects, then we can write the function that handles the storeFields configuration to support that use case. That would still preserve type information, which is the major goal of this pull request! The downside of relying on extractField is that coerces all output to strings, which is useful for the search index, but not useful for additional stored data.

TypeScript users won't be able to compile anymore projects that define "computed fields" like the nested field example above (where "author.name" is not a key of the document type, yet is a valid fieldName in the current MiniSearch implementation).

You underestimate how expressive TypeScript is! 😄 In fact, this pull request already accounts for dotted paths like you describe. If you check the types.ts file, I copied a recipe from a blog post that allows for constructing fields exactly the way you describe. If keyof T gives you just the top-level fields in T, then Path<T> gives you that plus all the deeply nested dotted paths. In fact, a good text editor with TypeScript support will render these options in a dropdown, so the developer can see all the different ways they can define fields for MiniSearch!

I tried to take a conservative approach so far, and I've only used the Path<T> type for the fields property in the options. But if I should use that type for other parts of the codebase, it's easy to replace! You know this project better than me, so please tell me which places should be keyof T and which should be Path<T>.

Tests pass because they are written in JavaScript, but I believe that TypeScript projects doing what shown in the example above would not compile anymore after the change.

Would you like me to rewrite the tests in TypeScript? Would that convince you?

@lucaong
Copy link
Owner

lucaong commented Apr 3, 2024

I really appreciate your pull request, and I am motivated to find a solution to improve the type safety of MiniSearch, at least in the common case of using plain key/value objects as documents without a custom extractField.

That said, I still think it's a breaking change: the extractField function (and tokenize, processTerm, etc.) currently can take any value for the fieldName, not only keys of the document type, or dotted paths. In fact, the Field Extraction example in the README shows one such case:

// Assuming that our documents look like:
const documents = [
  { id: 1, title: 'Moby Dick', author: { name: 'Herman Melville' }, pubDate: new Date(1851, 9, 18) },
  { id: 2, title: 'Zen and the Art of Motorcycle Maintenance', author: { name: 'Robert Pirsig' }, pubDate: new Date(1974, 3, 1) },
  { id: 3, title: 'Neuromancer', author: { name: 'William Gibson' }, pubDate: new Date(1984, 6, 1) },
  { id: 4, title: 'Zen in the Art of Archery', author: { name: 'Eugen Herrigel' }, pubDate: new Date(1948, 0, 1) },
  // ...and more
]

// We can support nested fields (author.name) and date fields (pubDate) with a
// custom `extractField` function:

let miniSearch = new MiniSearch({
  fields: ['title', 'author.name', 'pubYear'],
  extractField: (document, fieldName) => {
    // If field name is 'pubYear', extract just the year from 'pubDate'
    if (fieldName === 'pubYear') {
      const pubDate = document['pubDate']
      return pubDate && pubDate.getFullYear().toString()
    }

    // Access nested fields
    return fieldName.split('.').reduce((doc, key) => doc && doc[key], document)
  }
})

Here, the pubYear field is a computed field, which is neither a key of the document, nor a dotted path. This, I assume, would not compile with this PR. Or am I mistaken?

What I think could be done in the future, is to support a different syntax in the fields options. Basically, the values in the fields array could be strings (like now) or, alternatively, objects that specify more options like:

// Future proposal, NOT YET WORKING

new MiniSearch({
  fields: [
    {
      name: 'title',
      index: true, // Index this field. Can be set to false (and store to true) when only storing a field
      store: true, // equivalent to adding this field to storeFields
      tokenize: (s) => s.split(/\s+/), // tokenizer for this field
      extractField: (doc, _) => doc.title, // extractor for this field
      processTerm: (term) => term.toLowerCase()  // processor for this field
    },
    { name: 'body', store: true }, // Uses the default tokenizer, extractor, processor, etc.
    { path: 'author.name' }, // When using path instead of name we can be type safe (as in path
                             // must be a valid path in the doc type). Path is mutually exclusive to
                             // extractField, and by default is also used as name, if not specified
    'info' /* same as { name: 'info' }. If it was the same as { path: 'info' }
           *  this would be a breaking change when specifying a default
           *  extractField function, as we cannot require the name to be a key in the
           *  document type. Maybe we can make it the same as { path: 'info' } if and only
           *  if there is no custom `extractField` defined? */
  ],
  storeFields: ['title', 'body', 'info'], // In case of conflict with fields it should error
  processTerm: (term) => term.toLowerCase(), // Provides a default for all fields
  tokenize: (s) => s.split(/\s+/), // Provides a default for all fields
  extractField: (doc, fieldName) => doc[fieldName], // Provides a default for all fields,
  searchOptions: {
    boost: { title: 2 },
    prefix: true,
    fuzzy: 0.2,
    tokenize: (s) => s.split(/\s+/) // Search tokenizer
  }
})

For clarity, here I am using all possible combinations to annotate how they interact together, but typically one would either use the fields option with object values, or specify the various storeFields, extractField, processTerm, tokenize, etc. at the "global" level (and not both).

Therefore, these would be type safe in the type of field names:

new MiniSearch({ fields: [
    { name: 'title' },
    { name: 'body', store: true },
    { path: 'author.name' }
  ]
})

new MiniSearch({ fields: ['title', 'body', 'info'] })

While these would allow any string as field names (because a custom extractor is specified):

new MiniSearch({
  fields: [
    { name: 'title', extractField: (doc, _) => doc.title },
  ]
})

new MiniSearch({
  fields: ['title', 'body'],
  extractField: (doc, fieldName) => doc[fieldName]
})

The proposal above supports the current syntax with no breaking change, but offers a type safe syntax for users that do not need a custom extractField. I believe that the above can be made type safe when using the { path: ... } syntax, or even with the current string syntax whenever extractField is not specified. When a custom extractor is specified, complete flexibility is enabled, but as a result the type cannot be restricted to a dotted path anymore (one can still manually annotate the type of extractField arguments in that case).

What do you think?

@lucaong
Copy link
Owner

lucaong commented Apr 3, 2024

Maybe it is even possible to enforce type safety with the current syntax whenever extractField is not customized. For example by making the Options type the union between a version with no extractField and a stricter type for field names, and a version with an extractField function, but less strict type for field names.

The above proposal would be a further improvement on it, by allowing dotted paths to be specified without an extractField function, and therefore be kept type safe.

One drawback is that the more complicated types would make the documentation less user friendly: now it is relatively easy to look up the meaning of each option in the docs, but with this change one would have to follow the type union. This can still be solved by explicitly documenting options.

@singingwolfboy
Copy link
Author

Here, the pubYear field is a computed field, which is neither a key of the document, nor a dotted path. This, I assume, would not compile with this PR. Or am I mistaken?

Hmm, that does present a problem. But not an insurmountable one! We could simply add some extra fields to the type annotation. Here's some example code, that doesn't currently work with this pull request, but we could alter the code to make it work:

interface Doc {
  id: number;
  title: string;
  author: { name: string; }
  pubDate: Date;
}

// note the two type arguments
const ms = new MiniSearch<Doc, 'pubYear'>({
  fields: ['title', 'author.name', 'pubYear'],
  extractField: (document, fieldName) => {
    // If field name is 'pubYear', extract just the year from 'pubDate'
    if (fieldName === 'pubYear') {
      const pubDate = document['pubDate']
      return pubDate && pubDate.getFullYear().toString()
    }

    // Access nested fields
    return fieldName.split('.').reduce((doc, key) => doc && doc[key], document)
  }
})

We can use type arguments to the MiniSearch constructor to pass additional information, such as additional field names that should be considered acceptable when passed to the fields option (and others). I'm currently doing a similar sort of thing in this pull request for the storeFields option; there might be a way to avoid the repetition, if we can find someone who is better at TypeScript than I am. But either way, I'm confident that we can support this use-case without making a breaking change.

What I think could be done in the future, is to support a different syntax in the fields options. Basically, the values in the fields array could be strings (like now) or, alternatively, objects that specify more options...

I like this, too! I particularly like that you can now pass functions that are scoped by field: no more if (fieldName === 'oneSpecificField')! That makes the code easier to read, and probably more performant as well, since they don't need to be constantly called for fields that don't need any custom logic at all.

I'm OK with abandoning my efforts on this pull request, if you think that this other approach works better. Maybe I could help with specifying it and writing the code/tests?

@lucaong
Copy link
Owner

lucaong commented Apr 3, 2024

Yes, contributions like yours are very welcome :)

My concern is mostly to avoid big breaking changes with limited benefit. It's acceptable to add some breaking change in a major release, especially if the benefit outweights the cost. Let me think about this a bit more before closing your PR. In any case, I am completely on your side that additional type safety is desirable, and I definitely can use your help on this or alternative proposals.

@lucaong
Copy link
Owner

lucaong commented Jul 8, 2024

I will close this PR, but the conversation is still very relevant to implement the strongly typed interface of the next major release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants