Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to Increase the number of samples in generated synthetic data #426

Open
amrita-rajput opened this issue Nov 8, 2024 · 3 comments
Labels
bug Something isn't working question Further information is requested

Comments

@amrita-rajput
Copy link

I'm using the below command to generate the synthetic data with sdg-scale-factor as 100:

ilab data generate --endpoint-url http://localhost:8080/v1 --chunk-word-count 1000 --model mistralai/mistral-large --pipeline full --sdg-scale-factor 100

The number of samples in synthetic data generated always remains the same that is 30 and it's not getting affected by --sdg-scale-factor value.

@amrita-rajput amrita-rajput added the bug Something isn't working label Nov 8, 2024
@amrita-rajput amrita-rajput changed the title Increasing the number of samples in generated synthetic data Not able to Increase the number of samples in generated synthetic data Nov 8, 2024
@jaideepr97
Copy link
Member

@amrita-rajput apologies for the delayed response
are you working with skills or knowledge?

I may be wrong about this but I think --sdg-scale-factor gets ignored if working with knowledge
@aakankshaduggal can keep me honest

@jaideepr97 jaideepr97 added the question Further information is requested label Nov 28, 2024
@ktam3 ktam3 transferred this issue from instructlab/instructlab Dec 4, 2024
@ktam3
Copy link

ktam3 commented Dec 4, 2024

cc. @bbrowning @khaledsulayman could you PTAL?

@bbrowning
Copy link
Contributor

Because the number of samples generated is always stuck at 30, this user is generating skills and is hitting #420 - our behavior today is confusing because we do use --sdg-scale-factor to scale some of our generated data, but we only ever mix in 30 samples from any taxonomy leaf nodes into the final output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants