-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add import data from gql script #247
base: develop
Are you sure you want to change the base?
Conversation
This Nodejs script fetches ArDrive L1 transactions and bundles from a gateway through GQL interface. The script can receive some parameters like: - --gqlEndpoint # defaults to Goldsky - --minHeight # defaults to 0 - --maxHeight # defaults to the latest Arweave block - --blockRangeSize # defaults to 100 Example: `node --import=./register.js scripts/import-data/fetch-data-gql.ts --minHeight 1000000 --maxHeight 1000500`
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #247 +/- ##
========================================
Coverage 70.86% 70.87%
========================================
Files 35 35
Lines 8958 8967 +9
Branches 523 523
========================================
+ Hits 6348 6355 +7
- Misses 2608 2610 +2
Partials 2 2 ☔ View full report in Codecov by Sentry. |
Warning There were issues while running some tools. Please review the errors and either fix the tool’s configuration or disable the tool if it’s a critical failure. 🔧 eslint
error @permaweb/[email protected]: The engine "yarn" is incompatible with this module. Expected version "please-use-npm". Got "1.22.22" 📝 WalkthroughWalkthroughThe changes in this pull request include modifications to several files: the Changes
Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Outside diff range and nitpick comments (2)
scripts/import-data/fetch-data-gql.ts (2)
30-46
: Consider using a command-line argument parsing libraryUtilizing a library such as
commander
,yargs
, orminimist
can simplify argument handling, provide built-in validations, and improve code readability.
243-270
: Improve error handling inwriteTransactionsToFile
Enhance robustness by handling specific file system errors and providing more informative messages to the user.
Consider wrapping file operations in try-catch blocks and distinguishing between different types of errors (e.g., permission issues, disk space).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
.gitignore
(1 hunks)scripts/import-data/fetch-data-gql.ts
(1 hunks)tsconfig.json
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- .gitignore
🔇 Additional comments (2)
scripts/import-data/fetch-data-gql.ts (1)
273-275
: Verify that MAX_BLOCK_HEIGHT
is defined after fetching
Ensure that MAX_BLOCK_HEIGHT
is correctly assigned after calling fetchLatestBlockHeight()
to avoid potential issues in subsequent code.
Run the following script to confirm MAX_BLOCK_HEIGHT
is properly assigned:
✅ Verification successful
MAX_BLOCK_HEIGHT
assignment is properly implemented
The verification shows that:
MAX_BLOCK_HEIGHT
is correctly declared as a number- It's properly assigned either from CLI arguments or by fetching from the Arweave network
fetchLatestBlockHeight
correctly retrieves and returns the block height- The value is used consistently throughout the code for block range calculations and logging
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that MAX_BLOCK_HEIGHT is assigned after fetching the latest block height.
node -e "
(async () => {
let MAX_BLOCK_HEIGHT;
const fetchLatestBlockHeight = async () => {
const response = await fetch('https://arweave.net/info');
const data = await response.json();
return data.height;
};
MAX_BLOCK_HEIGHT = await fetchLatestBlockHeight();
console.log('MAX_BLOCK_HEIGHT:', MAX_BLOCK_HEIGHT);
})();
"
Length of output: 381
Script:
#!/bin/bash
# Let's examine how MAX_BLOCK_HEIGHT is used in the codebase
rg "MAX_BLOCK_HEIGHT" -A 3
# Also check the implementation of fetchLatestBlockHeight
ast-grep --pattern 'async function fetchLatestBlockHeight() { $$$ }'
Length of output: 1788
Script:
#!/bin/bash
# Let's check the implementation of fetchLatestBlockHeight with a different pattern
ast-grep --pattern 'const fetchLatestBlockHeight = async () => { $$$ }'
# Also check how getBlockRanges uses maxBlock parameter
ast-grep --pattern 'function getBlockRanges($$$) { $$$ }'
Length of output: 634
tsconfig.json (1)
28-28
: Including the scripts
directory in TypeScript compilation
Adding "scripts"
to the include
array ensures that TypeScript compiles the new script files, which is appropriate.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
scripts/import-data/fetch-data-gql.ts (1)
323-325
: Update log message to reflect the current block rangeThe log message indicates the full block range from
MIN_BLOCK_HEIGHT
toMAX_BLOCK_HEIGHT
, which may be misleading during iteration. Modify it to display the current range being processed for clearer progress tracking.Apply this diff to fix the log message:
console.log( - `Transactions and bundles from block ${MIN_BLOCK_HEIGHT} to ${MAX_BLOCK_HEIGHT} saved!`, + `Transactions and bundles from block ${range.min} to ${range.max} saved!`, );
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
scripts/import-data/fetch-data-gql.ts
(1 hunks)
🔇 Additional comments (4)
scripts/import-data/fetch-data-gql.ts (4)
111-112
:
Correct the property name when fetching the latest block height
The fetchLatestBlockHeight
function attempts to extract blocks
from the response, but the correct property for the latest block height is height
. This change ensures that the latest block height is accurately retrieved.
Apply this diff to fix the property name:
const { blocks } = await response.json();
- return blocks as number;
+ return height as number;
Or adjust the code accordingly:
const data = await response.json();
- const { blocks } = data;
+ const { height } = data;
return height as number;
125-129
: 🛠️ Refactor suggestion
Allow processing when minBlock
equals maxBlock
The current condition prevents processing a single block range where minBlock
equals maxBlock
. Adjust the condition to include this scenario, allowing the script to handle ranges where both values are the same.
Apply this diff to modify the condition:
- if (minBlock >= maxBlock || rangeSize <= 0) {
+ if (minBlock > maxBlock || rangeSize <= 0) {
throw new Error(
'Invalid input: ensure minBlock < maxBlock and rangeSize > 0',
);
}
174-174
:
Conditionally include the after
parameter in the GraphQL query
Passing an empty string for the after
parameter may cause issues with the GraphQL endpoint. Omit the after
parameter when cursor
is undefined to prevent potential errors.
Apply this diff to adjust the query:
sort: HEIGHT_ASC
- after: "${cursor !== undefined ? cursor : ''}"
+ ${cursor !== undefined ? `after: "${cursor}"` : ''}
) {
239-241
:
Handle empty edges
array to prevent errors
When processing the edges
array, ensure it is not empty before accessing its elements to avoid runtime exceptions when the array is empty.
Apply this diff to add a condition:
hasNextPage = pageInfo.hasNextPage;
- cursor = hasNextPage ? edges[edges.length - 1].cursor : undefined;
+ if (hasNextPage && edges.length > 0) {
+ cursor = edges[edges.length - 1].cursor;
+ } else {
+ cursor = undefined;
+ }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (3)
scripts/import-data/fetch-data-gql.ts (3)
158-171
: Consider moving App-Name values to configurationThe hardcoded App-Name values should be moved to a configuration file for better maintainability and reusability.
Create a config file and import the values:
// config.ts export const ARDRIVE_APP_NAMES = [ "ArDrive-App", "ArDrive-Web", "ArDrive-CLI", "ArDrive-Desktop", "ArDrive-Mobile", "ArDrive-Core", "ArDrive-Sync", ];Then update the query:
tags: [ { name: "App-Name" - values: [ - "ArDrive-App" - "ArDrive-Web" - "ArDrive-CLI" - "ArDrive-Desktop" - "ArDrive-Mobile" - "ArDrive-Core" - "ArDrive-Sync" - ] + values: ARDRIVE_APP_NAMES } ]
227-262
: Improve pagination handling with rate limitingThe while loop for pagination could benefit from rate limiting to prevent overwhelming the API.
Add rate limiting:
+ const sleep = (ms: number) => new Promise(resolve => setTimeout(resolve, ms)); + const RATE_LIMIT_DELAY = 1000; // 1 second delay between requests while (hasNextPage) { console.log( `Fetching transactions and bundles from block ${min} to ${max}. Page ${page}`, ); const { transactions: { edges, pageInfo }, } = await fetchGql({ minBlock: min, maxBlock: max, cursor, }); + // Add delay between requests + await sleep(RATE_LIMIT_DELAY);
267-294
: Consider batching file operations for large datasetsFor better performance with large datasets, consider batching file operations and adding progress tracking.
+ const BATCH_SIZE = 100; + let processedCount = 0; + const totalCount = transactions.size; - for (const [height, ids] of transactions.entries()) { + const entries = Array.from(transactions.entries()); + for (let i = 0; i < entries.length; i += BATCH_SIZE) { + const batch = entries.slice(i, i + BATCH_SIZE); + await Promise.all( + batch.map(async ([height, ids]) => { if (ids.size === 0) return; const content = JSON.stringify([...ids], null, 2); const filePath = path.join(outputDir, `${height}.json`); try { await fs.writeFile(filePath, content); + processedCount += 1; + if (processedCount % 10 === 0) { + console.log(`Progress: ${processedCount}/${totalCount} files written`); + } } catch (error) { console.error(`Failed to write ${filePath}: ${error}`); throw error; } + }) + ); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
scripts/import-data/fetch-data-gql.ts
(1 hunks)
🔇 Additional comments (3)
scripts/import-data/fetch-data-gql.ts (3)
1-24
: LGTM! Well-structured imports and license header
The code follows best practices by:
- Including the AGPL v3 license header
- Using Node.js built-in modules
- Implementing proper ESM path resolution
111-112
:
Fix incorrect property access in fetchLatestBlockHeight
The response from arweave.net/info contains a height
property, not blocks
.
239-241
:
Handle empty edges array to prevent errors
When processing the edges array, ensure it's not empty before accessing its elements.
let GQL_ENDPOINT = 'https://arweave-search.goldsky.com/graphql'; | ||
let MIN_BLOCK_HEIGHT = 0; | ||
let MAX_BLOCK_HEIGHT: number | undefined; | ||
let BLOCK_RANGE_SIZE = 100; | ||
|
||
args.forEach((arg, index) => { | ||
switch (arg) { | ||
case '--gqlEndpoint': | ||
if (args[index + 1]) { | ||
GQL_ENDPOINT = args[index + 1]; | ||
} else { | ||
console.error('Missing value for --gqlEndpoint'); | ||
process.exit(1); | ||
} | ||
break; | ||
case '--minHeight': | ||
if (args[index + 1]) { | ||
MIN_BLOCK_HEIGHT = parseInt(args[index + 1], 10); | ||
} else { | ||
console.error('Missing value for --minHeight'); | ||
process.exit(1); | ||
} | ||
break; | ||
case '--maxHeight': | ||
if (args[index + 1]) { | ||
MAX_BLOCK_HEIGHT = parseInt(args[index + 1], 10); | ||
} else { | ||
console.error('Missing value for --maxHeight'); | ||
process.exit(1); | ||
} | ||
break; | ||
case '--blockRangeSize': | ||
if (args[index + 1]) { | ||
BLOCK_RANGE_SIZE = parseInt(args[index + 1], 10); | ||
} else { | ||
console.error('Missing value for --blockRangeSize'); | ||
process.exit(1); | ||
} | ||
break; | ||
default: | ||
break; | ||
} | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add input validation for parsed arguments
While the code handles missing values well, it should validate the parsed values to ensure they are valid numbers and within acceptable ranges.
Apply this diff to add validation:
case '--minHeight':
if (args[index + 1]) {
- MIN_BLOCK_HEIGHT = parseInt(args[index + 1], 10);
+ const value = parseInt(args[index + 1], 10);
+ if (isNaN(value) || value < 0) {
+ console.error('Invalid value for --minHeight. Must be a non-negative integer.');
+ process.exit(1);
+ }
+ MIN_BLOCK_HEIGHT = value;
} else {
console.error('Missing value for --minHeight');
process.exit(1);
}
break;
case '--maxHeight':
if (args[index + 1]) {
- MAX_BLOCK_HEIGHT = parseInt(args[index + 1], 10);
+ const value = parseInt(args[index + 1], 10);
+ if (isNaN(value) || value < 0) {
+ console.error('Invalid value for --maxHeight. Must be a non-negative integer.');
+ process.exit(1);
+ }
+ MAX_BLOCK_HEIGHT = value;
} else {
console.error('Missing value for --maxHeight');
process.exit(1);
}
break;
case '--blockRangeSize':
if (args[index + 1]) {
- BLOCK_RANGE_SIZE = parseInt(args[index + 1], 10);
+ const value = parseInt(args[index + 1], 10);
+ if (isNaN(value) || value <= 0) {
+ console.error('Invalid value for --blockRangeSize. Must be a positive integer.');
+ process.exit(1);
+ }
+ BLOCK_RANGE_SIZE = value;
} else {
console.error('Missing value for --blockRangeSize');
process.exit(1);
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
let GQL_ENDPOINT = 'https://arweave-search.goldsky.com/graphql'; | |
let MIN_BLOCK_HEIGHT = 0; | |
let MAX_BLOCK_HEIGHT: number | undefined; | |
let BLOCK_RANGE_SIZE = 100; | |
args.forEach((arg, index) => { | |
switch (arg) { | |
case '--gqlEndpoint': | |
if (args[index + 1]) { | |
GQL_ENDPOINT = args[index + 1]; | |
} else { | |
console.error('Missing value for --gqlEndpoint'); | |
process.exit(1); | |
} | |
break; | |
case '--minHeight': | |
if (args[index + 1]) { | |
MIN_BLOCK_HEIGHT = parseInt(args[index + 1], 10); | |
} else { | |
console.error('Missing value for --minHeight'); | |
process.exit(1); | |
} | |
break; | |
case '--maxHeight': | |
if (args[index + 1]) { | |
MAX_BLOCK_HEIGHT = parseInt(args[index + 1], 10); | |
} else { | |
console.error('Missing value for --maxHeight'); | |
process.exit(1); | |
} | |
break; | |
case '--blockRangeSize': | |
if (args[index + 1]) { | |
BLOCK_RANGE_SIZE = parseInt(args[index + 1], 10); | |
} else { | |
console.error('Missing value for --blockRangeSize'); | |
process.exit(1); | |
} | |
break; | |
default: | |
break; | |
} | |
}); | |
let GQL_ENDPOINT = 'https://arweave-search.goldsky.com/graphql'; | |
let MIN_BLOCK_HEIGHT = 0; | |
let MAX_BLOCK_HEIGHT: number | undefined; | |
let BLOCK_RANGE_SIZE = 100; | |
args.forEach((arg, index) => { | |
switch (arg) { | |
case '--gqlEndpoint': | |
if (args[index + 1]) { | |
GQL_ENDPOINT = args[index + 1]; | |
} else { | |
console.error('Missing value for --gqlEndpoint'); | |
process.exit(1); | |
} | |
break; | |
case '--minHeight': | |
if (args[index + 1]) { | |
const value = parseInt(args[index + 1], 10); | |
if (isNaN(value) || value < 0) { | |
console.error('Invalid value for --minHeight. Must be a non-negative integer.'); | |
process.exit(1); | |
} | |
MIN_BLOCK_HEIGHT = value; | |
} else { | |
console.error('Missing value for --minHeight'); | |
process.exit(1); | |
} | |
break; | |
case '--maxHeight': | |
if (args[index + 1]) { | |
const value = parseInt(args[index + 1], 10); | |
if (isNaN(value) || value < 0) { | |
console.error('Invalid value for --maxHeight. Must be a non-negative integer.'); | |
process.exit(1); | |
} | |
MAX_BLOCK_HEIGHT = value; | |
} else { | |
console.error('Missing value for --maxHeight'); | |
process.exit(1); | |
} | |
break; | |
case '--blockRangeSize': | |
if (args[index + 1]) { | |
const value = parseInt(args[index + 1], 10); | |
if (isNaN(value) || value <= 0) { | |
console.error('Invalid value for --blockRangeSize. Must be a positive integer.'); | |
process.exit(1); | |
} | |
BLOCK_RANGE_SIZE = value; | |
} else { | |
console.error('Missing value for --blockRangeSize'); | |
process.exit(1); | |
} | |
break; | |
default: | |
break; | |
} | |
}); |
(async () => { | ||
if (MAX_BLOCK_HEIGHT === undefined) { | ||
MAX_BLOCK_HEIGHT = await fetchLatestBlockHeight(); | ||
} | ||
|
||
const blockRanges = getBlockRanges({ | ||
minBlock: MIN_BLOCK_HEIGHT, | ||
maxBlock: MAX_BLOCK_HEIGHT, | ||
rangeSize: BLOCK_RANGE_SIZE, | ||
}); | ||
|
||
console.log( | ||
`Starting to fetch transactions and bundles from block ${MIN_BLOCK_HEIGHT} to ${MAX_BLOCK_HEIGHT}`, | ||
); | ||
|
||
for (const range of blockRanges) { | ||
const { transactions, bundles } = await getTransactionsForRange(range); | ||
|
||
await writeTransactionsToFile({ | ||
outputDir: path.join(__dirname, 'transactions'), | ||
transactions, | ||
}); | ||
await writeTransactionsToFile({ | ||
outputDir: path.join(__dirname, 'bundles'), | ||
transactions: bundles, | ||
}); | ||
|
||
if (transactions.size !== 0 || bundles.size !== 0) { | ||
console.log( | ||
`Transactions and bundles from block ${range.min} to ${range.max} saved!`, | ||
); | ||
} | ||
} | ||
})(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add interrupt handling and detailed progress tracking
The main execution could benefit from proper cleanup on interruption and more detailed progress tracking.
+ let interrupted = false;
+
+ process.on('SIGINT', () => {
+ console.log('\nGracefully shutting down...');
+ interrupted = true;
+ });
+
+ const totalRanges = blockRanges.length;
+ let processedRanges = 0;
+
for (const range of blockRanges) {
+ if (interrupted) {
+ console.log('Operation interrupted by user');
+ process.exit(0);
+ }
+
const { transactions, bundles } = await getTransactionsForRange(range);
await writeTransactionsToFile({
outputDir: path.join(__dirname, 'transactions'),
transactions,
});
await writeTransactionsToFile({
outputDir: path.join(__dirname, 'bundles'),
transactions: bundles,
});
+ processedRanges += 1;
if (transactions.size !== 0 || bundles.size !== 0) {
console.log(
- `Transactions and bundles from block ${range.min} to ${range.max} saved!`,
+ `Progress: ${processedRanges}/${totalRanges} ranges processed. ` +
+ `Transactions and bundles from block ${range.min} to ${range.max} saved!`,
);
}
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
(async () => { | |
if (MAX_BLOCK_HEIGHT === undefined) { | |
MAX_BLOCK_HEIGHT = await fetchLatestBlockHeight(); | |
} | |
const blockRanges = getBlockRanges({ | |
minBlock: MIN_BLOCK_HEIGHT, | |
maxBlock: MAX_BLOCK_HEIGHT, | |
rangeSize: BLOCK_RANGE_SIZE, | |
}); | |
console.log( | |
`Starting to fetch transactions and bundles from block ${MIN_BLOCK_HEIGHT} to ${MAX_BLOCK_HEIGHT}`, | |
); | |
for (const range of blockRanges) { | |
const { transactions, bundles } = await getTransactionsForRange(range); | |
await writeTransactionsToFile({ | |
outputDir: path.join(__dirname, 'transactions'), | |
transactions, | |
}); | |
await writeTransactionsToFile({ | |
outputDir: path.join(__dirname, 'bundles'), | |
transactions: bundles, | |
}); | |
if (transactions.size !== 0 || bundles.size !== 0) { | |
console.log( | |
`Transactions and bundles from block ${range.min} to ${range.max} saved!`, | |
); | |
} | |
} | |
})(); | |
(async () => { | |
if (MAX_BLOCK_HEIGHT === undefined) { | |
MAX_BLOCK_HEIGHT = await fetchLatestBlockHeight(); | |
} | |
const blockRanges = getBlockRanges({ | |
minBlock: MIN_BLOCK_HEIGHT, | |
maxBlock: MAX_BLOCK_HEIGHT, | |
rangeSize: BLOCK_RANGE_SIZE, | |
}); | |
console.log( | |
`Starting to fetch transactions and bundles from block ${MIN_BLOCK_HEIGHT} to ${MAX_BLOCK_HEIGHT}`, | |
); | |
let interrupted = false; | |
process.on('SIGINT', () => { | |
console.log('\nGracefully shutting down...'); | |
interrupted = true; | |
}); | |
const totalRanges = blockRanges.length; | |
let processedRanges = 0; | |
for (const range of blockRanges) { | |
if (interrupted) { | |
console.log('Operation interrupted by user'); | |
process.exit(0); | |
} | |
const { transactions, bundles } = await getTransactionsForRange(range); | |
await writeTransactionsToFile({ | |
outputDir: path.join(__dirname, 'transactions'), | |
transactions, | |
}); | |
await writeTransactionsToFile({ | |
outputDir: path.join(__dirname, 'bundles'), | |
transactions: bundles, | |
}); | |
processedRanges += 1; | |
if (transactions.size !== 0 || bundles.size !== 0) { | |
console.log( | |
`Progress: ${processedRanges}/${totalRanges} ranges processed. ` + | |
`Transactions and bundles from block ${range.min} to ${range.max} saved!`, | |
); | |
} | |
} | |
})(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need an extra GQL call to get the root parent for BDIs. It also would be worthwhile to make the GQL query we're using to find data items configurable.
I commented on a couple places where we could pull in libraries too, but perhaps the intent is to make this script as standalone as possible. I think that's worth doing for now, but if it becomes to cumbersome pulling in some libraries is fine too.
min: ${minBlock} | ||
max: ${maxBlock} | ||
} | ||
tags: [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this configurable (maybe via cli specified file or environment variable) so users can use custom filters. We can leave the ArDrive tags as default right now for our own convenience.
} | ||
|
||
if (bundleId !== undefined) { | ||
bundles.get(blockHeight)?.add(bundleId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do another GQL call here (one for each unique bundle ID) to retrieve the bundle itself and check whether it has a parent (to get the root ID in case of BDIs). Let's make this behavior configurable via CLI arg too, but default it to enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to be able to get BDIs and L1 bundles and make it configurable like "get only L1 bundles"/"get L1 bundles and BDIs"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking it might be nice to skip the extra lookups and only get the BDIs for cases where someone wants the script to run faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it would be a setting for bundles where you can get the root tx id OR the bdi id (current behavior in this PR), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Outside diff range and nitpick comments (7)
scripts/import-data/import-data.ts (2)
36-103
: Refactor command-line argument parsing for scalability and maintainabilityThe repetitive pattern in the command-line argument parsing can make the code harder to maintain as new options are added. Consider using a dedicated command-line argument parsing library like
commander
,yargs
, orminimist
to simplify this section and improve scalability.
31-31
: Verify existence of directories and handle missing pathsThe script assumes that
TRANSACTIONS_DIR
andBUNDLES_DIR
exist. If these directories do not exist, the script may fail. It's advisable to check for the existence of these directories and create them if necessary or inform the user about the missing directories.Would you like assistance in adding code to verify and create directories if they are missing?
Also applies to: 70-85
scripts/import-data/utils.ts (1)
19-62
: Include response details in error messagesWhen a non-OK HTTP response is received, the error thrown only includes the status code. Including the response body or status text can provide more context and aid in debugging.
Apply this diff to enhance error messages:
... if (!response.ok) { - throw new Error(`HTTP error! status: ${response.status}`); + const errorText = await response.text(); + throw new Error(`HTTP error! status: ${response.status}, body: ${errorText}`); } ...src/workers/bundle-data-importer.ts (1)
141-143
: Add documentation for theisQueueFull
methodThe new
isQueueFull
method is a valuable addition. Adding JSDoc comments would improve code readability and maintainability by explaining the purpose and usage of this method.Apply this diff to add documentation:
+ /** + * Checks if the bundle data importer queue has reached its maximum capacity. + * @returns {Promise<boolean>} True if the queue is full, false otherwise. + */ async isQueueFull(): Promise<boolean> { return this.queue.length() >= this.maxQueueSize; }docker-compose.yaml (1)
111-111
: Set a default value forBUNDLE_DATA_IMPORTER_QUEUE_SIZE
The environment variable
BUNDLE_DATA_IMPORTER_QUEUE_SIZE
currently defaults to an empty string if not set, which may lead to unexpected behavior. Consider setting a sensible default value to ensure the application functions correctly even if the environment variable is not provided.Apply this diff to provide a default value (e.g., 1000):
- - BUNDLE_DATA_IMPORTER_QUEUE_SIZE=${BUNDLE_DATA_IMPORTER_QUEUE_SIZE:-} + - BUNDLE_DATA_IMPORTER_QUEUE_SIZE=${BUNDLE_DATA_IMPORTER_QUEUE_SIZE:-1000}scripts/import-data/fetch-data-gql.ts (2)
241-285
: Improve error handling and progress tracking in getTransactionsForRangeThe function could benefit from:
- Better error handling for GraphQL responses
- Progress tracking for long-running operations
- Graceful interruption handling
Apply this diff to enhance the function:
const getTransactionsForRange = async ({ min, max }: BlockRange) => { let cursor: string | undefined; let hasNextPage = true; const transactions: BlockTransactions = new Map(); const bundles: BlockTransactions = new Map(); + let totalProcessed = 0; + const startTime = Date.now(); while (hasNextPage) { + try { const { transactions: { edges, pageInfo }, } = await fetchGql({ minBlock: min, maxBlock: max, cursor, }); hasNextPage = pageInfo.hasNextPage; - cursor = hasNextPage ? edges[edges.length - 1].cursor : undefined; + if (hasNextPage && edges.length > 0) { + cursor = edges[edges.length - 1].cursor; + } else { + cursor = undefined; + } for (const edge of edges) { const blockHeight = edge.node.block.height; const bundleId = edge.node.bundledIn?.id; const id = edge.node.id; if (!transactions.has(blockHeight)) { transactions.set(blockHeight, new Set()); } if (!bundles.has(blockHeight)) { bundles.set(blockHeight, new Set()); } if (bundleId !== undefined) { if (BUNDLES_FETCH_ROOT_TX) { const rootTxId = await getRootTxId(bundleId); bundles.get(blockHeight)?.add(rootTxId); } else { bundles.get(blockHeight)?.add(bundleId); } } else { transactions.get(blockHeight)?.add(id); } + totalProcessed++; } + + // Log progress every 1000 items + if (totalProcessed % 1000 === 0) { + const elapsedSeconds = (Date.now() - startTime) / 1000; + console.log( + `Progress: Processed ${totalProcessed} items in ${elapsedSeconds.toFixed(2)}s`, + ); + } + } catch (error) { + console.error(`Error processing range ${min}-${max}: ${error}`); + throw error; + } } return { transactions, bundles }; };
287-314
: Enhance error handling in writeTransactionsToFileThe function should handle more specific error cases and provide better error messages.
Apply this diff to improve error handling:
const writeTransactionsToFile = async ({ outputDir, transactions, }: { outputDir: string; transactions: BlockTransactions; }) => { try { await fs.mkdir(outputDir, { recursive: true }); } catch (error) { - console.error(`Failed to create directory: ${error}`); + const errorMessage = error instanceof Error ? error.message : String(error); + console.error(`Failed to create directory ${outputDir}: ${errorMessage}`); throw error; } for (const [height, ids] of transactions.entries()) { if (ids.size === 0) continue; - const content = JSON.stringify([...ids]); const filePath = path.join(outputDir, `${height}.json`); try { + const content = JSON.stringify([...ids], null, 2); // Pretty print for better readability await fs.writeFile(filePath, content); } catch (error) { - console.error(`Failed to write ${filePath}: ${error}`); + const errorMessage = error instanceof Error ? error.message : String(error); + console.error( + `Failed to write to ${filePath} for block height ${height}: ${errorMessage}`, + ); throw error; } } };
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
docker-compose.yaml
(1 hunks)scripts/import-data/fetch-data-gql.ts
(1 hunks)scripts/import-data/import-data.ts
(1 hunks)scripts/import-data/utils.ts
(1 hunks)src/config.ts
(1 hunks)src/routes/ar-io.ts
(1 hunks)src/system.ts
(1 hunks)src/workers/bundle-data-importer.ts
(3 hunks)
🔇 Additional comments (3)
src/routes/ar-io.ts (1)
213-216
: LGTM! Good addition of queue overflow protection
The check for queue capacity before accepting new bundles is a good practice to prevent system overload.
src/config.ts (1)
201-205
: LGTM! Well-documented configuration constant
The new configuration constant follows the established pattern and provides a reasonable default value.
src/system.ts (1)
472-477
: LGTM! Verify the usage of the exported bundleDataImporter.
The changes look good. The export of bundleDataImporter
and the addition of maxQueueSize
parameter align with the PR objectives for importing data from GQL.
Let's verify the usage of the exported bundleDataImporter
:
✅ Verification successful
The previous search might have been too specific. Let's broaden our search to find any references to bundleDataImporter and system imports, as well as locate route files where this exported instance might be used.
We found references to bundleDataImporter
in three files. Let's examine how it's actually used in the routes and test files to verify if the export is properly utilized and necessary.
Export of bundleDataImporter is correctly used and necessary
The export is properly utilized in the routes file (src/routes/ar-io.ts) where it's accessed through the system instance to check queue status and process bundles. The test file confirms the implementation details and usage patterns align with the exported functionality.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for imports and usage of bundleDataImporter
rg -A 3 "import.*bundleDataImporter.*from.*system"
Length of output: 52
Script:
#!/bin/bash
# Let's try a broader search for bundleDataImporter usage
rg -l "bundleDataImporter"
# Also search for any imports from system.ts
rg "from ['\"].*system['\"]"
# Let's also check if there are any route files that might use it
fd "routes?"
Length of output: 163
Script:
#!/bin/bash
# Let's check how bundleDataImporter is used in the routes file
rg -A 5 "bundleDataImporter" src/routes/ar-io.ts
# Also check the test file to understand the usage pattern
rg -A 5 "bundleDataImporter" src/workers/bundle-data-importer.test.ts
Length of output: 2442
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Outside diff range and nitpick comments (3)
scripts/import-data/utils.ts (1)
76-100
: Optimize file processing and add folder validationThe function could benefit from several improvements:
- Validate folder existence before processing
- Use async iteration for better memory efficiency with large directories
- Add input validation for min/max values
Apply this diff to implement the improvements:
export const getFilesInRange = async ({ folder, min, max, }: { folder: string; min: number; max: number; }): Promise<string[]> => { + if (min > max) { + throw new Error('Min value cannot be greater than max value'); + } + try { + await fs.access(folder); const files = await fs.readdir(folder); const filesInRange = files .filter((file) => path.extname(file) === '.json') .filter((file) => { const match = file.match(/^\d+/); - const number = match ? parseInt(match[0], 10) : null; + if (!match) return false; + const number = parseInt(match[0], 10); + if (isNaN(number)) return false; return number !== null && number >= min && number <= max; }); return filesInRange; } catch (error) { + if (error instanceof Error && error.message.includes('ENOENT')) { + throw new Error(`Folder does not exist: ${folder}`); + } throw new Error(`Error processing files: ${(error as Error).message}`); } };scripts/import-data/count-fetched-ids.ts (2)
26-29
: Move configuration to a separate fileConsider moving these constants to a dedicated configuration file for better maintainability and reusability across scripts.
Create a new file
config.ts
:import path from 'node:path'; import { fileURLToPath } from 'node:url'; const __filename = fileURLToPath(import.meta.url); const __dirname = path.dirname(__filename); export const DEFAULT_CONFIG = { TRANSACTIONS_DIR: path.join(__dirname, 'transactions'), BUNDLES_DIR: path.join(__dirname, 'bundles'), MIN_BLOCK_HEIGHT: 0, MAX_BLOCK_HEIGHT: Infinity, } as const;
86-130
: Remove commented-out codeThe commented-out
importFromFiles
function should be removed if it's no longer needed. If it's for future use, consider moving it to a separate file or tracking it in an issue.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
scripts/import-data/count-fetched-ids.ts
(1 hunks)scripts/import-data/import-data.ts
(1 hunks)scripts/import-data/utils.ts
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- scripts/import-data/import-data.ts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed some commented code that was left in that could be removed, but otherwise looks good.
return counter; | ||
}; | ||
|
||
// const importFromFiles = async ({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this could be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, my bad
Marked my review as pending again since the Parquet export script will go into this branch too. |
a970197
to
035c895
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (1)
scripts/import-data/export-parquet.ts (1)
25-31
: Use consistent naming convention for constants.Consider using SCREAMING_SNAKE_CASE for all constant declarations to maintain consistency with JavaScript/TypeScript conventions:
-let ARIO_ENDPOINT = 'http://localhost:4000'; -let ADMIN_KEY: string | undefined; -let OUTPUT_DIR = path.join(__dirname, 'parquet'); -let MAX_FILE_ROWS = 1_000_000; -let MIN_BLOCK_HEIGHT = 0; -let MAX_BLOCK_HEIGHT: number | undefined; +const ARIO_ENDPOINT = 'http://localhost:4000'; +const ADMIN_KEY: string | undefined; +const OUTPUT_DIR = path.join(__dirname, 'parquet'); +const MAX_FILE_ROWS = 1_000_000; +const MIN_BLOCK_HEIGHT = 0; +const MAX_BLOCK_HEIGHT: number | undefined;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
.gitignore
(1 hunks)scripts/import-data/count-fetched-ids.ts
(1 hunks)scripts/import-data/export-parquet.ts
(1 hunks)scripts/import-data/fetch-data-gql.ts
(1 hunks)scripts/import-data/utils.ts
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- .gitignore
- scripts/import-data/count-fetched-ids.ts
- scripts/import-data/utils.ts
- scripts/import-data/fetch-data-gql.ts
🔇 Additional comments (1)
scripts/import-data/export-parquet.ts (1)
1-24
: LGTM! Well-structured imports and proper license header.
The code follows best practices by using the node: protocol for built-in modules and includes the necessary .js extension for ESM imports.
(async () => { | ||
if (ADMIN_KEY === undefined) { | ||
throw new Error('Missing admin key'); | ||
} | ||
|
||
if (MAX_BLOCK_HEIGHT === undefined) { | ||
MAX_BLOCK_HEIGHT = await fetchLatestBlockHeight(); | ||
} | ||
|
||
await fetchWithRetry(`${ARIO_ENDPOINT}/ar-io/admin/export-parquet`, { | ||
method: 'POST', | ||
headers: { | ||
'Content-Type': 'application/json', | ||
Authorization: `Bearer ${ADMIN_KEY}`, | ||
}, | ||
body: JSON.stringify({ | ||
outputDir: OUTPUT_DIR, | ||
startHeight: MIN_BLOCK_HEIGHT, | ||
endHeight: MAX_BLOCK_HEIGHT, | ||
maxFileRows: MAX_FILE_ROWS, | ||
}), | ||
}); | ||
|
||
console.log( | ||
`Parquet export started from block ${MIN_BLOCK_HEIGHT} to ${MAX_BLOCK_HEIGHT}`, | ||
); | ||
})(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enhance error handling and user feedback.
The main execution block needs improvements in error handling and user feedback:
(async () => {
+ try {
if (ADMIN_KEY === undefined) {
throw new Error('Missing admin key');
}
if (MAX_BLOCK_HEIGHT === undefined) {
MAX_BLOCK_HEIGHT = await fetchLatestBlockHeight();
}
- await fetchWithRetry(`${ARIO_ENDPOINT}/ar-io/admin/export-parquet`, {
+ const response = await fetchWithRetry(`${ARIO_ENDPOINT}/ar-io/admin/export-parquet`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
Authorization: `Bearer ${ADMIN_KEY}`,
},
body: JSON.stringify({
outputDir: OUTPUT_DIR,
startHeight: MIN_BLOCK_HEIGHT,
endHeight: MAX_BLOCK_HEIGHT,
maxFileRows: MAX_FILE_ROWS,
}),
});
+
+ if (!response.ok) {
+ throw new Error(`Export failed with status: ${response.status}`);
+ }
console.log(
`Parquet export started from block ${MIN_BLOCK_HEIGHT} to ${MAX_BLOCK_HEIGHT}`,
);
+ } catch (error) {
+ console.error('Export failed:', error.message);
+ process.exit(1);
+ }
})();
Also consider adding:
- Progress updates for long-running exports
- Status endpoint to check export progress
- Graceful shutdown handling
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
(async () => { | |
if (ADMIN_KEY === undefined) { | |
throw new Error('Missing admin key'); | |
} | |
if (MAX_BLOCK_HEIGHT === undefined) { | |
MAX_BLOCK_HEIGHT = await fetchLatestBlockHeight(); | |
} | |
await fetchWithRetry(`${ARIO_ENDPOINT}/ar-io/admin/export-parquet`, { | |
method: 'POST', | |
headers: { | |
'Content-Type': 'application/json', | |
Authorization: `Bearer ${ADMIN_KEY}`, | |
}, | |
body: JSON.stringify({ | |
outputDir: OUTPUT_DIR, | |
startHeight: MIN_BLOCK_HEIGHT, | |
endHeight: MAX_BLOCK_HEIGHT, | |
maxFileRows: MAX_FILE_ROWS, | |
}), | |
}); | |
console.log( | |
`Parquet export started from block ${MIN_BLOCK_HEIGHT} to ${MAX_BLOCK_HEIGHT}`, | |
); | |
})(); | |
(async () => { | |
try { | |
if (ADMIN_KEY === undefined) { | |
throw new Error('Missing admin key'); | |
} | |
if (MAX_BLOCK_HEIGHT === undefined) { | |
MAX_BLOCK_HEIGHT = await fetchLatestBlockHeight(); | |
} | |
const response = await fetchWithRetry(`${ARIO_ENDPOINT}/ar-io/admin/export-parquet`, { | |
method: 'POST', | |
headers: { | |
'Content-Type': 'application/json', | |
Authorization: `Bearer ${ADMIN_KEY}`, | |
}, | |
body: JSON.stringify({ | |
outputDir: OUTPUT_DIR, | |
startHeight: MIN_BLOCK_HEIGHT, | |
endHeight: MAX_BLOCK_HEIGHT, | |
maxFileRows: MAX_FILE_ROWS, | |
}), | |
}); | |
if (!response.ok) { | |
throw new Error(`Export failed with status: ${response.status}`); | |
} | |
console.log( | |
`Parquet export started from block ${MIN_BLOCK_HEIGHT} to ${MAX_BLOCK_HEIGHT}`, | |
); | |
} catch (error) { | |
console.error('Export failed:', error.message); | |
process.exit(1); | |
} | |
})(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IT will throw and error if POST fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karlprieb, understood. Since fetchWithRetry
throws an error if the POST fails, additional error handling might not be necessary. Thank you for the clarification.
🐰🤔
✏️ Learnings added
Learnt from: karlprieb
PR: ar-io/ar-io-node#247
File: scripts/import-data/export-parquet.ts:0-0
Timestamp: 2024-12-13T17:40:34.289Z
Learning: The `fetchWithRetry` function throws an error if the POST request fails.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
cfe7e45
to
3e409f4
Compare
This Nodejs script fetches ArDrive L1 transactions and bundles from a gateway through GQL interface.
The script can receive some parameters like:
Example:
node --import=./register.js scripts/import-data/fetch-data-gql.ts --minHeight 1000000 --maxHeight 1000500