Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix line ending handling in reverse_readfile/readline across OS, and not skipping empty lines #712

Merged
merged 102 commits into from
Dec 11, 2024

Conversation

DanielYang59
Copy link
Contributor

@DanielYang59 DanielYang59 commented Sep 4, 2024

I. Fix line ending handling (mainly in Windows with "\r\n")

  • Hard coded line ending leads to dangling \r for Windows
  • Hard coded line ending length 1 prevents reverse_readline from working for Windows:

    monty/src/monty/io.py

    Lines 131 to 133 in 1270c7b

    if file_size < max_mem or isinstance(m_file, gzip.GzipFile) or os.name == "nt":
    for line in reversed(m_file.readlines()):
    yield line.rstrip()
  • [Behaviour change] Yield line as is, without removing the line ending character (no rstrip)

II. Performance benchmark

Compare with current implementation to make sure there's no performance regression, with Python 3.12, using script e4940e0.

Ubuntu 22.04-WSL2

Windows 11


III. Test downstream packages

Summary by CodeRabbit

  • New Features

    • Enhanced file handling for compressed files and line endings.
    • Added functionality to determine line endings in files.
  • Bug Fixes

    • Improved error handling for empty files and unknown file types.
  • Tests

    • Introduced new tests for line ending detection and file reading scenarios.
    • Updated existing tests for consistency and clarity across different platforms.
  • Chores

    • Minor adjustments to type hints and import statements for clarity.

Copy link

coderabbitai bot commented Sep 4, 2024

Walkthrough

The changes in this pull request primarily enhance file handling capabilities in the src/monty/io.py file, particularly for compressed files and line endings. Key updates include the introduction of a new helper function _get_line_ending, modifications to the reverse_readfile and reverse_readline functions to improve error handling and maintain line endings, and the addition of type hints for better type safety. The test suite has also been expanded and refined to cover more scenarios related to these functionalities, ensuring robust validation of the new features.

Changes

File Change Summary
src/monty/io.py - Added _get_line_ending function for line ending detection.
- Updated reverse_readfile and reverse_readline for better error handling and line ending preservation.
- Enforced explicit mode specifications in zopen and added warnings for deprecated extensions.
tests/test_io.py - Introduced TestGetLineEnding class with parameterized tests for _get_line_ending.
- Enhanced existing tests in TestReverseReadline and TestReverseReadfile for line endings and edge cases.
tests/test_shutil.py - Updated platform checks in TestCopyR and TestRemove for clarity and consistency across tests.
tests/test_tempfile.py - Refined platform detection logic in TestScratchDir and corrected indentation in test_no_copy.
src/monty/json.py - Modified orjson import to include a type ignore comment.
src/monty/re.py - Updated type hint for gen.close() in regrep function to broaden type ignoring.

Possibly related PRs

🐰 In the land where files do play,
Line endings dance in a joyful array.
With helpers and tests, we leap and bound,
In the world of I/O, new treasures are found!
So hop along, dear friends, don’t delay,
Celebrate the changes in a bouncy ballet! 🎉


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

codecov bot commented Sep 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.19%. Comparing base (ae78429) to head (3df6709).
Report is 103 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #712      +/-   ##
==========================================
+ Coverage   83.90%   84.19%   +0.28%     
==========================================
  Files          27       27              
  Lines        1640     1670      +30     
  Branches      302      314      +12     
==========================================
+ Hits         1376     1406      +30     
  Misses        202      202              
  Partials       62       62              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@DanielYang59 DanielYang59 changed the title Allow selecting line endings in reverse_readfile and reverse_readline Detect line endings in reverse_readfile and reverse_readline Sep 4, 2024
@hongyi-zhao
Copy link

If, for some reason, the OUTCAR file has mixed style line endings, then how should it be handled? Therefore, I believe that the safest approach might be to first convert the given OUTCAR file to a unified UNIX line ending style.

@DanielYang59
Copy link
Contributor Author

Hi @hongyi-zhao thanks for the comment.

If, for some reason, the OUTCAR file has mixed style line endings

I believe that's technically possible, but I don't think reverse_readline/file should handle cases where files are badly formatted. reverse_readline/file would expect the file to be correctly formatted (i.e. has a single line ending across the entire file in our case), if not, the user should correct the format (dos2unix or unit2dos or such) before sending it into the pipeline.

@hongyi-zhao
Copy link

dos2unix or unit2dos or such

It should be unix2dos.

@DanielYang59
Copy link
Contributor Author

DanielYang59 commented Sep 5, 2024

dos2unix or unit2dos or such

It should be unix2dos.

Either serves the same purpose to unify line ending (reverse_readfile/line assume unified line ending, not just unified Unix line ending), so it doesn't matter which one you're opting for.

BTW, unix2dos (Unix to DOS) converts to DOS/Windows format.

@hongyi-zhao
Copy link

Don't mind, I mean you have a typo in your previous posting: unit2dos should be unix2dos.

@DanielYang59
Copy link
Contributor Author

Ah okay, didn't notice that, thanks!

Copy link
Contributor

@janosh janosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for fixing this issue @DanielYang59! 👍

src/monty/io.py Show resolved Hide resolved
tests/test_io.py Show resolved Hide resolved
@DanielYang59 DanielYang59 marked this pull request as ready for review September 19, 2024 03:46
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Outside diff range and nitpick comments (7)
benchmark/benchmark.py (2)

169-171: Include monty module version in the output

For better reproducibility and clarity, it's beneficial to display the version of the monty module being used in the benchmark. This information can be crucial when comparing results across different environments or after updates.

You can modify your script to include the monty version:

 import sys
+import monty

 os_info = platform.platform()
 python_version = sys.version.split()[0]
 print(f"\nRunning on OS: {os_info}, Python {python_version}")
+print(f"Monty version: {monty.__version__}")

6-6: Evaluate the necessity of the __future__ import

The from __future__ import annotations statement is used to postpone the evaluation of type annotations, which can be helpful in some cases. However, since the script isn't extensively utilizing annotations that require this import, and if you're running on Python 3.7 or later, this import may not be necessary.

Consider removing the import if it's not needed:

-from __future__ import annotations

Alternatively, if you plan to expand the use of annotations, especially with forward references or complex type hints, you might decide to retain it.

benchmark/pypi-7.12-win11.txt (2)

5-5: Ensure Consistent Phrasing for Timing Outputs

The lines reporting the creation of test files use the phrase "time used," whereas other timing outputs use "time taken." For consistency and clarity, consider changing "time used" to "time taken" in these lines.

Apply this diff to maintain consistency:

-Test file of size 1 MB created with 40757 lines, time used 0.02 seconds.
+Test file of size 1 MB created with 40757 lines, time taken: 0.02 seconds.

-Test file of size 10 MB created with 392476 lines, time used 0.17 seconds.
+Test file of size 10 MB created with 392476 lines, time taken: 0.17 seconds.

-Test file of size 100 MB created with 3784596 lines, time used 1.71 seconds.
+Test file of size 100 MB created with 3784596 lines, time taken: 1.71 seconds.

-Test file of size 500 MB created with 18462038 lines, time used 8.34 seconds.
+Test file of size 500 MB created with 18462038 lines, time taken: 8.34 seconds.

-Test file of size 1000 MB created with 36540934 lines, time used 16.35 seconds.
+Test file of size 1000 MB created with 36540934 lines, time taken: 16.35 seconds.

-Test file of size 5000 MB created with 178466370 lines, time used 85.79 seconds.
+Test file of size 5000 MB created with 178466370 lines, time taken: 85.79 seconds.

Also applies to: 29-29, 53-53, 77-77, 101-101, 125-125


1-145: Consider Excluding Benchmark Result Files from the Repository

Including raw benchmark result files like benchmark/pypi-7.12-win11.txt in the repository can increase its size unnecessarily and may not provide significant value to other developers. Benchmark results can vary between environments and are typically regenerated as needed.

Consider removing the benchmark result file from the repository. Instead, you can summarize key findings in the project's documentation or the pull request description. This approach keeps the repository lean and focuses on the most relevant information.

tests/test_io.py (3)

46-49: Typo in variable name start_pot; consider renaming to start_pos

The variable start_pot in lines 46 and 48 appears to be a typo. Renaming it to start_pos would better represent its purpose as the starting position of the file pointer.

Apply this diff to correct the variable name:

             with open(test_file, "r", encoding="utf-8") as f:
-                start_pot = f.tell()
+                start_pos = f.tell()
                 assert _get_line_ending(f) == l_end
-                assert f.tell() == start_pot
+                assert f.tell() == start_pos

52-55: Typo in variable name start_pot; consider renaming to start_pos

Similarly, in the binary mode test, consider renaming start_pot to start_pos for consistency.

Apply this diff to correct the variable name:

             with open(test_file, "rb") as f:
-                start_pot = f.tell()
+                start_pos = f.tell()
                 assert _get_line_ending(f) == l_end
-                assert f.tell() == start_pot
+                assert f.tell() == start_pos

198-198: Clarify the failure message in the empty file test

The failure message "No error should be thrown." might be misleading, as the test ensures no lines are read from an empty file. Consider updating the message for clarity.

Apply this diff to update the message:

             for _line in reverse_readline(f):
-                pytest.fail("No error should be thrown.")
+                pytest.fail("No lines should be read from an empty file.")
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 1270c7b and f354756.

Files ignored due to path filters (3)
  • tests/test_files/3000_lines.txt.bz2 is excluded by !**/*.bz2
  • tests/test_files/3000_lines.txt.gz is excluded by !**/*.gz
  • tests/test_files/3000lines.txt.gz is excluded by !**/*.gz
Files selected for processing (12)
  • .github/workflows/test.yml (2 hunks)
  • benchmark/benchmark.py (1 hunks)
  • benchmark/develop-ubuntu2204.txt (1 hunks)
  • benchmark/develop-win11.txt (1 hunks)
  • benchmark/pypi-7.12-ubuntu2204.txt (1 hunks)
  • benchmark/pypi-7.12-win11.txt (1 hunks)
  • src/monty/io.py (5 hunks)
  • tests/test_files/3000_lines.txt (1 hunks)
  • tests/test_io.py (4 hunks)
  • tests/test_multiprocessing.py (1 hunks)
  • tests/test_shutil.py (1 hunks)
  • tests/test_tempfile.py (2 hunks)
Files skipped from review due to trivial changes (3)
  • tests/test_files/3000_lines.txt
  • tests/test_multiprocessing.py
  • tests/test_shutil.py
Additional comments not posted (4)
.github/workflows/test.yml (2)

9-9: Confirm the necessity of fail-fast: false in the build strategy

Adding fail-fast: false ensures that all matrix jobs continue to run even if one fails, allowing for comprehensive test results across all environments. Please verify that this aligns with your intended testing workflow and that the extended build times are acceptable.


33-33: Ensure compatibility with Codecov Action V4

Updating the Codecov action from @v3 to @v4 may introduce changes that affect coverage report uploads. Please confirm that the new version is compatible with your configuration and update any settings if necessary.

You can review the release notes for codecov/codecov-action@v4 to check for breaking changes or required adjustments:

tests/test_tempfile.py (2)

4-4: Addition of platform import is appropriate and necessary.

The platform module is required for the operating system check using platform.system() later in the code.


124-124: OS check updated for clarity and compatibility.

Replacing the OS check with platform.system() != "Windows" provides a clearer and more explicit way to determine if the code is not running on a Windows system, enhancing readability and maintainability.

benchmark/benchmark.py Outdated Show resolved Hide resolved
benchmark/benchmark.py Outdated Show resolved Hide resolved
benchmark/pypi-7.12-win11.txt Outdated Show resolved Hide resolved
benchmark/develop-win11.txt Outdated Show resolved Hide resolved
benchmark/develop-ubuntu2204.txt Outdated Show resolved Hide resolved
src/monty/io.py Outdated Show resolved Hide resolved
src/monty/io.py Show resolved Hide resolved
src/monty/io.py Show resolved Hide resolved
tests/test_io.py Outdated Show resolved Hide resolved
tests/test_io.py Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (4)
tests/test_tempfile.py (1)

125-125: LGTM: Improved OS check for better clarity and consistency.

The change from os.name != "nt" to platform.system() != "Windows" improves the clarity and reliability of the OS check. This aligns well with similar changes made in other files of the project.

For even better consistency across the codebase, consider using a constant for the "Windows" string:

WINDOWS = "Windows"
# ...
if platform.system() != WINDOWS:

This would make it easier to maintain if you need to change the string in the future.

tests/test_shutil.py (1)

Line range hint 170-170: Appropriate test skipping for Windows platform

The addition of @unittest.skipIf(platform.system() == "Windows", "Skip on windows") to all test methods in the TestRemove class is a good improvement. This change aligns well with the PR objectives of enhancing cross-platform compatibility and prevents test failures on Windows due to unsupported operations.

Consider making the skip message more informative, e.g., "Skipping file removal test on Windows due to different file system behavior". This would provide more context about why the test is being skipped.

Also applies to: 178-178, 186-186, 198-198

src/monty/io.py (1)

209-212: Consider simplifying line ending handling

Since l_end can only be "\n" or "\r\n", and their lengths are known, you may simplify the code by directly assigning the length without using len() and cast.

Apply this diff to simplify:

     l_end: Literal["\r\n", "\n"] = _get_line_ending(m_file)
-    len_l_end: Literal[1, 2] = cast(Literal[1, 2], len(l_end))
+    len_l_end: int = 2 if l_end == "\r\n" else 1

This removes the need for cast and makes the code clearer.

tests/test_io.py (1)

264-266: Clarify line ending handling in assertions for consistency.

The assertion combines line.rstrip(os.linesep) with l_end, which may lead to confusion due to OS-specific line endings.

Consider normalizing the line endings before comparison to make the assertion clearer:

-    assert (
-        line.rstrip(os.linesep) + l_end
-        == contents[len(contents) - idx - 1]
-    )
+    expected_line = contents[len(contents) - idx - 1].replace('\r\n', '\n').rstrip('\n')
+    actual_line = line.replace('\r\n', '\n').rstrip('\n')
+    assert actual_line == expected_line
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 1cdc75b and 279cd4c.

📒 Files selected for processing (5)
  • .github/workflows/test.yml (2 hunks)
  • src/monty/io.py (5 hunks)
  • tests/test_io.py (4 hunks)
  • tests/test_shutil.py (1 hunks)
  • tests/test_tempfile.py (2 hunks)
🧰 Additional context used
🪛 actionlint
.github/workflows/test.yml

10-10: key "fail-fast" is duplicated in "strategy" section. previously defined at line:8,col:7

(syntax-check)

🪛 yamllint
.github/workflows/test.yml

[error] 10-10: duplication of key "fail-fast" in mapping

(key-duplicates)

🔇 Additional comments (15)
tests/test_tempfile.py (2)

4-4: LGTM: Import statement added correctly.

The platform module import is correctly placed and necessary for the subsequent changes in the file.


Line range hint 1-153: Summary: Improved OS detection for better cross-platform compatibility.

The changes in this file enhance the OS detection mechanism, making it more explicit and consistent with other parts of the project. The addition of the platform module and the updated condition in test_symlink method improve the clarity and reliability of the tests on different operating systems.

These modifications align well with the PR objectives of addressing line ending handling issues across different OS, particularly for Windows systems. The changes do not introduce any new issues and maintain the existing functionality of the tests.

tests/test_shutil.py (1)

34-34: Improved platform check for symlink creation

The change from os.name != "nt" to platform.system() != "Windows" is a good improvement. It makes the condition more explicit and easier to understand, enhancing code readability and maintainability. This change is consistent with similar updates in other files, as mentioned in the PR summary, and aligns well with the goal of improving cross-platform compatibility.

src/monty/io.py (11)

16-16: LGTM

The import of warnings is appropriate and enhances the module's ability to issue warnings.


18-19: LGTM

Importing TYPE_CHECKING, Literal, and cast from typing is appropriate for type annotations and conditional imports used later in the code.


26-26: LGTM

Importing IO, Iterator, and Union inside the TYPE_CHECKING block is standard practice for optional type annotations that assist with static analysis without impacting runtime.


115-116: LGTM

The updated function signature for reverse_readfile includes type hints, improving code readability and static analysis.


165-168: LGTM

The function reverse_readline now includes detailed type annotations and defaults, improving clarity and type safety.


210-211: Verify the correctness of cast usage for len_l_end

The use of cast to assign the length of l_end to len_l_end with type Literal[1, 2] is intended to aid static type checking. Ensure that this use of cast is appropriate and does not introduce type inconsistencies.

Review the necessity of the cast and consider if a simple type annotation would suffice.


264-266: Ensure correct decoding based on file mode

When reading blocks from the file, ensure that decoding is handled properly depending on whether the file is opened in text or binary mode.


135-138: ⚠️ Potential issue

Address potential high memory usage when reading compressed files

In the reverse_readfile function, these lines use file.readlines() to read compressed files, which loads the entire file into memory. For large gzip or bz2 files, this can lead to excessive memory consumption.

Consider processing the compressed files in a memory-efficient way. For example, you could read and yield lines one at a time:

     if isinstance(file, (gzip.GzipFile, bz2.BZ2File)):
-        for line in reversed(file.readlines()):
-            # "readlines" would keep the line end character
-            yield line.decode("utf-8")
+        lines = []
+        for line in file:
+            lines.append(line)
+            if len(lines) >= max_mem // blk_size:
+                for rev_line in reversed(lines):
+                    yield rev_line.decode("utf-8")
+                lines.clear()
+        for rev_line in reversed(lines):
+            yield rev_line.decode("utf-8")

This approach processes the file in chunks, reducing peak memory usage.

Likely invalid or redundant comment.


105-112: ⚠️ Potential issue

Consider supporting additional line endings or clarifying the error message in _get_line_ending

The _get_line_ending function currently checks only for "\r\n" (Windows) and "\n" (Unix/Linux) line endings. If a file uses an older Mac OS line ending ("\r"), the function will raise a ValueError. Consider adding support for "\r" line endings or updating the error message to specify the supported line endings.

To handle "\r" line endings, apply this diff:

     if first_line.endswith(b"\r\n"):
         return "\r\n"
     if first_line.endswith(b"\n"):
         return "\n"
+    if first_line.endswith(b"\r"):
+        return "\r"

     # It's likely the line is missing a line ending for the first line
     raise ValueError(f"Unknown line ending in line {repr(first_line)}.")

Alternatively, clarify the error message:

     # It's likely the line is missing a line ending for the first line
-    raise ValueError(f"Unknown line ending in line {repr(first_line)}.")
+    raise ValueError(f"Unsupported line ending in line {repr(first_line)}. Only '\\n' and '\\r\\n' are supported.")

Likely invalid or redundant comment.


20-23: Ensure conditional import of lzma handles availability correctly

The conditional import of lzma allows the code to handle environments where lzma is unavailable. Verify that subsequent code gracefully handles lzma being None.

Run the following script to check usage of lzma when it is None:


129-132: Ensure efficient retrieval of line endings

The use of _get_line_ending enhances the robustness of line ending detection. Verify that this does not introduce significant overhead for large files.

Run the following script to benchmark the performance impact:

tests/test_io.py (1)

3-4: Necessary imports for compression handling.

The imports bz2 and gzip are correctly included to handle bzip2 and gzip files in the tests.

src/monty/io.py Show resolved Hide resolved
src/monty/io.py Outdated Show resolved Hide resolved
tests/test_io.py Outdated Show resolved Hide resolved
@DanielYang59 DanielYang59 marked this pull request as draft October 22, 2024 03:20
@DanielYang59 DanielYang59 marked this pull request as ready for review October 22, 2024 03:33
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (5)
src/monty/io.py (4)

45-45: LGTM: Enhanced file format support and Path handling

The addition of .XZ and .LZMA file support improves the function's versatility. The explicit string conversion of Path objects is a good practice for ensuring compatibility.

Consider using os.fspath() for converting Path objects to strings, as it's the recommended way to handle both string and path-like objects:

-    if filename is not None and isinstance(filename, Path):
-        filename = str(filename)
+    if filename is not None:
+        filename = os.fspath(filename)

This change would handle both str and Path objects uniformly.

Also applies to: 50-51


55-107: LGTM: Well-implemented line ending detection

The _get_line_ending function is a robust implementation for detecting file line endings. It handles various file types, follows standards, and uses appropriate error handling.

Consider adding a comment explaining why \r (CR) line endings are not supported, as some legacy systems might still use them:

# Note: We don't check for '\r' (CR) line endings as they are rarely used in modern systems.
# If support for CR line endings is needed, add an additional check here.
if first_line.endswith(b"\r\n"):
    return "\r\n"
if first_line.endswith(b"\n"):
    return "\n"

This addition would clarify the design decision and make it easier to add support for CR line endings if needed in the future.


Line range hint 110-157: LGTM: Improved line ending handling and file type support

The changes to reverse_readfile enhance its robustness and consistency in handling different file types and line endings. The use of _get_line_ending and yielding lines with original endings are good improvements.

Consider caching the result of len(filemap) to avoid repeated calls:

 filemap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)
+filemap_len = len(filemap)
-file_size = len(filemap)
+file_size = filemap_len
 while file_size > 0:
     # ... existing code ...
-    elif file_size != len(filemap):
+    elif file_size != filemap_len:

This minor optimization could slightly improve performance, especially for large files.


160-272: LGTM: Enhanced type safety and line ending handling

The updates to reverse_readline improve type safety with better annotations and ensure consistent line ending handling across different file types. The logic for different file handling scenarios is well-implemented.

Consider adding a comment explaining the rationale behind the in-memory reversal for small files and gzip files:

# For small files or gzip files, we reverse the entire file in memory.
# This is more efficient for small files and necessary for gzip files
# which don't support reverse seeking.
if file_size < max_mem or isinstance(m_file, gzip.GzipFile):
    for line in reversed(m_file.readlines()):
        yield line if isinstance(line, str) else cast(bytes, line).decode("utf-8")

This comment would clarify the design decision and make the code more maintainable.

tests/test_io.py (1)

281-281: Consider consistent naming for class variables

For consistency, consider using uppercase for all class-level constants. In this case, NUM_LINES is uppercase, while NUMLINES in the TestReverseReadline class is not.

Apply this diff for consistency:

-    NUMLINES = 3000
+    NUM_LINES = 3000

Also update all references to NUMLINES in the TestReverseReadline class to NUM_LINES.

Also applies to: 284-284

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 279cd4c and d70e80e.

📒 Files selected for processing (5)
  • .github/workflows/test.yml (1 hunks)
  • src/monty/io.py (5 hunks)
  • src/monty/json.py (1 hunks)
  • src/monty/re.py (1 hunks)
  • tests/test_io.py (4 hunks)
✅ Files skipped from review due to trivial changes (2)
  • src/monty/json.py
  • src/monty/re.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/test.yml
🧰 Additional context used
🔇 Additional comments (6)
src/monty/io.py (3)

17-17: LGTM: Improved imports and type annotations

The new imports and type annotations enhance the code's type safety and provide better tooling support. The use of warnings is appropriate for the new functionality.

Also applies to: 19-19, 22-22


374-374: LGTM: Improved docstring clarity

The updated docstring now clearly states that the function only works on UNIX-like operating systems, which is an important clarification for users.


Line range hint 1-384: LGTM: Significant improvements to file I/O handling

The changes in this file represent a substantial improvement in file I/O operations, particularly in handling line endings and supporting various file types. Key improvements include:

  1. Centralized line ending detection with the new _get_line_ending function.
  2. Enhanced support for compressed file formats.
  3. Improved type annotations for better type safety and code clarity.
  4. More robust handling of edge cases, such as empty files.

These changes make the code more reliable, maintainable, and consistent across different file types and operating systems. The additions are well-integrated with the existing code and follow good coding practices.

tests/test_io.py (3)

24-118: Excellent addition of comprehensive test cases for _get_line_ending

The new TestGetLineEnding class provides a thorough set of test cases for the _get_line_ending function. It covers various scenarios including different line endings, file types (text, binary, compressed), and edge cases. This comprehensive approach ensures robust testing of the function's behavior across different situations.


Line range hint 122-277: Improved test coverage for reverse_readline

The updates to TestReverseReadline class enhance the test coverage by including tests for different line endings, empty files, and both text and binary modes. This ensures that the reverse_readline function works correctly across various scenarios.


Line range hint 279-370: Enhanced test coverage for reverse_readfile

The updates to TestReverseReadfile class improve the test coverage by including tests for different line endings, empty files, and files with empty lines. This ensures that the reverse_readfile function works correctly across various scenarios.

tests/test_io.py Show resolved Hide resolved
tests/test_io.py Show resolved Hide resolved
@shyuep
Copy link
Contributor

shyuep commented Dec 10, 2024

Pls fix the merge conflicts.
Thanks.

@shyuep shyuep merged commit bf66f0d into materialsvirtuallab:master Dec 11, 2024
7 of 8 checks passed
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (5)
src/monty/io.py (3)

143-156: Consider adding support for more file types.

The function handles common file types but could be extended to support more compressed formats (e.g., xz/lzma) for consistency with zopen.

Apply this diff to add support for more file types:

-    elif isinstance(file, (io.BufferedReader, gzip.GzipFile, bz2.BZ2File)):
+    elif isinstance(file, (io.BufferedReader, gzip.GzipFile, bz2.BZ2File, lzma.LZMAFile)):

157-169: Consider adding support for classic Mac OS line endings.

The function only handles Windows (\r\n) and Unix (\n) line endings. While classic Mac OS line endings (\r) are rare, supporting them would improve compatibility.

Apply this diff to add support for classic Mac OS line endings:

     if first_line.endswith(b"\r\n"):
         return "\r\n"
     if first_line.endswith(b"\n"):
         return "\n"
+    if first_line.endswith(b"\r"):
+        return "\r"

     # It's likely the line is missing a line ending for the first line
-    raise ValueError(f"Unknown line ending in line {repr(first_line)}.")
+    raise ValueError(
+        f"Unknown line ending in line {repr(first_line)}. "
+        "Supported line endings are: \\r\\n (Windows), \\n (Unix), \\r (Classic Mac OS)."
+    )

262-268: Consider using a type guard for line ending length.

The cast to Literal[1, 2] could be replaced with a type guard for better type safety.

Apply this diff to improve type safety:

-    len_l_end: Literal[1, 2] = cast(Literal[1, 2], len(l_end))
+    def get_line_ending_length(l_end: Literal["\r\n", "\n"]) -> Literal[1, 2]:
+        if l_end == "\r\n":
+            return 2
+        return 1
+    len_l_end = get_line_ending_length(l_end)
tests/test_io.py (2)

152-180: Consider adding performance benchmarks.

The big file test verifies functionality but doesn't measure performance. Consider adding benchmarks to ensure performance doesn't regress.

Add performance benchmarks using timeit:

def test_big_file_performance(self, l_end):
    """Test performance of reading big files."""
    import timeit
    
    setup_code = """
    file_name = "big_file.txt"
    with open(file_name, "rb") as file:
        lines = list(reverse_readline(file, max_mem=4096))
    """
    
    # Measure time
    time_taken = timeit.timeit(setup=setup_code, number=100)
    assert time_taken < 1.0  # Adjust threshold as needed

244-274: Consider testing mixed line endings.

While the current implementation doesn't handle mixed line endings, it would be valuable to have a test that verifies the behavior (raising an error) when encountering mixed line endings.

Add a test for mixed line endings:

def test_mixed_line_endings(self):
    """Test behavior with mixed line endings."""
    contents = ("Line1\n", "Line2\r\n", "Line3\n")
    file_name = "mixed_endings.txt"
    
    with ScratchDir("."):
        with open(file_name, "wb") as file:
            for line in contents:
                file.write(line.encode())
        
        with pytest.raises(ValueError, match="Inconsistent line endings"):
            with open(file_name, "rb") as file:
                list(reverse_readline(file))
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between d70e80e and 3df6709.

📒 Files selected for processing (6)
  • src/monty/io.py (5 hunks)
  • src/monty/json.py (1 hunks)
  • src/monty/re.py (1 hunks)
  • tests/test_io.py (4 hunks)
  • tests/test_shutil.py (1 hunks)
  • tests/test_tempfile.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • tests/test_shutil.py
  • tests/test_tempfile.py
  • src/monty/re.py
  • src/monty/json.py
🧰 Additional context used
🪛 Ruff (0.8.0)
src/monty/io.py

173-173: Undefined name Iterator

(F821)


225-225: Undefined name Iterator

(F821)

tests/test_io.py

197-198: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)

🪛 GitHub Check: codecov/patch
src/monty/io.py

[warning] 274-274: src/monty/io.py#L274
Added line #L274 was not covered by tests

🔇 Additional comments (9)
src/monty/io.py (6)

19-19: LGTM!

The Literal import is correctly placed and used for type hints in the new _get_line_ending function.


116-142: LGTM!

The function signature and docstring are well-defined. The docstring includes:

  • Clear description of functionality
  • POSIX standard reference
  • Return values and error cases
  • Warning cases

186-189: LGTM!

The line ending handling is improved by using the new _get_line_ending helper function.


222-225: LGTM!

The type hints improve type safety by:

  • Specifying supported file types
  • Using Iterator[str] return type
🧰 Tools
🪛 Ruff (0.8.0)

225-225: Undefined name Iterator

(F821)


190-195: 🛠️ Refactor suggestion

Consider optimizing memory usage for compressed files.

Loading the entire compressed file into memory with readlines() could be inefficient for large files.

Consider processing compressed files in chunks:

     if isinstance(file, (gzip.GzipFile, bz2.BZ2File)):
-        for line in reversed(file.readlines()):
-            # "readlines" would keep the line end character
-            yield line.decode("utf-8")
+        lines = []
+        chunk_size = 1024 * 1024  # 1MB chunks
+        while True:
+            chunk = file.read(chunk_size)
+            if not chunk:
+                break
+            lines.extend(chunk.splitlines(keepends=True))
+        for line in reversed(lines):
+            yield line.decode("utf-8")

Likely invalid or redundant comment.


196-218: 🛠️ Refactor suggestion

Consider improving error handling for memory mapping.

The broad except ValueError could mask other issues. Consider checking file size before attempting memory mapping.

Apply this diff to improve error handling:

     else:
+        file_size = os.path.getsize(filename)
+        if file_size == 0:
+            warnings.warn("File is empty.", stacklevel=2)
+            return
         try:
-            filemap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)
-        except ValueError:
+            filemap = mmap.mmap(file.fileno(), file_size, access=mmap.ACCESS_READ)
+        except mmap.error as e:
+            warnings.warn(f"Memory mapping failed: {e}", stacklevel=2)
             return

Likely invalid or redundant comment.

tests/test_io.py (3)

25-120: LGTM! Comprehensive test coverage.

The test class thoroughly covers:

  • Different line endings (Unix, Windows)
  • Various file types (text, binary, compressed)
  • Edge cases (empty files, missing line endings)
  • Error cases (unknown file types, unknown line endings)

202-243: LGTM! Thorough testing of empty lines.

The test ensures empty lines are preserved across different file types and memory modes.


327-371: LGTM! Consistent test coverage with reverse_readline.

The tests ensure consistent behavior between reverse_readfile and reverse_readline for:

  • Empty lines
  • Different line endings
  • Various file types

@DanielYang59 DanielYang59 deleted the readline-line-ending branch December 11, 2024 02:30
@DanielYang59
Copy link
Contributor Author

Thanks, I was about to fix these :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants