Fix line ending handling in `reverse_readfile/readline` across OS, and not skipping empty lines #712

DanielYang59 · 2024-09-04T06:24:01Z

I. Fix line ending handling (mainly in Windows with `"\r\n"`)

Hard coded line ending leads to dangling \r for Windows

Hard coded line ending length 1 prevents reverse_readline from working for Windows:

Lines 131 to 133 in 1270c7b

    
           if file_size < max_mem or isinstance(m_file, gzip.GzipFile) or os.name == "nt": 
        
               for line in reversed(m_file.readlines()): 
        
                   yield line.rstrip()

[Behaviour change] Yield line as is, without removing the line ending character (no rstrip)

II. Performance benchmark

Compare with current implementation to make sure there's no performance regression, with Python 3.12, using script e4940e0.

Ubuntu 22.04-WSL2

Installed from PyPI (2024.7.12): pypi-7.12-ubuntu2204.txt
Installed from this branch: develop-ubuntu2204.txt

Windows 11

Installed from PyPI (v2024.7.12): pypi-7.12-win11.txt
Installed from this branch: develop-win11.txt

III. Test downstream packages

pymatgen: Fix monty imports, enhance test for Outcar parser to cover uncompressed format materialsproject/pymatgen#4068

Summary by CodeRabbit

New Features
- Enhanced file handling for compressed files and line endings.
- Added functionality to determine line endings in files.
Bug Fixes
- Improved error handling for empty files and unknown file types.
Tests
- Introduced new tests for line ending detection and file reading scenarios.
- Updated existing tests for consistency and clarity across different platforms.
Chores
- Minor adjustments to type hints and import statements for clarity.

coderabbitai · 2024-09-04T06:24:07Z

Walkthrough

The changes in this pull request primarily enhance file handling capabilities in the src/monty/io.py file, particularly for compressed files and line endings. Key updates include the introduction of a new helper function _get_line_ending, modifications to the reverse_readfile and reverse_readline functions to improve error handling and maintain line endings, and the addition of type hints for better type safety. The test suite has also been expanded and refined to cover more scenarios related to these functionalities, ensuring robust validation of the new features.

Changes

File	Change Summary
`src/monty/io.py`	- Added `_get_line_ending` function for line ending detection. - Updated `reverse_readfile` and `reverse_readline` for better error handling and line ending preservation. - Enforced explicit mode specifications in `zopen` and added warnings for deprecated extensions.
`tests/test_io.py`	- Introduced `TestGetLineEnding` class with parameterized tests for `_get_line_ending`. - Enhanced existing tests in `TestReverseReadline` and `TestReverseReadfile` for line endings and edge cases.
`tests/test_shutil.py`	- Updated platform checks in `TestCopyR` and `TestRemove` for clarity and consistency across tests.
`tests/test_tempfile.py`	- Refined platform detection logic in `TestScratchDir` and corrected indentation in `test_no_copy`.
`src/monty/json.py`	- Modified `orjson` import to include a type ignore comment.
`src/monty/re.py`	- Updated type hint for `gen.close()` in `regrep` function to broaden type ignoring.

Possibly related PRs

Fix line ending in reverse_readfile/readline in Windows #700: This PR addresses line ending handling in the reverse_readfile and reverse_readline functions, which are directly related to the changes made in the main PR regarding file handling and line endings.
zopen changes: forbid implicit binary/text mode, signature change, default UTF-8 encoding in text mode, drop .z support after one-year #730: This PR introduces significant modifications to the zopen function, including changes to its signature and the enforcement of explicit mode specifications, which aligns with the updates made in the main PR to the zopen function in src/monty/io.py.

🐰 In the land where files do play,
Line endings dance in a joyful array.
With helpers and tests, we leap and bound,
In the world of I/O, new treasures are found!
So hop along, dear friends, don’t delay,
Celebrate the changes in a bouncy ballet! 🎉

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

codecov · 2024-09-04T06:40:30Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.19%. Comparing base (ae78429) to head (3df6709).
Report is 103 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #712      +/-   ##
==========================================
+ Coverage   83.90%   84.19%   +0.28%     
==========================================
  Files          27       27              
  Lines        1640     1670      +30     
  Branches      302      314      +12     
==========================================
+ Hits         1376     1406      +30     
  Misses        202      202              
  Partials       62       62

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/monty/io.py

hongyi-zhao · 2024-09-04T13:52:55Z

If, for some reason, the OUTCAR file has mixed style line endings, then how should it be handled? Therefore, I believe that the safest approach might be to first convert the given OUTCAR file to a unified UNIX line ending style.

DanielYang59 · 2024-09-05T03:03:30Z

Hi @hongyi-zhao thanks for the comment.

If, for some reason, the OUTCAR file has mixed style line endings

I believe that's technically possible, but I don't think reverse_readline/file should handle cases where files are badly formatted. reverse_readline/file would expect the file to be correctly formatted (i.e. has a single line ending across the entire file in our case), if not, the user should correct the format (dos2unix or unit2dos or such) before sending it into the pipeline.

hongyi-zhao · 2024-09-05T06:21:13Z

dos2unix or unit2dos or such

It should be unix2dos.

DanielYang59 · 2024-09-05T08:12:47Z

dos2unix or unit2dos or such

It should be unix2dos.

Either serves the same purpose to unify line ending (reverse_readfile/line assume unified line ending, not just unified Unix line ending), so it doesn't matter which one you're opting for.

BTW, unix2dos (Unix to DOS) converts to DOS/Windows format.

hongyi-zhao · 2024-09-05T08:23:03Z

Don't mind, I mean you have a typo in your previous posting: unit2dos should be unix2dos.

DanielYang59 · 2024-09-05T08:38:24Z

Ah okay, didn't notice that, thanks!

janosh

thanks for fixing this issue @DanielYang59! 👍

src/monty/io.py

tests/test_io.py

…into readline-line-ending

coderabbitai

Actionable comments posted: 11

Outside diff range and nitpick comments (7)

benchmark/benchmark.py (2)
169-171: Include monty module version in the output

For better reproducibility and clarity, it's beneficial to display the version of the monty module being used in the benchmark. This information can be crucial when comparing results across different environments or after updates.

You can modify your script to include the monty version:
 import sys
+import monty

 os_info = platform.platform()
 python_version = sys.version.split()[0]
 print(f"\nRunning on OS: {os_info}, Python {python_version}")
+print(f"Monty version: {monty.__version__}")
6-6: Evaluate the necessity of the __future__ import

The from __future__ import annotations statement is used to postpone the evaluation of type annotations, which can be helpful in some cases. However, since the script isn't extensively utilizing annotations that require this import, and if you're running on Python 3.7 or later, this import may not be necessary.

Consider removing the import if it's not needed:
-from __future__ import annotations
Alternatively, if you plan to expand the use of annotations, especially with forward references or complex type hints, you might decide to retain it.
benchmark/pypi-7.12-win11.txt (2)
5-5: Ensure Consistent Phrasing for Timing Outputs

The lines reporting the creation of test files use the phrase "time used," whereas other timing outputs use "time taken." For consistency and clarity, consider changing "time used" to "time taken" in these lines.

Apply this diff to maintain consistency:
-Test file of size 1 MB created with 40757 lines, time used 0.02 seconds.
+Test file of size 1 MB created with 40757 lines, time taken: 0.02 seconds.

-Test file of size 10 MB created with 392476 lines, time used 0.17 seconds.
+Test file of size 10 MB created with 392476 lines, time taken: 0.17 seconds.

-Test file of size 100 MB created with 3784596 lines, time used 1.71 seconds.
+Test file of size 100 MB created with 3784596 lines, time taken: 1.71 seconds.

-Test file of size 500 MB created with 18462038 lines, time used 8.34 seconds.
+Test file of size 500 MB created with 18462038 lines, time taken: 8.34 seconds.

-Test file of size 1000 MB created with 36540934 lines, time used 16.35 seconds.
+Test file of size 1000 MB created with 36540934 lines, time taken: 16.35 seconds.

-Test file of size 5000 MB created with 178466370 lines, time used 85.79 seconds.
+Test file of size 5000 MB created with 178466370 lines, time taken: 85.79 seconds.
Also applies to: 29-29, 53-53, 77-77, 101-101, 125-125

1-145: Consider Excluding Benchmark Result Files from the Repository

Including raw benchmark result files like benchmark/pypi-7.12-win11.txt in the repository can increase its size unnecessarily and may not provide significant value to other developers. Benchmark results can vary between environments and are typically regenerated as needed.

Consider removing the benchmark result file from the repository. Instead, you can summarize key findings in the project's documentation or the pull request description. This approach keeps the repository lean and focuses on the most relevant information.
tests/test_io.py (3)
46-49: Typo in variable name start_pot; consider renaming to start_pos

The variable start_pot in lines 46 and 48 appears to be a typo. Renaming it to start_pos would better represent its purpose as the starting position of the file pointer.

Apply this diff to correct the variable name:
             with open(test_file, "r", encoding="utf-8") as f:
-                start_pot = f.tell()
+                start_pos = f.tell()
                 assert _get_line_ending(f) == l_end
-                assert f.tell() == start_pot
+                assert f.tell() == start_pos
52-55: Typo in variable name start_pot; consider renaming to start_pos

Similarly, in the binary mode test, consider renaming start_pot to start_pos for consistency.

Apply this diff to correct the variable name:
             with open(test_file, "rb") as f:
-                start_pot = f.tell()
+                start_pos = f.tell()
                 assert _get_line_ending(f) == l_end
-                assert f.tell() == start_pot
+                assert f.tell() == start_pos
198-198: Clarify the failure message in the empty file test

The failure message "No error should be thrown." might be misleading, as the test ensures no lines are read from an empty file. Consider updating the message for clarity.

Apply this diff to update the message:
             for _line in reverse_readline(f):
-                pytest.fail("No error should be thrown.")
+                pytest.fail("No lines should be read from an empty file.")

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 1270c7b and f354756.

Files ignored due to path filters (3)

tests/test_files/3000_lines.txt.bz2 is excluded by !**/*.bz2
tests/test_files/3000_lines.txt.gz is excluded by !**/*.gz
tests/test_files/3000lines.txt.gz is excluded by !**/*.gz

Files selected for processing (12)

.github/workflows/test.yml (2 hunks)
benchmark/benchmark.py (1 hunks)
benchmark/develop-ubuntu2204.txt (1 hunks)
benchmark/develop-win11.txt (1 hunks)
benchmark/pypi-7.12-ubuntu2204.txt (1 hunks)
benchmark/pypi-7.12-win11.txt (1 hunks)
src/monty/io.py (5 hunks)
tests/test_files/3000_lines.txt (1 hunks)
tests/test_io.py (4 hunks)
tests/test_multiprocessing.py (1 hunks)
tests/test_shutil.py (1 hunks)
tests/test_tempfile.py (2 hunks)

Files skipped from review due to trivial changes (3)

tests/test_files/3000_lines.txt
tests/test_multiprocessing.py
tests/test_shutil.py

Additional comments not posted (4)

.github/workflows/test.yml (2)

9-9: Confirm the necessity of fail-fast: false in the build strategy

Adding fail-fast: false ensures that all matrix jobs continue to run even if one fails, allowing for comprehensive test results across all environments. Please verify that this aligns with your intended testing workflow and that the extended build times are acceptable.

33-33: Ensure compatibility with Codecov Action V4

Updating the Codecov action from @v3 to @v4 may introduce changes that affect coverage report uploads. Please confirm that the new version is compatible with your configuration and update any settings if necessary.

You can review the release notes for codecov/codecov-action@v4 to check for breaking changes or required adjustments:

tests/test_tempfile.py (2)

4-4: Addition of platform import is appropriate and necessary.

The platform module is required for the operating system check using platform.system() later in the code.

124-124: OS check updated for clarity and compatibility.

Replacing the OS check with platform.system() != "Windows" provides a clearer and more explicit way to determine if the code is not running on a Windows system, enhancing readability and maintainability.

benchmark/benchmark.py

benchmark/pypi-7.12-win11.txt

benchmark/develop-win11.txt

benchmark/develop-ubuntu2204.txt

src/monty/io.py

tests/test_io.py

coderabbitai

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (4)

tests/test_tempfile.py (1)
125-125: LGTM: Improved OS check for better clarity and consistency.

The change from os.name != "nt" to platform.system() != "Windows" improves the clarity and reliability of the OS check. This aligns well with similar changes made in other files of the project.

For even better consistency across the codebase, consider using a constant for the "Windows" string:
WINDOWS = "Windows"
# ...
if platform.system() != WINDOWS:
This would make it easier to maintain if you need to change the string in the future.
tests/test_shutil.py (1)

Line range hint 170-170: Appropriate test skipping for Windows platform

The addition of @unittest.skipIf(platform.system() == "Windows", "Skip on windows") to all test methods in the TestRemove class is a good improvement. This change aligns well with the PR objectives of enhancing cross-platform compatibility and prevents test failures on Windows due to unsupported operations.

Consider making the skip message more informative, e.g., "Skipping file removal test on Windows due to different file system behavior". This would provide more context about why the test is being skipped.

Also applies to: 178-178, 186-186, 198-198
src/monty/io.py (1)
209-212: Consider simplifying line ending handling

Since l_end can only be "\n" or "\r\n", and their lengths are known, you may simplify the code by directly assigning the length without using len() and cast.

Apply this diff to simplify:
     l_end: Literal["\r\n", "\n"] = _get_line_ending(m_file)
-    len_l_end: Literal[1, 2] = cast(Literal[1, 2], len(l_end))
+    len_l_end: int = 2 if l_end == "\r\n" else 1
This removes the need for cast and makes the code clearer.
tests/test_io.py (1)
264-266: Clarify line ending handling in assertions for consistency.

The assertion combines line.rstrip(os.linesep) with l_end, which may lead to confusion due to OS-specific line endings.

Consider normalizing the line endings before comparison to make the assertion clearer:
-    assert (
-        line.rstrip(os.linesep) + l_end
-        == contents[len(contents) - idx - 1]
-    )
+    expected_line = contents[len(contents) - idx - 1].replace('\r\n', '\n').rstrip('\n')
+    actual_line = line.replace('\r\n', '\n').rstrip('\n')
+    assert actual_line == expected_line

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 1cdc75b and 279cd4c.

📒 Files selected for processing (5)

.github/workflows/test.yml (2 hunks)
src/monty/io.py (5 hunks)
tests/test_io.py (4 hunks)
tests/test_shutil.py (1 hunks)
tests/test_tempfile.py (2 hunks)

🧰 Additional context used

🪛 actionlint

.github/workflows/test.yml

10-10: key "fail-fast" is duplicated in "strategy" section. previously defined at line:8,col:7

(syntax-check)

🪛 yamllint

.github/workflows/test.yml

[error] 10-10: duplication of key "fail-fast" in mapping

(key-duplicates)

🔇 Additional comments (15)

tests/test_tempfile.py (2)

4-4: LGTM: Import statement added correctly.

The platform module import is correctly placed and necessary for the subsequent changes in the file.

Line range hint 1-153: Summary: Improved OS detection for better cross-platform compatibility.

The changes in this file enhance the OS detection mechanism, making it more explicit and consistent with other parts of the project. The addition of the platform module and the updated condition in test_symlink method improve the clarity and reliability of the tests on different operating systems.

These modifications align well with the PR objectives of addressing line ending handling issues across different OS, particularly for Windows systems. The changes do not introduce any new issues and maintain the existing functionality of the tests.

tests/test_shutil.py (1)

34-34: Improved platform check for symlink creation

The change from os.name != "nt" to platform.system() != "Windows" is a good improvement. It makes the condition more explicit and easier to understand, enhancing code readability and maintainability. This change is consistent with similar updates in other files, as mentioned in the PR summary, and aligns well with the goal of improving cross-platform compatibility.
src/monty/io.py (11)
16-16: LGTM

The import of warnings is appropriate and enhances the module's ability to issue warnings.

18-19: LGTM

Importing TYPE_CHECKING, Literal, and cast from typing is appropriate for type annotations and conditional imports used later in the code.

26-26: LGTM

Importing IO, Iterator, and Union inside the TYPE_CHECKING block is standard practice for optional type annotations that assist with static analysis without impacting runtime.

115-116: LGTM

The updated function signature for reverse_readfile includes type hints, improving code readability and static analysis.

165-168: LGTM

The function reverse_readline now includes detailed type annotations and defaults, improving clarity and type safety.

210-211: Verify the correctness of cast usage for len_l_end

The use of cast to assign the length of l_end to len_l_end with type Literal[1, 2] is intended to aid static type checking. Ensure that this use of cast is appropriate and does not introduce type inconsistencies.

Review the necessity of the cast and consider if a simple type annotation would suffice.

264-266: Ensure correct decoding based on file mode

When reading blocks from the file, ensure that decoding is handled properly depending on whether the file is opened in text or binary mode.

135-138: ⚠️ Potential issue

Address potential high memory usage when reading compressed files

In the reverse_readfile function, these lines use file.readlines() to read compressed files, which loads the entire file into memory. For large gzip or bz2 files, this can lead to excessive memory consumption.

Consider processing the compressed files in a memory-efficient way. For example, you could read and yield lines one at a time:
     if isinstance(file, (gzip.GzipFile, bz2.BZ2File)):
-        for line in reversed(file.readlines()):
-            # "readlines" would keep the line end character
-            yield line.decode("utf-8")
+        lines = []
+        for line in file:
+            lines.append(line)
+            if len(lines) >= max_mem // blk_size:
+                for rev_line in reversed(lines):
+                    yield rev_line.decode("utf-8")
+                lines.clear()
+        for rev_line in reversed(lines):
+            yield rev_line.decode("utf-8")
This approach processes the file in chunks, reducing peak memory usage.

Likely invalid or redundant comment.

105-112: ⚠️ Potential issue

Consider supporting additional line endings or clarifying the error message in _get_line_ending

The _get_line_ending function currently checks only for "\r\n" (Windows) and "\n" (Unix/Linux) line endings. If a file uses an older Mac OS line ending ("\r"), the function will raise a ValueError. Consider adding support for "\r" line endings or updating the error message to specify the supported line endings.

To handle "\r" line endings, apply this diff:
     if first_line.endswith(b"\r\n"):
         return "\r\n"
     if first_line.endswith(b"\n"):
         return "\n"
+    if first_line.endswith(b"\r"):
+        return "\r"

     # It's likely the line is missing a line ending for the first line
     raise ValueError(f"Unknown line ending in line {repr(first_line)}.")
Alternatively, clarify the error message:
     # It's likely the line is missing a line ending for the first line
-    raise ValueError(f"Unknown line ending in line {repr(first_line)}.")
+    raise ValueError(f"Unsupported line ending in line {repr(first_line)}. Only '\\n' and '\\r\\n' are supported.")
Likely invalid or redundant comment.

20-23: Ensure conditional import of lzma handles availability correctly

The conditional import of lzma allows the code to handle environments where lzma is unavailable. Verify that subsequent code gracefully handles lzma being None.

Run the following script to check usage of lzma when it is None:

129-132: Ensure efficient retrieval of line endings

The use of _get_line_ending enhances the robustness of line ending detection. Verify that this does not introduce significant overhead for large files.

Run the following script to benchmark the performance impact:
tests/test_io.py (1)

3-4: Necessary imports for compression handling.

The imports bz2 and gzip are correctly included to handle bzip2 and gzip files in the tests.

src/monty/io.py

tests/test_io.py

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (5)

src/monty/io.py (4)
45-45: LGTM: Enhanced file format support and Path handling

The addition of .XZ and .LZMA file support improves the function's versatility. The explicit string conversion of Path objects is a good practice for ensuring compatibility.

Consider using os.fspath() for converting Path objects to strings, as it's the recommended way to handle both string and path-like objects:
-    if filename is not None and isinstance(filename, Path):
-        filename = str(filename)
+    if filename is not None:
+        filename = os.fspath(filename)
This change would handle both str and Path objects uniformly.

Also applies to: 50-51

55-107: LGTM: Well-implemented line ending detection

The _get_line_ending function is a robust implementation for detecting file line endings. It handles various file types, follows standards, and uses appropriate error handling.

Consider adding a comment explaining why \r (CR) line endings are not supported, as some legacy systems might still use them:
# Note: We don't check for '\r' (CR) line endings as they are rarely used in modern systems.
# If support for CR line endings is needed, add an additional check here.
if first_line.endswith(b"\r\n"):
    return "\r\n"
if first_line.endswith(b"\n"):
    return "\n"
This addition would clarify the design decision and make it easier to add support for CR line endings if needed in the future.

Line range hint 110-157: LGTM: Improved line ending handling and file type support

The changes to reverse_readfile enhance its robustness and consistency in handling different file types and line endings. The use of _get_line_ending and yielding lines with original endings are good improvements.

Consider caching the result of len(filemap) to avoid repeated calls:
 filemap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)
+filemap_len = len(filemap)
-file_size = len(filemap)
+file_size = filemap_len
 while file_size > 0:
     # ... existing code ...
-    elif file_size != len(filemap):
+    elif file_size != filemap_len:
This minor optimization could slightly improve performance, especially for large files.

160-272: LGTM: Enhanced type safety and line ending handling

The updates to reverse_readline improve type safety with better annotations and ensure consistent line ending handling across different file types. The logic for different file handling scenarios is well-implemented.

Consider adding a comment explaining the rationale behind the in-memory reversal for small files and gzip files:
# For small files or gzip files, we reverse the entire file in memory.
# This is more efficient for small files and necessary for gzip files
# which don't support reverse seeking.
if file_size < max_mem or isinstance(m_file, gzip.GzipFile):
    for line in reversed(m_file.readlines()):
        yield line if isinstance(line, str) else cast(bytes, line).decode("utf-8")
This comment would clarify the design decision and make the code more maintainable.
tests/test_io.py (1)
281-281: Consider consistent naming for class variables

For consistency, consider using uppercase for all class-level constants. In this case, NUM_LINES is uppercase, while NUMLINES in the TestReverseReadline class is not.

Apply this diff for consistency:
-    NUMLINES = 3000
+    NUM_LINES = 3000
Also update all references to NUMLINES in the TestReverseReadline class to NUM_LINES.

Also applies to: 284-284

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 279cd4c and d70e80e.

📒 Files selected for processing (5)

.github/workflows/test.yml (1 hunks)
src/monty/io.py (5 hunks)
src/monty/json.py (1 hunks)
src/monty/re.py (1 hunks)
tests/test_io.py (4 hunks)

✅ Files skipped from review due to trivial changes (2)

src/monty/json.py
src/monty/re.py

🚧 Files skipped from review as they are similar to previous changes (1)

.github/workflows/test.yml

🧰 Additional context used

🔇 Additional comments (6)

src/monty/io.py (3)

17-17: LGTM: Improved imports and type annotations

The new imports and type annotations enhance the code's type safety and provide better tooling support. The use of warnings is appropriate for the new functionality.

Also applies to: 19-19, 22-22

374-374: LGTM: Improved docstring clarity

The updated docstring now clearly states that the function only works on UNIX-like operating systems, which is an important clarification for users.

Line range hint 1-384: LGTM: Significant improvements to file I/O handling

The changes in this file represent a substantial improvement in file I/O operations, particularly in handling line endings and supporting various file types. Key improvements include:

Centralized line ending detection with the new _get_line_ending function.

Enhanced support for compressed file formats.

Improved type annotations for better type safety and code clarity.

More robust handling of edge cases, such as empty files.

These changes make the code more reliable, maintainable, and consistent across different file types and operating systems. The additions are well-integrated with the existing code and follow good coding practices.

tests/test_io.py (3)

24-118: Excellent addition of comprehensive test cases for _get_line_ending

The new TestGetLineEnding class provides a thorough set of test cases for the _get_line_ending function. It covers various scenarios including different line endings, file types (text, binary, compressed), and edge cases. This comprehensive approach ensures robust testing of the function's behavior across different situations.

Line range hint 122-277: Improved test coverage for reverse_readline

The updates to TestReverseReadline class enhance the test coverage by including tests for different line endings, empty files, and both text and binary modes. This ensures that the reverse_readline function works correctly across various scenarios.

Line range hint 279-370: Enhanced test coverage for reverse_readfile

The updates to TestReverseReadfile class improve the test coverage by including tests for different line endings, empty files, and files with empty lines. This ensures that the reverse_readfile function works correctly across various scenarios.

tests/test_io.py

shyuep · 2024-12-10T20:52:34Z

Pls fix the merge conflicts.
Thanks.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (5)

src/monty/io.py (3)
143-156: Consider adding support for more file types.

The function handles common file types but could be extended to support more compressed formats (e.g., xz/lzma) for consistency with zopen.

Apply this diff to add support for more file types:
-    elif isinstance(file, (io.BufferedReader, gzip.GzipFile, bz2.BZ2File)):
+    elif isinstance(file, (io.BufferedReader, gzip.GzipFile, bz2.BZ2File, lzma.LZMAFile)):
157-169: Consider adding support for classic Mac OS line endings.

The function only handles Windows (\r\n) and Unix (\n) line endings. While classic Mac OS line endings (\r) are rare, supporting them would improve compatibility.

Apply this diff to add support for classic Mac OS line endings:
     if first_line.endswith(b"\r\n"):
         return "\r\n"
     if first_line.endswith(b"\n"):
         return "\n"
+    if first_line.endswith(b"\r"):
+        return "\r"

     # It's likely the line is missing a line ending for the first line
-    raise ValueError(f"Unknown line ending in line {repr(first_line)}.")
+    raise ValueError(
+        f"Unknown line ending in line {repr(first_line)}. "
+        "Supported line endings are: \\r\\n (Windows), \\n (Unix), \\r (Classic Mac OS)."
+    )
262-268: Consider using a type guard for line ending length.

The cast to Literal[1, 2] could be replaced with a type guard for better type safety.

Apply this diff to improve type safety:
-    len_l_end: Literal[1, 2] = cast(Literal[1, 2], len(l_end))
+    def get_line_ending_length(l_end: Literal["\r\n", "\n"]) -> Literal[1, 2]:
+        if l_end == "\r\n":
+            return 2
+        return 1
+    len_l_end = get_line_ending_length(l_end)
tests/test_io.py (2)
152-180: Consider adding performance benchmarks.

The big file test verifies functionality but doesn't measure performance. Consider adding benchmarks to ensure performance doesn't regress.

Add performance benchmarks using timeit:
def test_big_file_performance(self, l_end):
    """Test performance of reading big files."""
    import timeit
    
    setup_code = """
    file_name = "big_file.txt"
    with open(file_name, "rb") as file:
        lines = list(reverse_readline(file, max_mem=4096))
    """
    
    # Measure time
    time_taken = timeit.timeit(setup=setup_code, number=100)
    assert time_taken < 1.0  # Adjust threshold as needed
244-274: Consider testing mixed line endings.

While the current implementation doesn't handle mixed line endings, it would be valuable to have a test that verifies the behavior (raising an error) when encountering mixed line endings.

Add a test for mixed line endings:
def test_mixed_line_endings(self):
    """Test behavior with mixed line endings."""
    contents = ("Line1\n", "Line2\r\n", "Line3\n")
    file_name = "mixed_endings.txt"
    
    with ScratchDir("."):
        with open(file_name, "wb") as file:
            for line in contents:
                file.write(line.encode())
        
        with pytest.raises(ValueError, match="Inconsistent line endings"):
            with open(file_name, "rb") as file:
                list(reverse_readline(file))

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between d70e80e and 3df6709.

📒 Files selected for processing (6)

src/monty/io.py (5 hunks)
src/monty/json.py (1 hunks)
src/monty/re.py (1 hunks)
tests/test_io.py (4 hunks)
tests/test_shutil.py (1 hunks)
tests/test_tempfile.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

tests/test_shutil.py
tests/test_tempfile.py
src/monty/re.py
src/monty/json.py

🧰 Additional context used

🪛 Ruff (0.8.0)

src/monty/io.py

173-173: Undefined name Iterator

(F821)

225-225: Undefined name Iterator

(F821)

tests/test_io.py

197-198: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)

🪛 GitHub Check: codecov/patch

src/monty/io.py

[warning] 274-274: src/monty/io.py#L274
Added line #L274 was not covered by tests

🔇 Additional comments (9)

src/monty/io.py (6)

19-19: LGTM!

The Literal import is correctly placed and used for type hints in the new _get_line_ending function.

116-142: LGTM!

The function signature and docstring are well-defined. The docstring includes:

Clear description of functionality
POSIX standard reference
Return values and error cases
Warning cases

186-189: LGTM!

The line ending handling is improved by using the new _get_line_ending helper function.

222-225: LGTM!

The type hints improve type safety by:

Specifying supported file types
Using Iterator[str] return type

🧰 Tools

🪛 Ruff (0.8.0)

225-225: Undefined name Iterator

(F821)

190-195: 🛠️ Refactor suggestion

Consider optimizing memory usage for compressed files.

Loading the entire compressed file into memory with readlines() could be inefficient for large files.

Consider processing compressed files in chunks:

     if isinstance(file, (gzip.GzipFile, bz2.BZ2File)):
-        for line in reversed(file.readlines()):
-            # "readlines" would keep the line end character
-            yield line.decode("utf-8")
+        lines = []
+        chunk_size = 1024 * 1024  # 1MB chunks
+        while True:
+            chunk = file.read(chunk_size)
+            if not chunk:
+                break
+            lines.extend(chunk.splitlines(keepends=True))
+        for line in reversed(lines):
+            yield line.decode("utf-8")

Likely invalid or redundant comment.

196-218: 🛠️ Refactor suggestion

Consider improving error handling for memory mapping.

The broad except ValueError could mask other issues. Consider checking file size before attempting memory mapping.

Apply this diff to improve error handling:

     else:
+        file_size = os.path.getsize(filename)
+        if file_size == 0:
+            warnings.warn("File is empty.", stacklevel=2)
+            return
         try:
-            filemap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)
-        except ValueError:
+            filemap = mmap.mmap(file.fileno(), file_size, access=mmap.ACCESS_READ)
+        except mmap.error as e:
+            warnings.warn(f"Memory mapping failed: {e}", stacklevel=2)
             return

Likely invalid or redundant comment.

tests/test_io.py (3)

25-120: LGTM! Comprehensive test coverage.

The test class thoroughly covers:

Different line endings (Unix, Windows)
Various file types (text, binary, compressed)
Edge cases (empty files, missing line endings)
Error cases (unknown file types, unknown line endings)

202-243: LGTM! Thorough testing of empty lines.

The test ensures empty lines are preserved across different file types and memory modes.

327-371: LGTM! Consistent test coverage with reverse_readline.

The tests ensure consistent behavior between reverse_readfile and reverse_readline for:

Empty lines
Different line endings
Various file types

DanielYang59 · 2024-12-11T02:30:46Z

Thanks, I was about to fix these :)

DanielYang59 added 3 commits September 4, 2024 14:21

add l_end arg to reverse_readfile

e2952c4

rstrip remove hard coded trailing white space

619a38d

add default value for l_end

6cbea60

DanielYang59 added 2 commits September 4, 2024 14:37

add l_end to reverse_readline

0de9696

continue CI test jobs upon failure

f114afe

DanielYang59 added 2 commits September 4, 2024 14:41

bump codecov to v4

deb1ad7

tweak docstring

c4a845c

DanielYang59 commented Sep 4, 2024

View reviewed changes

src/monty/io.py Show resolved Hide resolved

DanielYang59 changed the title ~~Allow selecting line endings in reverse_readfile and reverse_readline~~ Detect line endings in reverse_readfile and reverse_readline Sep 4, 2024

var name and docstring tweak

afbe573

DanielYang59 added 9 commits September 5, 2024 17:16

add helper function to get line ending and unit test

a4c4fe3

add bzip2 and gzip file support and test

86ea01b

sort import and tweak docstring

c7ec2de

use unix line ending \n as default if empty

ae125c3

fix docstring

343e0db

update unit test

d288e0d

remove accidental comment sign

064c064

use if after return

30624aa

tweak docstring

3084123

janosh reviewed Sep 5, 2024

View reviewed changes

src/monty/io.py Show resolved Hide resolved

tests/test_io.py Show resolved Hide resolved

DanielYang59 added 2 commits September 6, 2024 14:05

reset pointer to fix test in windows

1639725

encode in linux

dfe553d

DanielYang59 added 3 commits September 18, 2024 19:20

remove dup test script

0ea6828

Merge branch 'readline-line-ending' of github.com:DanielYang59/monty …

8db6bb4

…into readline-line-ending

clear finished TODO tag

f354756

DanielYang59 marked this pull request as ready for review September 19, 2024 03:46

coderabbitai bot reviewed Sep 19, 2024

View reviewed changes

DanielYang59 and others added 7 commits September 19, 2024 13:28

tweak var name

96c9108

fix missing newline char in comment

e1786ef

guard warning filter with context manager

f90074c

add type annotation

27b28a6

put tag into condition branch

1cdc75b

untrack benchmark script and results

e4940e0

Merge branch 'master' into readline-line-ending

279cd4c

coderabbitai bot reviewed Oct 21, 2024

View reviewed changes

src/monty/io.py Show resolved Hide resolved

src/monty/io.py Outdated Show resolved Hide resolved

tests/test_io.py Outdated Show resolved Hide resolved

fix typo in test var name

6613d33

DanielYang59 marked this pull request as draft October 22, 2024 03:20

remove merge issue

3baea3a

DanielYang59 force-pushed the readline-line-ending branch from 7515632 to 3baea3a Compare October 22, 2024 03:21

DanielYang59 added 3 commits October 22, 2024 11:23

Merge branch 'master' into readline-line-ending

61f4aff

revise comment

f1fd669

suppress mypy errors

d70e80e

DanielYang59 marked this pull request as ready for review October 22, 2024 03:33

coderabbitai bot reviewed Oct 22, 2024

View reviewed changes

tests/test_io.py Show resolved Hide resolved

tests/test_io.py Show resolved Hide resolved

shyuep and others added 2 commits December 10, 2024 17:45

Merge branch 'master' into readline-line-ending

2653961

pre-commit auto-fixes

3df6709

shyuep merged commit bf66f0d into materialsvirtuallab:master Dec 11, 2024
7 of 8 checks passed

coderabbitai bot reviewed Dec 11, 2024

View reviewed changes

DanielYang59 deleted the readline-line-ending branch December 11, 2024 02:30

coderabbitai bot mentioned this pull request Dec 12, 2024

Fix collections.ControlledDict handling of Iterators via update method #732

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix line ending handling in `reverse_readfile/readline` across OS, and not skipping empty lines #712

Fix line ending handling in `reverse_readfile/readline` across OS, and not skipping empty lines #712

DanielYang59 commented Sep 4, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 4, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

codecov bot commented Sep 4, 2024 •

edited

Loading

hongyi-zhao commented Sep 4, 2024

DanielYang59 commented Sep 5, 2024

hongyi-zhao commented Sep 5, 2024

DanielYang59 commented Sep 5, 2024 •

edited

Loading

hongyi-zhao commented Sep 5, 2024

DanielYang59 commented Sep 5, 2024

janosh left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

shyuep commented Dec 10, 2024

coderabbitai bot left a comment

DanielYang59 commented Dec 11, 2024

	if file_size < max_mem or isinstance(m_file, gzip.GzipFile) or os.name == "nt":
	for line in reversed(m_file.readlines()):
	yield line.rstrip()

Fix line ending handling in reverse_readfile/readline across OS, and not skipping empty lines #712

Fix line ending handling in reverse_readfile/readline across OS, and not skipping empty lines #712

Conversation

DanielYang59 commented Sep 4, 2024 • edited by coderabbitai bot Loading

I. Fix line ending handling (mainly in Windows with "\r\n")

II. Performance benchmark

Ubuntu 22.04-WSL2

Windows 11

III. Test downstream packages

Summary by CodeRabbit

coderabbitai bot commented Sep 4, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

codecov bot commented Sep 4, 2024 • edited Loading

Codecov Report

hongyi-zhao commented Sep 4, 2024

DanielYang59 commented Sep 5, 2024

hongyi-zhao commented Sep 5, 2024

DanielYang59 commented Sep 5, 2024 • edited Loading

hongyi-zhao commented Sep 5, 2024

DanielYang59 commented Sep 5, 2024

janosh left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

shyuep commented Dec 10, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

DanielYang59 commented Dec 11, 2024

Fix line ending handling in `reverse_readfile/readline` across OS, and not skipping empty lines #712

Fix line ending handling in `reverse_readfile/readline` across OS, and not skipping empty lines #712

DanielYang59 commented Sep 4, 2024 •

edited by coderabbitai bot

Loading

I. Fix line ending handling (mainly in Windows with `"\r\n"`)

coderabbitai bot commented Sep 4, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov bot commented Sep 4, 2024 •

edited

Loading

DanielYang59 commented Sep 5, 2024 •

edited

Loading