Process text sketch #180

ChristianGeng · 2024-11-14T16:14:32Z

Summary by Sourcery

Enhance the processing of text and audio data by introducing a new series generator for index handling and refactoring the post-processing logic into a separate method. Update tests to cover new functionality and improve code maintainability.

Enhancements:

Refactor the process index function to use a new series generator for handling different index types.
Introduce a new method _postprocess_xs to handle post-processing of processed data, improving code reuse and readability.

Tests:

Add new test cases to cover the data_identity function and process_func parameter in the test_process_index function.

sourcery-ai · 2024-11-14T16:14:37Z

Reviewer's Guide by Sourcery

This PR implements improvements to text processing functionality, focusing on handling different types of data and indices in the audio processing pipeline. The changes include better handling of file types, improved index type management, and enhanced data processing capabilities.

Sequence diagram for process_file method

sequenceDiagram
    participant Process
    participant Utils
    Process->>Utils: read_text(file, root)
    Note right of Process: For text files
    Process->>Utils: read_audio(file, start, end, root)
    Note right of Process: For audio/video files
    Process->>Process: _process_data(data, idx)
    Process->>Process: _process_data(signal, sampling_rate, idx)

Updated class diagram for Process class

classDiagram
    class Process {
        +int num_workers
        +bool multiprocessing
        +bool verbose
        +Callable read_func
        +void _process_file(String file, ...)
        +Series process_files(List files)
        +Series process_folder(String folder)
        +Series _process_index_wo_segment(Index index)
        +static Series _postprocess_xs(List xs)
        +Any _call_data(...)
        +Dict _special_args(...)
    }
    note for Process "Added read_func attribute and updated methods to use it"

File-Level Changes

Change	Details	Files
Added new data processing and index type handling functionality	Added helper function to determine index type based on preserve_index and segment flags Implemented series generator function to handle different index types Added support for process_func parameter in test_process_index Enhanced handling of None values in index processing	`tests/test_process_text.py`
Enhanced Process class with improved file reading capabilities	Added read_func parameter to Process class initialization Implemented dynamic setting of read_audio and read_text methods Made sampling_rate parameter optional in identity function Added postprocessing method for handling different data types	`audinterface/core/process.py`
Improved error handling and edge cases	Changed RuntimeError to FileNotFoundError for missing files Added handling for dictionary and iterable text data types Improved handling of None values in starts and ends lists Ensured non-scalar answers in data processing	`audinterface/core/process.py` `tests/test_process_text.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time. You can also use
this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @ChristianGeng - I've reviewed your changes - here's some feedback:

Overall Comments:

Please remove the commented out code blocks - they're no longer needed and make the code harder to read

Here's what I looked at during the review

🟡 General issues: 2 issues found
🟢 Security: all looks good
🟢 Testing: all looks good
🟡 Complexity: 1 issue found
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2024-11-14T16:15:41Z

audinterface/core/process.py

@@ -1147,6 +1208,8 @@ def _call_data(
        process_func_args = process_func_args or self.process_func_args
        special_args = self._special_args(idx, root, file, process_func_args)
        y = self.process_func(data, **special_args, **process_func_args)
+        # ensure non-scalar answer
+        y = [y] if len(y) == 1 else y


issue: Handle potential scalar values in _call_data return processing

The len(y) call assumes y is a sequence type. Consider checking if y is a scalar value first to avoid potential AttributeError.

sourcery-ai · 2024-11-14T16:15:41Z

audinterface/core/process.py

+            pd.Series: A pandas Series containing the postprocessed data.
+        """
+        ys = [x[0] for x in xs]
+        # TODO: put into single list comprehension for all these three diagnostics


suggestion: Keep separate diagnostic checks for better maintainability

The current separate checks are more readable and easier to debug than a combined list comprehension would be. Consider removing this TODO and keeping the current structure.

all_dict = all(map(lambda x: isinstance(x, dict), [x[0] for x in xs])) all_iterable = all(map(lambda x: isinstance(x, Iterable), [x[0] for x in xs]))

sourcery-ai · 2024-11-14T16:15:41Z

audinterface/core/process.py

+        y = self._postprocess_xs(xs)
+        return y
+
+    @staticmethod


issue (complexity): Consider refactoring the type handling and flattening logic into separate helper methods.

The _postprocess_xs method could be simplified while maintaining functionality. Here are two specific suggestions:

Extract the duplicated starts/ends handling into a helper method:

def _flatten_or_collect(items): """Flatten iterable items or collect None values.""" try: return list(itertools.chain.from_iterable(items)) except TypeError: # Handle case where all items are None if [x for x in filter(None, items)] == []: return items raise def _postprocess_xs(xs): ys = [x[0] for x in xs] # Simplify type handling with clear conversion rules if all(isinstance(x, dict) for x in ys): keys = list(itertools.chain.from_iterable(x.keys() for x in ys)) values = list(itertools.chain.from_iterable(x.values() for x in ys)) y = [{k: v} for k, v in zip(keys, values)] elif all(isinstance(x, str) for x in ys): y = [[x] for x in ys] # Preserve text items as single-item lists else: y = list(itertools.chain.from_iterable(ys)) files = list(itertools.chain.from_iterable(x[1] for x in xs)) starts = _flatten_or_collect([x[2] for x in xs]) ends = _flatten_or_collect([x[3] for x in xs]) # Rest of the method...

Consider making the type handling more explicit by documenting expected types and using a simpler pattern that handles just the required cases rather than trying to detect all possibilities.

sourcery-ai · 2024-11-14T16:15:41Z

tests/test_process_text.py

+    if num_files == 0:
+        index = pd.RangeIndex(0, 0, 1)


issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

Explanation
Avoid complex code, like conditionals, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2024-11-14T16:15:41Z

tests/test_process_text.py

+    if num_files == 0:
+        index = pd.RangeIndex(0, 0, 1)


issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

Explanation
Avoid complex code, like conditionals, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2024-11-14T16:15:42Z

audinterface/core/process.py

+        y = self._postprocess_xs(xs)
        return y


suggestion (code-quality): Inline variable that is immediately returned (inline-immediately-returned-variable)

Suggested change

y = self._postprocess_xs(xs)

return y

return self._postprocess_xs(xs)

sourcery-ai · 2024-11-14T16:15:42Z

audinterface/core/process.py

+        y = self._postprocess_xs(xs)
+        return y


suggestion (code-quality): Inline variable that is immediately returned (inline-immediately-returned-variable)

Suggested change

y = self._postprocess_xs(xs)

return y

return self._postprocess_xs(xs)

sourcery-ai · 2024-11-14T16:15:42Z

audinterface/core/process.py

+            pass
+            starts_non_iterable = [x for x in filter(None, [x[2] for x in xs])] == []
+            assert starts_non_iterable, "unknown problem"
+            starts = [x[2] for x in xs]
+
+        # same as for starts
+        try:
+            ends = list(itertools.chain.from_iterable([x[3] for x in xs]))
+        except TypeError:
+            pass
+            ends_non_iterable = [x for x in filter(None, [x[3] for x in xs])] == []


issue (code-quality): We've found these issues:

Remove redundant pass statement [×2] (remove-redundant-pass)

Replace identity comprehension with call to collection constructor [×2] (identity-comprehension)

Replaces an empty collection equality with a boolean operation (simplify-empty-collection-comparison)

Explanation

Convert list/set/tuple comprehensions that do not change the input elements into.

Before

# List comprehensions [item for item in coll] [item for item in friends.names()] # Dict comprehensions {k: v for k, v in coll} {k: v for k, v in coll.items()} # Only if we know coll is a `dict` # Unneeded call to `.items()` dict(coll.items()) # Only if we know coll is a `dict` # Set comprehensions {item for item in coll}

After

# List comprehensions list(iter(coll)) list(iter(friends.names())) # Dict comprehensions dict(coll) dict(coll) # Unneeded call to `.items()` dict(coll) # Set comprehensions set(coll)

All these comprehensions are just creating a copy of the original collection.
They can all be simplified by simply constructing a new collection directly. The
resulting code is easier to read and shows the intent more clearly.

Convert list/set/tuple comprehensions that do not change the input elements into.

Before

# List comprehensions [item for item in coll] [item for item in friends.names()] # Dict comprehensions {k: v for k, v in coll} {k: v for k, v in coll.items()} # Only if we know coll is a `dict` # Unneeded call to `.items()` dict(coll.items()) # Only if we know coll is a `dict` # Set comprehensions {item for item in coll}

After

# List comprehensions list(iter(coll)) list(iter(friends.names())) # Dict comprehensions dict(coll) dict(coll) # Unneeded call to `.items()` dict(coll) # Set comprehensions set(coll)

All these comprehensions are just creating a copy of the original collection.
They can all be simplified by simply constructing a new collection directly. The
resulting code is easier to read and shows the intent more clearly.

sourcery-ai · 2024-11-14T16:15:42Z

tests/test_process_text.py

@@ -191,21 +205,63 @@ def test_process_folder(
    pd.testing.assert_series_equal(y, pd.Series(dtype=object))


+def _get_idx_type(preserve_index, segment_is_None, idx):
+    """Get expected index type.
+


issue (code-quality): We've found these issues:

Replace unneeded comprehension with generator (comprehension-to-generator)

Replace if statement with if expression (assign-if-exp)

Inline variable that is immediately returned [×2] (inline-immediately-returned-variable)

sourcery-ai · 2024-11-14T16:15:42Z

tests/test_process_text.py

+            file = idx
+            yield file, value


suggestion (code-quality): We've found these issues:

Simplify conditional into switch-like form [×2] (switch)

Inline variable that is only used once (inline-variable)

Suggested change

file = idx

yield file, value

yield (idx, value)

ChristianGeng added 11 commits September 20, 2024 14:45

export read_text.

f2e914a

Add read_func. argument.

2e7d90d

identity: make sr a kwarg.

cfb1ed0

Fix process_index.

1476e62

Add clumsy method tp postprocess.

c0bc008

Reformat test_process_text.

1625a48

Add missing extension.

07c5f1b

Fix index problems for 0-len files.

fe5f671

improve docstring.

a9eaa7f

use correct interface for text.

521a344

Ruff change.

bc4fa86

sourcery-ai bot reviewed Nov 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process text sketch #180

Process text sketch #180

ChristianGeng commented Nov 14, 2024 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Nov 14, 2024 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

sourcery-ai bot left a comment

sourcery-ai bot Nov 14, 2024

sourcery-ai bot Nov 14, 2024

sourcery-ai bot Nov 14, 2024

sourcery-ai bot Nov 14, 2024

sourcery-ai bot Nov 14, 2024

sourcery-ai bot Nov 14, 2024

sourcery-ai bot Nov 14, 2024

sourcery-ai bot Nov 14, 2024

sourcery-ai bot Nov 14, 2024

sourcery-ai bot Nov 14, 2024

	y = self._postprocess_xs(xs)
	return y
	return self._postprocess_xs(xs)

Process text sketch #180

Are you sure you want to change the base?

Process text sketch #180

Conversation

ChristianGeng commented Nov 14, 2024 • edited by sourcery-ai bot Loading

Summary by Sourcery

sourcery-ai bot commented Nov 14, 2024 • edited Loading

Reviewer's Guide by Sourcery

Sequence diagram for process_file method

Updated class diagram for Process class

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

sourcery-ai bot left a comment

Choose a reason for hiding this comment

sourcery-ai bot Nov 14, 2024

Choose a reason for hiding this comment

sourcery-ai bot Nov 14, 2024

Choose a reason for hiding this comment

sourcery-ai bot Nov 14, 2024

Choose a reason for hiding this comment

sourcery-ai bot Nov 14, 2024

Choose a reason for hiding this comment

sourcery-ai bot Nov 14, 2024

Choose a reason for hiding this comment

sourcery-ai bot Nov 14, 2024

Choose a reason for hiding this comment

sourcery-ai bot Nov 14, 2024

Choose a reason for hiding this comment

sourcery-ai bot Nov 14, 2024

Choose a reason for hiding this comment

Before

After

Before

After

sourcery-ai bot Nov 14, 2024

Choose a reason for hiding this comment

sourcery-ai bot Nov 14, 2024

Choose a reason for hiding this comment

ChristianGeng commented Nov 14, 2024 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Nov 14, 2024 •

edited

Loading