Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process text sketch #180

Open
wants to merge 11 commits into
base: process-text
Choose a base branch
from
Open

Process text sketch #180

wants to merge 11 commits into from

Conversation

ChristianGeng
Copy link
Member

@ChristianGeng ChristianGeng commented Nov 14, 2024

Summary by Sourcery

Enhance the processing of text and audio data by introducing a new series generator for index handling and refactoring the post-processing logic into a separate method. Update tests to cover new functionality and improve code maintainability.

Enhancements:

  • Refactor the process index function to use a new series generator for handling different index types.
  • Introduce a new method _postprocess_xs to handle post-processing of processed data, improving code reuse and readability.

Tests:

  • Add new test cases to cover the data_identity function and process_func parameter in the test_process_index function.

Copy link

sourcery-ai bot commented Nov 14, 2024

Reviewer's Guide by Sourcery

This PR implements improvements to text processing functionality, focusing on handling different types of data and indices in the audio processing pipeline. The changes include better handling of file types, improved index type management, and enhanced data processing capabilities.

Sequence diagram for process_file method

sequenceDiagram
    participant Process
    participant Utils
    Process->>Utils: read_text(file, root)
    Note right of Process: For text files
    Process->>Utils: read_audio(file, start, end, root)
    Note right of Process: For audio/video files
    Process->>Process: _process_data(data, idx)
    Process->>Process: _process_data(signal, sampling_rate, idx)
Loading

Updated class diagram for Process class

classDiagram
    class Process {
        +int num_workers
        +bool multiprocessing
        +bool verbose
        +Callable read_func
        +void _process_file(String file, ...)
        +Series process_files(List files)
        +Series process_folder(String folder)
        +Series _process_index_wo_segment(Index index)
        +static Series _postprocess_xs(List xs)
        +Any _call_data(...)
        +Dict _special_args(...)
    }
    note for Process "Added read_func attribute and updated methods to use it"
Loading

File-Level Changes

Change Details Files
Added new data processing and index type handling functionality
  • Added helper function to determine index type based on preserve_index and segment flags
  • Implemented series generator function to handle different index types
  • Added support for process_func parameter in test_process_index
  • Enhanced handling of None values in index processing
tests/test_process_text.py
Enhanced Process class with improved file reading capabilities
  • Added read_func parameter to Process class initialization
  • Implemented dynamic setting of read_audio and read_text methods
  • Made sampling_rate parameter optional in identity function
  • Added postprocessing method for handling different data types
audinterface/core/process.py
Improved error handling and edge cases
  • Changed RuntimeError to FileNotFoundError for missing files
  • Added handling for dictionary and iterable text data types
  • Improved handling of None values in starts and ends lists
  • Ensured non-scalar answers in data processing
audinterface/core/process.py
tests/test_process_text.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time. You can also use
    this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ChristianGeng - I've reviewed your changes - here's some feedback:

Overall Comments:

  • Please remove the commented out code blocks - they're no longer needed and make the code harder to read
Here's what I looked at during the review
  • 🟡 General issues: 2 issues found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟡 Complexity: 1 issue found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@@ -1147,6 +1208,8 @@ def _call_data(
process_func_args = process_func_args or self.process_func_args
special_args = self._special_args(idx, root, file, process_func_args)
y = self.process_func(data, **special_args, **process_func_args)
# ensure non-scalar answer
y = [y] if len(y) == 1 else y
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Handle potential scalar values in _call_data return processing

The len(y) call assumes y is a sequence type. Consider checking if y is a scalar value first to avoid potential AttributeError.

pd.Series: A pandas Series containing the postprocessed data.
"""
ys = [x[0] for x in xs]
# TODO: put into single list comprehension for all these three diagnostics
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Keep separate diagnostic checks for better maintainability

The current separate checks are more readable and easier to debug than a combined list comprehension would be. Consider removing this TODO and keeping the current structure.

        all_dict = all(map(lambda x: isinstance(x, dict), [x[0] for x in xs]))
        all_iterable = all(map(lambda x: isinstance(x, Iterable), [x[0] for x in xs]))

y = self._postprocess_xs(xs)
return y

@staticmethod
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider refactoring the type handling and flattening logic into separate helper methods.

The _postprocess_xs method could be simplified while maintaining functionality. Here are two specific suggestions:

  1. Extract the duplicated starts/ends handling into a helper method:
def _flatten_or_collect(items):
    """Flatten iterable items or collect None values."""
    try:
        return list(itertools.chain.from_iterable(items))
    except TypeError:
        # Handle case where all items are None
        if [x for x in filter(None, items)] == []:
            return items
        raise

def _postprocess_xs(xs):
    ys = [x[0] for x in xs]

    # Simplify type handling with clear conversion rules
    if all(isinstance(x, dict) for x in ys):
        keys = list(itertools.chain.from_iterable(x.keys() for x in ys))
        values = list(itertools.chain.from_iterable(x.values() for x in ys))
        y = [{k: v} for k, v in zip(keys, values)]
    elif all(isinstance(x, str) for x in ys):
        y = [[x] for x in ys]  # Preserve text items as single-item lists
    else:
        y = list(itertools.chain.from_iterable(ys))

    files = list(itertools.chain.from_iterable(x[1] for x in xs))
    starts = _flatten_or_collect([x[2] for x in xs])
    ends = _flatten_or_collect([x[3] for x in xs])

    # Rest of the method...
  1. Consider making the type handling more explicit by documenting expected types and using a simpler pattern that handles just the required cases rather than trying to detect all possibilities.

Comment on lines +140 to +141
if num_files == 0:
index = pd.RangeIndex(0, 0, 1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines +152 to +153
if num_files == 0:
index = pd.RangeIndex(0, 0, 1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines +509 to 510
y = self._postprocess_xs(xs)
return y
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Inline variable that is immediately returned (inline-immediately-returned-variable)

Suggested change
y = self._postprocess_xs(xs)
return y
return self._postprocess_xs(xs)

Comment on lines +606 to +607
y = self._postprocess_xs(xs)
return y
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Inline variable that is immediately returned (inline-immediately-returned-variable)

Suggested change
y = self._postprocess_xs(xs)
return y
return self._postprocess_xs(xs)

Comment on lines +656 to +666
pass
starts_non_iterable = [x for x in filter(None, [x[2] for x in xs])] == []
assert starts_non_iterable, "unknown problem"
starts = [x[2] for x in xs]

# same as for starts
try:
ends = list(itertools.chain.from_iterable([x[3] for x in xs]))
except TypeError:
pass
ends_non_iterable = [x for x in filter(None, [x[3] for x in xs])] == []
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:


Explanation
Convert list/set/tuple comprehensions that do not change the input elements into.

Before

# List comprehensions
[item for item in coll]
[item for item in friends.names()]

# Dict comprehensions
{k: v for k, v in coll}
{k: v for k, v in coll.items()}  # Only if we know coll is a `dict`

# Unneeded call to `.items()`
dict(coll.items())  # Only if we know coll is a `dict`

# Set comprehensions
{item for item in coll}

After

# List comprehensions
list(iter(coll))
list(iter(friends.names()))

# Dict comprehensions
dict(coll)
dict(coll)

# Unneeded call to `.items()`
dict(coll)

# Set comprehensions
set(coll)

All these comprehensions are just creating a copy of the original collection.
They can all be simplified by simply constructing a new collection directly. The
resulting code is easier to read and shows the intent more clearly.

Convert list/set/tuple comprehensions that do not change the input elements into.

Before

# List comprehensions
[item for item in coll]
[item for item in friends.names()]

# Dict comprehensions
{k: v for k, v in coll}
{k: v for k, v in coll.items()}  # Only if we know coll is a `dict`

# Unneeded call to `.items()`
dict(coll.items())  # Only if we know coll is a `dict`

# Set comprehensions
{item for item in coll}

After

# List comprehensions
list(iter(coll))
list(iter(friends.names()))

# Dict comprehensions
dict(coll)
dict(coll)

# Unneeded call to `.items()`
dict(coll)

# Set comprehensions
set(coll)

All these comprehensions are just creating a copy of the original collection.
They can all be simplified by simply constructing a new collection directly. The
resulting code is easier to read and shows the intent more clearly.

@@ -191,21 +205,63 @@ def test_process_folder(
pd.testing.assert_series_equal(y, pd.Series(dtype=object))


def _get_idx_type(preserve_index, segment_is_None, idx):
"""Get expected index type.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

Comment on lines +239 to +240
file = idx
yield file, value
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): We've found these issues:

  • Simplify conditional into switch-like form [×2] (switch)
  • Inline variable that is only used once (inline-variable)
Suggested change
file = idx
yield file, value
yield (idx, value)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant