-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process text sketch #180
base: process-text
Are you sure you want to change the base?
Process text sketch #180
Conversation
Reviewer's Guide by SourceryThis PR implements improvements to text processing functionality, focusing on handling different types of data and indices in the audio processing pipeline. The changes include better handling of file types, improved index type management, and enhanced data processing capabilities. Sequence diagram for process_file methodsequenceDiagram
participant Process
participant Utils
Process->>Utils: read_text(file, root)
Note right of Process: For text files
Process->>Utils: read_audio(file, start, end, root)
Note right of Process: For audio/video files
Process->>Process: _process_data(data, idx)
Process->>Process: _process_data(signal, sampling_rate, idx)
Updated class diagram for Process classclassDiagram
class Process {
+int num_workers
+bool multiprocessing
+bool verbose
+Callable read_func
+void _process_file(String file, ...)
+Series process_files(List files)
+Series process_folder(String folder)
+Series _process_index_wo_segment(Index index)
+static Series _postprocess_xs(List xs)
+Any _call_data(...)
+Dict _special_args(...)
}
note for Process "Added read_func attribute and updated methods to use it"
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ChristianGeng - I've reviewed your changes - here's some feedback:
Overall Comments:
- Please remove the commented out code blocks - they're no longer needed and make the code harder to read
Here's what I looked at during the review
- 🟡 General issues: 2 issues found
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟡 Complexity: 1 issue found
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
@@ -1147,6 +1208,8 @@ def _call_data( | |||
process_func_args = process_func_args or self.process_func_args | |||
special_args = self._special_args(idx, root, file, process_func_args) | |||
y = self.process_func(data, **special_args, **process_func_args) | |||
# ensure non-scalar answer | |||
y = [y] if len(y) == 1 else y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: Handle potential scalar values in _call_data return processing
The len(y) call assumes y is a sequence type. Consider checking if y is a scalar value first to avoid potential AttributeError.
pd.Series: A pandas Series containing the postprocessed data. | ||
""" | ||
ys = [x[0] for x in xs] | ||
# TODO: put into single list comprehension for all these three diagnostics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Keep separate diagnostic checks for better maintainability
The current separate checks are more readable and easier to debug than a combined list comprehension would be. Consider removing this TODO and keeping the current structure.
all_dict = all(map(lambda x: isinstance(x, dict), [x[0] for x in xs]))
all_iterable = all(map(lambda x: isinstance(x, Iterable), [x[0] for x in xs]))
y = self._postprocess_xs(xs) | ||
return y | ||
|
||
@staticmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (complexity): Consider refactoring the type handling and flattening logic into separate helper methods.
The _postprocess_xs
method could be simplified while maintaining functionality. Here are two specific suggestions:
- Extract the duplicated starts/ends handling into a helper method:
def _flatten_or_collect(items):
"""Flatten iterable items or collect None values."""
try:
return list(itertools.chain.from_iterable(items))
except TypeError:
# Handle case where all items are None
if [x for x in filter(None, items)] == []:
return items
raise
def _postprocess_xs(xs):
ys = [x[0] for x in xs]
# Simplify type handling with clear conversion rules
if all(isinstance(x, dict) for x in ys):
keys = list(itertools.chain.from_iterable(x.keys() for x in ys))
values = list(itertools.chain.from_iterable(x.values() for x in ys))
y = [{k: v} for k, v in zip(keys, values)]
elif all(isinstance(x, str) for x in ys):
y = [[x] for x in ys] # Preserve text items as single-item lists
else:
y = list(itertools.chain.from_iterable(ys))
files = list(itertools.chain.from_iterable(x[1] for x in xs))
starts = _flatten_or_collect([x[2] for x in xs])
ends = _flatten_or_collect([x[3] for x in xs])
# Rest of the method...
- Consider making the type handling more explicit by documenting expected types and using a simpler pattern that handles just the required cases rather than trying to detect all possibilities.
if num_files == 0: | ||
index = pd.RangeIndex(0, 0, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests
)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
if num_files == 0: | ||
index = pd.RangeIndex(0, 0, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests
)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
y = self._postprocess_xs(xs) | ||
return y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Inline variable that is immediately returned (inline-immediately-returned-variable
)
y = self._postprocess_xs(xs) | |
return y | |
return self._postprocess_xs(xs) |
y = self._postprocess_xs(xs) | ||
return y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Inline variable that is immediately returned (inline-immediately-returned-variable
)
y = self._postprocess_xs(xs) | |
return y | |
return self._postprocess_xs(xs) |
pass | ||
starts_non_iterable = [x for x in filter(None, [x[2] for x in xs])] == [] | ||
assert starts_non_iterable, "unknown problem" | ||
starts = [x[2] for x in xs] | ||
|
||
# same as for starts | ||
try: | ||
ends = list(itertools.chain.from_iterable([x[3] for x in xs])) | ||
except TypeError: | ||
pass | ||
ends_non_iterable = [x for x in filter(None, [x[3] for x in xs])] == [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): We've found these issues:
- Remove redundant pass statement [×2] (
remove-redundant-pass
) - Replace identity comprehension with call to collection constructor [×2] (
identity-comprehension
) - Replaces an empty collection equality with a boolean operation (
simplify-empty-collection-comparison
)
Explanation
Convert list/set/tuple comprehensions that do not change the input elements into.
Before
# List comprehensions
[item for item in coll]
[item for item in friends.names()]
# Dict comprehensions
{k: v for k, v in coll}
{k: v for k, v in coll.items()} # Only if we know coll is a `dict`
# Unneeded call to `.items()`
dict(coll.items()) # Only if we know coll is a `dict`
# Set comprehensions
{item for item in coll}
After
# List comprehensions
list(iter(coll))
list(iter(friends.names()))
# Dict comprehensions
dict(coll)
dict(coll)
# Unneeded call to `.items()`
dict(coll)
# Set comprehensions
set(coll)
All these comprehensions are just creating a copy of the original collection.
They can all be simplified by simply constructing a new collection directly. The
resulting code is easier to read and shows the intent more clearly.
Convert list/set/tuple comprehensions that do not change the input elements into.
Before
# List comprehensions
[item for item in coll]
[item for item in friends.names()]
# Dict comprehensions
{k: v for k, v in coll}
{k: v for k, v in coll.items()} # Only if we know coll is a `dict`
# Unneeded call to `.items()`
dict(coll.items()) # Only if we know coll is a `dict`
# Set comprehensions
{item for item in coll}
After
# List comprehensions
list(iter(coll))
list(iter(friends.names()))
# Dict comprehensions
dict(coll)
dict(coll)
# Unneeded call to `.items()`
dict(coll)
# Set comprehensions
set(coll)
All these comprehensions are just creating a copy of the original collection.
They can all be simplified by simply constructing a new collection directly. The
resulting code is easier to read and shows the intent more clearly.
@@ -191,21 +205,63 @@ def test_process_folder( | |||
pd.testing.assert_series_equal(y, pd.Series(dtype=object)) | |||
|
|||
|
|||
def _get_idx_type(preserve_index, segment_is_None, idx): | |||
"""Get expected index type. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): We've found these issues:
- Replace unneeded comprehension with generator (
comprehension-to-generator
) - Replace if statement with if expression (
assign-if-exp
) - Inline variable that is immediately returned [×2] (
inline-immediately-returned-variable
)
file = idx | ||
yield file, value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): We've found these issues:
- Simplify conditional into switch-like form [×2] (
switch
) - Inline variable that is only used once (
inline-variable
)
file = idx | |
yield file, value | |
yield (idx, value) |
Summary by Sourcery
Enhance the processing of text and audio data by introducing a new series generator for index handling and refactoring the post-processing logic into a separate method. Update tests to cover new functionality and improve code maintainability.
Enhancements:
Tests: