Skip to content

Commit

Permalink
Squashed commit of the following:
Browse files Browse the repository at this point in the history
commit eeb1dde
Merge: 2504c37 4323946
Author: Stijn Peeters <[email protected]>
Date:   Thu Jul 11 16:47:45 2024 +0200

    Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat

commit 4323946
Author: Dale Wahl <[email protected]>
Date:   Thu Jul 11 12:08:08 2024 +0200

    fix processor more button

    would only show top level analysis if not logged in

commit d6ab2b0
Author: Dale Wahl <[email protected]>
Date:   Tue Jul 9 15:35:25 2024 +0200

    search_gab - use MappedItem

commit 2504c37
Author: Stijn Peeters <[email protected]>
Date:   Sat Jul 6 17:51:22 2024 +0200

    Fix multiline spacing in multi select list

commit fea66ce
Author: Dale Wahl <[email protected]>
Date:   Fri Jul 5 13:15:45 2024 +0200

    use processor media_type if dataset does not have media_type; set default media_type for downloaders

commit d41fa34
Author: Dale Wahl <[email protected]>
Date:   Fri Jul 5 12:57:18 2024 +0200

    video_hasher: handle no metadata file

commit 2820dce
Author: Dale Wahl <[email protected]>
Date:   Fri Jul 5 12:50:09 2024 +0200

    num_rows not num_items()

commit fb09162
Author: Dale Wahl <[email protected]>
Date:   Fri Jul 5 12:44:03 2024 +0200

    Google vision API returning 400s; properly log and record processed entries; google networks should not run on empty datasets

commit ebf39d8
Author: Dale Wahl <[email protected]>
Date:   Fri Jul 5 12:28:13 2024 +0200

    fix image_category_wall

    whoops, cleared categories and post_values after filling them!

commit 1ad9ec2
Author: Stijn Peeters <[email protected]>
Date:   Fri Jul 5 12:03:54 2024 +0200

    fsdfdsgd sorry

commit c7254c0
Author: Stijn Peeters <[email protected]>
Date:   Fri Jul 5 12:01:21 2024 +0200

    Fix razdel versioning

commit b9a327a
Author: Stijn Peeters <[email protected]>
Date:   Fri Jul 5 11:57:47 2024 +0200

    Reorganise tokeniser, stopwords

commit fb13bc4
Merge: 0b74569 e304649
Author: Stijn Peeters <[email protected]>
Date:   Fri Jul 5 11:56:08 2024 +0200

    Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat

commit e304649
Author: Dale Wahl <[email protected]>
Date:   Fri Jul 5 10:51:53 2024 +0200

    media_upload allow setting for max_form_part and warn users of failure above certain number of files

commit e4f982b
Author: Dale Wahl <[email protected]>
Date:   Fri Jul 5 09:50:49 2024 +0200

    Update media_import help text; looks like failure happens somewhere between 600-1000 files due to Flask request size limits

commit 0b74569
Author: Stijn Peeters <[email protected]>
Date:   Thu Jul 4 17:55:36 2024 +0200

    Add razdel as option for Russian tokenisation

commit 9f15a2b
Author: Dale Wahl <[email protected]>
Date:   Thu Jul 4 17:13:15 2024 +0200

    remove the log

commit ffcb6a4
Author: Dale Wahl <[email protected]>
Date:   Thu Jul 4 17:12:43 2024 +0200

    Inform user if too many files are uploaded

    I do not understand why this is appearing. app.config['MAX_CONTENT_LENGTH'] is set to None. Problem persists in Flask alone (i.e., does not appear to be Gunicorn/Nginx/Apache).

commit 9cad12d
Author: Stijn Peeters <[email protected]>
Date:   Thu Jul 4 15:09:42 2024 +0200

    Bump version

commit aad94f3
Author: Dale Wahl <[email protected]>
Date:   Thu Jul 4 10:51:01 2024 +0200

    Update setup.py to ensure videohash updates

commit d9154a6
Author: Dale Wahl <[email protected]>
Date:   Tue Jul 2 17:45:26 2024 +0200

    clip: categorizing requires categories...

    seriously, guys?

commit 0af9a5e
Author: Dale Wahl <[email protected]>
Date:   Tue Jul 2 17:31:49 2024 +0200

    blip2: fix no metadata file found (uploads...)

commit d695053
Author: Dale Wahl <[email protected]>
Date:   Tue Jul 2 17:25:26 2024 +0200

    cat_vis_wall - use str as category type if mixed

    i.e., use floats as string categories

commit bcb9140
Author: Sal Hagen <[email protected]>
Date:   Tue Jul 2 16:06:43 2024 +0200

    Add Twitter author profile pic and banner URLs

commit 1b3b02f
Author: Dale Wahl <[email protected]>
Date:   Tue Jul 2 11:42:50 2024 +0200

    add migrate.py log file in Docker

commit 2aaa972
Author: Dale Wahl <[email protected]>
Date:   Tue Jul 2 11:42:22 2024 +0200

    add necessary pip packages for upgrade in Docker environment; add error logging and save to file for trouble shooting

commit 18b8a53
Author: Dale Wahl <[email protected]>
Date:   Tue Jul 2 11:41:36 2024 +0200

    update Dockerfile to keep build environment

    useful for interactive upgrade

commit 7b224b9
Author: Dale Wahl <[email protected]>
Date:   Tue Jul 2 11:41:12 2024 +0200

    remove docker-compose.yml versions

commit acf5de0
Author: Stijn Peeters <[email protected]>
Date:   Mon Jul 1 17:38:32 2024 +0200

    Better issues.md, footer link

commit 1953ca3
Author: Dale Wahl <[email protected]>
Date:   Mon Jul 1 12:00:07 2024 +0200

    FIX: get_key() is more of a creating of a key then general getting of a key...

commit 12289bb
Author: Dale Wahl <[email protected]>
Date:   Mon Jul 1 11:37:06 2024 +0200

    .metadata.json may not have top_parent via Media Uploader

    This may exist in other processors if a proper check is not in place; will need to review

commit 25f4ed6
Author: Dale Wahl <[email protected]>
Date:   Tue Jun 25 14:43:40 2024 +0200

    Media upload datasource! (#419)

    * basic changes to allow files box

    * basic imports, yay!

    * video_scene_timelines to work on video imports!

    * add is_compatible_with checks to processors that cannot run on new media top_datasets

    * more is_compatible fixes

    * necessary function for checking media_types

    * enable more processors on media datasets

    * consolidate user_input file type

    * detect mimetype from filename

    best I can do without downloading all the files first.

    * handle zip archives; allow log and metadata files

    * do not count metadata or log files in num_files

    * move machine learning processors so they can be imported elsewhere

    * audio_to_text datasource

    * When validating zip file uploads, send list of file attributes instead of the first 128K of the zip file

    * Check type of files in zip when uploading media

    * Skip useless files when uploading media as zip

    * check multiple zip types in JS

    * js !=== python

    * fix media_type for loose file imports; fix extension for audio_to_text preset; fix merge for some processors w/ media_type

    ---------

    Co-authored-by: Stijn Peeters <[email protected]>

commit 4ce689b
Author: Stijn Peeters <[email protected]>
Date:   Mon Jun 24 11:58:50 2024 +0200

    Avoid KeyError

commit 155522d
Author: Dale Wahl <[email protected]>
Date:   Thu Jun 20 15:58:21 2024 +0200

    add generated images to image wall w/ text visual

commit eecde51
Author: Dale Wahl <[email protected]>
Date:   Thu Jun 20 15:57:56 2024 +0200

    allow users to NOT generate all images from prompts

commit d0b9574
Author: Stijn Peeters <[email protected]>
Date:   Wed Jun 19 16:28:26 2024 +0200

    ...don't mangle URLs in preview links

commit c105e36
Merge: 0028a99 8d4f99b
Author: Dale Wahl <[email protected]>
Date:   Wed Jun 19 16:25:36 2024 +0200

    Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat

commit 0028a99
Author: Dale Wahl <[email protected]>
Date:   Wed Jun 19 16:25:12 2024 +0200

    add followups to processors

commit 8d4f99b
Author: Stijn Peeters <[email protected]>
Date:   Wed Jun 19 16:17:22 2024 +0200

    More flexible URL linking in CSV preview

commit f4f8e66
Author: Dale Wahl <[email protected]>
Date:   Wed Jun 19 13:54:00 2024 +0200

    tokeniser fix: use default lang for word_tokenize if language is 'other'

commit 127472e
Author: Stijn Peeters <[email protected]>
Date:   Tue Jun 18 16:45:01 2024 +0200

    Better log messages for Telegram data source

commit e8714b6
Author: Stijn Peeters <[email protected]>
Date:   Mon Jun 17 17:42:21 2024 +0200

    Add 'crawl' feature to Telegram data source

    Fixes #321 (though might need a bit more testing)

commit 25fded7
Merge: d67cf44 b10e3bb
Author: sal-phd-desktop <[email protected]>
Date:   Fri Jun 14 16:23:02 2024 +0200

    Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat

commit d67cf44
Author: sal-phd-desktop <[email protected]>
Date:   Fri Jun 14 16:22:59 2024 +0200

    Fix export 4chan script and remove some unecessary code

commit b10e3bb
Author: Dale Wahl <[email protected]>
Date:   Thu Jun 13 15:14:06 2024 +0200

    video_hasher prefix: fix extension type

commit ba565cd
Author: Dale Wahl <[email protected]>
Date:   Thu Jun 13 14:53:13 2024 +0200

    video_hasher: fix to work with Pillow updates; add max amount videos

commit 90da5d2
Author: Dale Wahl <[email protected]>
Date:   Thu Jun 13 10:25:24 2024 +0200

    image_cat_wall fix the fix

commit a8b943d
Author: Dale Wahl <[email protected]>
Date:   Wed Jun 12 13:29:41 2024 +0200

    add OCR processor to image w/ text visualization

commit e7e636b
Author: Dale Wahl <[email protected]>
Date:   Tue Jun 11 15:23:12 2024 +0200

    add image_wall_w_text to follow on BLIP captions

commit f74b978
Author: Dale Wahl <[email protected]>
Date:   Thu Jun 6 11:05:25 2024 +0200

    image_category_wall: allow multiple images per item/post

commit e3c9ea5
Author: Dale Wahl <[email protected]>
Date:   Thu May 30 16:27:50 2024 +0200

    image_category_wall convert None to str for category

commit 0087457
Author: Dale Wahl <[email protected]>
Date:   Thu May 30 14:54:51 2024 +0200

    image_category_wall fix float categories

commit e0c55a8
Author: Dale Wahl <[email protected]>
Date:   Thu May 30 12:51:42 2024 +0200

    download_images fix divide by zero when user can download all

commit 3580fc9
Author: Dale Wahl <[email protected]>
Date:   Thu May 30 12:51:24 2024 +0200

    image_category_wall remove 'max' when user can use all images

commit f2145bd
Author: Dale Wahl <[email protected]>
Date:   Wed May 29 17:59:23 2024 +0200

    rank_attributes: option to count missing data or blanks

commit 01e7ab9
Author: Dale Wahl <[email protected]>
Date:   Wed May 29 16:53:57 2024 +0200

    fix missing field strategy so default_stategy not overwritten on second loop

    default_stategy would be set to correctly to the callable, but overwritten on second loop (and map_missing is a dictionary at that point).
  • Loading branch information
stijn-uva committed Jul 11, 2024
1 parent 64629bd commit 6e0721d
Show file tree
Hide file tree
Showing 109 changed files with 1,711 additions and 22,761 deletions.
2 changes: 1 addition & 1 deletion .zenodo.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"license": "MPL-2.0",
"title": "4CAT Capture and Analysis Toolkit",
"upload_type": "software",
"version": "v1.44",
"version": "v1.45",
"keywords": [
"webmining",
"scraping",
Expand Down
36 changes: 35 additions & 1 deletion LICENSE-3DPARTY
Original file line number Diff line number Diff line change
Expand Up @@ -802,4 +802,38 @@ Incorporates the Graphology graph manipulation library
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
THE SOFTWARE.

-------------------------------------------------------------------------------
Incorporates the zip.js library
- at /webtool/static/js/zip.min.js
- from https://github.com/gildas-lormeau/zip.js

BSD 3-Clause License

Copyright (c) 2023, Gildas Lormeau

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
1.44
1.45

This file should not be modified. It is used by 4CAT to determine whether it
needs to run migration scripts to e.g. update the database structure to a more
Expand Down
2 changes: 1 addition & 1 deletion backend/lib/processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -589,7 +589,7 @@ def extract_archived_file_by_name(self, filename, archive_path, staging_area=Non

with zipfile.ZipFile(archive_path, "r") as archive_file:
if filename not in archive_file.namelist():
raise KeyError("File %s not found in archive %s" % (filename, archive_path))
raise FileNotFoundError("File %s not found in archive %s" % (filename, archive_path))
else:
archive_file.extract(filename, staging_area)
return staging_area.joinpath(filename)
Expand Down
1 change: 1 addition & 0 deletions common/assets/stopwords-iso.json

Large diffs are not rendered by default.

58 changes: 58 additions & 0 deletions common/assets/stopwords-languages.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
{"af": "Afrikaans",
"ar": "Arabic",
"hy": "Armenian",
"eu": "Basque",
"bn": "Bengali",
"br": "Breton",
"bg": "Bulgarian",
"ca": "Catalan; Valencian",
"cs": "Czech",
"zh": "Chinese",
"da": "Danish",
"de": "German",
"nl": "Dutch",
"el": "Greek (Modern)",
"en": "English",
"eo": "Esperanto",
"et": "Estonian",
"fa": "Persian",
"fi": "Finnish",
"fr": "French",
"ga": "Irish",
"gl": "Galician",
"gu": "Gujarati",
"ha": "Hausa",
"he": "Hebrew",
"hi": "Hindi",
"hr": "Croatian",
"hu": "Hungarian",
"id": "Indonesian",
"it": "Italian",
"ja": "Japanese",
"ko": "Korean",
"ku": "Kurdish",
"la": "Latin",
"lv": "Latvian",
"lt": "Lithuanian",
"mr": "Marathi",
"ms": "Malay",
"no": "Norwegian",
"pl": "Polish",
"pt": "Portuguese",
"ro": "Romanian; Moldavian; Moldovan",
"ru": "Russian",
"sk": "Slovak",
"sl": "Slovenian",
"so": "Somali",
"st": "Sotho, Southern",
"es": "Spanish; Castilian",
"sw": "Swahili",
"sv": "Swedish",
"tl": "Tagalog",
"th": "Thai",
"tr": "Turkish",
"uk": "Ukrainian",
"ur": "Urdu",
"vi": "Vietnamese",
"yo": "Yoruba",
"zu": "Zulu"}
Loading

0 comments on commit 6e0721d

Please sign in to comment.